Using register-diversified corpora for general language studies 论文

1993引用 238
Natural Language Processing TechniquesLinguistic Variation and MorphologySecond Language Acquisition and Learning

摘要

The present study summarizes corpus-based research on linguistic characteristics from several different structural evels, in English as well as other languages, howing that register variation is inherent in natural language. Itfurther argues that, due to the importance and systematicity of the linguistic differences among registers, diversified corpora representing a broad range of register variation are required as the basis for general language studies. First, the extent of cross-register differences are illustrated from consideration ofindividual grammatical nd lexical features; these register differences are also important for probabilistic part-of-speech taggers and syntactic parsers, because the probabilities associated with grammat-ically ambiguous forms are often markedly different across registers. Then, corpus-based multi-dimensional nalyses of English are summarized, showing that linguistic features from several structural evels function together as underlying dimensions of variation, with each dimension defining a different set of linguistic relations among registers. Finally, the paper discusses how such analyses, based on register-diversified corpora, can be used to address two current issues in computational linguistics: the automatic lassification of texts into register categories and cross-linguistic comparisons of register variation. 1.