Building a Large Annotated Corpus of Learner English: The NUS Corpus of Learner English 论文

2013引用 346
Natural Language Processing TechniquesTopic ModelingText Readability and Simplification

摘要

We describe the NUS Corpus of Learner English (NUCLE), a large, fully annotated corpus of learner English that is freely available for research purposes. The goal of the corpus is to provide a large data resource for the development and evaluation of grammatical error correction systems. Although NUCLE has been available for almost two years, there has been no reference paper that describes the corpus in detail. In this paper, we address this need. We describe the annotation schema and the data collection and annotation process of NUCLE. Most importantly, we report on an unpublished study of annotator agreement for grammatical error correction. Finally, we present statistics on the distribution of grammatical errors in the NUCLE corpus. 1