Introducing and evaluating ukWaC, a very large Web-derived corpus of English 论文

2008引用 325
Discourse Analysis in Language StudiesNatural Language Processing TechniquesLexicography and Language Studies

摘要

Abstract
\nIn this paper we introduce ukWaC, a large corpus of English constructed by crawling the .uk Internet domain. The corpus contains
\nmore than 2 billion tokens and is one of the largest freely available linguistic resources for English. The paper describes the tools and
\nmethodology used in the construction of the corpus and provides a qualitative evaluation of its contents, carried out through a vocabulary based comparison with the BNC. We conclude by giving practical information about availability and format of the corpus.