Parallel corpora for medium density languages 论文
2007Amsterdam studies in the theory and history of linguistic science. Series 4, Current issues in linguistic theory引用 371
Natural Language Processing TechniquesTopic ModelingText Readability and Simplification
摘要
The choice of natural language technology appropriate for a given language is greatly impacted by density (availability of digitally stored material).More than half of the world speaks medium density languages, yet many of the methods appropriate for high or low density languages yield suboptimal results when applied to the medium density case.In this paper we describe a general methodology for rapidly collecting, building, and aligning parallel corpora for medium density languages, illustrating our main points on the case of Hungarian, Romanian, and Slovenian.We also describe and evaluate the hybrid sentence alignment method we are using.