Learning Bilingual Lexicons from Monolingual Corpora 论文

2008引用 312
Natural Language Processing TechniquesTopic ModelingHandwritten Text Recognition Techniques

摘要

We present a method for learning bilingual translation lexicons from monolingual corpora. Word types in each language are characterized by purely monolingual features, such as context counts and orthographic substrings. Translations are induced using a generative model based on canonical correlation analysis, which explains the monolingual lexicons in terms of latent matchings. We show that high-precision lexicons can be learned in a variety of language pairs and from a range of corpus types. 1