Interpolating between types and tokens by estimating power-law generators 论文

2005引用 215
Natural Language Processing TechniquesTopic ModelingLanguage and cultural evolution

摘要

Standard statistical models of language fail to capture one of the most striking properties of natural languages: the power-law distribution in the frequencies of word tokens. We present a framework for developing statistical models that generically produce power-laws, augmenting standard generative models with an adaptor that produces the appropriate pattern of token frequencies. We show that taking a particular stochastic process -- the Pitman-Yor process -- as an adaptor justifies the appearance of type frequencies in formal analyses of natural language, and improves the performance of a model for unsupervised learning of morphology.