An estimate of an upper bound for the entropy of English 论文
1992The COCOON platform (University of Paris)引用 354
Algorithms and Data CompressionNatural Language Processing TechniquesCellular Automata and Applications
摘要
We present an estimate of an upper bound of 1.75 bits for the entropy of characters in printed English, obtained by constructing a word trigram model and then computing the cross-entropy between this model and a balanced sample of English text. We suggest the well-known and widely available Brown Corpus of printed English as a standard against which to measure progress in language modeling and offer our bound as the first of what we hope will be a series of steadily decreasing bounds.