A convergent gambling estimate of the entropy of English 论文

1978IEEE Transactions on Information Theory引用 225
Computability, Logic, AI AlgorithmsAlgorithms and Data CompressionEvolutionary Algorithms and Applications

摘要

In his original paper on the subject, Shannon found upper and lower bounds for the entropy of printed English based on the number of trials required for a subject to guess subsequent symbols in a given text. The guessing approach precludes asymptotic consistency of either the upper or lower bounds except for degenerate ergodic processes. Shannon's technique of guessing the next symbol is altered by having the subject place sequential bets on the next symbol of text. If <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">S_{n}</tex> denotes the subject's capital after <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">n</tex> bets at <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">27</tex> for <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</tex> odds, and if it is assumed that the subject knows the underlying probability distribution for the process <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">X</tex> , then the entropy estimate is <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">\hat{H}_{n}(X)=(1-(1/n) \log_{27}S_{n}) \log_{2} 27</tex> bits/symbol. If the subject does not know the true probability distribution for the stochastic process, then <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">\hat{H}_{n}(X)</tex> is an asymptotic upper bound for the true entropy. If <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">X</tex> is stationary, <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">E\hat{H}_{n}(X)\rightarrowH(X), H(X)</tex> being the true entropy of the process. Moreover, if X is ergodic, then by the Shannon-McMillan-Breiman theorem <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">\hat{H}_{n}(X)\rightarrowH(X)</tex> with probability one. Preliminary indications are that English text has an entropy of approximately <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1.3</tex> bits/symbol, which agrees well with Shannon's estimate. In his original paper on the subject, Shannon found upper and lower bounds for the entropy of printed English based on the number of trials required for a subject to guess subsequent symbols in a given text. The guessing approach precludes asymptotic consistency of either the upper or lower bounds except for degenerate ergodic processes. Shannon's technique of guessing the next symbol is altered by having the subject place sequential bets on the next symbol of text. If <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">S_{n}</tex> denotes the subject's capital after <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">n</tex> bets at <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">27</tex> for <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</tex> odds, and if it is assumed that the subject knows the underlying probability distribution for the process <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">X</tex> , then the entropy estimate is <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">\hat{H}_{n}(X)=(1-(1/n) \log_{27}S_{n}) \log_{2} 27</tex> bits/symbol. If the subject does not know the true probability distribution for the stochastic process, then <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">\hat{H}_{n}(X)</tex> is an asymptotic upper bound for the true entropy. If <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">X</tex> is stationary, <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">E\hat{H}_{n}(X)\rightarrowH(X), H(X)</tex> being the true entropy of the process. Moreover, if X is ergodic, then by the Shannon-McMillan-Breiman theorem <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">\hat{H}_{n}(X)\rightarrowH(X)</tex> with probability one. Preliminary indications are that English text has an entropy of approximately <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1.3</tex> bits/symbol, which agrees well with Shannon's estimate. In his original paper on the subject, Shannon found upper and lower bounds for the entropy of printed English based on the number of trials required for a subject to guess subsequent symbols in a given text. The guessing approach precludes asymptotic consistency of either the upper or lower bounds except for degenerate ergodic processes. Shannon's technique of guessing the next symbol is altered by having the subject place sequential bets on the next symbol of text. If <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">S_{n}</tex> denotes the subject's capital after <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">n</tex> bets at <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">27</tex> for <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</tex> odds, and if it is assumed that the subject knows the underlying probability distribution for the process <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">X</tex> , then the entropy estimate is <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">\hat{H}_{n}(X)=(1-(1/n) \log_{27}S_{n}) \log_{2} 27</tex> bits/symbol. If the subject does not know the true probability distribution for the stochastic process, then <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">\hat{H}_{n}(X)</tex> is an asymptotic upper bound for the true entropy. If <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">X</tex> is stationary, <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">E\hat{H}_{n}(X)\rightarrowH(X), H(X)</tex> being the true entropy of the process.Moreover, if X is ergodic, then by the Shannon-McMillan-Breiman theorem <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">\hat{H}_{n}(X)\rightarrowH(X)</tex> with probability one. Preliminary indications are that English text has an entropy of approximately <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1.3</tex> bits/symbol, which agrees well with Shannon's estimate. In his original paper on the subject, Shannon found upper and lower bounds for the entropy of printed English based on the number of trials required for a subject to guess subsequent symbols in a given text. The guessing approach precludes asymptotic consistency of either the upper or lower bounds except for degenerate ergodic processes. Shannon's technique of guessing the next symbol is altered by having the subject place sequential bets on the next symbol of text. If <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">S_{n}</tex> denotes the subject's capital after <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">n</tex> bets at <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">27</tex> for <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</tex> odds, and if it is assumed that the subject knows the underlying probability distribution for the process <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">X</tex> , then the entropy estimate is <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">\hat{H}_{n}(X)=(1-(1/n) \log_{27}S_{n}) \log_{2} 27</tex> bits/symbol. If the subject does not know the true probability distribution for the stochastic process, then <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">\hat{H}_{n}(X)</tex> is an asymptotic upper bound for the true entropy. If <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">X</tex> is stationary, <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">E\hat{H}_{n}(X)\rightarrowH(X), H(X)</tex> being the true entropy of the process. Moreover, if X is ergodic, then by the Shannon-McMillan-Breiman theorem <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">\hat{H}_{n}(X)\rightarrowH(X)</tex> with probability one. Preliminary indications are that English text has an entropy of approximately <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1.3</tex> bits/symbol, which agrees well with Shannon's estimate.