Estimating the unseen 论文
摘要
We introduce a new approach to characterizing the unobserved portion of a distribution, which provides sublinear--sample estimators achieving arbitrarily small additive constant error for a class of properties that includes entropy and distribution support size. Additionally, we show new matching lower bounds. Together, this settles the longstanding question of the sample complexities of these estimation problems, up to constant factors. Our algorithm estimates these properties up to an arbitrarily small additive constant, using O(n/log n) samples, where n is a bound on the support size, or in the case of estimating the support size, 1/n is a lower bound on the probability of any element of the domain. Previously, no explicit sublinear--sample algorithms for either of these problems were known. Our algorithm is also computationally extremely efficient, running in time linear in the number of samples used.