Coverage Adjusted Entropy Estimation
Data on ``neural coding'' have frequently been analyzed using information-theoretic measures. These formulations involve the fundamental, and generally difficult statistical problem of estimating entropy. In this paper we review briefly several methods that have been advanced to estimate entropy, and highlight a method due to Chao and Shen that appeared recently in the environmental statistics literature. This method begins with the elementary Horvitz-Thompson estimator, developed for sampling from a finite population and adjusts for the potential new species that have not yet been observed in the sample---these become the new patterns or ``words'' in a spike train that have not yet been observed. The adjustment is due to I.J. Good, and is called the Good-Turing coverage estimate. We provide a new empirical regularization derivation of the coverage-adjusted probability estimator, which shrinks the MLE (the naive or ``direct'' plug-in estimate) toward zero. We prove that the coverage adjusted estimator, due to Chao and Shen, is consistent and first-order optimal, with rate $O_P(1/\log n)$, in the class of distributions with finite entropy variance and that within the class of distributions with finite $q$th moment of the log-likelihood, the Good-Turing coverage estimate and the total probability of unobserved words converge at rate $O_P(1/(\log n)^q)$. We then provide a simulation study of the estimator with standard distributions and examples from neuronal data, where observations are dependent. The results show that, with a minor modification, the coverage adjusted estimator performs much better than the MLE and is better than the Best Upper Bound estimator, due to Paninski, when the number of possible words $m$ is unknown or infinite.