Maximum Likelihood Estimation on Binned Data - 2020-05-25
If in doubt, use KL divergence
Most of the time in machine learning is spent coming up with clever ways to estimate underlying probability distributions from which the observed samples have been drawn by chance, or waiting for said clever ways to rack up a sizable computing bill. But what if we have a lot of data? In such cases, we often use histograms to get a compressed representation.
Similarly, the underlying (parametric) distribution can be discretized for faster computations, with an often negligible effect on accuracy. Such formulation can arise if the parametric model itself is defined as a mixture of (binned) empirical distributions (as in this real-world example).
How do we find the maximum likelihood estimate (MLE) of the distribution parameters in this binned world? My intuition suggested that MLE should be equivalent to minimizing the KL divergence between the emprical and the model distributions. Nontheless, I felt that it was worth going through a simple derivation to remove any doubt.
more...