Maximum Likelihood Estimation on Binned Data - 2020-05-25

If in doubt, use KL divergence

Most of the time in machine learning is spent coming up with clever ways to estimate underlying probability distributions from which the observed samples have been drawn by chance, or waiting for said clever ways to rack up a sizable computing bill. But what if we have a lot of data? In such cases, we often use histograms to get a compressed representation.

Similarly, the underlying (parametric) distribution can be discretized for faster computations, with an often negligible effect on accuracy. Such formulation can arise if the parametric model itself is defined as a mixture of (binned) empirical distributions (as in this real-world example).

How do we find the maximum likelihood estimate (MLE) of the distribution parameters in this binned world? My intuition suggested that MLE should be equivalent to minimizing the KL divergence between the emprical and the model distributions. Nontheless, I felt that it was worth going through a simple derivation to remove any doubt.

The equivalency between MLE and minimizing KL is certainly taken for granted in classification problems, where the number of classes is fixed, and true observations are presented as one-hot vectors.

In regression problems, we often compute the log-likelihood explicitely for each observation, but the equivalence to KL under certain asymptotic assumptions is also taught in many courses.

However, I would be lying if I said that I made these connections immediately - hence I hope that others may find the derivation below somewhat illuminating.

The likelihood of observing binned data

Consider $m$ bins $[ b_i, b_{i+1})$, where $0\leq i < m$. Observed samples have bin counts $P_i$, with $\sum P_i = n$. This corresponds to empirical probabilities $p_i = P_i / n$

We want to infer the underlying multinomial distribution, parametrized by probabilities of each bin in the form $q_i = Q_i / n$, i.e. $\sum Q_i = n = \sum P_i$.

The likelihood of this observation, conditioned on $q_i$, is given by the multinomial distribution:

$$L(P_i | Q_i) = \frac{n!}{\prod P_i!} \prod \left( \frac{Q_i}{n} \right)^{P_i}$$

To simplify the above expression, we use the Stirling’s approximation to expand the factorials. This leads to the following formula for the negatice log-likelihood:

$$- \ln L(P_i | Q_i) = \frac{m - 1}{2} \ln2\pi - \frac{1}{2}\ln n + \frac{1}{2}\sum P_i + n \times \sum{p_i \frac{p_i}{q_i}}$$

Note that the last term a constant multiple of the KL divergence of $p$ relative to $q$, $\mathrm{KL}(p |q)$. No other term depends on $Q_i$.

A numerical example

To demonstrate the accuracy of the (amazingly versatile) Stirling approximation, consider a simple example with $n=50$ observations and $m=3$ bins. A numerical implementation can be found here

By varying parameters of the modelled underlying distribution (think kurtosis), we can see how well the approximation works:

MLE approximation and KL divergence for different parameters

MLE approximation and KL divergence for different parameters

In fact, $n \times \mathrm{KL}(p||q)$ seems like a very useful approximation to the log-likelohood - accurate (up to a constant) even for small $n$:

MLE approximation and KL divergence for different parameters

MLE approximation and KL divergence for different parameters

References

  1. Code for this post

  2. Relation between MLE and KL divergence

comments powered by Disqus