Posts


Maximum Likelihood Estimation on Binned Data - 2020-05-25

If in doubt, use KL divergence

Most of the time in machine learning is spent coming up with clever ways to estimate underlying probability distributions from which the observed samples have been drawn by chance, or waiting for said clever ways to rack up a sizable computing bill. But what if we have a lot of data? In such cases, we often use histograms to get a compressed representation.

Similarly, the underlying (parametric) distribution can be discretized for faster computations, with an often negligible effect on accuracy. Such formulation can arise if the parametric model itself is defined as a mixture of (binned) empirical distributions (as in this real-world example).

How do we find the maximum likelihood estimate (MLE) of the distribution parameters in this binned world? My intuition suggested that MLE should be equivalent to minimizing the KL divergence between the emprical and the model distributions. Nontheless, I felt that it was worth going through a simple derivation to remove any doubt.

more...

Deep Christmas-Tree-Based Learning - 2018-12-24

A novel technique applied to a novel dataset 🎄🎄🎄

Merry Christmas!

more...

How I learned to stop using CSV and love HDF - 2018-11-06

Comparing file formats for data science work

Waiting for a CSV file to load before any data science can begin can be a great excuse for yet another coffee. While CSVs are human-readable, widely supported, and the go-to format for small data sets, they become unwieldy once we reach a few million rows. So while waiting for yet another file to load, I decided to investigate a few alternatives.

(image courtesy to https://xkcd-excuse.com)

more...