In Part 1 of this series of posts, we trained a Mixture Density Network to capture a heteroscedastic conditional probability distribution.
After predicting the parameters $(w_{ij}, \mu_{ij}, \sigma_{ij})$ of the $m$ mixture components, we would like to generate some - and in some cases many - samples from the learned distribution.
A fully vectorized implementation for the problem at hand required some thought, so I decided to write it up for future reference.
Being able to sample is important for two reasons:
- first, it gives us access to unlimited “synthetic” data,
- and second, any statistic you need can be computed from a big enough sample.
Imagine we want to sample num_samples
random values from a mixture of num_components
Gaussians, at each of num_rows
values of $X_i$.
While numpy.random
provides functions for efficient sampling from a number of probability distributions, we are trying to sample from num_rows
different distributions simultaneously.
Let us create some test data of appropriate shape (you can find the complete code here):
def create_data(num_rows=100, num_components=5):
w = np.random.rand(num_rows, num_components)
w = np.exp(w) / np.exp(w).sum(axis=1, keepdims=True)
mu = np.random.rand(num_rows, num_components)
sigma = np.random.rand(num_rows, num_components)
return w, mu, sigma
w, mu, sigma = create_data()
Sampling from a single Gaussian Mixture (at row i
) is a two-step process:
- Choose a random number,
choice
, between0
andnum_components - 1
, with probabilitiesw[i, :]
- Sample a random number from the chosen mixture component
This can be implemented as follows:
# Implementation 0 - two nested loops
samples = np.empty((num_rows, num_samples))
for i in range(num_rows):
_w, _mu, _sigma = w[i, :], mu[i, :], sigma[i, :]
for j in range(num_samples):
choice = np.random.choice(num_components, p=_w)
samples[i, j] = _mu[choice] + _sigma[choice] * np.random.randn()
Looping through numpy
arrays is never a good thing, and it is easy to spot
that both np.random.choice
and np.random.randn
can generate an array of random values in a single call. Using numpy
’s fancy indexing, we can get rid of the for
loop:
# Implementation 1 - a single loop over rows
samples = np.empty((num_rows, num_samples))
for i in range(num_rows):
_w, _mu, _sigma = w[i, :], mu[i, :], sigma[i, :]
choice = np.random.choice(num_components, size=num_samples, p=_w)
samples[i] = _mu[choice] + _sigma[choice] * np.random.randn(num_samples)
For reasonable values (num_rows, num_samples, num_components = 1000, 1000, 5
), the speed gain is massive:
sampling time | |
---|---|
Implementation 0 | 21s |
Implementation 1 | 0.13s |
Implementation 2 | 0.12s |
However, getting rid of the inner loop is harder. The only way I could come up with is as follows:
# Implementation 2 - no python loops
## First, sample which mixture component to use for each (row, sample)
choices = np.random.rand(num_rows, 1, num_samples)
thresholds = w.cumsum(axis=1)[:, :, np.newaxis]
col_idx = (choices < thresholds).argmax(axis=1)
_, row_idx = np.meshgrid(np.arange(num_samples), np.arange(num_rows))
## Second, compute the samples
mu = mu[row_idx, col_idx]
sigma = sigma[row_idx, col_idx]
samples = mu + sigma * np.random.randn(num_rows, num_samples)
The approach hinges on numpy
array broadcasting.
Note the shapes of the arrays we construct to choose a mixture component for each sample:
array | shape |
---|---|
choices |
(num_rows, 1, num_samples) |
thresholds |
(num_rows, num_components, 1) |
(choices < thresholds) |
(num_rows, num_components, num_samples) |
col_idx |
(num_rows, num_samples) |
This way, we can index the right mu
and sigma
for all samples simultaneously.
All the indexing doesn’t come for free, and in fact the speed difference to the previous (looped)
implementation is not as clear-cut:
When sampling a large number of points for each row, the looped approach is in fact marginally faster!
However, vectorization comes into its own for a large number of rows, where my final
implementation’s speed depends only on the total number of points samples (i.e. num_rows * num_samples
), as evidenced
by the straight-line contours in the log-log plot above.
Can we make it faster still? I suspect so. Is the speed-up significant in practice? Depends. Did solving this problem make me revisit vectorization and broadcasting? Definitely.
References
- Array broadcasting in
numpy
: https://docs.scipy.org/doc/numpy/user/theory.broadcasting.html - Implicit array expansion in MATLAB: https://uk.mathworks.com/help/matlab/ref/bsxfun.html
- My code for this post: See https://github.com/ig248/deepquantiles/blob/master/notebooks/mdn_sampling.ipynb
- My implementation of the
MixtureDensityRegressor
, including sampling methods: https://github.com/ig248/deepquantiles/blob/master/deepquantiles/regressors/mdn.py - Box–Muller transform for generating normal random variables: https://en.wikipedia.org/wiki/Box%E2%80%93Muller_transform
- Marsaglia polar method for generating normal random variables (as used in
numpy.random
): https://en.wikipedia.org/wiki/Marsaglia_polar_method - Sampling discrete choices: https://en.wikipedia.org/wiki/Pseudo-random_number_sampling#Finite_discrete_distributions