EM
Expectation-Maximization Algorithm
Gaussian Mixture distribution can be written as a linear superposition of Gaussians in the form:
Where
Let us introduce a
The joint distribution
in terms of a marginal distribution and a conditional distribution .The marginal distribution over
is specified in terms of the mixing coefficient , such that:The conditional distribution of
given a particular value for is a Gaussian:The conditional probability (posterior probability) of particular value of
given a particular value for , which can be found by Bayes rule:This probability can be viewed as the responsibility that component
takes for explaining the observation
Then the marginal distribution of the gaussian mixture can be written using the distribution of latent random vector
It follows that, since we are using a joint distribution, if we have a random sample
We can express the joint distribution as Bayesian network:
And we can use ancestral sampling to generate random samples distributed according to the Gaussian mixture model:
- Sample from
- Sample from
- Coloring them by the
EM for Gaussian Mixtures
Suppose we have a dataset of observations
The log-likelihood function is given by:
An elegant and powerful method for finding maximum likelihood solutions for models with latent variables is called the expectation-maximization algorithm
.
We know that at a maximum of the likelihood function (by taking the derivative with respect to
By assuming that
Since
By taking the derivative w.r.t
By taking the derivative w.r.t the miximing coefficients
So that the mixing coefficient for the
We can see that,
- We first choose some initial values for the means, covariances and mixing coefficients.
- We alternate between E step and M step:
- Expectation Step: We use the current values for the parameters to evaluate the posterior probabilities (responsibilities).
- Maximization Step: We then use responsibilities to maximize the log likelihood function for parameters (Mean first, then covariance matrix).
In practice, the algorithm is deemed to have converged when the change in the log likelihood function or alternatively in the parameters falls below some threshold.
An Alternative View of EM
The goal of EM Algorithm is to find maximum likelihood solutions for models having latent variables. We denote sthe set of all observed data by
If we have continuous latent variable, we can replace the summation by integral.
A key observation is that the summation over the latent variables appears inside the logarithm which provides much complicated expression of log likelihood. Suppose, for each observation in
In practice, we are not given the complete dataset but only the incomplete data. Our state of knowledge of the values of the latent variables in
Since we cannot use the complete dataset log likelihood, we can consider using the expectation under the posterior distribution (E step). In E step, we use the current parameter values
In the subsequent M step, we find parameters that maximize this expectation. If the current estimate for the parameters is denoted
Notice here we have complete data log likelihood, compare it with incomplete data log likelihood, the log likelihood function is much easier to compute.
General EM algorithm
General EM Algorithm for Gaussian Mixture
Previously, we used incomplete data log likelihood for Gaussian Mixture together with the EM algorithm, we have summation over
We have the complete data likelihood:
Where
Compare with the incomplete data log likelihood, we can see that this form is much easier to solve. In practice, we do not obtain the values of
We then proceed as follows:
- First we start at some initial values for parameters.
- We use these parameters to evaluate the responsibilities.
- We then maximize the expected log likelihood function.
Which is the same as the incomplete data log likelihood EM previously, but we have a much easier log likelihood function to maximize over.
Ref
PRML Chapter 9