EM

Expectation-Maximization Algorithm

Gaussian Mixture distribution can be written as a linear superposition of Gaussians in the form:

P(X)=k=1KπkN(X|μk,Σk)

Where πk is the mixing coefficient for each normal component that satisfies the conditions:

0πk1 k=1Kπk=1


Let us introduce a K-dimensional binary latent random vector Z having a 1-of-K representation in which a particular element Zk{0,1} is equal to 1 and all other elements are equal to 0 and k=1KZk=1. We define:

  • The joint distribution P(X,Z)=P(Z)P(X|Z) in terms of a marginal distribution P(Z) and a conditional distribution P(X|Z).

  • The marginal distribution over Z is specified in terms of the mixing coefficient πk, such that: P(Zk=1)=πk P(Z)=k=1KπkZk

  • The conditional distribution of X given a particular value for Z is a Gaussian: P(X|Zk=1)=N(X|μk,Σk) P(X|Z)=k=1KN(X|μk,Σk)Zk

  • The conditional probability (posterior probability) of particular value of Z given a particular value for X, which can be found by Bayes rule: γ(Zk)=P(Zk=1|X)=P(X|Zk=1)P(Zk=1)P(X)=πkN(X|μk,Σk)j=1KπkN(X|μj,Σj)

    This probability can be viewed as the responsibility that component k takes for explaining the observation X


Then the marginal distribution of the gaussian mixture can be written using the distribution of latent random vector Z as:

P(X)=ZP(Z)P(X|Z)=k=1KπkN(X|μk,Σk)

It follows that, since we are using a joint distribution, if we have a random sample X1,...,XN, for every random vector Xn there is a corresponding latent variable Zn. Therefore, we have found an equivalent formulation of the Gaussian mixture involving an explicit latent variable. Now, we can work with the joint distribution P(X,Z) instead of the original marginal distribution P(X).

We can express the joint distribution as Bayesian network:

And we can use ancestral sampling to generate random samples distributed according to the Gaussian mixture model:

  1. Sample from Z^P(Z)
  2. Sample from P(X|Z^)
  3. Coloring them by the Z^

EM for Gaussian Mixtures

Suppose we have a dataset of observations {x1,....,xN;xRM}, and we wish to model this data using a mixture of Gaussians. We can represent this dataset as an N×M matrix D in which the nth row is given by xnT. Similarly, the corresponding latent variables will be denoted by an N×K matrix H with rows ZnT. If we assume that the data points are drawn independently from the distribution then we can express the Gaussian mixture model for this i.i.d dataset using the graphical representation:

The log-likelihood function is given by:

ln(P(D|π,μ,Σ))=n=1Nln(k=1KπkN(xn|μk,Σk))

An elegant and powerful method for finding maximum likelihood solutions for models with latent variables is called the expectation-maximization algorithm.

We know that at a maximum of the likelihood function (by taking the derivative with respect to μk):

0=n=1NπkN(xn|μk,Σk)j=1KπkN(xn|μj,Σj)γ(Znk)Σk(xnμk)

By assuming that Σk is invertible and rearranging we have:

μ^k=1Nkn=1Nγ(Znk)xn

Nk=n=1Nγ(Znk)

Since Znk=P(Znk=1|X=xn) is the posterior probability, we can interpret Nk as the total probability of samples assigned to cluster k. We can see that the mean μk for the kth Gaussian component is obtained by taking a mean of all of the points in the dataset weighted by the posterior distribution for cluster k.


By taking the derivative w.r.t Σk, we have:

Σ^k=1Nkn=1Nγ(Znk)(xnμn)(xnμn)T

By taking the derivative w.r.t the miximing coefficients πk and using lagrange multiplier, we have:

π^k=NkN

So that the mixing coefficient for the kth component is given by the average responsibility which that component takes for explaining the data points.

We can see that, π^k,Nk,μ^k,Σ^k all depends on the value of γ(Znk) and γ(Znk) depends on the values of all other variables. Thus, we can do a simple iterative scheme for finding a solution to the maximum likelihood problem, which turns out to be an instance of the EM algorithm for the particular case of Gaussian Mixture Model:

  1. We first choose some initial values for the means, covariances and mixing coefficients.
  2. We alternate between E step and M step:
    • Expectation Step: We use the current values for the parameters to evaluate the posterior probabilities (responsibilities).
    • Maximization Step: We then use responsibilities to maximize the log likelihood function for parameters (Mean first, then covariance matrix).

In practice, the algorithm is deemed to have converged when the change in the log likelihood function or alternatively in the parameters falls below some threshold.

An Alternative View of EM

The goal of EM Algorithm is to find maximum likelihood solutions for models having latent variables. We denote sthe set of all observed data by D in which the nth row represents xnT, and similarily we denote the set of all latent variables by H with a corresponding row ZnT. The set of all parameters is denoted by θ and so the log likelihood function is given by:

lnL(θ;D)=ln(HP(D,H|θ))

If we have continuous latent variable, we can replace the summation by integral.

A key observation is that the summation over the latent variables appears inside the logarithm which provides much complicated expression of log likelihood. Suppose, for each observation in D, we have observed corresponding random variable ZnT=znT. We call {D,H} the complete dataset and if we only observed D we have incomplete dataset, the maximization of this complete-data log likelihood function is straightforward and much simpler than incomplete likelihood.

In practice, we are not given the complete dataset but only the incomplete data. Our state of knowledge of the values of the latent variables in H is given only by the posterior distribution:

P(H|D,θ)=P(H,D|θ)P(D|θ)

Since we cannot use the complete dataset log likelihood, we can consider using the expectation under the posterior distribution (E step). In E step, we use the current parameter values θold to find the posterior distribution of the latent variables given by P(H|D,θ). We then use this posterior distribution to find the expectation of the complete-data log likelihood evaluated for some general parameter value θ. This expectation, denoted as:

Q(θ,θold)=EH|D,θold[lnP(D,H|θ)]=HP(H|D,θold)lnP(D,H|θ)

In the subsequent M step, we find parameters that maximize this expectation. If the current estimate for the parameters is denoted θold, then a pair of successive E and M steps gives rise to a revised estimate θnew. The algorithm is initialized by choosing some starting value for the parameters θ0.

θnew=argmaxθQ(θ,θold)

Notice here we have complete data log likelihood, compare it with incomplete data log likelihood, the log likelihood function is much easier to compute.

General EM algorithm

General EM Algorithm for Gaussian Mixture

Previously, we used incomplete data log likelihood for Gaussian Mixture together with the EM algorithm, we have summation over k over k that occurs inside the logarithm. Now, we try to use general approach of EM algorithm.

We have the complete data likelihood:

P(D,H|μ,Σ,π)=P(D|H,μ,Σ,π)P(H|μ,Σ,π)=n=1Nk=1KπkZnkN(xn|μk,Σk)Znk

Where Znk is the k the component of Zn. Taking the logarithm, we have the log likelihood:

lnP(D,H|μ,Σ,π)=n=1Nk=1KZnk(lnπk+lnN(xn|μk,Σk))

Compare with the incomplete data log likelihood, we can see that this form is much easier to solve. In practice, we do not obtain the values of H, thus, we use expectation instead:

EH|μ[lnP(D,H|μ,Σ,π)]=n=1Nk=1KP(Znk=1|D,μ,Σ,π)(lnπk+lnN(xn|μk,Σk))=n=1Nk=1Kγ(Znk)(lnπk+lnN(xn|μk,Σk))

We then proceed as follows:

  1. First we start at some initial values for parameters.
  2. We use these parameters to evaluate the responsibilities.
  3. We then maximize the expected log likelihood function.

Which is the same as the incomplete data log likelihood EM previously, but we have a much easier log likelihood function to maximize over.

Ref

PRML Chapter 9