# 17.7. Expectation Maximization#

Expectation-Maximization (EM)  method is a maximum likelihood based estimation paradigm. It requires an explicit probabilistic model of the mixed data-set. The algorithm estimates model parameters and the segmentation of data in Maximum-Likelihood (ML) sense.

We assume that $$\by_s$$ are samples drawn from multiple “component” distributions and each component distribution is centered around a mean. Let there be $$K$$ such component distributions. We introduce a latent (hidden) discrete random variable $$z \in \{1, \dots, K\}$$ associated with the random variable $$\by$$ such that $$z_s = k$$ if $$\by_s$$ is drawn from $$k$$-th component distribution. The random vector $$(\by, z) \in \RR^M \times \{1, \dots, K\}$$ completely describes the event that a point $$\by$$ is drawn from a component indexed by the value of $$z$$.

We assume that $$z$$ is subject to a multinomial (marginal) distribution. i.e.:

$p(z= k) = \pi_k \geq 0, \quad \pi_1 + \dots + \pi_K = 1.$

Each component distribution can then be modeled as a conditional (continuous) distribution $$f(\by | z)$$. If each of the components is a multivariate normal distribution, then we have $$f(\by | z = k) \sim \NNN(\mu_k, \Sigma_k)$$ where $$\mu_k$$ is the mean and $$\Sigma_k$$ is the covariance matrix of the $$k$$-th component distribution. The parameter set for this model is then $$\theta = \{\pi_k, \mu_k, \Sigma_K \}_{k=1}^K$$ which is unknown in general and needs to be estimated from the dataset $$\bY$$.

With $$(\by, z)$$ being the complete random vector, the marginal PDF of $$\by$$ given $$\theta$$ is given by

$f(\by | \theta) = \sum_{z = 1}^K f(\by | z, \theta) p (z | \theta) = \sum_{z = 1}^K \pi_k f(\by | z=k, \theta).$

The log-likelihood function for the dataset

$\bY = \{ \by_s\}_{s=1}^N$

is given by

$l (\bY; \theta) = \sum_{s=1}^S \ln f(\by_s | \theta).$

An ML estimate of the parameters, namely $$\hat{\theta}_{\ML}$$ is obtained by maximizing $$l (\bY; \theta)$$ over the parameter space. The statistic $$l (Y; \theta)$$ is called incomplete log-likelihood function since it is marginalized over $$z$$. It is very difficult to compute and maximize directly. The EM method provides an alternate means of maximizing $$l (\bY; \theta)$$ by utilizing the latent r.v. $$z$$.

$f(\by | \theta) p ( z | \by , \theta) = f(\by, z | \theta),$
$\sum_{k=1}^K p(z = k | \by , \theta) = 1.$

Thus, $$l (\bY; \theta)$$ can be rewritten as

$\begin{split} l (\bY; \theta) &= \sum_{s=1}^S \sum_{k=1}^K p(z_s = k | \by_s , \theta) \ln \frac{f(\by_s, z_s =k | \theta)}{p(z_s=k | \by_s, \theta)}\\ &= \sum_{s, k} p(z_s = k | \by_s , \theta) \ln f(\by_s, z_s =k | \theta) \\ &- \sum_{s, k} p(z_s = k | \by_s , \theta) \ln p(z_s=k | \by_s, \theta) . \end{split}$

The first term is expected complete log-likelihood function and the second term is the conditional entropy of $$z_s$$ given $$\by_s$$ and $$\theta$$.

Let us introduce auxiliary variables $$w_{s k} (\theta) = p(z_s = k | \by_s , \theta)$$. $$w_{s k}$$ represents the expected membership of $$\by_s$$ in the $$k$$-th cluster. Put $$w_{sk}$$ in a matrix $$\bW (\theta)$$ and write:

$l'(\bY; \theta, \bW) = \sum_{s=1}^S \sum_{k=1}^K w_{s k} \ln f(\by_s, z_s =k | \theta).$
$h( z | \by; \bW) = - \sum_{s=1}^S \sum_{k=1}^K w_{s k} \ln w_{sk}.$

Then, we have

$l(\bY; \theta, \bW) = l'(\bY; \theta, \bW) + h( z | \by; W)$

where, we have written $$l$$ as a function of both $$\theta$$ and $$W$$.

An iterative maximization approach can be introduced as follows:

1. Maximize $$l(\bY; \theta, \bW)$$ w.r.t. $$\bW$$ keeping $$\theta$$ as constant.

2. Maximize $$l(\bY; \theta, \bW)$$ w.r.t. $$\theta$$ keeping $$\bW$$ as constant.

3. Repeat the previous two steps till convergence.

This is essentially the EM algorithm. Step 1 is known as E-step and step 2 is known as the M-step. In the E-step, we are estimating the expected membership of each sample being drawn from each component distribution. In the M-step, we are maximizing the expected complete log-likelihood function as the conditional entropy term doesn’t depend on $$\theta$$.

Using Lagrange multiplier, we can show that the optimal $$\hat{w}_{s k}$$ in the E-step is given by

$\hat{w}_{sk} = \frac{\pi_k f( \by_s | z_s = k, \theta )} {\sum_{l=1}^K \pi_l f(\by_s | z_s = l, \theta )}.$

A closed form solution for the $$M$$-step depends on the particular choice of the component distributions. We provide a closed form solution for the special case when each of the components is an isotropic normal distribution ($$\NNN(\mu_k, \sigma_k^2 I)$$).

$\begin{split} &\hat{\mu_k} = \frac{\sum_{s=1}^S w_{sk} y_s} {\sum_{s=1}^S w_{sk}},\\ &\hat{\sigma}_k^2 = \frac{\sum_{s=1}^S w_{sk} \| y_s - \mu_k \|_2^2} {M \sum_{s=1}^S w_{sk}},\\ &\hat{\pi_k} = \frac{\sum_{k=1}^K w_{sk}}{K}. \end{split}$

In $$K$$-means, each $$\by_s$$ gets hard assigned to a specific cluster. In EM, we have a soft assignment given by $$w_{s k}$$.

EM-method is a good method for a hybrid dataset consisting of mixture of component distributions. Yet, its applicability is limited. We need to have a good idea of the number of components beforehand. Further, for a Gaussian Mixture Model (GMM), it fails to work if the variance in some of the directions is arbitrarily small . For example, a subspace like distribution is one where the data has large variance within a subspace but almost zero variance orthogonal to the subspace. The EM method tends to fail with subspace like distributions.