17.7. Expectation Maximization#

Expectation-Maximization (EM) [26] method is a maximum likelihood based estimation paradigm. It requires an explicit probabilistic model of the mixed data-set. The algorithm estimates model parameters and the segmentation of data in Maximum-Likelihood (ML) sense.

We assume that $y_{s}$ are samples drawn from multiple “component” distributions and each component distribution is centered around a mean. Let there be $K$ such component distributions. We introduce a latent (hidden) discrete random variable $z \in {1, \dots, K}$ associated with the random variable $y$ such that $z_{s} = k$ if $y_{s}$ is drawn from $k$ -th component distribution. The random vector $(y, z) \in R^{M} \times {1, \dots, K}$ completely describes the event that a point $y$ is drawn from a component indexed by the value of $z$ .

We assume that $z$ is subject to a multinomial (marginal) distribution. i.e.:

p (z = k) = π_{k} \geq 0, π_{1} + \dots + π_{K} = 1.

Each component distribution can then be modeled as a conditional (continuous) distribution $f (y | z)$ . If each of the components is a multivariate normal distribution, then we have $f (y | z = k) \sim N (μ_{k}, Σ_{k})$ where $μ_{k}$ is the mean and $Σ_{k}$ is the covariance matrix of the $k$ -th component distribution. The parameter set for this model is then $θ = {π_{k}, μ_{k}, Σ_{K}}_{k = 1}^{K}$ which is unknown in general and needs to be estimated from the dataset $Y$ .

With $(y, z)$ being the complete random vector, the marginal PDF of $y$ given $θ$ is given by

f (y | θ) = \sum_{z = 1}^{K} f (y | z, θ) p (z | θ) = \sum_{z = 1}^{K} π_{k} f (y | z = k, θ) .

The log-likelihood function for the dataset

Y = {y_{s}}_{s = 1}^{N}

is given by

l (Y; θ) = \sum_{s = 1}^{S} \ln f (y_{s} | θ) .

An ML estimate of the parameters, namely ${\hat{θ}}_{ML}$ is obtained by maximizing $l (Y; θ)$ over the parameter space. The statistic $l (Y; θ)$ is called incomplete log-likelihood function since it is marginalized over $z$ . It is very difficult to compute and maximize directly. The EM method provides an alternate means of maximizing $l (Y; θ)$ by utilizing the latent r.v. $z$ .

We start with noting that

f (y | θ) p (z | y, θ) = f (y, z | θ),

\sum_{k = 1}^{K} p (z = k | y, θ) = 1.

Thus, $l (Y; θ)$ can be rewritten as

\begin{aligned} l (Y; θ) & = \sum_{s = 1}^{S} \sum_{k = 1}^{K} p (z_{s} = k | y_{s}, θ) \ln \frac{f (y_{s}, z_{s} = k | θ)}{p (z_{s} = k | y_{s}, θ)} \\ = \sum_{s, k} p (z_{s} = k | y_{s}, θ) \ln f (y_{s}, z_{s} = k | θ) \\ - \sum_{s, k} p (z_{s} = k | y_{s}, θ) \ln p (z_{s} = k | y_{s}, θ) . \end{aligned}

The first term is expected complete log-likelihood function and the second term is the conditional entropy of $z_{s}$ given $y_{s}$ and $θ$ .

Let us introduce auxiliary variables $w_{s k} (θ) = p (z_{s} = k | y_{s}, θ)$ . $w_{s k}$ represents the expected membership of $y_{s}$ in the $k$ -th cluster. Put $w_{s k}$ in a matrix $W (θ)$ and write:

l^{'} (Y; θ, W) = \sum_{s = 1}^{S} \sum_{k = 1}^{K} w_{s k} \ln f (y_{s}, z_{s} = k | θ) .

h (z | y; W) = - \sum_{s = 1}^{S} \sum_{k = 1}^{K} w_{s k} \ln w_{s k} .

Then, we have

l (Y; θ, W) = l^{'} (Y; θ, W) + h (z | y; W)

where, we have written $l$ as a function of both $θ$ and $W$ .

An iterative maximization approach can be introduced as follows:

Maximize $l (Y; θ, W)$ w.r.t. $W$ keeping $θ$ as constant.
Maximize $l (Y; θ, W)$ w.r.t. $θ$ keeping $W$ as constant.
Repeat the previous two steps till convergence.

This is essentially the EM algorithm. Step 1 is known as E-step and step 2 is known as the M-step. In the E-step, we are estimating the expected membership of each sample being drawn from each component distribution. In the M-step, we are maximizing the expected complete log-likelihood function as the conditional entropy term doesn’t depend on $θ$ .

Using Lagrange multiplier, we can show that the optimal ${\hat{w}}_{s k}$ in the E-step is given by

{\hat{w}}_{s k} = \frac{π_{k} f (y_{s} | z_{s} = k, θ)}{\sum_{l = 1}^{K} π_{l} f (y_{s} | z_{s} = l, θ)} .

A closed form solution for the $M$ -step depends on the particular choice of the component distributions. We provide a closed form solution for the special case when each of the components is an isotropic normal distribution ( $N (μ_{k}, σ_{k}^{2} I)$ ).

\begin{aligned} \hat{μ_{k}} = \frac{\sum_{s = 1}^{S} w_{s k} y_{s}}{\sum_{s = 1}^{S} w_{s k}}, \\ {\hat{σ}}_{k}^{2} = \frac{\sum_{s = 1}^{S} w_{s k} ‖ y_{s} - μ_{k} ‖_{2}^{2}}{M \sum_{s = 1}^{S} w_{s k}}, \\ \hat{π_{k}} = \frac{\sum_{k = 1}^{K} w_{s k}}{K} . \end{aligned}

In $K$ -means, each $y_{s}$ gets hard assigned to a specific cluster. In EM, we have a soft assignment given by $w_{s k}$ .

EM-method is a good method for a hybrid dataset consisting of mixture of component distributions. Yet, its applicability is limited. We need to have a good idea of the number of components beforehand. Further, for a Gaussian Mixture Model (GMM), it fails to work if the variance in some of the directions is arbitrarily small [82]. For example, a subspace like distribution is one where the data has large variance within a subspace but almost zero variance orthogonal to the subspace. The EM method tends to fail with subspace like distributions.

Topics in Signal Processing

Expectation Maximization

17.7. Expectation Maximization#