Editing Expectation–maximization algorithm (section)

=== Gaussian mixture === <!--This section is linked from [[Matrix calculus]] -->
[[File:ClusterAnalysis_Mouse.svg|thumb|400px|Comparison of [[K-means clustering|k-means]] and EM on artificial data visualized with [[Environment for DeveLoping KDD-Applications Supported by Index-Structures|ELKI]]. Using the variances, the EM algorithm can describe the normal distributions exactly, while k-means splits the data in [[Voronoi diagram|Voronoi]]-cells. The cluster center is indicated by the lighter, bigger symbol.]]

[[File:Em old faithful.gif|thumb|240px|An animation demonstrating the EM algorithm fitting a two component Gaussian [[mixture model]] to the [[Old Faithful]] dataset. The algorithm steps through from a random initialization to convergence. ]]

Let <math>\mathbf{x} = (\mathbf{x}_1,\mathbf{x}_2,\ldots,\mathbf{x}_n)</math> be a sample of <math>n</math> independent observations from a [[mixture model|mixture]] of two [[multivariate normal distribution]]s of dimension <math>d</math>, and let <math>\mathbf{z} = (z_1,z_2,\ldots,z_n)</math> be the latent variables that determine the component from which the observation originates.<ref name="hastie2001"/>
: <math>X_i \mid(Z_i = 1) \sim \mathcal{N}_d(\boldsymbol{\mu}_1,\Sigma_1)</math> and <math>X_i \mid(Z_i = 2) \sim \mathcal{N}_d(\boldsymbol{\mu}_2,\Sigma_2),</math>
where
: <math>\operatorname{P} (Z_i = 1 ) = \tau_1 \, </math> and <math>\operatorname{P} (Z_i=2) = \tau_2 = 1-\tau_1.</math>

The aim is to estimate the unknown parameters representing the ''mixing'' value between the Gaussians and the means and covariances of each:
: <math>\theta = \big( \boldsymbol{\tau},\boldsymbol{\mu}_1,\boldsymbol{\mu}_2,\Sigma_1,\Sigma_2 \big),</math>
where the incomplete-data likelihood function is
: <math>L(\theta;\mathbf{x}) = \prod_{i=1}^n \sum_{j=1}^2 \tau_j \ f(\mathbf{x}_i;\boldsymbol{\mu}_j,\Sigma_j),</math>

and the complete-data likelihood function is
: <math>L(\theta;\mathbf{x},\mathbf{z}) = p(\mathbf{x},\mathbf{z} \mid \theta) = \prod_{i=1}^n  \prod_{j=1}^2  \ [f(\mathbf{x}_i;\boldsymbol{\mu}_j,\Sigma_j) \tau_j] ^{\mathbb{I}(z_i=j)},</math>

or

: <math>L(\theta;\mathbf{x},\mathbf{z}) = \exp \left\{ \sum_{i=1}^n \sum_{j=1}^2 \mathbb{I}(z_i=j) \big[ \log \tau_j -\tfrac{1}{2} \log |\Sigma_j| -\tfrac{1}{2}(\mathbf{x}_i-\boldsymbol{\mu}_j)^\top\Sigma_j^{-1} (\mathbf{x}_i-\boldsymbol{\mu}_j) -\tfrac{d}{2} \log(2\pi) \big] \right\},</math>

where <math>\mathbb{I}</math> is an [[indicator function]] and <math>f</math> is the [[probability density function]] of a multivariate normal.

In the last equality, for each {{math|''i''}}, one indicator <math>\mathbb{I}(z_i=j)</math> is equal to zero, and one indicator is equal to one. The inner sum thus reduces to one term.

==== E step ====
Given our current estimate of the parameters ''θ''<sup>(''t'')</sup>, the conditional distribution of the ''Z''<sub>''i''</sub> is determined by [[Bayes' theorem]] to be the proportional height of the normal [[probability density function|density]] weighted by ''τ'':
: <math>T_{j,i}^{(t)} := \operatorname{P}(Z_i=j \mid X_i=\mathbf{x}_i ;\theta^{(t)}) = \frac{\tau_j^{(t)} \ f(\mathbf{x}_i;\boldsymbol{\mu}_j^{(t)},\Sigma_j^{(t)})}{\tau_1^{(t)} \ f(\mathbf{x}_i;\boldsymbol{\mu}_1^{(t)},\Sigma_1^{(t)}) + \tau_2^{(t)} \ f(\mathbf{x}_i;\boldsymbol{\mu}_2^{(t)},\Sigma_2^{(t)})}.</math>

These are called the "membership probabilities", which are normally considered the output of the E step (although this is not the Q function of below).

This E step corresponds with setting up this function for Q:
: <math>\begin{align}Q(\theta\mid\theta^{(t)})
&= \operatorname{E}_{\mathbf{Z}\mid\mathbf{X}=\mathbf{x};\mathbf{\theta}^{(t)}} [\log L(\theta;\mathbf{x},\mathbf{Z}) ] \\
&= \operatorname{E}_{\mathbf{Z}\mid\mathbf{X}=\mathbf{x};\mathbf{\theta}^{(t)}} [\log \prod_{i=1}^{n}L(\theta;\mathbf{x}_i,Z_i) ] \\
&= \operatorname{E}_{\mathbf{Z}\mid\mathbf{X}=\mathbf{x};\mathbf{\theta}^{(t)}} [\sum_{i=1}^n \log L(\theta;\mathbf{x}_i,Z_i) ] \\
&= \sum_{i=1}^n\operatorname{E}_{Z_i\mid X_i=x_i;\mathbf{\theta}^{(t)}} [\log L(\theta;\mathbf{x}_i,Z_i) ] \\
&= \sum_{i=1}^n \sum_{j=1}^2 P(Z_i =j \mid X_i = \mathbf{x}_i; \theta^{(t)}) \log L(\theta_j;\mathbf{x}_i,j) \\
&= \sum_{i=1}^n \sum_{j=1}^2 T_{j,i}^{(t)} \big[ \log \tau_j  -\tfrac{1}{2} \log |\Sigma_j| -\tfrac{1}{2}(\mathbf{x}_i-\boldsymbol{\mu}_j)^\top\Sigma_j^{-1} (\mathbf{x}_i-\boldsymbol{\mu}_j) -\tfrac{d}{2} \log(2\pi) \big].
\end{align}</math>
The expectation of <math>\log L(\theta;\mathbf{x}_i,Z_i)</math> inside the sum is taken with respect to the probability density function <math>P(Z_i \mid X_i = \mathbf{x}_i; \theta^{(t)})</math>, which might be different for each  <math>\mathbf{x}_i</math> of the training set. Everything in the E step is known before the step is taken except <math>T_{j,i}</math>, which is computed according to the equation at the beginning of the E step section.

This full conditional expectation does not need to be calculated in one step, because ''τ'' and '''μ'''/'''Σ''' appear in separate linear terms and can thus be maximized independently.

==== M step ====
<math>Q(\theta \mid \theta^{(t)} )</math> being quadratic in form means that determining the maximizing values of <math>\theta</math> is relatively straightforward. Also, <math>\tau</math>, <math>(\boldsymbol{\mu}_1,\Sigma_1)</math> and <math>(\boldsymbol{\mu}_2,\Sigma_2)</math> may all be maximized independently since they all appear in separate linear terms.

To begin, consider <math>\tau</math>, which has the constraint <math>\tau_1+\tau_2=1</math>:
: <math>\begin{align}\boldsymbol{\tau}^{(t+1)}
&= \underset{\boldsymbol{\tau}} {\operatorname{arg\,max}}\  Q(\theta \mid \theta^{(t)} ) \\
&= \underset{\boldsymbol{\tau}} {\operatorname{arg\,max}} \ \left\{ \left[  \sum_{i=1}^n T_{1,i}^{(t)} \right] \log \tau_1 + \left[  \sum_{i=1}^n T_{2,i}^{(t)} \right] \log \tau_2  \right\}.
\end{align}</math>
This has the same form as the maximum likelihood estimate for the [[binomial distribution]], so
: <math>\tau^{(t+1)}_j = \frac{\sum_{i=1}^n T_{j,i}^{(t)}}{\sum_{i=1}^n (T_{1,i}^{(t)} + T_{2,i}^{(t)} ) } = \frac{1}{n} \sum_{i=1}^n T_{j,i}^{(t)}.</math>

For the next estimates of <math>(\boldsymbol{\mu}_1,\Sigma_1)</math>:
: <math>\begin{align}(\boldsymbol{\mu}_1^{(t+1)},\Sigma_1^{(t+1)})
&= \underset{\boldsymbol{\mu}_1,\Sigma_1} \operatorname{arg\,max}\ Q(\theta \mid \theta^{(t)} ) \\
&= \underset{\boldsymbol{\mu}_1,\Sigma_1} \operatorname{arg\,max}\ \sum_{i=1}^n T_{1,i}^{(t)} \left\{ -\tfrac{1}{2} \log |\Sigma_1| -\tfrac{1}{2}(\mathbf{x}_i-\boldsymbol{\mu}_1)^\top\Sigma_1^{-1} (\mathbf{x}_i-\boldsymbol{\mu}_1) \right\}
\end{align}.</math>
This has the same form as a weighted maximum likelihood estimate for a normal distribution, so
: <math>\boldsymbol{\mu}_1^{(t+1)} = \frac{\sum_{i=1}^n T_{1,i}^{(t)} \mathbf{x}_i}{\sum_{i=1}^n T_{1,i}^{(t)}} </math> and <math>\Sigma_1^{(t+1)} = \frac{\sum_{i=1}^n T_{1,i}^{(t)} (\mathbf{x}_i - \boldsymbol{\mu}_1^{(t+1)}) (\mathbf{x}_i - \boldsymbol{\mu}_1^{(t+1)})^\top }{\sum_{i=1}^n T_{1,i}^{(t)}} </math>
and, by symmetry,
: <math>\boldsymbol{\mu}_2^{(t+1)} = \frac{\sum_{i=1}^n T_{2,i}^{(t)} \mathbf{x}_i}{\sum_{i=1}^n T_{2,i}^{(t)}} </math> and <math>\Sigma_2^{(t+1)} = \frac{\sum_{i=1}^n T_{2,i}^{(t)} (\mathbf{x}_i - \boldsymbol{\mu}_2^{(t+1)}) (\mathbf{x}_i - \boldsymbol{\mu}_2^{(t+1)})^\top }{\sum_{i=1}^n T_{2,i}^{(t)}}.</math>

==== Termination ====

Conclude the iterative process if <math> E_{Z\mid\theta^{(t)},\mathbf{x}}[\log L(\theta^{(t)};\mathbf{x},\mathbf{Z})] \leq E_{Z\mid\theta^{(t-1)},\mathbf{x}}[\log L(\theta^{(t-1)};\mathbf{x},\mathbf{Z})] + \varepsilon</math> for <math> \varepsilon </math> below some preset threshold.
==== Generalization ====
The algorithm illustrated above can be generalized for mixtures of more than two [[multivariate normal distribution]]s.