Editing Expectation–maximization algorithm (section)

== Description ==

=== The symbols ===
Given the [[statistical model]] which generates a set <math>\mathbf{X}</math> of observed data, a set of unobserved latent data or [[missing values]] <math>\mathbf{Z}</math>, and a vector of unknown parameters <math>\boldsymbol\theta</math>, along with a [[likelihood function]] <math>L(\boldsymbol\theta; \mathbf{X}, \mathbf{Z}) = p(\mathbf{X}, \mathbf{Z}\mid\boldsymbol\theta)</math>, the [[maximum likelihood estimate]] (MLE) of the unknown parameters is determined by maximizing the [[marginal likelihood]] of the observed data

:<math>L(\boldsymbol\theta; \mathbf{X}) = p(\mathbf{X}\mid\boldsymbol\theta) = \int  p(\mathbf{X},\mathbf{Z} \mid \boldsymbol\theta) \, d\mathbf{Z} = \int  p(\mathbf{X} \mid \mathbf{Z}, \boldsymbol\theta) p(\mathbf{Z} \mid \boldsymbol\theta) \, d\mathbf{Z} </math>

However, this quantity is often intractable since <math>\mathbf{Z}</math> is unobserved and the distribution of <math>\mathbf{Z}</math> is unknown before attaining <math>\boldsymbol\theta</math>.

=== The EM algorithm ===
The EM algorithm seeks to find the maximum likelihood estimate of the marginal likelihood by iteratively applying these two steps:
:''Expectation step (E step)'': Define <math>Q(\boldsymbol\theta\mid\boldsymbol\theta^{(t)})</math> as the [[expected value]] of the log [[likelihood function]] of <math>\boldsymbol\theta</math>, with respect to the current [[conditional probability distribution|conditional distribution]] of <math>\mathbf{Z}</math> given <math>\mathbf{X}</math> and the current estimates of the parameters <math>\boldsymbol\theta^{(t)}</math>:
::<math>Q(\boldsymbol\theta\mid\boldsymbol\theta^{(t)}) = \operatorname{E}_{\mathbf{Z} \sim p(\cdot | \mathbf{X},\boldsymbol\theta^{(t)})}\left[ \log p (\mathbf{X},\mathbf{Z} | \boldsymbol\theta) \right] \,</math>

:''Maximization step (M step)'': Find the parameters that maximize this quantity:
::<math>\boldsymbol\theta^{(t+1)} = \underset{\boldsymbol\theta}{\operatorname{arg\,max}} \ Q(\boldsymbol\theta\mid\boldsymbol\theta^{(t)}) \, </math>

More succinctly, we can write it as one equation:<math display="block">\boldsymbol\theta^{(t+1)} = \underset{\boldsymbol\theta}{\operatorname{arg\,max}} 
\operatorname{E}_{\mathbf{Z} \sim p(\cdot | \mathbf{X},\boldsymbol\theta^{(t)})}\left[ \log p (\mathbf{X},\mathbf{Z} | \boldsymbol\theta) \right] \, </math>

=== Interpretation of the variables ===
The typical models to which EM is applied use <math>\mathbf{Z}</math> as a latent variable indicating membership in one of a set of groups:
#The observed data points <math>\mathbf{X}</math> may be [[discrete random variable|discrete]] (taking values in a finite or countably infinite set) or [[continuous random variable|continuous]] (taking values in an uncountably infinite set). Associated with each data point may be a vector of observations.
#The [[missing values]] (aka [[latent variables]]) <math>\mathbf{Z}</math> are [[discrete random variable|discrete]], drawn from a fixed number of values, and with one latent variable per observed unit.
#The parameters are continuous, and are of two kinds: Parameters that are associated with all data points, and those associated with a specific value of a latent variable (i.e., associated with all data points whose corresponding latent variable has that value).
However, it is possible to apply EM to other sorts of models.

The motivation is as follows. If the value of the parameters <math>\boldsymbol\theta</math> is known, usually the value of the latent variables <math>\mathbf{Z}</math> can be found by maximizing the log-likelihood over all possible values of <math>\mathbf{Z}</math>, either simply by iterating over <math>\mathbf{Z}</math> or through an algorithm such as the [[Viterbi algorithm]] for [[hidden Markov model]]s. Conversely, if we know the value of the latent variables <math>\mathbf{Z}</math>, we can find an estimate of the parameters <math>\boldsymbol\theta</math> fairly easily, typically by simply grouping the observed data points according to the value of the associated latent variable and averaging the values, or some function of the values, of the points in each group. This suggests an iterative algorithm, in the case where both <math>\boldsymbol\theta</math> and <math>\mathbf{Z}</math> are unknown:
# First, initialize the parameters <math>\boldsymbol\theta</math> to some random values.
# Compute the probability of each possible value of <math>\mathbf{Z}</math> , given <math>\boldsymbol\theta</math>.
# Then, use the just-computed values of <math>\mathbf{Z}</math> to compute a better estimate for the parameters <math>\boldsymbol\theta</math>.
# Iterate steps 2 and 3 until convergence.
The algorithm as just described monotonically approaches a local minimum of the cost function.