Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Expectation–maximization algorithm
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
=== Gaussian mixture === <!--This section is linked from [[Matrix calculus]] --> [[File:ClusterAnalysis_Mouse.svg|thumb|400px|Comparison of [[K-means clustering|k-means]] and EM on artificial data visualized with [[Environment for DeveLoping KDD-Applications Supported by Index-Structures|ELKI]]. Using the variances, the EM algorithm can describe the normal distributions exactly, while k-means splits the data in [[Voronoi diagram|Voronoi]]-cells. The cluster center is indicated by the lighter, bigger symbol.]] [[File:Em old faithful.gif|thumb|240px|An animation demonstrating the EM algorithm fitting a two component Gaussian [[mixture model]] to the [[Old Faithful]] dataset. The algorithm steps through from a random initialization to convergence. ]] Let <math>\mathbf{x} = (\mathbf{x}_1,\mathbf{x}_2,\ldots,\mathbf{x}_n)</math> be a sample of <math>n</math> independent observations from a [[mixture model|mixture]] of two [[multivariate normal distribution]]s of dimension <math>d</math>, and let <math>\mathbf{z} = (z_1,z_2,\ldots,z_n)</math> be the latent variables that determine the component from which the observation originates.<ref name="hastie2001"/> : <math>X_i \mid(Z_i = 1) \sim \mathcal{N}_d(\boldsymbol{\mu}_1,\Sigma_1)</math> and <math>X_i \mid(Z_i = 2) \sim \mathcal{N}_d(\boldsymbol{\mu}_2,\Sigma_2),</math> where : <math>\operatorname{P} (Z_i = 1 ) = \tau_1 \, </math> and <math>\operatorname{P} (Z_i=2) = \tau_2 = 1-\tau_1.</math> The aim is to estimate the unknown parameters representing the ''mixing'' value between the Gaussians and the means and covariances of each: : <math>\theta = \big( \boldsymbol{\tau},\boldsymbol{\mu}_1,\boldsymbol{\mu}_2,\Sigma_1,\Sigma_2 \big),</math> where the incomplete-data likelihood function is : <math>L(\theta;\mathbf{x}) = \prod_{i=1}^n \sum_{j=1}^2 \tau_j \ f(\mathbf{x}_i;\boldsymbol{\mu}_j,\Sigma_j),</math> and the complete-data likelihood function is : <math>L(\theta;\mathbf{x},\mathbf{z}) = p(\mathbf{x},\mathbf{z} \mid \theta) = \prod_{i=1}^n \prod_{j=1}^2 \ [f(\mathbf{x}_i;\boldsymbol{\mu}_j,\Sigma_j) \tau_j] ^{\mathbb{I}(z_i=j)},</math> or : <math>L(\theta;\mathbf{x},\mathbf{z}) = \exp \left\{ \sum_{i=1}^n \sum_{j=1}^2 \mathbb{I}(z_i=j) \big[ \log \tau_j -\tfrac{1}{2} \log |\Sigma_j| -\tfrac{1}{2}(\mathbf{x}_i-\boldsymbol{\mu}_j)^\top\Sigma_j^{-1} (\mathbf{x}_i-\boldsymbol{\mu}_j) -\tfrac{d}{2} \log(2\pi) \big] \right\},</math> where <math>\mathbb{I}</math> is an [[indicator function]] and <math>f</math> is the [[probability density function]] of a multivariate normal. In the last equality, for each {{math|''i''}}, one indicator <math>\mathbb{I}(z_i=j)</math> is equal to zero, and one indicator is equal to one. The inner sum thus reduces to one term. ==== E step ==== Given our current estimate of the parameters ''θ''<sup>(''t'')</sup>, the conditional distribution of the ''Z''<sub>''i''</sub> is determined by [[Bayes' theorem]] to be the proportional height of the normal [[probability density function|density]] weighted by ''τ'': : <math>T_{j,i}^{(t)} := \operatorname{P}(Z_i=j \mid X_i=\mathbf{x}_i ;\theta^{(t)}) = \frac{\tau_j^{(t)} \ f(\mathbf{x}_i;\boldsymbol{\mu}_j^{(t)},\Sigma_j^{(t)})}{\tau_1^{(t)} \ f(\mathbf{x}_i;\boldsymbol{\mu}_1^{(t)},\Sigma_1^{(t)}) + \tau_2^{(t)} \ f(\mathbf{x}_i;\boldsymbol{\mu}_2^{(t)},\Sigma_2^{(t)})}.</math> These are called the "membership probabilities", which are normally considered the output of the E step (although this is not the Q function of below). This E step corresponds with setting up this function for Q: : <math>\begin{align}Q(\theta\mid\theta^{(t)}) &= \operatorname{E}_{\mathbf{Z}\mid\mathbf{X}=\mathbf{x};\mathbf{\theta}^{(t)}} [\log L(\theta;\mathbf{x},\mathbf{Z}) ] \\ &= \operatorname{E}_{\mathbf{Z}\mid\mathbf{X}=\mathbf{x};\mathbf{\theta}^{(t)}} [\log \prod_{i=1}^{n}L(\theta;\mathbf{x}_i,Z_i) ] \\ &= \operatorname{E}_{\mathbf{Z}\mid\mathbf{X}=\mathbf{x};\mathbf{\theta}^{(t)}} [\sum_{i=1}^n \log L(\theta;\mathbf{x}_i,Z_i) ] \\ &= \sum_{i=1}^n\operatorname{E}_{Z_i\mid X_i=x_i;\mathbf{\theta}^{(t)}} [\log L(\theta;\mathbf{x}_i,Z_i) ] \\ &= \sum_{i=1}^n \sum_{j=1}^2 P(Z_i =j \mid X_i = \mathbf{x}_i; \theta^{(t)}) \log L(\theta_j;\mathbf{x}_i,j) \\ &= \sum_{i=1}^n \sum_{j=1}^2 T_{j,i}^{(t)} \big[ \log \tau_j -\tfrac{1}{2} \log |\Sigma_j| -\tfrac{1}{2}(\mathbf{x}_i-\boldsymbol{\mu}_j)^\top\Sigma_j^{-1} (\mathbf{x}_i-\boldsymbol{\mu}_j) -\tfrac{d}{2} \log(2\pi) \big]. \end{align}</math> The expectation of <math>\log L(\theta;\mathbf{x}_i,Z_i)</math> inside the sum is taken with respect to the probability density function <math>P(Z_i \mid X_i = \mathbf{x}_i; \theta^{(t)})</math>, which might be different for each <math>\mathbf{x}_i</math> of the training set. Everything in the E step is known before the step is taken except <math>T_{j,i}</math>, which is computed according to the equation at the beginning of the E step section. This full conditional expectation does not need to be calculated in one step, because ''τ'' and '''μ'''/'''Σ''' appear in separate linear terms and can thus be maximized independently. ==== M step ==== <math>Q(\theta \mid \theta^{(t)} )</math> being quadratic in form means that determining the maximizing values of <math>\theta</math> is relatively straightforward. Also, <math>\tau</math>, <math>(\boldsymbol{\mu}_1,\Sigma_1)</math> and <math>(\boldsymbol{\mu}_2,\Sigma_2)</math> may all be maximized independently since they all appear in separate linear terms. To begin, consider <math>\tau</math>, which has the constraint <math>\tau_1+\tau_2=1</math>: : <math>\begin{align}\boldsymbol{\tau}^{(t+1)} &= \underset{\boldsymbol{\tau}} {\operatorname{arg\,max}}\ Q(\theta \mid \theta^{(t)} ) \\ &= \underset{\boldsymbol{\tau}} {\operatorname{arg\,max}} \ \left\{ \left[ \sum_{i=1}^n T_{1,i}^{(t)} \right] \log \tau_1 + \left[ \sum_{i=1}^n T_{2,i}^{(t)} \right] \log \tau_2 \right\}. \end{align}</math> This has the same form as the maximum likelihood estimate for the [[binomial distribution]], so : <math>\tau^{(t+1)}_j = \frac{\sum_{i=1}^n T_{j,i}^{(t)}}{\sum_{i=1}^n (T_{1,i}^{(t)} + T_{2,i}^{(t)} ) } = \frac{1}{n} \sum_{i=1}^n T_{j,i}^{(t)}.</math> For the next estimates of <math>(\boldsymbol{\mu}_1,\Sigma_1)</math>: : <math>\begin{align}(\boldsymbol{\mu}_1^{(t+1)},\Sigma_1^{(t+1)}) &= \underset{\boldsymbol{\mu}_1,\Sigma_1} \operatorname{arg\,max}\ Q(\theta \mid \theta^{(t)} ) \\ &= \underset{\boldsymbol{\mu}_1,\Sigma_1} \operatorname{arg\,max}\ \sum_{i=1}^n T_{1,i}^{(t)} \left\{ -\tfrac{1}{2} \log |\Sigma_1| -\tfrac{1}{2}(\mathbf{x}_i-\boldsymbol{\mu}_1)^\top\Sigma_1^{-1} (\mathbf{x}_i-\boldsymbol{\mu}_1) \right\} \end{align}.</math> This has the same form as a weighted maximum likelihood estimate for a normal distribution, so : <math>\boldsymbol{\mu}_1^{(t+1)} = \frac{\sum_{i=1}^n T_{1,i}^{(t)} \mathbf{x}_i}{\sum_{i=1}^n T_{1,i}^{(t)}} </math> and <math>\Sigma_1^{(t+1)} = \frac{\sum_{i=1}^n T_{1,i}^{(t)} (\mathbf{x}_i - \boldsymbol{\mu}_1^{(t+1)}) (\mathbf{x}_i - \boldsymbol{\mu}_1^{(t+1)})^\top }{\sum_{i=1}^n T_{1,i}^{(t)}} </math> and, by symmetry, : <math>\boldsymbol{\mu}_2^{(t+1)} = \frac{\sum_{i=1}^n T_{2,i}^{(t)} \mathbf{x}_i}{\sum_{i=1}^n T_{2,i}^{(t)}} </math> and <math>\Sigma_2^{(t+1)} = \frac{\sum_{i=1}^n T_{2,i}^{(t)} (\mathbf{x}_i - \boldsymbol{\mu}_2^{(t+1)}) (\mathbf{x}_i - \boldsymbol{\mu}_2^{(t+1)})^\top }{\sum_{i=1}^n T_{2,i}^{(t)}}.</math> ==== Termination ==== Conclude the iterative process if <math> E_{Z\mid\theta^{(t)},\mathbf{x}}[\log L(\theta^{(t)};\mathbf{x},\mathbf{Z})] \leq E_{Z\mid\theta^{(t-1)},\mathbf{x}}[\log L(\theta^{(t-1)};\mathbf{x},\mathbf{Z})] + \varepsilon</math> for <math> \varepsilon </math> below some preset threshold. ==== Generalization ==== The algorithm illustrated above can be generalized for mixtures of more than two [[multivariate normal distribution]]s.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)