Editing Mixture model (section)

=== Expectation maximization (EM) === <!-- Linked from [[Expectation-maximization algorithm]] -->

[[Expectation-maximization algorithm|Expectation maximization]] (EM) is seemingly the most popular technique used to determine the parameters of a mixture with an ''a priori'' given number of components. This is a particular way of implementing [[maximum likelihood]] estimation for this problem. EM is of particular appeal for finite normal mixtures where closed-form expressions are possible such as in the following iterative algorithm by Dempster ''et al.'' (1977)<ref name=dempster1977>{{cite journal |first1=A.P. |last1=Dempster |first2=N.M. |last2=Laird |first3=D.B. |last3=Rubin |title=Maximum Likelihood from Incomplete Data via the EM Algorithm |journal=Journal of the Royal Statistical Society, Series B |volume=39 |issue=1 |pages=1–38 | date = 1977 |doi=10.1111/j.2517-6161.1977.tb01600.x |jstor=2984875 |citeseerx = 10.1.1.163.7580 }}
</ref>

:<math> w_s^{(j+1)} = \frac{1}{N} \sum_{t =1}^N h_s^{(j)}(t) </math>
:<math> \mu_s^{(j+1)}  =  \frac{\sum_{t =1}^N h_s^{(j)}(t) x^{(t)}}{\sum_{t =1}^N h_s^{(j)}(t)} </math>
:<math> \Sigma_s^{(j+1)}  =  \frac{\sum_{t =1}^N h_s^{(j)}(t) [x^{(t)}-\mu_s^{(j+1)}][x^{(t)}-\mu_s^{(j+1)}]^{\top}}{\sum_{t =1}^N h_s^{(j)}(t)} </math>
with the posterior probabilities
:<math> h_s^{(j)}(t) = \frac{w_s^{(j)} p_s(x^{(t)}; \mu_s^{(j)},\Sigma_s^{(j)}) }{ \sum_{i = 1}^n w_i^{(j)} p_i(x^{(t)}; \mu_i^{(j)}, \Sigma_i^{(j)})}. </math>

Thus on the basis of the current estimate for the parameters, the [[conditional probability]] for a given observation ''x''<sup>(''t'')</sup> being generated from state ''s'' is determined for each {{nowrap|''t'' {{=}} 1, …, ''N''}} ; ''N'' being the sample size.  The parameters are then updated such that the new component weights correspond to the average conditional probability and each component mean and covariance is the component specific weighted average of the mean and covariance of the entire sample.

Dempster<ref name="dempster1977"/> also showed that each successive EM iteration will not decrease the likelihood, a property not shared by other gradient based maximization techniques.  Moreover, EM naturally embeds within it constraints on the probability vector, and for sufficiently large sample sizes positive definiteness of the covariance iterates.  This is a key advantage since explicitly constrained methods incur extra computational costs to check and maintain appropriate values. Theoretically EM is a first-order algorithm and as such converges slowly to a fixed-point solution. Redner and Walker (1984){{full citation needed|date=November 2012}} make this point arguing in favour of superlinear and second order Newton and quasi-Newton methods and reporting slow convergence in EM on the basis of their empirical tests.  They do concede that convergence in likelihood was rapid even if convergence in the parameter values themselves was not.  The relative merits of EM and other algorithms vis-à-vis convergence have been discussed in other literature.<ref name=XuJordam>{{cite journal |first1=L. |last1=Xu |first2=M.I. |last2=Jordan |title=On Convergence Properties of the EM Algorithm for Gaussian Mixtures |journal=Neural Computation |volume=8 |issue=1 |pages=129–151 |date=January 1996 |doi=10.1162/neco.1996.8.1.129 |hdl=10338.dmlcz/135225 |s2cid=207714252 |hdl-access=free }}</ref>

Other common objections to the use of EM are that it has a propensity to spuriously identify local maxima, as well as displaying sensitivity to initial values.<ref name="McLachlan_2"/><ref name="botev2004global">{{Cite book |author1=Botev, Z.I. |author2=Kroese, D.P. |title=Proceedings of the 2004 Winter Simulation Conference, 2004 |chapter=Global Likelihood Optimization Via the Cross-Entropy Method, with an Application to Mixture Models |author-link2=Dirk Kroese|volume= 1 |pages=517–523 |year=2004 |doi=10.1109/WSC.2004.1371358|isbn=978-0-7803-8786-7 |citeseerx=10.1.1.331.2319 |s2cid=6880171 }}</ref> One may address these problems by evaluating EM at several initial points in the parameter space but this is computationally costly and other approaches, such as the annealing EM method of Udea and Nakano (1998) (in which the initial components are essentially forced to overlap, providing a less heterogeneous basis for initial guesses), may be preferable.

Figueiredo and Jain<ref name="Jain" /> note that convergence to 'meaningless' parameter values obtained at the boundary (where regularity conditions breakdown, e.g., Ghosh and Sen (1985)) is frequently observed when the number of model components exceeds the optimal/true one.  On this basis they suggest a unified approach to estimation and identification in which the initial ''n'' is chosen to greatly exceed the expected optimal value.  Their optimization routine is constructed via a minimum message length (MML) criterion that effectively eliminates a candidate component if there is insufficient information to support it. In this way it is possible to systematize reductions in ''n'' and consider estimation and identification jointly.

==== The expectation step ====
With initial guesses for the parameters of our mixture model, "partial membership" of each data point in each constituent distribution is computed by calculating [[expectation value]]s for the membership variables of each data point.  That is, for each data point ''x<sub>j</sub>'' and distribution ''Y<sub>i</sub>'', the membership value ''y''<sub>''i'', ''j''</sub> is:

:<math> y_{i,j} = \frac{a_i f_Y(x_j;\theta_i)}{f_{X}(x_j)}.</math>

==== The maximization step ====
With expectation values in hand for group membership, [[plug-in estimates]] are recomputed for the distribution parameters.

The mixing coefficients ''a<sub>i</sub>'' are the [[arithmetic mean|mean]]s of the membership values over the ''N'' data points.

:<math> a_i = \frac{1}{N}\sum_{j=1}^N y_{i,j}</math>

The component model parameters ''θ<sub>i</sub>'' are also calculated by expectation maximization using data points ''x<sub>j</sub>'' that have been weighted using the membership values.  For example, if ''θ'' is a mean ''μ''

:<math> \mu_{i} = \frac{\sum_{j} y_{i,j}x_{j}}{\sum_{j} y_{i,j}}.</math>

With new estimates for ''a<sub>i</sub>'' and the ''θ<sub>i</sub>'''s, the expectation step is repeated to recompute new membership values.  The entire procedure is repeated until model parameters converge.