Editing Mixture model (section)

== Parameter estimation and system identification ==

Parametric mixture models are often used when we know the distribution ''Y'' and we can sample from ''X'', but we would like to determine the ''a<sub>i</sub>'' and ''θ<sub>i</sub>'' values.  Such situations can arise in studies in which we sample from a population that is composed of several distinct subpopulations.

It is common to think of probability mixture modeling as a missing data problem.  One way to understand this is to assume that the data points under consideration have "membership" in one of the distributions we are using to model the data.  When we start, this membership is unknown, or missing.  The job of estimation is to devise appropriate parameters for the model functions we choose, with the connection to the data points being represented as their membership in the individual model distributions.

A variety of approaches to the problem of mixture decomposition have been proposed, many of which focus on maximum likelihood methods such as [[expectation maximization]] (EM) or maximum ''a posteriori'' estimation (MAP).  Generally these methods consider separately the questions of system identification and parameter estimation; methods to determine the number and functional form of components within a mixture are distinguished from methods to estimate the corresponding parameter values.  Some notable departures are the graphical methods as outlined in Tarter and Lock<ref name=tart>{{citation 
|title=Model Free Curve Estimation
|first=Michael E. |last=Tarter
|date=1993
|publisher=Chapman and Hall
}}</ref> and more recently [[minimum message length]] (MML) techniques such as Figueiredo and Jain<ref name=Jain>{{cite journal |first1=M.A.T. |last1=Figueiredo |first2=A.K. |last2=Jain |title=Unsupervised Learning of Finite Mixture Models |journal=IEEE Transactions on Pattern Analysis and Machine Intelligence |volume=24 |issue=3 |pages=381–396 |date=March 2002 |doi=10.1109/34.990138 |citeseerx=10.1.1.362.9811 }}</ref> and to some extent the moment matching pattern analysis routines suggested by McWilliam and Loh (2009).<ref name="mcwilli">
{{citation |last1=McWilliam |first1=N. |title=Incorporating Multidimensional Tail-Dependencies in the Valuation of Credit Derivatives (Working Paper) |date=2008 |last2=Loh |first2=K.}} [http://www.misys.com/cds-portlets/digitalAssets/4/2797_CDsAndTailDep_forPublication_final1.pdf]</ref>

=== Expectation maximization (EM) === <!-- Linked from [[Expectation-maximization algorithm]] -->

[[Expectation-maximization algorithm|Expectation maximization]] (EM) is seemingly the most popular technique used to determine the parameters of a mixture with an ''a priori'' given number of components. This is a particular way of implementing [[maximum likelihood]] estimation for this problem. EM is of particular appeal for finite normal mixtures where closed-form expressions are possible such as in the following iterative algorithm by Dempster ''et al.'' (1977)<ref name=dempster1977>{{cite journal |first1=A.P. |last1=Dempster |first2=N.M. |last2=Laird |first3=D.B. |last3=Rubin |title=Maximum Likelihood from Incomplete Data via the EM Algorithm |journal=Journal of the Royal Statistical Society, Series B |volume=39 |issue=1 |pages=1–38 | date = 1977 |doi=10.1111/j.2517-6161.1977.tb01600.x |jstor=2984875 |citeseerx = 10.1.1.163.7580 }}
</ref>

:<math> w_s^{(j+1)} = \frac{1}{N} \sum_{t =1}^N h_s^{(j)}(t) </math>
:<math> \mu_s^{(j+1)}  =  \frac{\sum_{t =1}^N h_s^{(j)}(t) x^{(t)}}{\sum_{t =1}^N h_s^{(j)}(t)} </math>
:<math> \Sigma_s^{(j+1)}  =  \frac{\sum_{t =1}^N h_s^{(j)}(t) [x^{(t)}-\mu_s^{(j+1)}][x^{(t)}-\mu_s^{(j+1)}]^{\top}}{\sum_{t =1}^N h_s^{(j)}(t)} </math>
with the posterior probabilities
:<math> h_s^{(j)}(t) = \frac{w_s^{(j)} p_s(x^{(t)}; \mu_s^{(j)},\Sigma_s^{(j)}) }{ \sum_{i = 1}^n w_i^{(j)} p_i(x^{(t)}; \mu_i^{(j)}, \Sigma_i^{(j)})}. </math>

Thus on the basis of the current estimate for the parameters, the [[conditional probability]] for a given observation ''x''<sup>(''t'')</sup> being generated from state ''s'' is determined for each {{nowrap|''t'' {{=}} 1, …, ''N''}} ; ''N'' being the sample size.  The parameters are then updated such that the new component weights correspond to the average conditional probability and each component mean and covariance is the component specific weighted average of the mean and covariance of the entire sample.

Dempster<ref name="dempster1977"/> also showed that each successive EM iteration will not decrease the likelihood, a property not shared by other gradient based maximization techniques.  Moreover, EM naturally embeds within it constraints on the probability vector, and for sufficiently large sample sizes positive definiteness of the covariance iterates.  This is a key advantage since explicitly constrained methods incur extra computational costs to check and maintain appropriate values. Theoretically EM is a first-order algorithm and as such converges slowly to a fixed-point solution. Redner and Walker (1984){{full citation needed|date=November 2012}} make this point arguing in favour of superlinear and second order Newton and quasi-Newton methods and reporting slow convergence in EM on the basis of their empirical tests.  They do concede that convergence in likelihood was rapid even if convergence in the parameter values themselves was not.  The relative merits of EM and other algorithms vis-à-vis convergence have been discussed in other literature.<ref name=XuJordam>{{cite journal |first1=L. |last1=Xu |first2=M.I. |last2=Jordan |title=On Convergence Properties of the EM Algorithm for Gaussian Mixtures |journal=Neural Computation |volume=8 |issue=1 |pages=129–151 |date=January 1996 |doi=10.1162/neco.1996.8.1.129 |hdl=10338.dmlcz/135225 |s2cid=207714252 |hdl-access=free }}</ref>

Other common objections to the use of EM are that it has a propensity to spuriously identify local maxima, as well as displaying sensitivity to initial values.<ref name="McLachlan_2"/><ref name="botev2004global">{{Cite book |author1=Botev, Z.I. |author2=Kroese, D.P. |title=Proceedings of the 2004 Winter Simulation Conference, 2004 |chapter=Global Likelihood Optimization Via the Cross-Entropy Method, with an Application to Mixture Models |author-link2=Dirk Kroese|volume= 1 |pages=517–523 |year=2004 |doi=10.1109/WSC.2004.1371358|isbn=978-0-7803-8786-7 |citeseerx=10.1.1.331.2319 |s2cid=6880171 }}</ref> One may address these problems by evaluating EM at several initial points in the parameter space but this is computationally costly and other approaches, such as the annealing EM method of Udea and Nakano (1998) (in which the initial components are essentially forced to overlap, providing a less heterogeneous basis for initial guesses), may be preferable.

Figueiredo and Jain<ref name="Jain" /> note that convergence to 'meaningless' parameter values obtained at the boundary (where regularity conditions breakdown, e.g., Ghosh and Sen (1985)) is frequently observed when the number of model components exceeds the optimal/true one.  On this basis they suggest a unified approach to estimation and identification in which the initial ''n'' is chosen to greatly exceed the expected optimal value.  Their optimization routine is constructed via a minimum message length (MML) criterion that effectively eliminates a candidate component if there is insufficient information to support it. In this way it is possible to systematize reductions in ''n'' and consider estimation and identification jointly.

==== The expectation step ====
With initial guesses for the parameters of our mixture model, "partial membership" of each data point in each constituent distribution is computed by calculating [[expectation value]]s for the membership variables of each data point.  That is, for each data point ''x<sub>j</sub>'' and distribution ''Y<sub>i</sub>'', the membership value ''y''<sub>''i'', ''j''</sub> is:

:<math> y_{i,j} = \frac{a_i f_Y(x_j;\theta_i)}{f_{X}(x_j)}.</math>

==== The maximization step ====
With expectation values in hand for group membership, [[plug-in estimates]] are recomputed for the distribution parameters.

The mixing coefficients ''a<sub>i</sub>'' are the [[arithmetic mean|mean]]s of the membership values over the ''N'' data points.

:<math> a_i = \frac{1}{N}\sum_{j=1}^N y_{i,j}</math>

The component model parameters ''θ<sub>i</sub>'' are also calculated by expectation maximization using data points ''x<sub>j</sub>'' that have been weighted using the membership values.  For example, if ''θ'' is a mean ''μ''

:<math> \mu_{i} = \frac{\sum_{j} y_{i,j}x_{j}}{\sum_{j} y_{i,j}}.</math>

With new estimates for ''a<sub>i</sub>'' and the ''θ<sub>i</sub>'''s, the expectation step is repeated to recompute new membership values.  The entire procedure is repeated until model parameters converge.

=== Markov chain Monte Carlo ===
As an alternative to the EM algorithm, the mixture model parameters can be deduced using [[posterior sampling]] as indicated by [[Bayes' theorem]].  This is still regarded as an incomplete data problem in which membership of data points is the missing data.  A two-step iterative procedure known as [[Gibbs sampling]] can be used.

The previous example of a mixture of two [[Gaussian distribution]]s can demonstrate how the method works.  As before, initial guesses of the parameters for the mixture model are made.  Instead of computing partial memberships for each elemental distribution, a membership value for each data point is drawn from a [[Bernoulli distribution]] (that is, it will be assigned to either the first or the second Gaussian).  The Bernoulli parameter ''θ'' is determined for each data point on the basis of one of the constituent distributions.{{Vague|What does this mean?|date=March 2008}}  Draws from the distribution generate membership associations for each data point.  Plug-in estimators can then be used as in the M step of EM to generate a new set of mixture model parameters, and the binomial draw step repeated.

=== Moment matching ===
The [[Method of moments (statistics)|method of moment matching]] is one of the oldest techniques for determining the mixture parameters dating back to Karl Pearson's seminal work of 1894.
In this approach the parameters of the mixture are determined such that the composite distribution has moments matching some given value.  In many instances extraction of solutions to the moment equations may present non-trivial algebraic or computational problems.  Moreover, numerical analysis by Day<ref name=day>{{Cite journal | last1 = Day | first1 = N. E. | title = Estimating the Components of a Mixture of Normal Distributions | journal = Biometrika | volume = 56 | issue = 3 | pages = 463–474 | doi = 10.2307/2334652 | jstor = 2334652| year = 1969 }}</ref> has indicated that such methods may be inefficient compared to EM. Nonetheless, there has been renewed interest in this method, e.g., Craigmile and Titterington (1998) and Wang.<ref name=wang>{{citation 
|title=Generating daily changes in market variables using a multivariate mixture of normal distributions
|first = J. | last = Wang
|date = 2001
|journal = Proceedings of the 33rd Winter Conference on Simulation |pages   =283–289
}}</ref>

McWilliam and Loh (2009) consider the characterisation of a hyper-cuboid normal mixture [[copula (statistics)|copula]] in large dimensional systems for which EM would be computationally prohibitive.  Here a pattern analysis routine is used to generate multivariate tail-dependencies consistent with a set of univariate and (in some sense) bivariate moments.  The performance of this method is then evaluated using equity log-return data with [[Kolmogorov–Smirnov]] test statistics suggesting a good descriptive fit.

===Spectral method===
Some problems in mixture model estimation can be solved using [[spectral method]]s.
In particular it becomes useful if data points ''x<sub>i</sub>'' are points in high-dimensional [[real coordinate space|real space]], and the hidden distributions are known to be [[Logarithmically concave function|log-concave]] (such as [[Gaussian distribution]] or [[Exponential distribution]]).

Spectral methods of learning mixture models are based on the use of [[Singular Value Decomposition]] of a matrix which contains data points.
The idea is to consider the top ''k'' singular vectors, where ''k'' is the number of distributions to be learned. The projection
of each data point to a [[linear subspace]] spanned by those vectors groups points originating from the same distribution
very close together, while points from different distributions stay far apart.

One distinctive feature of the spectral method is that it allows us to [[Mathematical proof|prove]] that if
distributions satisfy certain separation condition (e.g., not too close), then the estimated mixture will be very close to the true one with high probability.

=== Graphical Methods ===

Tarter and Lock<ref name="tart" /> describe a graphical approach to mixture identification in which a kernel function is applied to an empirical frequency plot so to reduce intra-component variance.  In this way one may more readily identify components having differing means.  While this ''λ''-method does not require prior knowledge of the number or functional form of the components its success does rely on the choice of the kernel parameters which to some extent implicitly embeds assumptions about the component structure.

=== Other methods ===

Some of them can even probably learn mixtures of [[heavy-tailed distribution]]s including those with
infinite [[variance]] (see [[#Recent Papers|links to papers]] below).
In this setting, EM based methods would not work, since the Expectation step would diverge due to presence of
[[outlier]]s.

=== A simulation ===
To simulate a sample of size ''N'' that is from a mixture of distributions ''F''<sub>''i''</sub>, ''i''=1 to ''n'', with probabilities ''p''<sub>''i''</sub> (sum=&nbsp;''p''<sub>''i''</sub>&nbsp;=&nbsp;1):
# Generate ''N'' random numbers from a [[categorical distribution]] of size ''n'' and probabilities ''p''<sub>''i''</sub> for ''i''=&nbsp;1=&nbsp;to&nbsp;''n''.  These tell you which of the ''F''<sub>''i''</sub> each of the ''N'' values will come from.  Denote by ''m<sub>i</sub>'' the quantity of random numbers assigned to the ''i''<sup>th</sup> category.
# For each ''i'', generate ''m<sub>i</sub>'' random numbers from the ''F''<sub>''i''</sub> distribution.