Editing Mixture model (section)

==Structure==

===General mixture model===
A typical finite-dimensional mixture model is a [[hierarchical Bayes model|hierarchical model]] consisting of the following components:

*''N'' random variables that are observed, each distributed according to a mixture of ''K'' components, with the components belonging to the same [[parametric family]] of distributions (e.g., all [[normal distribution|normal]], all [[Zipf's law|Zipfian]], etc.) but with different parameters
*''N'' random [[latent variable]]s specifying the identity of the mixture component of each observation, each distributed according to a ''K''-dimensional [[categorical distribution]]
*A set of ''K'' mixture weights, which are probabilities that sum to 1.
*A set of ''K'' parameters, each specifying the parameter of the corresponding mixture component.  In many cases, each "parameter" is actually a set of parameters.  For example, if the mixture components are [[Gaussian distribution]]s, there will be a [[mean]] and [[variance]] for each component. If the mixture components are [[categorical distribution]]s (e.g., when each observation is a token from a finite alphabet of size ''V''), there will be a vector of ''V'' probabilities summing to 1.

In addition, in a [[Bayesian inference|Bayesian setting]], the mixture weights and parameters will themselves be random variables, and [[prior distribution]]s will be placed over the variables.  In such a case, the weights are typically viewed as a ''K''-dimensional random vector drawn from a [[Dirichlet distribution]] (the [[conjugate prior]] of the categorical distribution), and the parameters will be distributed according to their respective conjugate priors.

Mathematically, a basic parametric mixture model can be described as follows:

:<math>
\begin{array}{lcl}
K &=& \text{number of mixture components} \\
N &=& \text{number of observations} \\
\theta_{i=1 \dots K} &=& \text{parameter of distribution of observation associated with component } i \\
\phi_{i=1 \dots K} &=& \text{mixture weight, i.e., prior probability of a particular component } i \\
\boldsymbol\phi &=& K\text{-dimensional vector composed of all the individual } \phi_{1 \dots K} \text{; must sum to 1} \\
z_{i=1 \dots N} &=& \text{component of observation } i \\
x_{i=1 \dots N} &=& \text{observation } i \\
F(x|\theta) &=& \text{probability distribution of an observation, parametrized on } \theta \\
z_{i=1 \dots N} &\sim& \operatorname{Categorical}(\boldsymbol\phi) \\
x_{i=1 \dots N}|z_{i=1 \dots N} &\sim& F(\theta_{z_i})
\end{array}
</math>

In a Bayesian setting, all parameters are associated with random variables, as follows:

:<math>
\begin{array}{lcl}
K,N &=& \text{as above} \\
\theta_{i=1 \dots K}, \phi_{i=1 \dots K}, \boldsymbol\phi &=& \text{as above} \\
z_{i=1 \dots N}, x_{i=1 \dots N}, F(x|\theta) &=& \text{as above} \\
\alpha &=& \text{shared hyperparameter for component parameters} \\
\beta &=& \text{shared hyperparameter for mixture weights} \\
H(\theta|\alpha) &=& \text{prior probability distribution of component parameters, parametrized on } \alpha \\
\theta_{i=1 \dots K} &\sim& H(\theta|\alpha) \\
\boldsymbol\phi &\sim& \operatorname{Symmetric-Dirichlet}_K(\beta) \\
z_{i=1 \dots N}|\boldsymbol\phi &\sim& \operatorname{Categorical}(\boldsymbol\phi) \\
x_{i=1 \dots N}|z_{i=1 \dots N},\theta_{i=1 \dots K} &\sim& F(\theta_{z_i})
\end{array}
</math>

This characterization uses ''F'' and ''H'' to describe arbitrary distributions over observations and parameters, respectively.  Typically ''H'' will be the [[conjugate prior]] of ''F''.  The two most common choices of ''F'' are [[Gaussian distribution|Gaussian]] aka "[[normal distribution|normal]]" (for real-valued observations) and [[categorical distribution|categorical]] (for discrete observations).  Other common possibilities for the distribution of the mixture components are:
*[[Binomial distribution]], for the number of "positive occurrences" (e.g., successes, yes votes, etc.) given a fixed number of total occurrences
*[[Multinomial distribution]], similar to the binomial distribution, but for counts of multi-way occurrences (e.g., yes/no/maybe in a survey)
*[[Negative binomial distribution]], for binomial-type observations but where the quantity of interest is the number of failures before a given number of successes occurs
*[[Poisson distribution]], for the number of occurrences of an event in a given period of time, for an event that is characterized by a fixed rate of occurrence
*[[Exponential distribution]], for the time before the next event occurs, for an event that is characterized by a fixed rate of occurrence
*[[Log-normal distribution]], for positive real numbers that are assumed to grow exponentially, such as incomes or prices
*[[Multivariate normal distribution]] (aka multivariate Gaussian distribution), for vectors of correlated outcomes that are individually Gaussian-distributed
*[[multivariate t-distribution|Multivariate Student's ''t''-distribution]], for vectors of heavy-tailed correlated outcomes<ref>{{cite journal |first1=Sotirios P. |last1=Chatzis |first2=Dimitrios I. |last2=Kosmopoulos |first3=Theodora A. |last3=Varvarigou |title=Signal Modeling and Classification Using a Robust Latent Space Model Based on t Distributions |journal=IEEE Transactions on Signal Processing |volume=56 |issue=3 |pages=949–963 |year=2008 |doi=10.1109/TSP.2007.907912 |bibcode=2008ITSP...56..949C |s2cid=15583243 }}</ref>
*A vector of [[Bernoulli distribution|Bernoulli]]-distributed values, corresponding, e.g., to a black-and-white image, with each value representing a pixel; see the handwriting-recognition example below

===Specific examples===

====Gaussian mixture model====
[[File:nonbayesian-gaussian-mixture.svg|right|250px|thumb|Non-Bayesian Gaussian mixture model using [[plate notation]].  Smaller squares indicate fixed parameters; larger circles indicate random variables.  Filled-in shapes indicate known values.  The indication [K] means a vector of size ''K''.]]

A typical non-Bayesian [[Gaussian distribution|Gaussian]] mixture model looks like this:

:<math>
\begin{array}{lcl}
K,N &=& \text{as above} \\
\phi_{i=1 \dots K}, \boldsymbol\phi &=& \text{as above} \\
z_{i=1 \dots N}, x_{i=1 \dots N} &=& \text{as above} \\
\theta_{i=1 \dots K} &=& \{ \mu_{i=1 \dots K}, \sigma^2_{i=1 \dots K} \}  \\
\mu_{i=1 \dots K} &=& \text{mean of component } i \\
\sigma^2_{i=1 \dots K} &=& \text{variance of component } i \\
z_{i=1 \dots N} &\sim& \operatorname{Categorical}(\boldsymbol\phi) \\
x_{i=1 \dots N} &\sim& \mathcal{N}(\mu_{z_i}, \sigma^2_{z_i})
\end{array}
</math>

{{clear}}
[[File:bayesian-gaussian-mixture.svg|right|300px|thumb|Bayesian Gaussian mixture model using [[plate notation]].  Smaller squares indicate fixed parameters; larger circles indicate random variables.  Filled-in shapes indicate known values.  The indication [K] means a vector of size ''K''.]]

A Bayesian version of a [[Gaussian distribution|Gaussian]] mixture model is as follows:

:<math>
\begin{array}{lcl}
K,N &=& \text{as above} \\
\phi_{i=1 \dots K}, \boldsymbol\phi &=& \text{as above} \\
z_{i=1 \dots N}, x_{i=1 \dots N} &=& \text{as above} \\
\theta_{i=1 \dots K} &=& \{ \mu_{i=1 \dots K}, \sigma^2_{i=1 \dots K} \}  \\
\mu_{i=1 \dots K} &=& \text{mean of component } i \\
\sigma^2_{i=1 \dots K} &=& \text{variance of component } i \\
\mu_0, \lambda, \nu, \sigma_0^2 &=& \text{shared hyperparameters} \\
\mu_{i=1 \dots K} &\sim& \mathcal{N}(\mu_0, \lambda\sigma_i^2) \\
\sigma_{i=1 \dots K}^2 &\sim& \operatorname{Inverse-Gamma}(\nu, \sigma_0^2) \\
\boldsymbol\phi &\sim& \operatorname{Symmetric-Dirichlet}_K(\beta) \\
z_{i=1 \dots N} &\sim& \operatorname{Categorical}(\boldsymbol\phi) \\
x_{i=1 \dots N} &\sim& \mathcal{N}(\mu_{z_i}, \sigma^2_{z_i})
\end{array}
</math><math></math>
[[File:Parameter estimation process infinite Gaussian mixture model.webm|thumb|end=49|Animation of the clustering process for one-dimensional data using a Bayesian Gaussian mixture model where normal distributions are drawn from a [[Dirichlet process]]. The histograms of the clusters are shown in different colours. During the parameter estimation process, new clusters are created and grow on the data. The legend shows the cluster colours and the number of datapoints assigned to each cluster.]]

====Multivariate Gaussian mixture model====
A Bayesian Gaussian mixture model is commonly extended to fit a vector of unknown parameters (denoted in bold), or multivariate normal distributions.  In a multivariate distribution (i.e. one modelling a vector <math>\boldsymbol{x}</math> with ''N'' random variables) one may model a vector of parameters (such as several observations of a signal or patches within an image) using a Gaussian mixture model prior distribution on the vector of estimates given by
<math display="block">
p(\boldsymbol{\theta}) = \sum_{i=1}^K \phi_i \mathcal{N}(\boldsymbol{\mu}_i,\boldsymbol{\Sigma}_i)
</math>
where the ''i<sup>th</sup>'' vector component is characterized by normal distributions with weights <math>\phi_i</math>, means <math>\boldsymbol{\mu}_i</math> and covariance matrices <math>\boldsymbol{\Sigma}_i</math>.  To incorporate this prior into a Bayesian estimation, the prior is multiplied with the known distribution <math>p(\boldsymbol{x | \theta})</math> of the data <math>\boldsymbol{x}</math> conditioned on the parameters <math>\boldsymbol{\theta}</math> to be estimated.  With this formulation, the [[Posterior probability|posterior distribution]] <math>p(\boldsymbol{\theta | x})</math> is ''also'' a Gaussian mixture model of the form 
<math display="block">
p(\boldsymbol{\theta | x}) = \sum_{i=1}^K \tilde{\phi}_i \mathcal{N}(\boldsymbol{\tilde{\mu}_i}, \boldsymbol{\tilde{\Sigma}}_i)
</math>
with new parameters <math>\tilde{\phi}_i, \boldsymbol{\tilde{\mu}}_i</math> and <math>\boldsymbol{\tilde{\Sigma}}_i</math> that are updated using the [[Expectation-maximization algorithm|EM algorithm]].
<ref>
{{cite journal
|last=Yu |first=Guoshen
|title=Solving Inverse Problems with Piecewise Linear Estimators: From Gaussian Mixture Models to Structured Sparsity
|journal=IEEE Transactions on Image Processing
|volume=21 | date=2012|pages=2481–2499 |issue=5 |doi=10.1109/tip.2011.2176743
|pmid=22180506
| bibcode = 2012ITIP...21.2481G | arxiv = 1006.3056 | s2cid=479845 }}
</ref> Although EM-based parameter updates are well-established, providing the initial estimates for these parameters is currently an area of active research.  Note that this formulation yields a closed-form solution to the complete posterior distribution.  Estimations of the random variable <math>\boldsymbol{\theta}</math> may be obtained via one of several estimators, such as the mean or maximum of the posterior distribution.

Such distributions are useful for assuming patch-wise shapes of images and clusters, for example.  In the case of image representation, each Gaussian may be tilted, expanded, and warped according to the covariance matrices <math>\boldsymbol{\Sigma}_i</math>.  One Gaussian distribution of the set is fit to each patch (usually of size 8×8 pixels) in the image.  Notably, any distribution of points around a cluster (see [[K-means clustering|''k''-means]]) may be accurately given enough Gaussian components, but scarcely over ''K''=20 components are needed to accurately model a given image distribution or cluster of data.

====Categorical mixture model====
[[File:nonbayesian-categorical-mixture.svg|right|250px|thumb|Non-Bayesian categorical mixture model using [[plate notation]].  Smaller squares indicate fixed parameters; larger circles indicate random variables.  Filled-in shapes indicate known values.  The indication [K] means a vector of size ''K''; likewise for [V].]]

A typical non-Bayesian mixture model with [[categorical distribution|categorical]] observations looks like this:

*<math>K,N:</math> as above
*<math>\phi_{i=1 \dots K}, \boldsymbol\phi:</math> as above
*<math>z_{i=1 \dots N}, x_{i=1 \dots N}:</math> as above
*<math>V:</math> dimension of categorical observations, e.g., size of word vocabulary
*<math>\theta_{i=1 \dots K, j=1 \dots V}:</math> probability for component <math>i</math> of observing item <math>j</math>
*<math>\boldsymbol\theta_{i=1 \dots K}:</math> vector of dimension <math>V,</math> composed of <math>\theta_{i,1 \dots V};</math> must sum to 1

The random variables:
:<math>
\begin{array}{lcl}
z_{i=1 \dots N} &\sim& \operatorname{Categorical}(\boldsymbol\phi) \\
x_{i=1 \dots N} &\sim& \text{Categorical}(\boldsymbol\theta_{z_i})
\end{array}
</math>

<!--
The original version, all in LaTeX.
:<math>
\begin{array}{lcl}
K,N &=& \text{as above} \\
\phi_{i=1 \dots K}, \boldsymbol\phi &=& \text{as above} \\
z_{i=1 \dots N}, x_{i=1 \dots N} &=& \text{as above} \\
V &=& \text{dimension of categorical observations, e.g., size of word vocabulary} \\
\theta_{i=1 \dots K, j=1 \dots V} &=& \text{probability for component } i \text{ of observing the } j\text{th item} \\
\boldsymbol\theta_{i=1 \dots K} &=& V\text{-dimensional vector, composed of }\theta_{i,1 \dots V} \text{; must sum to 1} \\
z_{i=1 \dots N} &\sim& \operatorname{Categorical}(\boldsymbol\phi) \\
x_{i=1 \dots N} &\sim& \text{Categorical}(\boldsymbol\theta_{z_i})
\end{array}
</math>
-->

{{clear}}
[[File:bayesian-categorical-mixture.svg|right|300px|thumb|Bayesian categorical mixture model using [[plate notation]].  Smaller squares indicate fixed parameters; larger circles indicate random variables.  Filled-in shapes indicate known values.  The indication [K] means a vector of size ''K''; likewise for [V].]]

A typical Bayesian mixture model with [[categorical distribution|categorical]] observations looks like this:

*<math>K,N:</math> as above
*<math>\phi_{i=1 \dots K}, \boldsymbol\phi:</math> as above
*<math>z_{i=1 \dots N}, x_{i=1 \dots N}:</math> as above
*<math>V:</math> dimension of categorical observations, e.g., size of word vocabulary
*<math>\theta_{i=1 \dots K, j=1 \dots V}:</math> probability for component <math>i</math> of observing item <math>j</math>
*<math>\boldsymbol\theta_{i=1 \dots K}:</math> vector of dimension <math>V,</math> composed of <math>\theta_{i,1 \dots V};</math> must sum to 1
*<math>\alpha:</math> shared concentration hyperparameter of <math>\boldsymbol\theta</math> for each component
*<math>\beta:</math> concentration hyperparameter of <math>\boldsymbol\phi</math>

The random variables:
:<math>
\begin{array}{lcl}
\boldsymbol\phi &\sim& \operatorname{Symmetric-Dirichlet}_K(\beta) \\
\boldsymbol\theta_{i=1 \dots K} &\sim& \text{Symmetric-Dirichlet}_V(\alpha) \\
z_{i=1 \dots N} &\sim& \operatorname{Categorical}(\boldsymbol\phi) \\
x_{i=1 \dots N} &\sim& \text{Categorical}(\boldsymbol\theta_{z_i})
\end{array}
</math>

<!--
The (beginning of) equivalent of below, using no LaTeX.

*''K'',''N'' = as above
*&phi;<sub>1,...,''K''</sub>, '''&phi;''' as above
*''z''<sub>''i''=1...''N''</sub>, ''x''<sub>''i''=1...''N''</sub> = as above
* ''V'' = dimension of categorical observations, e.g., size of word vocabulary
-->
<!--
The equivalent using full LaTeX.

:<math>
\begin{array}{lcl}
K,N &=& \mbox{as above} \\
\phi_{i=1 \dots K}, \boldsymbol\phi &=& \text{as above} \\
z_{i=1 \dots N}, x_{i=1 \dots N} &=& \text{as above} \\
V &=& \text{dimension of categorical observations, e.g., size of word vocabulary} \\
\theta_{i=1 \dots K, j=1 \dots V} &=& \text{probability for component } i \text{ of observing the } j\text{th item} \\
\boldsymbol\theta_{i=1 \dots K} &=& V\text{-dimensional vector, composed of }\theta_{i,1 \dots V} \text{; must sum to 1} \\
\alpha &=& \text{shared concentration hyperparameter of } \boldsymbol\theta \text{ for each component} \\
\beta &=& \text{concentration hyperparameter of } \boldsymbol\phi \\
\boldsymbol\phi &\sim& \operatorname{Symmetric-Dirichlet}_K(\beta) \\
\boldsymbol\theta_{i=1 \dots K} &\sim& \text{Symmetric-Dirichlet}_V(\alpha) \\
z_{i=1 \dots N} &\sim& \operatorname{Categorical}(\boldsymbol\phi) \\
x_{i=1 \dots N} &\sim& \text{Categorical}(\boldsymbol\theta_{z_i})
\end{array}
</math>
-->