Editing Mixture model (section)

===General mixture model===
A typical finite-dimensional mixture model is a [[hierarchical Bayes model|hierarchical model]] consisting of the following components:

*''N'' random variables that are observed, each distributed according to a mixture of ''K'' components, with the components belonging to the same [[parametric family]] of distributions (e.g., all [[normal distribution|normal]], all [[Zipf's law|Zipfian]], etc.) but with different parameters
*''N'' random [[latent variable]]s specifying the identity of the mixture component of each observation, each distributed according to a ''K''-dimensional [[categorical distribution]]
*A set of ''K'' mixture weights, which are probabilities that sum to 1.
*A set of ''K'' parameters, each specifying the parameter of the corresponding mixture component.  In many cases, each "parameter" is actually a set of parameters.  For example, if the mixture components are [[Gaussian distribution]]s, there will be a [[mean]] and [[variance]] for each component. If the mixture components are [[categorical distribution]]s (e.g., when each observation is a token from a finite alphabet of size ''V''), there will be a vector of ''V'' probabilities summing to 1.

In addition, in a [[Bayesian inference|Bayesian setting]], the mixture weights and parameters will themselves be random variables, and [[prior distribution]]s will be placed over the variables.  In such a case, the weights are typically viewed as a ''K''-dimensional random vector drawn from a [[Dirichlet distribution]] (the [[conjugate prior]] of the categorical distribution), and the parameters will be distributed according to their respective conjugate priors.

Mathematically, a basic parametric mixture model can be described as follows:

:<math>
\begin{array}{lcl}
K &=& \text{number of mixture components} \\
N &=& \text{number of observations} \\
\theta_{i=1 \dots K} &=& \text{parameter of distribution of observation associated with component } i \\
\phi_{i=1 \dots K} &=& \text{mixture weight, i.e., prior probability of a particular component } i \\
\boldsymbol\phi &=& K\text{-dimensional vector composed of all the individual } \phi_{1 \dots K} \text{; must sum to 1} \\
z_{i=1 \dots N} &=& \text{component of observation } i \\
x_{i=1 \dots N} &=& \text{observation } i \\
F(x|\theta) &=& \text{probability distribution of an observation, parametrized on } \theta \\
z_{i=1 \dots N} &\sim& \operatorname{Categorical}(\boldsymbol\phi) \\
x_{i=1 \dots N}|z_{i=1 \dots N} &\sim& F(\theta_{z_i})
\end{array}
</math>

In a Bayesian setting, all parameters are associated with random variables, as follows:

:<math>
\begin{array}{lcl}
K,N &=& \text{as above} \\
\theta_{i=1 \dots K}, \phi_{i=1 \dots K}, \boldsymbol\phi &=& \text{as above} \\
z_{i=1 \dots N}, x_{i=1 \dots N}, F(x|\theta) &=& \text{as above} \\
\alpha &=& \text{shared hyperparameter for component parameters} \\
\beta &=& \text{shared hyperparameter for mixture weights} \\
H(\theta|\alpha) &=& \text{prior probability distribution of component parameters, parametrized on } \alpha \\
\theta_{i=1 \dots K} &\sim& H(\theta|\alpha) \\
\boldsymbol\phi &\sim& \operatorname{Symmetric-Dirichlet}_K(\beta) \\
z_{i=1 \dots N}|\boldsymbol\phi &\sim& \operatorname{Categorical}(\boldsymbol\phi) \\
x_{i=1 \dots N}|z_{i=1 \dots N},\theta_{i=1 \dots K} &\sim& F(\theta_{z_i})
\end{array}
</math>

This characterization uses ''F'' and ''H'' to describe arbitrary distributions over observations and parameters, respectively.  Typically ''H'' will be the [[conjugate prior]] of ''F''.  The two most common choices of ''F'' are [[Gaussian distribution|Gaussian]] aka "[[normal distribution|normal]]" (for real-valued observations) and [[categorical distribution|categorical]] (for discrete observations).  Other common possibilities for the distribution of the mixture components are:
*[[Binomial distribution]], for the number of "positive occurrences" (e.g., successes, yes votes, etc.) given a fixed number of total occurrences
*[[Multinomial distribution]], similar to the binomial distribution, but for counts of multi-way occurrences (e.g., yes/no/maybe in a survey)
*[[Negative binomial distribution]], for binomial-type observations but where the quantity of interest is the number of failures before a given number of successes occurs
*[[Poisson distribution]], for the number of occurrences of an event in a given period of time, for an event that is characterized by a fixed rate of occurrence
*[[Exponential distribution]], for the time before the next event occurs, for an event that is characterized by a fixed rate of occurrence
*[[Log-normal distribution]], for positive real numbers that are assumed to grow exponentially, such as incomes or prices
*[[Multivariate normal distribution]] (aka multivariate Gaussian distribution), for vectors of correlated outcomes that are individually Gaussian-distributed
*[[multivariate t-distribution|Multivariate Student's ''t''-distribution]], for vectors of heavy-tailed correlated outcomes<ref>{{cite journal |first1=Sotirios P. |last1=Chatzis |first2=Dimitrios I. |last2=Kosmopoulos |first3=Theodora A. |last3=Varvarigou |title=Signal Modeling and Classification Using a Robust Latent Space Model Based on t Distributions |journal=IEEE Transactions on Signal Processing |volume=56 |issue=3 |pages=949–963 |year=2008 |doi=10.1109/TSP.2007.907912 |bibcode=2008ITSP...56..949C |s2cid=15583243 }}</ref>
*A vector of [[Bernoulli distribution|Bernoulli]]-distributed values, corresponding, e.g., to a black-and-white image, with each value representing a pixel; see the handwriting-recognition example below