Editing Cluster analysis (section)

=== Model-based clustering ===
The clustering framework most closely related to statistics is [[model-based clustering]], which is based on [[Probability distribution|distribution models]]. This approach models the data as arising from a mixture of probability distributions. It has the advantages of providing principled statistical answers to questions such as how many clusters there are, what clustering method or model to use, and how to detect and deal with outliers.

While the theoretical foundation of these methods is excellent, they suffer from [[overfitting]] unless constraints are put on the model complexity. A more complex model will usually be able to explain the data better, which makes choosing the appropriate model complexity inherently difficult. Standard [[model-based clustering]] methods include more parsimonious models based on the [[eigenvalue decomposition]] of the covariance matrices, that provide a balance between overfitting and fidelity to the data.

One prominent method is known as Gaussian mixture models (using the [[expectation-maximization algorithm]]). Here, the data set is usually modeled with a fixed (to avoid overfitting) number of [[Gaussian distribution]]s that are initialized randomly and whose parameters are iteratively optimized to better fit the data set. This will converge to a [[local optimum]], so multiple runs may produce different results. In order to obtain a hard clustering, objects are often then assigned to the Gaussian distribution they most likely belong to; for soft clusterings, this is not necessary.

Distribution-based clustering produces complex models for clusters that can capture [[correlation and dependence]] between attributes. However, these algorithms put an extra burden on the user: for many real data sets, there may be no concisely defined mathematical model (e.g. assuming Gaussian distributions is a rather strong assumption on the data).

<gallery widths="200" heights="200" caption="Gaussian mixture model clustering examples">
File:EM-Gaussian-data.svg|On Gaussian-distributed data, <abbr title="expectation–maximization">EM</abbr> works well, since it uses Gaussians for modelling clusters.
File:EM-density-data.svg|Density-based clusters cannot be modeled using Gaussian distributions.
</gallery>