Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Cluster analysis
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
=== Model-based clustering === The clustering framework most closely related to statistics is [[model-based clustering]], which is based on [[Probability distribution|distribution models]]. This approach models the data as arising from a mixture of probability distributions. It has the advantages of providing principled statistical answers to questions such as how many clusters there are, what clustering method or model to use, and how to detect and deal with outliers. While the theoretical foundation of these methods is excellent, they suffer from [[overfitting]] unless constraints are put on the model complexity. A more complex model will usually be able to explain the data better, which makes choosing the appropriate model complexity inherently difficult. Standard [[model-based clustering]] methods include more parsimonious models based on the [[eigenvalue decomposition]] of the covariance matrices, that provide a balance between overfitting and fidelity to the data. One prominent method is known as Gaussian mixture models (using the [[expectation-maximization algorithm]]). Here, the data set is usually modeled with a fixed (to avoid overfitting) number of [[Gaussian distribution]]s that are initialized randomly and whose parameters are iteratively optimized to better fit the data set. This will converge to a [[local optimum]], so multiple runs may produce different results. In order to obtain a hard clustering, objects are often then assigned to the Gaussian distribution they most likely belong to; for soft clusterings, this is not necessary. Distribution-based clustering produces complex models for clusters that can capture [[correlation and dependence]] between attributes. However, these algorithms put an extra burden on the user: for many real data sets, there may be no concisely defined mathematical model (e.g. assuming Gaussian distributions is a rather strong assumption on the data). <gallery widths="200" heights="200" caption="Gaussian mixture model clustering examples"> File:EM-Gaussian-data.svg|On Gaussian-distributed data, <abbr title="expectation–maximization">EM</abbr> works well, since it uses Gaussians for modelling clusters. File:EM-density-data.svg|Density-based clusters cannot be modeled using Gaussian distributions. </gallery>
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)