Editing Cluster analysis (section)

== Definition ==

The notion of a "cluster" cannot be precisely defined, which is one of the reasons why there are so many clustering algorithms.<ref name="estivill">{{cite journal | title=Why so many clustering algorithms – A Position Paper | last = Estivill-Castro | first = Vladimir | s2cid = 7329935 | journal=ACM SIGKDD Explorations Newsletter |date=20 June 2002 | volume= 4 | issue=1 | pages=65–75 | doi=10.1145/568574.568575}}</ref> There is a common denominator: a group of data objects. However, different researchers employ different cluster models, and for each of these cluster models again different algorithms can be given. The notion of a cluster, as found by different algorithms, varies significantly in its properties. Understanding these "cluster models" is key to understanding the differences between the various algorithms. Typical cluster models include:
* ''{{vanchor|Connectivity model}}s'': for example, [[hierarchical clustering]] builds models based on distance connectivity.
* ''{{vanchor|Centroid model}}s'': for example, the [[k-means algorithm]] represents each cluster by a single mean vector.
* ''{{vanchor|Distribution model}}s'': clusters are modeled using statistical distributions, such as [[multivariate normal distribution]]s used by the [[expectation-maximization algorithm]].
* ''{{vanchor|Density model}}s'': for example, [[DBSCAN]] and [[OPTICS]] defines clusters as connected dense regions in the data space.
* ''{{vanchor|Subspace model}}s'': in [[biclustering]] (also known as co-clustering or two-mode-clustering), clusters are modeled with both cluster members and relevant attributes.
* ''{{vanchor|Group model}}s'': some algorithms do not provide a refined model for their results and just provide the grouping information.
* ''{{vanchor|Graph-based model}}s'': a [[Clique (graph theory)|clique]], that is, a subset of nodes in a [[Graph (discrete mathematics)|graph]] such that every two nodes in the subset are connected by an edge can be considered as a prototypical form of cluster. Relaxations of the complete connectivity requirement (a fraction of the edges can be missing) are known as quasi-cliques, as in the [[HCS clustering algorithm]].
* ''Signed graph models'': Every [[path (graph theory)|path]] in a [[signed graph]] has a [[sign (mathematics)|sign]] from the product of the signs on the edges. Under the assumptions of [[balance theory]], edges may change sign and result in a bifurcated graph. The weaker "clusterability axiom" (no [[cycle (graph theory)|cycle]] has exactly one negative edge) yields results with more than two clusters, or subgraphs with only positive edges.<ref>[[James A. Davis]] (May 1967) "Clustering and structural balance in graphs", [[Human Relations]] 20:181–7</ref>
* ''{{vanchor|Neural model}}s'': the most well-known [[unsupervised learning|unsupervised]] [[neural network]] is the [[self-organizing map]] and these models can usually be characterized as similar to one or more of the above models, and including subspace models when neural networks implement a form of [[Principal Component Analysis]] or [[Independent Component Analysis]].

A "clustering" is essentially a set of such clusters, usually containing all objects in the data set. Additionally, it may specify the relationship of the clusters to each other, for example, a hierarchy of clusters embedded in each other. Clusterings can be roughly distinguished as:
* ''{{vanchor|Hard clustering}}'': each object belongs to a cluster or not
* ''{{vanchor|Soft clustering}}'' (also: ''{{vanchor|[[fuzzy clustering]]}}''): each object belongs to each cluster to a certain degree (for example, a likelihood of belonging to the cluster)

There are also finer distinctions possible, for example:

* ''{{vanchor|Strict partitioning clustering}}'': each object belongs to exactly one cluster
* ''{{vanchor|Strict partitioning clustering with outliers}}'': objects can also belong to no cluster; in which case they are considered [[Anomaly detection|outliers]]
* ''{{vanchor|Overlapping clustering}}'' (also: ''alternative clustering'', ''multi-view clustering''): objects may belong to more than one cluster; usually involving hard clusters
* ''{{vanchor|Hierarchical clustering}}'': objects that belong to a child cluster also belong to the parent cluster
* ''{{vanchor|[[Subspace clustering]]}}'': while an overlapping clustering, within a uniquely defined subspace, clusters are not expected to overlap