Editing Cluster analysis (section)

== Evaluation and assessment ==
Evaluation (or "validation") of clustering results is as difficult as the clustering itself.<ref name="pfitzner">{{cite journal|last1=Pfitzner|first1=Darius|last2=Leibbrandt|first2=Richard|last3=Powers|first3=David|s2cid=6935380|date=2009|title=Characterization and evaluation of similarity measures for pairs of clusterings|journal=Knowledge and Information Systems|publisher=Springer|volume=19|issue=3|pages=361–394|doi=10.1007/s10115-008-0150-6}}</ref> Popular approaches involve "''internal''" evaluation, where the clustering is summarized to a single quality score, "''external''" evaluation, where the clustering is compared to an existing "ground truth" classification, "''manual''" evaluation by a human expert, and "''indirect''" evaluation by evaluating the utility of the clustering in its intended application.<ref name=":0" />

Internal evaluation measures suffer from the problem that they represent functions that themselves can be seen as a clustering objective. For example, one could cluster the data set by the Silhouette coefficient; except that there is no known efficient algorithm for this. By using such an internal measure for evaluation, one rather compares the similarity of the optimization problems,<ref name=":0" /> and not necessarily how useful the clustering is.

External evaluation has similar problems: if we have such "ground truth" labels, then we would not need to cluster; and in practical applications we usually do not have such labels. On the other hand, the labels only reflect one possible partitioning of the data set, which does not imply that there does not exist a different, and maybe even better, clustering.

Neither of these approaches can therefore ultimately judge the actual quality of a clustering, but this needs human evaluation,<ref name=":0">{{Cite book|title=The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data|last1=Feldman|first1=Ronen|last2=Sanger|first2=James|date=2007-01-01|publisher=Cambridge Univ. Press|isbn=978-0521836579|oclc=915286380}}</ref> which is highly subjective. Nevertheless, such statistics can be quite informative in identifying bad clusterings,<ref name=":1">{{Cite book|title=Text Mining: Predictive Methods for Analyzing Unstructured Information|last1=Weiss|first1=Sholom M.|last2=Indurkhya|first2=Nitin|last3=Zhang|first3=Tong|last4=Damerau|first4=Fred J.|publisher=Springer|year=2005|isbn=978-0387954332|oclc=803401334}}</ref> but one should not dismiss subjective human evaluation.<ref name=":1" />

=== Internal evaluation ===
{{see also|Determining the number of clusters in a data set}}
When a clustering result is evaluated based on the data that was clustered itself, this is called internal evaluation. These methods usually assign the best score to the algorithm that produces clusters with high similarity within a cluster and low similarity between clusters. One drawback of using internal criteria in cluster evaluation is that high scores on an internal measure do not necessarily result in effective information retrieval applications.<ref name="Christopher D. Manning, Prabhakar Raghavan & Hinrich Schutze">{{Cite book
 | first1 = Christopher D. | last1 = Manning | first2 = Prabhakar | last2 = Raghavan | first3 = Hinrich | last3 = Schütze
 | title = Introduction to Information Retrieval
 | publisher = Cambridge University Press
 | isbn = 978-0-521-86571-5
| date = 2008-07-07 }}</ref> Additionally, this evaluation is biased towards algorithms that use the same cluster model. For example, k-means clustering naturally optimizes object distances, and a distance-based internal criterion will likely overrate the resulting clustering.

Therefore, the internal evaluation measures are best suited to get some insight into situations where one algorithm performs better than another, but this shall not imply that one algorithm produces more valid results than another.<ref name="estivill" /> Validity as measured by such an index depends on the claim that this kind of structure exists in the data set. An algorithm designed for some kind of models has no chance if the data set contains a radically different set of models, or if the evaluation measures a radically different criterion.<ref name="estivill" /> For example, k-means clustering can only find convex clusters, and many evaluation indexes assume convex clusters. On a data set with non-convex clusters neither the use of ''k''-means, nor of an evaluation criterion that assumes convexity, is sound.

More than a dozen of internal evaluation measures exist, usually based on the intuition that items in the same cluster should be more similar than items in different clusters.<ref name=":2">{{Citation|title=Knowledge Discovery in Databases – Part III – Clustering|date=2017|url=https://dbs.ifi.uni-heidelberg.de/files/Team/eschubert/lectures/KDDClusterAnalysis17-screen.pdf|place=[[Heidelberg University]]}}</ref>{{Rp|115–121}} For example, the following methods can be used to assess the quality of clustering algorithms based on internal criterion:

==== [[Davies–Bouldin index]] ====
The [[Davies–Bouldin index]] can be calculated by the following formula:
<math>
DB = \frac {1} {n} \sum_{i=1}^{n} \max_{j\neq i}\left(\frac{\sigma_i + \sigma_j} {d(c_i,c_j)}\right)
</math>
where ''n'' is the number of clusters, <math>c_i</math> is the [[centroid]] of cluster <math>i</math>, <math>\sigma_i</math> is the average distance of all elements in cluster <math>i</math> to centroid <math>c_i</math>, and <math>d(c_i,c_j)</math> is the distance between centroids <math>c_i</math> and <math>c_j</math>. Since algorithms that produce clusters with low intra-cluster distances (high intra-cluster similarity) and high inter-cluster distances (low inter-cluster similarity) will have a low Davies–Bouldin index, the clustering algorithm that produces a collection of clusters with the smallest [[Davies–Bouldin index]] is considered the best algorithm based on this criterion.

==== [[Dunn index]] ====
The Dunn index aims to identify dense and well-separated clusters. It is defined as the ratio between the minimal inter-cluster distance to maximal intra-cluster distance. For each cluster partition, the Dunn index can be calculated by the following formula:<ref>{{Cite journal
 | last = Dunn | first = J.
 | title = Well separated clusters and optimal fuzzy partitions
 | journal = Journal of Cybernetics
 | year = 1974
 | volume = 4
 | pages = 95–104
 | doi = 10.1080/01969727408546059
}}</ref>
:<math>
D = \frac{\min_{1 \leq i < j \leq n} d(i,j)}{\max_{1 \leq k \leq n} d^{\prime}(k)} \,,
</math>
where ''d''(''i'',''j'') represents the distance between clusters ''i'' and ''j'', and ''d'' '(''k'') measures the intra-cluster distance of cluster ''k''. The inter-cluster distance ''d''(''i'',''j'') between two clusters may be any number of distance measures, such as the distance between the [[centroids]] of the clusters. Similarly, the intra-cluster distance ''d'' '(''k'') may be measured in a variety of ways, such as the maximal distance between any pair of elements in cluster&nbsp;''k''. Since internal criterion seek clusters with high intra-cluster similarity and low inter-cluster similarity, algorithms that produce clusters with high Dunn index are more desirable.

==== [[Silhouette (clustering)|Silhouette coefficient]] ====
The silhouette coefficient contrasts the average distance to elements in the same cluster with the average distance to elements in other clusters. Objects with a high silhouette value are considered well clustered, objects with a low value may be outliers. This index works well with ''k''-means clustering, and is also used to determine the optimal number of clusters.<ref>{{cite journal |title=Silhouettes: A graphical aid to the interpretation and validation of cluster analysis |author=Peter J. Rousseeuw |journal=Journal of Computational and Applied Mathematics |volume=20 |pages=53–65 |year=1987 |doi=10.1016/0377-0427(87)90125-7}}</ref>

=== External evaluation ===
In external evaluation, clustering results are evaluated based on data that was not used for clustering, such as known class labels and external benchmarks. Such benchmarks consist of a set of pre-classified items, and these sets are often created by (expert) humans. Thus, the benchmark sets can be thought of as a [[gold standard (test)|gold standard]] for evaluation.<ref name="pfitzner"/> These types of evaluation methods measure how close the clustering is to the predetermined benchmark classes. However, it has recently been discussed whether this is adequate for real data, or only on synthetic data sets with a factual ground truth, since classes can contain internal structure, the attributes present may not allow separation of clusters or the classes may contain [[Anomaly detection|anomalies]].<ref name="Faerberetal2010">{{Cite conference
 | first1 = Ines | last1 = Färber | first2 = Stephan | last2 = Günnemann | author3-link = Hans-Peter Kriegel | first3 = Hans-Peter | last3 = Kriegel | first4 = Peer | last4 = Kröger | first5 = Emmanuel | last5 = Müller | first6 = Erich | last6 = Schubert | first7 = Thomas | last7 = Seidl | first8 = Arthur | last8 = Zimek |author-link8=Arthur Zimek
 | editor1-first = Xiaoli Z. | editor1-last = Fern | editor2-first = Ian | editor2-last = Davidson | editor3-first = Jennifer | editor3-last = Dy|editor3-link=Jennifer Dy
 | book-title = MultiClust: Discovering, Summarizing, and Using Multiple Clusterings
 | title = On Using Class-Labels in Evaluation of Clusterings
 | year = 2010
 | publisher = [[Association for Computing Machinery|ACM]] [[SIGKDD]]
 | url = http://eecs.oregonstate.edu/research/multiclust/Evaluation-4.pdf
}}</ref> Additionally, from a [[knowledge discovery]] point of view, the reproduction of known knowledge may not necessarily be the intended result.<ref name="Faerberetal2010"/> In the special scenario of [[constrained clustering]], where meta information (such as class labels) is used already in the clustering process, the hold-out of information for evaluation purposes is non-trivial.<ref name="pourrajabi">{{Cite conference 
| first1 = M. | last1 = Pourrajabi
| first2 = D. | last2 = Moulavi
| first3 = R. J. G. B. | last3 = Campello
| first4 = A. | last4 = Zimek|author-link4=Arthur Zimek
| first5 = J. | last5 = Sander
|first6 = R. |last6 = Goebel
| year = 2014
| title = Model Selection for Semi-Supervised Clustering
| book-title = Proceedings of the 17th International Conference on Extending Database Technology (EDBT)
| pages = 331–342
| doi = 10.5441/002/edbt.2014.31
}}</ref>

A number of measures are adapted from variants used to evaluate classification tasks. In place of counting the number of times a class was correctly assigned to a single data point (known as [[true positive]]s), such ''pair counting'' metrics assess whether each pair of data points that is truly in the same cluster is predicted to be in the same cluster.<ref name="pfitzner"/>

As with internal evaluation, several external evaluation measures exist,<ref name=":2" />{{Rp|125–129}} for example:
==== Purity ====
Purity is a measure of the extent to which clusters contain a single class.<ref name="Christopher D. Manning, Prabhakar Raghavan & Hinrich Schutze"/> Its calculation can be thought of as follows: For each cluster, count the number of data points from the most common class in said cluster. Now take the sum over all clusters and divide by the total number of data points. Formally, given some set of clusters <math>M</math> and some set of classes <math>D</math>, both partitioning <math>N</math> data points, purity can be defined as:
:<math>
\frac{1}{N}\sum_{m\in M}\max_{d\in D}{|m \cap d|}
</math>
This measure doesn't penalize having many clusters, and more clusters will make it easier to produce a high purity. A purity score of 1 is always possible by putting each data point in its own cluster. Also, purity doesn't work well for imbalanced data, where even poorly performing clustering algorithms will give a high purity value. For example, if a size 1000 dataset consists of two classes, one containing 999 points and the other containing 1 point, then every possible partition will have a purity of at least 99.9%. 
==== [[Rand index]] ====
The Rand index<ref>{{Cite journal
 | first = W. M. | last = Rand
 | title = Objective criteria for the evaluation of clustering methods
 | journal = [[Journal of the American Statistical Association]]
 | volume = 66
 | pages = 846–850
 | year = 1971
 | doi = 10.2307/2284239
 | issue = 336
 | publisher = American Statistical Association
 | jstor = 2284239
| arxiv = 1704.01036
 }}</ref> computes how similar the clusters (returned by the clustering algorithm) are to the benchmark classifications. It can be computed using the following formula:
:<math>
RI = \frac {TP + TN} {TP + FP + FN + TN}
</math>
where  <math>TP</math> is the number of true positives, <math>TN</math> is the number of [[true negative]]s, <math>FP</math> is the number of [[false positives]], and <math>FN</math> is the number of [[false negatives]]. The instances being counted here are the number of correct ''pairwise'' assignments. That is, <math>TP</math> is the number of pairs of points that are clustered together in the predicted partition and in the ground truth partition, <math>FP</math> is the number of pairs of points that are clustered together in the predicted partition but not in the ground truth partition etc. If the dataset is of size N, then <math>TP + TN + FP + FN = \binom{N}{2}</math>. 
One issue with the [[Rand index]] is that [[false positive]]s and [[false negative]]s are equally weighted. This may be an undesirable characteristic for some clustering applications. The F-measure addresses this concern,{{Citation needed|date=May 2018|reason=it does not achieve the same for chance correction as ARI}} as does the chance-corrected [[adjusted Rand index]].
==== [[F-measure]] ====
The F-measure can be used to balance the contribution of [[false negative]]s by weighting [[recall (information retrieval)|recall]] through a parameter <math>\beta \geq 0</math>. Let '''[[precision (information retrieval)|precision]]''' and '''[[recall (information retrieval)|recall]]''' (both external evaluation measures in themselves) be defined as follows:
<math>
P = \frac {TP } {TP + FP }
</math>
<math>
R = \frac {TP } {TP + FN}
</math>
where <math>P</math> is the [[precision (information retrieval)|precision]] rate and <math>R</math> is the [[recall (information retrieval)|recall]] rate. We can calculate the F-measure by using the following formula:<ref name="Christopher D. Manning, Prabhakar Raghavan & Hinrich Schutze"/>
<math>
F_{\beta} = \frac {(\beta^2 + 1)\cdot P \cdot R } {\beta^2 \cdot P + R}
</math>
When <math>\beta=0</math>, <math>F_{0}=P</math>. In other words, [[recall (information retrieval)|recall]] has no impact on the F-measure when <math>\beta=0</math>, and increasing <math>\beta</math> allocates an increasing amount of weight to recall in the final F-measure.
Also <math>TN</math> is not taken into account and can vary from 0 upward without bound.
==== [[Jaccard coefficient|Jaccard index]] ====
The Jaccard index is used to quantify the similarity between two datasets. The [[Jaccard coefficient|Jaccard index]] takes on a value between 0 and 1. An index of 1 means that the two dataset are identical, and an index of 0 indicates that the datasets have no common elements. The Jaccard index is defined by the following formula:
<math>
J(A,B) = \frac {|A \cap B| } {|A \cup B|} = \frac{TP}{TP + FP + FN}
</math>
This is simply the number of unique elements common to both sets divided by the total number of unique elements in both sets.
Note that <math>TN</math> is not taken into account.
==== [[Sørensen–Dice coefficient|Dice index]] ====
The Dice symmetric measure doubles the weight on <math>TP</math> while still ignoring <math>TN</math>:
<math>
DSC = \frac{2TP}{2TP + FP + FN}
</math>
==== [[Fowlkes–Mallows Index|Fowlkes–Mallows index]]====
The Fowlkes–Mallows index<ref>{{cite journal | last1 = Fowlkes | first1 = E. B. | last2 = Mallows | first2 = C. L. | year = 1983 | title = A Method for Comparing Two Hierarchical Clusterings | jstor = 2288117 | journal = Journal of the American Statistical Association | volume = 78 | issue = 383| pages = 553–569 | doi = 10.1080/01621459.1983.10478008 }}</ref> computes the similarity between the clusters returned by the clustering algorithm and the benchmark classifications. The higher the value of the Fowlkes–Mallows index the more similar the clusters and the benchmark classifications are. It can be computed using the following formula:
<math>
FM = \sqrt{ \frac {TP}{TP+FP} \cdot \frac{TP}{TP+FN}  }
</math>
where  <math>TP</math> is the number of [[true positive]]s, <math>FP</math> is the number of [[false positives]], and <math>FN</math> is the number of [[false negatives]]. The <math>FM</math> index is the geometric mean of the [[precision (information retrieval)|precision]] and [[recall (information retrieval)|recall]] <math>P</math> and <math>R</math>, and is thus also known as the [[G-measure]], while the F-measure is their harmonic mean.<ref name="powers">{{cite conference | last = Powers | first = David | date = 2003 | conference = International Conference on Cognitive Science | pages = 529–534 | title = Recall and Precision versus the Bookmaker }}</ref><ref>{{cite journal | last1 = Arabie | first1 = P. | year = 1985| title = Comparing partitions | journal = Journal of Classification | volume = 2 | issue = 1| page = 1985 | doi = 10.1007/BF01908075 | s2cid = 189915041 }}</ref> Moreover, [[precision (information retrieval)|precision]] and [[recall (information retrieval)|recall]] are also known as Wallace's indices <math>B^I</math> and <math>B^{II}</math>.<ref>{{cite journal | last1 = Wallace | first1 = D. L. | year = 1983 | title = Comment | journal = Journal of the American Statistical Association | volume = 78 | issue = 383| pages = 569–579 | doi=10.1080/01621459.1983.10478009}}</ref> Chance normalized versions of recall, precision and G-measure correspond to [[Informedness]], [[Markedness]] and [[Matthews correlation coefficient|Matthews Correlation]] and relate strongly to [[Cohen's kappa|Kappa]].<ref name="kappa">{{cite conference | last1 = Powers | first1 = David | title = The Problem with Kappa | conference = European Chapter of the Association for Computational Linguistics | date = 2012 | pages = 345–355}}</ref>

==== Chi Index ====
The Chi index<ref>{{Cite journal |last1=Luna-Romera |first1=José María |last2=Martínez-Ballesteros |first2=María |last3=García-Gutiérrez |first3=Jorge |last4=Riquelme |first4=José C. |date=June 2019 |title=External clustering validity index based on chi-squared statistical test |url=https://linkinghub.elsevier.com/retrieve/pii/S0020025519301550 |journal=Information Sciences |language=en |volume=487 |pages=1–17 |doi=10.1016/j.ins.2019.02.046|hdl=11441/132081 |s2cid=93003939 |hdl-access=free }}</ref> is an external validation index that measure the clustering results by applying the [[Chi-squared test|chi-squared statistic]]. This index scores positively the fact that the labels are as sparse as possible across the clusters, i.e., that each cluster has as few different labels as possible. The higher the value of the Chi Index the greater the relationship between the resulting clusters and the label used.
==== [[Mutual Information]] ====
The mututal information is an [[information theory|information theoretic]] measure of how much information is shared between a clustering and a ground-truth classification that can detect a non-linear similarity between two clusterings. [[Adjusted mutual information|Normalized mutual information]] is a family of corrected-for-chance variants of this that has a reduced bias for varying cluster numbers.<ref name="pfitzner" />
==== [[Confusion matrix]] ====
A confusion matrix can be used to quickly visualize the results of a classification (or clustering) algorithm. It shows how different a cluster is from the gold standard cluster.
==== Validity Measure ====
The validity measure (short v-measure) is a combined metric for homogeneity and completeness of the clusters<ref>Rosenberg, Andrew, and Julia Hirschberg. "V-measure: A conditional entropy-based external cluster evaluation measure." Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL). 2007. [https://aclanthology.org/D07-1043.pdf pdf]</ref>

=== Cluster tendency ===
{{more citations needed section|date=April 2025}}
To measure cluster tendency is to measure to what degree clusters exist in the data to be clustered, and may be performed as an initial test, before attempting clustering. One way to do this is to compare the data against random data. On average, random data should not have clusters {{verification needed|[[Random cluster model]] how does this fit here??|date=April 2025}}.

*'''[[Hopkins statistic]]'''
:There are multiple formulations of the [[Hopkins statistic]].<ref>{{Cite journal
  | title = A new method for determining the type of distribution of plant individuals
  | last1 = Hopkins | first1 = Brian
  | last2 = Skellam | first2 = John Gordon
  | journal = Annals of Botany
  | volume =18
  | number = 2
  | pages = 213–227
  | year = 1954
  | publisher = Annals Botany Co
  | doi=10.1093/oxfordjournals.aob.a083391
}}</ref> A typical one is as follows.<ref>{{Cite book
 | last = Banerjee | first = A.
 | title = 2004 IEEE International Conference on Fuzzy Systems (IEEE Cat. No.04CH37542)
 | chapter = Validating clusters using the Hopkins statistic
 | s2cid = 36701919
 | volume = 1
 | pages = 149–153
 | doi = 10.1109/FUZZY.2004.1375706
| year = 2004
 | isbn = 978-0-7803-8353-1
 }}</ref> Let <math>X</math> be the set of <math>n</math> data points in <math>d</math> dimensional space. Consider a random sample (without replacement) of <math>m \ll n</math> data points with members <math>x_i</math>. Also generate a set <math>Y</math> of <math>m</math> uniformly randomly distributed data points. Now define two distance measures, <math>u_i</math> to be the distance of <math>y_i \in Y</math> from its nearest neighbor in X and <math>w_i</math> to be the distance of <math>x_i \in X</math> from its nearest neighbor in X. We then define the Hopkins statistic as:
:: <math>
H=\frac{\sum_{i=1}^m{u_i^d}}{\sum_{i=1}^m{u_i^d}+\sum_{i=1}^m{w_i^d}} \,,
</math>
:With this definition, uniform random data should tend to have values near to 0.5, and clustered data should tend to have values nearer to 1.
:However, data containing just a single Gaussian will also score close to 1, as this statistic measures deviation from a ''uniform'' distribution, not [[Multimodal distribution|multimodality]], making this statistic largely useless in application (as real data never is remotely uniform).