Editing Cluster analysis (section)

=== External evaluation ===
In external evaluation, clustering results are evaluated based on data that was not used for clustering, such as known class labels and external benchmarks. Such benchmarks consist of a set of pre-classified items, and these sets are often created by (expert) humans. Thus, the benchmark sets can be thought of as a [[gold standard (test)|gold standard]] for evaluation.<ref name="pfitzner"/> These types of evaluation methods measure how close the clustering is to the predetermined benchmark classes. However, it has recently been discussed whether this is adequate for real data, or only on synthetic data sets with a factual ground truth, since classes can contain internal structure, the attributes present may not allow separation of clusters or the classes may contain [[Anomaly detection|anomalies]].<ref name="Faerberetal2010">{{Cite conference
 | first1 = Ines | last1 = Färber | first2 = Stephan | last2 = Günnemann | author3-link = Hans-Peter Kriegel | first3 = Hans-Peter | last3 = Kriegel | first4 = Peer | last4 = Kröger | first5 = Emmanuel | last5 = Müller | first6 = Erich | last6 = Schubert | first7 = Thomas | last7 = Seidl | first8 = Arthur | last8 = Zimek |author-link8=Arthur Zimek
 | editor1-first = Xiaoli Z. | editor1-last = Fern | editor2-first = Ian | editor2-last = Davidson | editor3-first = Jennifer | editor3-last = Dy|editor3-link=Jennifer Dy
 | book-title = MultiClust: Discovering, Summarizing, and Using Multiple Clusterings
 | title = On Using Class-Labels in Evaluation of Clusterings
 | year = 2010
 | publisher = [[Association for Computing Machinery|ACM]] [[SIGKDD]]
 | url = http://eecs.oregonstate.edu/research/multiclust/Evaluation-4.pdf
}}</ref> Additionally, from a [[knowledge discovery]] point of view, the reproduction of known knowledge may not necessarily be the intended result.<ref name="Faerberetal2010"/> In the special scenario of [[constrained clustering]], where meta information (such as class labels) is used already in the clustering process, the hold-out of information for evaluation purposes is non-trivial.<ref name="pourrajabi">{{Cite conference 
| first1 = M. | last1 = Pourrajabi
| first2 = D. | last2 = Moulavi
| first3 = R. J. G. B. | last3 = Campello
| first4 = A. | last4 = Zimek|author-link4=Arthur Zimek
| first5 = J. | last5 = Sander
|first6 = R. |last6 = Goebel
| year = 2014
| title = Model Selection for Semi-Supervised Clustering
| book-title = Proceedings of the 17th International Conference on Extending Database Technology (EDBT)
| pages = 331–342
| doi = 10.5441/002/edbt.2014.31
}}</ref>

A number of measures are adapted from variants used to evaluate classification tasks. In place of counting the number of times a class was correctly assigned to a single data point (known as [[true positive]]s), such ''pair counting'' metrics assess whether each pair of data points that is truly in the same cluster is predicted to be in the same cluster.<ref name="pfitzner"/>

As with internal evaluation, several external evaluation measures exist,<ref name=":2" />{{Rp|125–129}} for example:
==== Purity ====
Purity is a measure of the extent to which clusters contain a single class.<ref name="Christopher D. Manning, Prabhakar Raghavan & Hinrich Schutze"/> Its calculation can be thought of as follows: For each cluster, count the number of data points from the most common class in said cluster. Now take the sum over all clusters and divide by the total number of data points. Formally, given some set of clusters <math>M</math> and some set of classes <math>D</math>, both partitioning <math>N</math> data points, purity can be defined as:
:<math>
\frac{1}{N}\sum_{m\in M}\max_{d\in D}{|m \cap d|}
</math>
This measure doesn't penalize having many clusters, and more clusters will make it easier to produce a high purity. A purity score of 1 is always possible by putting each data point in its own cluster. Also, purity doesn't work well for imbalanced data, where even poorly performing clustering algorithms will give a high purity value. For example, if a size 1000 dataset consists of two classes, one containing 999 points and the other containing 1 point, then every possible partition will have a purity of at least 99.9%. 
==== [[Rand index]] ====
The Rand index<ref>{{Cite journal
 | first = W. M. | last = Rand
 | title = Objective criteria for the evaluation of clustering methods
 | journal = [[Journal of the American Statistical Association]]
 | volume = 66
 | pages = 846–850
 | year = 1971
 | doi = 10.2307/2284239
 | issue = 336
 | publisher = American Statistical Association
 | jstor = 2284239
| arxiv = 1704.01036
 }}</ref> computes how similar the clusters (returned by the clustering algorithm) are to the benchmark classifications. It can be computed using the following formula:
:<math>
RI = \frac {TP + TN} {TP + FP + FN + TN}
</math>
where  <math>TP</math> is the number of true positives, <math>TN</math> is the number of [[true negative]]s, <math>FP</math> is the number of [[false positives]], and <math>FN</math> is the number of [[false negatives]]. The instances being counted here are the number of correct ''pairwise'' assignments. That is, <math>TP</math> is the number of pairs of points that are clustered together in the predicted partition and in the ground truth partition, <math>FP</math> is the number of pairs of points that are clustered together in the predicted partition but not in the ground truth partition etc. If the dataset is of size N, then <math>TP + TN + FP + FN = \binom{N}{2}</math>. 
One issue with the [[Rand index]] is that [[false positive]]s and [[false negative]]s are equally weighted. This may be an undesirable characteristic for some clustering applications. The F-measure addresses this concern,{{Citation needed|date=May 2018|reason=it does not achieve the same for chance correction as ARI}} as does the chance-corrected [[adjusted Rand index]].
==== [[F-measure]] ====
The F-measure can be used to balance the contribution of [[false negative]]s by weighting [[recall (information retrieval)|recall]] through a parameter <math>\beta \geq 0</math>. Let '''[[precision (information retrieval)|precision]]''' and '''[[recall (information retrieval)|recall]]''' (both external evaluation measures in themselves) be defined as follows:
<math>
P = \frac {TP } {TP + FP }
</math>
<math>
R = \frac {TP } {TP + FN}
</math>
where <math>P</math> is the [[precision (information retrieval)|precision]] rate and <math>R</math> is the [[recall (information retrieval)|recall]] rate. We can calculate the F-measure by using the following formula:<ref name="Christopher D. Manning, Prabhakar Raghavan & Hinrich Schutze"/>
<math>
F_{\beta} = \frac {(\beta^2 + 1)\cdot P \cdot R } {\beta^2 \cdot P + R}
</math>
When <math>\beta=0</math>, <math>F_{0}=P</math>. In other words, [[recall (information retrieval)|recall]] has no impact on the F-measure when <math>\beta=0</math>, and increasing <math>\beta</math> allocates an increasing amount of weight to recall in the final F-measure.
Also <math>TN</math> is not taken into account and can vary from 0 upward without bound.
==== [[Jaccard coefficient|Jaccard index]] ====
The Jaccard index is used to quantify the similarity between two datasets. The [[Jaccard coefficient|Jaccard index]] takes on a value between 0 and 1. An index of 1 means that the two dataset are identical, and an index of 0 indicates that the datasets have no common elements. The Jaccard index is defined by the following formula:
<math>
J(A,B) = \frac {|A \cap B| } {|A \cup B|} = \frac{TP}{TP + FP + FN}
</math>
This is simply the number of unique elements common to both sets divided by the total number of unique elements in both sets.
Note that <math>TN</math> is not taken into account.
==== [[Sørensen–Dice coefficient|Dice index]] ====
The Dice symmetric measure doubles the weight on <math>TP</math> while still ignoring <math>TN</math>:
<math>
DSC = \frac{2TP}{2TP + FP + FN}
</math>
==== [[Fowlkes–Mallows Index|Fowlkes–Mallows index]]====
The Fowlkes–Mallows index<ref>{{cite journal | last1 = Fowlkes | first1 = E. B. | last2 = Mallows | first2 = C. L. | year = 1983 | title = A Method for Comparing Two Hierarchical Clusterings | jstor = 2288117 | journal = Journal of the American Statistical Association | volume = 78 | issue = 383| pages = 553–569 | doi = 10.1080/01621459.1983.10478008 }}</ref> computes the similarity between the clusters returned by the clustering algorithm and the benchmark classifications. The higher the value of the Fowlkes–Mallows index the more similar the clusters and the benchmark classifications are. It can be computed using the following formula:
<math>
FM = \sqrt{ \frac {TP}{TP+FP} \cdot \frac{TP}{TP+FN}  }
</math>
where  <math>TP</math> is the number of [[true positive]]s, <math>FP</math> is the number of [[false positives]], and <math>FN</math> is the number of [[false negatives]]. The <math>FM</math> index is the geometric mean of the [[precision (information retrieval)|precision]] and [[recall (information retrieval)|recall]] <math>P</math> and <math>R</math>, and is thus also known as the [[G-measure]], while the F-measure is their harmonic mean.<ref name="powers">{{cite conference | last = Powers | first = David | date = 2003 | conference = International Conference on Cognitive Science | pages = 529–534 | title = Recall and Precision versus the Bookmaker }}</ref><ref>{{cite journal | last1 = Arabie | first1 = P. | year = 1985| title = Comparing partitions | journal = Journal of Classification | volume = 2 | issue = 1| page = 1985 | doi = 10.1007/BF01908075 | s2cid = 189915041 }}</ref> Moreover, [[precision (information retrieval)|precision]] and [[recall (information retrieval)|recall]] are also known as Wallace's indices <math>B^I</math> and <math>B^{II}</math>.<ref>{{cite journal | last1 = Wallace | first1 = D. L. | year = 1983 | title = Comment | journal = Journal of the American Statistical Association | volume = 78 | issue = 383| pages = 569–579 | doi=10.1080/01621459.1983.10478009}}</ref> Chance normalized versions of recall, precision and G-measure correspond to [[Informedness]], [[Markedness]] and [[Matthews correlation coefficient|Matthews Correlation]] and relate strongly to [[Cohen's kappa|Kappa]].<ref name="kappa">{{cite conference | last1 = Powers | first1 = David | title = The Problem with Kappa | conference = European Chapter of the Association for Computational Linguistics | date = 2012 | pages = 345–355}}</ref>

==== Chi Index ====
The Chi index<ref>{{Cite journal |last1=Luna-Romera |first1=José María |last2=Martínez-Ballesteros |first2=María |last3=García-Gutiérrez |first3=Jorge |last4=Riquelme |first4=José C. |date=June 2019 |title=External clustering validity index based on chi-squared statistical test |url=https://linkinghub.elsevier.com/retrieve/pii/S0020025519301550 |journal=Information Sciences |language=en |volume=487 |pages=1–17 |doi=10.1016/j.ins.2019.02.046|hdl=11441/132081 |s2cid=93003939 |hdl-access=free }}</ref> is an external validation index that measure the clustering results by applying the [[Chi-squared test|chi-squared statistic]]. This index scores positively the fact that the labels are as sparse as possible across the clusters, i.e., that each cluster has as few different labels as possible. The higher the value of the Chi Index the greater the relationship between the resulting clusters and the label used.
==== [[Mutual Information]] ====
The mututal information is an [[information theory|information theoretic]] measure of how much information is shared between a clustering and a ground-truth classification that can detect a non-linear similarity between two clusterings. [[Adjusted mutual information|Normalized mutual information]] is a family of corrected-for-chance variants of this that has a reduced bias for varying cluster numbers.<ref name="pfitzner" />
==== [[Confusion matrix]] ====
A confusion matrix can be used to quickly visualize the results of a classification (or clustering) algorithm. It shows how different a cluster is from the gold standard cluster.
==== Validity Measure ====
The validity measure (short v-measure) is a combined metric for homogeneity and completeness of the clusters<ref>Rosenberg, Andrew, and Julia Hirschberg. "V-measure: A conditional entropy-based external cluster evaluation measure." Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL). 2007. [https://aclanthology.org/D07-1043.pdf pdf]</ref>