Editing Cluster analysis (section)

=== Internal evaluation ===
{{see also|Determining the number of clusters in a data set}}
When a clustering result is evaluated based on the data that was clustered itself, this is called internal evaluation. These methods usually assign the best score to the algorithm that produces clusters with high similarity within a cluster and low similarity between clusters. One drawback of using internal criteria in cluster evaluation is that high scores on an internal measure do not necessarily result in effective information retrieval applications.<ref name="Christopher D. Manning, Prabhakar Raghavan & Hinrich Schutze">{{Cite book
 | first1 = Christopher D. | last1 = Manning | first2 = Prabhakar | last2 = Raghavan | first3 = Hinrich | last3 = Schütze
 | title = Introduction to Information Retrieval
 | publisher = Cambridge University Press
 | isbn = 978-0-521-86571-5
| date = 2008-07-07 }}</ref> Additionally, this evaluation is biased towards algorithms that use the same cluster model. For example, k-means clustering naturally optimizes object distances, and a distance-based internal criterion will likely overrate the resulting clustering.

Therefore, the internal evaluation measures are best suited to get some insight into situations where one algorithm performs better than another, but this shall not imply that one algorithm produces more valid results than another.<ref name="estivill" /> Validity as measured by such an index depends on the claim that this kind of structure exists in the data set. An algorithm designed for some kind of models has no chance if the data set contains a radically different set of models, or if the evaluation measures a radically different criterion.<ref name="estivill" /> For example, k-means clustering can only find convex clusters, and many evaluation indexes assume convex clusters. On a data set with non-convex clusters neither the use of ''k''-means, nor of an evaluation criterion that assumes convexity, is sound.

More than a dozen of internal evaluation measures exist, usually based on the intuition that items in the same cluster should be more similar than items in different clusters.<ref name=":2">{{Citation|title=Knowledge Discovery in Databases – Part III – Clustering|date=2017|url=https://dbs.ifi.uni-heidelberg.de/files/Team/eschubert/lectures/KDDClusterAnalysis17-screen.pdf|place=[[Heidelberg University]]}}</ref>{{Rp|115–121}} For example, the following methods can be used to assess the quality of clustering algorithms based on internal criterion:

==== [[Davies–Bouldin index]] ====
The [[Davies–Bouldin index]] can be calculated by the following formula:
<math>
DB = \frac {1} {n} \sum_{i=1}^{n} \max_{j\neq i}\left(\frac{\sigma_i + \sigma_j} {d(c_i,c_j)}\right)
</math>
where ''n'' is the number of clusters, <math>c_i</math> is the [[centroid]] of cluster <math>i</math>, <math>\sigma_i</math> is the average distance of all elements in cluster <math>i</math> to centroid <math>c_i</math>, and <math>d(c_i,c_j)</math> is the distance between centroids <math>c_i</math> and <math>c_j</math>. Since algorithms that produce clusters with low intra-cluster distances (high intra-cluster similarity) and high inter-cluster distances (low inter-cluster similarity) will have a low Davies–Bouldin index, the clustering algorithm that produces a collection of clusters with the smallest [[Davies–Bouldin index]] is considered the best algorithm based on this criterion.

==== [[Dunn index]] ====
The Dunn index aims to identify dense and well-separated clusters. It is defined as the ratio between the minimal inter-cluster distance to maximal intra-cluster distance. For each cluster partition, the Dunn index can be calculated by the following formula:<ref>{{Cite journal
 | last = Dunn | first = J.
 | title = Well separated clusters and optimal fuzzy partitions
 | journal = Journal of Cybernetics
 | year = 1974
 | volume = 4
 | pages = 95–104
 | doi = 10.1080/01969727408546059
}}</ref>
:<math>
D = \frac{\min_{1 \leq i < j \leq n} d(i,j)}{\max_{1 \leq k \leq n} d^{\prime}(k)} \,,
</math>
where ''d''(''i'',''j'') represents the distance between clusters ''i'' and ''j'', and ''d'' '(''k'') measures the intra-cluster distance of cluster ''k''. The inter-cluster distance ''d''(''i'',''j'') between two clusters may be any number of distance measures, such as the distance between the [[centroids]] of the clusters. Similarly, the intra-cluster distance ''d'' '(''k'') may be measured in a variety of ways, such as the maximal distance between any pair of elements in cluster&nbsp;''k''. Since internal criterion seek clusters with high intra-cluster similarity and low inter-cluster similarity, algorithms that produce clusters with high Dunn index are more desirable.

==== [[Silhouette (clustering)|Silhouette coefficient]] ====
The silhouette coefficient contrasts the average distance to elements in the same cluster with the average distance to elements in other clusters. Objects with a high silhouette value are considered well clustered, objects with a low value may be outliers. This index works well with ''k''-means clustering, and is also used to determine the optimal number of clusters.<ref>{{cite journal |title=Silhouettes: A graphical aid to the interpretation and validation of cluster analysis |author=Peter J. Rousseeuw |journal=Journal of Computational and Applied Mathematics |volume=20 |pages=53–65 |year=1987 |doi=10.1016/0377-0427(87)90125-7}}</ref>