Editing Cluster analysis (section)

=== Density-based clustering ===
In density-based clustering,<ref>{{Cite journal
 | author1-link = Hans-Peter Kriegel | first1 = Hans-Peter | last1 = Kriegel | first2 = Peer | last2 = Kröger | first3 = Jörg | last3 = Sander | first4 = Arthur | last4 = Zimek|author-link4=Arthur Zimek
 | title = Density-based Clustering
 | journal = WIREs Data Mining and Knowledge Discovery
 | volume = 1
 | issue = 3
 | year = 2011
 | pages = 231–240
 | url = http://wires.wiley.com/WileyCDA/WiresArticle/wisId-WIDM30.html
 | doi = 10.1002/widm.30
| s2cid = 36920706 }}</ref> clusters are defined as areas of higher density than the remainder of the data set. Objects in sparse areas – that are required to separate clusters – are usually considered to be noise and border points.

The most popular<ref>[http://academic.research.microsoft.com/CSDirectory/paper_category_7.htm Microsoft academic search: most cited data mining articles] {{webarchive|url=https://web.archive.org/web/20100421170848/http://academic.research.microsoft.com/CSDirectory/Paper_category_7.htm |date=2010-04-21 }}: DBSCAN is on rank 24, when accessed on: 4/18/2010</ref> density-based clustering method is [[DBSCAN]].<ref>{{Cite conference | author1-first = Martin | author1-last = Ester | author2-link = Hans-Peter Kriegel | author2-first = Hans-Peter | author2-last = Kriegel | author3-first = Jörg | author3-last = Sander | author4-first = Xiaowei | author4-last = Xu | title = A density-based algorithm for discovering clusters in large spatial databases with noise | pages = 226–231 | editor1-first = Evangelos | editor1-last = Simoudis | editor2-first = Jiawei | editor2-last = Han | editor3-first = Usama M. | editor3-last = Fayyad | book-title = Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96) | publisher = [[AAAI Press]] | year = 1996 | isbn = 1-57735-004-9 }}</ref> In contrast to many newer methods, it features a well-defined cluster model called "density-reachability". Similar to linkage-based clustering, it is based on connecting points within certain distance thresholds. However, it only connects points that satisfy a density criterion, in the original variant defined as a minimum number of other objects within this radius. A cluster consists of all density-connected objects (which can form a cluster of an arbitrary shape, in contrast to many other methods) plus all objects that are within these objects' range. Another interesting property of DBSCAN is that its complexity is fairly low – it requires a linear number of range queries on the database – and that it will discover essentially the same results (it is [[deterministic algorithm|deterministic]] for core and noise points, but not for border points) in each run, therefore there is no need to run it multiple times. [[OPTICS algorithm|OPTICS]]<ref>{{Cite conference
 | first1 = Mihael | last1 = Ankerst | first2 = Markus M. | last2 = Breunig | author3-link = Hans-Peter Kriegel | first3 = Hans-Peter | last3 = Kriegel | first4 = Jörg | last4 = Sander
 | title = OPTICS: Ordering Points To Identify the Clustering Structure
 | year = 1999
 | pages = 49–60
 | book-title = ACM SIGMOD international conference on Management of data
 | publisher = [[ACM Press]]
 | citeseerx = 10.1.1.129.6542
}}</ref> is a generalization of DBSCAN that removes the need to choose an appropriate value for the range parameter <math>\varepsilon</math>, and produces a hierarchical result related to that of [[hierarchical clustering|linkage clustering]]. DeLi-Clu,<ref name="ReferenceA">{{Cite conference| doi = 10.1007/11731139_16| isbn = 978-3-540-33206-0| contribution = DeLi-Clu: Boosting Robustness, Completeness, Usability, and Efficiency of Hierarchical Clustering by a Closest Pair Ranking| year = 2006| last1 = Achtert | first1 = E.| last2 = Böhm| series = Lecture Notes in Computer Science | first2 = C.| last3 = Kröger | first3 = P.| pages = 119–128|title= Advances in Knowledge Discovery and Data Mining| volume = 3918| citeseerx = 10.1.1.64.1161}}</ref> Density-Link-Clustering combines ideas from [[single-linkage clustering]] and OPTICS, eliminating the <math>\varepsilon</math> parameter entirely and offering performance improvements over OPTICS by using an [[R-tree]] index.

The key drawback of [[DBSCAN]] and [[OPTICS]] is that they expect some kind of density drop to detect cluster borders. On data sets with, for example, overlapping Gaussian distributions – a common use case in artificial data – the cluster borders produced by these algorithms will often look arbitrary, because the cluster density decreases continuously. On a data set consisting of mixtures of Gaussians, these algorithms are nearly always outperformed by methods such as [[Expectation–maximization algorithm|EM clustering]] that are able to precisely model this kind of data.

[[Mean-shift]] is a clustering approach where each object is moved to the densest area in its vicinity, based on [[kernel density estimation]]. Eventually, objects converge to local maxima of density. Similar to k-means clustering, these "density attractors" can serve as representatives for the data set, but mean-shift can detect arbitrary-shaped clusters similar to DBSCAN. Due to the expensive iterative procedure and density estimation, mean-shift is usually slower than DBSCAN or k-Means. Besides that, the applicability of the mean-shift algorithm to multidimensional data is hindered by the unsmooth behaviour of the kernel density estimate, which results in over-fragmentation of cluster tails.<ref name="ReferenceA"/>

<gallery widths="200" heights="200" caption="Density-based clustering examples">
File:DBSCAN-density-data.svg|Density-based clustering with [[DBSCAN]]
File:DBSCAN-Gaussian-data.svg|[[DBSCAN]] assumes clusters of similar density, and may have problems separating nearby clusters.
File:OPTICS-Gaussian-data.svg|[[OPTICS algorithm|OPTICS]] is a DBSCAN variant, improving handling of different densities clusters.
</gallery>