Editing Cluster analysis (section)

=== Recent developments ===
In recent years, considerable effort has been put into improving the performance of existing algorithms.<ref>{{cite conference | first = D. | last = Sculley |title=Web-scale k-means clustering |conference=Proc. 19th WWW |year=2010}}</ref><ref>{{cite journal | last1 = Huang | first1 = Z. | s2cid = 11323096 | year = 1998 | title = Extensions to the ''k''-means algorithm for clustering large data sets with categorical values | journal = Data Mining and Knowledge Discovery | volume = 2 | issue = 3| pages = 283–304 | doi = 10.1023/A:1009769707641 }}</ref> Among them are ''CLARANS'',<ref>R. Ng and J. Han. "Efficient and effective clustering method for spatial data mining". In: Proceedings of the 20th VLDB Conference, pages 144–155, Santiago, Chile, 1994.</ref> and ''[[Birch (data clustering)|BIRCH]]''.<ref>Tian Zhang, Raghu Ramakrishnan, Miron Livny. "[http://www.cs.du.edu/~leut/4423/papers/zhang96birch.pdf An Efficient Data Clustering Method for Very Large Databases]." In: Proc. Int'l Conf. on Management of Data, ACM SIGMOD, pp. 103–114.</ref> With the recent need to process larger and larger data sets (also known as [[big data]]), the willingness to trade semantic meaning of the generated clusters for performance has been increasing. This led to the development of pre-clustering methods such as [[canopy clustering algorithm|canopy clustering]], which can process huge data sets efficiently, but the resulting "clusters" are merely a rough pre-partitioning of the data set to then analyze the partitions with existing slower methods such as [[k-means clustering]].

For [[high-dimensional data]], many of the existing methods fail due to the [[curse of dimensionality]], which renders particular distance functions problematic in high-dimensional spaces. This led to new [[clustering high-dimensional data|clustering algorithms for high-dimensional data]] that focus on [[subspace clustering]] (where only some attributes are used, and cluster models include the relevant attributes for the cluster) and [[correlation clustering]] that also looks for arbitrary rotated ("correlated") subspace clusters that can be modeled by giving a [[correlation]] of their attributes.<ref>{{cite journal|last1=Kriegel|first1=Hans-Peter|author-link1=Hans-Peter Kriegel|last2=Kröger|first2=Peer|last3=Zimek|first3=Arthur|author-link3=Arthur Zimek|title=Subspace clustering|journal=Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery|date=July 2012|volume=2|issue=4|pages=351–364|doi=10.1002/widm.1057|s2cid=7241355}}</ref> Examples for such clustering algorithms are CLIQUE<ref>{{Cite journal | last1 = Agrawal | first1 = R. | last2 = Gehrke | first2 = J. | last3 = Gunopulos | first3 = D. | last4 = Raghavan | first4 = P. | title = Automatic Subspace Clustering of High Dimensional Data | doi = 10.1007/s10618-005-1396-1 | journal = Data Mining and Knowledge Discovery | volume = 11 | pages = 5–33 | year = 2005 | citeseerx = 10.1.1.131.5152 | s2cid = 9289572 }}</ref> and [[SUBCLU]].<ref>Karin Kailing, [[Hans-Peter Kriegel]] and Peer Kröger. ''Density-Connected Subspace Clustering for High-Dimensional Data''. In: ''Proc. SIAM Int. Conf. on Data Mining (SDM'04)'', pp. 246–257, 2004.</ref>

Ideas from density-based clustering methods (in particular the [[DBSCAN]]/[[OPTICS]] family of algorithms) have been adapted to subspace clustering (HiSC,<ref>{{Cite conference| doi = 10.1007/11871637_42| contribution = Finding Hierarchies of Subspace Clusters| isbn = 978-3-540-45374-1| year = 2006| last1 = Achtert | first1 = E.| last2 = Böhm | first2 = C.| last3 = Kriegel | first3 = H.-P.| series = Lecture Notes in Computer Science | author-link3 =Hans-Peter Kriegel| last4 = Kröger | first4 = P.| last5 = Müller-Gorman | first5 = I.| last6 = Zimek | first6 = A.|author-link6=Arthur Zimek| pages = 446–453| title= Knowledge Discovery in Databases: PKDD 2006| volume = 4213| citeseerx = 10.1.1.705.2956}}</ref> hierarchical subspace clustering and DiSH<ref>{{Cite conference| doi = 10.1007/978-3-540-71703-4_15| contribution = Detection and Visualization of Subspace Cluster Hierarchies| isbn = 978-3-540-71702-7| year = 2007| last1 = Achtert | first1 = E.| last2 = Böhm | first2 = C.| last3 = Kriegel | first3 = H. P. | author-link3 =Hans-Peter Kriegel| series = Lecture Notes in Computer Science| last4 = Kröger | first4 = P.| last5 = Müller-Gorman | first5 = I.| last6 = Zimek | first6 = A.|author-link6=Arthur Zimek| volume = 4443| pages = 152–163| title= Advances in Databases: Concepts, Systems and Applications| citeseerx = 10.1.1.70.7843}}</ref>) and correlation clustering (HiCO,<ref>{{Cite book| doi = 10.1109/SSDBM.2006.35| isbn = 978-0-7695-2590-7| year = 2006| last1 = Achtert | first1 = E.| last2 = Böhm | first2 = C.| last3 = Kröger | first3 = P.| last4 = Zimek | first4 = A.| title = 18th International Conference on Scientific and Statistical Database Management (SSDBM'06)| chapter = Mining Hierarchies of Correlation Clusters|author-link4=Arthur Zimek| pages = 119–128| citeseerx = 10.1.1.707.7872| s2cid = 2679909}}</ref> hierarchical correlation clustering, 4C<ref>{{Cite book | last1 = Böhm | first1 = C. | last2 = Kailing | first2 = K. | last3 = Kröger | first3 = P. | last4 = Zimek | first4 = A. |author-link4=Arthur Zimek| chapter = Computing Clusters of Correlation Connected objects | doi = 10.1145/1007568.1007620 | title = Proceedings of the 2004 ACM SIGMOD international conference on Management of data - SIGMOD '04 | pages = 455 | year = 2004 | isbn = 978-1581138597 | citeseerx = 10.1.1.5.1279 | s2cid = 6411037 }}</ref> using "correlation connectivity" and ERiC<ref>{{Cite book | last1 = Achtert | first1 = E. | last2 = Bohm | first2 = C. | last3 = Kriegel | first3 = H. P.  | author-link3=Hans-Peter Kriegel| last4 = Kröger | first4 = P. | last5 = Zimek | first5 = A. |author-link5=Arthur Zimek| doi = 10.1109/SSDBM.2007.21 | chapter = On Exploring Complex Relationships of Correlation Clusters | title = 19th International Conference on Scientific and Statistical Database Management (SSDBM 2007) | pages = 7 | year = 2007 | isbn = 978-0-7695-2868-7 | citeseerx = 10.1.1.71.5021 | s2cid = 1554722 }}</ref> exploring hierarchical density-based correlation clusters).

Several different clustering systems based on [[mutual information]] have been proposed. One is Marina Meilă's ''[[variation of information]]'' metric;<ref>{{Cite conference|last=Meilă|first=Marina | contribution=Comparing Clusterings by the Variation of Information |title=Learning Theory and Kernel Machines|year=2003|pages=173–187|doi=10.1007/978-3-540-45167-9_14|series=Lecture Notes in Computer Science|isbn=978-3-540-40720-1|volume=2777}}</ref> another provides hierarchical clustering.<ref>{{cite arXiv| first1 = Alexander | last1 = Kraskov | first2 = Harald | last2 = Stögbauer | first3 = Ralph G. | last3 = Andrzejak | first4 = Peter | last4 = Grassberger | title = Hierarchical Clustering Based on Mutual Information | date = 1 December 2003 | arxiv= q-bio/0311039}}</ref> Using genetic algorithms, a wide range of different fit-functions can be optimized, including mutual information.<ref>{{cite journal | last = Auffarth | first = B. | title = Clustering by a Genetic Algorithm with Biased Mutation Operator | journal = Wcci Cec | publisher = IEEE | date = July 18–23, 2010 |url=http://www.diva-portal.org/smash/get/diva2:456742/FULLTEXT02}}</ref> Also [[belief propagation]], a recent development in [[computer science]] and [[statistical physics]], has led to the creation of new types of clustering algorithms.<ref>{{Cite journal| first1 =  B. J. | last1 = Frey | first2 = D. | last2 = Dueck| title=Clustering by Passing Messages Between Data Points|journal=Science| volume=315| pages=972–976|doi= 10.1126/science.1136800|year= 2007|issue= 5814|pmid= 17218491 | bibcode = 2007Sci...315..972F | citeseerx = 10.1.1.121.3145 | s2cid = 6502291 }}</ref>