Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Cluster analysis
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
=== Recent developments === In recent years, considerable effort has been put into improving the performance of existing algorithms.<ref>{{cite conference | first = D. | last = Sculley |title=Web-scale k-means clustering |conference=Proc. 19th WWW |year=2010}}</ref><ref>{{cite journal | last1 = Huang | first1 = Z. | s2cid = 11323096 | year = 1998 | title = Extensions to the ''k''-means algorithm for clustering large data sets with categorical values | journal = Data Mining and Knowledge Discovery | volume = 2 | issue = 3| pages = 283–304 | doi = 10.1023/A:1009769707641 }}</ref> Among them are ''CLARANS'',<ref>R. Ng and J. Han. "Efficient and effective clustering method for spatial data mining". In: Proceedings of the 20th VLDB Conference, pages 144–155, Santiago, Chile, 1994.</ref> and ''[[Birch (data clustering)|BIRCH]]''.<ref>Tian Zhang, Raghu Ramakrishnan, Miron Livny. "[http://www.cs.du.edu/~leut/4423/papers/zhang96birch.pdf An Efficient Data Clustering Method for Very Large Databases]." In: Proc. Int'l Conf. on Management of Data, ACM SIGMOD, pp. 103–114.</ref> With the recent need to process larger and larger data sets (also known as [[big data]]), the willingness to trade semantic meaning of the generated clusters for performance has been increasing. This led to the development of pre-clustering methods such as [[canopy clustering algorithm|canopy clustering]], which can process huge data sets efficiently, but the resulting "clusters" are merely a rough pre-partitioning of the data set to then analyze the partitions with existing slower methods such as [[k-means clustering]]. For [[high-dimensional data]], many of the existing methods fail due to the [[curse of dimensionality]], which renders particular distance functions problematic in high-dimensional spaces. This led to new [[clustering high-dimensional data|clustering algorithms for high-dimensional data]] that focus on [[subspace clustering]] (where only some attributes are used, and cluster models include the relevant attributes for the cluster) and [[correlation clustering]] that also looks for arbitrary rotated ("correlated") subspace clusters that can be modeled by giving a [[correlation]] of their attributes.<ref>{{cite journal|last1=Kriegel|first1=Hans-Peter|author-link1=Hans-Peter Kriegel|last2=Kröger|first2=Peer|last3=Zimek|first3=Arthur|author-link3=Arthur Zimek|title=Subspace clustering|journal=Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery|date=July 2012|volume=2|issue=4|pages=351–364|doi=10.1002/widm.1057|s2cid=7241355}}</ref> Examples for such clustering algorithms are CLIQUE<ref>{{Cite journal | last1 = Agrawal | first1 = R. | last2 = Gehrke | first2 = J. | last3 = Gunopulos | first3 = D. | last4 = Raghavan | first4 = P. | title = Automatic Subspace Clustering of High Dimensional Data | doi = 10.1007/s10618-005-1396-1 | journal = Data Mining and Knowledge Discovery | volume = 11 | pages = 5–33 | year = 2005 | citeseerx = 10.1.1.131.5152 | s2cid = 9289572 }}</ref> and [[SUBCLU]].<ref>Karin Kailing, [[Hans-Peter Kriegel]] and Peer Kröger. ''Density-Connected Subspace Clustering for High-Dimensional Data''. In: ''Proc. SIAM Int. Conf. on Data Mining (SDM'04)'', pp. 246–257, 2004.</ref> Ideas from density-based clustering methods (in particular the [[DBSCAN]]/[[OPTICS]] family of algorithms) have been adapted to subspace clustering (HiSC,<ref>{{Cite conference| doi = 10.1007/11871637_42| contribution = Finding Hierarchies of Subspace Clusters| isbn = 978-3-540-45374-1| year = 2006| last1 = Achtert | first1 = E.| last2 = Böhm | first2 = C.| last3 = Kriegel | first3 = H.-P.| series = Lecture Notes in Computer Science | author-link3 =Hans-Peter Kriegel| last4 = Kröger | first4 = P.| last5 = Müller-Gorman | first5 = I.| last6 = Zimek | first6 = A.|author-link6=Arthur Zimek| pages = 446–453| title= Knowledge Discovery in Databases: PKDD 2006| volume = 4213| citeseerx = 10.1.1.705.2956}}</ref> hierarchical subspace clustering and DiSH<ref>{{Cite conference| doi = 10.1007/978-3-540-71703-4_15| contribution = Detection and Visualization of Subspace Cluster Hierarchies| isbn = 978-3-540-71702-7| year = 2007| last1 = Achtert | first1 = E.| last2 = Böhm | first2 = C.| last3 = Kriegel | first3 = H. P. | author-link3 =Hans-Peter Kriegel| series = Lecture Notes in Computer Science| last4 = Kröger | first4 = P.| last5 = Müller-Gorman | first5 = I.| last6 = Zimek | first6 = A.|author-link6=Arthur Zimek| volume = 4443| pages = 152–163| title= Advances in Databases: Concepts, Systems and Applications| citeseerx = 10.1.1.70.7843}}</ref>) and correlation clustering (HiCO,<ref>{{Cite book| doi = 10.1109/SSDBM.2006.35| isbn = 978-0-7695-2590-7| year = 2006| last1 = Achtert | first1 = E.| last2 = Böhm | first2 = C.| last3 = Kröger | first3 = P.| last4 = Zimek | first4 = A.| title = 18th International Conference on Scientific and Statistical Database Management (SSDBM'06)| chapter = Mining Hierarchies of Correlation Clusters|author-link4=Arthur Zimek| pages = 119–128| citeseerx = 10.1.1.707.7872| s2cid = 2679909}}</ref> hierarchical correlation clustering, 4C<ref>{{Cite book | last1 = Böhm | first1 = C. | last2 = Kailing | first2 = K. | last3 = Kröger | first3 = P. | last4 = Zimek | first4 = A. |author-link4=Arthur Zimek| chapter = Computing Clusters of Correlation Connected objects | doi = 10.1145/1007568.1007620 | title = Proceedings of the 2004 ACM SIGMOD international conference on Management of data - SIGMOD '04 | pages = 455 | year = 2004 | isbn = 978-1581138597 | citeseerx = 10.1.1.5.1279 | s2cid = 6411037 }}</ref> using "correlation connectivity" and ERiC<ref>{{Cite book | last1 = Achtert | first1 = E. | last2 = Bohm | first2 = C. | last3 = Kriegel | first3 = H. P. | author-link3=Hans-Peter Kriegel| last4 = Kröger | first4 = P. | last5 = Zimek | first5 = A. |author-link5=Arthur Zimek| doi = 10.1109/SSDBM.2007.21 | chapter = On Exploring Complex Relationships of Correlation Clusters | title = 19th International Conference on Scientific and Statistical Database Management (SSDBM 2007) | pages = 7 | year = 2007 | isbn = 978-0-7695-2868-7 | citeseerx = 10.1.1.71.5021 | s2cid = 1554722 }}</ref> exploring hierarchical density-based correlation clusters). Several different clustering systems based on [[mutual information]] have been proposed. One is Marina Meilă's ''[[variation of information]]'' metric;<ref>{{Cite conference|last=Meilă|first=Marina | contribution=Comparing Clusterings by the Variation of Information |title=Learning Theory and Kernel Machines|year=2003|pages=173–187|doi=10.1007/978-3-540-45167-9_14|series=Lecture Notes in Computer Science|isbn=978-3-540-40720-1|volume=2777}}</ref> another provides hierarchical clustering.<ref>{{cite arXiv| first1 = Alexander | last1 = Kraskov | first2 = Harald | last2 = Stögbauer | first3 = Ralph G. | last3 = Andrzejak | first4 = Peter | last4 = Grassberger | title = Hierarchical Clustering Based on Mutual Information | date = 1 December 2003 | arxiv= q-bio/0311039}}</ref> Using genetic algorithms, a wide range of different fit-functions can be optimized, including mutual information.<ref>{{cite journal | last = Auffarth | first = B. | title = Clustering by a Genetic Algorithm with Biased Mutation Operator | journal = Wcci Cec | publisher = IEEE | date = July 18–23, 2010 |url=http://www.diva-portal.org/smash/get/diva2:456742/FULLTEXT02}}</ref> Also [[belief propagation]], a recent development in [[computer science]] and [[statistical physics]], has led to the creation of new types of clustering algorithms.<ref>{{Cite journal| first1 = B. J. | last1 = Frey | first2 = D. | last2 = Dueck| title=Clustering by Passing Messages Between Data Points|journal=Science| volume=315| pages=972–976|doi= 10.1126/science.1136800|year= 2007|issue= 5814|pmid= 17218491 | bibcode = 2007Sci...315..972F | citeseerx = 10.1.1.121.3145 | s2cid = 6502291 }}</ref>
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)