Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Cluster analysis
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
=== Connectivity-based clustering (hierarchical clustering) === {{Main|Hierarchical clustering}} Connectivity-based clustering, also known as ''[[hierarchical clustering]]'', is based on the core idea of objects being more related to nearby objects than to objects farther away. These algorithms connect "objects" to form "clusters" based on their distance. A cluster can be described largely by the maximum distance needed to connect parts of the cluster. At different distances, different clusters will form, which can be represented using a [[dendrogram]], which explains where the common name "[[hierarchical clustering]]" comes from: these algorithms do not provide a single partitioning of the data set, but instead provide an extensive hierarchy of clusters that merge with each other at certain distances. In a dendrogram, the y-axis marks the distance at which the clusters merge, while the objects are placed along the x-axis such that the clusters don't mix. Connectivity-based clustering is a whole family of methods that differ by the way distances are computed. Apart from the usual choice of [[distance function]]s, the user also needs to decide on the linkage criterion (since a cluster consists of multiple objects, there are multiple candidates to compute the distance) to use. Popular choices are known as [[single-linkage clustering]] (the minimum of object distances), [[complete linkage clustering]] (the maximum of object distances), and [[UPGMA]] or [[WPGMA]] ("Unweighted or Weighted Pair Group Method with Arithmetic Mean", also known as average linkage clustering). Furthermore, hierarchical clustering can be agglomerative (starting with single elements and aggregating them into clusters) or divisive (starting with the complete data set and dividing it into partitions). These methods will not produce a unique partitioning of the data set, but a hierarchy from which the user still needs to choose appropriate clusters. They are not very robust towards outliers, which will either show up as additional clusters or even cause other clusters to merge (known as "chaining phenomenon", in particular with [[single-linkage clustering]]). In the general case, the complexity is <math>\mathcal{O}(n^3)</math> for agglomerative clustering and <math>\mathcal{O}(2^{n-1})</math> for [[divisive clustering]],<ref>{{cite book | last = Everitt | first = Brian | title = Cluster analysis | publisher = Wiley | location = Chichester, West Sussex, U.K | year = 2011 | isbn = 9780470749913 }}</ref> which makes them too slow for large data sets. For some special cases, optimal efficient methods (of complexity <math>\mathcal{O}(n^2)</math>) are known: SLINK<ref>{{cite journal | first = R. | last = Sibson | title=SLINK: an optimally efficient algorithm for the single-link cluster method | journal=The Computer Journal | volume=16 | issue=1 | pages=30β34 | year=1973 | publisher=British Computer Society | url=http://www.cs.gsu.edu/~wkim/index_files/papers/sibson.pdf | doi=10.1093/comjnl/16.1.30 }}</ref> for single-linkage and CLINK<ref>{{cite journal | first = D. | last = Defays | title=An efficient algorithm for a complete link method | journal=The Computer Journal | volume=20 | issue=4 | pages=364β366 | year=1977 | publisher=British Computer Society | doi=10.1093/comjnl/20.4.364}}</ref> for complete-linkage clustering. <gallery caption="Linkage clustering examples" widths="200px" heights="200px"> File:SLINK-Gaussian-data.svg|Single-linkage on Gaussian data. At 35 clusters, the biggest cluster starts fragmenting into smaller parts, while before it was still connected to the second largest due to the single-link effect. File:SLINK-density-data.svg|Single-linkage on density-based clusters. 20 clusters extracted, most of which contain single elements, since linkage clustering does not have a notion of "noise". </gallery>
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)