Editing Hierarchical clustering (section)

== Divisive clustering ==

The basic principle of divisive clustering was published as the DIANA (DIvisive ANAlysis clustering) algorithm.<ref>{{cite book |first1=L. |last1=Kaufman |first2=P.J. |last2=Rousseeuw |chapter=6. Divisive Analysis (Program DIANA) |title=Finding Groups in Data: An Introduction to Cluster Analysis |chapter-url=https://books.google.com/books?id=YeFQHiikNo0C&pg=PA253 |orig-year=1990 |date=2009 |publisher=Wiley |isbn=978-0-470-31748-8 |pages=253–279}}</ref> Initially, all data is in the same cluster, and the largest cluster is split until every object is separate.
Because there exist <math>O(2^n)</math> ways of splitting each cluster, heuristics are needed. DIANA chooses the object with the maximum average dissimilarity and then moves all objects to this cluster that are more similar to the new cluster than to the remainder.

Informally, DIANA is not so much a process of "dividing" as it is of "hollowing out": each iteration, an existing cluster (e.g. the initial cluster of the entire dataset) is chosen to form a new cluster inside of it. Objects progressively move to this nested cluster, and hollow out the existing cluster. Eventually, all that's left inside a cluster is nested clusters that grew there, without it owning any loose objects by itself.

Formally, DIANA operates in the following steps:

# Let <math>C_0 = \{1\dots n\}</math> be the set of all <math>n</math> object indices and <math>\mathcal{C} = \{C_0\}</math> the set of all formed clusters so far.
# Iterate the following until <math>|\mathcal{C}| = n</math>:
## Find the current cluster with 2 or more objects that has the largest diameter: <math>C_* = \arg\max_{C\in \mathcal{C}} \max_{i_1,i_2\in C} \delta(i_1,i_2)</math>
## Find the object in this cluster with the most dissimilarity to the rest of the cluster: <math>i^* = \arg\max_{i\in C_*} \frac{1}{|C_*|-1}\sum_{j\in C_*\setminus\{i\}} \delta(i,j)</math>
## Pop <math>i^*</math> from its old cluster <math>C_*</math> and put it into a new ''splinter group'' <math>C_\textrm{new} = \{i^*\}</math>.
## As long as <math>C_*</math> isn't empty, keep migrating objects from <math>C_*</math> to add them to <math>C_\textrm{new}</math>. To choose which objects to migrate, don't just consider dissimilarity to <math>C_*</math>, but also adjust for dissimilarity to the splinter group: let <math>i^* = \arg\max_{i\in C} D(i)</math> where we define <math>D(i) = \frac{1}{|C_*|-1}\sum_{j\in C_*\setminus\{i\}} \delta(i,j) - \frac{1}{|C_\textrm{new}|}\sum_{j\in C_\textrm{new}} \delta(i,j)</math>, then either stop iterating when <math>D(i^*) < 0</math>, or migrate <math>i^*</math>.
## Add <math>C_\textrm{new}</math> to <math>\mathcal{C}</math>.

Intuitively, <math>D(i)</math> above measures how strongly an object wants to leave its current cluster, but it is attenuated when the object wouldn't fit in the splinter group either. Such objects will likely start their own splinter group eventually.

The dendrogram of DIANA can be constructed by letting the splinter group <math>C_\textrm{new}</math> be a child of the hollowed-out cluster <math>C_*</math> each time. This constructs a tree with <math>C_0</math> as its root and <math>n</math> unique single-object clusters as its leaves.