Editing Cluster analysis (section)

=== Cluster tendency ===
{{more citations needed section|date=April 2025}}
To measure cluster tendency is to measure to what degree clusters exist in the data to be clustered, and may be performed as an initial test, before attempting clustering. One way to do this is to compare the data against random data. On average, random data should not have clusters {{verification needed|[[Random cluster model]] how does this fit here??|date=April 2025}}.

*'''[[Hopkins statistic]]'''
:There are multiple formulations of the [[Hopkins statistic]].<ref>{{Cite journal
  | title = A new method for determining the type of distribution of plant individuals
  | last1 = Hopkins | first1 = Brian
  | last2 = Skellam | first2 = John Gordon
  | journal = Annals of Botany
  | volume =18
  | number = 2
  | pages = 213–227
  | year = 1954
  | publisher = Annals Botany Co
  | doi=10.1093/oxfordjournals.aob.a083391
}}</ref> A typical one is as follows.<ref>{{Cite book
 | last = Banerjee | first = A.
 | title = 2004 IEEE International Conference on Fuzzy Systems (IEEE Cat. No.04CH37542)
 | chapter = Validating clusters using the Hopkins statistic
 | s2cid = 36701919
 | volume = 1
 | pages = 149–153
 | doi = 10.1109/FUZZY.2004.1375706
| year = 2004
 | isbn = 978-0-7803-8353-1
 }}</ref> Let <math>X</math> be the set of <math>n</math> data points in <math>d</math> dimensional space. Consider a random sample (without replacement) of <math>m \ll n</math> data points with members <math>x_i</math>. Also generate a set <math>Y</math> of <math>m</math> uniformly randomly distributed data points. Now define two distance measures, <math>u_i</math> to be the distance of <math>y_i \in Y</math> from its nearest neighbor in X and <math>w_i</math> to be the distance of <math>x_i \in X</math> from its nearest neighbor in X. We then define the Hopkins statistic as:
:: <math>
H=\frac{\sum_{i=1}^m{u_i^d}}{\sum_{i=1}^m{u_i^d}+\sum_{i=1}^m{w_i^d}} \,,
</math>
:With this definition, uniform random data should tend to have values near to 0.5, and clustered data should tend to have values nearer to 1.
:However, data containing just a single Gaussian will also score close to 1, as this statistic measures deviation from a ''uniform'' distribution, not [[Multimodal distribution|multimodality]], making this statistic largely useless in application (as real data never is remotely uniform).