Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Cluster analysis
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
=== Cluster tendency === {{more citations needed section|date=April 2025}} To measure cluster tendency is to measure to what degree clusters exist in the data to be clustered, and may be performed as an initial test, before attempting clustering. One way to do this is to compare the data against random data. On average, random data should not have clusters {{verification needed|[[Random cluster model]] how does this fit here??|date=April 2025}}. *'''[[Hopkins statistic]]''' :There are multiple formulations of the [[Hopkins statistic]].<ref>{{Cite journal | title = A new method for determining the type of distribution of plant individuals | last1 = Hopkins | first1 = Brian | last2 = Skellam | first2 = John Gordon | journal = Annals of Botany | volume =18 | number = 2 | pages = 213β227 | year = 1954 | publisher = Annals Botany Co | doi=10.1093/oxfordjournals.aob.a083391 }}</ref> A typical one is as follows.<ref>{{Cite book | last = Banerjee | first = A. | title = 2004 IEEE International Conference on Fuzzy Systems (IEEE Cat. No.04CH37542) | chapter = Validating clusters using the Hopkins statistic | s2cid = 36701919 | volume = 1 | pages = 149β153 | doi = 10.1109/FUZZY.2004.1375706 | year = 2004 | isbn = 978-0-7803-8353-1 }}</ref> Let <math>X</math> be the set of <math>n</math> data points in <math>d</math> dimensional space. Consider a random sample (without replacement) of <math>m \ll n</math> data points with members <math>x_i</math>. Also generate a set <math>Y</math> of <math>m</math> uniformly randomly distributed data points. Now define two distance measures, <math>u_i</math> to be the distance of <math>y_i \in Y</math> from its nearest neighbor in X and <math>w_i</math> to be the distance of <math>x_i \in X</math> from its nearest neighbor in X. We then define the Hopkins statistic as: :: <math> H=\frac{\sum_{i=1}^m{u_i^d}}{\sum_{i=1}^m{u_i^d}+\sum_{i=1}^m{w_i^d}} \,, </math> :With this definition, uniform random data should tend to have values near to 0.5, and clustered data should tend to have values nearer to 1. :However, data containing just a single Gaussian will also score close to 1, as this statistic measures deviation from a ''uniform'' distribution, not [[Multimodal distribution|multimodality]], making this statistic largely useless in application (as real data never is remotely uniform).
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)