Editing Cluster analysis (section)

=== Centroid-based clustering ===
{{Main|k-means clustering}}
In centroid-based clustering, each cluster is represented by a central vector, which is not necessarily a member of the data set. When the number of clusters is fixed to ''k'', [[k-means clustering|''k''-means clustering]] gives a formal definition as an optimization problem: find the ''k'' cluster centers and assign the objects to the nearest cluster center, such that the squared distances from the cluster are minimized.

The optimization problem itself is known to be [[NP-hard]], and thus the common approach is to search only for approximate solutions. A particularly well-known approximate method is [[Lloyd's algorithm]],<ref name="lloyd">{{Cite journal | last1 = Lloyd | first1 = S. | title = Least squares quantization in PCM | doi = 10.1109/TIT.1982.1056489 | journal = IEEE Transactions on Information Theory | volume = 28 | issue = 2 | pages = 129–137 | year = 1982 | s2cid = 10833328 }}</ref> often just referred to as "''k-means algorithm''" (although [[k-means clustering#History|another algorithm introduced this name]]). It does however only find a [[local optimum]], and is commonly run multiple times with different random initializations. Variations of ''k''-means often include such optimizations as choosing the best of multiple runs, but also restricting the centroids to members of the data set ([[K-medoids|''k''-medoids]]), choosing [[median]]s ([[k-medians clustering|''k''-medians clustering]]), choosing the initial centers less randomly ([[k-means++|''k''-means++]]) or allowing a fuzzy cluster assignment ([[Fuzzy clustering|fuzzy c-means]]).

Most ''k''-means-type algorithms require the [[Determining the number of clusters in a data set|number of clusters]] – ''k'' – to be specified in advance, which is considered to be one of the biggest drawbacks of these algorithms. Furthermore, the algorithms prefer clusters of approximately similar size, as they will always assign an object to the nearest centroid; often yielding improperly cut borders of clusters. This happens primarily because the algorithm optimizes cluster centers, not cluster borders. Steps involved in the centroid-based clustering algorithm are:

# Choose, ''k'' '''distinct''' clusters at random. These are the initial centroids to be improved upon.
# Suppose a set of observations, {{math|('''x'''<sub>1</sub>, '''x'''<sub>2</sub>, ..., '''x'''<sub>''n''</sub>)}}. Assign each observation to the centroid to which it has the smallest squared [[Euclidean distance]]. This results in ''k'' distinct groups, each containing unique observations.
# Recalculate centroids (see [[k-means clustering|''k''-means clustering]]).
# Exit ''iff'' the new centroids are equivalent to the previous iteration's centroids. Else, repeat the algorithm, the centroids have yet to converge.

K-means has a number of interesting theoretical properties. First, it partitions the data space into a structure known as a [[Voronoi diagram]]. Second, it is conceptually close to nearest neighbor classification, and as such is popular in [[machine learning]]. Third, it can be seen as a variation of model-based clustering, and Lloyd's algorithm as a variation of the [[Expectation-maximization algorithm]] for this model discussed below.

<gallery widths="200" heights="200" caption="''k''-means clustering examples">
File:KMeans-Gaussian-data.svg|''k''-means separates data into Voronoi cells, which assumes equal-sized clusters (not adequate here).
File:KMeans-density-data.svg|''k''-means cannot represent density-based clusters.
</gallery>

Centroid-based clustering problems such as ''k''-means and ''k''-medoids are special cases of the uncapacitated, metric [[Optimal facility location|facility location problem]], a canonical problem in the operations research and computational geometry communities. In a basic facility location problem (of which there are numerous variants that model more elaborate settings), the task is to find the best warehouse locations to optimally service a given set of consumers. One may view "warehouses" as cluster centroids and "consumer locations" as the data to be clustered. This makes it possible to apply the well-developed algorithmic solutions from the facility location literature to the presently considered centroid-based clustering problem.