# K-means

K-means is an algorithm for partitioning (or clustering) <math>N</math> data points into <math>K</math> disjoint subsets <math>S_j</math> containing <math>N_j</math> data points so as to minimize the sum-of-squares criterion

<math>J=\sum_{j=1}^K\sum_{n \in S_j}|x_n-mu_j|^2,</math>

where <math>x_n</math> is a vector representing the <math>nth</math> data point and <math>mu_j</math> is the geometric centroid of the data points in <math>S_j</math>.

In general, the algorithm does not achieve a global minimum of <math>J</math> over the assignments. In fact, since the algorithm uses discrete assignment rather than a set of continuous parameters, the “minimum” it reaches cannot even be properly called a local minimum. Despite these limitations, the algorithm is used fairly frequently as a result of its ease of implementation.

The algorithm consists of a simple re-estimation procedure as follows. Initially, the data points are assigned at random to the <math>K</math> sets. For step 1, the centroid is computed for each set. In step 2, every point is assigned to the cluster whose centroid is closest to that point.

The most common algorithm uses an iterative refinement technique. Due to its ubiquity it is often called the k-means algorithm; it is also referred to as Lloyd’s algorithm, particularly in the computer science community.

Given an initial set of <math>k</math> means <math>\mathbf{mu_1^{(1)},\dots,mu_k^{(1)}}</math>, which may be specified randomly or by some heuristic, the algorithm proceeds by alternating between two steps:

Assignment step: Assign each observation to the cluster with the closest mean

(i.e. partition the observations according to the Voronoi diagram generated by the means).

<math> S_i^{(t)} = \left\{ \mathbf x_j : \big\| \mathbf x_j - \mathbf{ mu^{(t)}_i} \big\| \leq \big\| \mathbf x_j - \mathbf {mu^{(t)}_{i^*}} \big\| \forall i^*=1,\ldots,k \right\} </math>

Update step: Calculate the new means to be the centroid of the observations in the cluster.

<math> \mathbf mu^{(t+1)}_i = \frac{1}{|S^{(t)}_i|} \sum_{\mathbf x_j \in S^{(t)}_i} \mathbf x_j </math>

These two steps are alternated until a stopping criterion is met, i.e., when there is no further change in the assignment of the data points.