K-Means
K-Means
Considering the problem of identifying groups, or clusters of data points in a multidimensional space.
Suppose we have a dataset \(\mathbf{x}_1, ....., \mathbf{x}_N\) consisting of \(N\) observations of a random \(D\) dimensional euclidean variable \(\mathbf{x}\). Our goal is to partition these points in to \(K\) clusters.
A cluster \(k\) contains:
- Cluster center: \(\boldsymbol{\mu}_k \in \mathbb{R}^{D}\).
- Assignment indicator variables: \(\mathbf{r}_{n} \in \mathbb{R}^{K}, \; r_{nk} \in \{0, 1\}\), one for each data point and each dimension \(k\) indicates whether the data point \(\mathbf{x}_n\) belongs to cluster \(k\)
So we can reformulate our goal to be: find an assignment of data points to clusters, as well as a set of cluster centers \(\{\boldsymbol{\mu}_k\}^{K}_{k=1}\), such that the sum of squares of the distances of each data point to its closest cluster center is a minimum. In equation, our objective is:
\[J(\mathbf{x}; \; \{\boldsymbol{\mu}_{k}, \mathbf{r}_{nk}\}) = \sum^{N}_{n=1}\sum^{K}_{k=1} r_{nk} \|\mathbf{x}_n - \boldsymbol{\mu}_k\|^2_2\]
Where the distance measure here is the L2 norm.
Thus, we want to find parameters \(\{\boldsymbol{\mu}_{k}, \mathbf{r}_{nk}\}\) such that the objective is minimized.
Algorithm
- Initialize \(\boldsymbol{\mu}_k, \; \forall \; k=1, ...., K\)
- Given \(\{\boldsymbol{\mu}_k\}\), minimize \(J\) w.r.t \(\{\mathbf{r}_{n}\}\)
- Given \(\{\mathbf{r}_{n}\}\), minimize \(J\) w.r.t \(\{\boldsymbol{\mu}_k\}\)
- Repeat 2, 3 until converges
For phase 2:
Given the cluster centers, it is obvious that if we assign each point to the closest cluster center, we have the minimum objective:
\[ r_{nk}= \begin{cases} 1, \quad \text{if }\; k = \underset{k}{\arg\min} \|\mathbf{x}_n - \boldsymbol{\mu}_k\|^2_2\\ 0, \quad o.w\\ \end{cases} \]
For phase 3:
Given the assignments, since the objective is convex, we can take gradient and solve for the optimal \(\boldsymbol{\mu}_k\):
\[\frac{\partial J}{\partial \boldsymbol{\mu}_k} = \sum^{N}_{n=1} -2 r_{nk} \mathbf{x}_n + 2\boldsymbol{\mu}_k \sum^{N}_{n=1} r_{nk} = 0\]
\[\implies \boldsymbol{\mu}_k \sum^{N}_{n=1} r_{nk} = \sum^{N}_{n=1} r_{nk} \mathbf{x}_n\]
\[\implies \boldsymbol{\mu}_k = \frac{\sum^{N}_{n=1} r_{nk} \mathbf{x}_n}{\sum^{N}_{n=1} r_{nk}}\]
\(\sum^{N}_{n=1} r_{nk} \mathbf{x}_n\) is the sum of all points that belongs to cluster \(k\) and \(\sum^{N}_{n=1} r_{nk}\) is the count of points in cluster \(k\), so the new \(\boldsymbol{\mu}_k\) is just the sample mean of the points in cluster \(k\).
Convergence
Since:
- \(\{\boldsymbol{\mu}_{k}, \mathbf{r}_{nk}\}\) can only take finite values (i.e they are derived from finite subsets of data and uniquely defined for each subset).
- At each step, we minimize the objective (i.e we either decrease or maintain the objective).
- The cost function is bounded below zero.
- Ties are broken consistently.
Thus, the algorithm can only take a finite number of non-decreasing steps before terminating at a local minimum.
Implementation
1 | import numpy as np |