K-Means

Posted on 2021-07-19 Edited on 2021-08-19 In ML Views:

Symbols count in article: 3.8k Reading time ≈ 3 mins.

K-Means

Considering the problem of identifying groups, or clusters of data points in a multidimensional space.

Suppose we have a dataset \(\mathbf{x}_1, ....., \mathbf{x}_N\) consisting of \(N\) observations of a random \(D\) dimensional euclidean variable \(\mathbf{x}\). Our goal is to partition these points in to \(K\) clusters.

A cluster \(k\) contains:

Cluster center: \(\boldsymbol{\mu}_k \in \mathbb{R}^{D}\).
Assignment indicator variables: \(\mathbf{r}_{n} \in \mathbb{R}^{K}, \; r_{nk} \in \{0, 1\}\), one for each data point and each dimension \(k\) indicates whether the data point \(\mathbf{x}_n\) belongs to cluster \(k\)

So we can reformulate our goal to be: find an assignment of data points to clusters, as well as a set of cluster centers \(\{\boldsymbol{\mu}_k\}^{K}_{k=1}\), such that the sum of squares of the distances of each data point to its closest cluster center is a minimum. In equation, our objective is:

\[J(\mathbf{x}; \; \{\boldsymbol{\mu}_{k}, \mathbf{r}_{nk}\}) = \sum^{N}_{n=1}\sum^{K}_{k=1} r_{nk} \|\mathbf{x}_n - \boldsymbol{\mu}_k\|^2_2\]

Where the distance measure here is the L2 norm.

Thus, we want to find parameters \(\{\boldsymbol{\mu}_{k}, \mathbf{r}_{nk}\}\) such that the objective is minimized.

Algorithm

Initialize \(\boldsymbol{\mu}_k, \; \forall \; k=1, ...., K\)
Given \(\{\boldsymbol{\mu}_k\}\), minimize \(J\) w.r.t \(\{\mathbf{r}_{n}\}\)
Given \(\{\mathbf{r}_{n}\}\), minimize \(J\) w.r.t \(\{\boldsymbol{\mu}_k\}\)
Repeat 2, 3 until converges

For phase 2:

Given the cluster centers, it is obvious that if we assign each point to the closest cluster center, we have the minimum objective:

\[ r_{nk}= \begin{cases} 1, \quad \text{if }\; k = \underset{k}{\arg\min} \|\mathbf{x}_n - \boldsymbol{\mu}_k\|^2_2\\ 0, \quad o.w\\ \end{cases} \]

For phase 3:

Given the assignments, since the objective is convex, we can take gradient and solve for the optimal \(\boldsymbol{\mu}_k\):

\[\frac{\partial J}{\partial \boldsymbol{\mu}_k} = \sum^{N}_{n=1} -2 r_{nk} \mathbf{x}_n + 2\boldsymbol{\mu}_k \sum^{N}_{n=1} r_{nk} = 0\]

\[\implies \boldsymbol{\mu}_k \sum^{N}_{n=1} r_{nk} = \sum^{N}_{n=1} r_{nk} \mathbf{x}_n\]

\[\implies \boldsymbol{\mu}_k = \frac{\sum^{N}_{n=1} r_{nk} \mathbf{x}_n}{\sum^{N}_{n=1} r_{nk}}\]

\(\sum^{N}_{n=1} r_{nk} \mathbf{x}_n\) is the sum of all points that belongs to cluster \(k\) and \(\sum^{N}_{n=1} r_{nk}\) is the count of points in cluster \(k\), so the new \(\boldsymbol{\mu}_k\) is just the sample mean of the points in cluster \(k\).

Convergence

Since:

\(\{\boldsymbol{\mu}_{k}, \mathbf{r}_{nk}\}\) can only take finite values (i.e they are derived from finite subsets of data and uniquely defined for each subset).
At each step, we minimize the objective (i.e we either decrease or maintain the objective).
The cost function is bounded below zero.
Ties are broken consistently.

Thus, the algorithm can only take a finite number of non-decreasing steps before terminating at a local minimum.

Implementation

import numpy as np


class KMeans:
    def __init__(self, k=2, max_iteration=10, init_method='random'):
        self.k = k
        self.max_iteration = max_iteration
        self.init_method = init_method

    def fit_transform(self, x: np.array):
        n, d = x.shape
        curr_mu = self._init_mu(x)
        curr_iter = 0

        while curr_iter <= self.max_iteration:
            r_matrix = np.zeros((n, self.k))
            # step one, map each instance to cluster center muk
            for sample in range(n):
                distance = []
                for j in range(self.k):
                    distance.append(np.linalg.norm(x[sample] - curr_mu[j]))

                mu_assigned = np.argmin(distance)
                r_matrix[sample][mu_assigned] = 1
                
            # step two calculate new mu_k
            prev_mu = curr_mu.copy()
            for j in range(self.k):
                curr_mu[j] = np.mean(x[r_matrix[:, j] == 1], axis=0)
                
            # check stopping criteria
            if np.linalg.norm(prev_mu - curr_mu) == 0:
                break

            curr_iter += 1
            
        print(f'finished k-means algorithm, on iteration {curr_iter}')
        return r_matrix, curr_mu

    def _init_mu(self, x):
        if self.init_method == 'random':
            col_max = x.max(axis=0)
            col_min = x.min(axis=0)
            return (col_max - col_min) * np.random.random_sample((self.k, x.shape[1])) + col_min
        
        elif self.init_method == 'random_points':
            random_int = np.random.randint(0, x.shape[0], size=self.k)
            return x[random_int]