PCA

Posted on 2021-07-19 Edited on 2021-08-26 In ML Views:

Symbols count in article: 3.6k Reading time ≈ 3 mins.

Principal Component Analysis

Let \([\mathbf{X}_1 ... \mathbf{X}_N]\) be a \(p \times N\) matrix of random sample where \(\mathbf{X}_i \in \mathbb{R}^p\). The sample mean \(\mathbf{M}\) of the random sample is defined as:

\[\mathbf{M} = \frac{1}{N} \sum^{N}_{i=1} \mathbf{X}_{i}\]

Let \(\hat{\mathbf{X}}_k\) be the centered random vector for \(k = 1, ...., N\):

\[\hat{\mathbf{X}}_k = \mathbf{X}_k - \mathbf{M}\]

Then we can defined the centered random sample matrix as:

\[B = [\hat{\mathbf{X}}_1 ... \hat{\mathbf{X}}_N]_{[p\times N]}\]

The sample covariance matrix \(S\) is then:

\[S = [\frac{1}{N} BB^T]_{[p\times p]}\]

We can easily show that \(S\) is positive semi-definite. Assume that \(\mathbf{x} > 0\):

\[\begin{aligned} BB^T \mathbf{x} &= \lambda \mathbf{x}\\ \implies \mathbf{x}^T BB^T \mathbf{x} &= \lambda \mathbf{x}^T \mathbf{x}\\ \implies \frac{\mathbf{x}^T BB^T \mathbf{x}}{\mathbf{x}^T \mathbf{x}} &= \lambda \end{aligned}\]

Let \(\mathbf{A} = B^T\mathbf{x}\), then:

\[\frac{\mathbf{A}^T \mathbf{A}}{\mathbf{x}^T \mathbf{x}} = \lambda\]

Since, \(\mathbf{A}^T \mathbf{A} \geq 0\) for any vector \(A\), similar for \(\mathbf{x}^T \mathbf{x}\), then:

\[\lambda \geq 0\]

Since, \(S\) is symmetric and \(\lambda \geq 0\) for all eigenvalues of \(S\), we can conclude that \(S\) is positive semi-definite.

The total variance in the data is:

\[tr(S)\]

PCA

The goal of PCA is to find an orthogonal \(p \times p\) matrix \(P = [\mathbf{u}_1 .... \mathbf{u}_p]\) that determines a change of variable, \(\mathbf{X} = P\mathbf{Y}\) with the property that the features of \(\mathbf{Y}\) are uncorrelated and are arranged in order of decreasing variance.

Assume the \(\mathbf{X}\) is already being centered, that is \(B = [\mathbf{X}_1 .... \mathbf{X}_N]\), then \(\mathbf{Y}\) is also centered since \(P \neq 0\):

\[E[\mathbf{X}] = P\cdot E[\mathbf{Y}] = 0\]

Then the sample covariance matrix of \(\mathbf{Y}\) is:

\[S_Y = \frac{1}{N-1}[P^{-1}\mathbf{X}_1 .... P^{-1}\mathbf{X}_N] [P^{-1}\mathbf{X}_1 .... P^{-1}\mathbf{X}_N]^T\]

\[\implies S_{\mathbf{Y}} = \frac{1}{N-1} P^{T}BB^{T}P = P^{T} S_{\mathbf{X}} P\]

So the desired orthogonal matrix is the one that makes \(\hat{Cov}[Y_i, Y_j] = 0, \; \forall i \neq j\) (features are uncorrelated), which means that the we want the sample covariance matrix \(S_Y\) to be diagonal.

Let \(D\) be a diagonal matrix with eigenvalues of \(S_{\mathbf{X}}\), \(\lambda_1 ,...., \lambda_p\) on the diagonal s.t \(\lambda_1 \geq \lambda_2 \geq ..... \lambda_p \geq 0\), let \(P\) be an orthogonal matrix whose columns are the corresponding unit eigenvectors \(\mathbf{u}_1, ....., \mathbf{u}_p\) (Symmetric matrices have property that the eigenspaces are mutually orthogonal, in the sense that eigenvectors corresponding to different eigenvalues are orthogonal). Then, we can use \(P\) and \(\mathbf{X}\) to represent \(Y\), and the sample covariance matrix is:

\[S_{\mathbf{X}} = PDP^T \implies P^{-1} S_{\mathbf{X}} P = D\]

Thus, \(S_{\mathbf{Y}} = D\).

Then, the eigenvectors \(\mathbf{u}_1, ...., \mathbf{u}_p\) are called Principal Components of the data. The first PC is the eigenvector corresponding to the largest eigenvalue of \(S_{\mathbf{X}}\).

The result transformation is defiend as:

\[P^{T} \mathbf{X} = \mathbf{Y}\]

\[ \mathbf{Y} = \begin{bmatrix} \mathbf{u}^T_1 \mathbf{X} \\ .\\ .\\ .\\ \mathbf{u}^T_p \mathbf{X}\\ \end{bmatrix} \]

Total Variance

It can be shown that the PCA transformation does not change the total variance of the data, that is the total variance of the data is the sum of eigenvalues:

\[tr(S_{\mathbf{Y}}) = tr(D)\]

This is only true when we are dealing with sample covariance matrix,

Implementation

import numpy as np
def pca(X, k=2):
    u, s, vt = np.linalg.svd(X, full_matrices=False)
    xv = u * s

    return xv[:, :k]

def normalize(X):
    n, d = X.shape
    x_trans = X.copy()
    for j in range(d):
        # normalization
        x_trans[:, j] = (X[:, j] - X[:, j].mean()) / X[:, j].std()

    return x_trans

Ref

https://stats.stackexchange.com/questions/266864/why-is-the-sum-of-eigenvalues-of-a-pca-equal-to-the-original-variance-of-the-dat