Deriving Principal Components Analysis from Singular Value Decomposition - Part II (PCA)

Deriving Principal Components Analysis from Singular Value Decomposition - Part II (PCA)

I wrote a previous article about Singular Value Decomposition (SVD) here, with the promise that I would eventually use SVD to motivate Principal Components Analysis (PCA). Hopefully you are well-rested, and ready to dive in.

Preliminaries

Sample Matrix

Let's imagine that we have a collection of n samples, each with m dimensions.

Let's call this feature space F. Typically, there are many more samples than dimensions, so m < n. We will arrange our samples into a matrix X, where each column of X is a sample in F. In the end, PCA is a change of basis, and basis changes don't do translation, so let's assume that the samples have already been translated to have mean zero in each coordinate.

Below is an example of samples in 2 dimensions:

No alt text provided for this image


Change of Basis

We may want to map samples from their original coordinates in F to some new set of coordinates. This operation is known as a change of basis. For example, we might want to stretch and rotate the axes. The mapping between a vector x⃗ in the old coordinates and the vector n⃗ in the new set of coordinates is given by a change of basis map

x⃗ = An⃗

where the columns of A are linearly independent vectors in F. Writing it this way around shows explicitly that it is possible to recover the original vector x⃗ from n⃗. If we are changing the basis for all the samples at once, we can write

X = AN

where each column n⃗ᵢ represents the sample x⃗ᵢ with new coordinates.

Interestingly, for a given matrix X, A need not be square, so long as the columns of A cover the column space of X. If X has rank r, then all the samples in X belong to some r-dimensional subspace of F and could be completely described with r coordinates. Thus, A could have as few as r columns.

Singular Value Decomposition

Let's remind ourselves of the Singular Value Decomposition. First, we need to some nomenclature.

For a matrix A, define Aᴴ as the conjugate transpose. If A is real, Aᴴ = Aᵀ, the transpose of A. Instead of , some sources use † and some use *.

A matrix A is called unitary if Aᴴ = A⁻¹. A square matrix A is unitary if and only if all the column vectors are orthonormal, which is to say perpendicular to each other and of norm 1. Similarly, all the row vectors of square matrix A are orthonormal if and only if A is unitary.

An m x n matrix X can be decomposed as

X = UΣV

where U is an m x m unitary matrix,V is an n x n unitary matrix and Σ is an m x n diagonal matrix whose diagonal entries are non-negative and arranged in decreasing order. The diagonal elements of Σ are called the singular values. The number of positive singular values in Σ corresponds to the rank of X.

If X has real entries, then U and V also have real entries.

We will be considering a situation where m < n. The shape of this decomposition looks like this:

No alt text provided for this image


Here, the dark blue diagonal of Σ indicates where all the non-zero values lie. The light blue portions are all zero.

The Compact SVD

The picture above teaches us that the full SVD is wasteful. In our example, the lower portion of Vis guaranteed to be multiplied by zero:

No alt text provided for this image


It makes no difference to the outcome what you put there. Similarly, if X is of rank r < m, some rows at the bottom of Σ will be zero, and some columns on the right-hand side of U are guaranteed to be multiplied by zero:

No alt text provided for this image

We can eliminate all these extra dimensions by throwing away the irrelevant portions of U and V and restricting Σ to be an r x r diagonal matrix with positive entries on the diagonal:

No alt text provided for this image

This is the compact singular value decomposition. It makes enough sense that some people think of the compact version as the regular singular value decomposition.

Interpreting U and UΣ as change of bases

Recall that a change of basis for the matrix X takes the form

X = AN

Substituting U for A and Vᴴ for N, we have

X = U (ΣV)

so U is a change of basis from an m-dimensional space to an r-dimensional space.

Similarly is a change of basis in the formula

X = () V

Of course, the columns z⃗of are more than linearly independent - they are orthogonal, since (z⃗ᵢ, z⃗ₖ) = (σu⃗ᵢ, σₖu⃗ₖ) = σ̅σₖ(u⃗ᵢ, u⃗ₖ) = 0.

Deriving Principal Components Analysis

Recall that we have constructed X with each column vector representing a single sample. We will motivate PCA by approximating X.

Approximation

If we wanted to approximate X with a lower-dimensional basis, how would we go about it? We know that the singular values on the diagonal of Σ are arranged in decreasing order, so what if we just ... set the smaller singular values to zero? Let's imagine we only retain the p largest singular values.

Our decomposition would look like this:

No alt text provided for this image

and once again, U and V have columns that are destined be be multiplied by zero:

No alt text provided for this image


Compacting as before, we let Σ̂ be an invertible p x p diagonal matrix:

No alt text provided for this image

This gives us

XÛΣ̂V̂

We won't prove it here, but the Eckart–Young–Mirsky theorem tell us that we have stumbled on the best possible p-dimensional approximation, both for the L₂ norm and the Frobenius norm. The idea of the proof is that our approximation errs by failing to add vectors scaled by small singular values. Any other approximation must err by failing to add vectors scaled by larger singular values, so it won't be as good.

Interpreting SVD as Principal Components Analysis

In the SVD X =UΣVᴴ, U is an r-dimensional orthonormal basis change from F to an r-dimensional space with the maximum covariance on the first axis and decreasing covariance on subsequent axes. These vectors are the principal coordinates. In our example, the principal components look like this:

No alt text provided for this image

The new coordinates are given by ΣVᴴ. Here, the transformed data looks just like the original data, but each successive coordinate lies in the direction of the maximum remaining variance:

No alt text provided for this image


The percentage of explained variance along the i'th principal component is proportional to σ:

var(i) = σᵢ / Σₖσₖ

Whitening

Simply rotating the data may not be what's desired. If we have two measurements with vastly different scales, we often like to scale the measurements so that each axis has unit variance. However, if the direction of maximum covariance is not on an axis, the data will still be stretched out.

Looking again at the SVD X =UΣVᴴ, our approach here is to use the basis change . The basis vectors of in our example look like this:

No alt text provided for this image

This maps the data onto axes with unit variance on each axis. The new coordinates are given by Vᴴ. Here is the whitened data in the new space:

No alt text provided for this image

Reducing Dimensions

Our approximation XÛΣ̂V̂gives us reduced dimension representations of the samples. To retain the original shape of the data, we use Û as a basis change to p dimensions. The new coordinates are given by Σ̂V̂ᴴ.

For a reduced dimension and whitened representation, we use ÛΣ̂ as a basis change. The new, whitened coordinates are given by ᴴ.

A Final Note of Caution

Reducing dimensions by PCA allows us to construct the best p-dimensional approximation to the original data, but possibly not the most useful. Let's extend our example to include two clusters of data:

No alt text provided for this image

If we apply PCA, the bulk of the covariance lies along the x-axis:

No alt text provided for this image

If we now reduce to one dimension, the two clusters are indistinguishable. Here are the histograms of the two clusters when projected onto the x-axis:

No alt text provided for this image

A more useful idea in this circumstance is to retain the y-axis, even though it's a bad approximation to the original data. Along the y-axis, the two clusters are quite distinguishible:

No alt text provided for this image


Conclusion

The SVD offers both a theoretical grounding for PCA and a practical set of tools.

The motivation for PCA arises organically from the SVD decomposition X=UΣVᴴ. Recognizing that the matrix U can be interpreted as a change of basis and taking advantage of the fact that the diagonal elements of Σ are in decreasing order allows us to map features so that the direction of maximum remaining covariance lies on each successive axis.

By imagining that small elements of Σ can be ignored, we arrive at optimal (with respect to both the L₂ and Frobenius norms) reduced-dimensional approximations for X.

Finally, by understanding as a change of basis, we are able to whiten the features to provide unit variance on each axis.



To view or add a comment, sign in

More articles by Mike Neergaard

Others also viewed

Explore content categories