Deriving Principal Components Analysis from Singular Value Decomposition

I wrote a previous article about Singular Value Decomposition (SVD) here, with the promise that I would eventually use SVD to motivate Principal Components Analysis (PCA). Hopefully you are well-rested, and ready to dive in.

Preliminaries

Sample Matrix

Let's imagine that we have a collection of n samples, each with m dimensions.

Let's call this feature space F. Typically, there are many more samples than dimensions, so m < n. We will arrange our samples into a matrix X, where each column of X is a sample in F. In the end, PCA is a change of basis, and basis changes don't do translation, so let's assume that the samples have already been translated to have mean zero in each coordinate.

Below is an example of samples in 2 dimensions:

Change of Basis

We may want to map samples from their original coordinates in F to some new set of coordinates. This operation is known as a change of basis. For example, we might want to stretch and rotate the axes. The mapping between a vector x⃗ in the old coordinates and the vector n⃗ in the new set of coordinates is given by a change of basis map

x⃗ = An⃗

where the columns of A are linearly independent vectors in F. Writing it this way around shows explicitly that it is possible to recover the original vector x⃗ from n⃗. If we are changing the basis for all the samples at once, we can write

X = AN

where each column n⃗ᵢ represents the sample x⃗ᵢ with new coordinates.

Interestingly, for a given matrix X, A need not be square, so long as the columns of A cover the column space of X. If X has rank r, then all the samples in X belong to some r-dimensional subspace of F and could be completely described with r coordinates. Thus, A could have as few as r columns.

Singular Value Decomposition

Let's remind ourselves of the Singular Value Decomposition. First, we need to some nomenclature.

For a matrix A, define Aᴴ as the conjugate transpose. If A is real, Aᴴ = Aᵀ, the transpose of A. Instead of ᴴ , some sources use † and some use *.

A matrix A is called unitary if Aᴴ = A⁻¹. A square matrix A is unitary if and only if all the column vectors are orthonormal, which is to say perpendicular to each other and of norm 1. Similarly, all the row vectors of square matrix A are orthonormal if and only if A is unitary.

An m x n matrix X can be decomposed as

X = UΣVᴴ

where U is an m x m unitary matrix,V is an n x n unitary matrix and Σ is an m x n diagonal matrix whose diagonal entries are non-negative and arranged in decreasing order. The diagonal elements of Σ are called the singular values. The number of positive singular values in Σ corresponds to the rank of X.

If X has real entries, then U and V also have real entries.

We will be considering a situation where m < n. The shape of this decomposition looks like this:

Here, the dark blue diagonal of Σ indicates where all the non-zero values lie. The light blue portions are all zero.

The Compact SVD

The picture above teaches us that the full SVD is wasteful. In our example, the lower portion of Vᴴ is guaranteed to be multiplied by zero:

It makes no difference to the outcome what you put there. Similarly, if X is of rank r < m, some rows at the bottom of Σ will be zero, and some columns on the right-hand side of U are guaranteed to be multiplied by zero:

We can eliminate all these extra dimensions by throwing away the irrelevant portions of U and V and restricting Σ to be an r x r diagonal matrix with positive entries on the diagonal:

This is the compact singular value decomposition. It makes enough sense that some people think of the compact version as the regular singular value decomposition.

Interpreting U and UΣ as change of bases

Recall that a change of basis for the matrix X takes the form

X = AN

Substituting U for A and Vᴴ for N, we have

X = U (ΣVᴴ)

so U is a change of basis from an m-dimensional space to an r-dimensional space.

Similarly UΣ is a change of basis in the formula

X = (UΣ) Vᴴ

Of course, the columns z⃗ᵢ of UΣ are more than linearly independent - they are orthogonal, since (z⃗ᵢ, z⃗ₖ) = (σᵢu⃗ᵢ, σₖu⃗ₖ) = σ̅ᵢσₖ(u⃗ᵢ, u⃗ₖ) = 0.

Deriving Principal Components Analysis

Recall that we have constructed X with each column vector representing a single sample. We will motivate PCA by approximating X.

Approximation

If we wanted to approximate X with a lower-dimensional basis, how would we go about it? We know that the singular values on the diagonal of Σ are arranged in decreasing order, so what if we just ... set the smaller singular values to zero? Let's imagine we only retain the p largest singular values.

Our decomposition would look like this:

and once again, U and V have columns that are destined be be multiplied by zero:

Compacting as before, we let Σ̂ be an invertible p x p diagonal matrix:

This gives us

X ≈ ÛΣ̂V̂ᴴ

We won't prove it here, but the Eckart–Young–Mirsky theorem tell us that we have stumbled on the best possible p-dimensional approximation, both for the L₂ norm and the Frobenius norm. The idea of the proof is that our approximation errs by failing to add vectors scaled by small singular values. Any other approximation must err by failing to add vectors scaled by larger singular values, so it won't be as good.

Interpreting SVD as Principal Components Analysis

In the SVD X =UΣVᴴ, U is an r-dimensional orthonormal basis change from F to an r-dimensional space with the maximum covariance on the first axis and decreasing covariance on subsequent axes. These vectors are the principal coordinates. In our example, the principal components look like this:

The new coordinates are given by ΣVᴴ. Here, the transformed data looks just like the original data, but each successive coordinate lies in the direction of the maximum remaining variance:

The percentage of explained variance along the i'th principal component is proportional to σᵢ:

var(i) = σᵢ / Σₖσₖ

Whitening

Simply rotating the data may not be what's desired. If we have two measurements with vastly different scales, we often like to scale the measurements so that each axis has unit variance. However, if the direction of maximum covariance is not on an axis, the data will still be stretched out.

Looking again at the SVD X =UΣVᴴ, our approach here is to use the basis change UΣ. The basis vectors of UΣ in our example look like this:

This maps the data onto axes with unit variance on each axis. The new coordinates are given by Vᴴ. Here is the whitened data in the new space:

Reducing Dimensions

Our approximation X ≈ ÛΣ̂V̂ᴴ gives us reduced dimension representations of the samples. To retain the original shape of the data, we use Û as a basis change to p dimensions. The new coordinates are given by Σ̂V̂ᴴ.

For a reduced dimension and whitened representation, we use ÛΣ̂ as a basis change. The new, whitened coordinates are given by V̂ᴴ.

A Final Note of Caution

Reducing dimensions by PCA allows us to construct the best p-dimensional approximation to the original data, but possibly not the most useful. Let's extend our example to include two clusters of data:

If we apply PCA, the bulk of the covariance lies along the x-axis:

If we now reduce to one dimension, the two clusters are indistinguishable. Here are the histograms of the two clusters when projected onto the x-axis:

A more useful idea in this circumstance is to retain the y-axis, even though it's a bad approximation to the original data. Along the y-axis, the two clusters are quite distinguishible:

Conclusion

The SVD offers both a theoretical grounding for PCA and a practical set of tools.

The motivation for PCA arises organically from the SVD decomposition X=UΣVᴴ. Recognizing that the matrix U can be interpreted as a change of basis and taking advantage of the fact that the diagonal elements of Σ are in decreasing order allows us to map features so that the direction of maximum remaining covariance lies on each successive axis.

By imagining that small elements of Σ can be ignored, we arrive at optimal (with respect to both the L₂ and Frobenius norms) reduced-dimensional approximations for X.

Finally, by understanding UΣ as a change of basis, we are able to whiten the features to provide unit variance on each axis.

Deriving Principal Components Analysis from Singular Value Decomposition - Part II (PCA)

Mike Neergaard

Preliminaries

Sample Matrix

Change of Basis

Singular Value Decomposition

The Compact SVD

Interpreting U and UΣ as change of bases

Deriving Principal Components Analysis

Recommended by LinkedIn

Approximation

Interpreting SVD as Principal Components Analysis

Whitening

Reducing Dimensions

A Final Note of Caution

Conclusion

More articles by Mike Neergaard

Others also viewed

Resolution of the Birch and Swinnerton-Dyer Conjecture via Temporal-Spectral Geometry (CronNet-Holo Framework)

Day - 6 : Projection Gradient Descent : Why Gradient Descent Isn’t Enough (When Constraints Exist)

Dynamic and Static Polymorphism

Creating Effective Search Strategies is a Lot Like Freshman Algebra

Series: Analysing Your Requirements with Precision of Depth, Second Instalment

STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

Understanding analytical methods to study vibration behavior of a mechanical system - Part 2

Avoiding Temporaries With Expression Templates

How to approach the LRDI Section in CAT

PeriodicMatrices – A Julia package for handling periodic time-varying matrices

Explore content categories

Preliminaries

Sample Matrix

Change of Basis

Singular Value Decomposition

The Compact SVD

Interpreting U and UΣ as change of bases

Deriving Principal Components Analysis

Recommended by LinkedIn

Approximation

Interpreting SVD as Principal Components Analysis

Whitening

Reducing Dimensions

A Final Note of Caution

Conclusion

More articles by Mike Neergaard

Deriving Principal Components Analysis from Singular Value Decomposition - Part I (SVD)

Feature Scaling with Percentiles Instead of Sample Deviation

Others also viewed

Resolution of the Birch and Swinnerton-Dyer Conjecture via Temporal-Spectral Geometry (CronNet-Holo Framework)

Day - 6 : Projection Gradient Descent : Why Gradient Descent Isn’t Enough (When Constraints Exist)

Dynamic and Static Polymorphism

Creating Effective Search Strategies is a Lot Like Freshman Algebra

Series: Analysing Your Requirements with Precision of Depth, Second Instalment

STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

Understanding analytical methods to study vibration behavior of a mechanical system - Part 2

Avoiding Temporaries With Expression Templates

How to approach the LRDI Section in CAT

PeriodicMatrices – A Julia package for handling periodic time-varying matrices

Explore content categories