Deriving Principal Components Analysis from Singular Value Decomposition - Part I (SVD)
Perhaps the fault is mine. I am sometimes more of a geometric thinker than an algebraic thinker. In any case, I am working on a series of AI lectures for my son and his university peers, and I was not satisfied with the articles I found on using Singular Value Decomposition (SVD) to perform Principle Components Analysis (PCA), I want to write it up from my perspective. The write-up will have two parts -- the first will explain SVD as a useful tool and an analog to eigenspaces and eigenvalues. The second will motivate PCA using SVD. You can find the second article here.
What is SVD?
I want to explore the significance of SVD and explore how it extends some useful tools from a small subspace of matrices to general matrices. We will start by
A note on complex matrices
Bottom line up front: In spaces over complex numbers, replace the transpose operator ᵀ with the conjugate transpose operator ᴴ.
We like a couple of ideas about vector spaces. First, that the inner product gives a norm:
(u⃗, u⃗)=‖u⃗‖²
and second, that the inner product is given by matrix multiplication:
(u⃗, v⃗) = u⃗ᵀ v⃗
This doesn't work correctly in complex spaces. Given a vector u⃗ with complex entries uᵢ,
u⃗ᵀu⃗ = Σuᵢ²
but the Euclidean norm of u⃗ is given by
‖u⃗‖² = Σūᵢuᵢ
Instead of a transpose operation, what we needed was an operation that conjugates and transposes. Displaying absolutely no imagination, we will call it a conjugate transpose. We need a new symbol for the conjugate transpose, and since T is already taken, sometime physicists use †. Unfortunately, † is typographically difficult, and is also sometimes used for something else. Some authors use * instead, but sadly that notation means something different in quantum mechanics. Within this document, we will will settle on ᴴ, because the conjugate transpose is sometimes called the Hermitian transpose. Thus,
(u⃗, v⃗) = u⃗ᴴ v⃗.
Conjugate transpose is its own inverse:
Aᴴᴴ = A
Also, the product rule for transpose works here:
(AB)ᴴ = BᴴAᴴ
Finally, we need an analog to symmetry. If
U = Uᴴ
we will say that U is Hermitian (named after the inimitible Charles Hermite).
An aside on Charles Hermite
Bottom line up front: Charles Hermite's story is interesting. That is all.
Let's pause a moment to acknowledge Charles Hermite, born 1822.
Interested in mathematics, he was accepted to the prestigious École Polytechnique in 1842. However, after a year they decided to expel him, because he had a congenital defect in his right foot which would make him unsuitable for military service. After intervention by Hermite's advocates, the expulsiion was withdrawn, but it was too late. Hermite left.
Hermite continued to study mathematics and befriended Joseph Betrand (Betrand's paradox, also the inventor of Dedekind cuts. Don't ask - this aside is long enough as it is), Carl Jacobi (Jacobian matrix, Jacobian determinant) and Joseph Liousville (Riemann-Louisville integral, Liousville λ function, and the Liousville crater). Well, I say he "befriended" Bertrand. He did, but he also married Betrand's sister. I bet when that family got together, Charles and Joseph were under strict but ineffective instructions not to talk math the whole time.
By 1848 Hermite was back at École Polytechnique - as an instructor. His students included Henri Poincaré (algebraic topology, Poincaré conjecture) and Jacques Hadamard (Hadamard's matrix, Hadamard's inequality, Hadamard quantum gate).
Hermite caught smallpox in 1856 and survived the experience. Influenced by the nun who nursed him and by fellow mathematician Augustin-Louis Cauchy (complex analysis, Cauchy sequences, Cauchy's transmission equation), Hermite became Catholic.
In 1873, Hermite cemented his place in the pantheon of mathematics by proving that e is transcendental. Following his technique, Ferdinand Lindemann proved in 1882 that π is transcendental.
Despite all that, Hermite comes to our attention today only because Hermitian matrices are named after him. Hermite proved in 1855 that the eigenvalues of Hermitian matrices are real.
You can see this for yourself. If A = Aᴴ, consider an eigenvector u⃗, a non-zero vector such that Au⃗ = λ u⃗
u⃗ᴴ Au⃗ = u⃗ᴴ λu⃗ = λ u⃗ᴴ u⃗ = λ ‖u⃗‖²
Taking the conjugate transpose, we have
[u⃗ᴴ A u⃗]ᴴ = λ̅‖u⃗‖²
But we know that [u⃗ᴴ A u⃗]ᴴ = u⃗ᴴ Aᴴ u⃗ = u⃗ᴴ A u⃗, so
λ ‖u⃗‖² = λ̅‖u⃗‖²
Since ‖u⃗‖² is not zero,
λ = λ̅
We didn't really need to know all that, but I hope you enjoyed the journey. In any case, having taken a moment to appreciat Charles Hermite, we are ready to move on.
Eigenvectors diagonalize, and I want them
Bottom line up front: Any non-defective matrix X can be decomposed as X=UDU⁻¹, where the column vectors of U are unit eigenvectors and D is a diagonal matrix with eigenvalues on the diagonal. Xᵖ=UDᵖU⁻¹. If X is invertible, then X⁻¹=UD⁻¹U⁻¹
For some matrix X recall that an eigenvector u⃗ is a vector that is scaled by the application of X by some factor of λ:
X u⃗ = λ u⃗
We should notice at this moment that X must be square. If u⃗ has n entries, then X has n columns, because the product X u⃗ requires it. X has n rows because the product X u⃗ is a column vector with n rows. You might be tempted to add rows or columns to X to make it square. We will introduce a trick later, however, to create square matrices and I think you will agree it's more useful.
We should also notice something more subtle: There is no guarantee that λ will be real. From the definition,
[X - λI ] u⃗ = 0⃗
so we know that X - λI is singular and thus det(X - λI) = 0. An eigenvalue is a solution to the polynomial det(X - λI) = 0, and complex solutions are possible. In honour of Hermite's student Hadamard, let's consider the two dimensional Hadamard matrix:
Inadvertently, we have entered the world of complex matrices. Happily, life is fine here, with the adjustments explained above.
Next, let's write the definition of an eigenvector slightly differently, embedding λ in a matrix:
X u⃗ = u⃗ [λ]
This new formulation allows to write down all the eigenvectors in a single equation. Let U be a matrix whose columns are eigenvectors. Then we can write:
Let's call this diagonal matrix of eigenvalues D so that our definition of eigenvalues reads
XU=UD.
It's good to notice the power of definitions here. If we ever see an equation like this, the column vectors of U are eigenvectors and the diagonal elements of D are eigenvalues -- by definition.
We should also notice the possibility that some eigenvalues λᵢ = 0, so even if D is square, there is no guarantee that D is invertible.
We have a few freedoms. We are free to place the eigenvectors in any order, so we can arrange the elements of D so that |λᵢ| ≥ |λᵢ₊₁|. We can arrange each of the k eigenvectors are of norm 1. Interestingly, algebraists don't care for the possibility that k < n. Matrices with this property are called defective. Mathematicians can be hostile sometimes.
If X is not defective, it is diagonalizable. For non-defective X, U has n independent columns, it is invertible and we can multiply both sides by U⁻¹:
X=UDU⁻¹
This expression of X has fantastic properties. If D is invertible, D⁻¹ is easy to find -- simply invert all the elements on the diagonal. X is invertible if and only if D is invertible, and its inverse is trivial to find:
X⁻¹=UD⁻¹U⁻¹
Powers of X are also easy to find:
Xᵖ=UDᵖU⁻¹
In fact, you can use this form to define arbitrary exponents, should you need them.
The magic of XXᴴ
Bottom line up front: For any matrix X, XᴴX can be decomposed as XᴴX=VΣ̂²Vᴴ where V is unitary and Σ̂ is a diagonal matrix with non-negative entries.
I haven't forgotten that eigenvectors require square matrices. For an arbitrary matrix X, we can create a square matrix XᴴX, which will turn out to have nice properties.
First of all, XᴴX is Hermitian, as
[XᴴX ]ᴴ = [X]ᴴ [Xᴴ]ᴴ = XᴴX
Recommended by LinkedIn
We won't prove it here, but the spectral theorem tells us that a Hermitian matrix A is diagonalizable in the form
A=VDVᴴ
where the entries of D are real, V is square, and V is unitary, a term that means Vᴴ=V⁻¹.
Unitary matrices are themselves interesting, since the evolution of a quantum state can be expressed as a unitary operator.
Interestingly, the spectral theorem was proven by Hermite's colleague Cauchy for real matrices, but a more general formulation had to wait for John von Neumann, who was born in 1903, almost three year's after Hermite's death in early 1901.
XᴴX has yet another cool property. As above, consider an eigenvector and eigenvalue of XᴴX:
v⃗ᴴ XᴴXv⃗ = λ ‖v⃗‖²
But now,
v⃗ᴴ XᴴXv⃗ = [Xv⃗]ᴴ Xv⃗ = ‖Xv⃗‖²
Since
λ ‖v⃗ ‖² = ‖Xv⃗ ‖²,
λ must be non-negative. That being the case, we can define σ ≥ 0 such that
σ² = λ
We will call σ a spectral value of X, which is convenient, as spectral values are defined for all matrices, not just the square ones. It will turn out that the spectral values calculated by XXᴴ are the same.
We should pause to notice something else. Since λ ‖v⃗ ‖² = ‖Xv⃗ ‖², it must be true that if λ=0, Xv⃗ = 0⃗, which is to say that v⃗ is in the null space of X. It's not critical to our story, but it turns out that the right hand side of the matrix V is a basis for the null space of X.
We now know that XᴴX can be decomposed as
XᴴX = VΣ̂²Vᴴ
Where V is unitary and Σ̂ is a diagonal matrix with non-negative entries.
The Singular Value Decomposition
Bottom line up front: Any matrix X can be decomposed as X = UΣVᴴ, where U and U are unitary and Σ is a diagonal matrix with decreasing, non-negative entries.
We have at last arrived at the point. Given an n x m matrix X, we know that
XᴴX = VΣ̂²Vᴴ
By symmetry, XXᴴ is similarly decomposible:
XXᴴ = UΣ²Uᴴ
We will wind up constructing U so that the SVD becomes more clear. If Σ̂ is invertible, everything is easy. Define
U = XVΣ̂⁻¹
It follows that
X = UΣ̂Vᴴ.
U is unitary, as
Uᴴ U = [Σ⁻¹VᴴXᴴ] [XVΣ⁻¹] = Σ⁻¹Vᴴ [XᴴX] VΣ⁻¹] = Σ⁻¹Vᴴ [VΣ̂²Vᴴ] VΣ⁻¹ = I.
We have to work a little harder when Σ̂ is not invertible, but we can still get there. Define a new matrix Σ where we restrict Σ̂ to m rows and n columns.
Define a new matrix U whose column vectors u⃗ᵢ are given by (1/σᵢ)Xv⃗ᵢ, unless σᵢ=0.
For non-zero values of σᵢ, σᵣ
u⃗ᵢᴴu⃗ᵣ = 1/(σᵢσᵣ)v⃗ᵢᴴ Xᴴ X v⃗ᵣ
If i ≠ r, v⃗ᵢᴴ v⃗ᵣ = 0, so
u⃗ᵢᴴu⃗ᵣ = 1/(σᵢσᵣ) v⃗ᵢᴴ [Xᴴ X v⃗ᵣ] = 1/(σᵢσᵣ)v⃗ᵢᴴ σᵣ²v⃗ᵣ = 0
On the other hand, if i = r,
u⃗ᵢᴴu⃗ᵣ = 1/(σᵢ²)v⃗ᵢᴴ [Xᴴ X v⃗ᵢ] = 1/(σᵢ²)v⃗ᵢᴴ σᵢ²v⃗ᵢ = 1.
That is to say, these vectors u⃗ᵢ are orthonormal. Also,
σᵢu⃗ᵢ = X v⃗ᵢ
We have yet to define u⃗ᵢ if σᵢ = 0. Once we hit null spectral values, we build the remainder of U from the row null space of X: Xᴴu⃗ᵢ = 0⃗. We can construct U in this way to be unitary.
For these additional vectors, it is trivially true that
σᵢu⃗ᵢ = 0 = X v⃗ᵢ
Thus, we can always write
UΣ = XV
Multiplying both sides by Vᴴ, we have
UΣVᴴ = X
With this, we have at last found our diagonalization for an arbitrary matrix.
The Properties of SVD
There is an argument, which I will take up in the next article, that we should relax the requirement that U and V be square matrices in favor of creating an invertible diagonal matrix Σ. There is good sense to that, and I look forward to discussing it. Regardless, there are helpful consequences that arise from the SVD.
The Moore-Penrose psuedoinverse
Bottom line up front: Define a diagonal matrix Σ⁻ᐩ by inverting all the non-zero elements of Σ. The Moore-Penrose pseudoinverse Xᐩ = VΣᐩUᴴ is always defined.
Define a diagonal matrix Σ⁻ᐩ by inverting all the non-zero elements of Σ. If Σ is invertible, Σ⁻¹=Σ⁻ᐩ If X is invertible, its inverse is easy to find:
X⁻¹ = VΣᐩUᴴ
Imagine we want to solve for some unknown B where
XB = Y
Then
B = X⁻¹Y = VΣᐩUᴴY
But even if X is not invertible, Xᐩ = VΣᐩUᴴ is still defined, and is called the Moore-Penrose pseudoinverse.
Just for a moment, let's pause to consider this name. The Moore-Penrose psuedoinverse was independently invented by E.H. Moore in 1920, Arne Bjerhammar in 1951 and Roger Penrose in 1955. Surely it would be more proper either to call this construct the "Moore psuedoinverse" or the "Moore-Bjerhammar-Penrose pseudoinverse." Sadly, I have been calling it "Moore-Penrose" for more than 30 years now, and it's the only name I can remember.
Ordinary Least Squares
Bottom line up front: B ≈ XᐩY is an ordinary least squares solution to the linear regression problem.
Let's imagine a situation where we have a (x, y) pairs, where x (the independent variable) is a point in ℝⁿ and y (the dependent variable) is an outcome in ℝʳ. Suppose we have m such pairs. Then our problem above can be phrased as
XB = Y,
where X is a known m x n matrix, B is an unknown n x r matrix, and Y is a known m x r matrix. Finding such a B is the task of linear regression.
We can attempt to to solve our equation by application of the Moore-Penrose pseudoinverse
B ≈ XᐩY
This solution is always defined. Also, it is always an Ordinary Least Squares (OLS) solution. If X has full rank (which is common), the OLS solution is unique.
If X does not have full rank, then there are data problems. If you insist on going ahead, then please stop for a moment and think again. You have data problems. If you have thought twice, then you can take comfort that B ≈ XᐩY is a Best Linear Unbiased Estimator (BLUE). But please don't do this. Terrible things happen when you rely on bad data.
Of course, numerical instabilities could arise, which would require us to be slightly more clever, perhaps through the use of PCA, or Lasso or Ridge regression, but that is a topic for another day.
Conclusion
The motivation for SVD decomposition of an arbitrary matrix X arises naturally as a consequence of the beautiful diagonalization properties of eigenvectors. By exploiting the amazing diagonalization properties of XᴴX, we arrive at a natural decomposition X = UΣVᴴ, where U and V are unitary and Σ is diagonal with non-negative entries. We also discover Xᐩ = VΣᐩUᴴ, the Moore-Penrose psuedoinverse, which provides an ordinary least squares solution to the linear regression problem. The procedure for using SVD for Principal Components Analysis (PCA) has to wait for the next article.
Some of the Latex aren't readable. Please can I get the proof for deriving PCA from SVD (using A=UΣV* and covariance matrix)? This question appeared in my past year's semester exam. 🙏