Principal Component Regression: Part 1 - The Magic of the SVD

Introduction

This is the first part of a multi-part series on Principal Component Regression, or PCR for short. We will eventually end up with a computational algorithm for PCR and code it up using C# using the NMath libraries. PCR is a method for constructing a linear regression model in the case that we have a large number of predictor variables which are highly correlated. Of course, we don’t know exactly which variables are correlated, otherwise we’d just throw them out and perform a normal linear regression.

In order to understand what is going on in the PCR algorithm, we need to know a little bit about the SVD (Singular Value Decomposition). Understanding a bit about the SVD and it’s relationship to the eigenvalue decomposition will go a long way in understanding the PCR algorithm.

The Singular Value Decomposition

The SVD (Singular Value Decomposition) is one of the most revealing matrix decompositions in linear algebra. A bit expensive to compute, but the bounty of information it yields is awe inspiring. Understanding a little about the SVD will illuminate the Principal Components Regression (PCR) algorithm. The SVD may seem like a deep and mysterious thing, at least I thought it was until I read the chapters covering it in the book “Numerical Linear Algebra”, by Lloyd N. Trefethen, and David Bau, III, which I summarize below.

We begin with an easy to state, and not too difficult to prove geometric statement about linear transformations.

A Geometric Fact

Let $S$ be the unit sphere in $\mathbb{R}^{n}$ , and let $X \in \mathbb{R}^{mxn}$ be any matrix mapping $\mathbb{R}^{n}$ into $\mathbb{R}^{n}$ and suppose, for the moment, that $X$ has full rank. Then the image, $XS$ of $S$ under $X$ is a hyperellipse in $\mathbb{R}^{n}$ (see the book for the proof).

Given this fact we make the following definitions (refer to Figure 1.):

Define the singular values ,

$\sigma _{1}\cdots\sigma_{n}$

of $X$ to be the lengths of the $n$ principal semiaxes of the hyperellipse $XS$ . It is conventional to assume the singular values are numbered in descending order

$\inline \sigma {1}\geq \sigma _{2}\geq\cdots\geq \sigma_{n}$

Define the left singular vectors

$u_{1},\cdots,u_{m}$

to be unit vectors in the direction of the principal semiaxes of $XS$ and define the right singular vectors,

$v_{1}\cdots v_{n}$ ,

to be the pre-images of the principal semiaxes of $XS$ so that

$Xv_{i} = \sigma_{i}u_{i}$ .

In matrix form we have

$XV = U \Sigma$ ,

where $V$ is the $n\textrm{ x }n$ orthonormal matrix whose columns are the right singular vectors of $X$ , $\Sigma$ is an $n\textrm{ x }n$ diagonal matrix with positive entries equal to the singular values, and $U$ is an $m\textrm{ x }n$ matrix whose orthonormal columns are the left singular vectors.
Since the columns of $V$ are orthonormal by construction, $V$ is a unitary matrix, that is it’s transpose is equal to it’s inverse, thus we can write

$\textrm{(2) }X = U \Sigma V^{T}$

And there you have it, the SVD is all it’s majesty! Actually the above decomposition is what is known as the reduced SVD. Note that the columns of $U$ are $n$ orthonormal vectors in $m$ dimensional space. $U$ can be extended to a unitary matrix by adjoining an additional $m-n$ orthonormal columns. If in addition we append $m-n$ rows of zeros to the bottom of the matrix $\Sigma$ , it will effectively multiply the appended columns in $U$ by zero, thus preserving equation (2). When $U$ and $\Sigma$ are modified in this way equation (2) is called the full SVD.

The Relationship Between Singular Values and Eigenvalues

There is an important relationship between the singular values of $X$ and the eigenvalues of $X^{T}X$ . Recall that a vector $v$ is an eigenvector with corresponding eigenvalue $\lambda$ for a matrix $X$ if and only if $Xv=\lambda v$ . Now, suppose we have the full SVD for $X$ as in equation (2). Then

$X^{T}X=(U\Sigma V^{T})^{T}(U \Sigma V^{T})$

$= V \Sigma ^{T}U^{T}U \Sigma V^{T}$

$= V \Sigma^{T} \Sigma V^{T}$

or,

$(X^{T}X)V = V \Lambda$

where we have used the fact that $U$ and $V$ are unitary and set

$\Lambda = \Sigma^{T} \Sigma$ .

Note that $\Lambda$ is a diagonal matrix with the singular values squared along the diagonal. From this it follows that the columns of $V$ are eigenvectors for $X^{T}X$ and the main diagonal of $\Lambda$ contain the corresponding eigenvalues. Thus the nonzero singular values of $X$ are the square roots of the nonzero eigenvalues of $X^{T}X$ .

We need one more very cool fact about the SVD before we get to the algorithm. Low-rank approximation.

Low-Rank Approximation

Suppose now that $X$ has rank $r$ and write $\Sigma$ in equation (2) as the sum of $r$ rank one matrices (each $r\textrm{ x }r$ rank one matrix will be all zeros except for $\sigma_{j}$ as the $j$ th diagonal element). We can then, using equation (2), write $X$ as the sum of rank one matrices,

$\textrm{(3) }X=\sum_{j=1}^{r} \sigma_{j}u_{j}v_{j}^{T}$

Equation (3) gives us a way to approximate any rank $r$ matrix $X$ by a lower rank $k < r$ matrix. Indeed, given $k < r$ , form the $k\textrm{th}$ partial sum

$X_{k}=\sum_{j=1}^{k} \sigma_{j}u_{j}v_{j}^{T}$

Then $X_{k}$ is a rank $k$ approximation for $X$ . How good is this approximation? Turns out it’s the best rank $k$ approximation you can get.

Computing the Low-Rank Approximations Using NMath

The NMath library provides two classes for computing the SVD for a matrix (actually 8 since there SVD classes for each of the datatypes Double, Float, DoubleComplex and FloatComplex). There is a basic decomposition class for computing the standard, reduced SVD, and a decomposition server class when more control is desired. Here is a simple C# routine that constructs the low-rank approximations for a matrix $X$ and prints out the Frobenius norms of difference between $X$ and each of it’s low-rank approximations.

static void LowerRankApproximations( DoubleMatrix X )
{
  // Construct the reduced SVD for X. We will consider
  // all singular values less than 1e-15 to be zero.
  DoubleSVDecomp decomp = new DoubleSVDecomp( X );
  decomp.Truncate( 1e-15 );
  int r = decomp.Rank;
  Console.WriteLine( "The {0}x{1} matrix X has rank {2}", X.Rows, X.Cols, r );

  // Construct the best lower rank approximations to X and
  // look at the frobenius norm of their differences.
  DoubleMatrix LowerRankApprox =
    new DoubleMatrix( X.Rows, X.Cols );
  double differenceNorm;
  for ( int k = 0; k < r; k++ )
  {
    LowerRankApprox += decomp.SingularValues[k] *
      NMathFunctions.OuterProduct( decomp.LeftVectors.Col( k ), decomp.RightVectors.Col( k ) );
    differenceNorm = ( X - LowerRankApprox ).FrobeniusNorm();
    Console.WriteLine( "Rank {0} approximation difference
      norm = {1:F4}", k+1, differenceNorm );
  }
}

Here’s the output for a matrix with 10 rows and 20 columns. Note that the rank can be at most 10.

The 10x20 matrix X has rank 10
Rank 1 approximation difference norm = 3.7954
Rank 2 approximation difference norm = 3.3226
Rank 3 approximation difference norm = 2.9135
Rank 4 approximation difference norm = 2.4584
Rank 5 approximation difference norm = 2.0038
Rank 6 approximation difference norm = 1.5689
Rank 7 approximation difference norm = 1.1829
Rank 8 approximation difference norm = 0.8107
Rank 9 approximation difference norm = 0.3676
Rank 10 approximation difference norm = 0.0000

-Steve

1. Look at the two dimension case. Consider the image of the unit sphere S={v:‖v‖≤1} under X. Since S is compact and X is continuous there exists a vector v_1 such that ‖Xv_1 ‖ is maximal. That is

‖Xv_1 ‖≥‖Xv‖ ∀ v∈S

Set 〖σu〗_1=Xv_1, where u_1 is a unit vector so that ‖Xv_1 ‖=σ. We will show that for any vector v_2 orthogonal to v_1its image u_2=Xv_2is orthogonal to u_1.

Choose a unit vector u_2 orthogonal to u_1 (ala Gram-Schmidt, for example), and write

Xv_2=αu_1+βu_2

We will show that α=0. Set

v=(σv_1+αv_2)/(σ^2+α^2 )^(1⁄2)

Then

Xv=(σ^2 u_1+α^2 u_1+αβu_2)/(σ^2+α^2 )^(1⁄2) =((σ^2+α^2 ) u_1+αβu_2)/(σ^2+α^2 )^(1⁄2)

Taking the norm of both sides and using the fact that u_1 and u_2 are orthonormal vectors* yields

‖Xv‖=(‖(σ^2+α^2 ) u_1 ‖^2+‖αβu_2 ‖^2 )^(1⁄2)/(σ^2+α^2 )^(1⁄2) =((σ^2+α^2 )^2+(αβ)^2 )^(1⁄2)/(σ^2+α^2 )^(1⁄2) ≥(σ^2+α^2)/(σ^2+α^2 )^(1⁄2)

So that
〖‖Xv‖≥(σ^2+α^2 )〗^(1⁄2)≥σ

Since ‖Xv‖ cannot be strictly greater than σ, equality must hold which means α=0.

*For orthogonal vectors u and v we have:

‖u+v‖^2=(u+v)∙(u+v)=u∙u+u∙v+v∙u+v∙v=‖u‖^2+‖v‖^2

2. The terms “left” and “right” do seem a bit awkward. The come from the positions of the factors U and V in the decomposition X=UΣV^T.

8 thoughts on “Principal Component Regression: Part 1 – The Magic of the SVD”

Brand Hunt says:

February 10, 2010 at 10:47 pm

Great article; I also dig that book. Also, it looks as if your code snippets aren’t using character entity references correctly (or your doubling up?) — the loop has a the “lt” escape instead of the less than character.
Pingback: CenterSpace Blog » Blog Archive » Principal Components Regression: Part 2 The Problem With Linear Regression
Lei Fang says:

March 16, 2011 at 1:57 pm

In matrix form we have
AV = U \Sigma

should be
XV = U \Sigma
Ken Baldwin says:

March 28, 2011 at 10:33 am

Fixed, thanks
Eran Zadok says:

July 26, 2011 at 4:32 pm

Hi Ken,

two questions:

1.In the section ‘a geometric fact’ you say ‘…the columns of V are orthonormal by construction’.
It may be so, but it is not obvious, since V columns are defined ‘the pre-images of the principal semiaxes of XS’.

2. In the same section you define ‘u’ as left singular vectors and ‘v’ right singular vectors. Shouldn’t it be the other way around?

Regards,
Eran.
Trevor Misfeldt says:

September 30, 2011 at 4:21 pm

1. Look at the two dimension case. Consider the image of the unit sphere S={v:‖v‖≤1} under X. Since S is compact and X is continuous there exists a vector v_1 such that ‖Xv_1 ‖ is maximal. That is

‖Xv_1 ‖≥‖Xv‖ ∀ v∈S

Set 〖σu〗_1=Xv_1, where u_1 is a unit vector so that ‖Xv_1 ‖=σ. We will show that for any vector v_2 orthogonal to v_1its image u_2=Xv_2is orthogonal to u_1.

Choose a unit vector u_2 orthogonal to u_1 (ala Gram-Schmidt, for example), and write

Xv_2=αu_1+βu_2

We will show that α=0. Set

v=(σv_1+αv_2)/(σ^2+α^2 )^(1⁄2)

Then

Xv=(σ^2 u_1+α^2 u_1+αβu_2)/(σ^2+α^2 )^(1⁄2) =((σ^2+α^2 ) u_1+αβu_2)/(σ^2+α^2 )^(1⁄2)

Taking the norm of both sides and using the fact that u_1 and u_2 are orthonormal vectors* yields

‖Xv‖=(‖(σ^2+α^2 ) u_1 ‖^2+‖αβu_2 ‖^2 )^(1⁄2)/(σ^2+α^2 )^(1⁄2) =((σ^2+α^2 )^2+(αβ)^2 )^(1⁄2)/(σ^2+α^2 )^(1⁄2) ≥(σ^2+α^2)/(σ^2+α^2 )^(1⁄2)

So that
〖‖Xv‖≥(σ^2+α^2 )〗^(1⁄2)≥σ

Since ‖Xv‖ cannot be strictly greater than σ, equality must hold which means α=0.

*For orthogonal vectors u and v we have:

‖u+v‖^2=(u+v)∙(u+v)=u∙u+u∙v+v∙u+v∙v=‖u‖^2+‖v‖^2

2. The terms “left” and “right” do seem a bit awkward. The come from the positions of the factors U and V in the decomposition X=UΣV^T.
Willem says:

July 2, 2012 at 4:14 am

Hi Steve,

thank you very much for this article. You mentioned a third part in which you discuss the implementation of PCR. Is it correct that the third part is missing?

Greetings,
Willem
Pingback: Principal Components Regression: Part 3 – The NIPALS Algorithm - CenterSpace