Principal Components Regression: Recap of Part 2
Recall that the least squares solution to the multiple linear problem is given by
(1)
And that problems occurred finding when the matrix
(2)
was close to being singular. The Principal Components Regression approach to addressing the problem is to replace in equation (1) with a better conditioned approximation. This approximation is formed by computing the eigenvalue decomposition for and retaining only the r largest eigenvalues. This yields the PCR solution:
(3)
where is an r x r diagonal matrix consisting of the r largest eigenvalues of
are the corresponding eigenvectors of . In this piece we shall develop code for computing the PCR solution using the NMath libraries.
[eds: This blog article is final entry of a three part series on principal component regression. The first article in this series, “Principal Component Regression: Part 1 – The Magic of the SVD” is here. And the second, “Principal Components Regression: Part 2 – The Problem With Linear Regression” is here.]
Review: Eigenvalues and Singular Values
In order to develop the algorithm, I want to go back to the Singular Value Decomposition (SVD) of a matrix and its relationship to the eigenvalue decomposition. Recall that the SVD of a matrix X is given by
(4)
Where U is the matrix of left singular vectors, V is the matrix of right singular vectors, and Σ is a diagonal matrix with positive entries equal to the singular values. The eigenvalue decomposition of is given by
(5)
Where the eigenvalues of X are the diagonal entries of the diagonal matrix and the columns of V are the eigenvectors of (V is also composed of the right singular vectors of X).
Recall further that if the matrix X has rank r then X can be written as
(6)
Where is the jth singular value (jth diagonal element of the diagonal matrix ), is the jth column of U, and is the jth column of V. An equivalent way of expressing the PCR solution (3) to the least squares problem in terms of the SVD for X is that we’ve replaced X in the solution (1) by its rank r approximation shown in (6).
Principal Components
The subject here is Principal Components Regression (PCR), but we have yet to mention principal components. All we have talked about are eigenvalues, eigenvectors, singular values, and singular vectors. We’ve seen how singular stuff and eigen stuff are related, but what are principal components?
Principal component analysis applies when one considers statistical properties of data. In linear regression each column of our matrix X represents a variable and each row is a set of observed value for these variables. The variables being observed are random variables and as such have means and variances. If we center the matrix X by subtracting from each column of X its corresponding mean, then we’ve normalized the random variables being observed so that they have zero mean. Once the matrix X is centered in this way, the matrix is then proportional to the variance/covariance for the variables. In this context the eigenvectors of are called the Principal Components of X. For completeness (and because they are used in discussing the PCR algorithm), we define two more terms.
In the SVD given by equation (4), define the matrix T by
(7)
The matrix T is called the scores for X. Note that T is orthogonal, but not necessarily orthonormal. Substituting this into the SVD for X yields
(8)
Using the fact that V is orthogonal we can also write
(9)
We call the matrix V the loadings. The goal of our algorithm is to obtain the representation given by equation (8) for X, retaining all the most significant principal components (or eigenvalues, or singular values – depending on where your heads at at the time).
Computing the Solution
Using equation (3) to compute the solution to our problem involves forming the matrix and obtaining its eigenvalue decomposition. This solution is fairly straight forward and has reasonable performance for moderately sized matrices X. However, in practice, the matrix X can be quite large, containing hundreds, even thousands of columns. In addition, many procedures for choosing the optimal number r of eigenvalues/singular values to retain involve computing the solution for many different values of r and comparing them. We therefore introduce an algorithm which computes only the number of eigenvalues we need.
The NIPALS Algorithm
We will be using an algorithm known as NIPALS (Nonlinear Iterative PArtial Least Squares). The NIPALS algorithm for the matrix X in our least squares problem and r, the number of retained principal components, proceeds as follows:
Initialize and . Then iterate through the following steps –
- Choose as any column of
- Let
- Let
- If is unchanged continue to step 5. Otherwise return to step 2.
- Let
- If stop. Otherwise increment j and return to step 1.
Properties of the NIPALS Algorithm
Let us see how the NIPALS algorithm produces principal components for us.
Let and write step (2) as
(10)
Setting in step 3 yields
(11)
This equation is satisfied upon completion of the loop 2-4. This shows that and are an eigenvalue and eigenvector of . The astute reader will note that the loop 2-4 is essentially the power method for computing a dominant eigenvalue and eigenvector for a linear transformation. Note further that using and equation (11) we obtain
(12)
After one iteration of the NIPALS algorithm we end up at step 5 with and
(13)
Note that and
are orthogonal:
(14)
Furthermore, since is initially picked as a column of , it is orthogonal to . Upon completion of the algorithm we form the following two matrices:
- , whose columns are the vectors , is orthogonal
- whose columns are the , is orthonormal.
(15)
If r is equal to the rank of X then, using the information obtained from equations (12) and (14), it follows that (15) yields the matrix decomposition (8). The idea behind Principal Components Regression is that after choosing an appropriate r the important features of X have been captured in . We then perform a linear regression with in place of X,
(16) .
The least squares solution then gives
(17)
Note that since is diagonal it is easy to invert. Also note that we left out the loadings matrix . This is due to the fact that the scores are linear combinations of the columns of X, and the PCR method amounts to singling out those combinations that are best for predicting y. Finally, using (9) and (16) we rewrite our linear regression problem as
(18)
From (18) we see that the PCR estimation is given by
(19) .
Steve