Principal Components Regression: Part 3 – The NIPALS Algorithm

Principal Components Regression: Recap of Part 2

Recall that the least squares solution \beta to the multiple linear problem X \beta = y is given by
(1) \hat{\beta} = (X^T X)^{-1} X^T y

And that problems occurred finding \hat{\beta} when the matrix
(2) X^T X

was close to being singular. The Principal Components Regression approach to addressing the problem is to replace X^T X in equation (1) with a better conditioned approximation. This approximation is formed by computing the eigenvalue decomposition for X^T X and retaining only the r largest eigenvalues. This yields the PCR solution:
(3) \hat{\beta}_r= V_1 \Lambda_1^{-1} V_1^T X^T y

where \Lambda_1 is an r x r diagonal matrix consisting of the r largest eigenvalues of X^T X,V_1=(v_1,...,v_r )
are the corresponding eigenvectors of X^T X. In this piece we shall develop code for computing the PCR solution using the NMath libraries.

[eds: This blog article is final entry of a three part series on principal component regression. The first article in this series, “Principal Component Regression: Part 1 – The Magic of the SVD” is here. And the second, “Principal Components Regression: Part 2 – The Problem With Linear Regression” is here.]

Review: Eigenvalues and Singular Values

In order to develop the algorithm, I want to go back to the Singular Value Decomposition (SVD) of a matrix and its relationship to the eigenvalue decomposition. Recall that the SVD of a matrix X is given by
(4) X=U \Sigma V^T

Where U is the matrix of left singular vectors, V is the matrix of right singular vectors, and Σ is a diagonal matrix with positive entries equal to the singular values. The eigenvalue decomposition of X^T X is given by
(5) X^T X=V \Lambda V^T

Where the eigenvalues of X are the diagonal entries of the diagonal matrix \Lambda and the columns of V are the eigenvectors of X^T X (V is also composed of the right singular vectors of X).
Recall further that if the matrix X has rank r then X can be written as
(6) X= \sum_{j=1}^{r} \sigma_j u_j v_j^T

Where \sigma_j is the jth singular value (jth diagonal element of the diagonal matrix \Sigma), u_j is the jth column of U, and v_j is the jth column of V. An equivalent way of expressing the PCR solution (3) to the least squares problem in terms of the SVD for X is that we’ve replaced X in the solution (1) by its rank r approximation shown in (6).

Principal Components

The subject here is Principal Components Regression (PCR), but we have yet to mention principal components. All we have talked about are eigenvalues, eigenvectors, singular values, and singular vectors. We’ve seen how singular stuff and eigen stuff are related, but what are principal components?
Principal component analysis applies when one considers statistical properties of data. In linear regression each column of our matrix X represents a variable and each row is a set of observed value for these variables. The variables being observed are random variables and as such have means and variances. If we center the matrix X by subtracting from each column of X its corresponding mean, then we’ve normalized the random variables being observed so that they have zero mean. Once the matrix X is centered in this way, the matrix X^T X is then proportional to the variance/covariance for the variables. In this context the eigenvectors of X^T X are called the Principal Components of X. For completeness (and because they are used in discussing the PCR algorithm), we define two more terms.
In the SVD given by equation (4), define the matrix T by
(7) T=U\Sigma

The matrix T is called the scores for X. Note that T is orthogonal, but not necessarily orthonormal. Substituting this into the SVD for X yields
(8) X=TV^T

Using the fact that V is orthogonal we can also write
(9) T=XV

We call the matrix V the loadings. The goal of our algorithm is to obtain the representation given by equation (8) for X, retaining all the most significant principal components (or eigenvalues, or singular values – depending on where your heads at at the time).

Computing the Solution

Using equation (3) to compute the solution to our problem involves forming the matrix X^T X and obtaining its eigenvalue decomposition. This solution is fairly straight forward and has reasonable performance for moderately sized matrices X. However, in practice, the matrix X can be quite large, containing hundreds, even thousands of columns. In addition, many procedures for choosing the optimal number r of eigenvalues/singular values to retain involve computing the solution for many different values of r and comparing them. We therefore introduce an algorithm which computes only the number of eigenvalues we need.

The NIPALS Algorithm

We will be using an algorithm known as NIPALS (Nonlinear Iterative PArtial Least Squares). The NIPALS algorithm for the matrix X in our least squares problem and r, the number of retained principal components, proceeds as follows:
Initialize j=1 and X_1=X. Then iterate through the following steps –

  1. Choose t_j as any column of X_j
  2. Let v_j = (X_j^T t_j) / \left \| X_j^T t_j \right \|
  3. Let t_j= X_j v_j
  4. If t_j is unchanged continue to step 5. Otherwise return to step 2.
  5. Let X_{j+1}= X_j- t_j v_j^T
  6. If j=r stop. Otherwise increment j and return to step 1.

Properties of the NIPALS Algorithm

Let us see how the NIPALS algorithm produces principal components for us.
Let \lambda_j = \left \| X^T t_j \right \| and write step (2) as
(10) X^T t_j = \lambda_j v_j

Setting t_j = X v_j in step 3 yields
(11) X^T X v_j= \lambda_j v_j

This equation is satisfied upon completion of the loop 2-4. This shows that \lambda_j and p_j are an eigenvalue and eigenvector of X^T X. The astute reader will note that the loop 2-4 is essentially the power method for computing a dominant eigenvalue and eigenvector for a linear transformation. Note further that using t_j=X v_j and equation (11) we obtain
(12)

  • t_j^T t_j= v_j^T X^T Xv_j
  • = v_j^T (X^T Xv_j )
  • = \lambda_j v_j^T v_j
  • = \lambda_j

After one iteration of the NIPALS algorithm we end up at step 5 with j=1 and
(13) X= t_1 v_1^T+ X_2

Note that t_1 and X_2=X - t_1 v_1^T
are orthogonal:
(14)

  • (X- t_1 v_1^T )^T t_1 = X^T t_1- v_1 t_1^T t_1
  • = X^T X v_1- v_1 \lambda_1
  • =0

Furthermore, since t_2 is initially picked as a column of X_2, it is orthogonal to t_1. Upon completion of the algorithm we form the following two matrices:

  • T_r, whose columns are the vectors t_i, T_r is orthogonal
  • V_r whose columns are the v_i, V_r is orthonormal.

(15) X_r=T_r V_r^T

If r is equal to the rank of X then, using the information obtained from equations (12) and (14), it follows that (15) yields the matrix decomposition (8). The idea behind Principal Components Regression is that after choosing an appropriate r the important features of X have been captured in T_r. We then perform a linear regression with T_r in place of X,
(16) T_r c=y.

The least squares solution then gives
(17) \hat{c}= (T_r^T T_r )^{-1} T_r^T y

Note that since T_r is diagonal it is easy to invert. Also note that we left out the loadings matrix V_r. This is due to the fact that the scores t_j are linear combinations of the columns of X, and the PCR method amounts to singling out those combinations that are best for predicting y. Finally, using (9) and (16) we rewrite our linear regression problem X \beta=y as
(18) XV_r c=y

From (18) we see that the PCR estimation \hat{\beta}_r is given by
(19) \hat{\beta}_r= V_r \hat{c}.

Steve

Leave a Reply

Your email address will not be published. Required fields are marked *

Top