Principal Components Regression: Part 3 – The NIPALS Algorithm

Principal Components Regression: Recap of Part 2

Recall that the least squares solution $\beta$ to the multiple linear problem $X \beta = y$ is given by
(1) $\hat{\beta} = (X^T X)^{-1} X^T y$

And that problems occurred finding $\hat{\beta}$ when the matrix
(2) $X^T X$

was close to being singular. The Principal Components Regression approach to addressing the problem is to replace $X^T X$ in equation (1) with a better conditioned approximation. This approximation is formed by computing the eigenvalue decomposition for $X^T X$ and retaining only the r largest eigenvalues. This yields the PCR solution:
(3) $\hat{\beta}_r= V_1 \Lambda_1^{-1} V_1^T X^T y$

where $\Lambda_1$ is an r x r diagonal matrix consisting of the r largest eigenvalues of $X^T X,V_1=(v_1,...,v_r )$
are the corresponding eigenvectors of $X^T X$ . In this piece we shall develop code for computing the PCR solution using the NMath libraries.

[eds: This blog article is final entry of a three part series on principal component regression. The first article in this series, “Principal Component Regression: Part 1 – The Magic of the SVD” is here. And the second, “Principal Components Regression: Part 2 – The Problem With Linear Regression” is here.]

Review: Eigenvalues and Singular Values

In order to develop the algorithm, I want to go back to the Singular Value Decomposition (SVD) of a matrix and its relationship to the eigenvalue decomposition. Recall that the SVD of a matrix X is given by
(4) $X=U \Sigma V^T$

Where U is the matrix of left singular vectors, V is the matrix of right singular vectors, and Σ is a diagonal matrix with positive entries equal to the singular values. The eigenvalue decomposition of $X^T X$ is given by
(5) $X^T X=V \Lambda V^T$

Where the eigenvalues of X are the diagonal entries of the diagonal matrix $\Lambda$ and the columns of V are the eigenvectors of $X^T X$ (V is also composed of the right singular vectors of X).
Recall further that if the matrix X has rank r then X can be written as
(6) $X= \sum_{j=1}^{r} \sigma_j u_j v_j^T$

Where $\sigma_j$ is the jth singular value (jth diagonal element of the diagonal matrix $\Sigma$ ), $u_j$ is the jth column of U, and $v_j$ is the jth column of V. An equivalent way of expressing the PCR solution (3) to the least squares problem in terms of the SVD for X is that we’ve replaced X in the solution (1) by its rank r approximation shown in (6).

Principal Components

The subject here is Principal Components Regression (PCR), but we have yet to mention principal components. All we have talked about are eigenvalues, eigenvectors, singular values, and singular vectors. We’ve seen how singular stuff and eigen stuff are related, but what are principal components?
Principal component analysis applies when one considers statistical properties of data. In linear regression each column of our matrix X represents a variable and each row is a set of observed value for these variables. The variables being observed are random variables and as such have means and variances. If we center the matrix X by subtracting from each column of X its corresponding mean, then we’ve normalized the random variables being observed so that they have zero mean. Once the matrix X is centered in this way, the matrix $X^T X$ is then proportional to the variance/covariance for the variables. In this context the eigenvectors of $X^T X$ are called the Principal Components of X. For completeness (and because they are used in discussing the PCR algorithm), we define two more terms.
In the SVD given by equation (4), define the matrix T by
(7) $T=U\Sigma$

The matrix T is called the scores for X. Note that T is orthogonal, but not necessarily orthonormal. Substituting this into the SVD for X yields
(8) $X=TV^T$

Using the fact that V is orthogonal we can also write
(9) $T=XV$

We call the matrix V the loadings. The goal of our algorithm is to obtain the representation given by equation (8) for X, retaining all the most significant principal components (or eigenvalues, or singular values – depending on where your heads at at the time).

Computing the Solution

Using equation (3) to compute the solution to our problem involves forming the matrix $X^T X$ and obtaining its eigenvalue decomposition. This solution is fairly straight forward and has reasonable performance for moderately sized matrices X. However, in practice, the matrix X can be quite large, containing hundreds, even thousands of columns. In addition, many procedures for choosing the optimal number r of eigenvalues/singular values to retain involve computing the solution for many different values of r and comparing them. We therefore introduce an algorithm which computes only the number of eigenvalues we need.

The NIPALS Algorithm

We will be using an algorithm known as NIPALS (Nonlinear Iterative PArtial Least Squares). The NIPALS algorithm for the matrix X in our least squares problem and r, the number of retained principal components, proceeds as follows:
Initialize $j=1$ and $X_1=X$ . Then iterate through the following steps –

Choose $t_j$ as any column of $X_j$
Let $v_j = (X_j^T t_j) / \left \| X_j^T t_j \right \|$
Let $t_j= X_j v_j$
If $t_j$ is unchanged continue to step 5. Otherwise return to step 2.
Let $X_{j+1}= X_j- t_j v_j^T$
If $j=r$ stop. Otherwise increment j and return to step 1.

Properties of the NIPALS Algorithm

Let us see how the NIPALS algorithm produces principal components for us.
Let $\lambda_j = \left \| X^T t_j \right \|$ and write step (2) as
(10) $X^T t_j = \lambda_j v_j$

Setting $t_j = X v_j$ in step 3 yields
(11) $X^T X v_j= \lambda_j v_j$

This equation is satisfied upon completion of the loop 2-4. This shows that $\lambda_j$ and $p_j$ are an eigenvalue and eigenvector of $X^T X$ . The astute reader will note that the loop 2-4 is essentially the power method for computing a dominant eigenvalue and eigenvector for a linear transformation. Note further that using $t_j=X v_j$ and equation (11) we obtain
(12)

$t_j^T t_j= v_j^T X^T Xv_j$
$= v_j^T (X^T Xv_j )$
$= \lambda_j v_j^T v_j$
$= \lambda_j$

After one iteration of the NIPALS algorithm we end up at step 5 with $j=1$ and
(13) $X= t_1 v_1^T+ X_2$

Note that $t_1$ and $X_2=X - t_1 v_1^T$
are orthogonal:
(14)

$(X- t_1 v_1^T )^T t_1 = X^T t_1- v_1 t_1^T t_1$
$= X^T X v_1- v_1 \lambda_1$
$=0$

Furthermore, since $t_2$ is initially picked as a column of $X_2$ , it is orthogonal to $t_1$ . Upon completion of the algorithm we form the following two matrices:

$T_r$ , whose columns are the vectors $t_i$ , $T_r$ is orthogonal
$V_r$ whose columns are the $v_i$ , $V_r$ is orthonormal.

(15) $X_r=T_r V_r^T$

If r is equal to the rank of X then, using the information obtained from equations (12) and (14), it follows that (15) yields the matrix decomposition (8). The idea behind Principal Components Regression is that after choosing an appropriate r the important features of X have been captured in $T_r$ . We then perform a linear regression with $T_r$ in place of X,
(16) $T_r c=y$ .

The least squares solution then gives
(17) $\hat{c}= (T_r^T T_r )^{-1} T_r^T y$

Note that since $T_r$ is diagonal it is easy to invert. Also note that we left out the loadings matrix $V_r$ . This is due to the fact that the scores $t_j$ are linear combinations of the columns of X, and the PCR method amounts to singling out those combinations that are best for predicting y. Finally, using (9) and (16) we rewrite our linear regression problem $X \beta=y$ as
(18) $XV_r c=y$