PCR c# Archives - CenterSpace

Principal Components Regression: Part 3 – The NIPALS Algorithm

Steve Sneller — Tue, 29 Nov 2016 19:23:13 +0000

Principal Components Regression: Recap of Part 2

Recall that the least squares solution to the multiple linear problem is given by
(1)

And that problems occurred finding when the matrix
(2)

was close to being singular. The Principal Components Regression approach to addressing the problem is to replace in equation (1) with a better conditioned approximation. This approximation is formed by computing the eigenvalue decomposition for and retaining only the r largest eigenvalues. This yields the PCR solution:
(3)

where is an r x r diagonal matrix consisting of the r largest eigenvalues of
are the corresponding eigenvectors of . In this piece we shall develop code for computing the PCR solution using the NMath libraries.

[eds: This blog article is final entry of a three part series on principal component regression. The first article in this series, “Principal Component Regression: Part 1 – The Magic of the SVD” is here. And the second, “Principal Components Regression: Part 2 – The Problem With Linear Regression” is here.]

Review: Eigenvalues and Singular Values

In order to develop the algorithm, I want to go back to the Singular Value Decomposition (SVD) of a matrix and its relationship to the eigenvalue decomposition. Recall that the SVD of a matrix X is given by
(4)

Where U is the matrix of left singular vectors, V is the matrix of right singular vectors, and Σ is a diagonal matrix with positive entries equal to the singular values. The eigenvalue decomposition of is given by
(5)

Where the eigenvalues of X are the diagonal entries of the diagonal matrix and the columns of V are the eigenvectors of (V is also composed of the right singular vectors of X).
Recall further that if the matrix X has rank r then X can be written as
(6)

Where is the jth singular value (jth diagonal element of the diagonal matrix ), is the jth column of U, and is the jth column of V. An equivalent way of expressing the PCR solution (3) to the least squares problem in terms of the SVD for X is that we’ve replaced X in the solution (1) by its rank r approximation shown in (6).

Principal Components

The subject here is Principal Components Regression (PCR), but we have yet to mention principal components. All we have talked about are eigenvalues, eigenvectors, singular values, and singular vectors. We’ve seen how singular stuff and eigen stuff are related, but what are principal components?
Principal component analysis applies when one considers statistical properties of data. In linear regression each column of our matrix X represents a variable and each row is a set of observed value for these variables. The variables being observed are random variables and as such have means and variances. If we center the matrix X by subtracting from each column of X its corresponding mean, then we’ve normalized the random variables being observed so that they have zero mean. Once the matrix X is centered in this way, the matrix is then proportional to the variance/covariance for the variables. In this context the eigenvectors of are called the Principal Components of X. For completeness (and because they are used in discussing the PCR algorithm), we define two more terms.
In the SVD given by equation (4), define the matrix T by
(7)

The matrix T is called the scores for X. Note that T is orthogonal, but not necessarily orthonormal. Substituting this into the SVD for X yields
(8)

Using the fact that V is orthogonal we can also write
(9)

We call the matrix V the loadings. The goal of our algorithm is to obtain the representation given by equation (8) for X, retaining all the most significant principal components (or eigenvalues, or singular values – depending on where your heads at at the time).

Computing the Solution

Using equation (3) to compute the solution to our problem involves forming the matrix and obtaining its eigenvalue decomposition. This solution is fairly straight forward and has reasonable performance for moderately sized matrices X. However, in practice, the matrix X can be quite large, containing hundreds, even thousands of columns. In addition, many procedures for choosing the optimal number r of eigenvalues/singular values to retain involve computing the solution for many different values of r and comparing them. We therefore introduce an algorithm which computes only the number of eigenvalues we need.

The NIPALS Algorithm

We will be using an algorithm known as NIPALS (Nonlinear Iterative PArtial Least Squares). The NIPALS algorithm for the matrix X in our least squares problem and r, the number of retained principal components, proceeds as follows:
Initialize and . Then iterate through the following steps –

Choose as any column of
Let
Let
If is unchanged continue to step 5. Otherwise return to step 2.
Let
If stop. Otherwise increment j and return to step 1.

Properties of the NIPALS Algorithm

Let us see how the NIPALS algorithm produces principal components for us.
Let and write step (2) as
(10)

Setting in step 3 yields
(11)

This equation is satisfied upon completion of the loop 2-4. This shows that and are an eigenvalue and eigenvector of . The astute reader will note that the loop 2-4 is essentially the power method for computing a dominant eigenvalue and eigenvector for a linear transformation. Note further that using and equation (11) we obtain
(12)

After one iteration of the NIPALS algorithm we end up at step 5 with and
(13)

Note that and
are orthogonal:
(14)

Furthermore, since is initially picked as a column of , it is orthogonal to . Upon completion of the algorithm we form the following two matrices:

, whose columns are the vectors , is orthogonal
whose columns are the , is orthonormal.

(15)

If r is equal to the rank of X then, using the information obtained from equations (12) and (14), it follows that (15) yields the matrix decomposition (8). The idea behind Principal Components Regression is that after choosing an appropriate r the important features of X have been captured in . We then perform a linear regression with in place of X,
(16) .

The least squares solution then gives
(17)

Note that since is diagonal it is easy to invert. Also note that we left out the loadings matrix . This is due to the fact that the scores are linear combinations of the columns of X, and the PCR method amounts to singling out those combinations that are best for predicting y. Finally, using (9) and (16) we rewrite our linear regression problem as
(18)

From (18) we see that the PCR estimation is given by
(19) .

Steve

The post Principal Components Regression: Part 3 – The NIPALS Algorithm appeared first on CenterSpace.

Principal Components Regression: Part 2 – The Problem With Linear Regression

Steve Sneller — Thu, 04 Mar 2010 17:17:07 +0000

This is the second part in a three part series on PCR, the first article on the topic can be found here.

The Linear Regression Model

Multiple Linear Regression (MLR) is a common approach to modeling the relationship between one or two or more explanatory variables and a response variable by fitting a linear equation to observed data. First let’s set up some notation. I will be rather brief, assuming the audience is somewhat familiar with MLR.

In multiple linear regression it is assumed that a response variable, depends on k explanatory variables, , by way of a linear relationship:

The idea is to perform several observations of the response and explanatory variables and then to chose the linear coefficients which best fit the observed data.

Thus, a multiple linear regression model is:

In matrix notation we have

where

The solution for the coefficient vector which “best” fits the data is given by the so called “normal equations”

This is known as the least squares solution to the problem because it minimizes the sum of the squares of the errors.

Now, consider the following example in which

and

Solving this simple linear regression model using the normal equations yields

which is quite far off from the actual solution

The reason behind this is the fact that the matrix is ill conditioned. Since the second column of is approximately twice the first, the matrix is almost singular.

One solution to this problem would be to change the model. Since the second column is approximately twice the first, these two explanatory variables encode basically the same information, thus we could remove one of them from the model.
However, it is usually not so easy to identify the source of the bad conditioning as it is in this example.

Another method for removing information from a model that is responsible for impreciseness in the least squares solution is offered by the technique of principal component regression (PCR). Henceforth we shall assume that the data in the matrix is centered. By this we mean that the mean of each explanatory variable has been subtracted from each column of X so that the explanatory variables all have mean zero. In particular this implies that the matrix is proportional to the covariance matrix for the explanatory variables.

Removing the Source of Imprecision

Let be an mxn matrix, and recall from the part 1 of this series that we can write as

where is a diagonal matrix containing the eigenvalues (in ascending order down the diagonal) of , and is orthogonal. The condition number for is just the absolute value of the ratio of the largest and smallest eigenvalues:

Thus we can see that if the smallest eigenvalue is much smaller than the largest eigenvalue, we get a very large condition number which implies a poorly conditioned matrix. The idea then is to remove these small eigenvalues from thus giving us an approximation to that is better conditioned. To this end, suppose that we wish to retain the r (r less than or equal to n) largest eigenvalues of in our approximation, and thus write

where

is an r x r diagonal matrix consisting of the r largest eigenvalues of , is a (n-r) x (n-r) diagonal matrix consisting of the remaining n – r eigenvalues of , and the n x n matrix is orthogonal with consisting of the first r columns of , and consisting of the remaining n – r columns of . Using this formulation we can write an approximation to using the r largest eigenvalues as

If we substitute this approximation into the normal equations 2, and do some simplification, we end up with the principal components estimator

While we could use equation 3 directly, it is usually not the best way to perform principal components regression. The next article in this series will illustrate an algorithm for PCR and implement it using the NMath libraries.

-Steve

The post Principal Components Regression: Part 2 – The Problem With Linear Regression appeared first on CenterSpace.

Principal Component Regression: Part 1 – The Magic of the SVD

Steve Sneller — Mon, 08 Feb 2010 17:44:45 +0000

Introduction

This is the first part of a multi-part series on Principal Component Regression, or PCR for short. We will eventually end up with a computational algorithm for PCR and code it up using C# using the NMath libraries. PCR is a method for constructing a linear regression model in the case that we have a large number of predictor variables which are highly correlated. Of course, we don’t know exactly which variables are correlated, otherwise we’d just throw them out and perform a normal linear regression.

In order to understand what is going on in the PCR algorithm, we need to know a little bit about the SVD (Singular Value Decomposition). Understanding a bit about the SVD and it’s relationship to the eigenvalue decomposition will go a long way in understanding the PCR algorithm.

The Singular Value Decomposition

The SVD (Singular Value Decomposition) is one of the most revealing matrix decompositions in linear algebra. A bit expensive to compute, but the bounty of information it yields is awe inspiring. Understanding a little about the SVD will illuminate the Principal Components Regression (PCR) algorithm. The SVD may seem like a deep and mysterious thing, at least I thought it was until I read the chapters covering it in the book “Numerical Linear Algebra”, by Lloyd N. Trefethen, and David Bau, III, which I summarize below.

We begin with an easy to state, and not too difficult to prove geometric statement about linear transformations.

A Geometric Fact

Let be the unit sphere in , and let be any matrix mapping into and suppose, for the moment, that has full rank. Then the image, of under is a hyperellipse in (see the book for the proof).

Figure 1. SVD of a 2x2 matrix

Given this fact we make the following definitions (refer to Figure 1.):

Define the singular values ,

of to be the lengths of the principal semiaxes of the hyperellipse . It is conventional to assume the singular values are numbered in descending order

Define the left singular vectors

to be unit vectors in the direction of the principal semiaxes of and define the right singular vectors,

to be the pre-images of the principal semiaxes of so that

In matrix form we have

where is the orthonormal matrix whose columns are the right singular vectors of , is an diagonal matrix with positive entries equal to the singular values, and is an matrix whose orthonormal columns are the left singular vectors.
Since the columns of are orthonormal by construction, is a unitary matrix, that is it’s transpose is equal to it’s inverse, thus we can write

And there you have it, the SVD is all it’s majesty! Actually the above decomposition is what is known as the reduced SVD. Note that the columns of are orthonormal vectors in dimensional space. can be extended to a unitary matrix by adjoining an additional orthonormal columns. If in addition we append rows of zeros to the bottom of the matrix , it will effectively multiply the appended columns in by zero, thus preserving equation (2). When and are modified in this way equation (2) is called the full SVD.

The Relationship Between Singular Values and Eigenvalues

There is an important relationship between the singular values of and the eigenvalues of . Recall that a vector is an eigenvector with corresponding eigenvalue for a matrix if and only if . Now, suppose we have the full SVD for as in equation (2). Then

or,

where we have used the fact that and are unitary and set

Note that is a diagonal matrix with the singular values squared along the diagonal. From this it follows that the columns of are eigenvectors for and the main diagonal of contain the corresponding eigenvalues. Thus the nonzero singular values of are the square roots of the nonzero eigenvalues of .

We need one more very cool fact about the SVD before we get to the algorithm. Low-rank approximation.

Low-Rank Approximation

Suppose now that has rank and write in equation (2) as the sum of rank one matrices (each rank one matrix will be all zeros except for as the th diagonal element). We can then, using equation (2), write as the sum of rank one matrices,

Equation (3) gives us a way to approximate any rank matrix by a lower rank matrix. Indeed, given , form the partial sum

Then is a rank approximation for . How good is this approximation? Turns out it’s the best rank approximation you can get.

Computing the Low-Rank Approximations Using NMath

The NMath library provides two classes for computing the SVD for a matrix (actually 8 since there SVD classes for each of the datatypes Double, Float, DoubleComplex and FloatComplex). There is a basic decomposition class for computing the standard, reduced SVD, and a decomposition server class when more control is desired. Here is a simple C# routine that constructs the low-rank approximations for a matrix and prints out the Frobenius norms of difference between and each of it’s low-rank approximations.

static void LowerRankApproximations( DoubleMatrix X )
{
  // Construct the reduced SVD for X. We will consider
  // all singular values less than 1e-15 to be zero.
  DoubleSVDecomp decomp = new DoubleSVDecomp( X );
  decomp.Truncate( 1e-15 );
  int r = decomp.Rank;
  Console.WriteLine( "The {0}x{1} matrix X has rank {2}", X.Rows, X.Cols, r );

  // Construct the best lower rank approximations to X and
  // look at the frobenius norm of their differences.
  DoubleMatrix LowerRankApprox =
    new DoubleMatrix( X.Rows, X.Cols );
  double differenceNorm;
  for ( int k = 0; k < r; k++ )
  {
    LowerRankApprox += decomp.SingularValues[k] *
      NMathFunctions.OuterProduct( decomp.LeftVectors.Col( k ), decomp.RightVectors.Col( k ) );
    differenceNorm = ( X - LowerRankApprox ).FrobeniusNorm();
    Console.WriteLine( "Rank {0} approximation difference
      norm = {1:F4}", k+1, differenceNorm );
  }
}

Here’s the output for a matrix with 10 rows and 20 columns. Note that the rank can be at most 10.

The 10x20 matrix X has rank 10
Rank 1 approximation difference norm = 3.7954
Rank 2 approximation difference norm = 3.3226
Rank 3 approximation difference norm = 2.9135
Rank 4 approximation difference norm = 2.4584
Rank 5 approximation difference norm = 2.0038
Rank 6 approximation difference norm = 1.5689
Rank 7 approximation difference norm = 1.1829
Rank 8 approximation difference norm = 0.8107
Rank 9 approximation difference norm = 0.3676
Rank 10 approximation difference norm = 0.0000

-Steve

The post Principal Component Regression: Part 1 – The Magic of the SVD appeared first on CenterSpace.