This is the second part in a three part series on PCR, the first article on the topic can be found here.
The Linear Regression Model
Multiple Linear Regression (MLR) is a common approach to modeling the relationship between one or two or more explanatory variables and a response variable by fitting a linear equation to observed data. First let’s set up some notation. I will be rather brief, assuming the audience is somewhat familiar with MLR.
In multiple linear regression it is assumed that a response variable, depends on k explanatory variables, , by way of a linear relationship:
The idea is to perform several observations of the response and explanatory variables and then to chose the linear coefficients which best fit the observed data.
Thus, a multiple linear regression model is:
In matrix notation we have
The solution for the coefficient vector which “best” fits the data is given by the so called “normal equations”
This is known as the least squares solution to the problem because it minimizes the sum of the squares of the errors.
Now, consider the following example in which
Solving this simple linear regression model using the normal equations yields
which is quite far off from the actual solution
The reason behind this is the fact that the matrix is ill conditioned. Since the second column of is approximately twice the first, the matrix is almost singular.
One solution to this problem would be to change the model. Since the second column is approximately twice the first, these two explanatory variables encode basically the same information, thus we could remove one of them from the model.
However, it is usually not so easy to identify the source of the bad conditioning as it is in this example.
Another method for removing information from a model that is responsible for impreciseness in the least squares solution is offered by the technique of principal component regression (PCR). Henceforth we shall assume that the data in the matrix is centered. By this we mean that the mean of each explanatory variable has been subtracted from each column of X so that the explanatory variables all have mean zero. In particular this implies that the matrix is proportional to the covariance matrix for the explanatory variables.
Removing the Source of Imprecision
Let be an mxn matrix, and recall from the part 1 of this series that we can write as
where is a diagonal matrix containing the eigenvalues (in ascending order down the diagonal) of , and is orthogonal. The condition number for is just the absolute value of the ratio of the largest and smallest eigenvalues:
Thus we can see that if the smallest eigenvalue is much smaller than the largest eigenvalue, we get a very large condition number which implies a poorly conditioned matrix. The idea then is to remove these small eigenvalues from thus giving us an approximation to that is better conditioned. To this end, suppose that we wish to retain the r (r less than or equal to n) largest eigenvalues of in our approximation, and thus write
is an r x r diagonal matrix consisting of the r largest eigenvalues of , is a (n-r) x (n-r) diagonal matrix consisting of the remaining n – r eigenvalues of , and the n x n matrix is orthogonal with consisting of the first r columns of , and consisting of the remaining n – r columns of . Using this formulation we can write an approximation to using the r largest eigenvalues as
If we substitute this approximation into the normal equations 2, and do some simplification, we end up with the principal components estimator
While we could use equation 3 directly, it is usually not the best way to perform principal components regression. The next article in this series will illustrate an algorithm for PCR and implement it using the NMath libraries.