# Principal Components Regression: Part 2 – The Problem With Linear Regression

This is the second part in a three part series on PCR, the first article on the topic can be found here.

### The Linear Regression Model

Multiple Linear Regression (MLR) is a common approach to modeling the relationship between one or two or more explanatory variables and a response variable by fitting a linear equation to observed data. First let’s set up some notation. I will be rather brief, assuming the audience is somewhat familiar with MLR.

In multiple linear regression it is assumed that a response variable, $Y$ depends on k explanatory variables, $X_1,...,X_k$, by way of a linear relationship:

$Y=b_1X_1+b_2X_2...+b_kX_k$

The idea is to perform several observations of the response and explanatory variables and then to chose the linear coefficients $b_1,...b_k$ which best fit the observed data.

Thus, a multiple linear regression model is:

$y_{i} = c + x_{i1}b_{1} + \cdots + x_{ik}b_{k} + f_{i}$
$i = 1,\cdots,n,\textrm{ where:}$
$y_{i}\textrm{ is the }i\textrm{th value of the response variable}$
$x_{ij}\textrm{ is the }i\textrm{th value of the }j\textrm{th explanatory variable}$
$n \textrm{ is the sample size}$
$k \textrm{ is the number of }x \textrm{-variables}$
$c \textrm{ is the intercpet of the regression model}$
$b_{j} \textrm{ is the regression coefficient for the }j\textrm{th explanatory variable}$
$f\textrm{ is the random noise term, assumed independent, with zero mean and common variance }\sigma ^{2}$
$c, b_{1},\cdots,b_{k}\textrm{ and }\sigma^{2}\textrm{ are unknown parameters, to be estimated from the data.}$

In matrix notation we have

$\textrm{(1) }Xb = y + f$

where

$X = (x_{ij})\textrm{, } y = (y_{i}) \textrm{, and } f = f_{i}$.

The solution for the coefficient vector $b$ which “best” fits the data is given by the so called “normal equations”

$\textrm{(2) }\beta=(X'X)^{-1}X'y$

This is known as the least squares solution to the problem because it minimizes the sum of the squares of the errors.

Now, consider the following example in which
$X=\begin{bmatrix} 1 & 1.9\\ 1 & 2.1\\ 1 & 2\\ 1& 2\\ 1 & 1.8 \end{bmatrix}$

and

$y=\begin{bmatrix} 6.0521\\ 7.0280\\ 7.1230\\ 4.4441\\ 5.0813 \end{bmatrix}$

Solving this simple linear regression model using the normal equations yields

$\widehat{b}=\begin{bmatrix} -4.2489\\ 5.2013 \end{bmatrix}$

which is quite far off from the actual solution

$b=\begin{bmatrix} 2\\ 2 \end{bmatrix}$

The reason behind this is the fact that the matrix $X'X$ is ill conditioned. Since the second column of $X$ is approximately twice the first, the matrix $X'X$ is almost singular.

One solution to this problem would be to change the model. Since the second column is approximately twice the first, these two explanatory variables encode basically the same information, thus we could remove one of them from the model.
However, it is usually not so easy to identify the source of the bad conditioning as it is in this example.

Another method for removing information from a model that is responsible for impreciseness in the least squares solution is offered by the technique of principal component regression (PCR). Henceforth we shall assume that the data in the matrix $X$ is centered. By this we mean that the mean of each explanatory variable has been subtracted from each column of X so that the explanatory variables all have mean zero. In particular this implies that the matrix $X'X$ is proportional to the covariance matrix for the explanatory variables.

### Removing the Source of Imprecision

Let $X$ be an mxn matrix, and recall from the part 1 of this series that we can write $X^TX$ as

$X^TX=V \Lambda V^T$

where $\Lambda$ is a diagonal matrix containing the eigenvalues (in ascending order down the diagonal) of $X^TX$, and $V$ is orthogonal. The condition number $\kappa (X^TX)$ for $X^TX$ is just the absolute value of the ratio of the largest and smallest eigenvalues:

$\kappa(X^TX)=\left | \frac{\lambda_{max}}{\lambda_{min}} \right |$

Thus we can see that if the smallest eigenvalue is much smaller than the largest eigenvalue, we get a very large condition number which implies a poorly conditioned matrix. The idea then is to remove these small eigenvalues from $X^TX$ thus giving us an approximation to $X^TX$ that is better conditioned. To this end, suppose that we wish to retain the r (r less than or equal to n) largest eigenvalues of $X^TX$ in our approximation, and thus write

$X^TX=(V_1,V_2)\begin{pmatrix} \Lambda_1 & 0 \\ 0 & \Lambda_2 \end{pmatrix} \begin{pmatrix} V_1^T \\ V_2^T \end{pmatrix}$,

where

$\Lambda_1$ is an r x r diagonal matrix consisting of the r largest eigenvalues of $X^TX$, $\Lambda_2$ is a (n-r) x (n-r) diagonal matrix consisting of the remaining n – r eigenvalues of $X^TX$, and the n x n matrix $V=(V_1,V_2)$ is orthogonal with $V_1=(v_1,...v_r)$ consisting of the first r columns of $V$, and $V_2=(v_{r+1},...v_n)$ consisting of the remaining n – r columns of $V$. Using this formulation we can write an approximation $\widehat{X^TX}$ to $X^TX$ using the r largest eigenvalues as

$\widehat{X^TX}=V_1 \Lambda_1 V_1^T$.

If we substitute this approximation into the normal equations 2, and do some simplification, we end up with the principal components estimator

$\textrm{(3) }\widehat{\beta^{(r)}}=V_1 \Lambda_1^{-1} V_1^T X^T y$.

While we could use equation 3 directly, it is usually not the best way to perform principal components regression. The next article in this series will illustrate an algorithm for PCR and implement it using the NMath libraries.

-Steve

Top