# Blog

## Principal Components Regression: Part 2 – The Problem With Linear Regression

Thursday, March 4th, 2010

This is the second part in a three part series on PCR, the first article on the topic can be found here.

### The Linear Regression Model

Multiple Linear Regression (MLR) is a common approach to modeling the relationship between one or two or more explanatory variables and a response variable by fitting a linear equation to observed data. First let’s set up some notation. I will be rather brief, assuming the audience is somewhat familiar with MLR.

In multiple linear regression it is assumed that a response variable, $Y$ depends on k explanatory variables, $X_1,...,X_k$, by way of a linear relationship:

$Y=b_1X_1+b_2X_2...+b_kX_k$

The idea is to perform several observations of the response and explanatory variables and then to chose the linear coefficients $b_1,...b_k$ which best fit the observed data.

Thus, a multiple linear regression model is:

$y_{i} = c + x_{i1}b_{1} + \cdots + x_{ik}b_{k} + f_{i}$
$i = 1,\cdots,n,\textrm{ where:}$
$y_{i}\textrm{ is the }i\textrm{th value of the response variable}$
$x_{ij}\textrm{ is the }i\textrm{th value of the }j\textrm{th explanatory variable}$
$n \textrm{ is the sample size}$
$k \textrm{ is the number of }x \textrm{-variables}$
$c \textrm{ is the intercpet of the regression model}$
$b_{j} \textrm{ is the regression coefficient for the }j\textrm{th explanatory variable}$
$f\textrm{ is the random noise term, assumed independent, with zero mean and common variance }\sigma ^{2}$
$c, b_{1},\cdots,b_{k}\textrm{ and }\sigma^{2}\textrm{ are unknown parameters, to be estimated from the data.}$

In matrix notation we have

$\textrm{(1) }Xb = y + f$

where

$X = (x_{ij})\textrm{, } y = (y_{i}) \textrm{, and } f = f_{i}$.

The solution for the coefficient vector $b$ which “best” fits the data is given by the so called “normal equations”

$\textrm{(2) }\beta=(X'X)^{-1}X'y$

This is known as the least squares solution to the problem because it minimizes the sum of the squares of the errors.

Now, consider the following example in which
$X=\begin{bmatrix} 1 & 1.9\\ 1 & 2.1\\ 1 & 2\\ 1& 2\\ 1 & 1.8 \end{bmatrix}$

and

$y=\begin{bmatrix} 6.0521\\ 7.0280\\ 7.1230\\ 4.4441\\ 5.0813 \end{bmatrix}$

Solving this simple linear regression model using the normal equations yields

$\widehat{b}=\begin{bmatrix} -4.2489\\ 5.2013 \end{bmatrix}$

which is quite far off from the actual solution

$b=\begin{bmatrix} 2\\ 2 \end{bmatrix}$

The reason behind this is the fact that the matrix $X'X$ is ill conditioned. Since the second column of $X$ is approximately twice the first, the matrix $X'X$ is almost singular.

One solution to this problem would be to change the model. Since the second column is approximately twice the first, these two explanatory variables encode basically the same information, thus we could remove one of them from the model.
However, it is usually not so easy to identify the source of the bad conditioning as it is in this example.

Another method for removing information from a model that is responsible for impreciseness in the least squares solution is offered by the technique of principal component regression (PCR). Henceforth we shall assume that the data in the matrix $X$ is centered. By this we mean that the mean of each explanatory variable has been subtracted from each column of X so that the explanatory variables all have mean zero. In particular this implies that the matrix $X'X$ is proportional to the covariance matrix for the explanatory variables.

### Removing the Source of Imprecision

Let $X$ be an mxn matrix, and recall from the part 1 of this series that we can write $X^TX$ as

$X^TX=V \Lambda V^T$

where $\Lambda$ is a diagonal matrix containing the eigenvalues (in ascending order down the diagonal) of $X^TX$, and $V$ is orthogonal. The condition number $\kappa (X^TX)$ for $X^TX$ is just the absolute value of the ratio of the largest and smallest eigenvalues:

$\kappa(X^TX)=\left | \frac{\lambda_{max}}{\lambda_{min}} \right |$

Thus we can see that if the smallest eigenvalue is much smaller than the largest eigenvalue, we get a very large condition number which implies a poorly conditioned matrix. The idea then is to remove these small eigenvalues from $X^TX$ thus giving us an approximation to $X^TX$ that is better conditioned. To this end, suppose that we wish to retain the r (r less than or equal to n) largest eigenvalues of $X^TX$ in our approximation, and thus write

$X^TX=(V_1,V_2)\begin{pmatrix} \Lambda_1 & 0 \\ 0 & \Lambda_2 \end{pmatrix} \begin{pmatrix} V_1^T \\ V_2^T \end{pmatrix}$,

where

$\Lambda_1$ is an r x r diagonal matrix consisting of the r largest eigenvalues of $X^TX$, $\Lambda_2$ is a (n-r) x (n-r) diagonal matrix consisting of the remaining n – r eigenvalues of $X^TX$, and the n x n matrix $V=(V_1,V_2)$ is orthogonal with $V_1=(v_1,...v_r)$ consisting of the first r columns of $V$, and $V_2=(v_{r+1},...v_n)$ consisting of the remaining n – r columns of $V$. Using this formulation we can write an approximation $\widehat{X^TX}$ to $X^TX$ using the r largest eigenvalues as

$\widehat{X^TX}=V_1 \Lambda_1 V_1^T$.

If we substitute this approximation into the normal equations 2, and do some simplification, we end up with the principal components estimator

$\textrm{(3) }\widehat{\beta^{(r)}}=V_1 \Lambda_1^{-1} V_1^T X^T y$.

While we could use equation 3 directly, it is usually not the best way to perform principal components regression. The next article in this series will illustrate an algorithm for PCR and implement it using the NMath libraries.

-Steve

## Principal Component Regression: Part 1 – The Magic of the SVD

Monday, February 8th, 2010

## Introduction

This is the first part of a multi-part series on Principal Component Regression, or PCR for short. We will eventually end up with a computational algorithm for PCR and code it up using C# using the NMath libraries. PCR is a method for constructing a linear regression model in the case that we have a large number of predictor variables which are highly correlated. Of course, we don’t know exactly which variables are correlated, otherwise we’d just throw them out and perform a normal linear regression.

In order to understand what is going on in the PCR algorithm, we need to know a little bit about the SVD (Singular Value Decomposition). Understanding a bit about the SVD and it’s relationship to the eigenvalue decomposition will go a long way in understanding the PCR algorithm.

## The Singular Value Decomposition

The SVD (Singular Value Decomposition) is one of the most revealing matrix decompositions in linear algebra. A bit expensive to compute, but the bounty of information it yields is awe inspiring. Understanding a little about the SVD will illuminate the Principal Components Regression (PCR) algorithm. The SVD may seem like a deep and mysterious thing, at least I thought it was until I read the chapters covering it in the book “Numerical Linear Algebra”, by Lloyd N. Trefethen, and David Bau, III, which I summarize below.
(more…)