Principal Component Regression: Part 1 – The Magic of the SVD.

February 8th, 2010

Introduction

This is the first part of a multi-part series on Principal Component Regression, or PCR for short. We will eventually end up with a computational algorithm for PCR and code it up using C# using the NMath libraries. PCR is a method for constructing a linear regression model in the case that we have a large number of predictor variables which are highly correlated. Of course, we don’t know exactly which variables are correlated, otherwise we’d just throw them out and perform a normal linear regression.

In order to understand what is going on in the PCR algorithm, we need to know a little bit about the SVD (Singular Value Decomposition). Understanding a bit about the SVD and it’s relationship to the eigenvalue decomposition will go a long way in understanding the PCR algorithm.

The Singular Value Decomposition

The SVD (Singular Value Decomposition)  is one of the most revealing matrix decompositions in linear algebra. A bit expensive to compute, but the bounty of information it yields is awe inspiring. Understanding a little about the SVD will illuminate the Principal Components Regression (PCR) algorithm. The SVD may seem like a deep and mysterious thing, at least I thought it was until I read the chapters covering it in the book  “Numerical Linear Algebra”, by Lloyd N. Trefethen, and David Bau, III, which I summarize below.

We begin with an easy to state, and not too difficult to prove geometric statement about linear transformations.

A Geometric Fact

Let  be the unit sphere in , and let   be any matrix mapping  into  and suppose, for the moment, that has full rank. Then the image,  of  under  is a hyperellipse in  (see the book for the proof).

SVD of a 2x2 matrix

Figure 1. SVD of a 2x2 matrix

Given this fact we make the following definitions (refer to Figure 1.):

Define the singular values ,
Read the rest of this entry »

  • Share/Bookmark

Drawing Dendrograms

February 3rd, 2010

A dendrogram is a tree diagram often used to visualize the results of hierarchical clustering. A customer recently contacted us asking for help drawing dendrograms from the output of the hierarchical clustering algorithm in NMath Stats. (For more information in hierarchical clustering in NMath Stats, see this post.)

UserZoom is an international software company specializing in web customer experience and usability testing. UserZoom software includes a tool to run online card sorting exercises with real users. Card Sorting is a technique that information architects and user experience  practitioners use to explore how “real people” group items and content. UserZoom developers were using NMath hierarchical clustering to analyze card sorting results, and wanted to generate dendrograms to help researchers understand participant’s grouping criteria.

In NMath Stats, class ClusterAnalysis performs hierarchical cluster analyses. For example, the following C# code clusters 6 random vectors of length 3:

int seed = 123;
DoubleMatrix data =
  new DoubleMatrix(6, 3, new RandGenUniform(1, 10, seed));
ClusterAnalysis ca = new ClusterAnalysis(data);

The random seed is specified so we can get repeatable results.

After construction, the Linkages property gets an (n-1) x 3 matrix containing the complete hierarchical linkage tree. At each level in the tree, columns 1 and 2 contain the indices of the clusters linked to form the next cluster. Column 3 contains the distances between the clusters.

Console.WriteLine(ca.Linkages.ToTabDelimited());

The output is:

3	4	2.11126489430827
1	5	2.50743075649221
2	6	2.59997959766694
7	8	3.61586096606758
0	9	4.37499686772813

Each object is initially assigned to its own singleton cluster, numbered 0 to 5. The analysis then proceeds iteratively, at each stage joining the two most similar clusters into a new cluster, continuing until there is one overall cluster. The first new cluster formed is assigned index 6, the second is assigned index 7, and so forth. When these indices appear later in the tree, the clusters are being combined again into a still larger cluster.
Read the rest of this entry »

  • Share/Bookmark

Forward Scaling Computing

January 28th, 2010

Forward Scaling for Multicore Performance

The era of sequential, single-threaded software development deployed to a uniprocessor machine is rapidly fading into history. Nearly all computers sold today have at least two, if not four cores – and will have eight in the near future. Intel announced last month the successful production and testing of a new 48-core research processor which will be made available to industry and academia for research and development of manycore parallel software developer tools and languages.

Intel's 48-core processor

Intel's recently announced 48-core processor

In the near future users of high performance software in finance, bio-informatics, or GIS will expect their applications to scale with core count, and software that fails to do so will either need to be rewritten or abandoned. To future-proof performance-sensitive software, code written today needs to be multicore aware, and scale automatically to all available cores – this is the key idea behind forward scaling software. If Moore’s ‘law’ is to be sustained into the future, hardware scalability must be joined with a similar shift in software. This fundamental shift in computing and application development, termed the ‘Manycore Shift’ by Microsoft, is an evolutionary shift that software developers must appreciate and adapt to in order to create long-living scalable applications.

CenterSpace’s Forward Scaling Strategy

This project of creating forward scaling software can sound daunting, but for many application developers it can be reduced to choosing the right languages & libraries for the computationally demanding portions of their application. If an application’s performance-sensitive components are forward scaling, so goes the application. At CenterSpace we are very performance sensitive and are working to insure that our users benefit from forward scaling behavior. Linear scaling with core number cannot always be achieved, but in the numerical computational domain we can frequently come close to this ideal.

Here are some parallel computing technologies that we are looking at adopting to ensure we meet this goal.
Read the rest of this entry »

  • Share/Bookmark

Power Spectral Density with NMath

January 13th, 2010

Application Note

Power Spectral Density Calculation

I’ve had several customers ask about computing the PSD in C# with NMath, so I though it was time for a post on the subject. The power spectral density provides an estimate of the power present within each slice of spectrum, and is presented as graph of the signal power versus frequency. It’s a common signal processing calculation across many fields from acoustics to chemistry, and can provide insight into periodicities contained within a time domain signal.

For stationary, square summable signals, the PSD is expressed as,



where F(w) is the Fourier transform of the time domain signal f(t), and T is the width of the time domain sampled signal. Naturally we can never sample the entire signal, so calculations of the PSD are all estimates. Techniques for estimating the PSD can be divided into two classes, parametric (model based), and nonparametic (non-model based). We will discuss only nonparametic techniques here. For discrete signals that are not square summable (i.e. non-stationary signals – and so the Fourier transform does not exist), estimates of the power spectral density can be derived from,



Which is the Fourier transform of the autocorrelation as the correlation width of the sampled signal tends to infinity. For a concise derivation of both of these formulas read these short lecture notes.

Computing the PSD using NMath

Concentrating on stationary (periodic) signals, the PSD is most efficiently computed by applying smoothing to discrete periodograms .



Where n is the number of signal samples. Each point in the periodogram represents the relative contribution to the variance in the time domain signal at that frequency. (Visualization provided courtesy of Infragistics.)

Periodogram of sun spot data.

An example periodogram of sunspot data. This is smoothed in some fashion to estimate the PSD.

Another estimation technique involves computing multiple windowed periodograms and then combining these together to get a progressively more accurate estimate (Welch’s Method, similarly MTM with Slepian windows). The time domain signal should be detrended before any of these operations.
Read the rest of this entry »

  • Share/Bookmark

Cluster Analysis, Part V: Monte Carlo NMF

January 11th, 2010

In this continuing series, we explore the NMath Stats functions for performing cluster analysis. (For previous posts, see Part 1 – PCA Part 2 – K-Means, Part 3 – Hierarchical, and Part 4 – NMF.) The sample data set we’re using classifies 89 single malt scotch whiskies on a five-point scale (0-4) for 12 flavor characteristics. To visualize the data set and clusterings, we make use of the free Microsoft Chart Controls for .NET, which provide a basic set of charts.

In this post, the last in the series, we’ll look at how NMath provides a Monte Carlo method for performing multiple non-negative matrix factorization (NMF) clusterings using different random starting conditions, and combining the results.

NMF uses an iterative algorithm with random starting values for W and H. This, coupled with the fact that the factorization is not unique, means that if you cluster the columns of V multiple times, you may get different final clusterings. The consensus matrix is a way to average multiple clusterings, to produce a probability estimate that any pair of columns will be clustered together.
To compute the consensus matrix, the columns of V are clustered using NMF n times. Each clustering yields a connectivity matrix. Recall that the connectivity matrix is a symmetric matrix whose i, jth entry is 1 if columns i and j of V are clustered together, and 0 if they are not. The consensus matrix is also a symmetric matrix, whose i, jth entry is formed by taking the average of the i, jth entries of the n connectivity matrices.
Thus, each i, jth entry of the consensus matrix is a value between 0, when columns i and j are not clustered together on any of the runs, and 1, when columns i and j were clustered together on all runs. The i, jth entry of a consensus matrix may be considered, in some sense, a “probability” that columns i and j belong to the same cluster.

NMF uses an iterative algorithm with random starting values for W and H. (See Part IV for more information on NMF.) This, coupled with the fact that the factorization is not unique, means that if you cluster the columns of V multiple times, you may get different final clusterings. The consensus matrix is a way to average multiple clusterings, to produce a probability estimate that any pair of columns will be clustered together.
Read the rest of this entry »

  • Share/Bookmark