9.1 Principal Component Analysis
Principal component analysis finds a smaller set of synthetic variables that capture the variance in an original data set. The first principal component accounts for as much of the variability in the data as possible, and each succeeding orthogonal component accounts for as much of the remaining variability as possible. In NMath Stats, class PrincipalComponentAnalysis performs principal component analyses.
Creating Principal Component Analyses
A PrincipalComponentAnalysis instance is constructed from a matrix or a dataframe containing numeric data. Each column represents a variable, and each row represents an observation:
PrincipalComponentAnalysis pca =
new PrincipalComponentAnalysis( data );
The data may optionally be zero-centered and scaled to have unit variance:
bool center = true;
bool scale = true;
PrincipalComponentAnalysis pca =
new PrincipalComponentAnalysis( data, center, scale );
By default, variables are centered but not scaled.
After construction, you can retrieve information about the data set using the provided read-only properties:
- Data gets the data matrix. If centering or scaling were specified at construction time, the returned matrix may not match the original data.
- NumberOfObservations gets the number of observations in the data matrix.
- NumberOfVariables gets the number of variables in the data matrix.
- IsCentered returns true if the data supplied at construction time was shifted to be zero-centered.
- IsScaled returns true if the data supplied at construction time was scaled to have unit variance.
- Means gets the column means of the data matrix. If centering is specified, the column means are substracted from the column values before analysis takes place.
- Norms gets the column norms (1-norm). If scaling is specified, column values are scaled to have unit variance before analysis by dividing by the column norm.
Principal Component Analysis Results
The Loadings property gets the complete loading matrix. Each column in the loading matrix is a principal component. The first principal component accounts for as much of the variability in the data as possible, and each succeeding orthogonal component accounts for as much of the remaining variability as possible.
Console.WriteLine( "Loading Martrix = " + pca.Loadings );
The provided indexer also gets a specified principal component, referenced by zero-based index. For example:
Console.WriteLine( "First principal component = " + pca[0] );
Console.WriteLine( "Second principal component = " + pca[1] );
The VarianceProportions property gets an ordered vector containing the proportion of the total variance accounted for by each principal component. CumulativeVarianceProportions gets the cumulative variance proportions. Thus:
Console.WriteLine( "Variance Proportions = " +
pca.VarianceProportions );
Console.WriteLine( "Cumulative Variance Proportions = " +
pca.CumulativeVarianceProportions );
The Threshold() method calculates the number of principal components required to account for a given proportion of the total variance:
Console.WriteLine( "PCs that account for 99% of the variance = " +
pca.Threshold( .99 ) );
The StandardDeviations property gets the standard deviations of the principal components. Eigenvalues gets the eigenvalues of the covariance/correlation matrix, though the calculation is actually performed using the singular values of the data matrix. The eigenvalues of the covariance/correlation matrix are equal to the squares of the standard deviations of the principal components.
Lastly, the Scores property gets the score matrix. The scores are the data formed by transforming the original data into the space of the principal components:
Console.WriteLine( "Scores = " + pca.Scores );
This code displays the data in the minimal synthetic dimensions required to account for 99% of the variance:
Slice rowSlice = Slice.All;
Slice colSlice = new Slice( 0, pca.Threshold( .99 ) );
Console.WriteLine( pca.Scores[ rowSlice, colSlice ] );