NMath Stats User's Guide

TOC | Previous | Next | Index

11.2 Data Clustering Using NMF (.NET, C#, CSharp, VB, Visual Basic, F#)

NMath Stats provides class NMFClustering for performing data clustering using iterative nonnegative matrix factorization (NMF), where each iteration step produces a new W and H. At each iteration, each column v of V is placed into a cluster corresponding to the column w of W which has the largest coefficient in H. That is, column v of V is placed in cluster i if the entry hij in H is the largest entry in column hj of H. Results are returned as an adjacency matrix whose i, jth value is 1 if columns i and j of V are in the same cluster, and 0 if they are not.

Iteration stops when the clustering of the columns of the matrix V stabilizes. There are three parameters that control iteration:

the maximum number of iterations to perform

the stopping adjacency, which is the number of consecutive times the adjacency matrix remains unchanged before it is considered stabilized

the convergence check period. Computing the adjacency matrix can be a somewhat expensive operation, so you may want to perform this operation only every nth iteration.

For example, running a NMFClustering instance with maximum iterations 2000, stopping adjacency 40, and convergence check period 10, computes a new adjacency matrix every 10 iterations, and checks it against the previous adjacency matrix. If they are the same, a count is incremented. The iteration stops when 40 consecutive unchanged adjacency matrices are recorded, or the maximum 2000 iterations are reached.

Creating NMFClustering Instances

Class NMFClustering is parameterized on the NMF update algorithm to use (Section 11.1). For instance:

Code Example – C# nonnegative matrix factorization (NMF)

var nmfClustering = new NMFClustering<NMFDivergenceUpdate>();

The update algorithm can be changed post-construction using the Updater property.

Code Example – C# nonnegative matrix factorization (NMF)

nmfClustering.Updater = new NMFGdClsUpdate();

The maximum iterations, stopping adjacency, and convergence check period can be specified either as constructor parameters, or post-construction using the MaxFactorizationIterations, StoppingAdjacency, and ConvergenceCheckPeriod properties, respectively. The default maximum number of iterations is 2000, the default stopping adjacency is 40, and the default convergence check period is 10.

Performing the Factorization

The Factor() method performs the actual iterative factorization:

Code Example – C# nonnegative matrix factorization (NMF)

DoubleMatrix data;   // data to be factored
int k;               // number of columns in W
nmfClustering.Factor( data, k );

NMFClustering objects can factor data contained in either a DoubleMatrix or a DataFrame object.

Cluster Results

After clustering, the Converged property checks if the iterative factorization converged before hitting the default maximum number of iterations. Iterations gets the total number of iterations performed in the most recent calculation. For example:

Code Example – C# nonnegative matrix factorization (NMF)

if ( nmfClustering.Converged ) {
  Console.WriteLine( "Factorization converged in {0} iterations.", 
    nmfClustering.Iterations );
}
else {
  Console.WriteLine( 
    "Factorization failed to converge in {0} iterations.", 
    nmfClustering.MaxFactorizationIterations );
}

If clustering converged, the final factors W and H are accessed through properties W and H:

Code Example – C# nonnegative matrix factorization (NMF)

Console.WriteLine( "W = " + nmfClustering.W );
Console.WriteLine( "H = " + nmfClustering.H );

The Connectivity property returns the final adjacency matrix as an instance of ConnectivityMatrix. The connectivity matrix is an adjacency matrix, A, such that columns of the factored matrix are in the same cluster if A[i,j] == 1, and are in different clusters if A[i,j] == 0. For instance:

Code Example – C# nonnegative matrix factorization (NMF)

ConnectivityMatrix connectivity = nmfClustering.Connectivity;  
Console.WriteLine( "Connectivity Matrix: " );
Console.WriteLine( connectivity.ToTabDelimited() );

The ClusterSet property returns a ClusterSet (Section 10.3) describing the final clusters:

Code Example – C# nonnegative matrix factorization (NMF)

ClusterSet cs = nmfClustering.ClusterSet;

// Print out the cluster each column belongs to
for ( int i = 0; i < cs.N; i++ ) {
  Console.WriteLine( "Column {0} belongs to cluster {1}",
    i, cs[i] );
}

// Print out the the members of each cluster
for ( int i = 0; i < cs.NumberOfClusters; i++ ) {
  int[] members = cs.Cluster( i );
  Console.Write( "Cluster number {0} contains: ", i );
  for ( int j = 0; j < members.Length; j++ ) {
    Console.Write( "{0} ", j );
  }
  Console.WriteLine();
}

Lastly, the Cost property gets the value of the cost function for the factorization.

Code Example – C# nonnegative matrix factorization (NMF)

double cost = nmfClustering.Cost;

The cost function is the function that is minimized by the NMF update algorithm.

Computing a Consensus Matrix

NMF uses an iterative algorithm with random starting values for W and H. This, coupled with the fact that the factorization is not unique, means that if you cluster the columns of V multiple times, you may get different final clusterings. The consensus matrix is a way to average multiple clusterings, to produce a probability estimate that any pair of columns will be clustered together.

To compute the consensus matrix, the columns of V are clustered using NMF n times. Each clustering yields a connectivity matrix. Recall that the connectivity matrix is a symmetric matrix whose i, jth entry is 1 if columns i and j of V are clustered together, and 0 if they are not. The consensus matrix is also a symmetric matrix, whose i, jth entry is formed by taking the average of the i, jth entries of the n connectivity matrices.

Thus, each i, jth entry of the consensus matrix is a value between 0, when columns i and j are not clustered together on any of the runs, and 1, when columns i and j were clustered together on all runs. The i, jth entry of a consensus matrix may be considered, in some sense, a "probability" that columns i and j belong to the same cluster.

NMath Stats provides class NMFConsensusMatrix for compute a consensus matrix. NMFConsensusMatrix is parameterized on the NMF update algorithm to use (Section 11.1). Additional constructor parameters specify the matrix to factor, the order k of the NMF factorization (the number of columns in W), and the number of clustering runs. For example:

Code Example – C# nonnegative matrix factorization (NMF)

DoubleMatrix data;   // data to be factored
int k;               // number of columns in W
int numberOfRuns = 70;

NMFConsensusMatrix<NMFDivergenceUpdate> consensusMatrix = 
  new NMFConsensusMatrix<NMFDivergenceUpdate>(data, k, 
    numberOfRuns);

The consensus matrix is computed at construction time, so be aware that this may be an expensive operation. Post-construction, the NumberOfConvergedRuns property gets the number of clustering runs where the NMF computation converged:

Code Example – C# nonnegative matrix factorization (NMF)

Console.WriteLine( "{0} runs out of {1} converged.", 
  consensusMatrix.NumberOfConvergedRuns, numberOfRuns );

NMFConsensusMatrix provides a standard indexer for getting the element value at a specified row and column in the consensus matrix. For example, this code gets the probability that columns 2 and 7 will be clustered together:

Code Example – C# nonnegative matrix factorization (NMF)

double p = consensusMatrix[2, 7];

This code prints the entire consensus matrix:

Code Example – C# nonnegative matrix factorization (NMF)

Console.WriteLine( "Consensus Matrix:" );
Console.WriteLine( consensusMatrix.ToTabDelimited() );

A consensus matrix, C, can also used to perform a hierarhical clustering of the columns of V (Section 10.3), using the distance function:

 

A ClusterAnalysis instance is constructed from a matrix containing numeric data. Each row in the data set represents an object to be clustered. In this case, you're simply clustering the column numbers of V, so construct a matrix with one colunm containing the numbers 0 to n-1, where n is the number of columns of V (and the order of of the consensus matrix):

Code Example – C# nonnegative matrix factorization (NMF)

DoubleMatrix colNumbers =
  new DoubleMatrix( consensusMatrix.Order, 1, 0, 1 );

Distance.Function distance =
  delegate( DoubleVector data1, DoubleVector data2 ) {
    int i = (int)data1[0];
    int j = (int)data2[0];
    return 1.0 - consensusMatrix[i, j];
  };

ClusterAnalysis ca =
  new ClusterAnalysis( colNumbers, distance );

After you've created a ClusterAnalysis object, the CutTree() method constructs a set of clusters by cutting the hierarchical linkage tree either at the specified height, or into the specified number of clusters. For example, this code cuts the linkage tree to form three clusters:

Code Example – C# nonnegative matrix factorization (NMF)

ClusterSet clusters = ca.CutTree( 3 );

for ( int i = 0; i < clusters.NumberOfClusters; i++ ) {
  int[] members = clusters.Cluster( i );
  Console.Write( "Cluster {0} contains: ", i );
  for ( int j = 0; j < members.Length; j++ ) {
    Console.Write( "{0} ", members[j] );
  }
  Console.WriteLine();
}


Top

Top