**11.2**** ****Data Clustering Using
NMF** (.NET, C#, CSharp, VB, Visual Basic, F#)

**NMath Stats**
provides class **NMFClustering**
for performing data clustering using iterative
nonnegative matrix
factorization (NMF), where each iteration step produces a new *W* and *H*.
At each iteration, each column v of *V*
is placed into a cluster corresponding to the column *w*
of *W* which has the largest coefficient
in *H*. That is, column v of *V* is placed in cluster *i*
if the entry *h** _{ij}*
in

*H*is the largest entry in column

*h*

*of*

_{j}*H*. Results are returned as an

*adjacency matrix*whose

*i*,

*j*th value is 1 if columns

*i*and

*j*of

*V*are in the same cluster, and 0 if they are not.

Iteration stops when the clustering of the columns
of the matrix *V* stabilizes. There
are three parameters that control iteration:

● the maximum number of iterations to perform

● the stopping adjacency, which is the number of consecutive times the adjacency matrix remains unchanged before it is considered stabilized

● the convergence check period.
Computing the adjacency matrix can be a somewhat expensive operation,
so you may want to perform this operation only every *n*th
iteration.

For example, running a **NMFClustering**
instance with maximum iterations 2000, stopping
adjacency 40, and convergence check period
10, computes a new adjacency matrix every
10 iterations, and checks it against the
previous adjacency matrix. If they are the same, a count is incremented.
The iteration stops when 40 consecutive
unchanged adjacency matrices are recorded, or the maximum 2000 iterations are reached.

**Creating NMFClustering
Instances**

Class **NMFClustering**
is parameterized on the NMF update algorithm to use (Section 11.1). For instance:

Code Example – C# nonnegative matrix factorization (NMF)

var nmfClustering = new NMFClustering<NMFDivergenceUpdate>();

The update algorithm can be changed post-construction using the Updater property.

Code Example – C# nonnegative matrix factorization (NMF)

nmfClustering.Updater = new NMFGdClsUpdate();

The maximum iterations, stopping adjacency, and convergence check period can be specified either as constructor parameters, or post-construction using the MaxFactorizationIterations, StoppingAdjacency, and ConvergenceCheckPeriod properties, respectively. The default maximum number of iterations is 2000, the default stopping adjacency is 40, and the default convergence check period is 10.

The Factor() method performs the actual iterative factorization:

Code Example – C# nonnegative matrix factorization (NMF)

DoubleMatrix data; // data to be factored int k; // number of columns in W nmfClustering.Factor( data, k );

**NMFClustering**
objects can factor data contained in either a **DoubleMatrix**
or a **DataFrame** object.

After clustering, the Converged property checks if the iterative factorization converged before hitting the default maximum number of iterations. Iterations gets the total number of iterations performed in the most recent calculation. For example:

Code Example – C# nonnegative matrix factorization (NMF)

if ( nmfClustering.Converged ) { Console.WriteLine( "Factorization converged in {0} iterations.", nmfClustering.Iterations ); } else { Console.WriteLine( "Factorization failed to converge in {0} iterations.", nmfClustering.MaxFactorizationIterations ); }

If clustering converged, the final factors *W* and *H*
are accessed through properties W and H:

Code Example – C# nonnegative matrix factorization (NMF)

Console.WriteLine( "W = " + nmfClustering.W ); Console.WriteLine( "H = " + nmfClustering.H );

The Connectivity property returns the final adjacency matrix as
an instance of **ConnectivityMatrix**.
The connectivity matrix is an adjacency matrix, *A*,
such that columns of the factored matrix are in the same cluster if A[i,j] == 1, and are in different clusters if
A[i,j] == 0. For instance:

Code Example – C# nonnegative matrix factorization (NMF)

ConnectivityMatrix connectivity = nmfClustering.Connectivity; Console.WriteLine( "Connectivity Matrix: " ); Console.WriteLine( connectivity.ToTabDelimited() );

The ClusterSet property
returns a **ClusterSet** (Section 10.3) describing the final clusters:

Code Example – C# nonnegative matrix factorization (NMF)

ClusterSet cs = nmfClustering.ClusterSet; // Print out the cluster each column belongs to for ( int i = 0; i < cs.N; i++ ) { Console.WriteLine( "Column {0} belongs to cluster {1}", i, cs[i] ); } // Print out the the members of each cluster for ( int i = 0; i < cs.NumberOfClusters; i++ ) { int[] members = cs.Cluster( i ); Console.Write( "Cluster number {0} contains: ", i ); for ( int j = 0; j < members.Length; j++ ) { Console.Write( "{0} ", j ); } Console.WriteLine(); }

Lastly, the Cost property gets the value of the cost function for the factorization.

Code Example – C# nonnegative matrix factorization (NMF)

double cost = nmfClustering.Cost;

The cost function is the function that is minimized by the NMF update algorithm.

NMF uses an iterative algorithm with random starting
values for *W* and *H*.
This, coupled with the fact that the factorization is not unique, means
that if you cluster the columns of V multiple times, you may get different
final clusterings. The *consensus matrix*
is a way to average multiple clusterings, to produce a probability estimate
that any pair of columns will be clustered together.

To compute the consensus matrix, the columns of *V* are clustered using NMF *n*
times. Each clustering yields a connectivity matrix. Recall that the
connectivity matrix is a symmetric matrix whose *i*,
*j*th entry is 1
if columns *i* and *j*
of *V* are clustered together, and 0 if they are not. The consensus matrix is also
a symmetric matrix, whose *i*, *j*th entry is formed by taking the average
of the *i*, *j*th
entries of the *n* connectivity matrices.

Thus, each *i*, *j*th entry of the consensus matrix is a value
between 0, when columns *i*
and *j* are not clustered together on
any of the runs, and 1, when columns i and
j were clustered together on all runs. The *i*,
*j*th entry of a consensus matrix may
be considered, in some sense, a "probability" that columns *i* and *j*
belong to the same cluster.

**NMath Stats**
provides class **NMFConsensusMatrix**
for compute a consensus matrix. **NMFConsensusMatrix**
is parameterized on the NMF update algorithm to use (Section 11.1). Additional constructor parameters
specify the matrix to factor, the order *k*
of the NMF factorization (the number of columns in *W*),
and the number of clustering runs. For example:

Code Example – C# nonnegative matrix factorization (NMF)

DoubleMatrix data; // data to be factored int k; // number of columns in W int numberOfRuns = 70; NMFConsensusMatrix<NMFDivergenceUpdate> consensusMatrix = new NMFConsensusMatrix<NMFDivergenceUpdate>(data, k, numberOfRuns);

The consensus matrix is computed at construction time, so be aware that this may be an expensive operation. Post-construction, the NumberOfConvergedRuns property gets the number of clustering runs where the NMF computation converged:

Code Example – C# nonnegative matrix factorization (NMF)

Console.WriteLine( "{0} runs out of {1} converged.", consensusMatrix.NumberOfConvergedRuns, numberOfRuns );

**NMFConsensusMatrix**
provides a standard indexer for getting the element value at a specified
row and column in the consensus matrix. For example, this code gets the
probability that columns 2 and 7 will be clustered together:

Code Example – C# nonnegative matrix factorization (NMF)

double p = consensusMatrix[2, 7];

This code prints the entire consensus matrix:

Code Example – C# nonnegative matrix factorization (NMF)

Console.WriteLine( "Consensus Matrix:" ); Console.WriteLine( consensusMatrix.ToTabDelimited() );

A consensus matrix, *C*,
can also used to perform a hierarhical clustering of the columns of *V* (Section 10.3), using the distance function:

A **ClusterAnalysis**
instance is constructed from a matrix containing numeric data. Each row
in the data set represents an object to be clustered. In this case, you're
simply clustering the column numbers of *V*,
so construct a matrix with one colunm containing the numbers 0 to n-1, where *n* is the number of columns of *V* (and the order of of the consensus matrix):

Code Example – C# nonnegative matrix factorization (NMF)

DoubleMatrix colNumbers = new DoubleMatrix( consensusMatrix.Order, 1, 0, 1 ); Distance.Function distance = delegate( DoubleVector data1, DoubleVector data2 ) { int i = (int)data1[0]; int j = (int)data2[0]; return 1.0 - consensusMatrix[i, j]; }; ClusterAnalysis ca = new ClusterAnalysis( colNumbers, distance );

After you've created a **ClusterAnalysis**
object, the CutTree() method constructs
a set of clusters by cutting the hierarchical linkage tree either at
the specified height, or into the specified number of clusters. For example,
this code cuts the linkage tree to form three clusters:

Code Example – C# nonnegative matrix factorization (NMF)

ClusterSet clusters = ca.CutTree( 3 ); for ( int i = 0; i < clusters.NumberOfClusters; i++ ) { int[] members = clusters.Cluster( i ); Console.Write( "Cluster {0} contains: ", i ); for ( int j = 0; j < members.Length; j++ ) { Console.Write( "{0} ", members[j] ); } Console.WriteLine(); }