A couple of years ago, we were asked by a customer to provide an implementation of an algorithm called Non-negative Matrix Factorization (NMF). We did a basic implementation, which we later included in our NMath Stats library. I kind of forgot about it until we recently heard from a prospective NMath customer who wanted to use NMF for grouping, or clustering. Talking with this customer rekindled my interest in NMF and we decided to provide additional functionality built on the existing NMF code to facilitate using the NMF for clustering.

This entry will proceed in three parts. The first will give a brief introduction to NMF and its uses, the second will briefly cover how to compute the factorization, and the third will cover how NMF can be used for clustering.

**The Non-negative Matrix Factorization**

Given a non-negative *m*-row by *n*-column matrix A, a non-negative matrix factorization of A is a non-negative *n*-row by *k*-column matrix W and a non-negative *k*-row by *m*-column matrix H whose product approximates the matrix A.

A ~ WH

The non-negativity of the elements of W and H are crucial, and are what make this problem a bit different. The entries of A usually represent some quantity for which negative numbers make no sense. For instance, the numbers, *a*_{ij,} in A might be counts of the *i*^{th} term in the *j*^{th} document, or the *i*^{th} pixel value in the *j*^{th} image.

So, why is this useful? Of course, it depends on the particular application, but the basic idea is *dimension reduction*. In general, NMF is used only when the matrix A is large. In the image pixel value example, where each column of the matrix A contains the pixel values of a particular image, the number of rows will be quite large, as may be the number of columns. When we do an NMF of A and make *k* much smaller than the number of rows or columns in A, the factorization yields a representation of each column of A as a linear combination of the *k* columns of W, with the coefficients coming from H.

For example, suppose I have 300 facial images (pictures of people’s faces). Each image is encoded as 50,000 pixel values. I arrange these into a 50,000 x 300 matrix A. 50,000 is a fairly large number, and if I am looking at each column of A as a vector, it’s a vector with 50,000 coordinates. Let’s do a NMF on A with *k* = 7. Now, each image (column in A) can be approximated by a linear combination of these 7 *basis images*. If the approximation is good, these 7 basis images, which are the columns of W, must represent a good chunk of the information in the original 300 images, and we have reduced the dimension of the space we are working in from 50,000 down to 7. Indeed, in this particular application it was found that the columns of W represented facial characteristics