NMath Stats User's Guide

TOC |  Previous |  Next |  Index

10.3 K-Means Clustering (.NET, C#, CSharp, Visual Basic, VB.NET)

The k-means clustering method assigns data points into k groups such that the sum of squares from points to the computed cluster centers is minimized. In NMath Stats, class KMeansClustering performs k-means clustering.

The algorithm used is that of Hartigan and Wong (A K-means clustering algorithm. Applied Statistics 28, 100-108. 1979):

  1. For each point, move it to another cluster if that would lower the sum of squares from points to the computed cluster centers.
  2. If a point is moved, immediately update the cluster centers of the two affected clusters.
  3. Repeat until no points are moved, or the specified maximum number of iterations is reached.

Creating KMeansClustering Objects

A KMeansClustering instance is constructed from a matrix or a dataframe containing numeric data. Each row in the data set represents an object to be clustered.

KMeansClustering km = new KMeansClustering( data );

After construction, you can retrieve information about the KMeansClustering data using the provided properties:

Stopping Criteria

Iteration stops when either clustering stabilizes, or the maximum number of iterations is reached. You can specify the maximum number of iterations in several ways:

Clustering

The Cluster() method clusters the data into the specified number of clusters. The method accepts either k, the number of clusters, or a matrix of initial cluster centers:

For example, this code clusters eight random vectors of length three into two clusters, using random starting cluster centers:

DoubleMatrix data = new DoubleMatrix( 8, 3, new RandGenUniform() );
KMeansClustering cl = new KMeansClustering( data );
ClusterSet clusters = cl.Cluster( 2 );

This code specifies the two starting centers:

DoubleMatrix centers = new DoubleMatrix("2x3 [ 0 0 0  1 1 1 ]");
ClusterSet clusters = cl.Cluster( centers );

Cluster Analysis Results

The Cluster() method returns a ClusterSet object, which represents a collection of objects assigned to a finite number of clusters. Properties on the KMeansClustering instance give additional information about the clustering just performed:

For instance, this code clusters 30 random vectors of length three into three clusters, and prints out the results:

DoubleMatrix data = new DoubleMatrix(30, 3, new RandGenUniform());
KMeansClustering km = new KMeansClustering(data);
km.Cluster(3);

Console.WriteLine( "k = {0}", km.K );
Console.WriteLine( "Initial cluster centers:" );
Console.WriteLine( km.InitialCenters.ToTabDelimited() );
Console.WriteLine( "{0} iterations", km.Iterations );
Console.WriteLine("Stopped because max iterations of {0} met? {1}",
  km.MaxIterations, km.MaxIterationsMet);
Console.WriteLine( "Final cluster centers:" );
Console.WriteLine( km.FinalCenters.ToTabDelimited() );
Console.WriteLine( "Clustering assignments:" );
Console.WriteLine( km.Clusters );
for (int i = 0; i < km.K; i++) {
  Console.WriteLine( "Cluster {0} has {1} items", i, km.Sizes[i] );
}

TOC |  Previous |  Next |  Index