NMath Stats User's Guide

TOC | Previous | Next | Index

10.4 K-Means Clustering (.NET, C#, CSharp, VB, Visual Basic, F#)

The k-means clustering method assigns data points into k groups such that the sum of squares from points to the computed cluster centers is minimized. In NMath Stats, class KMeansClustering performs k-means clustering.

The algorithm used is that of Hartigan and Wong (A K-means clustering algorithm. Applied Statistics 28, 100–108. 1979):

1. For each point, move it to another cluster if that would lower the sum of squares from points to the computed cluster centers.

2. If a point is moved, immediately update the cluster centers of the two affected clusters.

3. Repeat until no points are moved, or the specified maximum number of iterations is reached.

Creating KMeansClustering Objects

A KMeansClustering instance is constructed from a matrix or a dataframe containing numeric data. Each row in the data set represents an object to be clustered.

Code Example – C# k-means clustering

var km = new KMeansClustering( data );

After construction, you can retrieve information about the KMeansClustering data using the provided properties:

N gets the total number of objects being clustered.

Data gets and set the data matrix

Stopping Criteria

Iteration stops when either clustering stabilizes, or the maximum number of iterations is reached. You can specify the maximum number of iterations in several ways:

The static DefaultMaxIterations property gets and sets the default maximum number of iterations for instances of KMeansClustering. (Initially set to 1000.)

You can specify a non-default maximum in the KMeansClustering constructor. For instance:

   var km = new KMeansClustering( data, 100 );

The MaxIterations property gets and sets the maximum number of iterations on an existing KMeansClustering instance.

Clustering

The Cluster() method clusters the data into the specified number of clusters. The method accepts either k, the number of clusters, or a matrix of initial cluster centers:

If k is given, a set of distinct rows in the data matrix are chosen as the initial centers using the algorithm specified by a KMeanClustering.Start enumerated value. By default, rows are chosen at random.

If a matrix of initial cluster centers is given, k is inferred from the number of rows.

For example, this code clusters eight random vectors of length three into two clusters, using random starting cluster centers:

Code Example – C# k-means clustering

var data = new DoubleMatrix( 8, 3, new RandGenUniform() );
var cl = new KMeansClustering( data );
ClusterSet clusters = cl.Cluster( 2 );

This code specifies the two starting centers:

Code Example – C# k-means clustering

var centers = new DoubleMatrix("2x3 [ 0 0 0  1 1 1 ]");
ClusterSet clusters = cl.Cluster( centers );

Cluster Analysis Results

The Cluster() method returns a ClusterSet object, which represents a collection of objects assigned to a finite number of clusters. Properties on the KMeansClustering instance give additional information about the clustering just performed:

K gets the number of clusters.

InitialCenters gets the matrix of initial cluster centers.

FinalCenters gets the matrix of final cluster centers.

Clusters gets the cluster assignments.

WithinSumOfSquares gets the within-cluster sum of squares for each cluster.

Sizes gets the number of points in each cluster.

Iterations gets the number of iterations performed.

MaxIterationsMet returns true if the clustering stopped because the maximum number of iterations was reached; otherwise, false.

For instance, this code clusters 30 random vectors of length three into three clusters, and prints out the results:

Code Example – C# k-means clustering

var data = new DoubleMatrix(30, 3, new RandGenUniform());
var km = new KMeansClustering(data);
km.Cluster(3);

Console.WriteLine( "k = {0}", km.K );
Console.WriteLine( "Initial cluster centers:" );
Console.WriteLine( km.InitialCenters.ToTabDelimited() );
Console.WriteLine( "{0} iterations", km.Iterations );
Console.WriteLine("Stopped because max iterations of {0} met? {1}",
  km.MaxIterations, km.MaxIterationsMet);
Console.WriteLine( "Final cluster centers:" );
Console.WriteLine( km.FinalCenters.ToTabDelimited() );
Console.WriteLine( "Clustering assignments:" );
Console.WriteLine( km.Clusters );
for (int i = 0; i < km.K; i++) {
  Console.WriteLine( "Cluster {0} has {1} items", i, km.Sizes[i] );
}


Top

Top