NMath Stats User's Guide

TOC | Previous | Next | Index

10.3 Hierarchical Cluster Analysis (.NET, C#, CSharp, VB, Visual Basic, F#)

Cluster analysis detects natural groupings in data. In hierarchical cluster analysis, each object is initially assigned to its own singleton cluster. The analysis then proceeds iteratively, at each stage joining the two most similar clusters into a new cluster, continuing until there is one overall cluster. In NMath Stats, class ClusterAnalysis performs hierarchical cluster analyses.

Distance Functions

During clustering, the distance between individual objects is computed using a distance function. The distance function is encapsulated in a Distance.Function delegate, which takes two vectors and returns a measure of the distance (similarity) between them:

Code Example – C# hierarchical cluster analysis

public delegate double Function( DoubleVector data1,
                                 DoubleVector data2 );

Delegates are provided as static variables on class Distance for many common distance functions:

Distance.EuclideanFunction computes the Euclidean distance between two data vectors (2 norm):

 

Euclidean distance is simply the geometric distance in the multidimen­sional space.

Distance.SquaredEuclideanFunction computes the squared Euclidean distance between two vectors:

 

Squaring the simple Euclidean distance places progressively greater weight on objects that are further apart.

Distance.CityBlockFunction computes the city-block (Manhattan) distance between two vectors (1 norm):

 

In most cases, the city-block distance measure yields results similar to the simple Euclidean distance. Note, however, that the effect of outliers is dampened, since they are not squared.

Distance.MaximumFunction computes the maximum (Chebychev) distance between two vectors:

 

This distance measure may be appropriate in cases when you want to define two objects as different if they differ on any one of the dimensions.

Distance.PowerFunction( double p, double r ) computes the power distance between two vectors:

 

where p and r are user-defined parameters. Parameter p controls the pro­gressive weight that is placed on differences on individual dimensions; parameter r controls the progressive weight that is placed on larger differ­ences between objects. Appropriate selections of p and r yield Euclidean, squared Euclidean, Minkowski, city-block, and many other distance met­rics. For example, if p and r are equal to 2, the power distance is equal to the Euclidean distance.

All provided distance functions allow missing values. Pairs of elements are excluded from the distance measure when their comparison returns NaN. If all pairs are excluded, NaN is returned for the distance measure.

You can also define your own Distance.Function delegate and use it to cluster your data. For example, if you have function MyDistance() that computes the distance between two vectors:

Code Example – C# hierarchical cluster analysis

public double MyDistance( DoubleVector x, DoubleVector y );

You can define a Distance.Function delegate like so:

Code Example – C# hierarchical cluster analysis

Distance.Function MyDistanceFunction =
  new Distance.Function( MyDistance );

Linkage Functions

During clustering, the distances between clusters of objects are computed using a linkage function. The linkage function is encapsulated in a Linkage.Function delegate. When two groups P and Q are united, a linkage function computes the distance between the new combined group P + Q and another group R.

 

Figure 2 – Computing the distance between clusters using a linkage function

The parameters to the Linkage.Function—which may not necessarily all be used to calculate the result—are the distance between R and P, the distance between R and Q, the distance between P and Q, and the sizes (n) of all three groups:

Code Example – C# hierarchical cluster analysis

public delegate double Function( double Drp, double Drq,
  double Dpq, double Nr, double Np, double Nq );

Delegates are provided as static variables on class Linkage for many common linkage functions:

Linkage.SingleFunction computes the distance between two clusters as the distance of the two closest objects (nearest neighbors) in the clusters. Adopting a friends-of-friends clustering strategy closely related to the minimal spanning tree, the single linkage method tends to result in long chains of clusters.

Linkage.CompleteFunction computes the distance between two clusters as the greatest distance between any two objects in the different clusters (furthest neighbors). The complete linkage method tends to work well in cases where objects form naturally distinct clumps.

Linkage.UnweightedAverageFunction computes the distance between two clusters as the average distance between all pairs of objects in the two different clusters. This method is sometimes referred to as unweighted pair-group method using arithmetic averages, and abbreviated UPGMA.

Linkage.WeightedAverageFunction computes the distance between two clusters as the average distance between all pairs of objects in the two different clusters, using the size of each cluster as a weighting factor. This method is sometimes referred to as weighted pair-group method using arithmetic averages, and abbreviated WPGMA.

Linkage.CentroidFunction computes the distance between two clusters as the difference between centroids. The centroid of a cluster is the average point in the multidimensional space. The centroid method is sometimes referred to as unweighted pair-group method using the centroid average, and abbreviated UPGMC.

Linkage.MedianFunction computes the distance between two clusters as the difference between centroids, using the size of each cluster as a weighting factor. This is sometimes referred to as weighted pair-group method using the centroid average, and abbreviated WPGMC.

Linkage.WardFunction computes the distance between two clusters using Ward's method. Ward's method uses an analysis of variance approach to evaluate the distances between clusters. The smaller the increase in the total within-group sum of squares as a result of joining two clusters, the closer they are. The within-group sum of squares of a cluster is defined as the sum of the squares of the distance between all objects in the cluster and the centroid of the cluster. Ward's method tends to produce compact groups of well-distributed size.

You can also define your own Linkage.Function delegate and use it to cluster your data. For example, if you have function MyLinkage() that computes the distance between two clusters:

Code Example – C# hierarchical cluster analysis

public double MyLinkage( double Drp, double Drq, double Dpq,
                         double Nr, double Np, double Nq );

You can define a Linkage.Function delegate like so:

Code Example – C# hierarchical cluster analysis

Linkage.Function MyLinkageFunction =
  new Linkage.Function( MyLinkage );

Creating Cluster Analyses

A ClusterAnalysis instance is constructed from a matrix or a dataframe containing numeric data. Each row in the data set represents an object to be clustered.

Code Example – C# hierarchical cluster analysis

var ca = new ClusterAnalysis( data );

The current default distance and linkage delegates are used. The default distance and linkage delegates are Distance.EuclideanFunction and Linkage.SingleFunction, unless the defaults have been changed using the static DefaultDistanceFunction and DefaultLinkageFunction properties. For example:

Code Example – C# hierarchical cluster analysis

ClusterAnalysis.DefaultDistanceFunction = Distance.MaximumFunction;
ClusterAnalysis.DefaultLinkageFunction = Linkage.CentroidFunction;

This changes the default distance and linkage functions for all subsequently constructed ClusterAnalysis objects.

You can also specify non-default distance and linkage functions in the constructor:

Code Example – C# hierarchical cluster analysis

var ca = new ClusterAnalysis( data,   
  Distance.PowerFunction( 1.25, 2.0 ), Linkage.CompleteFunction );

After construction, you can retrieve information about the ClusterAnalysis configuration using the provided properties:

N gets the total number of objects being clustered.

DistanceFunction gets and sets the distance function delegate used to measure the distance between individual objects. Setting the distance function using the DistanceFunction property has no effect until Update() is called with new data. (See below.)

LinkageFunction gets and sets the linkage function used to measure the distance between clusters of objects. Setting the linkage delegate using the LinkageFunction property has no effect until Update() is called with new data. (See below.)

Cluster Analysis Results

The Distances property gets the vector of distances between all possible object pairs, computed using the current distance delegate. For n objects, the distance vector is of length (n-1)(n/2), with distances arranged in the order:

(1,2), (1,3), ..., (1,n), (2,3), ..., (2,n), ..., ..., (n-1,n)

Linkages gets an (n-1) x 3 matrix containing the complete hierarchical linkage tree, computed from Distances using the current linkage delegate. At each level in the tree, columns 1 and 2 contain the indices of the clusters linked to form the next cluster. Column 3 contains the distances between the clusters. For example, this code clusters 8 random vectors of length 3, then shows a sample output of the hierarchical cluster tree:

Code Example – C# hierarchical cluster analysis

var data = new DoubleMatrix( 8, 3, new RandGenUniform() );
var ca = new ClusterAnalysis( data );
Console.WriteLine( ca.Linkages );

// Sample Output
// 
// 7x3 [
//        4 7 0.194409151975696
//        3 5 0.290431894003636
//        2 9 0.495557235783239
//        1 6 0.508966210536187
//        0 11 0.522321103698264
//        8 10 0.590187697768796
//        12 13 0.621675638177606 ]

Each object is initially assigned to its own singleton cluster, numbered 0 to 7. The analysis then proceeds iteratively, at each stage joining the two most similar clusters into a new cluster, continuing until there is one overall cluster. The first new cluster formed by the linkage function is assigned index 8, the second is assigned index 9, and so forth. When these indices appear later in the tree, the clusters are being combined again into a still larger cluster.

The CutTree() method constructs a set of clusters by cutting the hierarchical linkage tree either at the specified height, or into the specified number of clusters. For example, this code cuts the linkage tree to form 3 clusters:

Code Example – C# hierarchical cluster analysis

ca.CutTree( 3 );

This code cuts the linkage tree at a height of 0.75:

Code Example – C# hierarchical cluster analysis

ca.CutTree( 0.75 );

The CutTree() method returns a ClusterSet object, which represents a collection of objects assigned to a finite number of clusters. The NumberOfClusters property

gets the number of clusters into which objects are grouped; N gets the number of objects. The Clusters property returns an array of integers that identifies the cluster into which each object was grouped. Cluster numbers are arbitrary, and range from 0 to NumberOfClusters - 1. The indexer gets the cluster to which a given object is assigned. The Cluster() method returns the objects assigned to a given cluster as an array of integers. For instance:

Code Example – C# hierarchical cluster analysis

// Cluster 10 random vectors of length 4:
var data = new DoubleMatrix( 10, 4, new RandGenUniform() );
var ca = new ClusterAnalysis( data );

// Cut the tree into 5 clusters
ClusterSet cut = ca.CutTree( 5 );

Console.WriteLine( "ClusterSet = " + cut );
Console.WriteLine( "Object 0 is in cluster: " + cut[0] );
Console.WriteLine( "Object 3 is in cluster: " + cut[3] );
Console.WriteLine( "Object 8 is in cluster: " + cut[8] );
int[] objects = cut.Cluster( 1 );
Console.Write( "Objects in cluster 1: " );
for (int i = 0; i < objects.Length; i++ )
{
  Console.Write( objects[i] + " " );
}
Console.WriteLine();

// Sample Ouput
//
// ClusterSet = 0,1,2,1,1,1,3,1,4,1
// Object 0 is in cluster: 0
// Object 3 is in cluster: 1
// Object 8 is in cluster: 4
// Objects in cluster 1: 1 3 4 5 7 9

Lastly, the CopheneticDistances property on class ClusterAnalysis gets the vector of cophenetic distances between all possible object pairs. The cophenetic distance between two objects is defined to be the intergroup distance when the objects are first combined into a single cluster in the linkage tree. The format is the same as the distance vector returned by Distances.

The correlation between the original Distances and the CopheneticDistances is sometimes taken as a measure of the appropriateness of a cluster analysis relative to the original data:

Code Example – C# hierarchical cluster analysis

var ca = new ClusterAnalysis( data );
double r = StatsFunctions.Correlation( ca.Distances, 
                                       ca.CopheneticDistances );

Reusing Cluster Analysis Objects

Method Update() updates an existing ClusterAnalysis instance with new data, and optionally with new distance and linkage functions. For example:

Code Example – C# hierarchical cluster analysis

var ca = new ClusterAnalysis( data, Linkage.SingleFunction );
Console.WriteLine( ca.Linkages );

ca.Update( data, Linkage.CompleteFunction );
Console.WriteLine( ca.Linkages );

Top

Top