Cluster analysis detects natural groupings in data. In hierarchical cluster analysis, each object is initially assigned to its own singleton cluster. The analysis then proceeds iteratively, at each stage joining the two most similar clusters into a new cluster, continuing until there is one overall cluster. In NMath Stats, class ClusterAnalysis perform hierarchical cluster analyses.
During clustering, the distance between individual objects is computed using a distance function. The distance function is encapsulated in a Distance.Function delegate, which takes two vectors and returns a measure of the distance (similarity) between them:
public delegate double Function( DoubleVector data1, DoubleVector data2 );
Delegates are provided as static variables on class Distance for many common distance functions:
All provided distance functions allow missing values. Pairs of elements are excluded from the distance measure when their comparison returns NaN. If all pairs are excluded, NaN is returned for the distance measure.
You can also define your own Distance.Function delegate and use it to cluster your data. For example, if you have function MyDistance() that computes the distance between two vectors:
public double MyDistance( DoubleVector x, DoubleVector y );
You can define a Distance.Function delegate like so:
Distance.Function MyDistanceFunction = new Distance.Function( MyDistance );
During clustering, the distances between clusters of objects are computed using a linkage function. The linkage function is encapsulated in a Linkage.Function delegate. When two groups P and Q are united, a linkage function computes the distance between the new combined group P + Q and another group R.
|
The parameters to the Linkage.Function-which may not necessarily all be used to calculate the result-are the distance between R and P, the distance between R and Q, the distance between P and Q, and the sizes (n) of all three groups:
public delegate double Function( double Drp, double Drq, double Dpq, double Nr, double Np, double Nq );
Delegates are provided as static variables on class Linkage for many common linkage functions:
You can also define your own Linkage.Function delegate and use it to cluster your data. For example, if you have function MyLinkage() that computes the distance between two clusters:
public double MyLinkage( double Drp, double Drq, double Dpq,
double Nr, double Np, double Nq );
You can define a Linkage.Function delegate like so:
Linkage.Function MyLinkageFunction = new Linkage.Function( MyLinkage );
A ClusterAnalysis instance is constructed from a matrix or a dataframe containing numeric data. Each row in the data set represents an object to be clustered.
ClusterAnalysis ca = new ClusterAnalysis( data );
The current default distance and linkage delegates are used. The default distance and linkage delegates are Distance.EuclideanFunction and Linkage.SingleFunction, unless the defaults have been changed using the static DefaultDistanceFunction and DefaultLinkageFunction properties. For example:
ClusterAnalysis.DefaultDistanceFunction = Distance.MaximumFunction; ClusterAnalysis.DefaultLinkageFunction = Linkage.CentroidFunction;
This changes the default distance and linkage functions for all subsequently constructed ClusterAnalysis objects.
You can also specify non-default distance and linkage functions in the constructor:
ClusterAnalysis ca = new ClusterAnalysis( data, Distance.PowerFunction( 1.25, 2.0 ), Linkage.CompleteFunction );
After construction, you can retrieve information about the ClusterAnalysis configuration using the provided properties:
The Distances property gets the vector of distances between all possible object pairs, computed using the current distance delegate. For n objects, the distance vector is of length (n-1)(n/2), with distances arranged in the order:
(1,2), (1,3), ..., (1,n), (2,3), ..., (2,n), ..., ..., (n-1,n)
Linkages gets an (n-1) x 3 matrix containing the complete hierarchical linkage tree, computed from Distances using the current linkage delegate. At each level in the tree, columns 1 and 2 contain the indices of the clusters linked to form the next cluster. Column 3 contains the distances between the clusters. For example, this code clusters 8 random vectors of length 3, then shows a sample output of the hierarchical cluster tree:
DoubleMatrix data = new DoubleMatrix( 8, 3, new RandGenUniform() ); ClusterAnalysis ca = new ClusterAnalysis( data ); Console.WriteLine( ca.Linkages ); // Sample Output // // 7x3 [ // 4 7 0.194409151975696 // 3 5 0.290431894003636 // 2 9 0.495557235783239 // 1 6 0.508966210536187 // 0 11 0.522321103698264 // 8 10 0.590187697768796 // 12 13 0.621675638177606 ]
Each object is initially assigned to its own singleton cluster, numbered 0 to 7. The analysis then proceeds iteratively, at each stage joining the two most similar clusters into a new cluster, continuing until there is one overall cluster. The first new cluster formed by the linkage function is assigned index 8, the second is assigned index 9, and so forth. When these indices appear later in the tree, the clusters are being combined again into a still larger cluster.
The CutTree() method constructs a set of clusters by cutting the hierarchical linkage tree either at the specified height, or into the specified number of clusters. For example, this code cuts the linkage tree to form 3 clusters:
ca.CutTree( 3 );
This code cuts the linkage tree at a height of 0.75:
ca.CutTree( 0.75 );
The CutTree() method returns a ClusterSet object, which represents a collection of objects assigned to a finite number of clusters. The NumberOfClusters property
gets the number of clusters into which objects are grouped; N gets the number of objects. The Clusters property returns an array of integers that identifies the cluster into which each object was grouped. Cluster numbers are arbitrary, and range from 0 to NumberOfClusters - 1. The indexer gets the cluster to which a given object is assigned. The Cluster() method returns the objects assigned to a given cluster as an array of integers. For instance:
// Cluster 10 random vectors of length 4: DoubleMatrix data = new DoubleMatrix( 10, 4, new RandGenUniform() ); ClusterAnalysis ca = new ClusterAnalysis( data ); // Cut the tree into 5 clusters ClusterSet cut = ca.CutTree( 5 ); Console.WriteLine( "ClusterSet = " + cut ); Console.WriteLine( "Object 0 is in cluster: " + cut[0] ); Console.WriteLine( "Object 3 is in cluster: " + cut[3] ); Console.WriteLine( "Object 8 is in cluster: " + cut[8] ); int[] objects = cut.Cluster( 1 ); Console.Write( "Objects in cluster 1: " ); for (int i = 0; i < objects.Length; i++ ) { Console.Write( objects[i] + " " ); } Console.WriteLine(); // Sample Ouput // // ClusterSet = 0,1,2,1,1,1,3,1,4,1 // Object 0 is in cluster: 0 // Object 3 is in cluster: 1 // Object 8 is in cluster: 4 // Objects in cluster 1: 1 3 4 5 7 9
Lastly, the CopheneticDistances property on class ClusterAnalysis gets the vector of cophenetic distances between all possible object pairs. The cophenetic distance between two objects is defined to be the intergroup distance when the objects are first combined into a single cluster in the linkage tree. The format is the same as the distance vector returned by Distances.
The correlation between the original Distances and the CopheneticDistances is sometimes taken as a measure of the appropriateness of a cluster analysis relative to the original data:
ClusterAnalysis ca = new ClusterAnalysis( data ); double r = StatsFunctions.Correlation( ca.Distances, ca.CopheneticDistances );
Method Update() updates an existing ClusterAnalysis instance with new data, and optionally with new distance and linkage functions. For example:
ClusterAnalysis ca = new ClusterAnalysis( data, Linkage.SingleFunction ); Console.WriteLine( ca.Linkages ); ca.Update( data, Linkage.CompleteFunction ); Console.WriteLine( ca.Linkages );
TOC | Previous | Next | Index