Drawing Dendrograms

A dendrogram is a tree diagram often used to visualize the results of hierarchical clustering. A customer recently contacted us asking for help drawing dendrograms from the output of the hierarchical clustering algorithm in NMath Stats. (For more information in hierarchical clustering in NMath Stats, see this post.)

UserZoom is an international software company specializing in web customer experience and usability testing. UserZoom software includes a tool to run online card sorting exercises with real users. Card Sorting is a technique that information architects and user experience practitioners use to explore how “real people” group items and content. UserZoom developers were using NMath hierarchical clustering to analyze card sorting results, and wanted to generate dendrograms to help researchers understand participant’s grouping criteria.

In NMath Stats, class ClusterAnalysis performs hierarchical cluster analyses. For example, the following C# code clusters 6 random vectors of length 3:

int seed = 123;
DoubleMatrix data =
  new DoubleMatrix(6, 3, new RandGenUniform(1, 10, seed));
ClusterAnalysis ca = new ClusterAnalysis(data);

The random seed is specified so we can get repeatable results.

After construction, the Linkages property gets an (n-1) x 3 matrix containing the complete hierarchical linkage tree. At each level in the tree, columns 1 and 2 contain the indices of the clusters linked to form the next cluster. Column 3 contains the distances between the clusters.

Console.WriteLine(ca.Linkages.ToTabDelimited());

The output is:

3	4	2.11126489430827
1	5	2.50743075649221
2	6	2.59997959766694
7	8	3.61586096606758
0	9	4.37499686772813

Each object is initially assigned to its own singleton cluster, numbered 0 to 5. The analysis then proceeds iteratively, at each stage joining the two most similar clusters into a new cluster, continuing until there is one overall cluster. The first new cluster formed is assigned index 6, the second is assigned index 7, and so forth. When these indices appear later in the tree, the clusters are being combined again into a still larger cluster.

The Linkages matrix contains the complete hierarchical cluster tree. A problem arises, however, when trying to draw a dendrogram from the Linkages matrix: how should you layout the leaf nodes of the tree so that no lines cross in the diagram? Supppose, for example, you tried laying out the leaf nodes in the order in which they appear in the Linkages matrix:

3 4 1 5 2 0

If you try drawing a dendrogram by hand starting from these leaf nodes, following the steps in the linkage tree, you’ll quickly see the problem. First, singleton clusters 3 and 4 are joined to create a new cluster (6). No problem. Then singleton clusters 1 and 5 are joined to create a new cluster (7). Again, no problem. But then singleton cluster 2 is joined to synthetic cluster 6 to create a new cluster (8), and we have a problem. You can see the result here:

The issue is that the order of the leaf nodes in the Linkages matrix is based on the distance between the clusters at the time they are joined. This order does not necessarily produce a dendrogram with no crossing lines. One solution is to walk the hierarchical cluster tree recursively from the top-down, to determine a good drawing order of the leaf nodes. The following C# code does this:

public static int[] ComputeLeafNodes(ClusterAnalysis ca)
{
  int[] order = new int[ca.Linkages.Rows + 1];
  int pos = 0;
  ReorderChildren(ca);
  WalkChildren(ca, ref order, ref pos, ca.Linkages.Rows - 1);
  return order;
}

private static void WalkChildren(ClusterAnalysis ca,
  ref int[] order, ref int pos, int row)
{
  int n = ca.Linkages.Rows + 1;
  int node = (int)ca.Linkages[row, 0];

  if (node >= n)
  {
    WalkChildren(ca, ref order, ref pos, node - n);
  }

  if (node < n)
    order[pos++] = node;
  
  node = (int)ca.Linkages[row, 1];
  if (node >= n)
  {
    WalkChildren(ca, ref order, ref pos, node - n);
  }

  if (node < n)
    order[pos++] = node;
}

private static void ReorderChildren(ClusterAnalysis ca)
{
  int n = ca.Linkages.Rows + 1;
  for (int i = 0; i < ca.Linkages.Rows; i++)
  {
    int l1 = (int)ca.Linkages[i, 0];
    int l2 = (int)ca.Linkages[i, 1];
    if (l1 < n && l2 < n)     {       
      if (l1 > l2)
      {
        Swap(ca, i);
      }
    }
    else if (l1 < n && l2 >= n)
    {
      Swap(ca, i);
    }
    else if (l1 >= n && l2 >= n)
    {
      if (l1 > l2)
      {
        Swap(ca, i);
      }
    }
  }
}

private static void Swap(ClusterAnalysis ca, int row)
{
  double tmp = ca.Linkages[row, 0];
  ca.Linkages[row, 0] = ca.Linkages[row, 1];
  ca.Linkages[row, 1] = tmp;
}

If we use this ComputeLeafNodes() function on the cluster analysis computed above:

int[] leaves = ComputeLeafNodes(ca);
foreach (int i in leaves) Console.Write("{0} ", i);
Console.WriteLine();

The output is:

1 5 3 4 2 0           (the previous ordering was:  3 4 1 5 2 0)

If you repeat the exercise of drawing the dendrogram by hand, you’ll see that this layout avoids any crossing lines because the 1-5 and 3-4 cluster pairs have been reordered:

Using a technique like that shown in ComputeLeafNodes(), UserZoom was able to generate dendrograms to help researchers understand participant’s grouping criteria in the card sorting data:

Ken

6 thoughts on “Drawing Dendrograms

  1. Is there any way to generate a dendrogram from nmath ClusterAnalysis result ?
    If someone can share the c# source code to generate dendrograms which is not made by UserZoom, I ‘ll greatly appreciated 😉
    Any resource is very welcome !~

  2. Hi Sam,

    There are no graphing functions in NMath, but many of our customers use NMath in conjunction with other charting packages, such as Infragistics (http://www.infragistics.com/), Syncfusion (http://www.syncfusion.com/), Nevron (http://www.nevron.com/), and SoftwareFX (http://www.softwarefx.com/). If you want a full-featured dendrogram, I’d suggest you check out these products to see what’s available.

    I do have some C# code I wrote to draw the first two dendrograms shown above from ClusterAnalysis objects (using System.Drawing). I’d be happy to send it to you if you want to try rolling your own solution. Email me at baldwin@centerspace.net, if you’re interested.

    Ken

  3. An adjacency matrix can represent a more general graph structure than the tree structure of a dendrogram.

Leave a Reply

Your email address will not be published. Required fields are marked *

Top