NMath Stats User's Guide

TOC | Previous | Next | Index

2.11 Cross-Tabulation (.NET, C#, CSharp, VB, Visual Basic, F#)

As described in Section 2.10, the DataFrame.GetFactor() method can be used in conjunction with Subset.GetGroupings() to access "cells" of data based on one or two grouping factors. This is such a common operation that class DataFrame also provides the Tabulate() methods as a convenience. This method accepts one or two grouping columns, a data column, and a delegate to apply to each data column subset. The results are returned in a new data frame.

Column Delegates

Overloads of Tabulate() accept static IDFColumn function delegates that return various types. For instance, this code encapsulates the static StatsFunctions.Mean() function in a Func<IDFColumn, double>:

Code Example – C#

Func<IDFColumn, double> mean =
  new Func<IDFColumn, double>(StatsFunctions.Mean);

Most of the static descriptive statistics functions on class StatsFunctions (Chapter 3) have overloads that accept an IDFColumn and return a double, and so can be encapsulated in this way. A few return integers.

For example, this code encapsulates StatsFunctions.Count(), which returns the number of items in a column, in a Func<IDFColumn, int>:

Code Example – C#

Func<IDFColumn, int> count =
  new Func<IDFColumn, int>(StatsFunctions.Count);

Applying Column Delegates to Tabulated Data

The following code fills a DataFrame with some sales data:

Code Example – C#

var df = new DataFrame();
df.AddColumn( new DFStringColumn( "Product" ) );
df.AddColumn( new DFStringColumn("Month") );  
df.AddColumn( new DFIntColumn( "Quantity" ) );
df.AddColumn( new DFNumericColumn( "Price" ) );
df.AddColumn( new DFNumericColumn( "TotalSale" ) );

int rowID = 0;
df.AddRow( rowID++, "Squash", "Nov", 40, 1.50, 60.0 );
df.AddRow( rowID++, "Carrots", "Nov", 15, 1.20, 18.0 );
df.AddRow( rowID++, "Squash", "Nov", 37, 1.45, 53.65 );
df.AddRow( rowID++, "Carrots", "Nov", 18, 1.25, 22.50 );
df.AddRow( rowID++, "Squash", "Nov", 34, 1.39, 47.26 );
df.AddRow( rowID++, "Carrots", "Dec", 20, 1.30, 26.0 );
df.AddRow( rowID++, "Squash", "Dec", 31, 1.30, 40.30 );
df.AddRow( rowID++, "Carrots", "Dec", 25, 1.40, 35.0 );
df.AddRow( rowID++, "Squash", "Dec", 25, 1.25, 31.25 );
df.AddRow( rowID++, "Carrots", "Dec", 30, 1.45, 43.50 );
df.AddRow( rowID++, "Carrots", "Jan", 33, 1.50, 49.50 );
df.AddRow( rowID++, "Squash", "Jan", 19, 1.21, 22.99 );
df.AddRow( rowID++, "Carrots", "Jan", 40, 1.65, 66.0 );
df.AddRow( rowID++, "Squash", "Jan", 15, 1.11, 16.65 );
df.AddRow( rowID++, "Carrots", "Jan", 47, 1.80, 84.60 );
df.AddRow( rowID++, "Squash", "Jan", 10, 1.00, 10.0 );

This code displays the average sales for each product:

Code Example – C#

Func<IDFColumn, double> mean =
  new Func<IDFColumn, double>(StatsFunctions.Mean);
Console.WriteLine( df.Tabulate( "Product", "TotalSale", mean ) );

The Product column is used as a grouping column, TotalSale contains the data, and the mean delegate returns the mean of the value in each cell. The output is:

#         Results
Carrots   43.1375
Squash    35.2625
Overall   39.2000

The Tabulate() methods return a new data frame. If only one grouping factor is specified, as in this example, the row keys are the sorted, unique factor levels. The only column, named Results, contains the results of applying the given delegate to the values in the data column tabulated for each level of the factor. A final row is appended, with key Overall, containing the results of applying the given delegate to all values in the data column.

Similarly, this code displays the number of observations in each cell for every combination of Product and Month:

Code Example – C#

Func<IDFColumn, int> count =
  new Func<IDFColumn, int>( StatsFunctions.Count );
Console.WriteLine(
  df.Tabulate( "Product", "Month", "TotalSale", count ); 

The Product and Month columns are used as grouping columns, TotalSale contains the data, and the count delegate returns the number of items in each cell.

The output is:

#         Dec  Jan  Nov  Overall
Carrots   3    3    2    8
Squash    2    3    3    8
Overall   5    6    5    16

When two grouping factors are specified, as in this case, the returned data frame has row keys containing the sorted, unique levels of the first grouping factor as strings. The columns in the data frame are named using the sorted, unique levels of the second grouping factor.

NOTE—In this example the alphabetic sorting of the Month names has put them into non-chronological order. In the months had been stored as DateTime objects in an DFDateTimeColumn, they would have been ordered chronologically.

Each cell in the data frame contains the results of applying the given delegate to the values in the data column tabulated for the appropriate combination of the two factors. A final column is appended, named Overall, containing the overall results for each level of the first factor. A final row is appended, with key Overall, containing the overall results for each level of the second factor. The lower right corner cell, accessed by indexer this["Overall","Overall"], contains the results of applying the given delegate to all values in the data column.


Top

Top