NMath Stats User's Guide

TOC | Previous | Next | Index

2.10 Factors (.NET, C#, CSharp, VB, Visual Basic, F#)

The Factor class represents a categorical vector in which all elements are drawn from a finite number of factor levels. Thus, a Factor contains two parts:

an object array of factor levels

an integer array of categorical data, of which each element is an index into the array of levels

For example, this string data:

"A", "A", "C", "B", "A", "C", "B"

could be presented as a Factor with the following levels and categorical data:

Code Example – C#

object[] levels = { "A", "B", "C" };
int[] data = { 0, 0, 2, 1, 0, 2, 1 };

Factors are usually constructed from a data frame column using the GetFactor() method, but they can also be constructed independently.

Creating Factors

The GetFactor() method on DataFrame accepts a column index or name and returns a Factor with levels for the sorted, unique elements in the given column:

Code Example – C#

Factor myColFactor = df.GetFactor( "myCol" );

Alternatively, you can provide the factor levels yourself. The order is preserved. Thus:

Code Example – C#

var levels = new object[] { "Q1", "Q2", "Q3", "Q4" };
Factor myColFactor = df.GetFactor( "myCol", levels );

An InvalidArgumentException is raised if the specified column contains a value not present in the given array of levels.

You can also construct a Factor independently of a DataFrame. For example, you can construct a Factor from an array of values:

Code Example – C#

var strArray = new object[] { 1, 1, 3, 2, 1, 3, 2 };
var factor = new Factor( strArray );

Factor levels are constructed from a sorted list of unique values in the passed array.

Alternatively, you can construct a Factor from an array of factor levels, and a data array consisting of indices into the factor levels:

Code Example – C#

var levels = new object[] { 1, 2, 3 };
var data = new int[] { 0, 0, 2, 1, 0, 2, 1 };
var factor = new Factor( levels, data );

An InvalidArgumentException is thrown if the given data array contains an invalid index.

Properties of Factors

The Factor class provides the following properties:

Data gets the categorical data for the factor. Each element in the returned integer array is an index into Levels.

Levels gets the levels of the factor as an array of objects.

Length gets the length of the Data in the factor.

Name gets and set the name of the factor.

NumberOfLevels gets the number of levels in the factor.

Accessing Factors

A standard indexer is provided for accessing the element at a given index:

Code Example – C#

string str = (string)factor[2];

The indexer returns Levels[ Data[index] ]—that is, it returns the level at the given position.

Creating Groupings with Factors

The principal use of factors is in conjunction with the GetGroupings() methods on Subset. One overload of this method accepts a single Factor and returns an array of subsets containing the indices for each level of the given factor. Another overload accepts two Factor objects and returns a two-dimensional jagged array of subsets containing the indices for each combination of levels in the two factors.

For example, suppose we weigh human subjects based on sex and age group. The data for 15 subject might look like this:

Table 4 – Sample data

 

Male

Female

Child

45, 42

30, 35, 60, 40

Adult

182, 170

115, 130, 110

Senior

142, 155

115, 123

In a DataFrame, each observation would be a row, like so:

Code Example – C#

var df = new DataFrame();
df.AddColumn( new DFStringColumn( "Sex" ) );  
df.AddColumn( new DFStringColumn( "AgeGroup" ));
df.AddColumn( new DFIntColumn( "Weight" ) );

df.AddRow( "John Smith", "Male", "Child", 45 );
df.AddRow( "Ruth Barnes", "Female", "Senior", 115 );
df.AddRow( "Jane Jones", "Female", "Adult", 115 );
df.AddRow( "Timmy Toddler", "Male", "Child", 42 );
df.AddRow( "Betsy Young", "Female", "Adult", 130 );
df.AddRow( "Arthur Smith", "Male", "Senior", 142 );
df.AddRow( "Lucy Young", "Female", "Child", 30 );
df.AddRow( "Emma Allen", "Female", "Child", 35 );
df.AddRow( "Roy Wilkenson", "Male", "Adult", 182 );
df.AddRow( "Susan Schwarz", "Female", "Senior", 110 );
df.AddRow( "Ming Tao", "Female", "Senior", 123 );
df.AddRow( "Johanna Glynn", "Female", "Child", 60 );
df.AddRow( "Randall Harvey", "Male", "Adult", 170 );
df.AddRow( "Tom Howard", "Male", "Senior", 155 );
df.AddRow( "Jennifer Watson", "Female", "Child", 40 );

In this case, we're using the subjects' names as row keys.

It is natural to construct factors from the Sex and AgeGroup columns:

Code Example – C#

Factor sex = df.GetFactor( "Sex" );
Factor age = df.GetFactor( "AgeGroup" );

We can then use these factors in conjunction with the GetGroupings() methods on Subset to create subsets representing the original rows, columns, and cells in Table 4:

Code Example – C#

Subset[] sexGroups = Subset.GetGroupings( sex );
Subset[] ageGroups = Subset.GetGroupings( age );
Subset[,] cellGroups = Subset.GetGroupings( sex, age );

These subsets can then be used to operate on the relevant portions of the data frame. For instance, this code prints out row means, column means, and cell means for Table 4:

Code Example – C#

Console.WriteLine( "\nTABLE ROW MEANS" ); 
for ( int i = 0; i < age.NumberOfLevels; i++ )
{
  double mean = StatsFunctions.Mean(
    df[ df.IndexOfColumn( "Weight" ), ageGroups[i] ] );
  Console.WriteLine( "Mean for {0} = {1}", age.Levels[i], mean );
}

Console.WriteLine( "\nTABLE COLUMN MEANS" ); 
for ( int i = 0; i < sex.NumberOfLevels; i++ )
{
  double mean = StatsFunctions.Mean(
    df[ df.IndexOfColumn( "Weight" ), sexGroups[i] ] );
  Console.WriteLine( "Mean for {0} = {1}", sex.Levels[i], mean );
}

Console.WriteLine( "\nTABLE CELL MEANS" );
for ( int i = 0; i < sex.NumberOfLevels; i++ )
{
  for ( int j = 0; j < age.NumberOfLevels; j++ )
  {
    double mean = StatsFunctions.Mean(
      df[ df.IndexOfColumn( "Weight" ), cellGroups[i,j] ] );
    Console.WriteLine( "Mean for {0} {1} = {2}",
      sex.Levels[i], age.Levels[j], mean );
  }
}

The output is:

TABLE ROW MEANS
Mean for Adult = 149.25
Mean for Child = 42
Mean for Senior = 129

TABLE COLUMN MEANS
Mean for Female = 84.2222222222222
Mean for Male = 122.666666666667

TABLE CELL MEANS
Mean for Female Adult = 122.5
Mean for Female Child = 41.25
Mean for Female Senior = 116
Mean for Male Adult = 176
Mean for Male Child = 43.5
Mean for Male Senior = 148.5

See also the Tabulate() convenience methods on class DataFrame, as described in Section 2.11.


Top

Top