The Factor class represents a categorical vector in which all elements are drawn from a finite number of factor levels. Thus, a Factor contains two parts:
For example, this string data:
"A", "A", "C", "B", "A", "C", "B"
could be presented as a Factor with the following levels and categorical data:
object[] levels = { "A", "B", "C" };
int[] data = { 0, 0, 2, 1, 0, 2, 1 };
Factors are usually constructed from a data frame column using the GetFactor() method, but they can also be constructed independently.
The GetFactor() method on DataFrame accepts a column index or name and returns a Factor with levels for the sorted, unique elements in the given column:
Factor myColFactor = df.GetFactor( "myCol" );
Alternatively, you can provide the factor levels yourself. The order is preserved. Thus:
object[] levels = new object[] { "Q1", "Q2", "Q3", "Q4" };
Factor myColFactor = df.GetFactor( "myCol", levels );
An InvalidArgumentException is raised if the specified column contains a value not present in the given array of levels.
You can also construct a Factor independently of a DataFrame. For example, you can construct a Factor from an array of values:
object[] strArray = { 1, 1, 3, 2, 1, 3, 2 };
Factor factor = new Factor( strArray );
Factor levels are constructed from a sorted list of unique values in the passed array.
Alternatively, you can construct a Factor from an array of factor levels, and a data array consisting of indices into the factor levels:
object[] levels = { 1, 2, 3 };
int[] data = { 0, 0, 2, 1, 0, 2, 1 };
Factor factor = new Factor( levels, data );
An InvalidArgumentException is thrown if the given data array contains an invalid index.
The Factor class provides the following properties:
A standard indexer is provided for accessing the element at a given index:
string str = (string)factor[2];
The indexer returns Levels[ Data[index] ]-that is, it returns the level at the given position.
The principal use of factors is in conjunction with the GetGroupings() methods on Subset. One overload of this method accepts a single Factor and returns an array of subsets containing the indices for each level of the given factor. Another overload accepts two Factor objects and returns a two-dimensional jagged array of subsets containing the indices for each combination of levels in the two factors.
For example, suppose we weigh human subjects based on sex and age group. The data for 15 subject might look like this:
| |
Male |
Female |
| Child |
45, 42 |
30, 35, 60, 40 |
| Adult |
182, 170 |
115, 130, 110 |
| Senior |
142, 155 |
115, 123 |
In a DataFrame, each observation would be a row, like so:
DataFrame df = new DataFrame(); df.AddColumn( new DFStringColumn( "Sex" ) ); df.AddColumn( new DFStringColumn( "AgeGroup" )); df.AddColumn( new DFIntColumn( "Weight" ) ); df.AddRow( "John Smith", "Male", "Child", 45 ); df.AddRow( "Ruth Barnes", "Female", "Senior", 115 ); df.AddRow( "Jane Jones", "Female", "Adult", 115 ); df.AddRow( "Timmy Toddler", "Male", "Child", 42 ); df.AddRow( "Betsy Young", "Female", "Adult", 130 ); df.AddRow( "Arthur Smith", "Male", "Senior", 142 ); df.AddRow( "Lucy Young", "Female", "Child", 30 ); df.AddRow( "Emma Allen", "Female", "Child", 35 ); df.AddRow( "Roy Wilkenson", "Male", "Adult", 182 ); df.AddRow( "Susan Schwarz", "Female", "Senior", 110 ); df.AddRow( "Ming Tao", "Female", "Senior", 123 ); df.AddRow( "Johanna Glynn", "Female", "Child", 60 ); df.AddRow( "Randall Harvey", "Male", "Adult", 170 ); df.AddRow( "Tom Howard", "Male", "Senior", 155 ); df.AddRow( "Jennifer Watson", "Female", "Child", 40 );
In this case, we're using the subjects' names as row keys.
It is natural to construct factors from the Sex and AgeGroup columns:
Factor sex = df.GetFactor( "Sex" ); Factor age = df.GetFactor( "AgeGroup" );
We can then use these factors in conjunction with the GetGroupings() methods on Subset to create subsets representing the original rows, columns, and cells in Table 4:
Subset[] sexGroups = Subset.GetGroupings( sex ); Subset[] ageGroups = Subset.GetGroupings( age ); Subset[,] cellGroups = Subset.GetGroupings( sex, age );
These subsets can then be used to operate on the relevant portions of the data frame. For instance, this code prints out row means, column means, and cell means for Table 4:
Console.WriteLine( "\nTABLE ROW MEANS" );
for ( int i = 0; i < age.NumberOfLevels; i++ )
{
double mean = StatsFunctions.Mean(
df[ df.IndexOfColumn( "Weight" ), ageGroups[i] ] );
Console.WriteLine( "Mean for {0} = {1}", age.Levels[i], mean );
}
Console.WriteLine( "\nTABLE COLUMN MEANS" );
for ( int i = 0; i < sex.NumberOfLevels; i++ )
{
double mean = StatsFunctions.Mean(
df[ df.IndexOfColumn( "Weight" ), sexGroups[i] ] );
Console.WriteLine( "Mean for {0} = {1}", sex.Levels[i], mean );
}
Console.WriteLine( "\nTABLE CELL MEANS" );
for ( int i = 0; i < sex.NumberOfLevels; i++ )
{
for ( int j = 0; j < age.NumberOfLevels; j++ )
{
double mean = StatsFunctions.Mean(
df[ df.IndexOfColumn( "Weight" ), cellGroups[i,j] ] );
Console.WriteLine( "Mean for {0} {1} = {2}",
sex.Levels[i], age.Levels[j], mean );
}
}
TABLE ROW MEANS Mean for Adult = 149.25 Mean for Child = 42 Mean for Senior = 129 TABLE COLUMN MEANS Mean for Female = 84.2222222222222 Mean for Male = 122.666666666667 TABLE CELL MEANS Mean for Female Adult = 122.5 Mean for Female Child = 41.25 Mean for Female Senior = 116 Mean for Male Adult = 176 Mean for Male Child = 43.5 Mean for Male Senior = 148.5
See also the Tabulate() convenience methods on class DataFrame, as described in Section 2.11.
TOC | Previous | Next | Index