The Importance of Graphing Your Data

In his classic book The Visual Display of Quantitative Information, Edward R. Tufte argued that “graphics can be more precise and revealing than conventional statistical computations”. As an example, he described Anscombe’s Quartet–four datasets that have identical simple statistical properties, yet appear very different when graphed.
Anscombe's Quartet
These data sets–each consisting of 11 x,y points–were constructed by statistician Francis Anscombe in 1973.

As previously described, NMath 5.1 and NMath Stats 3.4 include classes for plotting NMath types using the Microsoft Chart Controls for .NET. (Free adapter code is also available for using NMath with Syncfusion Essential Chart.) Let’s use Anscombe’s data to explore how NMath’s new visualization capabilities can be used to reveal the differences in the data sets.

First, we’ll load the data into a DoubleMatrix.

DoubleMatrix A = new DoubleMatrix( @"11x8 [
  10.0 8.04  10.0 9.14 10.0 7.46  8.0  6.58
  8.0  6.95  8.0  8.14 8.0  6.77  8.0  5.76
  13.0 7.58  13.0 8.74 13.0 12.74 8.0  7.71
  9.0  8.81  9.0  8.77 9.0  7.11  8.0  8.84
  11.0 8.33  11.0 9.26 11.0 7.81  8.0  8.47
  14.0 9.96  14.0 8.10 14.0 8.84  8.0  7.04
  6.0  7.24  6.0  6.13 6.0  6.08  8.0  5.25
  4.0  4.26  4.0  3.10 4.0  5.39  19.0 12.50
  12.0 10.84 12.0 9.13 12.0 8.15  8.0  5.56
  7.0  4.82  7.0  7.26 7.0  6.42  8.0  7.91
  5.0  5.68  5.0  4.74 5.0  5.73  8.0  6.89 ]" );

Now let’s perform some simple descriptive statistics.

 int groups = 4;
 Slice rows = Slice.All;
 Slice xCols = new Slice( 0, groups, 2 );
 Slice yCols = new Slice( 1, groups, 2 );
 double unbiased = (double)A.Rows / ( A.Rows - 1 );

 Console.WriteLine( "Mean of x: {0}",
   NMathFunctions.Mean( A[ rows, xCols ] ) );
 Console.WriteLine( "Variance of x: {0}",
   NMathFunctions.Variance( A[rows, xCols] ) * unbiased );
 Console.WriteLine( "Mean of y: {0}",
   NMathFunctions.Round( NMathFunctions.Mean( A[rows, yCols] ), 2 ) );
 Console.WriteLine( "Variance of y: {0}",
   NMathFunctions.Round(
     NMathFunctions.Variance( A[rows, yCols] ) * unbiased, 3 ) );

 Console.Write( "Correlation of x-y: " );
 for (int i = 0; i < A.Cols; i += 2 )
 {
   Console.Write( NMathFunctions.Round(
    StatsFunctions.Correlation( A.Col(i), A.Col(i + 1) ), 3 ) + " " );
 }
 Console.WriteLine();

You can see from the output that the statistics are nearly identical for all four data sets:

Mean of x: [ 9 9 9 9 ]
Variance of x: [ 11 11 11 11 ]
Mean of y: [ 7.5 7.5 7.5 7.5 ]
Variance of y: [ 4.127 4.128 4.123 4.123 ]
Correlation of x-y: 0.816 0.816 0.816 0.817

Now let's fit a linear model to each data set.

 LinearRegression[] lrs = new LinearRegression[groups];

 for( int i = 0; i < groups; i ++ )
 {
   Console.WriteLine( "Group {0}", i + 1 );

   bool addIntercept = true;
   lrs[i] = new LinearRegression( new DoubleMatrix( A.Col( 2 * i ) ),
     A.Col( 2 * i + 1 ), addIntercept );
   Console.WriteLine( "equation of regression line: Y = {0} + {1}X",
     Math.Round( lrs[i].Parameters[0], 2 ),
     Math.Round( lrs[i].Parameters[1], 3 ) );

   LinearRegressionParameter param =
     new LinearRegressionParameter( lrs[i], 1 );
   Console.WriteLine( "standard error of estimate of slope: {0}",
     Math.Round( param.StandardError, 3 ) );
   Console.WriteLine( "t-statistic: {0}",
     Math.Round( param.TStatistic( 0 ), 2 ) );

   LinearRegressionAnova anova = new LinearRegressionAnova( lrs[i] );
   Console.WriteLine( "regression sum of squares: {0}",
     Math.Round( anova.RegressionSumOfSquares, 2 ) );
   Console.WriteLine( "residual Sum of squares: {0}",
     Math.Round( anova.ResidualSumOfSquares, 2 ) );
   Console.WriteLine( "r2: {0}", Math.Round( anova.RSquared, 2 ) );

   Console.WriteLine();
 }

Again, the output is nearly identical for each data set:

Group 1
equation of regression line: Y = 3 + 0.5X
standard error of estimate of slope: 0.118
t-statistic: 4.24
regression sum of squares: 27.51
residual Sum of squares: 13.76
r2: 0.67

Group 2
equation of regression line: Y = 3 + 0.5X
standard error of estimate of slope: 0.118
t-statistic: 4.24
regression sum of squares: 27.5
residual Sum of squares: 13.78
r2: 0.67

Group 3
equation of regression line: Y = 3 + 0.5X
standard error of estimate of slope: 0.118
t-statistic: 4.24
regression sum of squares: 27.47
residual Sum of squares: 13.76
r2: 0.67

Group 4
equation of regression line: Y = 3 + 0.5X
standard error of estimate of slope: 0.118
t-statistic: 4.24
regression sum of squares: 27.49
residual Sum of squares: 13.74
r2: 0.67

Finally, let's use the new NMath charting functionality to plot each linear regression fit. Note that we make use of the Compose() method to combine multiple charts into a single composite Chart control.

 List charts = new List();
 for( int i = 0; i < lrs.Length; i++ )
 {
   charts.Add( NMathStatsChart.ToChart( lrs[i], 0 ) );
 }
 Chart all = NMathStatsChart.Compose( charts, 2, 2,
   NMathChart.AreaLayoutOrder.RowMajor );
 for( int i = 0; i < groups; i++ )
 {
   all.ChartAreas[i].AxisX.Title = "x" + ( i + 1 );
   all.ChartAreas[i].AxisX.Minimum = 2;
   all.ChartAreas[i].AxisX.Maximum = 22;
   all.ChartAreas[i].AxisX.Interval = 4;

   all.ChartAreas[i].AxisY.Title = "y" + ( i + 1 );
   all.ChartAreas[i].AxisY.Minimum = 2;
   all.ChartAreas[i].AxisY.Maximum = 14;
   all.ChartAreas[i].AxisY.Interval = 4;

   all.Series[2 * i].Color = Color.DarkOrange;
   all.Series[2 * i + 1].Color = Color.SteelBlue;
 }
 NMathStatsChart.Show( all );

Anscombe's QuartetThe charts reveal  dramatic differences between the data sets, despite the identical fitted models. Group 1 shows a linear relationship, while in Group 2 the releationship is clearly non-linear.  Groups 3 and 4 demonstrate how a single outlier can have a large effect on simple statistics.

Ken

References

Anscombe, F. J. (1973). "Graphs in Statistical Analysis". American Statistician 27 (1): 17–21. JSTOR 2682899.
Tufte, Edward R. (2001). The Visual Display of Quantitative Information (2nd ed.). Cheshire, CT: Graphics Press

Leave a Reply

Your email address will not be published. Required fields are marked *

Top