Blog

Posts Tagged ‘PDF’

Distribution Fitting Demo

Monday, April 9th, 2012

A customer recently asked how to fit a normal (Gaussian) distribution to a vector of experimental data. Here’s a demonstration of how to do it.

Let’s start by creating a data set: 100 values drawn from a normal distribution with known parameters (mean = 0.5, variance = 2.0).

int n = 100;
double mean = .5;
double variance = 2.0;
var data = new DoubleVector( n, new RandGenNormal( mean, variance ) );

Now, compute y values based on the empirical cumulative distribution function (CDF), which returns the probability that a random variable X will have a value less than or equal to x–that is, f(x) = P(X <= x). Here’s an easy way to do, although not necessarily the most efficient for larger data sets:

var cdfY = new DoubleVector( data.Length );
var sorted = NMathFunctions.Sort( data );
for ( int i = 0; i < data.Length; i++ )
{
  int j = 0;
  while ( j < sorted.Length && sorted[j] <= data[i] ) j++;
  cdfY[i] = j / (double)data.Length;
}

The data is sorted, then for each value x in the data, we iterate through the sorted vector looking for the first value that is greater than x.

We’ll use one of NMath’s non-linear least squares minimization routines to fit a normal distribution CDF() function to our empirical CDF. NMath provides classes for fitting generalized one variable functions to a set of points. In the space of the function parameters, beginning at a specified starting point, these classes finds a minimum (possibly local) in the sum of the squared residuals with respect to a set of data points.

A one variable function takes a single double x, and returns a double y:

y = f(x)

A generalized one variable function additionally takes a set of parameters, p, which may appear in the function expression in arbitrary ways:

y = f(p1, p2,..., pn; x)

For example, this code computes y=a*sin(b*x + c):

public double MyGeneralizedFunction( DoubleVector p, double x )
{
  return p[0] * Math.Sin( p[1] * x + p[2] );
}

In the distribution fitting example, we want to define a parameterized function delegate that returns CDF(x) for the distribution described by the given parameters (mean, variance):

Func<DoubleVector, double, double> f =
  ( DoubleVector p, double x ) =>
    new NormalDistribution( p[0], p[1] ).CDF( x );

Now that we have our data and the function we want to fit, we can apply the curve fitting routine. We’ll use a bounded function fitter, because the variance of the fitted normal distribution must be constrained to be greater than 0.

var fitter = new BoundedOneVariableFunctionFitter<TrustRegionMinimizer>( f );
var start = new DoubleVector( new double[] { 0.1, 0.1 } );
var lowerBounds = new DoubleVector( new double[] { Double.MinValue, 0 } );
var upperBounds = 
   new DoubleVector( new double[] { Double.MaxValue, Double.MaxValue } );
var solution = fitter.Fit( data, cdfY, start, lowerBounds, upperBounds );
var fit = new NormalDistribution( solution[0], solution[1] );
 
Console.WriteLine( "Fitted distribution:\nmean={0}\nvariance={1}",
  fit.Mean, fit.Variance );

The output for one run is

Fitted distribution: 
mean=0.567334190790594
variance=2.0361207956132

which is a reasonable approximation to the original distribution (given 100 points).

We can also visually inspect the fit by plotting the original data and the CDF() function of the fitted distribution.

ToChart( data, cdfY, SeriesChartType.Point, fit,
  NMathStatsChart.DistributionFunction.CDF );
 
private static void ToChart( DoubleVector x, DoubleVector y,
  SeriesChartType dataChartType, NormalDistribution dist,
  NMathStatsChart.DistributionFunction distFunction )
{
  var chart = NMathStatsChart.ToChart( dist, distFunction );
  chart.Series[0].Name = "Fit";
 
  var series = new Series() {
    Name = "Data",
    ChartType = dataChartType
  };
  series.Points.DataBindXY( x, y );
  chart.Series.Insert( 0, series );
 
  chart.Legends.Add( new Legend() );
  NMathChart.Show( chart );
}

CDF() of fitted distribution

We can also look at the probability density function (PDF) of the fitted distribution, but to do so we must first construct an empirical PDF using a histogram. The x-values are the midpoints of the histogram bins, and the y-values are the histogram counts converted to probabilities, scaled to integrate to 1.

int numBins = 10;
var hist = new Histogram( numBins, data );
 
var pdfX = new DoubleVector( hist.NumBins );
var pdfY = new DoubleVector( hist.NumBins );
for ( int i = 0; i < hist.NumBins; i++ )
{
  // use bin midpoint for x value
  Interval bin = hist.Bins[i];
  pdfX[i] = ( bin.Min + bin.Max ) / 2;
 
   // convert histogram count to probability for y value
   double binWidth = bin.Max - bin.Min;
   pdfY[i] = hist.Count( i ) / ( data.Length * binWidth );
}
 
ToChart( pdfX, pdfY, SeriesChartType.Column, fit,
  NMathStatsChart.DistributionFunction.PDF );

PDF() of fitted distribution

You might be tempted to try to fit a distribution PDF() function directly to the histogram data, rather than using the CDF() function like we did above, but this is problematic for several reasons. The bin counts have different variability than the original data. They also have a fixed sum, so they are not independent measurements. Also, for continuous data, fitting a model based on aggregated histogram counts, rather than the original data, throws away information.

Ken

Probability Distributions in NMath Stats

Tuesday, March 2nd, 2010

Normal Distribution PDF CDF

Gaussian Distribution

Probability distributions are central to many applications in statistical analysis. The NMath Stats library offers a large set of probability distributions, covering most domains of application, all with an easy to use common interface. Each distribution class uses numerically stable accurate algorithms to compute both the probability distribution and the cumulative distribution. In this post we’ll look at some code examples using these distribution classes. All of the charts in this post were generated using the Infragistics WPF tool set.

Available Distributions

The NMath Stats library offers the following set of probability distributions, with each name linked to their API documentation page. More information can be found on the CenterSpace probability distribution landing page, including links to code examples in C# and VB, and more documentation.

Distributions in NMath Stats
Normal Distribution (Gaussian) Log Normal Distribution
Poisson Distribution Geometric Distribution

Weibull Distribution Uniform Distribution

Chi-Square Distribution Binomial Distribution

Negative Binomial Distribution Exponential Distribution

T Distribution F Distribution

Triangular Distribution Logistic Distribution

Beta Distribution Gamma Distribution

Distribution of Running Times

Corvallis hosts many foot races during the year, and this application note analyzes the finishing times data from two of those: the annual Fall Festival 10K Fun Run, and the one-off Strands 5K. The central limit theorem tells us to expect the running times for a foot race (of enough participants) to be normally distributed. But what happens to that distribution when a large prize is offered? The Fall Festival 10K Fun Run offers a prize of exactly $0, where the Strands 5K offered an amazing $10,000 prize.

We can estimate a normal distribution from the two data sets, and then use a Kolmogorov-Smirnov test to determine if the distribution passed the K-S null hypothesis. If the Kolmogorov-Smirnov null hypothesis is not rejected, then under this statistic, the data points are said to be drawn from the reference distribution (in this case the normal distribution).

using System.IO;
using CenterSpace.NMath.Core;
using CenterSpace.NMath.Stats;
 
public Main()
{
  // Load fall festival 10K data and strands 5K data
  StreamReader reader = new StreamReader("fall_festival_times.txt", false);
  DoubleVector fallfestival10k = new DoubleVector(reader);
 
  reader = new StreamReader("strands_times.txt", false);
  DoubleVector strands10k = new DoubleVector(reader);
 
  // Estimate Normal (Gaussian) Distributions and 
  // check the Kolmogorov-Smirnov Test
  NormalDistribution ndist_ff = new NormalDistribution(
    StatsFunctions.Mean(fallfestival10k), StatsFunctions.Variance(fallfestival10k));
  OneSampleKSTest kstest = new OneSampleKSTest(fallfestival10k, ndist_ff);
  bool rejectNH = kstest.Reject; // False
 
  NormalDistribution ndist_s = new NormalDistribution(
    StatsFunctions.Mean(strands10k), StatsFunctions.Variance(strands10k));
  kstest = new OneSampleKSTest(strands10k, ndist_s);
  rejectNH = kstest.Reject;  // True
}

A look at the data makes the results of the Kolmogorov-Smirnov test look plausible.

CDF of Fall Festival 10K & the Strands 5K finishing times.

CDF of Fall Festival 10K & the Strands 5K finishing times with normalized finishing times.


The Strand 5K finishing times are not normally distributed because the big prize prompted many fast runners to show up and many average runners to enjoy the race from the sidelines. This grouped the finishing times around the winner (many close finishers) and so they were no longer normally distributed. This distribution looks more like a Weibull and we can test against that intuition with the code snippit.

  WeibullDistribution wdist_s = new WeibullDistribution(25,4);
 
  kstest = new OneSampleKSTest(strands10k, wdist_s);
  rejectNH = kstest.Reject; // now False

Let’s look at the CDF of this weibull versus the data again.

Strands 5K finishing times with a Weibull CDF.

Normalized Strands 5K finishing times overlayed on a Weibull CDF.

Simple Coin Flipping Example

I’ll present one more simple example using a discrete distribution. The binomial distribution is used for modeling most coin flipping games, as it represents the distribution of successes in a sequence of independent yes/no questions. The binomial distribution is parametrized on the number of trials n, and the probability of each independent success, p. For example, using this binomial distribution we can model, say, the number of heads founds in a sequence of 10 coin flips, using n=10 and p=1/2

Binomial Distribution n=10, p=0.5

Binomial Distribution with n=10 and p=0.5


As expected, the most likely number of heads would occur at 5 (with a probability of 0.246), and the probability of either getting 3, 4, 5, or 6 heads in 10 flips would be the difference of the CDF at 6 and 3, equal to 0.656. Below is the simple C# code used to compute the answers to these questions.

using CenterSpace.NMath.Core;
using CenterSpace.NMath.Stats;
 
public Main()
{
  int number_of_trials = 10;
  int prob_of_success = 0.5;
  BinomialDistribution dist = 
    new BinomialDistribution(number_of_trials, prob_of_success);
 
  // Probability of landing 5 heads in 10 flips ( = 0.246)
  Double five_heads = dist.PDF(5); 
 
  // Probability of landing 3, 4, 5, or 6 heads in 10 flips ( = 0.656)
  Double three_to_six_heads = dist.CDF(6) - dist.CDF(3); 
}

I hope these code examples can help you get started using the NMath Stats distribution classes quickly and correctly.

-Happy Computing,

Paul