Blog

Archive for the ‘NMath Stats’ Category

July Release of NMath and NMath Stats

Tuesday, June 8th, 2010

We are currently working hard on our upcoming July release of our NMath and NMath Stats C# math libraries. This release will add many new features from Runge-Kutta to automatic Peak Finding algorithms as well as address our most frequent support requests. Developers using our math libraries currently will find the new release build-compatible with the prior release. Upgrades are provided free of charge to customers with current annual maintenance contracts.  Maintenance contracts are available through our webstore.

Pure C#

Both libraries are now supported by a new pure C# math kernel doing away with our old C++ kernel. Because we are now a pure .NET assembly, deployment of NMath based applications is simplified by eliminating the Microsoft C++ runtime library dependency. As with all releases we will be posting our updated performance benchmarks at the time of the release.

Full Control

Additionally, our libraries have been re-architected to dynamically link to both native numerical libraries and ( and perhaps more importantly for our customers ) the Intel OMP threading library (libiomp.dll) . This means that our customers will have complete control over the threading library. In the past, we statically linked in OMP. Now, we are picking up OMP dynamically and thereby avoid collisions between statically and dynamically linked OMP libraries. In a nutshell, NMath will now play more nicely with libraries from other vendors.

New Features

Now for the fun stuff. The table below summarized the new features in NMath 4.1 and NMath Stats 3.2.

Product Feature Summary
NMath 4.1 Savitzky-Golay derivatives Class generates correlation coefficients to compute the smoothed Savizky-Golay derivatives of sampled data.
Savitzky-Golay smoothing See blog article on SG smoothing
Peak Finding Class finds peaks in sampled data using Savitzky-Golay smoothed polynomials and their derivatives.
Runge-Kutta ODE solver Class for solving ODE’s
Bounded function fitting Class for fitting general nonlinear models with bounds on the parameters. Also see this blog article for code examples of bounded nonlinear curving fitting.
Correlated random number generators Class creates streams of induced correlated random numbers typically for simulation studies using Monte Carlo.
NMath Stats 3.2 Johnson System of distributions The Johnson system of distributions is based on three possible transformations of a normal distribution–exponential, logistic, and hyperbolic sine–plus the identity transformation:

X = xi + (lambda * T((z – gamma) / delta))

where z is a standard normal random variable, xi and lambda are shape parameters, delta is a scale parameter, gamma is a location parameter, and T is the transformation.

Kruskal-Wallis rank sum test The Kruskal-Wallis rank sum test is a non-parametric test for equality of population medians among groups. It is a non-parametric version of the classical one-way ANOVA.
Regression statistics for PolynomialLeastSquares [see below]
Regression statistics for OneVariableFunctionFitter Class provides a variety of regression statistics including the residual sum of squares, R squared, adjusted R squared, F statistic, and others.

I hope you find these new additions to the library useful in your application work. If you are looking for something specific that isn’t currently supported in our library, please contact us. We build custom numeric classes for existing and new customers on a regular basis.

Happy Computing,
Paul

  • Share/Bookmark

Non-linear Curve Fitting

Monday, April 12th, 2010

SIGA, a public biotechnology company, recently hired CenterSpace consultants to refine their logistic modeling capabilities. Modeling a dose-response system with a logistic curve is one important special case of the more general non-linear curve fitting problem. If you are familiar with linear regression and related statistics, the non-linear case closely parallels the linear case making the jump to non-linear curve fitting easier.

SIGA Human BioArmor
SIGA is a world leader in designing and developing novel countermeasures to prevent and treat serious infectious diseases, with an emphasis on biological warfare defense.

In a previous blog post, Ken outlined the techniques for using NMath to compute various common linear trend lines and non-linear trend lines linearizable via a simple variable substitution.  This post extends that curve fitting article to the non-linear case.

Non-linear Curving Fitting – The Logistic

The logistic model is a fundamental non-linear model for many systems, and is widely used in the life sciences, medicine, and environmental toxicology.

Non-linear curve fit with logistic model

Non-linear curve fit with logistic model


This image shows a fit of a 4-parameter logistic model to the measured inhibitory response of an infectious agent to a treatment at various drug dose levels – this is a classic dose-response curve.  The 4-parameter model was used here because the underlying physical process is expected to be symmetric, and is defined as:

a - response at x = 0
b - slope factor
c - mid-range (50%) concentration
d - response as x -> infinity

The 5-parameter logistic adds an additional asymmetry parameter that allow the leading and trailing tails to have different curvatures. It is defined as:

a - response at x = 0
b - slope factor
c - mid-range (50%) concentration
d - response as x -> infinity
g - asymmetry factor

Using NMath the following C# code fits the four-parameter logistic to the given data. By extending the initial conditions vector by one element, and changing the model function delegate we could use the same code to fit the FiveParameterLogistic function.

using CenterSpace.NMath.Core;
using CenterSpace.NMath.Analysis;
 
 public void NonlinearFitExampleCode()
 {
    // Make some synthetic data
    DoubleVector xValues = new DoubleVector("30000 10000 3000 500 200 60  25 10");
    DoubleVector yValues = new DoubleVector("112.8 115 131 118 26 1.7 -0.8 -1.4");
 
    // This is the function delegate that the curve fitting 
    // class will use - any delegate can be used.
    NMathFunctions.GeneralizedDoubleUnaryFunction f 
      = AnalysisFunctions.FourParameterLogistic;
 
    // The starting point in the function parameter space.
    DoubleVector start = new DoubleVector("0.1 0.1 0.1 0.1");
 
    // Construct a curve fitting object for our function, then perform the fit.
    OneVariableFunctionFitter fitter 
      = new OneVariableFunctionFitter(f);
    DoubleVector solution = fitter.Fit(xValues, yValues, start);
 
    // Display solution
    Console.WriteLine("4 Parameter Logistic Model Parameters: {0}", solution.ToString("0.####"));
}

The model parameters are returned in a vector, in the order defined above [a, b, c, d].

4 Parameter Logistic Model Parameters:
  [ -0.1755 6.0556 246.8343 119.6108 ]

The fitting class OneVariableFunctionFitter is based on the Trust Region algorithm which is an extension of the Levenberg-Marquardt minimization algorithm. As used in the example above, the partial derivatives with respect to the parameters are numerically estimated. In our experience numerically estimating the partials works very well. However, for improved precision and performance it is also possible to explicitly specify delegates for the partials, using the property:

 NMathFunctions.GeneralizedDoubleUnaryFunction[] PartialDeriviates 

If the curvature of the function surface is low, as is typically the case for the logistic model, and the local linearized estimate closely approximates the function surface during the minimization process, the partial derivatives are not even evaluated for performance reasons.

Bounded Parameters

Because the OneVariableFunctionFitter is based on the Trust Region algorithm, it is possible to place bounds on the parameters. This is a very powerful feature, which is frequently employed when modeling a dose-response curve with a logistic. There’s an old debate between the chemists and biologists regarding the dose-response curve on whether the curve should be relative or absolute. In the image above, a relative dose-response curve is shown, as there were no bounds placed on the four parameters. However, we can create an absolute dose-response curve by adding the constraints 100 >= a >= 0 and 0 <= d <= 100 and re-fitting the curve. This will enforce all predicted responses to be between 0% and 100%.

Logistic model fit with parameter bounds

Logistic model fit with parameter bounds 100 >= a >= 0 and 0 <= d <= 100

Here is a C# code example on how to do this with NMath.

using CenterSpace.NMath.Core;
using CenterSpace.NMath.Analysis;
 
 public void BoundedNonlinearFitExampleCode()
 {
 
  // Make some synthetic data
  DoubleVector xValues = new DoubleVector("30000 10000 3000 500 200 60  25 10");
  DoubleVector yValues = new DoubleVector("112.8 115 131 118 26 1.7 -0.8 -1.4");
 
  // The starting point in the function parameter space.
  DoubleVector start = new DoubleVector("0.1 0.1 0.1 100.0");
 
  // 4-parameter bounds in the order of a, b, c, & d
  DoubleVector lowerBounds = new DoubleVector("0.0 0.0 0.0 0.0");
  DoubleVector upperBounds = new DoubleVector("100.0 100000.0 100000.0 100.0");
 
  // Convert the 4-parameter logistic to a Func<>
  Func<Double[], Double, Double> fitFunction = 
    new Func<double[], double, double>((double[] soln, Double x) => AnalysisFunctions.FourParameterLogisticFunction(new   DoubleVector(soln), x));
 
  // This class is not included with the current release.
  FunctionFitterWithBounds fitter = 
    new FunctionFitterWithBounds(fitFunction, lowerBounds, upperBounds);
 
  //Now define and fit the data as shown above.
  DoubleVector solution = fitter.Fit(xValues, yValues, start);
 
  // Display solution
   Console.WriteLine("4 Parameter Logistic Model Parameters: {0}", solution.ToString("0.####"));
}

The bounded model parameters returned, in the order defined above [a, b, c, d] are:

4 Parameter Logistic Model Parameters: [ 0 0.1652 0.787 100 ]

Note that the first and last parameters are now within the specified bounds.

The class FunctionFitterWithBounds is unreleased (and won't be in this form) but you can download it and use it if you have the latest release of NMath installed. The non-linear curving fitting NMath API will be extended and refined in our next NMath release, and bounded non-linear curve fitting will be more clearly accessible.

  • Click here to download the free C# FunctionFitterWithBounds class.

Goodness-Of-Fit Model Statistics

After we are done fitting our model, the fit may look good, but really how good is the fit? To better understand the goodness-of-fit of our model, we need to analyze the residuals of the fit and related statistics. The next release of NMath Stats (expected in June) will offer a new class GoodnessOfFit that will provide a complete set of statistical measures that can be used to assess the goodness-of-fit for a non-linear model, including the F-Statistic, t-Statistic, and parameter confidence intervals. Please contact us if you are interested in obtaining an advanced copy of this class.

Happy Computing,
Paul

Resources

  • The five-parameter logistic: A characterization and comparison with the four-parameter logistic, Paul G. Gotschalk, John R. Dunn, Analytical Biochemistry 343 (2005) pp 54-65.
  • Principles of Curve Fitting for Multiplex Sandwich Immunoassys, D. Davis, A. Zhang, C. Etienne, I. Huang, & M. Malit, tech note 2861.
  • A brief technical introduction to Trust Region methods, by Mark S. Gockenbach
    • Share/Bookmark

Probability Distributions in NMath Stats

Tuesday, March 2nd, 2010
Normal Distribution PDF CDF

Gaussian Distribution

Probability distributions are central to many applications in statistical analysis. The NMath Stats library offers a large set of probability distributions, covering most domains of application, all with an easy to use common interface. Each distribution class uses numerically stable accurate algorithms to compute both the probability distribution and the cumulative distribution. In this post we’ll look at some code examples using these distribution classes. All of the charts in this post were generated using the Infragistics WPF tool set.

Available Distributions

The NMath Stats library offers the following set of probability distributions, with each name linked to their API documentation page. More information can be found on the CenterSpace probability distribution landing page, including links to code examples in C# and VB, and more documentation.

Distributions in NMath Stats
Normal Distribution (Gaussian) Log Normal Distribution
Poisson Distribution Geometric Distribution

Weibull Distribution Uniform Distribution

Chi-Square Distribution Binomial Distribution

Negative Binomial Distribution Exponential Distribution

T Distribution F Distribution

Triangular Distribution Logistic Distribution

Beta Distribution Gamma Distribution

Distribution of Running Times

Corvallis hosts many foot races during the year, and this application note analyzes the finishing times data from two of those: the annual Fall Festival 10K Fun Run, and the one-off Strands 5K. The central limit theorem tells us to expect the running times for a foot race (of enough participants) to be normally distributed. But what happens to that distribution when a large prize is offered? The Fall Festival 10K Fun Run offers a prize of exactly $0, where the Strands 5K offered an amazing $10,000 prize.

We can estimate a normal distribution from the two data sets, and then use a Kolmogorov-Smirnov test to determine if the distribution passed the K-S null hypothesis. If the Kolmogorov-Smirnov null hypothesis is not rejected, then under this statistic, the data points are said to be drawn from the reference distribution (in this case the normal distribution).

using System.IO;
using CenterSpace.NMath.Core;
using CenterSpace.NMath.Stats;
 
public Main()
{
  // Load fall festival 10K data and strands 5K data
  StreamReader reader = new StreamReader("fall_festival_times.txt", false);
  DoubleVector fallfestival10k = new DoubleVector(reader);
 
  reader = new StreamReader("strands_times.txt", false);
  DoubleVector strands10k = new DoubleVector(reader);
 
  // Estimate Normal (Gaussian) Distributions and 
  // check the Kolmogorov-Smirnov Test
  NormalDistribution ndist_ff = new NormalDistribution(
    StatsFunctions.Mean(fallfestival10k), StatsFunctions.Variance(fallfestival10k));
  OneSampleKSTest kstest = new OneSampleKSTest(fallfestival10k, ndist_ff);
  bool rejectNH = kstest.Reject; // False
 
  NormalDistribution ndist_s = new NormalDistribution(
    StatsFunctions.Mean(strands10k), StatsFunctions.Variance(strands10k));
  kstest = new OneSampleKSTest(strands10k, ndist_s);
  rejectNH = kstest.Reject;  // True
}

A look at the data makes the results of the Kolmogorov-Smirnov test look plausible.

CDF of Fall Festival 10K & the Strands 5K finishing times.

CDF of Fall Festival 10K & the Strands 5K finishing times with normalized finishing times.


The Strand 5K finishing times are not normally distributed because the big prize prompted many fast runners to show up and many average runners to enjoy the race from the sidelines. This grouped the finishing times around the winner (many close finishers) and so they were no longer normally distributed. This distribution looks more like a Weibull and we can test against that intuition with the code snippit.

  WeibullDistribution wdist_s = new WeibullDistribution(25,4);
 
  kstest = new OneSampleKSTest(strands10k, wdist_s);
  rejectNH = kstest.Reject; // now False

Let’s look at the CDF of this weibull versus the data again.

Strands 5K finishing times with a Weibull CDF.

Normalized Strands 5K finishing times overlayed on a Weibull CDF.

Simple Coin Flipping Example

I’ll present one more simple example using a discrete distribution. The binomial distribution is used for modeling most coin flipping games, as it represents the distribution of successes in a sequence of independent yes/no questions. The binomial distribution is parametrized on the number of trials n, and the probability of each independent success, p. For example, using this binomial distribution we can model, say, the number of heads founds in a sequence of 10 coin flips, using n=10 and p=1/2

Binomial Distribution n=10, p=0.5

Binomial Distribution with n=10 and p=0.5


As expected, the most likely number of heads would occur at 5 (with a probability of 0.246), and the probability of either getting 3, 4, 5, or 6 heads in 10 flips would be the difference of the CDF at 6 and 3, equal to 0.656. Below is the simple C# code used to compute the answers to these questions.

using CenterSpace.NMath.Core;
using CenterSpace.NMath.Stats;
 
public Main()
{
  int number_of_trials = 10;
  int prob_of_success = 0.5;
  BinomialDistribution dist = 
    new BinomialDistribution(number_of_trials, prob_of_success);
 
  // Probability of landing 5 heads in 10 flips ( = 0.246)
  Double five_heads = dist.PDF(5); 
 
  // Probability of landing 3, 4, 5, or 6 heads in 10 flips ( = 0.656)
  Double three_to_six_heads = dist.CDF(6) - dist.CDF(3); 
}

I hope these code examples can help you get started using the NMath Stats distribution classes quickly and correctly.

-Happy Computing,

Paul

  • Share/Bookmark

Drawing Dendrograms

Wednesday, February 3rd, 2010

A dendrogram is a tree diagram often used to visualize the results of hierarchical clustering. A customer recently contacted us asking for help drawing dendrograms from the output of the hierarchical clustering algorithm in NMath Stats. (For more information in hierarchical clustering in NMath Stats, see this post.)

UserZoom is an international software company specializing in web customer experience and usability testing. UserZoom software includes a tool to run online card sorting exercises with real users. Card Sorting is a technique that information architects and user experience practitioners use to explore how “real people” group items and content. UserZoom developers were using NMath hierarchical clustering to analyze card sorting results, and wanted to generate dendrograms to help researchers understand participant’s grouping criteria.

In NMath Stats, class ClusterAnalysis performs hierarchical cluster analyses. For example, the following C# code clusters 6 random vectors of length 3:

int seed = 123;
DoubleMatrix data =
  new DoubleMatrix(6, 3, new RandGenUniform(1, 10, seed));
ClusterAnalysis ca = new ClusterAnalysis(data);

The random seed is specified so we can get repeatable results.

After construction, the Linkages property gets an (n-1) x 3 matrix containing the complete hierarchical linkage tree. At each level in the tree, columns 1 and 2 contain the indices of the clusters linked to form the next cluster. Column 3 contains the distances between the clusters.

Console.WriteLine(ca.Linkages.ToTabDelimited());

The output is:

3	4	2.11126489430827
1	5	2.50743075649221
2	6	2.59997959766694
7	8	3.61586096606758
0	9	4.37499686772813

Each object is initially assigned to its own singleton cluster, numbered 0 to 5. The analysis then proceeds iteratively, at each stage joining the two most similar clusters into a new cluster, continuing until there is one overall cluster. The first new cluster formed is assigned index 6, the second is assigned index 7, and so forth. When these indices appear later in the tree, the clusters are being combined again into a still larger cluster.
(more…)

  • Share/Bookmark

Cluster Analysis, Part V: Monte Carlo NMF

Monday, January 11th, 2010

In this continuing series, we explore the NMath Stats functions for performing cluster analysis. (For previous posts, see Part 1 – PCA , Part 2 – K-Means, Part 3 – Hierarchical, and Part 4 – NMF.) The sample data set we’re using classifies 89 single malt scotch whiskies on a five-point scale (0-4) for 12 flavor characteristics. To visualize the data set and clusterings, we make use of the free Microsoft Chart Controls for .NET, which provide a basic set of charts.

In this post, the last in the series, we’ll look at how NMath provides a Monte Carlo method for performing multiple non-negative matrix factorization (NMF) clusterings using different random starting conditions, and combining the results.

NMF uses an iterative algorithm with random starting values for W and H. This, coupled with the fact that the factorization is not unique, means that if you cluster the columns of V multiple times, you may get different final clusterings. The consensus matrix is a way to average multiple clusterings, to produce a probability estimate that any pair of columns will be clustered together.
To compute the consensus matrix, the columns of V are clustered using NMF n times. Each clustering yields a connectivity matrix. Recall that the connectivity matrix is a symmetric matrix whose i, jth entry is 1 if columns i and j of V are clustered together, and 0 if they are not. The consensus matrix is also a symmetric matrix, whose i, jth entry is formed by taking the average of the i, jth entries of the n connectivity matrices.
Thus, each i, jth entry of the consensus matrix is a value between 0, when columns i and j are not clustered together on any of the runs, and 1, when columns i and j were clustered together on all runs. The i, jth entry of a consensus matrix may be considered, in some sense, a “probability” that columns i and j belong to the same cluster.

NMF uses an iterative algorithm with random starting values for W and H. (See Part IV for more information on NMF.) This, coupled with the fact that the factorization is not unique, means that if you cluster the columns of V multiple times, you may get different final clusterings. The consensus matrix is a way to average multiple clusterings, to produce a probability estimate that any pair of columns will be clustered together.
(more…)

  • Share/Bookmark