Blog

Archive for the ‘NMath Stats Tutorial’ Category

Calling External .NET Libraries from Excel

Wednesday, December 8th, 2010

There are many circumstances where you may need to access an external library of functions or routines from Excel.  For example, if you need a complex function such as fitting data to a surface, or portfolio optimization, that is not natively available in Excel.  There also may be a need to protect proprietary calculations by using user defined functions to process algorithms in a black box manner.  I was looking for a way to rapidly prototype some calculations without setting up a complex development environment.

Harking back to the old rule of development projects of  “two out of three”, when the three metrics are fast, cheap, and quality.  On any time limited project you only can plan to achieve two metrics, never all three. Initially I like to dive in and start with fast and cheap and work my way towards quality as necessary.  So, we’ll start with the quick and dirty approach to calling external libraries from Excel.

Project Setup

You must have a version of .NET Framework of 2.0 or greater.  The latest version is free from Microsoft at this link.
http://www.microsoft.com/net/download.aspx.

You’ll also need:

  • Excel 97 or later.
  • External library assemblies that you need to access from Excel. In our case we will use Centerspace’s NMath.dll and NMathStats.dll.
  • A freeware product called ExcelDNA written by Govert van Drimmelen that can be downloaded at http://exceldna.codeplex.com/ .

The first order of business is to unpack the downloaded file, ExcelDNA.zip, into a working directory.  For our example, we will use CenterSpaceExcel as our directory name. After unpacking you should have two folders Distribution and Source in our CenterSpaceExcel directory.  Inside the Distribution folder locate the file ExcelDNA.xll and rename it to NMathExcel.xll.

We now need to locate in the same directory the file ExcelDna.dna and rename it to NMathExcel.dna.  Then using notepad, or your favorite code editor, and open the file NMathExcel.dna.

You should see the following code:

<DnaLibrary>
<![CDATA[ 
     Public Module Module1   
       Function AddThem(x, y)      
          AddThem = x + y   
       End Function 
    End Module 
]]>
</DnaLibrary>
Assuming CenterSpace NMath and NStat are installed in the standard locations. Change it to read as follows and save:
<DnaLibrary>
 
<Reference Name="CenterSpace.NMath.Core" />
<Reference Path="C:\Program Files\CenterSpace\NMath 4.1\Assemblies\NMath.dll" />
<Reference Name="CenterSpace.NMath.Stats" />
<Reference Path="C:\Program Files\CenterSpace\NMath Stats 3.2\Assemblies\NMathStats.dll" />
 
<![CDATA[
	Imports NMath = CenterSpace.NMath.Core
	Imports Stats = CenterSpace.NMath.Stats
 
	Public Module NMathExcel
 
		<ExcelFunction(Description:="Returns x to the y power")> _
	    	Function NPower(x as double, y As double) As double
			NPower = NMath.NMathFunctions.PowFunction(x, y)
		End Function
 
		<ExcelFunction(IsMacroType:=True, IsVolatile:=True)> _
		Function NRand() As double
			dim rand As New NMath.RandGenMTwist
			NRand = rand.Next53BitRes()
		End Function
 
		<ExcelFunction(Description:="Binomial Distribution: Number of Successes, Trials, Probability, Cumulative is True or False")> _
		Function NBinomDist(NSuccess As Int32, NTrials As Int32, Prob As double, Cumul As Boolean) As double
			dim nbin As New Stats.BinomialDistribution
			nbin.N = NTrials
			nbin.P = Prob
			IF Cumul
				NBinomDist = nbin.CDF(NSuccess)
			Else
				NBinomDist = nbin.PDF(NSuccess)
			End If
		End Function
 
		Function NDoubleMatrixRand(rsize As integer, csize As integer, RandLBx As integer, RandUBy As integer) As Object(,)
			dim rng As New NMath.RandGenUniform(RandLBx,RandUBy)
			Rng.Reset(&#038;H124)
			dim TempA As New NMath.DoubleMatrix(rsize, csize, Rng)
			NDoubleMatrixRand = NCopyArray(TempA, rsize, csize)
 
		End Function
 
		Function NCopyArray(IMatrix As Object, rsize As integer, csize As integer) As Object(,)
			dim i As Integer
			dim j As Integer
			dim OArray(rsize, csize) As Object
			for i = 0 to rsize - 1
			   for j = 0 to csize - 1
				OArray(i,j) = IMatrix(i,j)
			   next j
			next i
			NCopyArray = OArray
		End Function		
 
	End Module
]]>

We now have created the VB code to call our CenterSpace Math and Statistics libraries with the following five functions.

  1. The first function shows a simple math library call to the Power function which takes a number x and raises it to the y power and returns the value.
  2. The second function shows a call to obtain a fast random number from the math library.  Since we want a new number each time the spreadsheet is re-calculated we have made the function volatile.
  3. The third function call shows how to set values that need to be accessed by a function in our .NET assemble; in this case, the Binomial Distribution.
  4. The fourth function demonstrates the creation of a DoubleMatrix that is the filled with random uniformly distributed numbers.
  5. The fifth function is a helper sub-routine to transfer data across the com interface.

Test our setup in Excel

Open Excel and move your cursor the Tools menu item.  Usually towards the bottom of the drop down menu you will find the selection Add-Ins.  After selecting Add-Ins, you see the pop-up window with the option to select Microsoft supplied Add-ins.  Choose the Browse option and go to the working directory we created at the beginning.  In our case, this will be the CenterSpaceExcel directory.  Next select the Distribution folder and you should see the renamed file: NMathExcel.xll.  Select it and you should now see the following screen.

Selecting a user created XLL as an Add-in for Excel

Make sure NMathExcel is checked and click OK. If you get an error and this point it is probably due to a typo in the DNA file, otherwise you will get the expected new sheet ready for entry.

Select an empty cell and then select from the menu bar Insert then from the pulldown Function.  You should see the following pop-up.

Selecting the category containing our NMath functions

At the bottom of the category pull down you should see our NMathExcel Functions;  Select it and you should have these options.:

NMath Excel Function Category

If we choose NPower, we will get the next screen,

Calling NMath Library Power function in Excel

I arbitrarily typed the value of 3.2 for x and 3.327 for y.  You can see the result of 47.9329301 before selecting OK.

Select OK and Excel will insert the value into the cell.  Select another blank cell and this time choose our NRand() function.  You will notice there is no opportunity to enter values and finish by selecting OK.  At this point you should see a number between 0 and 1 in the cell.  Each time you press F9 (sheet recalc) a new random number will appear.  If we had not made this function volatile the number would not change unless you edit the cell.

To test our Binomial Distribution function, again we will select a new cell and use the insert function option to insert the NBinomDist function with the following values.

Calling NMath Statistical function Binomial Distribution from Excel

At this point we have made successful calls into both of CenterSpace’s NMath and NMath Stats .NET math libraries.

In our fourth example, we will see how Excel handles matrices and look at issues passing array arguments across the COM interface.  Excel 2003 was limited to a maximum of 60,000 cells in an array, but Excel 2007 was expanded to handle 2 million.  Excel has some quirky ways of displaying matrices, and I’ll cover the in’s and out’s of these quirks.

We have written the basic code to set up a function called NDoubleMatrixRand for the purpose of creating a matrix with supplied dimensions and filled with uniform Random numbers over a specified distribution.  We will select another blank cell and again go to insert function and this time choose NDoubleMatrixRand.  Suppose we want to create a 6×6 matrix filled with random numbers between -2 and 2.  Our input will look like the following screen.

Creating a DoubleMatrix in Excel using NMath

Notice the equal sign in the middle right of the above screen is equal to {-0.994818527251482,-0.08

Values inclosed in curly brackets that are separated by commas indicates that an matrix was actually created, but you can see the Formula result is only displaying a partial value due to display size. At this point when you select OK, you will have a cell with a single value.  Here is where the fun begins.  Start at the cell and drag a 6×6 range as shown in the following screen.

Selecting the area the matrix is to be displayed in

Now get your fingers limbered. Here is where it gets a bit obscure – do exactly as follows.

  • Press the F2 key.  (pressing F2 may be  optional but is recommended by Excel as the cell leaves the edit mode)
  • Press and hold the Ctrl key followed by
  • pressing and holding the Shift key followed by
  • pressing the Enter key

and presto chango!  You should see a screen like this.

Displaying a Matrix in Excel

Notice that your cell’s formula is now enclosed in { }, indicating to Excel that the contained formula is an array function.  This is the only way to get matrices displayed.  Also, if you try to edit this cell you will get an error that changes are not allowed.  If  you want to change the dimensions simply reference the values from another cell when you create the function.

The fifth function NCopyArray copies the library matrix across the COM bridge into an Excel array object.  As I stated in the beginning this would be a quick and dirty approach and would leave room for improvement.

Summary

In my next post, I will provide the above code in C# and add more function calls with some matrices with hopefully an improved approach to NCopyArray.  Future posts will include creating a packaged XLL and a more complex example such as curve fitting.

Since time is our most precious asset, being able to quickly access complex math functions with a general purpose tool like Excel should save time and money!

At CenterSpace, we are interested if this blog is helpful and if there is a need for more examples of how our libraries can be accessed by Excel.  Let us know what areas are of interest to you.

Mike  Magee

Thanks and Resources
Also, a special thanks to Govert van Drimmelen for writing a wonderful tool such as ExcelDNA.

Statistical Quality Control Charts

Wednesday, August 11th, 2010

Statistical quality control charts, or Shewart quality control charts, are used across nearly all sectors of industry to maintain and improve product quality. Quality control charts provide a means to detect when a time varying process exceeds its historic process variation and needs analysis and/or intervention to remedy the out-of-control process (known as special cause variation). These process control charts are independent of any engineering decision-making about the particular process at hand, but are instead based on the statistical nature of the process itself. This standardized statistical control framework was created and refined by Walter Shewart at Bell Telephone Laboratories from 1925 to his retirement in 1956. It is this independence of process details that make Mr. Shewart’s techniques powerful, widely applicable, decision-making aids.

With their ongoing partnership, CenterSpace Software and Nevron have teamed up to create some free code examples for creating Shewart charts.

Quality Chart Types

Statistical quality control charts can be generally divided into two categories, those for tracking discrete attribute variables (e.g. a pass/fail test), and those for tracking continuous process variables (e.g. pipe diameter, temperature).

Chart Process Observation Process Observation Variable
X-bar and R chart Quality characteristic measurement within one subgroup Variables
X-bar and s chart Quality characteristic measurement within one subgroup Variables
Shewhart individuals control chart (I-R chart or I chart) Quality characteristic measurement for one observation Variables
Three-way chart Quality characteristic measurement within one subgroup Variables
p-chart Fraction nonconforming within one subgroup Attributes
np-chart Number nonconforming within one subgroup Attributes
c-chart Number of nonconformances within one subgroup Attributes
u-chart Nonconformances per unit within one subgroup Attributes

The statistical modeling language, “R”, provides a package (qcc) for creating these and other statistical process control charts. This R package was created by Luca Scrucca and is actively maintained and can be found in CRAN repository.

c-chart generated the R package qcc

c-chart generated by R package qcc

u-chart generated by the R package qcc

u-chart generated by R package qcc

These two images demonstrate the standard look of the ‘c’ and ‘u’ attribute quality control chart. Some typical chart features include the highlighting of out-of-control data points and time varying upper and lower control limits. The charts generated by the R qcc package have served as our standard for recreating these in the .NET / C# development environment. The real world data used in our examples below was copied from the qcc package so direct comparisons can be made.

Creating a Quality Chart with .NET

To integrate these quality controls charts into a .NET/C# data driven quality monitoring application, we need both a statistical analysis library and a visualization tool that can manage the special chart style demanded by quality control engineers. CenterSpace, in partnership with Nevron, has created an extensible example application to build these types of specialized charts. Once you have these free helper classes, building an attribute u-chart is as simple or simpler than prototyping charts in R.

    public void UChart()
    {
 
      // u-Chart sample data
      // This data-set was copied from the 'dyedcloth' data set packaged with
      // the R-package qcc by Luca Scrucca
      //
      // Example Data Description
      // In a textile finishing plant, dyed cloth is inspected for the occurrence of
      // defects per 50 square meters.
      // The data on ten rolls of cloth are presented
      //    x number of nonconformities per 50 square meters (inspection units)
      //    samplesize number of inspection units in roll (variable sample size
      DoubleVector x =
        new DoubleVector(14, 12, 20, 11, 7, 10, 21, 16, 19, 23);
      DoubleVector samplesize =
        new DoubleVector(10.0, 8.0, 13.0, 10.0, 9.5, 10.0, 12.0, 10.5, 12.0, 12.5);
 
      // This builds the statistical information for the drawing the chart.
      IAttributeChartStats stats_u = new Stats_u(x, samplesize);
 
      // Build and display the Nevron u-Chart visualization
      NevronControlChart.AutoRefresh = true;
      NevronControlChart.Clear();
      AttributeChart cChart =
        new AttributeChart(stats_u, this.NevronControlChart);
 
    }

This code creates a u-Chart that looks like this below.

u-Chart, or Unit Chart

For those familiar with the aforementioned R-package qcc, these .NET/C# classes follow the same R naming convention for the particular chart statistics objects, but with an improved object model. So as seen in this example, the u-chart statistics are contained in a class named Stats_u, similar to the R stats.u command. Each of these statistical chart objects implements either an IAttributeChartStats or an IVariableChartStats interface, which is used by the chart generating class (AttributeChart) as seen in the last line of the code above.

Building control charts boils down to three steps using these example classes.

  1. Build the necessary data vectors.
  2. Build the desired chart’s statistics object,
    e.g. IAttributeChartStats Stats = new Stats_c(DoubleVector data);
  3. Show chart using Nevrons .NET chart control,
    e.g. new AttributeChart(IAttributeChartStats Stats, NChartControl Chart)

Free Example Code

The example code now available on github can currently create all four essential attribute quality control charts, as seen below.


c-Chart, or Count Chart

c-Chart, or Count Chart


u-Chart, or Unit Chart

p-Chart, Percentage Chart

np-Chart

To download and run these examples just navigate to our Nevron / CenterSpace github repository and either click on the “Download Source” button in the upper right-hand corner and download either a .zip or .tar file of the project, or just clone the repository. For those unfamiliar with git, git is a source code control system designed specifically for collaborative projects such as this one. To clone the project, after installing git, simply type at your command prompt:

 git clone git@github.com:MilenMetodiev/CenterSpaceNevronExamples.git

This will create a clone of this project code at your current drive location in a directory call “CenterSpaceNevronExamples”.

Other Quality Control Charts and Future Development

Currently we have only implemented the attribute control charts. Other common quality system charts including EWMA (exponential weighted moving average), Pareto, and CumSum (cumulative sum) have not been implemented in this example, but can be using the same tool set and class patterns established in this example. If you would like help or need any assistance in getting the project running or extending this to other chart types, drop us an email .

Happy Computing,
-Paul

Resources

  • The table above is adapted from the Wikipedia control chart article.
  • “qcc: An R package for quality control charting and statistical process control”, R News, Volume 4/1 , June 2004.
  • The standard qcc documentation from the CRAN project was very helpful with this project.

Cluster Analysis, Part V: Monte Carlo NMF

Monday, January 11th, 2010

In this continuing series, we explore the NMath Stats functions for performing cluster analysis. (For previous posts, see Part 1 – PCA , Part 2 – K-Means, Part 3 – Hierarchical, and Part 4 – NMF.) The sample data set we’re using classifies 89 single malt scotch whiskies on a five-point scale (0-4) for 12 flavor characteristics. To visualize the data set and clusterings, we make use of the free Microsoft Chart Controls for .NET, which provide a basic set of charts.

In this post, the last in the series, we’ll look at how NMath provides a Monte Carlo method for performing multiple non-negative matrix factorization (NMF) clusterings using different random starting conditions, and combining the results.

NMF uses an iterative algorithm with random starting values for W and H. This, coupled with the fact that the factorization is not unique, means that if you cluster the columns of V multiple times, you may get different final clusterings. The consensus matrix is a way to average multiple clusterings, to produce a probability estimate that any pair of columns will be clustered together.
To compute the consensus matrix, the columns of V are clustered using NMF n times. Each clustering yields a connectivity matrix. Recall that the connectivity matrix is a symmetric matrix whose i, jth entry is 1 if columns i and j of V are clustered together, and 0 if they are not. The consensus matrix is also a symmetric matrix, whose i, jth entry is formed by taking the average of the i, jth entries of the n connectivity matrices.
Thus, each i, jth entry of the consensus matrix is a value between 0, when columns i and j are not clustered together on any of the runs, and 1, when columns i and j were clustered together on all runs. The i, jth entry of a consensus matrix may be considered, in some sense, a “probability” that columns i and j belong to the same cluster.

NMF uses an iterative algorithm with random starting values for W and H. (See Part IV for more information on NMF.) This, coupled with the fact that the factorization is not unique, means that if you cluster the columns of V multiple times, you may get different final clusterings. The consensus matrix is a way to average multiple clusterings, to produce a probability estimate that any pair of columns will be clustered together.
(more…)

Cluster Analysis, Part IV: Non-negative Matrix Factorization (NMF)

Wednesday, January 6th, 2010

In this continuing series, we explore the NMath Stats functions for performing cluster analysis. (For previous posts, see Part 1 – PCA , Part 2 – K-Means, and Part 3 – Hierarchical.) The sample data set we’re using classifies 89 single malt scotch whiskies on a five-point scale (0-4) for 12 flavor characteristics. To visualize the data set and clusterings, we make use of the free Microsoft Chart Controls for .NET, which provide a basic set of charts.

In this post, we’ll cluster the scotches using non-negative matrix factorization (NMF). NMF approximately factors a matrix V into two matrices, W and H:

wh

If V in an n x m matrix, then NMF can be used to approximately factor V into an n x r matrix W and an r x m matrix H. Usually r is chosen to be much smaller than either m or n, for dimension reduction. Thus, each column of V is approximated by a linear combination of the columns of W, with the coefficients being the corresponding column H. This extracts underlying features of the data as basis vectors in W, which can then be used for identification, clustering, and compression.
(more…)

Clustering Analysis, Part III: Hierarchical Cluster Analysis

Monday, December 28th, 2009

In this continuing series, we explore the NMath Stats functions for performing cluster analysis. (For previous posts, see Part 1 – PCA and Part 2 – K-Means.) The sample data set we’re using classifies 89 single malt scotch whiskies on a five-point scale (0-4) for 12 flavor characteristics. To visualize the data set and clusterings, we make use of the free Microsoft Chart Controls for .NET, which provide a basic set of charts.

In this post, we’ll cluster the scotches based on “similarity” in the original 12-dimensional flavor space using hierarchical cluster analysis. In hierarchical cluster analysis, each object is initially assigned to its own singleton cluster. The analysis then proceeds iteratively, at each stage joining the two most “similar” clusters into a new cluster, continuing until there is one overall cluster. In NMath Stats, class ClusterAnalysis performs hierarchical cluster analyses.
(more…)