<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	
	xmlns:georss="http://www.georss.org/georss"
	xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#"
	>

<channel>
	<title>NMath Stats Tutorial Archives - CenterSpace</title>
	<atom:link href="https://www.centerspace.net/category/nmath-stats/nmath-stats-tutorial/feed" rel="self" type="application/rss+xml" />
	<link>https://www.centerspace.net/category/nmath-stats/nmath-stats-tutorial</link>
	<description>.NET numerical class libraries</description>
	<lastBuildDate>Tue, 07 Feb 2023 21:49:16 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.1.1</generator>
<site xmlns="com-wordpress:feed-additions:1">104092929</site>	<item>
		<title>Calling External .NET Libraries from Excel</title>
		<link>https://www.centerspace.net/calling-external-net-libraries-from-excel</link>
					<comments>https://www.centerspace.net/calling-external-net-libraries-from-excel#comments</comments>
		
		<dc:creator><![CDATA[CenterSpace]]></dc:creator>
		<pubDate>Thu, 09 Dec 2010 06:10:05 +0000</pubDate>
				<category><![CDATA[Excel]]></category>
		<category><![CDATA[Marketing]]></category>
		<category><![CDATA[NMath Stats Tutorial]]></category>
		<category><![CDATA[NMath Tutorial]]></category>
		<category><![CDATA[.NET]]></category>
		<category><![CDATA[excel interop]]></category>
		<category><![CDATA[NMath and Excel]]></category>
		<guid isPermaLink="false">http://www.centerspace.net/blog/?p=2845</guid>

					<description><![CDATA[<p>There are many circumstances where you may need to access an external library of functions or routines from Excel.  For example, if you need a complex function such as fitting data to a surface, or portfolio optimization, that is not natively available in Excel.  There also may be a need to protect proprietary calculations by [&#8230;]</p>
<p>The post <a rel="nofollow" href="https://www.centerspace.net/calling-external-net-libraries-from-excel">Calling External .NET Libraries from Excel</a> appeared first on <a rel="nofollow" href="https://www.centerspace.net">CenterSpace</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>There are many circumstances where you may need to access an external library of functions or routines from Excel.  For example, if you need a complex function such as fitting data to a surface, or portfolio optimization, that is not natively available in Excel.  There also may be a need to protect proprietary calculations by using user defined functions to process algorithms in a black box manner.  I was looking for a way to rapidly prototype some calculations without setting up a complex development environment.</p>
<p>Harking back to the old rule of development projects of  “two out of three”, when the three metrics are fast, cheap, and quality.  On any time limited project you only can plan to achieve two metrics, never all three. Initially I like to dive in and start with fast and cheap and work my way towards quality as necessary.  So, we&#8217;ll start with the quick and dirty approach to calling external libraries from Excel.</p>
<h3>Project Setup</h3>
<p>You must have a version of .NET Framework of 2.0 or greater.  The latest .NET version is free and easily downloaded from Microsoft.</p>
<p>You&#8217;ll also need:</p>
<ul>
<li>Excel 97 or later.</li>
<li>External library assemblies that you need to access from Excel. In our case we will use Centerspace’s <strong>NMath.dll</strong> and <strong>NMathStats.dll</strong>.</li>
<li>A freeware product called ExcelDNA written by Govert van Drimmelen that can be downloaded at <a href="https://excel-dna.net/">Excel-DNA</a> .</li>
</ul>
<p>The first order of business is to unpack the downloaded file, <strong>ExcelDNA.zip</strong>, into a working directory.  For our example, we will use <em>CenterSpaceExcel</em> as our directory name. After unpacking you should have two folders <em>Distribution</em> and <em>Source</em> in our <em>CenterSpaceExcel</em> directory.  Inside the <em>Distribution</em> folder locate the file <strong>ExcelDNA.xll</strong> and rename it to <strong>NMathExcel.xll</strong>.</p>
<p>We now need to locate in the same directory the file <strong>ExcelDna.dna</strong> and rename it to <strong>NMathExcel.dna</strong>.  Then using notepad, or your favorite code editor, and open the file <strong>NMathExcel.dna</strong>.</p>
<p>You should see the following code:</p>
<pre lang="vb">
<DnaLibrary>
<![CDATA[ 
     Public Module Module1   
       Function AddThem(x, y)      
          AddThem = x + y   
       End Function 
    End Module 
]]&gt;
</DnaLibrary></pre>
<div  mce_tmp="1">Assuming CenterSpace NMath and NStat are installed in the standard locations. Change it to read as follows and save:</div>
<pre lang="vb">
<DnaLibrary>

<Reference Name="CenterSpace.NMath.Core" />
<Reference Path="C:\Program Files\CenterSpace\NMath 4.1\Assemblies\NMath.dll" />
<Reference Name="CenterSpace.NMath.Stats" />
<Reference Path="C:\Program Files\CenterSpace\NMath Stats 3.2\Assemblies\NMathStats.dll" />

<![CDATA[
	Imports NMath = CenterSpace.NMath.Core
	Imports Stats = CenterSpace.NMath.Stats

	Public Module NMathExcel

		<ExcelFunction(Description:="Returns x to the y power")> _
	    	Function NPower(x as double, y As double) As double
			NPower = NMath.NMathFunctions.PowFunction(x, y)
		End Function

		<ExcelFunction(IsMacroType:=True, IsVolatile:=True)> _
		Function NRand() As double
			dim rand As New NMath.RandGenMTwist
			NRand = rand.Next53BitRes()
		End Function

		<ExcelFunction(Description:="Binomial Distribution: Number of Successes, Trials, Probability, Cumulative is True or False")> _
		Function NBinomDist(NSuccess As Int32, NTrials As Int32, Prob As double, Cumul As Boolean) As double
			dim nbin As New Stats.BinomialDistribution
			nbin.N = NTrials
			nbin.P = Prob
			IF Cumul
				NBinomDist = nbin.CDF(NSuccess)
			Else
				NBinomDist = nbin.PDF(NSuccess)
			End If
		End Function

		Function NDoubleMatrixRand(rsize As integer, csize As integer, RandLBx As integer, RandUBy As integer) As Object(,)
			dim rng As New NMath.RandGenUniform(RandLBx,RandUBy)
			Rng.Reset(&#038;H124)
			dim TempA As New NMath.DoubleMatrix(rsize, csize, Rng)
			NDoubleMatrixRand = NCopyArray(TempA, rsize, csize)

		End Function

		Function NCopyArray(IMatrix As Object, rsize As integer, csize As integer) As Object(,)
			dim i As Integer
			dim j As Integer
			dim OArray(rsize, csize) As Object
			for i = 0 to rsize - 1
			   for j = 0 to csize - 1
				OArray(i,j) = IMatrix(i,j)
			   next j
			next i
			NCopyArray = OArray
		End Function		

	End Module
]]&gt;</pre>
<p>We now have created the VB code to call our CenterSpace Math and Statistics libraries with the following five functions.</p>
<ol>
<li>The first function shows a simple math library call to the Power function which takes a number x and raises it to the y power and returns the value.</li>
<li>The second function shows a call to obtain a fast random number from the math library.  Since we want a new number each time the spreadsheet is re-calculated we have made the function <code>volatile</code>.</li>
<li>The third function call shows how to set values that need to be accessed by a function in our .NET assemble; in this case, the Binomial Distribution.</li>
<li>The fourth function demonstrates the creation of a <code>DoubleMatrix</code> that is the filled with random uniformly distributed numbers.</li>
<li>The fifth function is a helper sub-routine to transfer data across the com interface.</li>
</ol>
<h3>Test our setup in Excel</h3>
<p>Open Excel and move your cursor the <span style="text-decoration: underline;">Tools</span> menu item.  Usually towards the bottom of the drop down menu you will find the selection <span style="text-decoration: underline;">Add-Ins</span>.  After selecting <span style="text-decoration: underline;">Add-Ins</span>, you see the pop-up window with the option to select Microsoft supplied Add-ins.  Choose the <span style="text-decoration: underline;">Browse</span> option and go to the working directory we created at the beginning.  In our case, this will be the <em>CenterSpaceExcel</em> directory.  Next select the <em>Distribution</em> folder and you should see the renamed file: <strong>NMathExcel.xll</strong>.  Select it and you should now see the following screen.</p>
<figure id="attachment_2899" aria-describedby="caption-attachment-2899" style="width: 360px" class="wp-caption alignnone"><a href="https://www.centerspace.net/blog/wp-content/uploads/2010/12/ExcelAddin1.gif"><img decoding="async" class="size-full wp-image-2899" title="ExcelAddin" src="https://www.centerspace.net/blog/wp-content/uploads/2010/12/ExcelAddin1.gif" alt="" width="360" height="422" srcset="https://www.centerspace.net/wp-content/uploads/2010/12/ExcelAddin1.gif 360w, https://www.centerspace.net/wp-content/uploads/2010/12/ExcelAddin1-255x300.gif 255w" sizes="(max-width: 360px) 100vw, 360px" /></a><figcaption id="caption-attachment-2899" class="wp-caption-text">Selecting a user created XLL as an Add-in for Excel</figcaption></figure>
<p>Make sure NMathExcel is checked and click OK. If you get an error and this point it is probably due to a typo in the DNA file, otherwise you will get the expected new sheet ready for entry.</p>
<p>Select an empty cell and then select from the menu bar <span style="text-decoration: underline;">Insert</span> then from the pulldown <span style="text-decoration: underline;">Function</span>.  You should see the following pop-up.</p>
<figure id="attachment_2902" aria-describedby="caption-attachment-2902" style="width: 540px" class="wp-caption alignnone"><a href="https://www.centerspace.net/blog/wp-content/uploads/2010/12/InsertFunction.gif"><img decoding="async" loading="lazy" class="size-full wp-image-2902" title="InsertFunction" src="https://www.centerspace.net/blog/wp-content/uploads/2010/12/InsertFunction.gif" alt="" width="540" height="423" srcset="https://www.centerspace.net/wp-content/uploads/2010/12/InsertFunction.gif 540w, https://www.centerspace.net/wp-content/uploads/2010/12/InsertFunction-300x235.gif 300w" sizes="(max-width: 540px) 100vw, 540px" /></a><figcaption id="caption-attachment-2902" class="wp-caption-text">Selecting the category containing our NMath functions</figcaption></figure>
<p>At the bottom of the category pull down you should see our NMathExcel Functions;  Select it and you should have these options.:</p>
<figure id="attachment_2905" aria-describedby="caption-attachment-2905" style="width: 540px" class="wp-caption alignnone"><a href="https://www.centerspace.net/blog/wp-content/uploads/2010/12/InsNMathFctn.gif"><img decoding="async" loading="lazy" class="size-full wp-image-2905" title="InsNMathFctn" src="https://www.centerspace.net/blog/wp-content/uploads/2010/12/InsNMathFctn.gif" alt="" width="540" height="427" srcset="https://www.centerspace.net/wp-content/uploads/2010/12/InsNMathFctn.gif 540w, https://www.centerspace.net/wp-content/uploads/2010/12/InsNMathFctn-300x237.gif 300w" sizes="(max-width: 540px) 100vw, 540px" /></a><figcaption id="caption-attachment-2905" class="wp-caption-text">NMath Excel Function Category</figcaption></figure>
<p>If we choose <code>NPower</code>, we will get the next screen,</p>
<figure id="attachment_2906" aria-describedby="caption-attachment-2906" style="width: 540px" class="wp-caption alignnone"><a href="https://www.centerspace.net/blog/wp-content/uploads/2010/12/InsNMathFctnPwrArg.gif"><img decoding="async" loading="lazy" class="size-full wp-image-2906" title="InsNMathFctnPwrArg" src="https://www.centerspace.net/blog/wp-content/uploads/2010/12/InsNMathFctnPwrArg.gif" alt="" width="540" height="359" srcset="https://www.centerspace.net/wp-content/uploads/2010/12/InsNMathFctnPwrArg.gif 540w, https://www.centerspace.net/wp-content/uploads/2010/12/InsNMathFctnPwrArg-300x199.gif 300w" sizes="(max-width: 540px) 100vw, 540px" /></a><figcaption id="caption-attachment-2906" class="wp-caption-text">Calling NMath Library Power function in Excel</figcaption></figure>
<p>I arbitrarily typed the value of 3.2 for x and 3.327 for y.  You can see the result of 47.9329301 before selecting OK.</p>
<p>Select OK and Excel will insert the value into the cell.  Select another blank cell and this time choose our <code>NRand()</code> function.  You will notice there is no opportunity to enter values and finish by selecting OK.  At this point you should see a number between 0 and 1 in the cell.  Each time you press F9 (sheet recalc) a new random number will appear.  If we had not made this function volatile the number would not change unless you edit the cell.</p>
<p>To test our Binomial Distribution function, again we will select a new cell and use the insert function option to insert the <code>NBinomDist</code> function with the following values.</p>
<figure id="attachment_2909" aria-describedby="caption-attachment-2909" style="width: 540px" class="wp-caption alignnone"><a href="https://www.centerspace.net/blog/wp-content/uploads/2010/12/InsNMathFctnBinomArg.gif"><img decoding="async" loading="lazy" class="size-full wp-image-2909" title="InsNMathFctnBinomArg" src="https://www.centerspace.net/blog/wp-content/uploads/2010/12/InsNMathFctnBinomArg.gif" alt="" width="540" height="361" srcset="https://www.centerspace.net/wp-content/uploads/2010/12/InsNMathFctnBinomArg.gif 540w, https://www.centerspace.net/wp-content/uploads/2010/12/InsNMathFctnBinomArg-300x200.gif 300w" sizes="(max-width: 540px) 100vw, 540px" /></a><figcaption id="caption-attachment-2909" class="wp-caption-text">Calling NMath Statistical function Binomial Distribution from Excel</figcaption></figure>
<p>At this point we have made successful calls into both of CenterSpace&#8217;s NMath and NMath Stats .NET math libraries.</p>
<p>In our fourth example, we will see how Excel handles matrices and look at issues passing array arguments across the COM interface.  Excel 2003 was limited to a maximum of 60,000 cells in an array, but Excel 2007 was expanded to handle 2 million.  Excel has some quirky ways of displaying matrices, and I&#8217;ll cover the in&#8217;s and out&#8217;s of these quirks.</p>
<p>We have written the basic code to set up a function called <code>NDoubleMatrixRand</code> for the purpose of creating a matrix with supplied dimensions and filled with uniform Random numbers over a specified distribution.  We will select another blank cell and again go to insert function and this time choose <code>NDoubleMatrixRand</code>.  Suppose we want to create a 6&#215;6 matrix filled with random numbers between -2 and 2.  Our input will look like the following screen.</p>
<figure id="attachment_2910" aria-describedby="caption-attachment-2910" style="width: 600px" class="wp-caption alignnone"><a href="https://www.centerspace.net/blog/wp-content/uploads/2010/12/InsDblMtrxRandArg.gif"><img decoding="async" loading="lazy" class="size-full wp-image-2910" title="InsDblMtrxRandArg" src="https://www.centerspace.net/blog/wp-content/uploads/2010/12/InsDblMtrxRandArg.gif" alt="" width="600" height="421" srcset="https://www.centerspace.net/wp-content/uploads/2010/12/InsDblMtrxRandArg.gif 600w, https://www.centerspace.net/wp-content/uploads/2010/12/InsDblMtrxRandArg-300x210.gif 300w" sizes="(max-width: 600px) 100vw, 600px" /></a><figcaption id="caption-attachment-2910" class="wp-caption-text">Creating a DoubleMatrix in Excel using NMath</figcaption></figure>
<p>Notice the equal sign in the middle right of the above screen is equal to {-0.994818527251482,-0.08</p>
<p>Values inclosed in curly brackets that are separated by commas indicates that an matrix was actually created, but you can see the Formula result is only displaying a partial value due to display size. At this point when you select OK, you will have a cell with a single value.  Here is where the fun begins.  Start at the cell and drag a 6&#215;6 range as shown in the following screen.</p>
<figure id="attachment_2911" aria-describedby="caption-attachment-2911" style="width: 564px" class="wp-caption alignnone"><a href="https://www.centerspace.net/blog/wp-content/uploads/2010/12/InsDblMtrxRandDsply.gif"><img decoding="async" loading="lazy" class="size-full wp-image-2911" title="InsDblMtrxRandDsply" src="https://www.centerspace.net/blog/wp-content/uploads/2010/12/InsDblMtrxRandDsply.gif" alt="" width="564" height="274" srcset="https://www.centerspace.net/wp-content/uploads/2010/12/InsDblMtrxRandDsply.gif 564w, https://www.centerspace.net/wp-content/uploads/2010/12/InsDblMtrxRandDsply-300x145.gif 300w" sizes="(max-width: 564px) 100vw, 564px" /></a><figcaption id="caption-attachment-2911" class="wp-caption-text">Selecting the area the matrix is to be displayed in</figcaption></figure>
<p>Now get your fingers limbered. Here is where it gets a bit obscure &#8211; do exactly as follows.</p>
<ul>
<li>Press the F2 key.  (<em>pressing F2 may be  optional but is recommended by Excel as the cell leaves the edit mode</em>)</li>
<li>Press and hold the Ctrl key followed by</li>
<li>pressing and holding the Shift key followed by</li>
<li>pressing the Enter key</li>
</ul>
<p>and presto chango!  You should see a screen like this.</p>
<figure id="attachment_2913" aria-describedby="caption-attachment-2913" style="width: 561px" class="wp-caption alignnone"><a href="https://www.centerspace.net/blog/wp-content/uploads/2010/12/InsDblMtrxRandDsplyRslt.gif"><img decoding="async" loading="lazy" class="size-full wp-image-2913" title="InsDblMtrxRandDsplyRslt" src="https://www.centerspace.net/blog/wp-content/uploads/2010/12/InsDblMtrxRandDsplyRslt.gif" alt="" width="561" height="211" srcset="https://www.centerspace.net/wp-content/uploads/2010/12/InsDblMtrxRandDsplyRslt.gif 561w, https://www.centerspace.net/wp-content/uploads/2010/12/InsDblMtrxRandDsplyRslt-300x112.gif 300w" sizes="(max-width: 561px) 100vw, 561px" /></a><figcaption id="caption-attachment-2913" class="wp-caption-text">Displaying a Matrix in Excel </figcaption></figure>
<p>Notice that your cell&#8217;s formula is now enclosed in { }, indicating to Excel that the contained formula is an array function.  This is the only way to get matrices displayed.  Also, if you try to edit this cell you will get an error that changes are not allowed.  If  you want to change the dimensions simply reference the values from another cell when you create the function.</p>
<p>The fifth function <code>NCopyArray</code> copies the library matrix across the COM bridge into an Excel array object.  As I stated in the beginning this would be a quick and dirty approach and would leave room for improvement.</p>
<h3>Summary</h3>
<p>In my next post, I will provide the above code in C# and add more function calls with some matrices with hopefully an improved approach to <code>NCopyArray</code>.  Future posts will include creating a packaged XLL and a more complex example such as curve fitting.</p>
<p>Since time is our most precious asset, being able to quickly access complex math functions with a general purpose tool like Excel should save time and money!</p>
<p>At CenterSpace, we are interested if this blog is helpful and if there is a need for more examples of how our libraries can be accessed by Excel.  Let us know what areas are of interest to you.</p>
<p>Mike  Magee</p>
<p><strong>Thanks and Resources </strong><br />
Also, a special thanks to Govert van Drimmelen for writing a wonderful tool such as ExcelDNA.</p>
<p>The post <a rel="nofollow" href="https://www.centerspace.net/calling-external-net-libraries-from-excel">Calling External .NET Libraries from Excel</a> appeared first on <a rel="nofollow" href="https://www.centerspace.net">CenterSpace</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.centerspace.net/calling-external-net-libraries-from-excel/feed</wfw:commentRss>
			<slash:comments>2</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">2845</post-id>	</item>
		<item>
		<title>Statistical Quality Control Charts</title>
		<link>https://www.centerspace.net/statistical-quality-control-charts</link>
					<comments>https://www.centerspace.net/statistical-quality-control-charts#comments</comments>
		
		<dc:creator><![CDATA[Paul Shirkey]]></dc:creator>
		<pubDate>Wed, 11 Aug 2010 17:33:47 +0000</pubDate>
				<category><![CDATA[NMath Stats Tutorial]]></category>
		<category><![CDATA[attribute quality chart]]></category>
		<category><![CDATA[c-chart]]></category>
		<category><![CDATA[Pareto chart]]></category>
		<category><![CDATA[quality charts in .NET]]></category>
		<category><![CDATA[quality charts in C#]]></category>
		<category><![CDATA[quality control charts]]></category>
		<category><![CDATA[Shewhart charts]]></category>
		<category><![CDATA[statistical quality control charts]]></category>
		<category><![CDATA[u-chart]]></category>
		<category><![CDATA[variable quality chart]]></category>
		<guid isPermaLink="false">http://www.centerspace.net/blog/?p=2345</guid>

					<description><![CDATA[<p><img src="https://www.centerspace.net/blog/wp-content/uploads/2010/07/u-chart-150x150.png" alt="" title="u-chart"  class="excerpt" /><br />
Statistical quality control charts, or Shewart quality control charts, are used across nearly all sectors of industry to maintain and improve product quality.  Quality control charts provide a means to detect when a time varying process exceeds its historic process variation and needs analysis and/or intervention to remedy the out-of-control process (known as special cause variation).  CenterSpace Software and Nevron have teamed up to create some free code examples for creating Shewart attribute control charts.</p>
<p>The post <a rel="nofollow" href="https://www.centerspace.net/statistical-quality-control-charts">Statistical Quality Control Charts</a> appeared first on <a rel="nofollow" href="https://www.centerspace.net">CenterSpace</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>Statistical quality control charts, or Shewart quality control charts, are used across nearly all sectors of industry to maintain and improve product quality.  Quality control charts provide a means to detect when a time varying process exceeds its historic process variation and needs analysis and/or intervention to remedy the out-of-control process (known as special cause variation).  These process control charts are independent of any engineering decision-making about the particular process at hand, but are instead based on the statistical nature of the process itself.  This standardized statistical control framework was created and refined by Walter Shewart at Bell Telephone Laboratories from 1925 to his retirement in 1956.  It is this independence of process details that make Mr. Shewart&#8217;s techniques powerful, widely applicable, decision-making aids.</p>
<p>With their ongoing partnership, CenterSpace Software and Nevron have teamed up to create some free code examples for creating Shewart charts.</p>
<h3>Quality Chart Types</h3>
<p>Statistical quality control charts can be generally divided into two categories, those for tracking discrete attribute variables (e.g. a pass/fail test), and those for tracking continuous process variables (e.g. pipe diameter, temperature).</p>
<table>
<tbody>
<tr>
<th> Chart</th>
<th> Process Observation</th>
<th> Process Observation Variable</th>
</tr>
<tr>
<td>X-bar and R chart</td>
<td>Quality characteristic measurement within one subgroup</td>
<td>Variables</td>
</tr>
<tr>
<td>X-bar and s chart</td>
<td>Quality characteristic measurement within one subgroup</td>
<td>Variables</td>
</tr>
<tr>
<td>Shewhart individuals control chart (I-R chart or I chart)</td>
<td>Quality characteristic measurement for one observation</td>
<td>Variables</td>
</tr>
<tr>
<td>Three-way chart</td>
<td>Quality characteristic measurement within one subgroup</td>
<td>Variables</td>
</tr>
<tr>
<td>p-chart</td>
<td>Fraction nonconforming within one subgroup</td>
<td>Attributes</td>
</tr>
<tr>
<td>np-chart</td>
<td>Number nonconforming within one subgroup</td>
<td>Attributes</td>
</tr>
<tr>
<td>c-chart</td>
<td>Number of nonconformances within one subgroup</td>
<td>Attributes</td>
</tr>
<tr>
<td>u-chart</td>
<td>Nonconformances per unit within one subgroup</td>
<td>Attributes</td>
</tr>
</tbody>
</table>
<p>The statistical modeling language, &#8220;R&#8221;, provides a package (qcc) for creating these and other statistical process control charts.  This R package was created by Luca Scrucca and is actively maintained and can be found in CRAN repository.</p>
<table>
<tbody>
<tr>
<td>
<p><figure id="attachment_2364" aria-describedby="caption-attachment-2364" style="width: 200px" class="wp-caption alignleft"><a href="https://www.centerspace.net/blog/wp-content/uploads/2010/07/c-chart-R.png"><img decoding="async" class="size-medium wp-image-2364" title="c-chart generated the R package qcc" src="https://www.centerspace.net/blog/wp-content/uploads/2010/07/c-chart-R-300x200.png" alt="c-chart generated the R package qcc" width="200" srcset="https://www.centerspace.net/wp-content/uploads/2010/07/c-chart-R-300x200.png 300w, https://www.centerspace.net/wp-content/uploads/2010/07/c-chart-R.png 738w" sizes="(max-width: 300px) 100vw, 300px" /></a><figcaption id="caption-attachment-2364" class="wp-caption-text">c-chart generated by R package qcc</figcaption></figure></td>
<td>
<p><figure id="attachment_2365" aria-describedby="caption-attachment-2365" style="width: 200px" class="wp-caption alignright"><a href="https://www.centerspace.net/blog/wp-content/uploads/2010/07/u-chart-R.png"><img decoding="async" class="size-medium wp-image-2365" title="u-chart generated by the R package qcc" src="https://www.centerspace.net/blog/wp-content/uploads/2010/07/u-chart-R-300x190.png" alt="u-chart generated by the R package qcc" width="200" srcset="https://www.centerspace.net/wp-content/uploads/2010/07/u-chart-R-300x190.png 300w, https://www.centerspace.net/wp-content/uploads/2010/07/u-chart-R.png 760w" sizes="(max-width: 300px) 100vw, 300px" /></a><figcaption id="caption-attachment-2365" class="wp-caption-text">u-chart generated by R package qcc</figcaption></figure></td>
</tr>
</tbody>
</table>
<p>These two images demonstrate the standard look of the &#8216;c&#8217; and &#8216;u&#8217; attribute quality control chart.  Some typical chart features include the highlighting of out-of-control data points and time varying upper and lower control limits.  The charts generated by the R qcc package have served as our standard for recreating these in the .NET / C# development environment.  The real world data used in our examples below was copied from the qcc package so direct comparisons can be made.</p>
<h3>Creating a Quality Chart with .NET</h3>
<p>To integrate these quality controls charts into a .NET/C# data driven quality monitoring application, we need both a statistical analysis library and a visualization tool that can manage the special chart style demanded by quality control engineers.  CenterSpace, in partnership with Nevron, has created an extensible example application to build these types of specialized charts.  Once you have these free helper classes, building an attribute u-chart is as simple or simpler than prototyping charts in R.</p>
<pre lang="csharp">    public void UChart()
    {

      // u-Chart sample data
      // This data-set was copied from the 'dyedcloth' data set packaged with
      // the R-package qcc by Luca Scrucca
      //
      // Example Data Description
      // In a textile finishing plant, dyed cloth is inspected for the occurrence of
      // defects per 50 square meters.
      // The data on ten rolls of cloth are presented
      //    x number of nonconformities per 50 square meters (inspection units)
      //    samplesize number of inspection units in roll (variable sample size
      DoubleVector x =
        new DoubleVector(14, 12, 20, 11, 7, 10, 21, 16, 19, 23);
      DoubleVector samplesize =
        new DoubleVector(10.0, 8.0, 13.0, 10.0, 9.5, 10.0, 12.0, 10.5, 12.0, 12.5);

      // This builds the statistical information for the drawing the chart.
      IAttributeChartStats stats_u = new Stats_u(x, samplesize);

      // Build and display the Nevron u-Chart visualization
      NevronControlChart.AutoRefresh = true;
      NevronControlChart.Clear();
      AttributeChart cChart =
        new AttributeChart(stats_u, this.NevronControlChart);

    }</pre>
<p>This code creates a u-Chart that looks like this below.</p>
<figure id="attachment_2503" aria-describedby="caption-attachment-2503" style="width: 300px" class="wp-caption aligncenter"><a href="https://www.centerspace.net/blog/wp-content/uploads/2010/07/u-chart.png"><img decoding="async" loading="lazy" class="size-medium wp-image-2503" title="u-chart" src="https://www.centerspace.net/blog/wp-content/uploads/2010/07/u-chart-300x209.png" alt="" width="300" height="209" srcset="https://www.centerspace.net/wp-content/uploads/2010/07/u-chart-300x209.png 300w, https://www.centerspace.net/wp-content/uploads/2010/07/u-chart.png 618w" sizes="(max-width: 300px) 100vw, 300px" /></a><figcaption id="caption-attachment-2503" class="wp-caption-text">u-Chart, or Unit Chart</figcaption></figure>
<p>For those familiar with the aforementioned R-package qcc, these .NET/C# classes follow the same R naming convention for the particular chart statistics objects,  but with an improved object model.  So as seen in this example, the u-chart statistics are contained in a class named <code> Stats_u</code>, similar to the R <code> stats.u </code> command.  Each of these statistical chart objects implements either an <code>IAttributeChartStats</code> or an <code>IVariableChartStats</code> interface, which is used by the chart generating class (<code>AttributeChart</code>) as seen in the last line of the code above.</p>
<p>Building control charts boils down to three steps using these example classes.</p>
<ol>
<li> Build the necessary data vectors.</li>
<li> Build the desired chart&#8217;s statistics object,<br />
e.g. <code> IAttributeChartStats Stats = new Stats_c(DoubleVector data); </code></li>
<li> Show chart using Nevrons .NET chart control,<br />
e.g. <code> new AttributeChart(IAttributeChartStats Stats, NChartControl Chart) </code></li>
</ol>
<h3>Free Example Code</h3>
<p>The example code now available on <a href="https://github.com/MilenMetodiev/CenterSpaceNevronExamples">github</a> can currently create all four essential attribute quality control charts, as seen below.</p>
<table>
<tbody>
<tr>
<td>
<p><figure id="attachment_2503" aria-describedby="caption-attachment-2503" style="width: 180px" class="wp-caption alignnone"><br />
<a href="https://www.centerspace.net/blog/wp-content/uploads/2010/07/c-chart.png"><img decoding="async" class="size-thumbnail wp-image-2506" title="c-chart" src="https://www.centerspace.net/blog/wp-content/uploads/2010/07/c-chart-150x150.png" alt="c-Chart, or Count Chart" width="180" /></a><br />
<figcaption id="caption-attachment-2503" class="wp-caption-text">c-Chart, or Count Chart</figcaption></figure></td>
<td>
<p><figure id="attachment_2503" aria-describedby="caption-attachment-2503" style="width: 180px" class="wp-caption alignnone"><a href="https://www.centerspace.net/blog/wp-content/uploads/2010/07/u-chart.png"><img decoding="async" class="size-thumbnail wp-image-2503" title="u-chart" src="https://www.centerspace.net/blog/wp-content/uploads/2010/07/u-chart-150x150.png" alt="" width="180" /></a><br />
<figcaption id="caption-attachment-2503" class="wp-caption-text">u-Chart, or Unit Chart</figcaption></figure></td>
</tr>
<tr>
<td>
<p><figure id="attachment_2507" aria-describedby="caption-attachment-2507" style="width: 180px" class="wp-caption alignnone"><a href="https://www.centerspace.net/blog/wp-content/uploads/2010/07/p-chart.png"><img decoding="async" class="size-thumbnail wp-image-2507" title="p-chart" src="https://www.centerspace.net/blog/wp-content/uploads/2010/07/p-chart-150x150.png" alt="" width="180" /></a><figcaption id="caption-attachment-2507" class="wp-caption-text">p-Chart, Percentage Chart</figcaption></figure></td>
<td>
<p><figure id="attachment_2508" aria-describedby="caption-attachment-2508" style="width: 180px" class="wp-caption alignnone"><a href="https://www.centerspace.net/blog/wp-content/uploads/2010/07/np-chart.png"><img decoding="async" class="size-thumbnail wp-image-2508" title="np-chart" src="https://www.centerspace.net/blog/wp-content/uploads/2010/07/np-chart-150x150.png" alt="" width="180" /></a><figcaption id="caption-attachment-2508" class="wp-caption-text">np-Chart</figcaption></figure></td>
</tr>
</tbody>
</table>
<p>To download and run these examples just navigate to our <a href="https://github.com/MilenMetodiev/CenterSpaceNevronExamples"> Nevron / CenterSpace  github repository</a> and either click on the &#8220;Download Source&#8221; button in the upper right-hand corner and download either a .zip or .tar file of the project, or just clone the repository.   For those unfamiliar with git, git is a source code control system designed specifically for collaborative projects such as this one.  To clone the project, after installing <a href="http://git-scm.com/download">git</a>, simply type at your command prompt:</p>
<pre lang="dos"> git clone git@github.com:MilenMetodiev/CenterSpaceNevronExamples.git</pre>
<p>This will create a clone of this project code at your current drive location in a directory call &#8220;CenterSpaceNevronExamples&#8221;.</p>
<h3>Other Quality Control Charts and Future Development</h3>
<p>Currently we have only implemented the attribute control charts.  Other common quality system charts including EWMA (exponential weighted moving average), Pareto, and CumSum (cumulative sum) have not been implemented in this example, but can be using the same tool set and class patterns established in this example.  If you would like help or need any assistance in getting the project running or extending this to other chart types, drop us an <a href="mailto:consulting@centerspace.net"> email </a>.</p>
<p>Happy Computing,<br />
-Paul</p>
<p><strong> Resources </strong></p>
<ul>
<li> The table above is adapted from the Wikipedia control chart <a href="https://en.wikipedia.org/wiki/Control_chart">article</a>.</li>
<li> &#8220;qcc: An R package for quality control charting and statistical process control&#8221;, R News, <a href="https://www.r-project.org/doc/Rnews/Rnews_2004-1.pdf">Volume 4/1 </a>, June 2004.</li>
<li> The standard <a href="https://cran.r-project.org/web/packages/qcc/qcc.pdf"> qcc documentation </a> from the CRAN project was very helpful with this project.</li>
</ul>
<p>The post <a rel="nofollow" href="https://www.centerspace.net/statistical-quality-control-charts">Statistical Quality Control Charts</a> appeared first on <a rel="nofollow" href="https://www.centerspace.net">CenterSpace</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.centerspace.net/statistical-quality-control-charts/feed</wfw:commentRss>
			<slash:comments>3</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">2345</post-id>	</item>
		<item>
		<title>Cluster Analysis, Part V: Monte Carlo NMF</title>
		<link>https://www.centerspace.net/clustering-analysis-part-v-monte-carlo-nmf</link>
					<comments>https://www.centerspace.net/clustering-analysis-part-v-monte-carlo-nmf#respond</comments>
		
		<dc:creator><![CDATA[Ken Baldwin]]></dc:creator>
		<pubDate>Mon, 11 Jan 2010 03:59:51 +0000</pubDate>
				<category><![CDATA[NMath Stats Tutorial]]></category>
		<category><![CDATA[clustering]]></category>
		<category><![CDATA[clustering .NET]]></category>
		<category><![CDATA[clustering C#]]></category>
		<category><![CDATA[NMF]]></category>
		<category><![CDATA[NMF .NET]]></category>
		<category><![CDATA[NMF C#]]></category>
		<category><![CDATA[nonnegative matrix factorization]]></category>
		<category><![CDATA[nonnegative matrix factorization .NET]]></category>
		<category><![CDATA[nonnegative matrix factorization C#]]></category>
		<guid isPermaLink="false">http://www.centerspace.net/blog/?p=1031</guid>

					<description><![CDATA[<p>In this continuing series, we explore the NMath Stats functions for performing cluster analysis. (For previous posts, see Part 1 &#8211; PCA , Part 2 &#8211; K-Means, Part 3 &#8211; Hierarchical, and Part 4 &#8211; NMF.) The sample data set we&#8217;re using classifies 89 single malt scotch whiskies on a five-point scale (0-4) for 12 [&#8230;]</p>
<p>The post <a rel="nofollow" href="https://www.centerspace.net/clustering-analysis-part-v-monte-carlo-nmf">Cluster Analysis, Part V: Monte Carlo NMF</a> appeared first on <a rel="nofollow" href="https://www.centerspace.net">CenterSpace</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>In this continuing series, we explore the NMath Stats functions for performing cluster analysis. (For previous posts, see <a href="/clustering-analysis-part-iv-non-negative-matrix-factorization/">Part 1 &#8211; PCA </a>, <a href="/clustering-analysis-part-iv-non-negative-matrix-factorization/">Part 2 &#8211; K-Means</a>, <a href=" https://www.centerspace.net/drawing-dendrograms/">Part 3 &#8211; Hierarchical</a>, and <a href="/clustering-analysis-part-iv-non-negative-matrix-factorization/">Part 4 &#8211; NMF</a>.) The sample data set we&#8217;re using classifies 89 single malt scotch whiskies on a five-point scale (0-4) for 12 flavor characteristics. To visualize the data set and clusterings, we make use of the free <a href="https://www.nuget.org/packages/Microsoft.Chart.Controls/">Microsoft Chart Controls for .NET</a>, which provide a basic set of charts.</p>
<p>In this post, the last in the series, we&#8217;ll look at how NMath provides a Monte Carlo method for performing multiple non-negative matrix factorization (NMF) clusterings using different random starting conditions, and combining the results.</p>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">NMF uses an iterative algorithm with random starting values for W and H. This, coupled with the fact that the factorization is not unique, means that if you cluster the columns of V multiple times, you may get different final clusterings. The consensus matrix is a way to average multiple clusterings, to produce a probability estimate that any pair of columns will be clustered together.</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">To compute the consensus matrix, the columns of V are clustered using NMF n times. Each clustering yields a connectivity matrix. Recall that the connectivity matrix is a symmetric matrix whose i, jth entry is 1 if columns i and j of V are clustered together, and 0 if they are not. The consensus matrix is also a symmetric matrix, whose i, jth entry is formed by taking the average of the i, jth entries of the n connectivity matrices.</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">Thus, each i, jth entry of the consensus matrix is a value between 0, when columns i and j are not clustered together on any of the runs, and 1, when columns i and j were clustered together on all runs. The i, jth entry of a consensus matrix may be considered, in some sense, a &#8220;probability&#8221; that columns i and j belong to the same cluster.</div>
<p>NMF uses an iterative algorithm with random starting values for <em>W</em> and <em>H</em>. (See <a href="/clustering-analysis-part-iv-non-negative-matrix-factorization/">Part IV</a> for more information on NMF.) This, coupled with the fact that the factorization is not unique, means that if you cluster the columns of <em>V</em> multiple times, you may get different final clusterings. The <em>consensus matrix</em> is a way to average multiple clusterings, to produce a probability estimate that any pair of columns will be clustered together.<br />
<span id="more-1031"></span><br />
To compute the consensus matrix, the columns of V are clustered using NMF <em>n</em> times. Each clustering yields a connectivity matrix. Recall that the connectivity matrix is a symmetric matrix whose <em>i</em>, <em>j</em>th entry is 1 if columns <em>i</em> and <em>j</em> of <em>V</em> are clustered together, and 0 if they are not. The consensus matrix is also a symmetric matrix, whose <em>i</em>, <em>j</em>th entry is formed by taking the average of the <em>i</em>, <em>j</em>th entries of the <em>n</em> connectivity matrices. The <em>i</em>, <em>j</em>th entry of a consensus matrix may be considered a &#8220;probability&#8221; that columns <em>i</em> and <em>j</em> belong to the same cluster.</p>
<p>NMath Stats provides class <a href="/doc/NMathSuite/ref/html/T_CenterSpace_NMath_Core_NMFConsensusMatrix_1.htm">NMFConsensusMatrix </a>for computing a consensus matrix. NMFConsensusMatrix is parameterized on the NMF update algorithm to use. Additional constructor parameters specify the matrix to factor, the order <em>k</em> of the NMF factorization (the number of columns in <em>W</em>), and the number of clustering runs. The consensus matrix is computed at construction time, so be aware that this may be an expensive operation.</p>
<p>For example, the following C# code creates a consensus matrix for 100 runs, clustering the scotch data (loaded into a dataframe in <a href="/clustering-analysis-part-iv-non-negative-matrix-factorization/">Part I</a>) into four clusters:</p>
<pre class="code">int k = 4;
int numberOfRuns = 100;
NMFConsensusMatrix&lt;NMFDivergenceUpdate&gt; consensusMatrix =
  new NMFConsensusMatrix&lt;NMFDivergenceUpdate&gt;(
    data.ToDoubleMatrix().Transpose(),
    k,
    numberOfRuns);

Console.WriteLine("{0} runs out of {1} converged.",
  consensusMatrix.NumberOfConvergedRuns, numberOfRuns);</pre>
<p>The output is:</p>
<pre>100 runs out of 100 converged.</pre>
<p>NMFConsensusMatrix provides a standard indexer for getting the element value at a specified row and column in the consensus matrix. For instance, one of the goals of Young et al. was to identify single malts that are particularly good representatives of each cluster. This information could be used, for example, to purchase a representative sampling of scotches. As described in <a href="/clustering-analysis-part-iv-non-negative-matrix-factorization/">Part IV</a>, they reported that these whiskies were the closest to each flavor profile:</p>
<ul>
<li>Glendronach and Macallan</li>
<li>Tomatin and Speyburn</li>
<li>AnCnoc and Miltonduff</li>
<li>Ardbeg and Clynelish</li>
</ul>
<p>The consensus matrix reveals, however, that the pairings are not equally strong:</p>
<pre lang="csharp">Console.WriteLine("Probability that Glendronach is clustered with Macallan = {0}",
  consensusMatrix[data.IndexOfKey("Glendronach"), data.IndexOfKey("Macallan")]);
Console.WriteLine("Probability that Tomatin is clustered with Speyburn = {0}",
  consensusMatrix[data.IndexOfKey("Tomatin"), data.IndexOfKey("Speyburn")]);
Console.WriteLine("Probability that AnCnoc is clustered with Miltonduff = {0}",
  consensusMatrix[data.IndexOfKey("AnCnoc"), data.IndexOfKey("Miltonduff")]);
Console.WriteLine("Probability that Ardbeg is clustered with Clynelish = {0}",
  consensusMatrix[data.IndexOfKey("Ardbeg"), data.IndexOfKey("Clynelish")]);</pre>
<p>The output is:</p>
<pre>Probability that Glendronach is clustered with Macallan = 1
Probability that Tomatin is clustered with Speyburn = 0.4
Probability that AnCnoc is clustered with Miltonduff = 0.86
Probability that Ardbeg is clustered with Clynelish = 1</pre>
<p>Thus, although Glendronach and Macallan are clustered together in all 100 runs, Tomatin and Speyburn are only clustered together 40% of the time.</p>
<p>A consensus matrix, <em>C</em>, can itself be used to cluster objects, by perform a hierarchical cluster analysis using the distance function:</p>
<p style="text-align: center;"><img decoding="async" loading="lazy" class="alignnone size-full wp-image-1051" title="nmf_distance_function" src="https://www.centerspace.net/blog/wp-content/uploads/2010/01/nmf_distance_function.gif" alt="nmf_distance_function" width="155" height="24" srcset="https://www.centerspace.net/wp-content/uploads/2010/01/nmf_distance_function.gif 155w, https://www.centerspace.net/wp-content/uploads/2010/01/nmf_distance_function-150x24.gif 150w" sizes="(max-width: 155px) 100vw, 155px" /></p>
<p>For example, this C# code creates an hierarchical cluster analysis using this distance function, then cuts the tree at the level of four clusters, printing out the cluster members:</p>
<pre lang="csharp">DoubleMatrix colNumbers = new DoubleMatrix(consensusMatrix.Order, 1, 0, 1);
string[] names = data.StringRowKeys;

Distance.Function distance =
  delegate(DoubleVector data1, DoubleVector data2)
  {
    int i = (int)data1[0];
    int j = (int)data2[0];
    return 1.0 - consensusMatrix[i, j];
  };

ClusterAnalysis ca = new ClusterAnalysis(colNumbers, distance, Linkage.WardFunction);

int k = 4;
ClusterSet cs = ca.CutTree(k);
for (int clusterNumber = 0; clusterNumber &lt; cs.NumberOfClusters; clusterNumber++)
{
  int[] members = cs.Cluster(clusterNumber);
  Console.Write("Objects in cluster {0}: ", clusterNumber);
  for (int i = 0; i &lt; members.Length; i++)
  {
    Console.Write("{0} ", names[members[i]]);
  }
  Console.WriteLine("\n");
}</pre>
<p>The output is:</p>
<pre>Objects in cluster 0:
Aberfeldy Auchroisk Balmenach Dailuaine Glendronach
Glendullan Glenfarclas Glenrothes Glenturret Macallan
Mortlach RoyalLochnagar Tomore 

Objects in cluster 1:
Aberlour ArranIsleOf Belvenie BenNevis Benriach Benromach
Bladnoch BlairAthol Bowmore Craigallechie Dalmore
Dalwhinnie Deanston GlenElgin GlenGarioch GlenKeith
GlenOrd Glenkinchie Glenlivet Glenlossie Inchgower
Knochando Linkwood OldFettercairn RoyalBrackla
Speyburn Teaninich Tomatin Tomintoul Tullibardine 

Objects in cluster 2:
AnCnoc Ardmore Auchentoshan Aultmore Benrinnes
Bunnahabhain Cardhu Craigganmore Dufftown Edradour
GlenGrant GlenMoray GlenSpey Glenallachie Glenfiddich
Glengoyne Glenmorangie Loch Lomond Longmorn
Mannochmore Miltonduff Scapa Speyside Strathisla
Strathmill Tamdhu Tamnavulin Tobermory 

Objects in cluster 3:
Ardbeg Balblair Bruichladdich Caol Ila Clynelish
GlenDeveronMacduff GlenScotia Highland Park
Isle of Jura Lagavulin Laphroig Oban OldPulteney
Springbank Talisker</pre>
<p>Once again using the cluster assignments to color the objects in the plane of the first two principal components, we can see the grouping represented by the consensus matrix (k=4).</p>
<p><img decoding="async" loading="lazy" class="alignnone size-full wp-image-1049" title="nmf2" src="https://www.centerspace.net/blog/wp-content/uploads/2010/01/nmf2.png" alt="nmf2" width="448" height="358" srcset="https://www.centerspace.net/wp-content/uploads/2010/01/nmf2.png 448w, https://www.centerspace.net/wp-content/uploads/2010/01/nmf2-300x239.png 300w" sizes="(max-width: 448px) 100vw, 448px" /></p>
<p>Well, this concludes are tour through the NMath clustering functionality. Techniques such as principal component analysis, <em>k</em>-means clustering, hierarchical cluster analysis, and non-negative matrix factorization can all be applied to data such as these to explore various clusterings. Choosing among these approaches is ultimately a matter of domain knowledge and performance requirements. Is it appropriate to cluster based on distance in the original space, or should dimension reduction be applied? If dimension reduction is used, are negative component parameters meaningful? Are there sufficient computational resource available to construct a complete hierarchical cluster tree, or should a <em>k</em>-means approach be used? If an hierarchical cluster tree is computed, what distance and linkage function should be used? NMath provides a powerful, flexible set of clustering tools for data mining and data analysis.</p>
<p>Ken</p>
<p><strong>References</strong></p>
<p>Young, S.S., Fogel, P., Hawkins, D. M. (unpublished manuscript). “Clustering Scotch Whiskies using Non-Negative Matrix Factorization”. Retrieved December 15, 2009 from <a href="http://www.niss.org/sites/default/files/ScotchWhisky.pdf">http://niss.org/sites/default/files/ScotchWhisky.pdf</a>.</p>
<p>The post <a rel="nofollow" href="https://www.centerspace.net/clustering-analysis-part-v-monte-carlo-nmf">Cluster Analysis, Part V: Monte Carlo NMF</a> appeared first on <a rel="nofollow" href="https://www.centerspace.net">CenterSpace</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.centerspace.net/clustering-analysis-part-v-monte-carlo-nmf/feed</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">1031</post-id>	</item>
		<item>
		<title>Cluster Analysis, Part IV: Non-negative Matrix Factorization (NMF)</title>
		<link>https://www.centerspace.net/clustering-analysis-part-iv-non-negative-matrix-factorization</link>
					<comments>https://www.centerspace.net/clustering-analysis-part-iv-non-negative-matrix-factorization#comments</comments>
		
		<dc:creator><![CDATA[Ken Baldwin]]></dc:creator>
		<pubDate>Wed, 06 Jan 2010 16:48:14 +0000</pubDate>
				<category><![CDATA[NMath Stats Tutorial]]></category>
		<category><![CDATA[clustering]]></category>
		<category><![CDATA[clustering .NET]]></category>
		<category><![CDATA[clustering C#]]></category>
		<category><![CDATA[NMF]]></category>
		<category><![CDATA[NMF .NET]]></category>
		<category><![CDATA[NMF C#]]></category>
		<category><![CDATA[nonnegative matrix factorization]]></category>
		<category><![CDATA[nonnegative matrix factorization .NET]]></category>
		<category><![CDATA[nonnegative matrix factorization C#]]></category>
		<guid isPermaLink="false">http://www.centerspace.net/blog/?p=808</guid>

					<description><![CDATA[<p>In this continuing series, we explore the NMath Stats functions for performing cluster analysis. (For previous posts, see Part 1 &#8211; PCA , Part 2 &#8211; K-Means, and Part 3 &#8211; Hierarchical.) The sample data set we&#8217;re using classifies 89 single malt scotch whiskies on a five-point scale (0-4) for 12 flavor characteristics. To visualize [&#8230;]</p>
<p>The post <a rel="nofollow" href="https://www.centerspace.net/clustering-analysis-part-iv-non-negative-matrix-factorization">Cluster Analysis, Part IV: Non-negative Matrix Factorization (NMF)</a> appeared first on <a rel="nofollow" href="https://www.centerspace.net">CenterSpace</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>In this continuing series, we explore the NMath Stats functions for performing cluster analysis. (For previous posts, see <a href="/clustering-analysis-part-iv-non-negative-matrix-factorization/">Part 1 &#8211; PCA </a>, <a href="/clustering-analysis-part-iv-non-negative-matrix-factorization/">Part 2 &#8211; K-Means</a>, and <a href=" https://www.centerspace.net/drawing-dendrograms/">Part 3 &#8211; Hierarchical</a>.) The sample data set we&#8217;re using classifies 89 single malt scotch whiskies on a five-point scale (0-4) for 12 flavor characteristics. To visualize the data set and clusterings, we make use of the free Microsoft Chart Controls for .NET, which provide a basic set of charts.</p>
<p>In this post, we&#8217;ll cluster the scotches using non-negative matrix factorization (NMF). NMF approximately factors a matrix <em>V</em> into two matrices, <em>W</em> and <em>H</em>:</p>
<p style="text-align: center;"><img decoding="async" loading="lazy" class="alignnone size-full wp-image-962" title="wh" src="https://www.centerspace.net/blog/wp-content/uploads/2010/01/wh.gif" alt="wh" width="60" height="20" /></p>
<p>If V in an <span style="color: #000000; font-style: italic; font-weight: normal; text-decoration: none; text-transform: none; vertical-align: baseline;">n</span> x <span style="color: #000000; font-style: italic; font-weight: normal; text-decoration: none; text-transform: none; vertical-align: baseline;">m</span> matrix, then NMF can be used to approximately factor <span style="color: #000000; font-style: italic; font-weight: normal; text-decoration: none; text-transform: none; vertical-align: baseline;">V </span>into an <span style="color: #000000; font-style: italic; font-weight: normal; text-decoration: none; text-transform: none; vertical-align: baseline;">n</span> x <span style="color: #000000; font-style: italic; font-weight: normal; text-decoration: none; text-transform: none; vertical-align: baseline;">r</span> matrix <span style="color: #000000; font-style: italic; font-weight: normal; text-decoration: none; text-transform: none; vertical-align: baseline;">W</span> and an <span style="color: #000000; font-style: italic; font-weight: normal; text-decoration: none; text-transform: none; vertical-align: baseline;">r</span> x <span style="color: #000000; font-style: italic; font-weight: normal; text-decoration: none; text-transform: none; vertical-align: baseline;">m</span> matrix <span style="color: #000000; font-style: italic; font-weight: normal; text-decoration: none; text-transform: none; vertical-align: baseline;">H</span>. Usually <span style="color: #000000; font-style: italic; font-weight: normal; text-decoration: none; text-transform: none; vertical-align: baseline;">r</span> is chosen to be much smaller than either <span style="color: #000000; font-style: italic; font-weight: normal; text-decoration: none; text-transform: none; vertical-align: baseline;">m</span> or <span style="color: #000000; font-style: italic; font-weight: normal; text-decoration: none; text-transform: none; vertical-align: baseline;">n</span>, for dimension reduction. Thus, each column of <span style="color: #000000; font-style: italic; font-weight: normal; text-decoration: none; text-transform: none; vertical-align: baseline;">V</span> is approximated by a linear combination of the columns of <span style="color: #000000; font-style: italic; font-weight: normal; text-decoration: none; text-transform: none; vertical-align: baseline;">W</span>, with the coefficients being the corresponding column <span style="color: #000000; font-style: italic; font-weight: normal; text-decoration: none; text-transform: none; vertical-align: baseline;">H</span>. This extracts underlying features of the data as basis vectors in <span style="color: #000000; font-style: italic; font-weight: normal; text-decoration: none; text-transform: none; vertical-align: baseline;">W</span>, which can then be used for identification, clustering, and compression.<br />
<span id="more-808"></span><br />
Earlier in this series, we used principal component analysis (PCA) as a means of dimension reduction for the purposes of visualizing the scotch data. NMF differs from PCA in two important respects:</p>
<ol>
<li>NMF enforces the constraint that the factors <span style="color: #000000; font-style: italic; font-weight: normal; text-decoration: none; text-transform: none; vertical-align: baseline;">W</span> and <span style="color: #000000; font-style: italic; font-weight: normal; text-decoration: none; text-transform: none; vertical-align: baseline;">H</span> must be non-negative-that is, all elements must be equal to or greater than zero. By not allowing negative entries in <span style="color: #000000; font-style: italic; font-weight: normal; text-decoration: none; text-transform: none; vertical-align: baseline;">W</span> and <span style="color: #000000; font-style: italic; font-weight: normal; text-decoration: none; text-transform: none; vertical-align: baseline;">H</span>, NMF enables a non-subtractive combination of the parts to form a whole, and in some contexts, more meaningful basis vectors. In the scotch data, for example, what would it mean for a scotch to have a negative value for a flavor charactistic?</li>
<li>NMF does not require the basis vectors to be orthogonal. If we are using NMF to extract meaningful underlying components of the data, there is no <em>a priori</em> reason to require the components to be orthogonal.</li>
</ol>
<p>Let&#8217;s begin by reproducing the NMF analysis of the scotch data presented in Young<em> et al.</em>. The authors performed NMF with r=4, to identify four major flavor factors in scotch whiskies, and then asked whether there are single malts that appear to be relatively pure embodiments of these four flavor profiles.</p>
<p>NMath Stats provides class <a href="https://www.centerspace.net/doc/NMath/ref/html/T_CenterSpace_NMath_Core_NMFClustering_1.htm">NMFClustering </a>for performing data clustering using iterative nonnegative matrix factorization (NMF), where each iteration step produces a new W and H. At each iteration, each column v of V is placed into a cluster corresponding to the column w of W which has the largest coefficient in H. That is, column v of V is placed in cluster i if the entry hij in H is the largest entry in column hj of H. Results are returned as an adjacency matrix whose i, jth value is 1 if columns i and j of V are in the same cluster, and 0 if they are not. Iteration stops when the clustering of the columns of the matrix V stabilizes.</p>
<p>NMFClustering is parameterized on the NMF update algorithm to use. For instance:</p>
<pre>NMFClustering&lt;NMFDivergenceUpdate&gt; nmf =
  new NMFClustering&lt;NMFDivergenceUpdate&gt;();</pre>
<p>This specifies the <em>divergence update</em> algorithm, which minimizes a divergence functional related to the Poisson likelihood of generating V from W and H. (For more information, see Brunet, Jean-Philippe <em>et al</em>. , 2004.)</p>
<p>The Factor() method performs the actual iterative factorization. The following C# code clusters the scotch data (loaded into a dataframe in <a href="/clustering-analysis-part-iv-non-negative-matrix-factorization/">Part I</a>) into four clusters:</p>
<pre lang="csharp">int k = 4;

// specify starting conditions (optional)
int seed = 1973;
RandGenUniform rnd = new RandGenUniform(seed);
DoubleMatrix starting_W = new DoubleMatrix(data.Cols, k, rnd);
DoubleMatrix starting_H = new DoubleMatrix(k, data.Rows, rnd);

nmf.Factor(data.ToDoubleMatrix().Transpose(),
           k,
           starting_W,
           starting_H);
Console.WriteLine("Factorization converged in {0} iterations.\n",
                   nmf.Iterations);</pre>
<p>There are a couple things to note in this code:</p>
<ul>
<li>By default, NMFact uses random starting values for W and H. This, coupled with the fact that the factorization is not unique, means that if you cluster the columns of V multiple times, you may get different final clusterings. In order to reproduce the results in Young<em> et al.</em> the code above specifies a particular random seed for the initial conditions.</li>
<li>The scotch data needs to be transposed before clustering, since NMFClustering requires each object to be clustered to be a column in the input matrix.</li>
</ul>
<p>The output is:</p>
<pre>Factorization converged in 530 iterations.</pre>
<p>We can examine the four flavor factors (columns of W) to see what linear combination of the original flavor characteristics each represents. The following code orders each factor, normalized so the largest value is 1.0, similar to the data shown in Table 1 of Young <em>et al</em>.:</p>
<pre lang="csharp">ReproduceTable1(nmf.W, data.ColumnHeaders);

private static void ReproduceTable1(DoubleMatrix W,
  object[] rowKeys)
{
  // normalize
  for (int i = 0; i &lt; W.Cols; i++)
  {
    W[Slice.All, i] /= NMathFunctions.MaxValue(W.Col(i));
  }

  // Create data frame to hold W
  string[] factorNames = GetFactorNames(W.Cols);
  DataFrame df_W = new DataFrame(W, factorNames);
  df_W.SetRowKeys(rowKeys);

  // Print out sorted columns
  for (int i = 0; i &lt; df_W.Cols; i++)
  {
    df_W.SortRows(new int[] { i },
                         new SortingType[] { SortingType.Descending });
    Console.WriteLine(df_W[Slice.All, new Slice(i, 1)]);
    Console.WriteLine();
  }
  Console.WriteLine();
}</pre>
<p>The output is:</p>
<pre>#	Factor 0
Fruity	1.0000
Floral	0.8681
Sweetness	0.8292
Malty	0.6568
Nutty	0.5855
Body	0.4295
Smoky	0.2805
Honey	0.2395
Spicy	0.0000
Winey	0.0000
Tobacco	0.0000
Medicinal	0.0000 

#	Factor 1
Winey	1.0000
Body	0.6951
Nutty	0.5078
Sweetness	0.4257
Honey	0.3517
Malty	0.3301
Fruity	0.2949
Smoky	0.2631
Spicy	0.0000
Floral	0.0000
Tobacco	0.0000
Medicinal	0.0000 

#	Factor 2
Spicy	1.0000
Honey	0.4885
Sweetness	0.4697
Floral	0.4301
Smoky	0.3508
Malty	0.3492
Body	0.3160
Fruity	0.0036
Nutty	0.0000
Winey	0.0000
Tobacco	0.0000
Medicinal	0.0000 

#	Factor 3
Medicinal	1.0000
Smoky	0.8816
Body	0.7873
Spicy	0.3936
Sweetness	0.3375
Malty	0.3069
Nutty	0.2983
Fruity	0.2441
Tobacco	0.2128
Floral	0.0000
Winey	0.0000
Honey	0.0000</pre>
<p>Thus:</p>
<ul>
<li>Factor 0 contains Fruity, Floral, and Sweetness flavors.</li>
<li>Factor 1 emphasizes the Winey flavor.</li>
<li>Factor 2 contains Spicy and Honey flavors.</li>
<li>Factor 3 contains Medicinal and Smokey flavors.</li>
</ul>
<p>The objects are placed into clusters corresponding to the column of W which has the largest coefficient in H. The following C# code prints out the contents of each cluster, ordered by largest coefficient, after normalizing so the sum of each component is 1.0:</p>
<pre lang="csharp">ReproduceTable2(nmf.H, data.RowKeys, nmf.ClusterSet);

private static void ReproduceTable2(DoubleMatrix H, object[] rowKeys, ClusterSet cs)
{
  // normalize
  for (int i = 0; i &lt; H.Rows; i++)
  {
    H[i, Slice.All] /= NMathFunctions.Sum(H.Row(i));
  }

  // Create data frame to hold H
  string[] factorNames = GetFactorNames(H.Rows);
  DataFrame df_H = new DataFrame(H.Transpose(), factorNames);
  df_H.SetRowKeys(rowKeys);

  // Print information on each cluster
  for (int clusterNumber = 0; clusterNumber &lt; cs.NumberOfClusters; clusterNumber++)
  {
    int[] members = cs.Cluster(clusterNumber);
    int factor = NMathFunctions.MaxIndex(H.Col(members[0]));
    Console.WriteLine("Cluster {0} ordered by {1}: ", clusterNumber, factorNames[factor]);

    DataFrame cluster = df_H[new Subset(members), Slice.All];
    cluster.SortRows(new int[] { factor }, new SortingType[] { SortingType.Descending });

    Console.WriteLine(cluster);
    Console.WriteLine();
  }
}</pre>
<p>The output is:</p>
<pre>Cluster 0 ordered by Factor 1:
#	        Factor 0	Factor 1	Factor 2	Factor 3
Glendronach	0.0000	0.0567	0.0075	0.0000
Macallan	0.0085	0.0469	0.0083	0.0000
Balmenach	0.0068	0.0395	0.0123	0.0000
Dailuaine	0.0070	0.0317	0.0164	0.0000
Mortlach	0.0060	0.0316	0.0240	0.0000
Tomore	        0.0000	0.0308	0.0000	0.0000
RoyalLochnagar	0.0104	0.0287	0.0164	0.0000
Glenrothes	0.0054	0.0280	0.0081	0.0000
Glenfarclas	0.0127	0.0279	0.0164	0.0000
Auchroisk	0.0103	0.0267	0.0099	0.0000
Aberfeldy	0.0125	0.0238	0.0117	0.0000
Strathisla	0.0162	0.0229	0.0151	0.0000
Glendullan	0.0140	0.0228	0.0102	0.0000
BlairAthol	0.0111	0.0211	0.0166	0.0000
Dalmore	        0.0088	0.0208	0.0114	0.0204
Ardmore	        0.0104	0.0182	0.0118	0.0000 

Cluster 1 ordered by Factor 2:
#	        Factor 0	Factor 1	Factor 2	Factor 3
Tomatin	        0.0000	0.0170	0.0306	0.0000
Aberlour	0.0136	0.0260	0.0282	0.0000
Belvenie	0.0087	0.0123	0.0262	0.0000
GlenGarioch	0.0079	0.0086	0.0252	0.0000
Speyburn	0.0115	0.0000	0.0244	0.0000
BenNevis	0.0202	0.0000	0.0242	0.0000
Bowmore	        0.0049	0.0109	0.0225	0.0186
Inchgower	0.0104	0.0000	0.0218	0.0118
Craigallechie	0.0131	0.0098	0.0216	0.0136
Tomintoul	0.0085	0.0083	0.0214	0.0000
Benriach	0.0150	0.0000	0.0214	0.0000
Glenlivet	0.0125	0.0176	0.0205	0.0000
Glenturret	0.0080	0.0228	0.0203	0.0000
Benromach	0.0132	0.0140	0.0198	0.0000
Glenkinchie	0.0112	0.0000	0.0190	0.0000
OldFettercairn	0.0068	0.0137	0.0182	0.0160
Knochando	0.0131	0.0133	0.0179	0.0000
GlenOrd	        0.0118	0.0128	0.0175	0.0000
Glenlossie	0.0143	0.0000	0.0167	0.0000
GlenDeveronMacduff	0.0000	0.0156	0.0158	0.0216
GlenKeith	0.0108	0.0146	0.0145	0.0000
ArranIsleOf	0.0073	0.0086	0.0127	0.0125
GlenSpey	0.0086	0.0091	0.0119	0.0000 

Cluster 2 ordered by Factor 0:
#	        Factor 0	Factor 1	Factor 2	Factor 3
AnCnoc	        0.0294	0.0000	0.0000	0.0000
Miltonduff	0.0242	0.0000	0.0000	0.0000
Aultmore	0.0242	0.0000	0.0000	0.0000
Longmorn	0.0214	0.0141	0.0089	0.0000
Cardhu	        0.0204	0.0000	0.0094	0.0000
Auchentoshan	0.0203	0.0000	0.0065	0.0000
Strathmill	0.0203	0.0000	0.0125	0.0000
Edradour	0.0195	0.0172	0.0092	0.0000
Tobermory	0.0190	0.0000	0.0000	0.0000
Glenfiddich	0.0190	0.0000	0.0000	0.0000
Tamnavulin	0.0189	0.0000	0.0148	0.0000
Dufftown	0.0189	0.0000	0.0000	0.0147
Craigganmore	0.0184	0.0000	0.0030	0.0254
Speyside	0.0182	0.0138	0.0000	0.0000
Glenallachie	0.0178	0.0000	0.0108	0.0000
Dalwhinnie	0.0174	0.0000	0.0172	0.0000
GlenMoray	0.0174	0.0079	0.0157	0.0000
Tamdhu	        0.0172	0.0124	0.0000	0.0000
Glengoyne	0.0170	0.0090	0.0065	0.0000
Benrinnes	0.0158	0.0196	0.0161	0.0000
GlenElgin	0.0155	0.0107	0.0133	0.0000
Bunnahabhain	0.0148	0.0075	0.0078	0.0110
Glenmorangie	0.0143	0.0000	0.0123	0.0166
Scapa	        0.0140	0.0128	0.0089	0.0127
Bladnoch	0.0137	0.0063	0.0088	0.0000
Linkwood	0.0129	0.0165	0.0092	0.0000
Mannochmore	0.0124	0.0126	0.0081	0.0000
GlenGrant	0.0122	0.0121	0.0000	0.0000
Deanston	0.0119	0.0151	0.0122	0.0000
Loch Lomond	0.0105	0.0000	0.0094	0.0130
Tullibardine	0.0099	0.0093	0.0098	0.0138 

Cluster 3 ordered by Factor 3:
#	        Factor 0	Factor 1	Factor 2	Factor 3
Ardbeg	        0.0000	0.0000	0.0000	0.0906
Clynelish	0.0001	0.0000	0.0000	0.0855
Lagavulin	0.0000	0.0138	0.0000	0.0740
Laphroig	0.0000	0.0082	0.0000	0.0731
Talisker	0.0030	0.0000	0.0129	0.0706
Caol Ila	0.0048	0.0000	0.0019	0.0694
Oban	        0.0067	0.0000	0.0008	0.0564
OldPulteney	0.0114	0.0073	0.0000	0.0429
Isle of Jura	0.0079	0.0000	0.0059	0.0352
Balblair	0.0125	0.0000	0.0074	0.0297
Springbank	0.0000	0.0142	0.0189	0.0282
RoyalBrackla	0.0122	0.0078	0.0135	0.0276
GlenScotia	0.0096	0.0144	0.0000	0.0275
Bruichladdich	0.0100	0.0098	0.0140	0.0249
Teaninich	0.0081	0.0000	0.0111	0.0216
Highland Park	0.0050	0.0145	0.0146	0.0211</pre>
<p>These data are very similar to those shown in Table 2 in Young <em>et al</em>. According to their analysis, the most representative malts in each cluster are:</p>
<ul>
<li>Glendronach and Macallan</li>
<li>Tomatin and Speyburn</li>
<li>AnCnoc and Miltonduff</li>
<li>Ardbeg and Clynelish</li>
</ul>
<p>As you can see, these scotches are at, or very near, the top of each ordered cluster in the output above.</p>
<p>Finally, it is interesting to view the clusters found by NMF in the same plane of the first two principal components that we have looked at previously.</p>
<p><img decoding="async" loading="lazy" class="alignnone size-full wp-image-977" title="nmf1" src="https://www.centerspace.net/blog/wp-content/uploads/2010/01/nmf1.png" alt="nmf1" width="355" height="356" srcset="https://www.centerspace.net/wp-content/uploads/2010/01/nmf1.png 355w, https://www.centerspace.net/wp-content/uploads/2010/01/nmf1-150x150.png 150w, https://www.centerspace.net/wp-content/uploads/2010/01/nmf1-299x300.png 299w" sizes="(max-width: 355px) 100vw, 355px" /></p>
<p>If you compare this plot to that produced by <em><a href="/clustering-analysis-part-iv-non-negative-matrix-factorization/">k</a></em><a href="/clustering-analysis-part-iv-non-negative-matrix-factorization/">-means clustering</a> or <a href=" https://www.centerspace.net/drawing-dendrograms/">hierarchical cluster analysis</a>, you can see how different the results are. We are no longer clustering based on &#8220;similarity&#8221; in the original 12-dimensional flavor space (of which this is a view). Instead, we&#8217;ve used a reduced set of synthetic dimensions which capture underlying features in the data.</p>
<p>In order to produce results similar to those of Young <em>et al</em>. we explicitly specified a random seed to the NMF process. With different seeds, somewhat different final clusterings can occur. In the final post in this series, we&#8217;ll look at how NMath provides a Monte Carlo method for performing multiple NMF clusterings using different random starting conditions, and combining the results.</p>
<p>Ken</p>
<h3>References</h3>
<p>Brunet, Jean-Philippe et al. (2004). &#8220;Metagenes and Molecular Pattern Discovery Using Matrix Factorization&#8221;, <em>Proceedings of the National Academy of Sciences</em> 101, no. 12 (March 23, 2004): 4164-4169.</p>
<p>Young, S.S., Fogel, P., Hawkins, D. M. (unpublished manuscript). “Clustering Scotch Whiskies using Non-Negative Matrix Factorization”. Retrieved December 15, 2009 from <a href="http://www.niss.org/sites/default/files/ScotchWhisky.pdf">http://niss.org/sites/default/files/ScotchWhisky.pdf</a>.</p>
<p>The post <a rel="nofollow" href="https://www.centerspace.net/clustering-analysis-part-iv-non-negative-matrix-factorization">Cluster Analysis, Part IV: Non-negative Matrix Factorization (NMF)</a> appeared first on <a rel="nofollow" href="https://www.centerspace.net">CenterSpace</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.centerspace.net/clustering-analysis-part-iv-non-negative-matrix-factorization/feed</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">808</post-id>	</item>
		<item>
		<title>Clustering Analysis, Part III: Hierarchical Cluster Analysis</title>
		<link>https://www.centerspace.net/clustering-analysis-part-iii-hierarchical-cluster-analysis</link>
					<comments>https://www.centerspace.net/clustering-analysis-part-iii-hierarchical-cluster-analysis#comments</comments>
		
		<dc:creator><![CDATA[Ken Baldwin]]></dc:creator>
		<pubDate>Mon, 28 Dec 2009 08:55:28 +0000</pubDate>
				<category><![CDATA[NMath Stats Tutorial]]></category>
		<category><![CDATA[clustering]]></category>
		<category><![CDATA[clustering .NET]]></category>
		<category><![CDATA[clustering C#]]></category>
		<category><![CDATA[hierarchical cluster analysis]]></category>
		<category><![CDATA[hierarchical cluster analysis .NET]]></category>
		<category><![CDATA[hierarchical cluster analysis C#]]></category>
		<guid isPermaLink="false">http://www.centerspace.net/blog/?p=781</guid>

					<description><![CDATA[<p>In this continuing series, we explore the NMath Stats functions for performing cluster analysis. (For previous posts, see Part 1 &#8211; PCA and Part 2 &#8211; K-Means.) The sample data set we&#8217;re using classifies 89 single malt scotch whiskies on a five-point scale (0-4) for 12 flavor characteristics. To visualize the data set and clusterings, [&#8230;]</p>
<p>The post <a rel="nofollow" href="https://www.centerspace.net/clustering-analysis-part-iii-hierarchical-cluster-analysis">Clustering Analysis, Part III: Hierarchical Cluster Analysis</a> appeared first on <a rel="nofollow" href="https://www.centerspace.net">CenterSpace</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>In this continuing series, we explore the NMath Stats functions for performing cluster analysis. (For previous posts, see <a href="/clustering-analysis-part-iv-non-negative-matrix-factorization/">Part 1 &#8211; PCA </a>and <a href="/clustering-analysis-part-iv-non-negative-matrix-factorization/">Part 2 &#8211; K-Means</a>.) The sample data set we&#8217;re using classifies 89 single malt scotch whiskies on a five-point scale (0-4) for 12 flavor characteristics. To visualize the data set and clusterings, we make use of the free Microsoft Chart Controls for .NET, which provide a basic set of charts.</p>
<p>In this post, we&#8217;ll cluster the scotches based on &#8220;similarity&#8221; in the original 12-dimensional flavor space using hierarchical cluster analysis. In hierarchical cluster analysis, each object is initially assigned to its own singleton cluster. The analysis then proceeds iteratively, at each stage joining the two most &#8220;similar&#8221; clusters into a new cluster, continuing until there is one overall cluster. In NMath Stats, class <a href="https://www.centerspace.net/doc/NMath/ref/html/T_CenterSpace_NMath_Core_ClusterAnalysis.htm">ClusterAnalysis</a> performs hierarchical cluster analyses.<br />
<span id="more-781"></span><br />
The clustering process is governed by two functions:</p>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">During clustering, the distance between individual objects is computed using a distance function. The distance function is encapsulated in a Distance.Function delegate, which takes two vectors and returns a measure of the distance (similarity) between them:</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">Delegates are provided as static variables on class Distance for many common distance functions:</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">You can also define your own Distance.Function delegate and use it to cluster your data.</div>
<ul>
<li>A <em>distance function</em> computes the distance between individual objects. In NMath Stats, the distance function is encapsulated in a Distance.Function delegate, which takes two vectors and returns a measure of the distance (similarity) between them. Delegates are provided as static variables on class <a href="https://www.centerspace.net/doc/NMath/ref/html/T_CenterSpace_NMath_Core_Distance.htm">Distance</a> for many common distance functions. You can also define your own delegate.</li>
<li>A <em>linkage function</em> computes the distance between clusters. In NMath Stats, the linkage function is encapsulated in a Linkage.Function delegate. When two groups P and Q are united, a linkage function computes the distance between the new combined group P + Q and another group R. Delegates are provided as static variables on class Linkage for many common linkage functions. Again, you can also define your own delegate.</li>
</ul>
<p>Based on the choice of distance and linkage function, radically different clustering can often result. Ultimately, background knowledge of the domain is required to choose between them.</p>
<p>In this case, we&#8217;ll use the Euclidean distance function and the Ward linkage function. The Ward linkage function computes the distance between two clusters using Ward&#8217;s method, which tends to produce compact groups of well-distributed size. Ward&#8217;s method uses an analysis of variance approach to evaluate the distances between clusters. The smaller the increase in the total within-group sum of squares as a result of joining two clusters, the closer they are. The within-group sum of squares of a cluster is defined as the sum of the squares of the distance between all objects in the cluster and the centroid of the cluster.</p>
<p>This code clusters the scotch data (loaded into a DataFrame in <a href="/clustering-analysis-part-iv-non-negative-matrix-factorization/">Part I</a>), then cuts the hierarchical cluster tree at the level of four clusters:</p>
<pre lang="csharp">ClusterAnalysis ca = new ClusterAnalysis(df,
    Distance.EuclideanDistance,
    Linkage.WardFunction
);
ClusterSet cs = ca.CutTree(4);</pre>
<p>Printing out the cluster members from the cluster set, as shown in <a href="/clustering-analysis-part-iv-non-negative-matrix-factorization/">Part II</a>, produces:</p>
<pre class="code">Objects in cluster 0:
Aberfeldy Aberlour Ardmore Auchroisk Balmenach Belvenie BenNevis
Benriach Benrinnes Benromach BlairAthol Dailuaine Dalmore
Deanston Edradour GlenElgin GlenKeith GlenOrd Glendronach
Glendullan Glenfarclas Glenlivet Glenrothes Glenturret Knochando
Linkwood Longmorn Macallan Mortlach OldFettercairn RoyalBrackla
RoyalLochnagar Strathisla Tullibardine 

Objects in cluster 1:
AnCnoc ArranIsleOf Auchentoshan Aultmore Bladnoch Bunnahabhain
Cardhu Craigallechie Dalwhinnie Dufftown GlenDeveronMacduff
GlenGrant GlenMoray GlenSpey Glenallachie Glenfiddich Glengoyne
Glenkinchie Glenlossie Glenmorangie Inchgower Loch Lomond
Mannochmore Miltonduff Scapa Speyburn Speyside Tamdhu Tobermory
Tomintoul Tomore 

Objects in cluster 2:
Ardbeg Caol Ila Clynelish Lagavulin Laphroig Talisker 

Objects in cluster 3:
Balblair Bowmore Bruichladdich Craigganmore GlenGarioch
GlenScotia Highland Park Isle of Jura Oban OldPulteney
Springbank Strathmill Tamnavulin Teaninich Tomatin</pre>
<p>Coloring the objects based on cluster assignment in the plot of the first two principal components shows how similar this clustering is to the results of <em>k</em>-mean clustering.</p>
<p><img decoding="async" loading="lazy" class="alignnone size-full wp-image-887" title="hierarchical1" src="https://www.centerspace.net/blog/wp-content/uploads/2009/12/hierarchical11.png" alt="hierarchical1" width="456" height="371" srcset="https://www.centerspace.net/wp-content/uploads/2009/12/hierarchical11.png 456w, https://www.centerspace.net/wp-content/uploads/2009/12/hierarchical11-300x244.png 300w" sizes="(max-width: 456px) 100vw, 456px" /></p>
<p>Again, remember that although we’ve used dimension reduction (principal component analysis, in this case) to visualize the clustering, the clustering itself was performed based on similarity in the original 12-dimensional flavor space, not based on distance in this plane.</p>
<p>Because we have the entire hierarchical cluster tree, we can cut the tree at different levels. For example, into six clusters:</p>
<pre class="code">Objects in cluster 0:
Aberfeldy Aberlour Ardmore Auchroisk Belvenie BenNevis Benriach
Benrinnes Benromach BlairAthol Deanston Edradour GlenElgin
GlenKeith GlenOrd Glendullan Glenfarclas Glenlivet Glenrothes
Glenturret Knochando Linkwood Longmorn OldFettercairn
RoyalBrackla Strathisla Tullibardine 

Objects in cluster 1:
AnCnoc ArranIsleOf Auchentoshan Aultmore Bladnoch Bunnahabhain
Cardhu Craigallechie Dalwhinnie Dufftown GlenDeveronMacduff
GlenGrant GlenMoray GlenSpey Glenallachie Glenfiddich Glengoyne
Glenkinchie Glenlossie Glenmorangie Inchgower Loch Lomond
Mannochmore Miltonduff Scapa Speyburn Speyside Tamdhu
Tobermory Tomintoul Tomore 

Objects in cluster 2:
Ardbeg Caol Ila Clynelish Lagavulin Laphroig Talisker 

Objects in cluster 3:
Balblair Craigganmore GlenGarioch Oban Strathmill Tamnavulin
Teaninich 

Objects in cluster 4:
Balmenach Dailuaine Dalmore Glendronach Macallan Mortlach
RoyalLochnagar 

Objects in cluster 5:
Bowmore Bruichladdich GlenScotia Highland Park Isle of Jura
OldPulteney Springbank Tomatin</pre>
<p>Coloring the objects based on cluster assignment in the plot of the first two principal components:</p>
<p><img decoding="async" loading="lazy" class="alignnone size-full wp-image-888" title="hierarchical2" src="https://www.centerspace.net/blog/wp-content/uploads/2009/12/hierarchical21.png" alt="hierarchical2" width="454" height="362" srcset="https://www.centerspace.net/wp-content/uploads/2009/12/hierarchical21.png 454w, https://www.centerspace.net/wp-content/uploads/2009/12/hierarchical21-300x239.png 300w" sizes="(max-width: 454px) 100vw, 454px" /></p>
<p>Clusters (1, 2) are unchanged, while clusters (0, 3) are now split into sub-clusters.</p>
<p>Both of the clustering techniques we&#8217;ve looked at so far&#8211;<em>k</em>-means and hierarchical cluster analysis&#8211;have clustered the scotches based on similarity in the original 12-dimensional flavor space. In the next post, we look at clustering using non-negative matrix factorization (NMF), which can be used to cluster the objects using a reduced set of synthetic dimensions.</p>
<p>Ken</p>
<p>The post <a rel="nofollow" href="https://www.centerspace.net/clustering-analysis-part-iii-hierarchical-cluster-analysis">Clustering Analysis, Part III: Hierarchical Cluster Analysis</a> appeared first on <a rel="nofollow" href="https://www.centerspace.net">CenterSpace</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.centerspace.net/clustering-analysis-part-iii-hierarchical-cluster-analysis/feed</wfw:commentRss>
			<slash:comments>4</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">781</post-id>	</item>
		<item>
		<title>Clustering Analysis, Part II: K-Means Clustering</title>
		<link>https://www.centerspace.net/clustering-analysis-part-ii-k-means-clustering</link>
					<comments>https://www.centerspace.net/clustering-analysis-part-ii-k-means-clustering#comments</comments>
		
		<dc:creator><![CDATA[Ken Baldwin]]></dc:creator>
		<pubDate>Mon, 21 Dec 2009 19:53:54 +0000</pubDate>
				<category><![CDATA[NMath Stats Tutorial]]></category>
		<category><![CDATA[clustering]]></category>
		<category><![CDATA[clustering .NET]]></category>
		<category><![CDATA[clustering C#]]></category>
		<category><![CDATA[k-means]]></category>
		<category><![CDATA[k-means .NET]]></category>
		<category><![CDATA[k-means C#]]></category>
		<guid isPermaLink="false">http://www.centerspace.net/blog/?p=780</guid>

					<description><![CDATA[<p>In this continuing series, we explore the NMath Stats functions for performing cluster analysis in .NET. (For previous posts, see here.) The sample data set we&#8217;re using classifies 89 single malt scotch whiskies on a five-point scale (0-4) for 12 flavor characteristics. To visualize the data set and clusterings, we make use of the free [&#8230;]</p>
<p>The post <a rel="nofollow" href="https://www.centerspace.net/clustering-analysis-part-ii-k-means-clustering">Clustering Analysis, Part II: K-Means Clustering</a> appeared first on <a rel="nofollow" href="https://www.centerspace.net">CenterSpace</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>In this continuing series, we explore the NMath Stats functions for performing cluster analysis in .NET. (For previous posts, see <a href="/clustering-analysis-part-iv-non-negative-matrix-factorization/">here</a>.) The sample data set we&#8217;re using classifies 89 single malt scotch whiskies on a five-point scale (0-4) for 12 flavor characteristics. To visualize the data set and clusterings, we make use of the free Microsoft Chart Controls for .NET, which provide a basic set of charts.</p>
<p>In this post, we&#8217;ll cluster the scotches based on &#8220;similarity&#8221; in the original 12-dimensional flavor space using <em>k</em>-means clustering. The <em>k</em>-means clustering method assigns data points into <em>k</em> groups such that the sum of squares from points to the computed cluster centers is minimized. In NMath Stats, class <a href="https://www.centerspace.net/doc/NMath/ref/html/T_CenterSpace_NMath_Core_KMeansClustering.htm">KMeansClustering </a>performs <em>k</em>-means clustering.<br />
<span id="more-780"></span><br />
The algorithm used is that of Hartigan and Wong (1979):</p>
<ul>
<li>For each point, move it to another cluster if that would lower the sum of squares from points to the computed cluster centers.</li>
<li>If a point is moved, immediately update the cluster centers of the two affected clusters.</li>
<li>Repeat until no points are moved, or the specified maximum number of iterations is reached.</li>
</ul>
<p>A KMeansClustering instance is constructed from a matrix or a dataframe containing numeric data. Each row in the data set represents an object to be clustered.  The Cluster() method clusters the data into the specified number of clusters. The method accepts either<em> k</em>, the number of clusters, or a matrix of initial cluster centers:</p>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">If k is given, a set of distinct rows in the data matrix are chosen as the initial centers using the algorithm specified by a KMeanClustering.Start enumerated value. By default, rows are chosen at random.</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">If a matrix of initial cluster centers is given, k is inferred from the number of rows.</div>
<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">For example, this code clusters eight random vectors of length three into two clusters, using random starting cluster centers:</div>
<ul>
<li>If <em>k</em> is given, a set of distinct rows in the data matrix are chosen as the initial centers using the algorithm specified by a KMeanClustering.Start enumerated value. By default, rows are chosen at random.</li>
<li>If a matrix of initial cluster centers is given, <em>k</em> is inferred from the number of rows.</li>
</ul>
<p>For example, this C# code clusters the scotch data (loaded into a dataframe in <a href="/clustering-analysis-part-iv-non-negative-matrix-factorization/">Part I</a>) into four clusters:</p>
<pre lang="csharp">KMeansClustering km = new KMeansClustering(df);
int k = 4;
km.Cluster(k, KMeansClustering.Start.QuickCluster);</pre>
<p><em>K</em>-means clustering requires a set of starting cluster centers to initiate the iterative algorithm. By default, rows are chosen at random from the given data. Here we employ the QuickCluster algorithm, similar to the SPSS QuickCluster function, for choosing the starting centers. (The QuickCluster algorithm proceeeds as follows: If the distance between row <em>r</em> and its closest center is greater than the distance between the two closest centers <em>(m, n)</em>, then <em>r</em> replaces <em>m</em> or <em>n</em>, whichever is closest to <em>r</em>. Otherwise, if the distance between row <em>r</em> and its closest center (<em>q</em>) is greater than the distance between <em>q</em> and its closest center, then row <em>r</em> replaces <em>q</em>.)</p>
<p>The Clusters property returns a <a href="https://www.centerspace.net/doc/NMath/ref/html/T_CenterSpace_NMath_Core_ClusterSet.htm">ClusterSet </a>object, which represents a collection of objects assigned to a finite number of clusters.  The following C# code prints out the members of each cluster:</p>
<pre lang="csharp">ClusterSet cs = km.Clusters;
for (int i = 0; i &lt; cs.NumberOfClusters; i++)
{
     Console.WriteLine("Cluster {0} contains:", i);
     int[] members = cs.Cluster(i);
     for (int j = 0; j &lt; members.Length; j++)
     {
          Console.Write("{0} ", df.RowKeys[members[j]]);
     }
     Console.WriteLine("\n");
}</pre>
<p>The output looks like this:</p>
<pre class="code">Cluster 0 contains:
Aberfeldy Aberlour Ardmore Auchroisk Balmenach Belvenie
BenNevis Benriach Benrinnes Benromach BlairAthol Craigallechie
Dailuaine Dalmore Deanston Edradour GlenKeith GlenOrd
Glendronach Glendullan Glenfarclas Glenlivet Glenrothes
Glenturret Knochando Linkwood Longmorn Macallan Mortlach
OldFettercairn RoyalLochnagar Scapa Strathisla 

Cluster 1 contains:
Ardbeg Caol Ila Clynelish Lagavulin Laphroig Talisker 

Cluster 2 contains:
AnCnoc Auchentoshan Aultmore Bladnoch Bunnahabhain
Cardhu Craigganmore Dalwhinnie Dufftown GlenElgin GlenGrant
GlenMoray GlenSpey Glenallachie Glenfiddich Glengoyne
Glenkinchie Glenlossie Inchgower Loch Lomond Mannochmore
Miltonduff Speyburn Speyside Strathmill Tamdhu Tamnavulin
Tobermory Tomintoul Tomore Tullibardine 

Cluster 3 contains:
ArranIsleOf Balblair Bowmore Bruichladdich GlenDeveronMacduff
GlenGarioch GlenScotia Glenmorangie Highland Park Isle of Jura
Oban OldPulteney RoyalBrackla Springbank Teaninich Tomatin</pre>
<p>To help visualize these clusters, we can once again plot the scotches in the plane formed by the first two principal components (see <a href="/clustering-analysis-part-iv-non-negative-matrix-factorization/">Part I</a>), which collectively account for ~50% of the variance, coloring the points based on cluster assignment.</p>
<p><img decoding="async" loading="lazy" class="alignnone size-full wp-image-840" title="kmeans" src="https://www.centerspace.net/blog/wp-content/uploads/2009/12/kmeans1.png" alt="kmeans" width="451" height="363" srcset="https://www.centerspace.net/wp-content/uploads/2009/12/kmeans1.png 451w, https://www.centerspace.net/wp-content/uploads/2009/12/kmeans1-300x241.png 300w" sizes="(max-width: 451px) 100vw, 451px" /></p>
<p>Remember that although we&#8217;ve used dimension reduction (principal component analysis, in this case) to <em>visualize </em>the clustering, the clustering itself was performed based on similarity in the original 12-dimensional flavor space, not based on distance in this plane. Nonetheless, the clusters look pretty reasonable.</p>
<p><em>K</em>-means clustering is very efficient for large data sets, but does require you to know the number of clusters, <em>k, </em>in advance. Also, you have no control of the similarity metric used to cluster the objects (within-cluster sum of squares). In the next post in this series, we&#8217;ll look at hierarchical cluster analysis, which constructs the entire hierarchical cluster tree, and allows you to specify the distance and linkage functions to use.</p>
<p>Ken</p>
<h3>References</h3>
<p>Hartigan, J.A. and Wong, M.A. (1979). A k-means clustering algorithm. Algorithm AS136, Appl. Stat. 28, pp. 100–108.</p>
<p>The post <a rel="nofollow" href="https://www.centerspace.net/clustering-analysis-part-ii-k-means-clustering">Clustering Analysis, Part II: K-Means Clustering</a> appeared first on <a rel="nofollow" href="https://www.centerspace.net">CenterSpace</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.centerspace.net/clustering-analysis-part-ii-k-means-clustering/feed</wfw:commentRss>
			<slash:comments>2</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">780</post-id>	</item>
		<item>
		<title>Clustering Analysis, Part I: Principal Component Analysis (PCA)</title>
		<link>https://www.centerspace.net/clustering-analysis-part-i-principal-component-analysis-pca</link>
					<comments>https://www.centerspace.net/clustering-analysis-part-i-principal-component-analysis-pca#comments</comments>
		
		<dc:creator><![CDATA[Ken Baldwin]]></dc:creator>
		<pubDate>Tue, 15 Dec 2009 21:08:01 +0000</pubDate>
				<category><![CDATA[NMath Stats Tutorial]]></category>
		<category><![CDATA[clustering]]></category>
		<category><![CDATA[clustering .NET]]></category>
		<category><![CDATA[clustering C#]]></category>
		<category><![CDATA[pca]]></category>
		<category><![CDATA[pca .NET]]></category>
		<category><![CDATA[pca C#]]></category>
		<category><![CDATA[principal component analysis]]></category>
		<category><![CDATA[principal component analysis .NET]]></category>
		<category><![CDATA[principal component analysis C#]]></category>
		<guid isPermaLink="false">http://www.centerspace.net/blog/?p=741</guid>

					<description><![CDATA[<p>Cluster analysis or clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense. Cluster analysis is the assignment of a set of objects into one or more clusters based on object similarity. NMath Stats includes a variety of techniques for [&#8230;]</p>
<p>The post <a rel="nofollow" href="https://www.centerspace.net/clustering-analysis-part-i-principal-component-analysis-pca">Clustering Analysis, Part I: Principal Component Analysis (PCA)</a> appeared first on <a rel="nofollow" href="https://www.centerspace.net">CenterSpace</a>.</p>
]]></description>
										<content:encoded><![CDATA[<div id="_mcePaste" style="position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px; overflow-x: hidden; overflow-y: hidden;">Cluster analysis or clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense.</div>
<p><em>Cluster analysis</em> is the assignment of a set of objects into one or more clusters based on object similarity.  NMath Stats includes a variety of techniques for performing cluster analysis, which we will explore in a series of posts.</p>
<h3>The Data Set</h3>
<p>The data set we&#8217;ll use was created by David Wishart (2002), who classified 89 single malt scotch whiskies on a five-point scale (0-4) for 12 flavor characteristics: Body, Sweetness, Smoky, Medicinal, Tobacco, Honey, Spicy, Winey, Nutty, Malty, Fruity, Floral. Wishart provides clusterings of the whiskies into 4, 6, and 10 clusters.  Young et al. (unpublished manuscript) demonstrate a further clustering into 4 clusters using non-negative matrix factorization (NMF). Both the Young et al. paper and the original data set are available <a href="http://www.niss.org/research/software/irMF">here</a>.<span id="more-741"></span></p>
<h3>Visualization</h3>
<p>To visualize the data set and clusterings, we&#8217;ll make use of the free Microsoft Chart Controls for .NET, which provide a basic set of charts. NMath is also available as a <a href="/partners/">bundle</a> with the Syncfusion Essential Studio and Nevron Chart for .NET at a substantial discount. NMath easily interoperates with most charting packages.</p>
<h3>Getting Started</h3>
<p>To begin, let&#8217;s load the <a href="http://www.niss.org/sites/default/files/ScotchWhisky01.txt">data set</a> into a <a href="https://www.centerspace.net/doc/NMath/user/data-frame.htm">CenterSpace.NMath.DataFrame</a> object:</p>
<pre lang="csharp">DataFrame df =
  DataFrame.Load("ScotchWhisky01.txt", true, true, ",", true);</pre>
<p>The parameters to the Load() method specify:</p>
<ul>
<li>the filename containing the data</li>
<li>whether the data in the file contains column headers</li>
<li>whether the data in the file contains  row keys</li>
<li>the column delimiter</li>
<li>whether to parse the column types, or treat everything as string data.</li>
</ul>
<p>The data set includes a leading column of row ids. Let&#8217;s replace these keys with the distillery names, then remove the distillery column from the data frame:</p>
<pre lang="csharp">df.SetRowKeys(data[0]);
df.RowKeyHeader = data.ColumnHeaders[0];
df.RemoveColumn(0);</pre>
<p>The data frame now looks like this:</p>
<pre class="code">Distillery     Body Sweetness Smoky Medicinal Tobacco Honey Spicy Winey Nutty Malty Fruity Floral
Aberfeldy      2 2 2 0 0 2 1 2 2 2 2 2
Aberlour       3 3 1 0 0 4 3 2 2 3 3 2
AnCnoc         1 3 2 0 0 2 0 0 2 2 3 2
Ardbeg         4 1 4 4 0 0 2 0 1 2 1 0
Ardmore        2 2 2 0 0 1 1 1 2 3 1 1
ArranIsleOf    2 3 1 1 0 1 1 1 0 1 1 2
...</pre>
<p>There are 89 rows representing each scotch, and 12 columns representing the score on each flavor characteristic.</p>
<h3>Principal Component Analysis</h3>
<p>Each whisky is representing as a point in a 12-dimensional flavor space. Principal component analysis (PCA) finds a smaller set of synthetic variables that capture the maximum variance in an original data set. The first principal component accounts for as much of the variability in the data as possible, and each succeeding orthogonal component accounts for as much of the remaining variability as possible. In NMath Stats, classes <a href="https://www.centerspace.net/doc/NMath/ref/html/T_CenterSpace_NMath_Core_DoublePCA.htm">DoublePCA</a> and <a href="https://www.centerspace.net/doc/NMath/ref/html/T_CenterSpace_NMath_Core_FloatPCA.htm">FloatPCA</a> perform principal component analyses. (For more information on PCA in NMath Stats, see <a href="/principal-component-analysis/">this page</a>.)</p>
<p>For example, the following C# code constructs a PCA from the whisky data set, then prints the proportion of the variance accounted for by each principal component:</p>
<pre lang="csharp">DoublePCA pca = new DoublePCA(df);
Console.WriteLine("Variance Proportions = " +
  pca.VarianceProportions);
Console.WriteLine("Cumulative Variance Proportions = " +
  pca.CumulativeVarianceProportions);</pre>
<p>The output looks like this:</p>
<pre lang="csharp">Variance Proportions =
[ 0.301109794401424 0.192178864989234 0.0956019274277357
  0.0825032185621017  0.0723086445838344 0.0599231013596576
  0.0510808855222438 0.0458706422880217  0.0349809707532734
  0.0319772808383918 0.0229738209669541 0.00949084830712784 ]

Cumulative Variance Proportions =
[ 0.301109794401424 0.493288659390658 0.588890586818394
  0.671393805380496 0.74370244996433 0.803625551323988
  0.854706436846231 0.900577079134253 0.935558049887526
  0.967535330725918 0.990509151692872 1 ]</pre>
<p>To visualize this information, we can construct a Scree plot, a simple line chart that shows the fraction of total variance in the data as explained by each principal component.</p>
<p><img decoding="async" loading="lazy" class="alignnone size-full wp-image-763" title="scree" src="https://www.centerspace.net/blog/wp-content/uploads/2009/12/scree1.png" alt="scree" width="376" height="338" srcset="https://www.centerspace.net/wp-content/uploads/2009/12/scree1.png 376w, https://www.centerspace.net/wp-content/uploads/2009/12/scree1-300x269.png 300w" sizes="(max-width: 376px) 100vw, 376px" /></p>
<p>As you can see, the first two principal components account for ~50% of the variance.</p>
<p>The Scores property on DoublePCA gets the score matrix. The scores are the data formed by transforming the original data into the space of the principal components. For example, here we  create a view of the original 12-dimensional data by plotting the first two principal components for each scotch against each other.</p>
<p><img decoding="async" loading="lazy" class="alignnone size-full wp-image-749" title="pca" src="https://www.centerspace.net/blog/wp-content/uploads/2009/12/pca.png" alt="pca" width="366" height="345" srcset="https://www.centerspace.net/wp-content/uploads/2009/12/pca.png 366w, https://www.centerspace.net/wp-content/uploads/2009/12/pca-300x282.png 300w" sizes="(max-width: 366px) 100vw, 366px" /></p>
<p>The synthetic dimensions themselves are not particularly meaningful. Essentially we&#8217;ve fit a plane into the original 12-dimensional flavor space which accounts for as much of the variance as possible. This can help reveal any natural clustering. In the whisky data, however, there does not appear to be any strong natural clusters&#8211;perhaps a group of outliers at the bottom of the plot, and another group at the right. Of course, the original flavor characteristics were chosen precisely to avoid any dramatic clustering.</p>
<p>In future posts, we&#8217;ll apply functions in NMath Stats for k-mean clustering, hierarchical cluster analysis, and non-negative matrix factorization to explore clusterings in the data.</p>
<p>Ken</p>
<h3>References</h3>
<p>Wishart, D. (2002). <em>Whisky Classified, Choosing Single Malts by Flavor</em>. Pavilon, London.</p>
<p>Young, S.S., Fogel, P., Hawkins, D. M. (unpublished manuscript). &#8220;Clustering Scotch Whiskies using Non-Negative Matrix Factorization&#8221;. Retrieved December 15, 2009 from <a href="http://www.niss.org/sites/default/files/ScotchWhisky.pdf">http://niss.org/sites/default/files/ScotchWhisky.pdf</a>.</p>
<p>The post <a rel="nofollow" href="https://www.centerspace.net/clustering-analysis-part-i-principal-component-analysis-pca">Clustering Analysis, Part I: Principal Component Analysis (PCA)</a> appeared first on <a rel="nofollow" href="https://www.centerspace.net">CenterSpace</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.centerspace.net/clustering-analysis-part-i-principal-component-analysis-pca/feed</wfw:commentRss>
			<slash:comments>2</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">741</post-id>	</item>
	</channel>
</rss>
