<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	
	xmlns:georss="http://www.georss.org/georss"
	xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#"
	>

<channel>
	<title>Performance Archives - CenterSpace</title>
	<atom:link href="https://www.centerspace.net/category/performance/feed" rel="self" type="application/rss+xml" />
	<link>https://www.centerspace.net/category/performance</link>
	<description>.NET numerical class libraries</description>
	<lastBuildDate>Tue, 07 Feb 2023 21:48:41 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.1.1</generator>
<site xmlns="com-wordpress:feed-additions:1">104092929</site>	<item>
		<title>Precision and Reproducibility in Computing</title>
		<link>https://www.centerspace.net/precision-and-reproducibility-in-computing</link>
					<comments>https://www.centerspace.net/precision-and-reproducibility-in-computing#respond</comments>
		
		<dc:creator><![CDATA[Paul Shirkey]]></dc:creator>
		<pubDate>Mon, 16 Nov 2015 22:32:31 +0000</pubDate>
				<category><![CDATA[MKL]]></category>
		<category><![CDATA[NMath]]></category>
		<category><![CDATA[Object-Oriented Numerics]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[floating point precision]]></category>
		<category><![CDATA[MKL repeatability]]></category>
		<category><![CDATA[MKL reproducibility]]></category>
		<category><![CDATA[NMath repeatability]]></category>
		<category><![CDATA[NMath Reproducibility]]></category>
		<category><![CDATA[repeatability]]></category>
		<category><![CDATA[repeatability in computing]]></category>
		<category><![CDATA[Reproducibility in computing]]></category>
		<guid isPermaLink="false">http://www.centerspace.net/blog/?p=5810</guid>

					<description><![CDATA[<p>Run-to-run reproducibility in computing is often assumed as an obvious truth.  However software running on modern computer architectures, among many other processes, particularly when coupled with advanced performance-optimized libraries, is often only guaranteed to produce reproducible results only up to a certain precision; beyond that results can and do vary run-to-run.</p>
<p>The post <a rel="nofollow" href="https://www.centerspace.net/precision-and-reproducibility-in-computing">Precision and Reproducibility in Computing</a> appeared first on <a rel="nofollow" href="https://www.centerspace.net">CenterSpace</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>Run-to-run reproducibility in computing is often assumed as an obvious truth.  However software running on modern computer architectures, among many other processes, particularly when coupled with advanced performance-optimized libraries, is often only guaranteed to produce reproducible results only up to a certain precision; beyond that results can and do vary run-to-run.  Reproducibility is interrelated with the precision of floating-point point types and the resultant rounding, operation re-ordering, memory structure and use, and finally how real numbers are represented internally in a computer&#8217;s registers.  </p>
<p>This issue of reproducibility arises with <strong>NMath</strong> users when writing and running unit tests; which is why it&#8217;s important when writing tests to compare floating point numbers only up to their designed precision, at an absolute maximum.  With the IEEE 754 floating point representation which virtually all modern computers adhere to, the single precision <code>float </code>type uses 32 bits or 4 bytes and offers 24 bits of precision or about <em>7 decimal digits</em>. While the double precision <code>double </code>type requires 64 bits or 8 bytes and offers 53 bits of precision or about <em>15 decimal digits</em>.  Few algorithms can achieve significant results to the 15th decimal place due to rounding, loss of precision due to subtraction and other sources of numerical precision degradation.  <strong>NMath&#8217;s</strong> numerical results are tested, at a maximum, to the 14th decimal place.</p>
<h4 style="padding-left: 30px;"><em>A Precision Example</em></h4>
<p style="padding-left: 30px;">As an example, what does the following code output?</p>
<pre style="padding-left: 30px;" lang="csharp">      double x = .050000000000000003;
      double y = .050000000000000000;
      if ( x == y )
        Console.WriteLine( "x is y" );
      else
        Console.WriteLine( "x is not y" );
</pre>
<p style="padding-left: 30px;">I get &#8220;x is y&#8221;, which is clearly not the case, but the number x specified is beyond the precision of a <code>double </code>type.</p>
<p>Due to these limits on decimal number representation and the resulting rounding, the numerical results of some operations can be affected by the associative reordering of operations. For example, in some cases <code>a*x + a*z</code> may not equal <code>a*(x + z)</code> with floating point types.  Although this can be difficult to test using modern optimizing compilers because the code you write and the code that runs can be organized in a very different way, but is mathematically equivalent if not numerically.</p>
<p>So <em>reproducibility </em>is impacted by precision via dynamic operation reorderings in the ALU and additionally by run-time processor dispatching, data-array alignment, and variation in thread number among other factors.  These issues can create <em>run-to-run</em> differences in the least significant digits.  Two runs, same code, two answers.  <em>This is by design and is not an issue of correctness</em>.  Subtle changes in the memory layout of the program&#8217;s data, differences in loading of the ALU registers and operation order, and differences in threading all due to unrelated processes running on the same machine cause these run-to-run differences. </p>
<h3> Managing Reproducibility </h3>
<p>Most importantly, one should test code&#8217;s numerical results only to the precision that can be expected by the algorithm, input data, and finally the limits of floating point arithmetic.  To do this in unit tests, compare floating point numbers carefully only to a fixed number of digits.  The code snippet below compares two double numbers and returns true only if the numbers match to a specified number of digits.  </p>
<pre lang="csharp">
private static bool EqualToNumDigits( double expected, double actual, int numDigits )
    {
      double max = System.Math.Abs( expected ) > System.Math.Abs( actual ) ? System.Math.Abs( expected ) : System.Math.Abs( actual );
      double diff = System.Math.Abs( expected - actual );
      double relDiff = max > 1.0 ? diff / max : diff;
      if ( relDiff <= DOUBLE_EPSILON )
      {
        return true;
      }

      int numDigitsAgree = (int) ( -System.Math.Floor( Math.Log10( relDiff ) ) - 1 );
      return numDigitsAgree >= numDigits;
    }
</pre>
<p>This type of comparison should be used throughout unit testing code.  The full code listing, which we use for our internal testing, is provided at the end of this article.</p>
<p>If it is essential to enforce binary run-to-run reproducibility to the limits of precision, <strong>NMath </strong>provides a flag in its configuration class to ensure this is the case.  However this flag should be set for unit testing only because there can be a significant cost to performance.  In general, expect a 10% to 20% reduction in performance with some common operations degrading far more than that.  For example, some matrix multiplications will take twice the time with this flag set.</p>
<p>Note that the number of threads that Intel&#8217;s MKL library uses ( which <strong>NMath</strong> depends on ) must also be fixed before setting the reproducibility flag.</p>
<pre lang="csharp">
int numThreads = 2;  // This must be fixed for reproducibility.
NMathConfiguration.SetMKLNumThreads( numThreads );
NMathConfiguration.Reproducibility = true;
</pre>
<p>This reproducibility run configuration for <strong>NMath </strong>cannot be unset at a later point in the program.  Note that both setting the number of threads and the reproducibility flag may be set in the AppConfig or in environmental variables.  See the <a href="https://www.centerspace.net/doc/NMath/user/overview-83549.htm#Xoverview-83549">NMath User Guide</a> for instructions on how to do this. </p>
<p>Paul</p>
<p><strong>References</strong></p>
<p>M. A. Cornea-Hasegan, B. Norin.  <em>IA-64 Floating-Point Operations and the IEEE Standard for Binary Floating-Point Arithmetic</em>. Intel Technology Journal, Q4, 1999.<br />
<a href="http://gec.di.uminho.pt/discip/minf/ac0203/icca03/ia64fpbf1.pdf">http://gec.di.uminho.pt/discip/minf/ac0203/icca03/ia64fpbf1.pdf</a></p>
<p>D. Goldberg, <em>What Every Computer Scientist Should Know About Floating-Point Arithmetic</em>. Computing Surveys. March 1991.<br />
<a href="http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html">http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html</a></p>
<h3> Full <code>double</code> Comparison Code </h3>
<pre lang="csharp">
private static bool EqualToNumDigits( double expected, double actual, int numDigits )
    {
      bool xNaN = double.IsNaN( expected );
      bool yNaN = double.IsNaN( actual );
      if ( xNaN && yNaN )
      {
        return true;
      }
      if ( xNaN || yNaN )
      {
        return false;
      }
      if ( numDigits <= 0 )
      {
        throw new InvalidArgumentException( "numDigits is not positive in TestCase::EqualToNumDigits." );
      }

      double max = System.Math.Abs( expected ) > System.Math.Abs( actual ) ? System.Math.Abs( expected ) : System.Math.Abs( actual );
      double diff = System.Math.Abs( expected - actual );
      double relDiff = max > 1.0 ? diff / max : diff;
      if ( relDiff <= DOUBLE_EPSILON )
      {
        return true;
      }

      int numDigitsAgree = (int) ( -System.Math.Floor( Math.Log10( relDiff ) ) - 1 );
      //// Console.WriteLine( "x = {0}, y = {1}, rel diff = {2}, diff = {3}, num digits = {4}", x, y, relDiff, diff, numDigitsAgree );
      return numDigitsAgree >= numDigits;
    }
</pre>
<p>The post <a rel="nofollow" href="https://www.centerspace.net/precision-and-reproducibility-in-computing">Precision and Reproducibility in Computing</a> appeared first on <a rel="nofollow" href="https://www.centerspace.net">CenterSpace</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.centerspace.net/precision-and-reproducibility-in-computing/feed</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">5810</post-id>	</item>
		<item>
		<title>NMath Premium: FFT Performance</title>
		<link>https://www.centerspace.net/nmath-premium-fft-performance</link>
					<comments>https://www.centerspace.net/nmath-premium-fft-performance#respond</comments>
		
		<dc:creator><![CDATA[Paul Shirkey]]></dc:creator>
		<pubDate>Tue, 28 May 2013 16:00:29 +0000</pubDate>
				<category><![CDATA[NMath Premium]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[C#]]></category>
		<category><![CDATA[C# GPU]]></category>
		<category><![CDATA[C# Nvidia GPU]]></category>
		<category><![CDATA[GPU]]></category>
		<guid isPermaLink="false">http://www.centerspace.net/blog/?p=4212</guid>

					<description><![CDATA[<p><img class="excerpt" title="Double Precision FFT" src="https://www.centerspace.net/blog/wp-content/uploads/2013/02/ScreenClip-51.png" alt="NMath Premium" /><br />
NMath Premium is CenterSpace Software's NVIDIA GPU-accelerated edition of the NMath math and statistics library. Many linear algebra and signal processing algorithms can now run on a local NVIDIA GPU processor, frequently realizing several multiples of performance gain. In this post, we look at  the performance of complex to complex forward 1D and 2D FFT's.</p>
<p>The post <a rel="nofollow" href="https://www.centerspace.net/nmath-premium-fft-performance">NMath Premium: FFT Performance</a> appeared first on <a rel="nofollow" href="https://www.centerspace.net">CenterSpace</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p><strong> NMath Premium </strong> is our new GPU-accelerated math and statistics library for the .NET platform. The supported NVIDIA GPU routines include both a range of dense linear algebra algorithms and 1D and 2D Fast Fourier Transforms (FFTs). NMath Premium is designed to be a near drop-in replacement for NMath, however there are a few important differences and additional logging capabilities that are specific to the premium product.</p>
<p><strong>NMath Premium</strong> will be released June 11. For immediate access, sign up <a href="https://www.centerspace.net/nmath-premium/">here</a> to join the beta program.</p>
<h2>Benchmark Approach</h2>
<p>Modern FFT implementations are hybridized algorithms which switch between algorithmic approaches and processing kernels depending on the available hardware, FFT type, and FFT length. A FFT library may use the straight Cooly-Tukey algorithm for a short power-of-two FFT but switch to Bluestein&#8217;s algorithm for odd-length FFT&#8217;s. Further, depending on the factors of the FFT length different combinations of processing kernels may be used. In other words there is no single &#8216;FFT algorithm&#8217; and so there is no easy expression for FLOPS completed per FFT computed. Therefore, when analyzing the performance of FFT libraries today, the performance is often reported <em> relative to the Cooly-Tukey implementation </em> with the FLOPs estimated at <code> 5 * N * log( N ) </code>. This relative performance is reported here. As an example, if we report a performance of 10 GFLOP&#8217;s for a particular FFT, that means if you ran an implementation of the Cooly-Tukey algorithm you&#8217;d need a 10 GFLOP&#8217;s capable machine to match the performance (finish as quickly).</p>
<p>Because GPU computation takes place in a different memory space from the CPU, all data must be copied to the GPU and the results then copied back to the CPU. This copy time overhead <em>is included in all reported performance numbers.</em> We include this copy time to give our library users an accurate picture of attainable performance.</p>
<h3>GPU&#8217;s Tested</h3>
<p>The <strong> NMath Premium </strong> 1D and 2D FFT library was tested on four different NVIDIA GPU&#8217;s and a 4-core 2.0Ghz Intel i7. These models represent the current range of performance available from NVIDIA, ranging from the widely installed GeForce GTX 525 to NVIDIA&#8217;s fasted double precision GPU, the Tesla K20.</p>
<table>
<tbody>
<tr>
<th>GPU</th>
<th>Peak GFLOP (single / double)</th>
<th>Summary</th>
</tr>
<tr>
<td>Tesla K20</td>
<td>3510 / 1170</td>
<td>Optimized for applications requiring double precision performance such as computational physics, biochemistry simulations, and computational finance.</td>
</tr>
<tr>
<td>Tesla K10</td>
<td>2288/ 95</td>
<td>This is a dual GPU processor card optimized for single precision performance for applications such as seismic and video or image processing. If both GPU cores are maximally utilized these GFLOP numbers would double.</td>
</tr>
<tr>
<td>Tesla 2090</td>
<td>1331/ 655</td>
<td>A single core GPU with a more balanced single and double precision performance.</td>
</tr>
<tr>
<td>GeForce 525</td>
<td>230 / &#8211;</td>
<td>A single core consumer GPU found in many gaming computers.</td>
</tr>
</tbody>
</table>
<h2>FFT Performance Charts</h2>
<p>The four charts below represent the performance of various power-of-two length, complex to complex forward 1D and 2D FFT&#8217;s. All <strong>NMath </strong>products also seamlessly compute non-power-of-two length FFT&#8217;s but their performance is not part of this GPU comparison note.</p>
<p>The performance of the CPU-bound 1D FFT outperformed all of the GPU&#8217;s for relatively short FFT lengths. This is expected because the superior performance of the GPU&#8217;s cannot be enjoyed due to the data transfer overhead. Once the computational complexity of the 1D FFT is high enough the data transfer overhead is outweighed by the efficient parallel nature of the GPU&#8217;s, and they start to overtake the CPU-bound 1D FFT&#8217;s. This cross-over point occurs when the FFT reaches a length near 65536. The exception is the consumer level GeForce GTX 525, where the GPU and CPU FFT performance roughly track each other.</p>
<p>The 2D FFT case is different because of the higher computational demand of the two-dimensional case. First, in the single precision case we see the inferiority of the NVIDIA K20, which is designed primarily as a double precision computation engine. Here the CPU-bound outperforms the K20 for all image sizes. However the K10 and 2090 are extremely fast (including the data transfer time) and outperform the CPU-bound 2D FFT by approximately 60-70%. In the double precision 2D FFT case, the K20 outperforms all other processors in nearly all cases measured. The tested K20 was memory limited in the [ 8192 x 8192 ] test case and couldn&#8217;t complete the computation.</p>
<table>
<tbody>
<tr>
<td><a href="https://www.centerspace.net/blog/wp-content/uploads/2013/02/ScreenClip-61.png"><img decoding="async" title="Performance of single precision FFT" alt="" src="https://www.centerspace.net/blog/wp-content/uploads/2013/02/ScreenClip-61.png" width="350 class=" /></a></td>
<td><a href="https://www.centerspace.net/blog/wp-content/uploads/2013/02/ScreenClip-51.png"><img decoding="async" title="Performance of double precision FFT" alt="" src="https://www.centerspace.net/blog/wp-content/uploads/2013/02/ScreenClip-51.png" width="350 class=" /></a></td>
</tr>
<tr>
<td><a href="https://www.centerspace.net/blog/wp-content/uploads/2013/02/ScreenClip-12.png"><img decoding="async" class="alignnone size-full wp-image-4231" title="Performance or single precision 2D FFT" alt="Performance or single precision 2D FFT" src="https://www.centerspace.net/blog/wp-content/uploads/2013/02/ScreenClip-12.png" width="350" srcset="https://www.centerspace.net/wp-content/uploads/2013/02/ScreenClip-12.png 800w, https://www.centerspace.net/wp-content/uploads/2013/02/ScreenClip-12-300x262.png 300w" sizes="(max-width: 800px) 100vw, 800px" /></a></td>
<td><a href="https://www.centerspace.net/blog/wp-content/uploads/2013/02/ScreenClip-13.png"><img decoding="async" class="alignnone size-full wp-image-4232" title="Performance of double precision 2D FFT" alt="Performance of double precision 2D FFT" src="https://www.centerspace.net/blog/wp-content/uploads/2013/02/ScreenClip-13.png" width="350" srcset="https://www.centerspace.net/wp-content/uploads/2013/02/ScreenClip-13.png 800w, https://www.centerspace.net/wp-content/uploads/2013/02/ScreenClip-13-300x262.png 300w" sizes="(max-width: 800px) 100vw, 800px" /></a></td>
</tr>
</tbody>
</table>
<h3>Batch FFT</h3>
<p>To amortized the cost of data transfer to and from the GPU, <strong> NMath Premium </strong> can run FFT&#8217;s in batches of signal arrays. For the smaller FFT sizes, the batch processing nearly doubles the performance of the FFT on the GPU. As the length of the FFT increases the advantage of batch processing decreased because the full array signals can no longer be loaded into the GPU.</p>
<p><a href="https://www.centerspace.net/blog/wp-content/uploads/2013/02/BatchFFT.png"><img decoding="async" class="alignnone size-full wp-image-4259" title="Performance of batch 1D FFT" alt="" src="https://www.centerspace.net/blog/wp-content/uploads/2013/02/BatchFFT.png" width="350" srcset="https://www.centerspace.net/wp-content/uploads/2013/02/BatchFFT.png 800w, https://www.centerspace.net/wp-content/uploads/2013/02/BatchFFT-300x262.png 300w" sizes="(max-width: 800px) 100vw, 800px" /></a></p>
<h2>Summary</h2>
<p>As the complexity of the FFT increases either due to an increase in length or problem dimension the GPU leveraged FFT performance overtakes the CPU-bound version. The advantage of the GPU 1D FFT grows substantially as the FFT length grows beyond ~100,000 samples. Batch processing of signals arranged in rows in a matrix can be used to mitigate the data transfer overhead to the GPU. There are times where it may be advantageous to offload the processing of FFT&#8217;s onto the GPU even when CPU-bound performance is greater because this will free many CPU cycles for other activities. Because <strong> NMath Premium </strong> supports adjustable crossover thresholds the developer can control the FFT length at which FFT computation switchs to the GPU. Setting this threshhold to zero will push all FFT processing to the GPU, completely offloading this work from the CPU.</p>
<p>The post <a rel="nofollow" href="https://www.centerspace.net/nmath-premium-fft-performance">NMath Premium: FFT Performance</a> appeared first on <a rel="nofollow" href="https://www.centerspace.net">CenterSpace</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.centerspace.net/nmath-premium-fft-performance/feed</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">4212</post-id>	</item>
		<item>
		<title>Clearing a vector</title>
		<link>https://www.centerspace.net/clearing-a-vector</link>
					<comments>https://www.centerspace.net/clearing-a-vector#respond</comments>
		
		<dc:creator><![CDATA[Trevor Misfeldt]]></dc:creator>
		<pubDate>Wed, 09 Nov 2011 22:28:01 +0000</pubDate>
				<category><![CDATA[.NET]]></category>
		<category><![CDATA[Best Practices]]></category>
		<category><![CDATA[NMath]]></category>
		<category><![CDATA[NMath Tutorial]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[clearing a matrix]]></category>
		<category><![CDATA[clearing a vector]]></category>
		<category><![CDATA[NMath matirx]]></category>
		<category><![CDATA[NMath vector]]></category>
		<category><![CDATA[zeroing a matrix]]></category>
		<category><![CDATA[zeroing a vector]]></category>
		<guid isPermaLink="false">http://www.centerspace.net/blog/?p=3621</guid>

					<description><![CDATA[<p>A customer recently asked us for the best method to zero out a vector. We decided to run some tests to find out. Here are the five methods we tried followed by performance timing and any drawbacks. The following tests were performed on a DoubleVector of length 10,000,000. 1) Create a new vector. This isn&#8217;t [&#8230;]</p>
<p>The post <a rel="nofollow" href="https://www.centerspace.net/clearing-a-vector">Clearing a vector</a> appeared first on <a rel="nofollow" href="https://www.centerspace.net">CenterSpace</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>A customer recently asked us for the best method to zero out a vector. We decided to run some tests to find out. Here are the five methods we tried followed by performance timing and any drawbacks.</p>
<p>The following tests were performed on a <code>DoubleVector</code> of length 10,000,000.</p>
<p>1) Create a new vector. This isn&#8217;t really clearing out an existing vector but we thought we should include it for completeness.</p>
<pre lang="csharp" line="1"> DoubleVector v2 = new DoubleVector( v.Length, 0.0 );</pre>
<p>The big drawback here is that you&#8217;re creating new memory. Time: <strong>419.5ms</strong></p>
<p>2) Probably the first thing to come to mind is to simply iterate through the vector and set everything to zero.</p>
<pre lang="csharp" line="1">
for ( int i = 0; i < v.Length; i++ )
{
  v[i] = 0.0;
}</pre>
<p>We have to do some checking in the index operator. No new memory is created. Time: <strong>578.5ms</strong></p>
<p>3) In some cases, you could iterate through the underlying array of data inside the DoubleVector.</p>
<pre lang="csharp" line="1"> 
for ( int i = 0; i &lt; v.DataBlock.Data.Length; i++ )
{
  v.DataBlock.Data[i] = 0.0;
}</pre>
<p>This is a little less intuitive. And, very importantly, it will not work with many views into other data structures. For example, a row slice of a matrix. However, it's easier for the CLR to optimize this loop. Time: <strong>173.5ms</strong></p>
<p>4) We can use the power of Intel's MKL to multiply the vector by zero.</p>
<pre lang="csharp" line="1"> v.Scale( 0.0 );</pre>
<p>Scale() does this in-place. No new memory is created. In this example, we assume that MKL has already been loaded and is ready to go which is true if another MKL-based NMath call was already made or if NMath was <a href="/initializing-nmath/">initialized</a>. This method will work on all views of other data structures. Time: <strong>170ms</strong></p>
<p>5) This surprised us a bit but the best method we could find was to clear out the underlying array using Array.Clear() in .NET</p>
<pre lang="csharp" line="1"> Array.Clear( v.DataBlock.Data, 0, v.DataBlock.Data.Length );</pre>
<p>This creates no new memory. However, this will not work with non-contiguous views. However, this method is very fast. Time: <strong> 85.8ms</strong></p>
<p>To make efficiently clearing a vector simpler for NMath users we have created a <code>Clear()</code> method and a <code>Clear( Slice )</code> method on the vector and matrix classes.  It will do the right thing in the right circumstance. It will be released in NMath 5.2 in 2012.</p>
<h3> Test Code </h3>
<pre lang="csharp" line="1">
using System;
using CenterSpace.NMath.Core;

namespace Test
{
  class ClearVector
  {
    static int size = 100000000;
    static int runs = 10;
    static int methods = 5;
    
    static void Main( string[] args )
    {
      System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch();
      DoubleMatrix times = new DoubleMatrix( runs, methods );
      NMathKernel.Init();

      for ( int run = 0; run < runs; run++ )
      {
        Console.WriteLine( "Run {0}...", run );
        DoubleVector v = null;

        // Create a new one
        v = new DoubleVector( size, 1.0, 2.0 );
        sw.Start();
        DoubleVector v2 = new DoubleVector( v.Length, 0.0 );
        sw.Stop();
        times[run, 0] = sw.ElapsedMilliseconds;
        Console.WriteLine( Assert( v2 ) );

        // iterate through vector
        v = new DoubleVector( size, 1.0, 2.0 );
        sw.Reset();
        sw.Start();
        for ( int i = 0; i < v.Length; i++ )
        {
          v[i] = 0.0;
        }
        sw.Stop();
        times[run, 1] = sw.ElapsedMilliseconds;
        Console.WriteLine( Assert( v ) );

        // iterate through array
        v = new DoubleVector( size, 1.0, 2.0 );
        sw.Reset();
        sw.Start();
        for ( int i = 0; i < v.DataBlock.Data.Length; i++ )
        {
          v.DataBlock.Data[i] = 0.0;
        }
        sw.Stop();
        times[run, 2] = sw.ElapsedMilliseconds;
        Console.WriteLine( Assert( v ) );
        
        // scale
        v = new DoubleVector( size, 1.0, 2.0 );
        sw.Reset();
        sw.Start();
        v.Scale( 0.0 );
        sw.Stop();
        times[run, 3] = sw.ElapsedMilliseconds;
        Console.WriteLine( Assert( v ) );

        // Array Clear
        v = new DoubleVector( size, 1.0, 2.0 );
        sw.Reset();
        sw.Start();
        Array.Clear( v.DataBlock.Data, 0, v.DataBlock.Data.Length );
        sw.Stop();
        times[run, 4] = sw.ElapsedMilliseconds;
        Console.WriteLine( Assert( v ) );
        Console.WriteLine( times.Row( run ) );
      }
      Console.WriteLine( "Means: " + NMathFunctions.Mean( times ) );
    }

    private static bool Assert( DoubleVector v )
    {
      if ( v.Length != size )
      {
        return false;
      }
      for ( int i = 0; i < v.Length; ++i )
      {
        if ( v[i] != 0.0 )
        {
          return false;
        }
      }
      return true;
    }
  }
}
</pre>
<p>- Trevor</p>
<p>The post <a rel="nofollow" href="https://www.centerspace.net/clearing-a-vector">Clearing a vector</a> appeared first on <a rel="nofollow" href="https://www.centerspace.net">CenterSpace</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.centerspace.net/clearing-a-vector/feed</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">3621</post-id>	</item>
		<item>
		<title>Initializing NMath</title>
		<link>https://www.centerspace.net/initializing-nmath</link>
					<comments>https://www.centerspace.net/initializing-nmath#respond</comments>
		
		<dc:creator><![CDATA[Trevor Misfeldt]]></dc:creator>
		<pubDate>Wed, 09 Nov 2011 22:01:27 +0000</pubDate>
				<category><![CDATA[Best Practices]]></category>
		<category><![CDATA[NMath]]></category>
		<category><![CDATA[NMath Tutorial]]></category>
		<category><![CDATA[Performance]]></category>
		<guid isPermaLink="false">http://www.centerspace.net/blog/?p=3611</guid>

					<description><![CDATA[<p>NMath uses Intel&#8217;s Math Kernel Library (MKL) internally. This code contains native, optimized code to wring out the best performance possible. There is a one-time delay when the appropriate x86 or x64 native code is loaded. This cost can be easily controlled by the developer by using the NMathKernel.Init() method. Please see Initializing NMath for [&#8230;]</p>
<p>The post <a rel="nofollow" href="https://www.centerspace.net/initializing-nmath">Initializing NMath</a> appeared first on <a rel="nofollow" href="https://www.centerspace.net">CenterSpace</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>NMath uses Intel&#8217;s Math Kernel Library (MKL) internally. This code contains native, optimized code to wring out the best performance possible.</p>
<p>There is a one-time delay when the appropriate x86 or x64 native code is loaded. This cost can be easily controlled by the developer by using the NMathKernel.Init() method. Please see <a href="http://centerspace.net/doc/NMath/user/overview-83549.htm">Initializing NMath</a> for more details.</p>
<p>&#8211; Trevor</p>
<p>The post <a rel="nofollow" href="https://www.centerspace.net/initializing-nmath">Initializing NMath</a> appeared first on <a rel="nofollow" href="https://www.centerspace.net">CenterSpace</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.centerspace.net/initializing-nmath/feed</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">3611</post-id>	</item>
		<item>
		<title>Forward Scaling Computing</title>
		<link>https://www.centerspace.net/forward-scaling-computing</link>
					<comments>https://www.centerspace.net/forward-scaling-computing#respond</comments>
		
		<dc:creator><![CDATA[Paul Shirkey]]></dc:creator>
		<pubDate>Thu, 28 Jan 2010 18:02:13 +0000</pubDate>
				<category><![CDATA[CenterSpace]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[Cloud Computing]]></category>
		<category><![CDATA[Ct]]></category>
		<category><![CDATA[data parallelism]]></category>
		<category><![CDATA[Forward Scaling computing]]></category>
		<category><![CDATA[NMR in the cloud]]></category>
		<category><![CDATA[task parallel library]]></category>
		<category><![CDATA[task parallelism]]></category>
		<guid isPermaLink="false">http://www.centerspace.net/blog/?p=1327</guid>

					<description><![CDATA[<p><img class="excerpt" src="https://www.centerspace.net/blog/wp-content/uploads/2010/01/scc-h-wafer_small-150x150.jpg" /><br />
The era of sequential, single-threaded software development deployed to a uniprocessor machine is rapidly fading into history.  Nearly all computers sold today have at least two, if not four cores - and will have eight in the near future.  Intel announced last month the successful production and testing of a new <a href="http://bit.ly/4SJiun">48-core research processor</a> which will be made available to industry and academia for research and development of <em> manycore </em> parallel software developer tools and languages.</p>
<p>The post <a rel="nofollow" href="https://www.centerspace.net/forward-scaling-computing">Forward Scaling Computing</a> appeared first on <a rel="nofollow" href="https://www.centerspace.net">CenterSpace</a>.</p>
]]></description>
										<content:encoded><![CDATA[<h2>Forward Scaling for Multicore Performance</h2>
<p>The era of sequential, single-threaded software development deployed to a uniprocessor machine is rapidly fading into history.  Nearly all computers sold today have at least two, if not four cores &#8211; and will have eight in the near future.  Intel announced last month the successful production and testing of a new <a href="http://www.wired.com/2009/12/intel-48-core-processor/">48-core research processor</a> which will be made available to industry and academia for research and development of <em> manycore </em> parallel software developer tools and languages.</p>
<figure id="attachment_1172" aria-describedby="caption-attachment-1172" style="width: 300px" class="wp-caption aligncenter"><img decoding="async" loading="lazy" src="https://www.centerspace.net/blog/wp-content/uploads/2010/01/scc-h-wafer_small-300x200.jpg" alt="Intel&#039;s 48-core processor" title="Intel&#039;s 48-core processor" width="300" height="200" class="size-medium wp-image-1172" srcset="https://www.centerspace.net/wp-content/uploads/2010/01/scc-h-wafer_small-300x200.jpg 300w, https://www.centerspace.net/wp-content/uploads/2010/01/scc-h-wafer_small-1024x682.jpg 1024w, https://www.centerspace.net/wp-content/uploads/2010/01/scc-h-wafer_small.jpg 1500w" sizes="(max-width: 300px) 100vw, 300px" /><figcaption id="caption-attachment-1172" class="wp-caption-text">Intel's recently announced 48-core processor</figcaption></figure>
<p>In the near future users of high performance software in finance, bio-informatics, or GIS will expect their applications to scale with core count, and software that fails to do so will either need to be rewritten or abandoned.  To future-proof performance-sensitive software, code written today needs to be multicore aware, and scale automatically to all available cores &#8211; this is the key idea behind forward scaling software.  If Moore&#8217;s &#8216;law&#8217; is to be sustained into the future, hardware scalability must be joined with a similar shift in software.  This fundamental shift in computing and application development, termed the &#8216;Manycore Shift&#8217; by Microsoft, is an evolutionary shift that software developers must appreciate and adapt to in order to create long-living scalable applications.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           </p>
<h2> CenterSpace&#8217;s Forward Scaling Strategy </h2>
<p>This project of creating forward scaling software can sound daunting, but for many application developers it can be reduced to choosing the right languages &#038; libraries for the computationally demanding portions of their application.  If an application&#8217;s performance-sensitive components are forward scaling, so goes the application. <!-- problem size grows, serial parts typically grow slower--> At CenterSpace we are very performance sensitive and are working to insure that our users benefit from forward scaling behavior.  Linear scaling with core number cannot always be achieved, but in the numerical computational domain we can frequently come close to this ideal.</p>
<p>Here are some parallel computing technologies that we are looking at adopting to ensure we meet this goal.<br />
<span id="more-1327"></span></p>
<h3> Microsoft&#8217;s Task Parallel Library </h3>
<p>Microsoft research released the <a href="https://msdn.microsoft.com/en-us/library/dd460717(v=vs.110).aspx">Task Parallel Library</a> about two years ago, and improved upon it most recently in June of 2008.  With this experience, Microsoft will now include task base parallelism functionality in the March 2010 release of .NET 4.  </p>
<p>These parallel extensions will reside in a new static class <code>Parallel</code> inside of the <code>System.Threading.Tasks</code> namespace.  While this framework provides an extensible set of classes for complex task parallelism problems, the common use cases will include the typical variants of the lowly loop.  Here&#8217;s a very simple example of computing the square of a vector contrasting sequential and parallel code patterns.</p>
<pre lang="csharp">
// Single threaded looping
for(int i = 0; i < n; i++)
  s[i] = v[i]*v[i];

// Using the Parallel class &#038; anonymous delegates
Double[] s = new Double[n];
Parallel.For(0, n, delegate (int i)
{
  s[i] = v[i]*v[i];
} );

// Using the Parallel class &#038; lambda expressions
Double[] s = new Double[n];
Parallel.For(0, n, (i) => s[i] = v[i]*v[i]);
</pre>
<p>Note how concise the parallel looping code can be written with lambda expressions.  If you haven&#8217;t yet gotten excited about lambda expressions, hopefully you are now!   Also, <em> outer variables </em> can be referenced from inside the lambda expression (or the anonymous delegate) making code ports fairly simple once the inherent parallelism is recognized.</p>
<h3> Intel&#8217;s Ct Data Parallel VM </h3>
<p>CenterSpace already leverages the Intel&#8217;s forward scaling implementation&#8217;s of BLAS and LAPACK in our NMath and NMath Stats libraries, so we are very attuned to Intel&#8217;s efforts to bring new tools to programmers to leverage their multicore chips.  Intel has a long history of supporting the efforts of developers in creating and debugging multithreaded applications by offering solid libraries, compilers, and debuggers.  Intel has a major initiative called the <a href="https://software.intel.com/en-us/articles/tera-scale-computing-a-parallel-path-to-the-future">Tera-Scale Computing Research Program</a> to push forward all areas of high performance computing spanning from hardware to software.</p>
<p>At CenterSpace we are particularly interested in a new <it> data parallel </it> virtual machine that will offer all the data parallel functionality of Ct to any language with C bindings.  Backing up a bit, <a href="https://software.intel.com/en-us/articles/tera-scale-computing-a-parallel-path-to-the-future">Ct</a> is a new data parallel language that is in development at Intel, that, with the help of recently acquired Rapid Mind (August 2009), will enable programmer friendly data-parallel programming in widespread languages such as C++.  The new <a href="https://software.intel.com/en-us/articles/data-parallel-vm/">data parallel virtual machine</a>, with its C front end, can interoperate with languages such as C#, allowing programmers using the .NET family of languages to leverage this data parallel technology. </p>
<p>This is a fundamentally different approach to Microsoft&#8217;s task-based parallel language classes.  In the example above, note that the computation in the lambda expression must be independent for every <code>i</code>.  This places a burden on the programmer to identify and create these parallel lambda expressions; the data parallel approach frees the programmer of this significant burden.  Computing the dot product in C#, leveraging the C front end to the data parallel VM,  might look something like:</p>
<pre lang="csharp">
// Using a Ct based parallel class & lamda expressions
CtVector<int> v1 = new CtVector<int>(1,2, ... ,49999,50000);
CtVector<int> v2 = new CtVector<int>(50000,49999, ... ,2,1);
int dot_product = CtParallel.AddReduce(v1*v2);
</pre>
<p>Note that the dot product operation is not easily converted to a task base parallel implementation since each operation in the lambda expression is not independent, but instead requires both a multiplication and a summation (reduction).  However, there is significant data parallelism in the vector product and reduction steps that should exhibit near linear scaling with processor count using this Ct based implementation.   </p>
<p>As with task based parallelism, the data parallelism approach has its drawbacks.  First, the data must reside not in native types but in special Ct (<code>CtVector</code> in this example) containers that the data parallel engine knows how to divide and reassemble.  This is generally a minor issue, however if your data isn&#8217;t wont to residing in vectors or matrices at all, data parallelism may not be an option for leveraging multi-core hardware &#8211; task base parallelism may be the answer.  Both data parallelism and task parallelism approaches have their strengths and weaknesses and application domains where each shines.  At CenterSpace, since our focus is on numerical computation, our data typically resides comfortably in vectors so we expect Ct&#8217;s data parallel approach to be an important tool in our future.</p>
<h3> Cloud Computing with EC2 &#038; Azure </h3>
<p>Cloud computing, as it is typically thought of as a room full of servers and disk drives, is not conceptually that different from a single computer with many cores.  In fact, Intel likes to refer to their new 48-core processor as a <em>single-chip cloud computer</em>.  In both cases the central goals are performance and scalability.  </p>
<p>Since many of CenterSpace&#8217;s customers have high computational demands and often process large datasets, we have started pushing some NMath functionality out into the cloud.  A powerful &#038; computationally demanding data clustering algorithm called Non-negative Matrix Factorization was our first NMath port to the cloud.  With this algorithm now residing in the cloud, customer&#8217;s can access this high-performace NMF implementation from virtually any programming language, cutting their run times from days to hours.  We&#8217;ll be blogging more on our cloud computing efforts in the near future.</p>
<p>Happy Computing,</p>
<p><em> -Paul </em></p>
<p><em> References &#038; Additional Resources </em></p>
<ol>
<li>Toub, Stephen.  Patterns of Parallel Programming.  Whitepaper, Microsoft Corporation, 2009.
<li><a href="https://www.microsoft.com/en-us/download/details.aspx?id=17702">The Manycore Shift</a>: Microsoft Parallel Computing Initiative Ushers Computing into the Next Era, 2007.
<li>An in-depth technical <a href="http://www.intel.com/technology/itj/2007/v11i4/4-libraries/2-intro.htm">article</a> on Intel&#8217;s numeric intensive,  multi-core ready, technologies, November 2007.<br />
<p>The post <a rel="nofollow" href="https://www.centerspace.net/forward-scaling-computing">Forward Scaling Computing</a> appeared first on <a rel="nofollow" href="https://www.centerspace.net">CenterSpace</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.centerspace.net/forward-scaling-computing/feed</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">1327</post-id>	</item>
		<item>
		<title>High Performance Numerics in C#</title>
		<link>https://www.centerspace.net/high-performance-numerics-in-c</link>
					<comments>https://www.centerspace.net/high-performance-numerics-in-c#comments</comments>
		
		<dc:creator><![CDATA[Paul Shirkey]]></dc:creator>
		<pubDate>Mon, 21 Dec 2009 19:52:38 +0000</pubDate>
				<category><![CDATA[Performance]]></category>
		<category><![CDATA[forward scaling]]></category>
		<category><![CDATA[multi-core performance]]></category>
		<category><![CDATA[numerics c#]]></category>
		<category><![CDATA[stackoverflow question]]></category>
		<guid isPermaLink="false">http://www.centerspace.net/blog/?p=678</guid>

					<description><![CDATA[<p>Recently a programmer on stackoverflow commented that the performance of NMath was &#8220;really amazing&#8221; and was wondering how we achieved that performance in the context of the .NET/C# framework/language pair. This blog post discusses how CenterSpace achieves such great performance in this memory managed framework. A future post will discuss where we are looking to [&#8230;]</p>
<p>The post <a rel="nofollow" href="https://www.centerspace.net/high-performance-numerics-in-c">High Performance Numerics in C#</a> appeared first on <a rel="nofollow" href="https://www.centerspace.net">CenterSpace</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>Recently a programmer on stackoverflow commented that the performance of NMath was &#8220;<a href="http://stackoverflow.com/questions/1831353/the-speed-of-net-in-numerical-computing">really amazing</a>&#8221; and was wondering how we achieved that performance in the context of the .NET/C# framework/language pair.  This blog post discusses how CenterSpace achieves such great performance in this memory managed framework.  A future post will discuss where we are looking to gain even more performance.<br />
<span id="more-678"></span></p>
<h2> 1. C# is Fast, Memory Allocation Is Not</h2>
<p>CenterSpace libraries never allocate memory unless absolutely necessary, and we provide an API that doesn&#8217;t force users to unnecessarily allocate memory.  For example, where appropriate, nearly all classes provide two method signatures for each computational operation &#8211; one that returns a vector as an out variable, and one that returns a new vector.  </p>
<pre lang="csharp">
 Double1DConvolution conv = new Double1DConvolution(kernel, 256);

// Allocates a returns the result in a new vector.
DoubleVector result = conv.Convolve(data);

// Returns the result in the provided vector.
conv.Convolve(data, ref result);
</pre>
<p>In a loop, the latter is far superior if the <code> result </code> vector is reused.  The earlier is fine for a one off result, and is convenient method signature for the API user.  Inexperienced C# programmers often complain that their applications suffer from poor performance &#8211; and frequently the root of this issue is not in the language itself, but in poor memory allocation/reuse practices.  Languages that offer garbage collection services are easy to abuse in this way.   </p>
<h2> 2. Precision  &#8211; Ability to Use Just What You Need </h2>
<p>Frequently programmers do all of their computation using <code> Double </code> precision math.  If 7 digits of precision are all you need, using strictly single precision algorithms will vastly improve performance.  Below is a table comparing double and single precision FFT&#8217;s computed using NMath.</p>
<pre class="code">
<table align="center">
<tr bgcolor = #efefef >
<th  > FFT Length <th> Double Precision (ns) <th> Single Precision (ns) <th> Performance Gain </th> 
</tr>
<tr align="center"> <td> 1024<td>  200<td>  25<td>  8X 
<tr align="center"> <td> 2048<td>  325<td>  50<td>  6.5X
<tr align="center"> <td> 4096<td>  675<td>  150<td>  4.5X 
</table>
</pre>
<p>Clearly, if the precision is not necessary, the performance gain in switching from double to single precision is considerable (not to mention the memory saving for the data storage).  NMath provides both single and double precision options for nearly every class.  </p>
<h2> 3. Processor Optimized Code </h2>
<p>Part of the NMath class library is based on <a href="https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms">BLAS</a> and <a href="https://en.wikipedia.org/wiki/LAPACK"> LAPACK, </a> two long established interfaces for linear algebra.  We use Intel&#8217;s implementation of these libraries because Intel carefully optimizes their performance for the Intel multicore processors on an on-going basis.  We also leverage MKL&#8217;s implementation of the FFT.  Below is a brief comparison between NMath&#8217;s FFT and FFTW (the FFT implementation shipped with MATLAB) &#8211; on a different machine than above.</p>
<pre class="code">
<table align="center"><tbody >
<tr>
<th colspan="3"> Comparison of a forward, real, out-of-place FFT. </th>
</tr>
<tr> 
<th> FFT length</th> <th> FFTW </th> <th> NMATH FFT </th></tr>
<tr> 
<td> 1024 </td> <td> 4.14 &mu;s</td> <td> 4.36 &mu;s </td> </tr>
<tr> 
<td> 1000</td> <td> 5.98 &mu;s </td> <td> 5.33 &mu;s </td> </tr>
<tr> 
<td> 4096</td> <td> 20.31 &mu;s </td> <td> 21.71 &mu;s </td> </tr>
<tr> 
<td> 4095</td> <td> 49.90 &mu;s </td> <td> 43.01 &mu;s </td> </tr>
<tr> 
<td> 1024^2 </td> <td> 17.16 ms </td> <td> 15.63 ms </td> </tr>
</tbody>
</table>
</pre>
<p>Clearly .NET / C# programmers can have the productive development language of C# and have world class computational performance.</p>
<p>There are a couple of different ways to call a library from the .NET framework without impacting performance unacceptably.  Using P/Invoke the library can be called directly from C# (in our case).  Due to the cost of marshaling the data this is not a good option for many short computations, but for significant operations, the P/Invoke cost is negligible.  We also have the option to call a C++/cli routine from C#, pinning all pointers to data allocated in the managed space, and then call the Intel library.  In terms of performance, pinning pointers in C++/cli is generally better, but it&#8217;s also more complex to implement the pointers to both managed and unmanaged heaps.  CenterSpace uses both techniques.</p>
<p>Happy Computing,</p>
<p><em> -Paul </em></p>
<p>The post <a rel="nofollow" href="https://www.centerspace.net/high-performance-numerics-in-c">High Performance Numerics in C#</a> appeared first on <a rel="nofollow" href="https://www.centerspace.net">CenterSpace</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.centerspace.net/high-performance-numerics-in-c/feed</wfw:commentRss>
			<slash:comments>2</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">678</post-id>	</item>
		<item>
		<title>Modern Fast Fourier Transform</title>
		<link>https://www.centerspace.net/modern-fast-fourier-transform</link>
					<comments>https://www.centerspace.net/modern-fast-fourier-transform#comments</comments>
		
		<dc:creator><![CDATA[Paul Shirkey]]></dc:creator>
		<pubDate>Tue, 29 Sep 2009 05:32:46 +0000</pubDate>
				<category><![CDATA[NMath]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[FFT]]></category>
		<category><![CDATA[FFT performance]]></category>
		<category><![CDATA[High performance FFT]]></category>
		<category><![CDATA[Multicore FFT]]></category>
		<guid isPermaLink="false">http://www.centerspace.net/blog/?p=226</guid>

					<description><![CDATA[<p>All variants of the original Cooley-Tukey O(n log n) fast Fourier transform fundamentally exploit different ways to factor the discrete Fourier summation of length N. For example, the split-radix FFT algorithm divides the Fourier summation of length N into three new Fourier summations: one of length N/2 and two of length N/4. The prime factor [&#8230;]</p>
<p>The post <a rel="nofollow" href="https://www.centerspace.net/modern-fast-fourier-transform">Modern Fast Fourier Transform</a> appeared first on <a rel="nofollow" href="https://www.centerspace.net">CenterSpace</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>All variants of the original Cooley-Tukey O(n log n) fast Fourier transform fundamentally exploit different ways to factor the discrete Fourier summation of length N.</p>
<p><center><br />
<a href="http://www.codecogs.com/eqnedit.php?latex=X_k = \sum_{n=0}^{N-1} x_n e^{(-2 \pi i / N) kn} \ \ \ \ \ k = 0, ... ,N-1" target="_blank" rel="noopener"><img decoding="async" title="X_k = \sum_{n=0}^{N-1} x_n e^{(-2 \pi i / N) kn} \ \ \ \ \ k = 0, ... ,N-1" src="http://latex.codecogs.com/gif.latex?X_k = \sum_{n=0}^{N-1} x_n e^{(-2 \pi i / N) kn} \ \ \ \ \ k = 0, ... ,N-1" alt="" /></a></center><br />
For example, the <em>split-radix FFT</em> algorithm divides the Fourier summation of length N into three new Fourier summations: one of length N/2 and two of length N/4.</p>
<p><center><br />
<a href="http://www.codecogs.com/eqnedit.php?latex=X_{k_N} = X_{k_{N/2}} @plus; X_{k_{N/4}} @plus; X_{k_{N/4}}" target="_blank" rel="noopener"><img decoding="async" title="X_{k_N} = X_{k_{N/2}} + X_{k_{N/4}} + X_{k_{N/4}}" src="http://latex.codecogs.com/gif.latex?X_{k_N} = X_{k_{N/2}} + X_{k_{N/4}} + X_{k_{N/4}}" alt="" /></a></center><br />
The <em>prime factor FFT</em>, divides the Fourier summation of length N, into two (if they exist) summations of length N1 and N2, where N1 and N2 must be relatively prime.</p>
<p><center><br />
<a href="http://www.codecogs.com/eqnedit.php?latex=X_{k_N} = X_{k_{N1}} ( X_{k_{N2}} ) \ \ where \ N1 \perp N2" target="_blank" rel="noopener"><img decoding="async" title="X_{k_N} = X_{k_{N1}} ( X_{k_{N2}} ) \ \ where \ N1 \perp N2" src="http://latex.codecogs.com/gif.latex?X_{k_N} = X_{k_{N1}} ( X_{k_{N2}} ) \ \ where \ N1 \perp N2" alt="" /></a></center><br />
These algorithms are typically applied recursively, and in combination with one another (or with still other factorizations) to maximize performance for a particular N.</p>
<p>In modern implementations there really isn&#8217;t a single static FFT algorithm, but more a dynamic collection of FFT algorithms and tools that are cleverly collated for the Fourier transform type at hand. Major algorithmic changes occur in the underlying implementation as the length and forward domain (real or complex) of the problem vary. Sophisticated FFT implementations insulate the end-user programmer from all of this background machinery.</p>
<h5>DFT length is fundamental to performance</h5>
<p>The days of power-of-2-only FFT algorithms are dead. Users of modern FFT libraries should not need to worry about the large complexities involved in finding the optimal algorithm for the FFT computation at hand; the library should look at the FFT length, problem domain (real or complex), number of machine cores, and machine architecture, and find and compute with the best hybridized FFT algorithm available. However, it is still helpful to understand that your realized performance will depend fundamentally on the various factorization of the length of your FFT. Most know that the best FFT performance will be had when N is a power of 2. If this stringent length requirement cannot be met, then it is best to use a length that be factored into small primes. CenterSpace&#8217;s FFT algorithms contain optimized kernels for prime factor lengths of 2, 3, 5, 7 and 11. The table below demonstrates the FFT performance sensitivity to FFT length.</p>
<table border="0" cellpadding="4">
<caption>Forward real 1D FFT performance at various lengths.</caption>
<tbody>
<tr align="center">
<td><em> DFT Length </em></td>
<td><em>Factors </em></td>
<td><em>MFLOP approximation </em></td>
</tr>
<tr>
<td>512</td>
<td>2 x 2 x 2 x 2 x 2 x 2 x 2 x 2 x 2</td>
<td>5324.5</td>
</tr>
<tr>
<td>511</td>
<td>7 x 73</td>
<td>1327.8</td>
</tr>
<tr>
<td>510</td>
<td>2 x 3 x 5 x 17</td>
<td>3879.4</td>
</tr>
<tr>
<td>509</td>
<td>509 (prime)</td>
<td>1762.4</td>
</tr>
<tr>
<td>508</td>
<td>2 x 2 x 127</td>
<td>2637.6</td>
</tr>
<tr>
<td>507</td>
<td>3 x 13 x 13</td>
<td>2631.5</td>
</tr>
<tr>
<td>506</td>
<td>2 x 11 x 23</td>
<td>3938.3</td>
</tr>
<tr>
<td>505</td>
<td>5 x 101</td>
<td>1122.6</td>
</tr>
<tr>
<td>504</td>
<td>2 x 2 x 2 x 3 x 3 x 7</td>
<td>5227</td>
</tr>
</tbody>
</table>
<p>Clearly the fastest FFT&#8217;s are for lengths that can be factored into small primes (512, 510, 507, 506, 504), and especially small primes that have optimized kernels (512 and 504). The more kernel optimized primes your FFT length contains the faster it will run. This is a universal fact that all FFT implementations confront and holds true for higher dimension FFT&#8217;s as well. <em> Slight changes in length can have a profound impact on FFT performance</em>.</p>
<p>You can factor your FFT length using an online service to assess how your FFT will perform.</p>
<h5>Multi-core Scalability</h5>
<p>The ability to factor a particular FFT into a set independent computations makes it fundamentally suitable for parallelization. All modern desktop and many laptop computers today contain at least two processor cores and any modern math library should be exploiting this fact where possible. CenterSpace&#8217;s complex domain FFT&#8217;s (and related convolutions) are multi-core aware, and automatically expand to fully utilize the available processor cores. Small problems are run on a single core, but once the computational advantages of algorithm parallelization overcome the overhead costs of multi-core parallelization, the computation is spread across all available cores. This automatic parallelization is gained simply by using CenterSpace&#8217;s NMath class libraries. No end-user programming effort is involved.</p>
<table border="0" cellpadding="6">
<caption>Forward complex 1D FFT performance on 1 and 8 cores.</caption>
<tbody>
<tr align="center">
<th><em> FFT Length </em></th>
<th><em> Machine Cores </em></th>
<th><em> Time (seconds) </em></th>
<th><em> MFLOP approximation </em></th>
</tr>
<tr>
<td>2^20</td>
<td>One</td>
<td>56.7</td>
<td>6405.9</td>
</tr>
<tr>
<td>2^20 + 1</td>
<td>One</td>
<td>554.6</td>
<td>655.3</td>
</tr>
<tr>
<td>2^20</td>
<td>Eight</td>
<td>53.3</td>
<td>6813.7</td>
</tr>
<tr>
<td>2^20 + 1</td>
<td>Eight</td>
<td>124.2</td>
<td>2925.3</td>
</tr>
</tbody>
</table>
<p>The power of two FFT&#8217;s are so computationally efficient on modern processors that the gain between one and eight cores is only about 3 seconds on a 2^20-point FFT. However, for the non-power-of-two case we get a 4.5 times speed improvement going from one core to eight. Looked at another way, with multi-core scalability of the FFT, we suffered only a 2X loss in performance going from a 2^20 length FFT to a 2^20+1 length FFT, instead of a 10X loss in performance. In other words, the multi-core scalability of CenterSpace&#8217;s NMath FFT algorithms mitigate the performance loss in using non-power-of-2 lengths, and this simplifies the end-user programmer&#8217;s job.</p>
<p><em> -Paul </em></p>
<p>See our <a href="/topic-fast-fourier-transforms/">FFT landing page </a> for complete documentation and code examples.</p>
<p>The post <a rel="nofollow" href="https://www.centerspace.net/modern-fast-fourier-transform">Modern Fast Fourier Transform</a> appeared first on <a rel="nofollow" href="https://www.centerspace.net">CenterSpace</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.centerspace.net/modern-fast-fourier-transform/feed</wfw:commentRss>
			<slash:comments>2</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">226</post-id>	</item>
	</channel>
</rss>
