Performance Archives - CenterSpace

Precision and Reproducibility in Computing

Paul Shirkey — Mon, 16 Nov 2015 22:32:31 +0000

Run-to-run reproducibility in computing is often assumed as an obvious truth. However software running on modern computer architectures, among many other processes, particularly when coupled with advanced performance-optimized libraries, is often only guaranteed to produce reproducible results only up to a certain precision; beyond that results can and do vary run-to-run. Reproducibility is interrelated with the precision of floating-point point types and the resultant rounding, operation re-ordering, memory structure and use, and finally how real numbers are represented internally in a computer’s registers.

This issue of reproducibility arises with NMath users when writing and running unit tests; which is why it’s important when writing tests to compare floating point numbers only up to their designed precision, at an absolute maximum. With the IEEE 754 floating point representation which virtually all modern computers adhere to, the single precision float type uses 32 bits or 4 bytes and offers 24 bits of precision or about 7 decimal digits. While the double precision double type requires 64 bits or 8 bytes and offers 53 bits of precision or about 15 decimal digits. Few algorithms can achieve significant results to the 15th decimal place due to rounding, loss of precision due to subtraction and other sources of numerical precision degradation. NMath’s numerical results are tested, at a maximum, to the 14th decimal place.

A Precision Example

As an example, what does the following code output?

      double x = .050000000000000003;
      double y = .050000000000000000;
      if ( x == y )
        Console.WriteLine( "x is y" );
      else
        Console.WriteLine( "x is not y" );

I get “x is y”, which is clearly not the case, but the number x specified is beyond the precision of a double type.

Due to these limits on decimal number representation and the resulting rounding, the numerical results of some operations can be affected by the associative reordering of operations. For example, in some cases a*x + a*z may not equal a*(x + z) with floating point types. Although this can be difficult to test using modern optimizing compilers because the code you write and the code that runs can be organized in a very different way, but is mathematically equivalent if not numerically.

So reproducibility is impacted by precision via dynamic operation reorderings in the ALU and additionally by run-time processor dispatching, data-array alignment, and variation in thread number among other factors. These issues can create run-to-run differences in the least significant digits. Two runs, same code, two answers. This is by design and is not an issue of correctness. Subtle changes in the memory layout of the program’s data, differences in loading of the ALU registers and operation order, and differences in threading all due to unrelated processes running on the same machine cause these run-to-run differences.

Managing Reproducibility

Most importantly, one should test code’s numerical results only to the precision that can be expected by the algorithm, input data, and finally the limits of floating point arithmetic. To do this in unit tests, compare floating point numbers carefully only to a fixed number of digits. The code snippet below compares two double numbers and returns true only if the numbers match to a specified number of digits.

private static bool EqualToNumDigits( double expected, double actual, int numDigits )
    {
      double max = System.Math.Abs( expected ) > System.Math.Abs( actual ) ? System.Math.Abs( expected ) : System.Math.Abs( actual );
      double diff = System.Math.Abs( expected - actual );
      double relDiff = max > 1.0 ? diff / max : diff;
      if ( relDiff <= DOUBLE_EPSILON )
      {
        return true;
      }

      int numDigitsAgree = (int) ( -System.Math.Floor( Math.Log10( relDiff ) ) - 1 );
      return numDigitsAgree >= numDigits;
    }

This type of comparison should be used throughout unit testing code. The full code listing, which we use for our internal testing, is provided at the end of this article.

If it is essential to enforce binary run-to-run reproducibility to the limits of precision, NMath provides a flag in its configuration class to ensure this is the case. However this flag should be set for unit testing only because there can be a significant cost to performance. In general, expect a 10% to 20% reduction in performance with some common operations degrading far more than that. For example, some matrix multiplications will take twice the time with this flag set.

Note that the number of threads that Intel’s MKL library uses ( which NMath depends on ) must also be fixed before setting the reproducibility flag.

int numThreads = 2;  // This must be fixed for reproducibility.
NMathConfiguration.SetMKLNumThreads( numThreads );
NMathConfiguration.Reproducibility = true;

This reproducibility run configuration for NMath cannot be unset at a later point in the program. Note that both setting the number of threads and the reproducibility flag may be set in the AppConfig or in environmental variables. See the NMath User Guide for instructions on how to do this.

Paul

References

M. A. Cornea-Hasegan, B. Norin. IA-64 Floating-Point Operations and the IEEE Standard for Binary Floating-Point Arithmetic. Intel Technology Journal, Q4, 1999.
http://gec.di.uminho.pt/discip/minf/ac0203/icca03/ia64fpbf1.pdf

D. Goldberg, What Every Computer Scientist Should Know About Floating-Point Arithmetic. Computing Surveys. March 1991.
http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html

Full `double` Comparison Code

private static bool EqualToNumDigits( double expected, double actual, int numDigits )
    {
      bool xNaN = double.IsNaN( expected );
      bool yNaN = double.IsNaN( actual );
      if ( xNaN && yNaN )
      {
        return true;
      }
      if ( xNaN || yNaN )
      {
        return false;
      }
      if ( numDigits <= 0 )
      {
        throw new InvalidArgumentException( "numDigits is not positive in TestCase::EqualToNumDigits." );
      }

      double max = System.Math.Abs( expected ) > System.Math.Abs( actual ) ? System.Math.Abs( expected ) : System.Math.Abs( actual );
      double diff = System.Math.Abs( expected - actual );
      double relDiff = max > 1.0 ? diff / max : diff;
      if ( relDiff <= DOUBLE_EPSILON )
      {
        return true;
      }

      int numDigitsAgree = (int) ( -System.Math.Floor( Math.Log10( relDiff ) ) - 1 );
      //// Console.WriteLine( "x = {0}, y = {1}, rel diff = {2}, diff = {3}, num digits = {4}", x, y, relDiff, diff, numDigitsAgree );
      return numDigitsAgree >= numDigits;
    }

The post Precision and Reproducibility in Computing appeared first on CenterSpace.

NMath Premium: FFT Performance

Paul Shirkey — Tue, 28 May 2013 16:00:29 +0000

NMath Premium is our new GPU-accelerated math and statistics library for the .NET platform. The supported NVIDIA GPU routines include both a range of dense linear algebra algorithms and 1D and 2D Fast Fourier Transforms (FFTs). NMath Premium is designed to be a near drop-in replacement for NMath, however there are a few important differences and additional logging capabilities that are specific to the premium product.

NMath Premium will be released June 11. For immediate access, sign up here to join the beta program.

Benchmark Approach

Modern FFT implementations are hybridized algorithms which switch between algorithmic approaches and processing kernels depending on the available hardware, FFT type, and FFT length. A FFT library may use the straight Cooly-Tukey algorithm for a short power-of-two FFT but switch to Bluestein’s algorithm for odd-length FFT’s. Further, depending on the factors of the FFT length different combinations of processing kernels may be used. In other words there is no single ‘FFT algorithm’ and so there is no easy expression for FLOPS completed per FFT computed. Therefore, when analyzing the performance of FFT libraries today, the performance is often reported relative to the Cooly-Tukey implementation with the FLOPs estimated at 5 * N * log( N ) . This relative performance is reported here. As an example, if we report a performance of 10 GFLOP’s for a particular FFT, that means if you ran an implementation of the Cooly-Tukey algorithm you’d need a 10 GFLOP’s capable machine to match the performance (finish as quickly).

Because GPU computation takes place in a different memory space from the CPU, all data must be copied to the GPU and the results then copied back to the CPU. This copy time overhead is included in all reported performance numbers. We include this copy time to give our library users an accurate picture of attainable performance.

GPU’s Tested

The NMath Premium 1D and 2D FFT library was tested on four different NVIDIA GPU’s and a 4-core 2.0Ghz Intel i7. These models represent the current range of performance available from NVIDIA, ranging from the widely installed GeForce GTX 525 to NVIDIA’s fasted double precision GPU, the Tesla K20.

GPU	Peak GFLOP (single / double)	Summary
Tesla K20	3510 / 1170	Optimized for applications requiring double precision performance such as computational physics, biochemistry simulations, and computational finance.
Tesla K10	2288/ 95	This is a dual GPU processor card optimized for single precision performance for applications such as seismic and video or image processing. If both GPU cores are maximally utilized these GFLOP numbers would double.
Tesla 2090	1331/ 655	A single core GPU with a more balanced single and double precision performance.
GeForce 525	230 / –	A single core consumer GPU found in many gaming computers.

FFT Performance Charts

The four charts below represent the performance of various power-of-two length, complex to complex forward 1D and 2D FFT’s. All NMath products also seamlessly compute non-power-of-two length FFT’s but their performance is not part of this GPU comparison note.

The performance of the CPU-bound 1D FFT outperformed all of the GPU’s for relatively short FFT lengths. This is expected because the superior performance of the GPU’s cannot be enjoyed due to the data transfer overhead. Once the computational complexity of the 1D FFT is high enough the data transfer overhead is outweighed by the efficient parallel nature of the GPU’s, and they start to overtake the CPU-bound 1D FFT’s. This cross-over point occurs when the FFT reaches a length near 65536. The exception is the consumer level GeForce GTX 525, where the GPU and CPU FFT performance roughly track each other.

The 2D FFT case is different because of the higher computational demand of the two-dimensional case. First, in the single precision case we see the inferiority of the NVIDIA K20, which is designed primarily as a double precision computation engine. Here the CPU-bound outperforms the K20 for all image sizes. However the K10 and 2090 are extremely fast (including the data transfer time) and outperform the CPU-bound 2D FFT by approximately 60-70%. In the double precision 2D FFT case, the K20 outperforms all other processors in nearly all cases measured. The tested K20 was memory limited in the [ 8192 x 8192 ] test case and couldn’t complete the computation.

Batch FFT

To amortized the cost of data transfer to and from the GPU, NMath Premium can run FFT’s in batches of signal arrays. For the smaller FFT sizes, the batch processing nearly doubles the performance of the FFT on the GPU. As the length of the FFT increases the advantage of batch processing decreased because the full array signals can no longer be loaded into the GPU.

Summary

As the complexity of the FFT increases either due to an increase in length or problem dimension the GPU leveraged FFT performance overtakes the CPU-bound version. The advantage of the GPU 1D FFT grows substantially as the FFT length grows beyond ~100,000 samples. Batch processing of signals arranged in rows in a matrix can be used to mitigate the data transfer overhead to the GPU. There are times where it may be advantageous to offload the processing of FFT’s onto the GPU even when CPU-bound performance is greater because this will free many CPU cycles for other activities. Because NMath Premium supports adjustable crossover thresholds the developer can control the FFT length at which FFT computation switchs to the GPU. Setting this threshhold to zero will push all FFT processing to the GPU, completely offloading this work from the CPU.

The post NMath Premium: FFT Performance appeared first on CenterSpace.

Clearing a vector

Trevor Misfeldt — Wed, 09 Nov 2011 22:28:01 +0000

A customer recently asked us for the best method to zero out a vector. We decided to run some tests to find out. Here are the five methods we tried followed by performance timing and any drawbacks.

The following tests were performed on a DoubleVector of length 10,000,000.

1) Create a new vector. This isn’t really clearing out an existing vector but we thought we should include it for completeness.

 DoubleVector v2 = new DoubleVector( v.Length, 0.0 );

The big drawback here is that you’re creating new memory. Time: 419.5ms

2) Probably the first thing to come to mind is to simply iterate through the vector and set everything to zero.

for ( int i = 0; i < v.Length; i++ )
{
  v[i] = 0.0;
}

We have to do some checking in the index operator. No new memory is created. Time: 578.5ms

3) In some cases, you could iterate through the underlying array of data inside the DoubleVector.

 
for ( int i = 0; i < v.DataBlock.Data.Length; i++ )
{
  v.DataBlock.Data[i] = 0.0;
}

This is a little less intuitive. And, very importantly, it will not work with many views into other data structures. For example, a row slice of a matrix. However, it's easier for the CLR to optimize this loop. Time: 173.5ms

4) We can use the power of Intel's MKL to multiply the vector by zero.

 v.Scale( 0.0 );

Scale() does this in-place. No new memory is created. In this example, we assume that MKL has already been loaded and is ready to go which is true if another MKL-based NMath call was already made or if NMath was initialized. This method will work on all views of other data structures. Time: 170ms

5) This surprised us a bit but the best method we could find was to clear out the underlying array using Array.Clear() in .NET

 Array.Clear( v.DataBlock.Data, 0, v.DataBlock.Data.Length );

This creates no new memory. However, this will not work with non-contiguous views. However, this method is very fast. Time: 85.8ms

To make efficiently clearing a vector simpler for NMath users we have created a Clear() method and a Clear( Slice ) method on the vector and matrix classes. It will do the right thing in the right circumstance. It will be released in NMath 5.2 in 2012.

Test Code

using System;
using CenterSpace.NMath.Core;

namespace Test
{
  class ClearVector
  {
    static int size = 100000000;
    static int runs = 10;
    static int methods = 5;
    
    static void Main( string[] args )
    {
      System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch();
      DoubleMatrix times = new DoubleMatrix( runs, methods );
      NMathKernel.Init();

      for ( int run = 0; run < runs; run++ )
      {
        Console.WriteLine( "Run {0}...", run );
        DoubleVector v = null;

        // Create a new one
        v = new DoubleVector( size, 1.0, 2.0 );
        sw.Start();
        DoubleVector v2 = new DoubleVector( v.Length, 0.0 );
        sw.Stop();
        times[run, 0] = sw.ElapsedMilliseconds;
        Console.WriteLine( Assert( v2 ) );

        // iterate through vector
        v = new DoubleVector( size, 1.0, 2.0 );
        sw.Reset();
        sw.Start();
        for ( int i = 0; i < v.Length; i++ )
        {
          v[i] = 0.0;
        }
        sw.Stop();
        times[run, 1] = sw.ElapsedMilliseconds;
        Console.WriteLine( Assert( v ) );

        // iterate through array
        v = new DoubleVector( size, 1.0, 2.0 );
        sw.Reset();
        sw.Start();
        for ( int i = 0; i < v.DataBlock.Data.Length; i++ )
        {
          v.DataBlock.Data[i] = 0.0;
        }
        sw.Stop();
        times[run, 2] = sw.ElapsedMilliseconds;
        Console.WriteLine( Assert( v ) );
        
        // scale
        v = new DoubleVector( size, 1.0, 2.0 );
        sw.Reset();
        sw.Start();
        v.Scale( 0.0 );
        sw.Stop();
        times[run, 3] = sw.ElapsedMilliseconds;
        Console.WriteLine( Assert( v ) );

        // Array Clear
        v = new DoubleVector( size, 1.0, 2.0 );
        sw.Reset();
        sw.Start();
        Array.Clear( v.DataBlock.Data, 0, v.DataBlock.Data.Length );
        sw.Stop();
        times[run, 4] = sw.ElapsedMilliseconds;
        Console.WriteLine( Assert( v ) );
        Console.WriteLine( times.Row( run ) );
      }
      Console.WriteLine( "Means: " + NMathFunctions.Mean( times ) );
    }

    private static bool Assert( DoubleVector v )
    {
      if ( v.Length != size )
      {
        return false;
      }
      for ( int i = 0; i < v.Length; ++i )
      {
        if ( v[i] != 0.0 )
        {
          return false;
        }
      }
      return true;
    }
  }
}

- Trevor

The post Clearing a vector appeared first on CenterSpace.

Initializing NMath

Trevor Misfeldt — Wed, 09 Nov 2011 22:01:27 +0000

NMath uses Intel’s Math Kernel Library (MKL) internally. This code contains native, optimized code to wring out the best performance possible.

There is a one-time delay when the appropriate x86 or x64 native code is loaded. This cost can be easily controlled by the developer by using the NMathKernel.Init() method. Please see Initializing NMath for more details.

– Trevor

The post Initializing NMath appeared first on CenterSpace.

Forward Scaling Computing

Paul Shirkey — Thu, 28 Jan 2010 18:02:13 +0000

Forward Scaling for Multicore Performance

The era of sequential, single-threaded software development deployed to a uniprocessor machine is rapidly fading into history. Nearly all computers sold today have at least two, if not four cores – and will have eight in the near future. Intel announced last month the successful production and testing of a new 48-core research processor which will be made available to industry and academia for research and development of manycore parallel software developer tools and languages.

Intel's recently announced 48-core processor

In the near future users of high performance software in finance, bio-informatics, or GIS will expect their applications to scale with core count, and software that fails to do so will either need to be rewritten or abandoned. To future-proof performance-sensitive software, code written today needs to be multicore aware, and scale automatically to all available cores – this is the key idea behind forward scaling software. If Moore’s ‘law’ is to be sustained into the future, hardware scalability must be joined with a similar shift in software. This fundamental shift in computing and application development, termed the ‘Manycore Shift’ by Microsoft, is an evolutionary shift that software developers must appreciate and adapt to in order to create long-living scalable applications.

CenterSpace’s Forward Scaling Strategy

This project of creating forward scaling software can sound daunting, but for many application developers it can be reduced to choosing the right languages & libraries for the computationally demanding portions of their application. If an application’s performance-sensitive components are forward scaling, so goes the application. At CenterSpace we are very performance sensitive and are working to insure that our users benefit from forward scaling behavior. Linear scaling with core number cannot always be achieved, but in the numerical computational domain we can frequently come close to this ideal.

Here are some parallel computing technologies that we are looking at adopting to ensure we meet this goal.

Microsoft’s Task Parallel Library

Microsoft research released the Task Parallel Library about two years ago, and improved upon it most recently in June of 2008. With this experience, Microsoft will now include task base parallelism functionality in the March 2010 release of .NET 4.

These parallel extensions will reside in a new static class Parallel inside of the System.Threading.Tasks namespace. While this framework provides an extensible set of classes for complex task parallelism problems, the common use cases will include the typical variants of the lowly loop. Here’s a very simple example of computing the square of a vector contrasting sequential and parallel code patterns.

// Single threaded looping
for(int i = 0; i < n; i++)
  s[i] = v[i]*v[i];

// Using the Parallel class & anonymous delegates
Double[] s = new Double[n];
Parallel.For(0, n, delegate (int i)
{
  s[i] = v[i]*v[i];
} );

// Using the Parallel class & lambda expressions
Double[] s = new Double[n];
Parallel.For(0, n, (i) => s[i] = v[i]*v[i]);

Note how concise the parallel looping code can be written with lambda expressions. If you haven’t yet gotten excited about lambda expressions, hopefully you are now! Also, outer variables can be referenced from inside the lambda expression (or the anonymous delegate) making code ports fairly simple once the inherent parallelism is recognized.

Intel’s Ct Data Parallel VM

CenterSpace already leverages the Intel’s forward scaling implementation’s of BLAS and LAPACK in our NMath and NMath Stats libraries, so we are very attuned to Intel’s efforts to bring new tools to programmers to leverage their multicore chips. Intel has a long history of supporting the efforts of developers in creating and debugging multithreaded applications by offering solid libraries, compilers, and debuggers. Intel has a major initiative called the Tera-Scale Computing Research Program to push forward all areas of high performance computing spanning from hardware to software.

At CenterSpace we are particularly interested in a new data parallel virtual machine that will offer all the data parallel functionality of Ct to any language with C bindings. Backing up a bit, Ct is a new data parallel language that is in development at Intel, that, with the help of recently acquired Rapid Mind (August 2009), will enable programmer friendly data-parallel programming in widespread languages such as C++. The new data parallel virtual machine, with its C front end, can interoperate with languages such as C#, allowing programmers using the .NET family of languages to leverage this data parallel technology.

This is a fundamentally different approach to Microsoft’s task-based parallel language classes. In the example above, note that the computation in the lambda expression must be independent for every i. This places a burden on the programmer to identify and create these parallel lambda expressions; the data parallel approach frees the programmer of this significant burden. Computing the dot product in C#, leveraging the C front end to the data parallel VM, might look something like:

// Using a Ct based parallel class & lamda expressions
CtVector v1 = new CtVector(1,2, ... ,49999,50000);
CtVector v2 = new CtVector(50000,49999, ... ,2,1);
int dot_product = CtParallel.AddReduce(v1*v2);

Note that the dot product operation is not easily converted to a task base parallel implementation since each operation in the lambda expression is not independent, but instead requires both a multiplication and a summation (reduction). However, there is significant data parallelism in the vector product and reduction steps that should exhibit near linear scaling with processor count using this Ct based implementation.

As with task based parallelism, the data parallelism approach has its drawbacks. First, the data must reside not in native types but in special Ct (CtVector in this example) containers that the data parallel engine knows how to divide and reassemble. This is generally a minor issue, however if your data isn’t wont to residing in vectors or matrices at all, data parallelism may not be an option for leveraging multi-core hardware – task base parallelism may be the answer. Both data parallelism and task parallelism approaches have their strengths and weaknesses and application domains where each shines. At CenterSpace, since our focus is on numerical computation, our data typically resides comfortably in vectors so we expect Ct’s data parallel approach to be an important tool in our future.

Cloud Computing with EC2 & Azure

Cloud computing, as it is typically thought of as a room full of servers and disk drives, is not conceptually that different from a single computer with many cores. In fact, Intel likes to refer to their new 48-core processor as a single-chip cloud computer. In both cases the central goals are performance and scalability.

Since many of CenterSpace’s customers have high computational demands and often process large datasets, we have started pushing some NMath functionality out into the cloud. A powerful & computationally demanding data clustering algorithm called Non-negative Matrix Factorization was our first NMath port to the cloud. With this algorithm now residing in the cloud, customer’s can access this high-performace NMF implementation from virtually any programming language, cutting their run times from days to hours. We’ll be blogging more on our cloud computing efforts in the near future.

Happy Computing,

-Paul

References & Additional Resources

Toub, Stephen. Patterns of Parallel Programming. Whitepaper, Microsoft Corporation, 2009.
The Manycore Shift: Microsoft Parallel Computing Initiative Ushers Computing into the Next Era, 2007.
An in-depth technical article on Intel’s numeric intensive, multi-core ready, technologies, November 2007.

The post Forward Scaling Computing appeared first on CenterSpace.

High Performance Numerics in C#

Paul Shirkey — Mon, 21 Dec 2009 19:52:38 +0000

Recently a programmer on stackoverflow commented that the performance of NMath was “really amazing” and was wondering how we achieved that performance in the context of the .NET/C# framework/language pair. This blog post discusses how CenterSpace achieves such great performance in this memory managed framework. A future post will discuss where we are looking to gain even more performance.

1. C# is Fast, Memory Allocation Is Not

CenterSpace libraries never allocate memory unless absolutely necessary, and we provide an API that doesn’t force users to unnecessarily allocate memory. For example, where appropriate, nearly all classes provide two method signatures for each computational operation – one that returns a vector as an out variable, and one that returns a new vector.

 Double1DConvolution conv = new Double1DConvolution(kernel, 256);

// Allocates a returns the result in a new vector.
DoubleVector result = conv.Convolve(data);

// Returns the result in the provided vector.
conv.Convolve(data, ref result);

In a loop, the latter is far superior if the result vector is reused. The earlier is fine for a one off result, and is convenient method signature for the API user. Inexperienced C# programmers often complain that their applications suffer from poor performance – and frequently the root of this issue is not in the language itself, but in poor memory allocation/reuse practices. Languages that offer garbage collection services are easy to abuse in this way.

2. Precision – Ability to Use Just What You Need

Frequently programmers do all of their computation using Double precision math. If 7 digits of precision are all you need, using strictly single precision algorithms will vastly improve performance. Below is a table comparing double and single precision FFT’s computed using NMath.



 FFT Length  Double Precision (ns)  Single Precision (ns)  Performance Gain  

  1024   200   25   8X 
  2048   325   50   6.5X
  4096   675   150   4.5X

FFT Length	Double Precision (ns)	Single Precision (ns)	Performance Gain
1024	200	25	8X
2048	325	50	6.5X
4096	675	150	4.5X

Clearly, if the precision is not necessary, the performance gain in switching from double to single precision is considerable (not to mention the memory saving for the data storage). NMath provides both single and double precision options for nearly every class.

3. Processor Optimized Code

Part of the NMath class library is based on BLAS and LAPACK, two long established interfaces for linear algebra. We use Intel’s implementation of these libraries because Intel carefully optimizes their performance for the Intel multicore processors on an on-going basis. We also leverage MKL’s implementation of the FFT. Below is a brief comparison between NMath’s FFT and FFTW (the FFT implementation shipped with MATLAB) – on a different machine than above.



 Comparison of a forward, real, out-of-place FFT. 

 
 FFT length   FFTW   NMATH FFT 
 
 1024   4.14 μs   4.36 μs  
 
 1000   5.98 μs   5.33 μs  
 
 4096   20.31 μs   21.71 μs  
 
 4095   49.90 μs   43.01 μs  
 
 1024^2   17.16 ms   15.63 ms

Comparison of a forward, real, out-of-place FFT.
FFT length	FFTW	NMATH FFT
1024	4.14 μs	4.36 μs
1000	5.98 μs	5.33 μs
4096	20.31 μs	21.71 μs
4095	49.90 μs	43.01 μs
1024^2	17.16 ms	15.63 ms

Clearly .NET / C# programmers can have the productive development language of C# and have world class computational performance.

There are a couple of different ways to call a library from the .NET framework without impacting performance unacceptably. Using P/Invoke the library can be called directly from C# (in our case). Due to the cost of marshaling the data this is not a good option for many short computations, but for significant operations, the P/Invoke cost is negligible. We also have the option to call a C++/cli routine from C#, pinning all pointers to data allocated in the managed space, and then call the Intel library. In terms of performance, pinning pointers in C++/cli is generally better, but it’s also more complex to implement the pointers to both managed and unmanaged heaps. CenterSpace uses both techniques.

Happy Computing,

-Paul

The post High Performance Numerics in C# appeared first on CenterSpace.

Modern Fast Fourier Transform

Paul Shirkey — Tue, 29 Sep 2009 05:32:46 +0000

All variants of the original Cooley-Tukey O(n log n) fast Fourier transform fundamentally exploit different ways to factor the discrete Fourier summation of length N.

For example, the split-radix FFT algorithm divides the Fourier summation of length N into three new Fourier summations: one of length N/2 and two of length N/4.

The prime factor FFT, divides the Fourier summation of length N, into two (if they exist) summations of length N1 and N2, where N1 and N2 must be relatively prime.

These algorithms are typically applied recursively, and in combination with one another (or with still other factorizations) to maximize performance for a particular N.

In modern implementations there really isn’t a single static FFT algorithm, but more a dynamic collection of FFT algorithms and tools that are cleverly collated for the Fourier transform type at hand. Major algorithmic changes occur in the underlying implementation as the length and forward domain (real or complex) of the problem vary. Sophisticated FFT implementations insulate the end-user programmer from all of this background machinery.

DFT length is fundamental to performance

The days of power-of-2-only FFT algorithms are dead. Users of modern FFT libraries should not need to worry about the large complexities involved in finding the optimal algorithm for the FFT computation at hand; the library should look at the FFT length, problem domain (real or complex), number of machine cores, and machine architecture, and find and compute with the best hybridized FFT algorithm available. However, it is still helpful to understand that your realized performance will depend fundamentally on the various factorization of the length of your FFT. Most know that the best FFT performance will be had when N is a power of 2. If this stringent length requirement cannot be met, then it is best to use a length that be factored into small primes. CenterSpace’s FFT algorithms contain optimized kernels for prime factor lengths of 2, 3, 5, 7 and 11. The table below demonstrates the FFT performance sensitivity to FFT length.

Forward real 1D FFT performance at various lengths.
DFT Length	Factors	MFLOP approximation
512	2 x 2 x 2 x 2 x 2 x 2 x 2 x 2 x 2	5324.5
511	7 x 73	1327.8
510	2 x 3 x 5 x 17	3879.4
509	509 (prime)	1762.4
508	2 x 2 x 127	2637.6
507	3 x 13 x 13	2631.5
506	2 x 11 x 23	3938.3
505	5 x 101	1122.6
504	2 x 2 x 2 x 3 x 3 x 7	5227

Clearly the fastest FFT’s are for lengths that can be factored into small primes (512, 510, 507, 506, 504), and especially small primes that have optimized kernels (512 and 504). The more kernel optimized primes your FFT length contains the faster it will run. This is a universal fact that all FFT implementations confront and holds true for higher dimension FFT’s as well. Slight changes in length can have a profound impact on FFT performance.

You can factor your FFT length using an online service to assess how your FFT will perform.

Multi-core Scalability

The ability to factor a particular FFT into a set independent computations makes it fundamentally suitable for parallelization. All modern desktop and many laptop computers today contain at least two processor cores and any modern math library should be exploiting this fact where possible. CenterSpace’s complex domain FFT’s (and related convolutions) are multi-core aware, and automatically expand to fully utilize the available processor cores. Small problems are run on a single core, but once the computational advantages of algorithm parallelization overcome the overhead costs of multi-core parallelization, the computation is spread across all available cores. This automatic parallelization is gained simply by using CenterSpace’s NMath class libraries. No end-user programming effort is involved.

Forward complex 1D FFT performance on 1 and 8 cores.
FFT Length	Machine Cores	Time (seconds)	MFLOP approximation
2^20	One	56.7	6405.9
2^20 + 1	One	554.6	655.3
2^20	Eight	53.3	6813.7
2^20 + 1	Eight	124.2	2925.3

The power of two FFT’s are so computationally efficient on modern processors that the gain between one and eight cores is only about 3 seconds on a 2^20-point FFT. However, for the non-power-of-two case we get a 4.5 times speed improvement going from one core to eight. Looked at another way, with multi-core scalability of the FFT, we suffered only a 2X loss in performance going from a 2^20 length FFT to a 2^20+1 length FFT, instead of a 10X loss in performance. In other words, the multi-core scalability of CenterSpace’s NMath FFT algorithms mitigate the performance loss in using non-power-of-2 lengths, and this simplifies the end-user programmer’s job.

-Paul

See our FFT landing page for complete documentation and code examples.

The post Modern Fast Fourier Transform appeared first on CenterSpace.

Performance Archives - CenterSpace

Precision and Reproducibility in Computing

A Precision Example

Managing Reproducibility

Full double Comparison Code

NMath Premium: FFT Performance

Benchmark Approach

GPU’s Tested

FFT Performance Charts

Batch FFT

Summary

Clearing a vector

Test Code

Initializing NMath

Forward Scaling Computing

Forward Scaling for Multicore Performance

CenterSpace’s Forward Scaling Strategy

Microsoft’s Task Parallel Library

Intel’s Ct Data Parallel VM

Cloud Computing with EC2 & Azure

High Performance Numerics in C#

1. C# is Fast, Memory Allocation Is Not

2. Precision – Ability to Use Just What You Need

3. Processor Optimized Code

Modern Fast Fourier Transform

DFT length is fundamental to performance

Multi-core Scalability

Full `double` Comparison Code