FFT in .NET Archives - CenterSpace

FFT Performance Benchmarks in .NET

Paul Shirkey — Wed, 05 Jan 2011 20:07:46 +0000

We’ve had a number of inquires about the CenterSpace FFT benchmarks, so I thought I would code up a few tests and run them on my machine. I’ve included our FFT performance numbers and the code that generated those numbers so you can try them on your machine. (If you don’t have NMath, you’ll need to download the eval version). I also did a comparison of 1 dimensional real DFTs, with FFTW, one of the fastest desktop FFT implementations available.

Benchmarks

These benchmarks were run on a 2.80 Ghz, Intel Core i7 CPU, with 4Gb of memory installed.

The clock resolution is 0.003 ns
1024 point, forward, real FFT required 4361.364 ns, Mflops 4069
1000 point, forward, real FFT required 5338.785 ns, Mflops 3235
4096 point, forward, real FFT required 21708.565 ns, Mflops 3924
4095 point, forward, real FFT required 43012.010 ns, Mflops 1980
1024 * 1024 point, forward, real FFT required 15.635 ms, Mflops 2324

I’m estimating the megaflop performance during the FFT using:

This is the asymptotic number of floating point operations for the radix-2 Cooley-Tukey FFT algorithm. This FFT MFlop estimate is used in a number of FFT benchmark reports and serves as a good basis for comparing algorithm efficiency.

As expected we take a performance hit for non-power of 2 lengths, but due to various optimizations for processing prime length FFT kernels (3, 5, 7 & 11), the performance hit is minimal in many cases. The 1000-point FFT has prime factors (2)(2)(2)(5)(5)(5), and the 4095-point FFT has prime factors (3)(3)(5)(7)(13), so those larger prime factors in the 4095-point FFT cost us some performance. Typically, user’s zero pad their data vectors to a power-of-two length to get optimal performance.

Side by side comparison with FFTW

FFTW claims to be the “Fastest Fourier Transform in the West”, and is a clever, high performance implementation of the discrete Fourier transform. This algorithm is shipped with all copies of MATLAB. FFTW is implemented in C and has the reputation as being one of the fastest desktop FFT algorithm.

Both the NMath FFT and the FFTW have a pre-computation setup that establishes the best algorithmic approach for the DFT at hand, before computing any FFT’s. This pre-computational phase is not included in the times below. In the case of the NMath FFT classes, this pre-computational phase in done in the class constructor; Therefore users must avoid constructing NMath FFT classes in tight loops for best performance (as shown in the benchmark code below). Below is a small side-by-side comparison between FFTW and NMath’s FFT (using the numbers from above).



 Comparison of a forward, real, out-of-place FFT. 

 
 FFT length   FFTW   NMATH FFT 

 
 1024   4.14 μs   4.36 μs  

 
 1000   5.98 μs   5.33 μs  

 
 4096   20.31 μs   21.71 μs  

 
 4095   49.90 μs   43.01 μs  

 
 1024^2   17.16 ms   15.63 ms

Comparison of a forward, real, out-of-place FFT.
FFT length	FFTW	NMATH FFT
1024	4.14 μs	4.36 μs
1000	5.98 μs	5.33 μs
4096	20.31 μs	21.71 μs
4095	49.90 μs	43.01 μs
1024^2	17.16 ms	15.63 ms

Clearly NMATH is very competitive with, and at times out-performs FFTW for real FFT’s of both power-of-2 length signals and otherwise. I chose 1D real signals as a test case because this is one of the most frequent use cases of our NMATH FFT library.

On a subjective scale, running a 1024-point FFT on a desktop commodity machine at around (an algorithm normalized) 4 GFlops is amazing. That means that in a real time measurement situation, users can compute 1024-point FFT’s at around 220kHz – all with just a couple of lines of code.

Happy Computing,
Paul

Benchmark Code

 public void BenchMarks()
    {
      Double numberTrials = 10000;
      Double flops;

      Stopwatch timer = new System.Diagnostics.Stopwatch();
      Console.WriteLine( String.Format("The clock resolution is {0:0.000} ns", Stopwatch.Frequency / 1000000000.0 ) );

      // Snip one - power of two
      RandGenUniform rand = new RandGenUniform();
      DoubleForward1DFFT fft = new DoubleForward1DFFT( 1024 );
      DoubleVector realsignal = new DoubleVector( 1024, rand );

      DoubleVector result = new DoubleVector( 1024 * 1024 );

      timer.Reset();
      for( int i = 0; i < numberTrials; i++ )
      {
        timer.Start();
        fft.FFT( realsignal, ref result );
        timer.Stop();
      }
      flops = (2.5 * 1024 * NMathFunctions.Log(1024)) / (((timer.ElapsedTicks / numberTrials) / Stopwatch.Frequency) * 1000000.0 );
      Console.WriteLine( String.Format( "1024 point, forward, real FFT required {0:0.000} ns, Mflops {1:0}", ( ( timer.ElapsedTicks / numberTrials ) / Stopwatch.Frequency ) * 1000000000.0, flops ) );

      // length 1000
      fft = new DoubleForward1DFFT( 1000 );
      realsignal = new DoubleVector( 1000, rand );

      timer.Reset();
      for( int i = 0; i < numberTrials; i++ )
      {
        timer.Start();
        fft.FFT( realsignal, ref result );
        timer.Stop();
      }
      flops = ( 2.5 * 1000 * NMathFunctions.Log( 1000 ) ) / ( ( ( timer.ElapsedTicks / numberTrials ) / Stopwatch.Frequency ) * 1000000.0 );
      Console.WriteLine( String.Format( "1000 point, forward, real FFT required {0:0.000} ns, Mflops {1:0}", ( ( timer.ElapsedTicks / numberTrials ) / Stopwatch.Frequency ) * 1000000000.0, flops ) );

      // length 4096
      fft = new DoubleForward1DFFT( 4096 );
      realsignal = new DoubleVector( 4096, rand );

      timer.Reset();
      for( int i = 0; i < numberTrials; i++ )
      {
        timer.Start();
        fft.FFT( realsignal, ref result );
        timer.Stop();
      }
      flops = ( 2.5 * 4096 * NMathFunctions.Log( 4096 ) ) / ( ( ( timer.ElapsedTicks / numberTrials ) / Stopwatch.Frequency ) * 1000000.0 );
      Console.WriteLine( String.Format( "4096 point, forward, real FFT required {0:0.000} ns, Mflops {1:0}", ( ( timer.ElapsedTicks / numberTrials ) / Stopwatch.Frequency ) * 1000000000.0, flops ) );

      // length 4095
      fft = new DoubleForward1DFFT( 4095 );
      realsignal = new DoubleVector( 4095, rand );

      timer.Reset();
      for( int i = 0; i < numberTrials; i++ )
      {
        timer.Start();
        fft.FFT( realsignal, ref result );
        timer.Stop();
      }
      flops = ( 2.5 * 4095 * NMathFunctions.Log( 4095 ) ) / ( ( ( timer.ElapsedTicks / numberTrials ) / Stopwatch.Frequency ) * 1000000.0 );
      Console.WriteLine( String.Format( "4095 point, forward, real FFT required {0:0.000} ns, Mflops {1:0}", ( ( timer.ElapsedTicks / numberTrials ) / Stopwatch.Frequency ) * 1000000000.0, flops ) );


      // length 1M
      fft = new DoubleForward1DFFT( 1024 * 1024 );
      realsignal = new DoubleVector( 1024 * 1024, rand );

      timer.Reset();
      for( int i = 0; i < 100; i++ )
      {
        timer.Start();
        fft.FFT( realsignal, ref result );
        timer.Stop();
      }
      flops = ( 2.5 * 1024 * 1024 * NMathFunctions.Log( 1024 * 1024 ) ) / ( ( ( timer.ElapsedTicks / 100.0 ) / Stopwatch.Frequency ) * 1000000.0 );
      Console.WriteLine( String.Format( "Million point (1024 * 1024), forward, real point FFT required {0:0.000} ms, Mflops {1:0}", ( ( timer.ElapsedTicks / 100.0 ) / Stopwatch.Frequency ) * 1000.0, flops ) );

    }

The post FFT Performance Benchmarks in .NET appeared first on CenterSpace.

High Performance FFT in NMath 4.0

Paul Shirkey — Wed, 02 Sep 2009 23:41:45 +0000

The next release of Center Space’s NMATH .NET libraries will contain high performance, multi-core aware, fast fourier transform classes. This set of classes will elegantly support all common 1D and 2D FFT computations in a robust easy to use object-oriented interface.

The following FFT classes will be available.



DoubleComplexForward1DFFT
DoubleComplexBackward1DFFT
DoubleComplexForward2DFFT
DoubleComplexBackward2DFFT
DoubleForward1DFFT
DoubleSymmetricBackward1DFFT
DoubleForward2DFFT
DoubleGeneral1DFFT (for computing FFT's of data with offset & strided memory layouts)

All classes efficiently support FFT’s of arbitrary length, with a simple interface for both in-place and out-of-place computations. Additionally, there is a parallel set of classes for single precision computation.

Example

Here is a simple example computing a 1000-point forward 1D FFT.

// Create some random signal data. RandomNumberGenerator rand = new RandGenMTwist(427); DoubleVector data = new DoubleVector(1000, rand);


// Create the 1D real FFT instance

DoubleForward1DFFT fft1000 =

       new DoubleForward1DFFT(1000);

// Compute the FFT fft1000.FFTInPlace(data);

The FFT of Real (non-Complex) data results in a FFT signal of complex-conjugate symmetric data. For memory efficiency this is returned to the user in a packed format (making in-place computation possible). To facilitate the unpacking of this data, signal reader classes are supplied that support random-access indexers into the packed data. Continuing with the example above.
// Ask the FFT instance for the correct reader, // passing in the FFT data. DoubleSymmetricSignalReader reader = fft1000.GetSignalReader(data);


// Now we can access any element from the

// packed complex-conjugate symmetric FFT data set

// using common random-access index sematics.

DoubleComplex thirdelement = reader[2];

// Also the entire result can be unpacked DoubleComplex[] unpackedfft = reader.UnpackFullToArray();

The readers are not necessary for the Complex versions of the FFT classes because FFT’s of Complex data is Complex and so no data packing is possible (for memory savings).

Packing Format Notes

As mentioned above, the Fourier transform of a real signal, results in a complex-conjugate symmetric signal. This symmetry is used by CenterSpace to pack the Fourier transform into an array which is the same size as the signal array.

The following table describes the layout of the packed complex-conjugate symmetric signal, of length N, in one dimension.

For N even

For N odd

If we were to unroll the array, where each element in the array contains alternating real and complex values, for the case of N even, we would have an array of length 2*N.

The complexities of the packing in two dimensions increase substantially, and will not be recorded here. All NMath FFT users are encourage to use the readers to unwind packed results. Not only does this reduce coding complexity, if the underlying packing format changes, the readers will still provide the expected functionality.

Finally, when inverting complex-conjugate symmetric signals, using the DoubleSymmetricBackward1DFFT class, the input signals are expect be packed.

-Paul

The post High Performance FFT in NMath 4.0 appeared first on CenterSpace.