FFT Performance Benchmarks in .NET

We’ve had a number of inquires about the CenterSpace FFT benchmarks, so I thought I would code up a few tests and run them on my machine. I’ve included our FFT performance numbers and the code that generated those numbers so you can try them on your machine. (If you don’t have NMath, you’ll need to download the eval version). I also did a comparison of 1 dimensional real DFTs, with FFTW, one of the fastest desktop FFT implementations available.

Benchmarks

These benchmarks were run on a 2.80 Ghz, Intel Core i7 CPU, with 4Gb of memory installed.

The clock resolution is 0.003 ns
1024 point, forward, real FFT required 4361.364 ns, Mflops 4069
1000 point, forward, real FFT required 5338.785 ns, Mflops 3235
4096 point, forward, real FFT required 21708.565 ns, Mflops 3924
4095 point, forward, real FFT required 43012.010 ns, Mflops 1980
1024 * 1024 point, forward, real FFT required 15.635 ms, Mflops 2324

I’m estimating the megaflop performance during the FFT using:

$MFlops \approx {2.5*n \ ln (n)) \over{ \textit{time in} \ \mu s} }$

This is the asymptotic number of floating point operations for the radix-2 Cooley-Tukey FFT algorithm. This FFT MFlop estimate is used in a number of FFT benchmark reports and serves as a good basis for comparing algorithm efficiency.

As expected we take a performance hit for non-power of 2 lengths, but due to various optimizations for processing prime length FFT kernels (3, 5, 7 & 11), the performance hit is minimal in many cases. The 1000-point FFT has prime factors (2)(2)(2)(5)(5)(5), and the 4095-point FFT has prime factors (3)(3)(5)(7)(13), so those larger prime factors in the 4095-point FFT cost us some performance. Typically, user’s zero pad their data vectors to a power-of-two length to get optimal performance.

Side by side comparison with FFTW

FFTW claims to be the “Fastest Fourier Transform in the West”, and is a clever, high performance implementation of the discrete Fourier transform. This algorithm is shipped with all copies of MATLAB. FFTW is implemented in C and has the reputation as being one of the fastest desktop FFT algorithm.

Both the NMath FFT and the FFTW have a pre-computation setup that establishes the best algorithmic approach for the DFT at hand, before computing any FFT’s. This pre-computational phase is not included in the times below. In the case of the NMath FFT classes, this pre-computational phase in done in the class constructor; Therefore users must avoid constructing NMath FFT classes in tight loops for best performance (as shown in the benchmark code below). Below is a small side-by-side comparison between FFTW and NMath’s FFT (using the numbers from above).



 Comparison of a forward, real, out-of-place FFT. 

 
 FFT length   FFTW   NMATH FFT 

 
 1024   4.14 μs   4.36 μs  

 
 1000   5.98 μs   5.33 μs  

 
 4096   20.31 μs   21.71 μs  

 
 4095   49.90 μs   43.01 μs  

 
 1024^2   17.16 ms   15.63 ms

Comparison of a forward, real, out-of-place FFT.
FFT length	FFTW	NMATH FFT
1024	4.14 μs	4.36 μs
1000	5.98 μs	5.33 μs
4096	20.31 μs	21.71 μs
4095	49.90 μs	43.01 μs
1024^2	17.16 ms	15.63 ms

Clearly NMATH is very competitive with, and at times out-performs FFTW for real FFT’s of both power-of-2 length signals and otherwise. I chose 1D real signals as a test case because this is one of the most frequent use cases of our NMATH FFT library.

On a subjective scale, running a 1024-point FFT on a desktop commodity machine at around (an algorithm normalized) 4 GFlops is amazing. That means that in a real time measurement situation, users can compute 1024-point FFT’s at around 220kHz – all with just a couple of lines of code.

Happy Computing,
Paul

Benchmark Code

 public void BenchMarks()
    {
      Double numberTrials = 10000;
      Double flops;

      Stopwatch timer = new System.Diagnostics.Stopwatch();
      Console.WriteLine( String.Format("The clock resolution is {0:0.000} ns", Stopwatch.Frequency / 1000000000.0 ) );

      // Snip one - power of two
      RandGenUniform rand = new RandGenUniform();
      DoubleForward1DFFT fft = new DoubleForward1DFFT( 1024 );
      DoubleVector realsignal = new DoubleVector( 1024, rand );

      DoubleVector result = new DoubleVector( 1024 * 1024 );

      timer.Reset();
      for( int i = 0; i < numberTrials; i++ )
      {
        timer.Start();
        fft.FFT( realsignal, ref result );
        timer.Stop();
      }
      flops = (2.5 * 1024 * NMathFunctions.Log(1024)) / (((timer.ElapsedTicks / numberTrials) / Stopwatch.Frequency) * 1000000.0 );
      Console.WriteLine( String.Format( "1024 point, forward, real FFT required {0:0.000} ns, Mflops {1:0}", ( ( timer.ElapsedTicks / numberTrials ) / Stopwatch.Frequency ) * 1000000000.0, flops ) );

      // length 1000
      fft = new DoubleForward1DFFT( 1000 );
      realsignal = new DoubleVector( 1000, rand );

      timer.Reset();
      for( int i = 0; i < numberTrials; i++ )
      {
        timer.Start();
        fft.FFT( realsignal, ref result );
        timer.Stop();
      }
      flops = ( 2.5 * 1000 * NMathFunctions.Log( 1000 ) ) / ( ( ( timer.ElapsedTicks / numberTrials ) / Stopwatch.Frequency ) * 1000000.0 );
      Console.WriteLine( String.Format( "1000 point, forward, real FFT required {0:0.000} ns, Mflops {1:0}", ( ( timer.ElapsedTicks / numberTrials ) / Stopwatch.Frequency ) * 1000000000.0, flops ) );

      // length 4096
      fft = new DoubleForward1DFFT( 4096 );
      realsignal = new DoubleVector( 4096, rand );

      timer.Reset();
      for( int i = 0; i < numberTrials; i++ )
      {
        timer.Start();
        fft.FFT( realsignal, ref result );
        timer.Stop();
      }
      flops = ( 2.5 * 4096 * NMathFunctions.Log( 4096 ) ) / ( ( ( timer.ElapsedTicks / numberTrials ) / Stopwatch.Frequency ) * 1000000.0 );
      Console.WriteLine( String.Format( "4096 point, forward, real FFT required {0:0.000} ns, Mflops {1:0}", ( ( timer.ElapsedTicks / numberTrials ) / Stopwatch.Frequency ) * 1000000000.0, flops ) );

      // length 4095
      fft = new DoubleForward1DFFT( 4095 );
      realsignal = new DoubleVector( 4095, rand );

      timer.Reset();
      for( int i = 0; i < numberTrials; i++ )
      {
        timer.Start();
        fft.FFT( realsignal, ref result );
        timer.Stop();
      }
      flops = ( 2.5 * 4095 * NMathFunctions.Log( 4095 ) ) / ( ( ( timer.ElapsedTicks / numberTrials ) / Stopwatch.Frequency ) * 1000000.0 );
      Console.WriteLine( String.Format( "4095 point, forward, real FFT required {0:0.000} ns, Mflops {1:0}", ( ( timer.ElapsedTicks / numberTrials ) / Stopwatch.Frequency ) * 1000000000.0, flops ) );


      // length 1M
      fft = new DoubleForward1DFFT( 1024 * 1024 );
      realsignal = new DoubleVector( 1024 * 1024, rand );

      timer.Reset();
      for( int i = 0; i < 100; i++ )
      {
        timer.Start();
        fft.FFT( realsignal, ref result );
        timer.Stop();
      }
      flops = ( 2.5 * 1024 * 1024 * NMathFunctions.Log( 1024 * 1024 ) ) / ( ( ( timer.ElapsedTicks / 100.0 ) / Stopwatch.Frequency ) * 1000000.0 );
      Console.WriteLine( String.Format( "Million point (1024 * 1024), forward, real point FFT required {0:0.000} ms, Mflops {1:0}", ( ( timer.ElapsedTicks / 100.0 ) / Stopwatch.Frequency ) * 1000.0, flops ) );

    }

Benchmarks

Side by side comparison with FFTW

Benchmark Code

Leave a Reply