Multicore FFT Archives - CenterSpace

FFT Performance Benchmarks in .NET

Paul Shirkey — Wed, 05 Jan 2011 20:07:46 +0000

We’ve had a number of inquires about the CenterSpace FFT benchmarks, so I thought I would code up a few tests and run them on my machine. I’ve included our FFT performance numbers and the code that generated those numbers so you can try them on your machine. (If you don’t have NMath, you’ll need to download the eval version). I also did a comparison of 1 dimensional real DFTs, with FFTW, one of the fastest desktop FFT implementations available.

Benchmarks

These benchmarks were run on a 2.80 Ghz, Intel Core i7 CPU, with 4Gb of memory installed.

The clock resolution is 0.003 ns
1024 point, forward, real FFT required 4361.364 ns, Mflops 4069
1000 point, forward, real FFT required 5338.785 ns, Mflops 3235
4096 point, forward, real FFT required 21708.565 ns, Mflops 3924
4095 point, forward, real FFT required 43012.010 ns, Mflops 1980
1024 * 1024 point, forward, real FFT required 15.635 ms, Mflops 2324

I’m estimating the megaflop performance during the FFT using:

This is the asymptotic number of floating point operations for the radix-2 Cooley-Tukey FFT algorithm. This FFT MFlop estimate is used in a number of FFT benchmark reports and serves as a good basis for comparing algorithm efficiency.

As expected we take a performance hit for non-power of 2 lengths, but due to various optimizations for processing prime length FFT kernels (3, 5, 7 & 11), the performance hit is minimal in many cases. The 1000-point FFT has prime factors (2)(2)(2)(5)(5)(5), and the 4095-point FFT has prime factors (3)(3)(5)(7)(13), so those larger prime factors in the 4095-point FFT cost us some performance. Typically, user’s zero pad their data vectors to a power-of-two length to get optimal performance.

Side by side comparison with FFTW

FFTW claims to be the “Fastest Fourier Transform in the West”, and is a clever, high performance implementation of the discrete Fourier transform. This algorithm is shipped with all copies of MATLAB. FFTW is implemented in C and has the reputation as being one of the fastest desktop FFT algorithm.

Both the NMath FFT and the FFTW have a pre-computation setup that establishes the best algorithmic approach for the DFT at hand, before computing any FFT’s. This pre-computational phase is not included in the times below. In the case of the NMath FFT classes, this pre-computational phase in done in the class constructor; Therefore users must avoid constructing NMath FFT classes in tight loops for best performance (as shown in the benchmark code below). Below is a small side-by-side comparison between FFTW and NMath’s FFT (using the numbers from above).



 Comparison of a forward, real, out-of-place FFT. 

 
 FFT length   FFTW   NMATH FFT 

 
 1024   4.14 μs   4.36 μs  

 
 1000   5.98 μs   5.33 μs  

 
 4096   20.31 μs   21.71 μs  

 
 4095   49.90 μs   43.01 μs  

 
 1024^2   17.16 ms   15.63 ms

Comparison of a forward, real, out-of-place FFT.
FFT length	FFTW	NMATH FFT
1024	4.14 μs	4.36 μs
1000	5.98 μs	5.33 μs
4096	20.31 μs	21.71 μs
4095	49.90 μs	43.01 μs
1024^2	17.16 ms	15.63 ms

Clearly NMATH is very competitive with, and at times out-performs FFTW for real FFT’s of both power-of-2 length signals and otherwise. I chose 1D real signals as a test case because this is one of the most frequent use cases of our NMATH FFT library.

On a subjective scale, running a 1024-point FFT on a desktop commodity machine at around (an algorithm normalized) 4 GFlops is amazing. That means that in a real time measurement situation, users can compute 1024-point FFT’s at around 220kHz – all with just a couple of lines of code.

Happy Computing,
Paul

Benchmark Code

 public void BenchMarks()
    {
      Double numberTrials = 10000;
      Double flops;

      Stopwatch timer = new System.Diagnostics.Stopwatch();
      Console.WriteLine( String.Format("The clock resolution is {0:0.000} ns", Stopwatch.Frequency / 1000000000.0 ) );

      // Snip one - power of two
      RandGenUniform rand = new RandGenUniform();
      DoubleForward1DFFT fft = new DoubleForward1DFFT( 1024 );
      DoubleVector realsignal = new DoubleVector( 1024, rand );

      DoubleVector result = new DoubleVector( 1024 * 1024 );

      timer.Reset();
      for( int i = 0; i < numberTrials; i++ )
      {
        timer.Start();
        fft.FFT( realsignal, ref result );
        timer.Stop();
      }
      flops = (2.5 * 1024 * NMathFunctions.Log(1024)) / (((timer.ElapsedTicks / numberTrials) / Stopwatch.Frequency) * 1000000.0 );
      Console.WriteLine( String.Format( "1024 point, forward, real FFT required {0:0.000} ns, Mflops {1:0}", ( ( timer.ElapsedTicks / numberTrials ) / Stopwatch.Frequency ) * 1000000000.0, flops ) );

      // length 1000
      fft = new DoubleForward1DFFT( 1000 );
      realsignal = new DoubleVector( 1000, rand );

      timer.Reset();
      for( int i = 0; i < numberTrials; i++ )
      {
        timer.Start();
        fft.FFT( realsignal, ref result );
        timer.Stop();
      }
      flops = ( 2.5 * 1000 * NMathFunctions.Log( 1000 ) ) / ( ( ( timer.ElapsedTicks / numberTrials ) / Stopwatch.Frequency ) * 1000000.0 );
      Console.WriteLine( String.Format( "1000 point, forward, real FFT required {0:0.000} ns, Mflops {1:0}", ( ( timer.ElapsedTicks / numberTrials ) / Stopwatch.Frequency ) * 1000000000.0, flops ) );

      // length 4096
      fft = new DoubleForward1DFFT( 4096 );
      realsignal = new DoubleVector( 4096, rand );

      timer.Reset();
      for( int i = 0; i < numberTrials; i++ )
      {
        timer.Start();
        fft.FFT( realsignal, ref result );
        timer.Stop();
      }
      flops = ( 2.5 * 4096 * NMathFunctions.Log( 4096 ) ) / ( ( ( timer.ElapsedTicks / numberTrials ) / Stopwatch.Frequency ) * 1000000.0 );
      Console.WriteLine( String.Format( "4096 point, forward, real FFT required {0:0.000} ns, Mflops {1:0}", ( ( timer.ElapsedTicks / numberTrials ) / Stopwatch.Frequency ) * 1000000000.0, flops ) );

      // length 4095
      fft = new DoubleForward1DFFT( 4095 );
      realsignal = new DoubleVector( 4095, rand );

      timer.Reset();
      for( int i = 0; i < numberTrials; i++ )
      {
        timer.Start();
        fft.FFT( realsignal, ref result );
        timer.Stop();
      }
      flops = ( 2.5 * 4095 * NMathFunctions.Log( 4095 ) ) / ( ( ( timer.ElapsedTicks / numberTrials ) / Stopwatch.Frequency ) * 1000000.0 );
      Console.WriteLine( String.Format( "4095 point, forward, real FFT required {0:0.000} ns, Mflops {1:0}", ( ( timer.ElapsedTicks / numberTrials ) / Stopwatch.Frequency ) * 1000000000.0, flops ) );


      // length 1M
      fft = new DoubleForward1DFFT( 1024 * 1024 );
      realsignal = new DoubleVector( 1024 * 1024, rand );

      timer.Reset();
      for( int i = 0; i < 100; i++ )
      {
        timer.Start();
        fft.FFT( realsignal, ref result );
        timer.Stop();
      }
      flops = ( 2.5 * 1024 * 1024 * NMathFunctions.Log( 1024 * 1024 ) ) / ( ( ( timer.ElapsedTicks / 100.0 ) / Stopwatch.Frequency ) * 1000000.0 );
      Console.WriteLine( String.Format( "Million point (1024 * 1024), forward, real point FFT required {0:0.000} ms, Mflops {1:0}", ( ( timer.ElapsedTicks / 100.0 ) / Stopwatch.Frequency ) * 1000.0, flops ) );

    }

The post FFT Performance Benchmarks in .NET appeared first on CenterSpace.

Modern Fast Fourier Transform

Paul Shirkey — Tue, 29 Sep 2009 05:32:46 +0000

All variants of the original Cooley-Tukey O(n log n) fast Fourier transform fundamentally exploit different ways to factor the discrete Fourier summation of length N.

For example, the split-radix FFT algorithm divides the Fourier summation of length N into three new Fourier summations: one of length N/2 and two of length N/4.

The prime factor FFT, divides the Fourier summation of length N, into two (if they exist) summations of length N1 and N2, where N1 and N2 must be relatively prime.

These algorithms are typically applied recursively, and in combination with one another (or with still other factorizations) to maximize performance for a particular N.

In modern implementations there really isn’t a single static FFT algorithm, but more a dynamic collection of FFT algorithms and tools that are cleverly collated for the Fourier transform type at hand. Major algorithmic changes occur in the underlying implementation as the length and forward domain (real or complex) of the problem vary. Sophisticated FFT implementations insulate the end-user programmer from all of this background machinery.

DFT length is fundamental to performance

The days of power-of-2-only FFT algorithms are dead. Users of modern FFT libraries should not need to worry about the large complexities involved in finding the optimal algorithm for the FFT computation at hand; the library should look at the FFT length, problem domain (real or complex), number of machine cores, and machine architecture, and find and compute with the best hybridized FFT algorithm available. However, it is still helpful to understand that your realized performance will depend fundamentally on the various factorization of the length of your FFT. Most know that the best FFT performance will be had when N is a power of 2. If this stringent length requirement cannot be met, then it is best to use a length that be factored into small primes. CenterSpace’s FFT algorithms contain optimized kernels for prime factor lengths of 2, 3, 5, 7 and 11. The table below demonstrates the FFT performance sensitivity to FFT length.

Forward real 1D FFT performance at various lengths.
DFT Length	Factors	MFLOP approximation
512	2 x 2 x 2 x 2 x 2 x 2 x 2 x 2 x 2	5324.5
511	7 x 73	1327.8
510	2 x 3 x 5 x 17	3879.4
509	509 (prime)	1762.4
508	2 x 2 x 127	2637.6
507	3 x 13 x 13	2631.5
506	2 x 11 x 23	3938.3
505	5 x 101	1122.6
504	2 x 2 x 2 x 3 x 3 x 7	5227

Clearly the fastest FFT’s are for lengths that can be factored into small primes (512, 510, 507, 506, 504), and especially small primes that have optimized kernels (512 and 504). The more kernel optimized primes your FFT length contains the faster it will run. This is a universal fact that all FFT implementations confront and holds true for higher dimension FFT’s as well. Slight changes in length can have a profound impact on FFT performance.

You can factor your FFT length using an online service to assess how your FFT will perform.

Multi-core Scalability

The ability to factor a particular FFT into a set independent computations makes it fundamentally suitable for parallelization. All modern desktop and many laptop computers today contain at least two processor cores and any modern math library should be exploiting this fact where possible. CenterSpace’s complex domain FFT’s (and related convolutions) are multi-core aware, and automatically expand to fully utilize the available processor cores. Small problems are run on a single core, but once the computational advantages of algorithm parallelization overcome the overhead costs of multi-core parallelization, the computation is spread across all available cores. This automatic parallelization is gained simply by using CenterSpace’s NMath class libraries. No end-user programming effort is involved.

Forward complex 1D FFT performance on 1 and 8 cores.
FFT Length	Machine Cores	Time (seconds)	MFLOP approximation
2^20	One	56.7	6405.9
2^20 + 1	One	554.6	655.3
2^20	Eight	53.3	6813.7
2^20 + 1	Eight	124.2	2925.3

The power of two FFT’s are so computationally efficient on modern processors that the gain between one and eight cores is only about 3 seconds on a 2^20-point FFT. However, for the non-power-of-two case we get a 4.5 times speed improvement going from one core to eight. Looked at another way, with multi-core scalability of the FFT, we suffered only a 2X loss in performance going from a 2^20 length FFT to a 2^20+1 length FFT, instead of a 10X loss in performance. In other words, the multi-core scalability of CenterSpace’s NMath FFT algorithms mitigate the performance loss in using non-power-of-2 lengths, and this simplifies the end-user programmer’s job.

-Paul

See our FFT landing page for complete documentation and code examples.

The post Modern Fast Fourier Transform appeared first on CenterSpace.