Blog

Archive for the ‘Performance’ Category

Clearing a vector

Wednesday, November 9th, 2011

A customer recently asked us for the best method to zero out a vector. We decided to run some tests to find out. Here are the five methods we tried, with any drawbacks and the timings.

These were performed on a DoubleVector, v,  of length 10,000,000.

1) Create a new vector. This isn’t really clearing out an existing vector but we thought we should include it for completeness.

 DoubleVector v2 = new DoubleVector( v.Length, 0.0 );

The big drawback here is that you’re creating new memory. Time: 419.5ms

2) Probably the first thing to come to mind is to simply iterate through the vector and set everything to zero.

for ( int i = 0; i < v.Length; i++ )
{
  v[i] = 0.0;
}

We have to do some checking in the index operator. No new memory created. Time: 578.5ms

3) In some cases, you could iterate through the underlying array of data inside the DoubleVector.

for ( int i = 0; i < v.DataBlock.Data.Length; i++ )
{
  v.DataBlock.Data[i] = 0.0;
}

This is a little less intuitive. And, very importantly, it will not work with many views into other data structures. For example, a row slice of a matrix. However, it’s easier for the CLR to optimize this loop. Time: 173.5ms

4) We can use the power of Intel’s MKL to multiply the vector by zero.

 v.Scale( 0.0 );

Scale() does this in-place. No new memory is created. In this example, we assume that MKL has already been loaded and is ready to go which is true if another MKL-based NMath call was already made or if NMath was initialized. This method will work on all views of other data structures. Time: 170ms

5) This surprised us a bit but the best method we could find was to clear out the underlying array using Array.Clear() in .NET

 Array.Clear( v.DataBlock.Data, 0, v.DataBlock.Data.Length );

This creates no new memory. However, this will not work with non-contiguous views. However, this method is very fast. Time: 85.8ms

To make this much simpler, we have created a Clear() method and a Clear( Slice) method on vectors and matrices. It will do the right thing in the right circumstance. It will be released in NMath 5.2 in 2012.

And, here is the code we used for this test:

using System;
using CenterSpace.NMath.Core;

namespace Test
{
  class ClearVector
  {
    static int size = 100000000;
    static int runs = 10;
    static int methods = 5;

    static void Main( string[] args )
    {
      System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch();
      DoubleMatrix times = new DoubleMatrix( runs, methods );
      NMathKernel.Init();

      for ( int run = 0; run < runs; run++ )
      {
        Console.WriteLine( "Run {0}...", run );
        DoubleVector v = null;

        // Create a new one
        v = new DoubleVector( size, 1.0, 2.0 );
        sw.Start();
        DoubleVector v2 = new DoubleVector( v.Length, 0.0 );
        sw.Stop();
        times[run, 0] = sw.ElapsedMilliseconds;
        Console.WriteLine( Assert( v2 ) );

        // iterate through vector
        v = new DoubleVector( size, 1.0, 2.0 );
        sw.Reset();
        sw.Start();
        for ( int i = 0; i < v.Length; i++ )
        {
          v[i] = 0.0;
        }
        sw.Stop();
        times[run, 1] = sw.ElapsedMilliseconds;
        Console.WriteLine( Assert( v ) );

        // iterate through array
        v = new DoubleVector( size, 1.0, 2.0 );
        sw.Reset();
        sw.Start();
        for ( int i = 0; i < v.DataBlock.Data.Length; i++ )
        {
          v.DataBlock.Data[i] = 0.0;
        }
        sw.Stop();
        times[run, 2] = sw.ElapsedMilliseconds;
        Console.WriteLine( Assert( v ) );

        // scale
        v = new DoubleVector( size, 1.0, 2.0 );
        sw.Reset();
        sw.Start();
        v.Scale( 0.0 );
        sw.Stop();
        times[run, 3] = sw.ElapsedMilliseconds;
        Console.WriteLine( Assert( v ) );

        // Array Clear
        v = new DoubleVector( size, 1.0, 2.0 );
        sw.Reset();
        sw.Start();
        Array.Clear( v.DataBlock.Data, 0, v.DataBlock.Data.Length );
        sw.Stop();
        times[run, 4] = sw.ElapsedMilliseconds;
        Console.WriteLine( Assert( v ) );
        Console.WriteLine( times.Row( run ) );
      }
      Console.WriteLine( "Means: " + NMathFunctions.Mean( times ) );
    }

    private static bool Assert( DoubleVector v )
    {
      if ( v.Length != size )
      {
        return false;
      }
      for ( int i = 0; i < v.Length; ++i )
      {
        if ( v[i] != 0.0 )
        {
          return false;
        }
      }
      return true;
    }
  }
}

- Trevor

Share

Initializing NMath

Wednesday, November 9th, 2011

NMath uses Intel’s Math Kernel Library (MKL) internally. This code contains native, optimized code to wring out the best performance possible.

There is a one-time delay when the appropriate x86 or x64 native code is loaded. This cost can be easily controlled by the developer by using the NMathKernel.Init() method. Please see Initializing NMath for more details.

- Trevor

Share

Forward Scaling Computing

Thursday, January 28th, 2010

Forward Scaling for Multicore Performance

The era of sequential, single-threaded software development deployed to a uniprocessor machine is rapidly fading into history. Nearly all computers sold today have at least two, if not four cores – and will have eight in the near future. Intel announced last month the successful production and testing of a new 48-core research processor which will be made available to industry and academia for research and development of manycore parallel software developer tools and languages.

Intel's 48-core processor

Intel's recently announced 48-core processor

In the near future users of high performance software in finance, bio-informatics, or GIS will expect their applications to scale with core count, and software that fails to do so will either need to be rewritten or abandoned. To future-proof performance-sensitive software, code written today needs to be multicore aware, and scale automatically to all available cores – this is the key idea behind forward scaling software. If Moore’s ‘law’ is to be sustained into the future, hardware scalability must be joined with a similar shift in software. This fundamental shift in computing and application development, termed the ‘Manycore Shift’ by Microsoft, is an evolutionary shift that software developers must appreciate and adapt to in order to create long-living scalable applications.

CenterSpace’s Forward Scaling Strategy

This project of creating forward scaling software can sound daunting, but for many application developers it can be reduced to choosing the right languages & libraries for the computationally demanding portions of their application. If an application’s performance-sensitive components are forward scaling, so goes the application. At CenterSpace we are very performance sensitive and are working to insure that our users benefit from forward scaling behavior. Linear scaling with core number cannot always be achieved, but in the numerical computational domain we can frequently come close to this ideal.

Here are some parallel computing technologies that we are looking at adopting to ensure we meet this goal.
(more…)

Share

High Performance Numerics in C#

Monday, December 21st, 2009

Recently a programmer on stackoverflow commented that the performance of NMath was “really amazing” and was wondering how we achieved that performance in the context of the .NET/C# framework/language pair. This blog post discusses how CenterSpace achieves such great performance in this memory managed framework. A future post will discuss where we are looking to gain even more performance.
(more…)

Share

Modern Fast Fourier Transform

Monday, September 28th, 2009

All variants of the original Cooley-Tukey O(n log n) fast Fourier transform fundamentally exploit different ways to factor the discrete Fourier summation of length N.




For example, the split-radix FFT algorithm divides the Fourier summation of length N into three new Fourier summations: one of length N/2 and two of length N/4.



The prime factor FFT, divides the Fourier summation of length N, into two (if they exist) summations of length N1 and N2, where N1 and N2 must be relatively prime.



These algorithms are typically applied recursively, and in combination with one another (or with still other factorizations) to maximize performance for a particular N.

In modern implementations there really isn’t a single static FFT algorithm, but more a dynamic collection of FFT algorithms and tools that are cleverly collated for the Fourier transform type at hand. Major algorithmic changes occur in the underlying implementation as the length and forward domain (real or complex) of the problem vary. Sophisticated FFT implementations insulate the end-user programmer from all of this background machinery.

DFT length is fundamental to performance

The days of power-of-2-only FFT algorithms are dead. Users of modern FFT libraries should not need to worry about the large complexities involved in finding the optimal algorithm for the FFT computation at hand; the library should look at the FFT length, problem domain (real or complex), number of machine cores, and machine architecture, and find and compute with the best hybridized FFT algorithm available. However, it is still helpful to understand that your realized performance will depend fundamentally on the various factorization of the length of your FFT. Most know that the best FFT performance will be had when N is a power of 2. If this stringent length requirement cannot be met, then it is best to use a length that be factored into small primes. CenterSpace’s FFT algorithms contain optimized kernels for prime factor lengths of 2, 3, 5, 7 and 11. The table below demonstrates the FFT performance sensitivity to FFT length.

Forward real 1D FFT performance at various lengths.
DFT Length Factors MFLOP approximation
512 2 x 2 x 2 x 2 x 2 x 2 x 2 x 2 x 2 5324.5
511 7 x 73 1327.8
510 2 x 3 x 5 x 17 3879.4
509 509 (prime) 1762.4
508 2 x 2 x 127 2637.6
507 3 x 13 x 13 2631.5
506 2 x 11 x 23 3938.3
505 5 x 101 1122.6
504 2 x 2 x 2 x 3 x 3 x 7 5227

Clearly the fastest FFT’s are for lengths that can be factored into small primes (512, 510, 507, 506, 504), and especially small primes that have optimized kernels (512 and 504). The more kernel optimized primes your FFT length contains the faster it will run. This is a universal fact that all FFT implementations confront and holds true for higher dimension FFT’s as well. Slight changes in length can have a profound impact on FFT performance.

You can factor your FFT length using an online service to assess how your FFT will perform.

Multi-core Scalability

The ability to factor a particular FFT into a set independent computations makes it fundamentally suitable for parallelization. All modern desktop and many laptop computers today contain at least two processor cores and any modern math library should be exploiting this fact where possible. CenterSpace’s complex domain FFT’s (and related convolutions) are multi-core aware, and automatically expand to fully utilize the available processor cores. Small problems are run on a single core, but once the computational advantages of algorithm parallelization overcome the overhead costs of multi-core parallelization, the computation is spread across all available cores. This automatic parallelization is gained simply by using CenterSpace’s NMath class libraries. No end-user programming effort is involved.

Forward complex 1D FFT performance on 1 and 8 cores.
FFT Length Machine Cores Time (seconds) MFLOP approximation
2^20 One 56.7 6405.9
2^20 + 1 One 554.6 655.3
2^20 Eight 53.3 6813.7
2^20 + 1 Eight 124.2 2925.3

The power of two FFT’s are so computationally efficient on modern processors that the gain between one and eight cores is only about 3 seconds on a 2^20-point FFT. However, for the non-power-of-two case we get a 4.5 times speed improvement going from one core to eight. Looked at another way, with multi-core scalability of the FFT, we suffered only a 2X loss in performance going from a 2^20 length FFT to a 2^20+1 length FFT, instead of a 10X loss in performance. In other words, the multi-core scalability of CenterSpace’s NMath FFT algorithms mitigate the performance loss in using non-power-of-2 lengths, and this simplifies the end-user programmer’s job.

-Paul

See our FFT landing page for complete documentation and code examples.

Share