High Performance Numerics in C#
Recently a programmer on stackoverflow commented that the performance of NMath was “really amazing” and was wondering how we achieved that performance in the context of the .NET/C# framework/language pair. This blog post discusses how CenterSpace achieves such great performance in this memory managed framework. A future post will discuss where we are looking to gain even more performance.
1. C# is Fast, Memory Allocation Is Not
CenterSpace libraries never allocate memory unless absolutely necessary, and we provide an API that doesn’t force users to unnecessarily allocate memory. For example, where appropriate, nearly all classes provide two method signatures for each computational operation – one that returns a vector as an out variable, and one that returns a new vector.
Double1DConvolution conv = new Double1DConvolution(kernel, 256); // Allocates a returns the result in a new vector. DoubleVector result = conv.Convolve(data); // Returns the result in the provided vector. conv.Convolve(data, ref result);
In a loop, the latter is far superior if the
result vector is reused. The earlier is fine for a one off result, and is convenient method signature for the API user. Inexperienced C# programmers often complain that their applications suffer from poor performance – and frequently the root of this issue is not in the language itself, but in poor memory allocation/reuse practices. Languages that offer garbage collection services are easy to abuse in this way.
2. Precision – Ability to Use Just What You Need
Frequently programmers do all of their computation using
Double precision math. If 7 digits of precision are all you need, using strictly single precision algorithms will vastly improve performance. Below is a table comparing double and single precision FFT’s computed using NMath.
|FFT Length||Double Precision (ns)||Single Precision (ns)||Performance Gain|
Clearly, if the precision is not necessary, the performance gain in switching from double to single precision is considerable (not to mention the memory saving for the data storage). NMath provides both single and double precision options for nearly every class.
3. Processor Optimized Code
Part of the NMath class library is based on BLAS and LAPACK, two long established interfaces for linear algebra. We use Intel’s implementation of these libraries because Intel carefully optimizes their performance for the Intel multicore processors on an on-going basis. We also leverage MKL’s implementation of the FFT. Below is a brief comparison between NMath’s FFT and FFTW (the FFT implementation shipped with MATLAB) – on a different machine than above.
|Comparison of a forward, real, out-of-place FFT.|
|FFT length||FFTW||NMATH FFT|
|1024||4.14 μs||4.36 μs|
|1000||5.98 μs||5.33 μs|
|4096||20.31 μs||21.71 μs|
|4095||49.90 μs||43.01 μs|
|1024^2||17.16 ms||15.63 ms|
Clearly .NET / C# programmers can have the productive development language of C# and have world class computational performance.
There are a couple of different ways to call a library from the .NET framework without impacting performance unacceptably. Using P/Invoke the library can be called directly from C# (in our case). Due to the cost of marshaling the data this is not a good option for many short computations, but for significant operations, the P/Invoke cost is negligible. We also have the option to call a C++/cli routine from C#, pinning all pointers to data allocated in the managed space, and then call the Intel library. In terms of performance, pinning pointers in C++/cli is generally better, but it’s also more complex to implement the pointers to both managed and unmanaged heaps. CenterSpace uses both techniques.