Archive for the ‘MKL’ Category

Precision and Reproducibility in Computing

Monday, November 16th, 2015

Run-to-run reproducibility in computing is often assumed as an obvious truth. However software running on modern computer architectures, among many other processes, particularly when coupled with advanced performance-optimized libraries, is often only guaranteed to produce reproducible results only up to a certain precision; beyond that results can and do vary run-to-run. Reproducibility is interrelated with the precision of floating-point point types and the resultant rounding, operation re-ordering, memory structure and use, and finally how real numbers are represented internally in a computer’s registers.

This issue of reproducibility arises with NMath users when writing and running unit tests; which is why it’s important when writing tests to compare floating point numbers only up to their designed precision, at an absolute maximum. With the IEEE 754 floating point representation which virtually all modern computers adhere to, the single precision float type uses 32 bits or 4 bytes and offers 24 bits of precision or about 7 decimal digits. While the double precision double type requires 64 bits or 8 bytes and offers 53 bits of precision or about 15 decimal digits. Few algorithms can achieve significant results to the 15th decimal place due to rounding, loss of precision due to subtraction and other sources of numerical precision degradation. NMath’s numerical results are tested, at a maximum, to the 14th decimal place.

A Precision Example

As an example, what does the following code output?

      double x = .050000000000000003;
      double y = .050000000000000000;
      if ( x == y )
        Console.WriteLine( "x is y" );
        Console.WriteLine( "x is not y" );

I get “x is y”, which is clearly not the case, but the number x specified is beyond the precision of a double type.

Due to these limits on decimal number representation and the resulting rounding, the numerical results of some operations can be affected by the associative reordering of operations. For example, in some cases a*x + a*z may not equal a*(x + z) with floating point types. Although this can be difficult to test using modern optimizing compilers because the code you write and the code that runs can be organized in a very different way, but is mathematically equivalent if not numerically.

So reproducibility is impacted by precision via dynamic operation reorderings in the ALU and additionally by run-time processor dispatching, data-array alignment, and variation in thread number among other factors. These issues can create run-to-run differences in the least significant digits. Two runs, same code, two answers. This is by design and is not an issue of correctness. Subtle changes in the memory layout of the program’s data, differences in loading of the ALU registers and operation order, and differences in threading all due to unrelated processes running on the same machine cause these run-to-run differences.

Managing Reproducibility

Most importantly, one should test code’s numerical results only to the precision that can be expected by the algorithm, input data, and finally the limits of floating point arithmetic. To do this in unit tests, compare floating point numbers carefully only to a fixed number of digits. The code snippet below compares two double numbers and returns true only if the numbers match to a specified number of digits.

private static bool EqualToNumDigits( double expected, double actual, int numDigits )
      double max = System.Math.Abs( expected ) > System.Math.Abs( actual ) ? System.Math.Abs( expected ) : System.Math.Abs( actual );
      double diff = System.Math.Abs( expected - actual );
      double relDiff = max > 1.0 ? diff / max : diff;
      if ( relDiff <= DOUBLE_EPSILON )
        return true;
      int numDigitsAgree = (int) ( -System.Math.Floor( Math.Log10( relDiff ) ) - 1 );
      return numDigitsAgree >= numDigits;

This type of comparison should be used throughout unit testing code. The full code listing, which we use for our internal testing, is provided at the end of this article.

If it is essential to enforce binary run-to-run reproducibility to the limits of precision, NMath provides a flag in its configuration class to ensure this is the case. However this flag should be set for unit testing only because there can be a significant cost to performance. In general, expect a 10% to 20% reduction in performance with some common operations degrading far more than that. For example, some matrix multiplications will take twice the time with this flag set.

Note that the number of threads that Intel’s MKL library uses ( which NMath depends on ) must also be fixed before setting the reproducibility flag.

int numThreads = 2;  // This must be fixed for reproducibility.
NMathConfiguration.SetMKLNumThreads( numThreads );
NMathConfiguration.Reproducibility = true;

This reproducibility run configuration for NMath cannot be unset at a later point in the program. Note that both setting the number of threads and the reproducibility flag may be set in the AppConfig or in environmental variables. See the NMath User Guide for instructions on how to do this.



M. A. Cornea-Hasegan, B. Norin. IA-64 Floating-Point Operations and the IEEE Standard for Binary Floating-Point Arithmetic. Intel Technology Journal, Q4, 1999.

D. Goldberg, What Every Computer Scientist Should Know About Floating-Point Arithmetic. Computing Surveys. March 1991.

Full double Comparison Code

private static bool EqualToNumDigits( double expected, double actual, int numDigits )
      bool xNaN = double.IsNaN( expected );
      bool yNaN = double.IsNaN( actual );
      if ( xNaN && yNaN )
        return true;
      if ( xNaN || yNaN )
        return false;
      if ( numDigits <= 0 )
        throw new InvalidArgumentException( "numDigits is not positive in TestCase::EqualToNumDigits." );
      double max = System.Math.Abs( expected ) > System.Math.Abs( actual ) ? System.Math.Abs( expected ) : System.Math.Abs( actual );
      double diff = System.Math.Abs( expected - actual );
      double relDiff = max > 1.0 ? diff / max : diff;
      if ( relDiff <= DOUBLE_EPSILON )
        return true;
      int numDigitsAgree = (int) ( -System.Math.Floor( Math.Log10( relDiff ) ) - 1 );
      //// Console.WriteLine( "x = {0}, y = {1}, rel diff = {2}, diff = {3}, num digits = {4}", x, y, relDiff, diff, numDigitsAgree );
      return numDigitsAgree >= numDigits;

Absolute value of complex numbers

Tuesday, March 8th, 2011

Max Hadley from Schlumberger in Southampton, UK came to us with an interesting bug report regarding the MaxAbsValue() and MaxAbsIndex() functions as applied to complex vectors in the NMathFunctions class.  Most of the time these methods worked as expected, but they would intermittently fail to correctly identify the maximum element in large vectors with similar elements.

In researching the MKL documentation we found that this was in fact not a problem from MKL’s perspective. MKL uses the L1-norm, or Manhattan distance from 0,  as a metric to compute the absolute value of a complex number. This simply means that it adds together the absolute values of the real and imaginary components:

Absolute value of a complex number according to BLAS.

We had expected the absolute value to be computed via the L2-norm, or Euclidean distance from zero, which is referred to in places as the magnitude metric. Interestingly, MKL uses the L1-norm because that is the norm defined by the underlying BLAS standard, and apparently the original designers of BLAS choose that norm for computational efficiency. This means that all BLAS-based linear algebra packages compute the norm of a complex vector in this way – and it’s probably not what most people expect.

This was a tricky bug to find for two reasons. First, substituting one norm for the other did not elicit incorrect behavior often because the real component generally dominates the magnitude. Second, the actual calculation of the absolute value of a complex number (rather than the maximum absolute value of a complex vector) has always been calculated using the L2-norm.

Now that we found the problem, we faced the unenviable task of trying to make our API consistent while interfacing with MKL and how it deals with finding the maximum absolute value element in a vector of complex numbers.  We started by suffixing all complex versions of min and max abs methods that use MKL and therefore use the L1-norm to compute the absolute value of complex numbers with a ‘1’:

public static int MaxAbs1Index( FloatComplexVector v )
public static int MaxAbs1Value( FloatComplexVector v )
public static int MinAbs1Index( FloatComplexVector v )
public static int MinAbs1Value( FloatComplexVector v )
public static int MaxAbs1Index( DoubleComplexVector v )
public static int MaxAbs1Value( DoubleComplexVector v )
public static int MinAbs1Index( DoubleComplexVector v )
public static int MinAbs1Value( DoubleComplexVector v )

And we have subsequently written new methods that compute the maximum and minimum absolute values of a complex vector according to the L2-norm, or Euclidean distance, of its elements.  Users should be aware that these methods do not use MKL:

public static int MaxAbsIndex( FloatComplexVector v )
public static int MaxAbsValue( FloatComplexVector v )
public static int MinAbsIndex( FloatComplexVector v )
public static int MinAbsValue( FloatComplexVector v )
public static int MaxAbsIndex( DoubleComplexVector v )
public static int MaxAbsValue( DoubleComplexVector v )
public static int MinAbsIndex( DoubleComplexVector v )
public static int MinAbsValue( DoubleComplexVector v )

We hope the change is intuitive and useful.


MKL Memory Leak?

Wednesday, January 28th, 2009

We recently heard from an NMath user:

I am seeing a memory accumulation in my application (which uses NMath Core 2.5). From my memory profiler it looks like it could be an allocation in DotNetBlas.Product(), within the MKL dgemm() function.

I understand that MKL is designed such that memory is not released until the application closes. However, as this application runs in new worker threads all the time, I wondering if each new thread is holding onto it’s own memory for Product().

I’ve tried setting the system variable MKL_DISABLE_FAST_MM – but this seems to have made no difference – would I expect this to have an immediate effect (after re-starting the application)? Is there any other way within NMath to force MKL to release memory?

It’s true that for performance reasons, memory allocated by the Intel Math Kernel Library (MKL) is not released. This is by design and is a one-time occurrence for MKL routines that require workspace memory buffers. However, this workspace appears to be allocated on a per-thread basis, which can be a problem for applications that spawn large numbers of threads. As the MKL documentation delicately puts it, “the user should be aware that some tools might report this as a memory leak”.

There are two solutions for multithreaded applications to avoid continuous memory accumulation:

  1. Use a thread pool, so the number of new threads is bounded by the size of the pool.
  2. Use the MKL_FreeBuffers() function to free the memory allocated by the MKL memory manager.

The MKL_FreeBuffers() function is not currently exposed in NMath, but will be added in the next release. In the meantime, you can add this function to Kernel.cpp in NMath Core, and rebuild:

static void MklFreeBuffers() {

Or, if you want some console output to confirm that memory is being released, try this:

static void MklFreeBuffers() {

  MKL_INT64 AllocatedBytes;
  int N_AllocatedBuffers;

  AllocatedBytes = MKL_MemStat(&N_ AllocatedBuffers);
  System::Console::WriteLine("BEFORE: " + (long)AllocatedBytes + " bytes in " + N_AllocatedBuffers + " buffers");

  BLAS_PREFIX(MKL_FreeBuffers()) ;

  AllocatedBytes = MKL_MemStat(&N_AllocatedBuffers);
  System::Console::WriteLine("AFTER: " + (long)AllocatedBytes + " bytes in " + N_AllocatedBuffers + " buffers");


Once you’ve rebuilt NMath Core, you’d use the new method like so:

using CenterSpace.NMath.Kernel;

Note that some care should be taken when calling MklFreeBuffers(), since a drop in performance may occur for any subsequent MKL functions within the same thread, due to reallocation of buffers. Furthermore, given the cost of freeing the buffers themselves, rather than calling MklFreeBuffers() at the end of each thread, it might be more performant to do so after every n threads, or perhaps even based on the total memory usage of your program.