Math on GPU's

The most recent release of NMath Premium 6.0 is a major update which includes an upgraded optimization suite, now backed by the Microsoft Solver Foundation, a significantly more powerful GPU-bridge architecture, and a new class for cubic smoothing splines. This blog post will focus on the new API for doing computation on GPU’s with NMath Premium.

The adaptive GPU bridge API in NMath Premium 6.0 includes the following important new features.

Support for multiple GPU’s
Automatic tuning of the CPU–GPU adaptive bridge to insure optimal hardware usage.
Per-thread control for binding threads to GPU’s.

As with the first release of NMath Premium, using NMath to leverage massively-parallel GPU’s never requires any kernel-level GPU programming or other specialized GPU programming skills. Yet the programmer can easily take as much control as needed to route executing threads or tasks to any available GPU device. In the following, after introducing the new GPU bridge architecture, we’ll discuss each of these features separately with code examples.

Before getting started on our NMath Premium tutorial it’s important to consider your test GPU model. While many of NVIDIA’s GPU’s provide a good to excellent computational advantage over the CPU, not all of NVIDIA’s GPU’s were designed with general computing in mind. The “NVS” class of NVIDIA GPU’s (such as the NVS 5400M) generally perform very poorly as do the “GT” cards in the GeForce series. However the “GTX” cards in the GeForce series generally perform well, as do the Quadro Desktop Produces and the Tesla cards. While it’s fine to test NMath Premium on any NVIDIA, testing on inexpensive consumer grade video cards will rarely show any performance advantage.

NMath’s GPU API Basics

With NMath there are three fundamental software entities involved with routing computations between the CPU and GPU’s: GPU hardware devices represented by IComputeDevice instances, the Bridge classes which control when a particular operation is sent to the CPU or a GPU, and finally the BridgeManager which provides the primary means for managing the devices and bridges.

These three entities are governed by two important ideas.

Bridges are assigned to compute devices and there is a strict one-to-one relationship between each Bridge and IComputeDevice. Once assigned, the bridge instance governs when computations will be sent to it’s paired GPU device or the CPU.
Executing threads are assigned to devices; this is a many-to-one relationship. Any number of threads can be routed to a particular compute device.

Assigning a Bridge class to a device is one line of code with the BridgeManager.

BridgeManager.Instance.SetBridge( BridgeManager.Instance.GetComputeDevice( 0 ), bridge );

Assigning a thread, in this case the CurrentThread, to a device is again accomplished using the BridgeManager.

IComputeDevice cd = BridgeManager.Instance.GetComputeDevice( 0 );
BridgeManager.Instance.SetComputeDevice( cd, Thread.CurrentThread );

After installing NMath Premium, the default behavior will create a default bridge and assign it to the GPU with a device number of 0 (generally the fastest GPU installed). Also by default, all unassigned threads will execute on device 0. This means that out of the box with no additional programming, existing NMath code, once recompiled against the new NMath Premium assemblies, will route all appropriate computations to the device 0 GPU. All of the follow discussions and code examples are ways to refine this default behavior to get the best performance from your GPU hardware.

Math on Multiple GPU’s Supported

Currently only the NVIDIA GPU with a device number 0 is supported by NMath Premium, this release removes that barrier. With version 6, work can be assigned to any installed NVIDIA device as long as the device drivers are up-to-date.

The work done by an executing thread is routed to a particular device using the BridgeManager.Instance.SetDevice() as we saw in the example above. Any properly configured hardware device can be used here including any NVIDIA device and the CPU. The CPU is simply viewed as another compute device and is always assigned a device number of -1.

var bmanager = BridgeManager.Instance;

var cd = bmanager .GetComputeDevice( -1 );
BridgeManager.Instance.SetComputeDevice( cd, Thread.CurrentThread );
....
cd = bmanager .GetComputeDevice( 2 );
BridgeManager.Instance.SetComputeDevice( cd, Thread.CurrentThread );

Lines 3 & 4 first assign the current thread to the CPU device (no code on this thread will run on any GPU) and then in lines 6 & 7 the current thread is switched to the GPU device 2. If an invalid compute device is requested a null IComputeDevice is returned. To find all available computing devices, the BridgeManager offers an array of IComputeDevices which contains all detected compute devices including the CPU, called IComputeDevices Devices[]. The number of detected GPU’s can be found using the property BridgeManager.Instance.CountGPU.

As an aside, keep in mind that PCI slot numbers do not necessarily correspond to GPU device numbers. NVIDIA assigns the device number 0 to the fastest detected GPU and so installing an additional GPU into a machine may renumber the device numbers for the previously installed GPU’s.

Tuning the Adaptive Bridge

Assigning a Bridge to a GPU device doesn’t necessarily mean that all computation routed to that device will run on that device. Instead, the assigned Bridge acts as an intermediary between the CPU and the GPU and moves the larger problems to the GPU where there’s a speed advantage and retains the smaller problems on the CPU. NMath has a built-in default bridge, but it may generate non-optimal run-times depending on your hardware or your customers hardware configuration. To improved the hardware usage and performance a bridge can be tuned once and then persisted to disk for all future use.

// Get a compute device and a new bridge.
IComputeDevice cd = BridgeManager.Instance.GetComputeDevice( 0 );
Bridge bridge = BridgeManager.Instance.NewDefaultBridge( cd );

// Tune this bridge for the matrix multiply operation alone. 
bridge.Tune( BridgeFunctions.dgemm, cd, 1200 );

// Or just tune the entire bridge.  Depending on the hardware and tuning parameters
// this can be an expensive one-time operation. 
bridge.TuneAll( cd, 1200 );

// Now assign this updated bridge to the device.
BridgeManager.Instance.SetBridge( cd, bridge );

// Persisting the bridge that was tuned above is done with the BridgeManager.  
// Note that this overwrites any existing bridge with the same name.
BridgeManager.Instance.SaveBridge( bridge, @".\MyTunedBridge" );

// Then loading that bridge from disk is simple.
var myTunedBridge = BridgeManager.Instance.LoadBridge( @".\MyTunedBridge" );

Once a bridge is tuned it can be persisted, redistributed, and used again. If three different GPU’s are installed this tuning should be done once for each GPU and then each bridge should be assigned to the device it was tuned on. However if there are three identical GPU’s the tuning need be done only once, then persisted to disk, and later assigned to all identical GPU’s. Bridges assigned to GPU devices for which it wasn’t tuned will never result in incorrect results, only possibly under performance of the hardware.

Thread Control

Once a bridge is paired to a device, threads may be assigned to that device for execution. This is not a necessary step as all unassigned threads will run on the default device (typically device 0). However, suppose we have three tasks and three GPU’s, and we wish to use a GPU per task. The following code does that.

...
IComputeDevice gpu0= BridgeManager.Instance.GetComputeDevice( 0 );
IComputeDevice gpu1 = BridgeManager.Instance.GetComputeDevice( 1 );
IComputeDevice gpu2 = BridgeManager.Instance.GetComputeDevice( 2 );

if( gpu0 != null && gpu1 != null && gpu2 != null)
{
   System.Threading.Tasks.Task[] tasks = new Task[3]
   {
      Task.Factory.StartNew(() => Task1Worker(gpu0)),
      Task.Factory.StartNew(() => Task2Worker(gpu1)),
      Task.Factory.StartNew(() => Task2Worker(gpu2)),
   };

   //Block until all tasks complete.
   Task.WaitAll(tasks);
}
...

This code is standard C# code using the Task Parallel Library and contains no NMath Premium specific API calls outside of passing a GPU compute device to each task. The task worker routines have the following simple structure.

private static void Task1Worker( IComputeDevice cd  )
  {
      BridgeManager.Instance.SetComputeDevice( cd );

      // Do Work here.
  }

The other two task workers are identical outside of whatever useful computing work they may be doing.

Good luck and please post any questions in the comments below or just email us at support AT centerspace.net we’ll get back to you.

Happy Computing,

Paul

NMath Premium’s new Adaptive GPU Bridge Architecture

NMath’s GPU API Basics

Math on Multiple GPU’s Supported

Tuning the Adaptive Bridge

Thread Control

Leave a Reply