Offloading to GPU Archives - CenterSpace

NMath Premium’s new Adaptive GPU Bridge Architecture

Paul Shirkey — Mon, 13 Oct 2014 16:35:01 +0000

The most recent release of NMath Premium 6.0 is a major update which includes an upgraded optimization suite, now backed by the Microsoft Solver Foundation, a significantly more powerful GPU-bridge architecture, and a new class for cubic smoothing splines. This blog post will focus on the new API for doing computation on GPU’s with NMath Premium.

The adaptive GPU bridge API in NMath Premium 6.0 includes the following important new features.

Support for multiple GPU’s
Automatic tuning of the CPU–GPU adaptive bridge to insure optimal hardware usage.
Per-thread control for binding threads to GPU’s.

As with the first release of NMath Premium, using NMath to leverage massively-parallel GPU’s never requires any kernel-level GPU programming or other specialized GPU programming skills. Yet the programmer can easily take as much control as needed to route executing threads or tasks to any available GPU device. In the following, after introducing the new GPU bridge architecture, we’ll discuss each of these features separately with code examples.

Before getting started on our NMath Premium tutorial it’s important to consider your test GPU model. While many of NVIDIA’s GPU’s provide a good to excellent computational advantage over the CPU, not all of NVIDIA’s GPU’s were designed with general computing in mind. The “NVS” class of NVIDIA GPU’s (such as the NVS 5400M) generally perform very poorly as do the “GT” cards in the GeForce series. However the “GTX” cards in the GeForce series generally perform well, as do the Quadro Desktop Produces and the Tesla cards. While it’s fine to test NMath Premium on any NVIDIA, testing on inexpensive consumer grade video cards will rarely show any performance advantage.

NMath’s GPU API Basics

With NMath there are three fundamental software entities involved with routing computations between the CPU and GPU’s: GPU hardware devices represented by IComputeDevice instances, the Bridge classes which control when a particular operation is sent to the CPU or a GPU, and finally the BridgeManager which provides the primary means for managing the devices and bridges.

These three entities are governed by two important ideas.

Bridges are assigned to compute devices and there is a strict one-to-one relationship between each Bridge and IComputeDevice. Once assigned, the bridge instance governs when computations will be sent to it’s paired GPU device or the CPU.
Executing threads are assigned to devices; this is a many-to-one relationship. Any number of threads can be routed to a particular compute device.

Assigning a Bridge class to a device is one line of code with the BridgeManager.

BridgeManager.Instance.SetBridge( BridgeManager.Instance.GetComputeDevice( 0 ), bridge );

Assigning a thread, in this case the CurrentThread, to a device is again accomplished using the BridgeManager.

IComputeDevice cd = BridgeManager.Instance.GetComputeDevice( 0 );
BridgeManager.Instance.SetComputeDevice( cd, Thread.CurrentThread );

After installing NMath Premium, the default behavior will create a default bridge and assign it to the GPU with a device number of 0 (generally the fastest GPU installed). Also by default, all unassigned threads will execute on device 0. This means that out of the box with no additional programming, existing NMath code, once recompiled against the new NMath Premium assemblies, will route all appropriate computations to the device 0 GPU. All of the follow discussions and code examples are ways to refine this default behavior to get the best performance from your GPU hardware.

Math on Multiple GPU’s Supported

Currently only the NVIDIA GPU with a device number 0 is supported by NMath Premium, this release removes that barrier. With version 6, work can be assigned to any installed NVIDIA device as long as the device drivers are up-to-date.

The work done by an executing thread is routed to a particular device using the BridgeManager.Instance.SetDevice() as we saw in the example above. Any properly configured hardware device can be used here including any NVIDIA device and the CPU. The CPU is simply viewed as another compute device and is always assigned a device number of -1.

var bmanager = BridgeManager.Instance;

var cd = bmanager .GetComputeDevice( -1 );
BridgeManager.Instance.SetComputeDevice( cd, Thread.CurrentThread );
....
cd = bmanager .GetComputeDevice( 2 );
BridgeManager.Instance.SetComputeDevice( cd, Thread.CurrentThread );

Lines 3 & 4 first assign the current thread to the CPU device (no code on this thread will run on any GPU) and then in lines 6 & 7 the current thread is switched to the GPU device 2. If an invalid compute device is requested a null IComputeDevice is returned. To find all available computing devices, the BridgeManager offers an array of IComputeDevices which contains all detected compute devices including the CPU, called IComputeDevices Devices[]. The number of detected GPU’s can be found using the property BridgeManager.Instance.CountGPU.

As an aside, keep in mind that PCI slot numbers do not necessarily correspond to GPU device numbers. NVIDIA assigns the device number 0 to the fastest detected GPU and so installing an additional GPU into a machine may renumber the device numbers for the previously installed GPU’s.

Tuning the Adaptive Bridge

Assigning a Bridge to a GPU device doesn’t necessarily mean that all computation routed to that device will run on that device. Instead, the assigned Bridge acts as an intermediary between the CPU and the GPU and moves the larger problems to the GPU where there’s a speed advantage and retains the smaller problems on the CPU. NMath has a built-in default bridge, but it may generate non-optimal run-times depending on your hardware or your customers hardware configuration. To improved the hardware usage and performance a bridge can be tuned once and then persisted to disk for all future use.

// Get a compute device and a new bridge.
IComputeDevice cd = BridgeManager.Instance.GetComputeDevice( 0 );
Bridge bridge = BridgeManager.Instance.NewDefaultBridge( cd );

// Tune this bridge for the matrix multiply operation alone. 
bridge.Tune( BridgeFunctions.dgemm, cd, 1200 );

// Or just tune the entire bridge.  Depending on the hardware and tuning parameters
// this can be an expensive one-time operation. 
bridge.TuneAll( cd, 1200 );

// Now assign this updated bridge to the device.
BridgeManager.Instance.SetBridge( cd, bridge );

// Persisting the bridge that was tuned above is done with the BridgeManager.  
// Note that this overwrites any existing bridge with the same name.
BridgeManager.Instance.SaveBridge( bridge, @".\MyTunedBridge" );

// Then loading that bridge from disk is simple.
var myTunedBridge = BridgeManager.Instance.LoadBridge( @".\MyTunedBridge" );

Once a bridge is tuned it can be persisted, redistributed, and used again. If three different GPU’s are installed this tuning should be done once for each GPU and then each bridge should be assigned to the device it was tuned on. However if there are three identical GPU’s the tuning need be done only once, then persisted to disk, and later assigned to all identical GPU’s. Bridges assigned to GPU devices for which it wasn’t tuned will never result in incorrect results, only possibly under performance of the hardware.

Thread Control

Once a bridge is paired to a device, threads may be assigned to that device for execution. This is not a necessary step as all unassigned threads will run on the default device (typically device 0). However, suppose we have three tasks and three GPU’s, and we wish to use a GPU per task. The following code does that.

...
IComputeDevice gpu0= BridgeManager.Instance.GetComputeDevice( 0 );
IComputeDevice gpu1 = BridgeManager.Instance.GetComputeDevice( 1 );
IComputeDevice gpu2 = BridgeManager.Instance.GetComputeDevice( 2 );

if( gpu0 != null && gpu1 != null && gpu2 != null)
{
   System.Threading.Tasks.Task[] tasks = new Task[3]
   {
      Task.Factory.StartNew(() => Task1Worker(gpu0)),
      Task.Factory.StartNew(() => Task2Worker(gpu1)),
      Task.Factory.StartNew(() => Task2Worker(gpu2)),
   };

   //Block until all tasks complete.
   Task.WaitAll(tasks);
}
...

This code is standard C# code using the Task Parallel Library and contains no NMath Premium specific API calls outside of passing a GPU compute device to each task. The task worker routines have the following simple structure.

private static void Task1Worker( IComputeDevice cd  )
  {
      BridgeManager.Instance.SetComputeDevice( cd );

      // Do Work here.
  }

The other two task workers are identical outside of whatever useful computing work they may be doing.

Good luck and please post any questions in the comments below or just email us at support AT centerspace.net we’ll get back to you.

Happy Computing,

Paul

The post NMath Premium’s new Adaptive GPU Bridge Architecture appeared first on CenterSpace.

Offloading Computation to your GPU

Paul Shirkey — Thu, 13 Jun 2013 22:32:35 +0000

Large computational problems are offloaded onto a GPU because the problems run substantially faster on the GPU than on the CPU. By leveraging the innate parallelism of the GPU overall performance of the application is improved. (For example, see here and here.) However a second collateral benefit of moving computation to the GPU is the resulting offloading of computation from the CPU. But, until the advent of tools like NMath Premium, this benefit has been seldom discussed because of the complexity of programming the GPU; raw performance of the GPU has been the focus but for desktop users the ability to offload work to a second underutilized processor is often just as important. In this post I’ll present a code example that provides a simple task queuing model that can asynchronously offload work to the GPU and return results without writing any specialized GPU code.

Offloading to Your GPU

Frequently data processing applications have a tripartite structure – the data flows in from a disk on the network, the data is then computationally processed, and finally the results are analyzed and exported. Each of these tasks has various computational loads and each can be completed independently. In the code example below, this common structure is mirrored in three asynchronous tasks, one for each of the above described tasks, linked by two queues. We want to compute a stream of 2D FFT’s and would like to offload that work to the GPU to free up the CPU for more analysis.

    public void ThreadedGPUFFTExample()
    {

      //NMathConfiguration.ProcessorSharingMethod = ProcessorManagement.CPU;
      //NMathConfiguration.EnableGPULogging = true;

      Stopwatch timer = new Stopwatch();
      timer.Reset();

      // Off-load all FFT work to the GPU.
      var fftLength = 3000;
      FloatComplexForward2DFFT fftEngine = 
          new FloatComplexForward2DFFT( fftLength, fftLength );

      Queue dataInQ = new Queue( 2 );
      Queue dataOutQ = new Queue( 10 );

      var jobBlockCount = 10;

      // Start up threaded tasks that each monitor their respective Queues.
      var fftTask = Task.Factory.StartNew( () 
          = GPUFFTWorker( jobBlockCount, fftEngine, dataInQ, dataOutQ ) );
      var cpuTask = Task.Factory.StartNew( () 
          = CPUWorker( jobBlockCount, dataOutQ ) );
      var cpuDataReaderTask = Task.Factory.StartNew( () 
          = CPUDataReader( jobBlockCount, dataInQ ) );

      timer.Start();
      cpuTask.Wait();  // Wait until we are finished with the jobs
      timer.Stop();

      Console.WriteLine( String.Format( "\n * Tasks required {0} ms for {1} jobs. ", timer.ElapsedMilliseconds, jobBlockCount ) );

    }

This is the main body of our example where two queues are setup to pass data structures between the three tasks, GPUFFTWorker(), CPUWorker(), & CPUDataReader(). The data stored in the queues are FloatComplexMatrix but it could be any type or data structure as needed. Here our main GPU task is computing a series or 2D FFT’s, so 2D arrays are passed in the queues. Once the three tasks are started, we simply wait for the main CPU task to finish with all of the analysis, print a message and exit.

The three worker tasks are simple routines which are polling the queues for incoming work, and once their 10 jobs have been completed they exit. The code is provided at the bottom of this article.

Measuring the offloading

Running this example as show above, computing 10 3000×3000 2D FFT’s, we see the following output.

Enqueued data for job #10 
  Finished FFT on GPU for job 10.
  Dequeued spectrum 10 for analysis 
Enqueued data for job #9 
  Finished FFT on GPU for job 9.
  Dequeued spectrum 9 for analysis 
Enqueued data for job #8 
  Finished FFT on GPU for job 8.
  Dequeued spectrum 8 for analysis 
Enqueued data for job #7 
  Finished FFT on GPU for job 7.
Enqueued data for job #6 
  Dequeued spectrum 7 for analysis 
  Finished FFT on GPU for job 6.
Enqueued data for job #5 
  Finished FFT on GPU for job 5.
  Dequeued spectrum 6 for analysis 
Enqueued data for job #4 
  Finished FFT on GPU for job 4.
  Dequeued spectrum 5 for analysis 
Enqueued data for job #3 
  Finished FFT on GPU for job 3.
Enqueued data for job #2 
  Finished FFT on GPU for job 2.
  Dequeued spectrum 4 for analysis 
Enqueued data for job #1 
 * Finished loading all requested datasets.
  Finished FFT on GPU for job 1.
 * Finished all 2D FFT's.
  Dequeued spectrum 3 for analysis 
  Dequeued spectrum 2 for analysis 
  Dequeued spectrum 1 for analysis 

 * Tasks required 14148 ms for 10 jobs.

This output shows that the three tasks are indeed running asynchronously and that the final analysis in the CPUWorker can’t quite keep up with the other two upstream tasks. To measure how much work we are offloading to the GPU, we need to run this example while doing the 2D FFT’s on the GPU then on the CPU and compare the CPU spark charts in the resource monitor. If we are successfully off loading work to the GPU we should see substantially lower CPU loading while using the GPU for the 2D FFT’s. We can control the flow of computation by including or commenting out the first line of code in our example.

//NMathConfiguration.ProcessorSharingMethod = ProcessorManagement.CPU;

If this line of code is commented out the default processor sharing method of ProblemSize is used which will cause our large 2D FFT’s to be shunted over to the GPU. If this line is included all processing will be done on the CPU alone.

The following two images were plucked from my resource monitor after a complete run of 30 2D FFT jobs.

Offloading measurement by monitoring CPU loading

CPU load while running FFT’s on CPU	CPU load while running FFT’s on GPU

I ran these two experiments on my 4-core hyper-threaded i7 desktop using a NVIDIA GeForce 640 GPU. This particular GPU was shipped standard with my Dell computer and would be commonly found in many performance desktops. Clearly shifting the 2D FFT’s to the GPU offloads a lot of work from my CPU’s, and in fact CPU-7 and CPU-4 are completely parked (shut down) during the entire run and CPU-3 barely lifted a finger. Now we should go to work on threading the CPU-analysis portion of our code to leverage these idle cores.

– Happy Computing,

Paul

Worker Code

    // CPUDataReader is responsible gathering the data
    private void CPUDataReader( int jobCounter, Queue dataIn )
    {

      while ( jobCounter > 0 )
      {
        // Read the initial data set from disk and load into memory for
        // each job.  I'm just simulating this with a random matrix.
        FloatComplexMatrix data = new FloatComplexMatrix( 3000, 3000, 
          new RandGenNormal( 0.0, 1.0, 445 + jobCounter ) );

        dataIn.Enqueue( data );

        Console.WriteLine( String.Format( "Enqueued data for job #{0} ", jobCounter ) );

        jobCounter--;
      }

      Console.WriteLine( " * Finished loading all requested datasets." );
    }

    // GPUFFTWorker is responsible for computing the stream of 2D FFT's
    private void GPUFFTWorker( int jobCounter, FloatComplexForward2DFFT fftEngine,   
      Queue dataIn, Queue dataOut )
    {
      FloatComplexMatrix signal;

      // Monitor the job queue and execute the FFT's as the data becomes available.
      while ( jobCounter > 0 )
      {
        if( dataIn.Count > 0 )
        {
          signal = dataIn.Dequeue();

          fftEngine.FFTInPlace( signal );

          Console.WriteLine( String.Format("  Finished FFT on GPU for job {0}.", jobCounter) );

          dataOut.Enqueue( signal );

          jobCounter--;
        }
      }

      Console.WriteLine( " * Finished all 2D FFT's." );
    }

    // CPUWorker is responsible for the post analysis of the data.
    private void CPUWorker( int jobCounter, Queue dataOut )
    {

      while ( jobCounter > 0 )
      {
        if ( dataOut.Count > 0 )
        {

          FloatComplexMatrix fftSpectrum = dataOut.Dequeue();

          Console.WriteLine( String.Format( "  Dequeued spectrum {0} for analysis ", jobCounter ) );

          // Compute magnitude of FFT
          FloatMatrix absFFT = NMathFunctions.Abs( fftSpectrum );

          // Find spectral peaks, write out results, ...

          jobCounter--;
        }
      }
    }

The post Offloading Computation to your GPU appeared first on CenterSpace.