C# Nvidia GPU Archives - CenterSpace

NMath Premium’s new Adaptive GPU Bridge Architecture

Paul Shirkey — Mon, 13 Oct 2014 16:35:01 +0000

The most recent release of NMath Premium 6.0 is a major update which includes an upgraded optimization suite, now backed by the Microsoft Solver Foundation, a significantly more powerful GPU-bridge architecture, and a new class for cubic smoothing splines. This blog post will focus on the new API for doing computation on GPU’s with NMath Premium.

The adaptive GPU bridge API in NMath Premium 6.0 includes the following important new features.

Support for multiple GPU’s
Automatic tuning of the CPU–GPU adaptive bridge to insure optimal hardware usage.
Per-thread control for binding threads to GPU’s.

As with the first release of NMath Premium, using NMath to leverage massively-parallel GPU’s never requires any kernel-level GPU programming or other specialized GPU programming skills. Yet the programmer can easily take as much control as needed to route executing threads or tasks to any available GPU device. In the following, after introducing the new GPU bridge architecture, we’ll discuss each of these features separately with code examples.

Before getting started on our NMath Premium tutorial it’s important to consider your test GPU model. While many of NVIDIA’s GPU’s provide a good to excellent computational advantage over the CPU, not all of NVIDIA’s GPU’s were designed with general computing in mind. The “NVS” class of NVIDIA GPU’s (such as the NVS 5400M) generally perform very poorly as do the “GT” cards in the GeForce series. However the “GTX” cards in the GeForce series generally perform well, as do the Quadro Desktop Produces and the Tesla cards. While it’s fine to test NMath Premium on any NVIDIA, testing on inexpensive consumer grade video cards will rarely show any performance advantage.

NMath’s GPU API Basics

With NMath there are three fundamental software entities involved with routing computations between the CPU and GPU’s: GPU hardware devices represented by IComputeDevice instances, the Bridge classes which control when a particular operation is sent to the CPU or a GPU, and finally the BridgeManager which provides the primary means for managing the devices and bridges.

These three entities are governed by two important ideas.

Bridges are assigned to compute devices and there is a strict one-to-one relationship between each Bridge and IComputeDevice. Once assigned, the bridge instance governs when computations will be sent to it’s paired GPU device or the CPU.
Executing threads are assigned to devices; this is a many-to-one relationship. Any number of threads can be routed to a particular compute device.

Assigning a Bridge class to a device is one line of code with the BridgeManager.

BridgeManager.Instance.SetBridge( BridgeManager.Instance.GetComputeDevice( 0 ), bridge );

Assigning a thread, in this case the CurrentThread, to a device is again accomplished using the BridgeManager.

IComputeDevice cd = BridgeManager.Instance.GetComputeDevice( 0 );
BridgeManager.Instance.SetComputeDevice( cd, Thread.CurrentThread );

After installing NMath Premium, the default behavior will create a default bridge and assign it to the GPU with a device number of 0 (generally the fastest GPU installed). Also by default, all unassigned threads will execute on device 0. This means that out of the box with no additional programming, existing NMath code, once recompiled against the new NMath Premium assemblies, will route all appropriate computations to the device 0 GPU. All of the follow discussions and code examples are ways to refine this default behavior to get the best performance from your GPU hardware.

Math on Multiple GPU’s Supported

Currently only the NVIDIA GPU with a device number 0 is supported by NMath Premium, this release removes that barrier. With version 6, work can be assigned to any installed NVIDIA device as long as the device drivers are up-to-date.

The work done by an executing thread is routed to a particular device using the BridgeManager.Instance.SetDevice() as we saw in the example above. Any properly configured hardware device can be used here including any NVIDIA device and the CPU. The CPU is simply viewed as another compute device and is always assigned a device number of -1.

var bmanager = BridgeManager.Instance;

var cd = bmanager .GetComputeDevice( -1 );
BridgeManager.Instance.SetComputeDevice( cd, Thread.CurrentThread );
....
cd = bmanager .GetComputeDevice( 2 );
BridgeManager.Instance.SetComputeDevice( cd, Thread.CurrentThread );

Lines 3 & 4 first assign the current thread to the CPU device (no code on this thread will run on any GPU) and then in lines 6 & 7 the current thread is switched to the GPU device 2. If an invalid compute device is requested a null IComputeDevice is returned. To find all available computing devices, the BridgeManager offers an array of IComputeDevices which contains all detected compute devices including the CPU, called IComputeDevices Devices[]. The number of detected GPU’s can be found using the property BridgeManager.Instance.CountGPU.

As an aside, keep in mind that PCI slot numbers do not necessarily correspond to GPU device numbers. NVIDIA assigns the device number 0 to the fastest detected GPU and so installing an additional GPU into a machine may renumber the device numbers for the previously installed GPU’s.

Tuning the Adaptive Bridge

Assigning a Bridge to a GPU device doesn’t necessarily mean that all computation routed to that device will run on that device. Instead, the assigned Bridge acts as an intermediary between the CPU and the GPU and moves the larger problems to the GPU where there’s a speed advantage and retains the smaller problems on the CPU. NMath has a built-in default bridge, but it may generate non-optimal run-times depending on your hardware or your customers hardware configuration. To improved the hardware usage and performance a bridge can be tuned once and then persisted to disk for all future use.

// Get a compute device and a new bridge.
IComputeDevice cd = BridgeManager.Instance.GetComputeDevice( 0 );
Bridge bridge = BridgeManager.Instance.NewDefaultBridge( cd );

// Tune this bridge for the matrix multiply operation alone. 
bridge.Tune( BridgeFunctions.dgemm, cd, 1200 );

// Or just tune the entire bridge.  Depending on the hardware and tuning parameters
// this can be an expensive one-time operation. 
bridge.TuneAll( cd, 1200 );

// Now assign this updated bridge to the device.
BridgeManager.Instance.SetBridge( cd, bridge );

// Persisting the bridge that was tuned above is done with the BridgeManager.  
// Note that this overwrites any existing bridge with the same name.
BridgeManager.Instance.SaveBridge( bridge, @".\MyTunedBridge" );

// Then loading that bridge from disk is simple.
var myTunedBridge = BridgeManager.Instance.LoadBridge( @".\MyTunedBridge" );

Once a bridge is tuned it can be persisted, redistributed, and used again. If three different GPU’s are installed this tuning should be done once for each GPU and then each bridge should be assigned to the device it was tuned on. However if there are three identical GPU’s the tuning need be done only once, then persisted to disk, and later assigned to all identical GPU’s. Bridges assigned to GPU devices for which it wasn’t tuned will never result in incorrect results, only possibly under performance of the hardware.

Thread Control

Once a bridge is paired to a device, threads may be assigned to that device for execution. This is not a necessary step as all unassigned threads will run on the default device (typically device 0). However, suppose we have three tasks and three GPU’s, and we wish to use a GPU per task. The following code does that.

...
IComputeDevice gpu0= BridgeManager.Instance.GetComputeDevice( 0 );
IComputeDevice gpu1 = BridgeManager.Instance.GetComputeDevice( 1 );
IComputeDevice gpu2 = BridgeManager.Instance.GetComputeDevice( 2 );

if( gpu0 != null && gpu1 != null && gpu2 != null)
{
   System.Threading.Tasks.Task[] tasks = new Task[3]
   {
      Task.Factory.StartNew(() => Task1Worker(gpu0)),
      Task.Factory.StartNew(() => Task2Worker(gpu1)),
      Task.Factory.StartNew(() => Task2Worker(gpu2)),
   };

   //Block until all tasks complete.
   Task.WaitAll(tasks);
}
...

This code is standard C# code using the Task Parallel Library and contains no NMath Premium specific API calls outside of passing a GPU compute device to each task. The task worker routines have the following simple structure.

private static void Task1Worker( IComputeDevice cd  )
  {
      BridgeManager.Instance.SetComputeDevice( cd );

      // Do Work here.
  }

The other two task workers are identical outside of whatever useful computing work they may be doing.

Good luck and please post any questions in the comments below or just email us at support AT centerspace.net we’ll get back to you.

Happy Computing,

Paul

The post NMath Premium’s new Adaptive GPU Bridge Architecture appeared first on CenterSpace.

Distributing Parallel Tasks on Multiple GPU’s

Paul Shirkey — Wed, 17 Sep 2014 20:50:51 +0000

In this post I’m going demonstrate how to use the Task Parallel Library with NMath Premium to run tasks in parallel on multiple GPU’s and the CPU. Back in 2012 when Microsoft released .NET 4.0 and the System.Threading.Task namespace many .NET programmers never, or only under duress, wrote multi-threaded code. It’s old news now that TPL has reduced the complexity of writing threaded code by providing several new classes to make the process easier while eliminating some pitfalls. Leveraging the TPL API together with NMath Premium is a powerful combination for quickly getting code running on your GPU hardware without the burden of learning complex CUDA programming techniques.

NMath Premium GPU Smart Bridge

The NMath Premium 6.0 library is now integrated with a new CPU-GPU hybrid-computing Adaptive Bridge™ Technology. This technology allows users to easily assign specific threads to a particular compute device and manage computational routing between the CPU and multiple on-board GPU’s. Each piece of installed computing hardware is uniformly treated as a compute device and managed in software as an immutable IComputeDevice; Currently the adaptive bridge allows a single CPU compute device (naturally!) along with any number of NVIDIA GPU devices. How NMath Premium interacts with each compute device is governed by a Bridge class. A one-to-one relationship between each Bridge instance and each compute device is enforced. All of the compute devices and bridges are managed by the singleton BridgeManager class.

Adaptive Bridge

These three classes: the BridgeManager, the Bridge, and the immutable IComputeDevice form the entire API of the Adaptive Bridge™. With this API, nearly all programming tasks, such as assigning a particular Action<> to a specific GPU, are accomplished in one or two lines of code. Let’s look at some code that does just that: Run an Action<> on a GPU.

using CenterSpace.NMath.Matrix;

public void mainProgram( string[] args )
    {
      // Set up a Action<> that runs on a IComputeDevice.
      Action worker = WorkerAction;
      
      // Get the compute devices we wish to run our 
      // Action<> on - in this case two GPU 0.
      IComputeDevice deviceGPU0 = BridgeManager.Instance.GetComputeDevice( 0 );

      // Do work
      worker(deviceGPU0, 9);
    }

    private void WorkerAction( IComputeDevice device, int input )
    {
      // Place this thread to the given compute device.
      BridgeManager.Instance.SetComputeDevice( device );

      // Do all the hard work here on the assigned device.
      // Call various GPU-aware NMath Premium routines here.
      FloatMatrix A = new FloatMatrix( 1230, 900, new RandGenUniform( -1, 1, 37 ) );
      FloatSVDecompServer server = new FloatSVDecompServer();
      FloatSVDDecomp svd = server.GetDecomp( A );
    }

It’s important to understand that only operations where the GPU has a computational advantage are actually run on the GPU. So it’s not as though all of the code in the WorkerAction runs on the GPU, but only code that makes sense such as: SVD, QR decomp, matrix multiply, Eigenvalue decomposition and so forth. But using this as a code template, you can easily run your own worker several times passing in different compute devices each time to compare the computational advantages or disadvantages of using various devices – including the CPU compute device.

In the above code example the BridgeManager is used twice: once to get a IComputeDevice reference and once to assign a thread (the Action<>'s thread in this case ) to the device. The Bridge class didn’t come into play since we implicitly relied on a default bridge to be assigned to our compute device of choice. Relying on the default bridge will likely result in inferior performance so it’s best to use a bridge that has been specifically tuned to your NVIDIA GPU. The follow code shows how to accomplish bridge tuning.

  // Here we get the bridge associated with GPU device 0.
  var cd = BridgeManager.Instance.GetComputeDevice( 0 );
  var bridge = (Bridge) BridgeManager.Instance.GetBridge( cd );

  // Tune the bridge and save it.  Turning can take a few minutes.
  bridge.TuneAll( device, 1200 );
  bridge.SaveBridge("Device0Bridge.bdg");

This bridge turning is typically a one-time operation per computer, and once done, the tuned bridge can be serialized to disk and then reload at application start-up. If new GPU hardware is installed then this tuning operation should be repeated. The following code snipped loads a saved bridge and pairs it with a device.

  // Load our serialized bridge.
  Bridge bridge = BridgeManager.Instance.LoadBridge( "Device0Bridge.bdg" );
  
  // Now pair this saved bridge with compute device 0.   
  var device0 = BridgeManager.Instance.GetComputeDevice( 0 );
  BridgeManager.Instance.SetBridge( device0, bridge );

Once the tuned bridge is assigned to a device, the behavior of all threads assigned to that device will be governed by that bridge. In the typical application the pairing of bridges to devices done at start up and not altered again, while the assignment of threads to devices may be done frequently at runtime.

It’s interesting to note that beyond optimally routing small and large problems to the CPU and GPU respectively, bridges can be configured to shunt all work to the GPU regardless of problem size. This is useful for testing and for offloading work to a GPU when the CPU if taxed. Even if the particular problem runs slower on the GPU than the CPU, if the CPU is fully occupied, offloading work to an otherwise idle GPU will enhance performance.

C# Code Example of Running Tasks on Two GPU’s

I’m going to wrap up this blog post with a complete C# code example which runs a matrix multiplication task simultaneously on two GPU’s and the CPU. The framework of this example uses the TPL and aspects of the adaptive bridge already covered here. I ran this code on a machine with two NVIDIA GeForce GPU’s, a GTX760 and a GT640, and the timing results from this run for executing a large matrix multiplication are shown below.

Finished matrix multiply on the GeForce GTX 760 in 67 ms.
Finished matrix multiply on the Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz in 103 ms.
Finished matrix multiply on the GeForce GT 640 in 282 ms.

Finished all double precision matrix multiplications in parallel in 282 ms.

The complete code for this example is given in the section below. In this run we see the GeForce GTX760 easily finished first in 67ms followed by the CPU and then finally by the GeForce GT640. It’s expected that the GeForce GT640 would not do well in this example because it’s optimized for single precision work and these matrix multiples are double precision. Nevertheless, this examples shows it’s programmatically simple to push work to any NVIDIA GPU and in a threaded application even a relatively slow GPU can be used to offload work from the CPU. Also note that the entire program ran in 282ms – the time required to finish the matrix multiply by the slowest hardware – verifying that all three tasks did run in parallel and that there was very little overhead in using the TPL or the Adaptive Bridge™

Below is a snippet of the NMath Premium log file generated during the run above.

	Time 		        tid   Device#  Function    Device Used    
2014-04-28 11:22:47.417 AM	10	0	dgemm		GPU
2014-04-28 11:22:47.421 AM	15	1	dgemm		GPU
2014-04-28 11:22:47.425 AM	13	-1	dgemm		CPU

We can see here that three threads were created nearly simultaneously with thread id’s of 10, 15, & 13; And that the first two threads ran their matrix multiplies (dgemm) on GPU’s 0 and 1 and the last thread 13 ran on the CPU. As a matter of convention the CPU device number is always -1 and all GPU device numbers are integers 0 and greater. Typically device number 0 is assigned to the fastest installed GPU and that is the default GPU used by NMath Premium.

-Paul

TPL Tasks on Multiple GPU’s C# Code

public void GPUTaskExample()
    {
     
      NMathConfiguration.Init();

      // Set up a string writer for logging
      using ( var writer = new System.IO.StringWriter() )
      {

        // Enable the CPU/GPU bridge logging
        BridgeManager.Instance.EnableLogging( writer );

        // Get the compute devices we wish to run our tasks on - in this case 
        // two GPU's and the CPU.
        IComputeDevice deviceGPU0 = BridgeManager.Instance.GetComputeDevice( 0 );
        IComputeDevice deviceGPU1 = BridgeManager.Instance.GetComputeDevice( 1 );
        IComputeDevice deviceCPU = BridgeManager.Instance.CPU;

        // Build some matrices
        var A = new DoubleMatrix( 1200, 1400, 0, 1 );
        var B = new DoubleMatrix( 1400, 1300, 0, 1 );

        // Build the task array and assign matrix multiply jobs and compute devices
        // to those tasks.  Any number of tasks can be added here and any number 
        // of tasks can be assigned to a particular device.
        Stopwatch timer = new Stopwatch();
        timer.Start();
        System.Threading.Tasks.Task[] tasks = new Task[3]
        {
          Task.Factory.StartNew(() => MatrixMultiply(deviceGPU0, A, B)),
          Task.Factory.StartNew(() => MatrixMultiply(deviceGPU1, A, B)),
          Task.Factory.StartNew(() => MatrixMultiply(deviceCPU, A, B)),
        };

        // Block until all tasks complete
        Task.WaitAll( tasks );
        timer.Stop();
        Console.WriteLine( "Finished all double precision matrix multiplications in parallel in " + timer.ElapsedMilliseconds + " ms.\n" );

        // Dump the log file for verification.
        Console.WriteLine( writer );

        // Quit logging
        BridgeManager.Instance.DisableLogging();
      
      }
    }

    private static void MatrixMultiply( IComputeDevice device, DoubleMatrix A, DoubleMatrix B )
    {
      // Place this thread to the given compute device.
      BridgeManager.Instance.SetComputeDevice( device );

      Stopwatch timer = new Stopwatch();
      timer.Start();

      // Do this task work.
      NMathFunctions.Product( A, B );

      timer.Stop();
      Console.WriteLine( "Finished matrix multiplication on the " + device.DeviceName  + " in " + timer.ElapsedMilliseconds + " ms.\n" );
    }

The post Distributing Parallel Tasks on Multiple GPU’s appeared first on CenterSpace.

Detecting and Configuring your GPU for Computation

Paul Shirkey — Wed, 19 Jun 2013 15:30:03 +0000

Detecting your GPU

Before evaluating NMath Premium or any other GPU-aware software you need to know what type of hardware you have and verify that the correct drivers are installed. There are two quick ways of detecting your NVIDIA GPU and viewing it’s hardware specifications.

The majority of installed NVIDIA GPU’s in desktop computers are there acting as high performance video rendering hardware. You can quickly see if you have a NVIDIA GPU installed by opening your windows Device Manager (right click on Computer from the Start menu, select Properties, and click on Device Manger). Once in the Device Manager open the Display adapters in the tree-menu and a list of installed devices will be shown. On my development machine I see one display adapter listed as “NVIDIA GeForce GT 640”.

Multiple display adapters can be installed and it’s important to note that NMath Premium currently only runs on the device 0 adapter. By right clicking on a listed display adapter more details are provided including the driver version and device number. Display adapters (GPU’s) can be individually enabled or disabled from the right-click context menu.

NVIDIA provides a GPU device query program called DeviceQuery.exe that is freely available which gives a detailed list of features of all installed GPU’s. CenterSpace ships a version of this program with the GPU-aware NMath Premium product. When I run this program on my development machine I get the following:

CenterSpace Software NMath Premium Check...

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GT 640"
  CUDA Driver Version / Runtime Version          5.5 / 5.0
  CUDA Capability Major/Minor version number:    3.0
  Total amount of global memory:                 1024 MBytes (107...
  ( 2) Multiprocessors x (192) CUDA Cores/MP:    384 CUDA Cores
  GPU Clock rate:                                954 MHz (0.95 GHz)
  Memory Clock rate:                             2500 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 262144 bytes
  Max Texture Dimension Size (x,y,z)             1D=(65536), 2D=(65...
  Max Layered Texture Size (dim) x layers        1D=(16384) x 2048,...
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     2147483647 x 655...
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy e...
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Bus ID / PCI location ID:           1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() 
       with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.5, 
CUDA Runtime Version = 5.0, NumDevs = 1, Device0 = GeForce GT 640

Note that deviceQuery has a CUDA library dependency (on the runtime CUDART64 dll) and so will complain unless you have installed NMath Premium. Probably the most important item to note is at the bottom of the listing where the CUDA driver and CUDA runtimes versions are given. Currently NMath Premium requires a CUDA driver of at least 5.0 and a CUDA runtime of the same version or higher. I’ll describe the simple process of upgrading your driver in the follow section.

NVIDIA GPU Drivers

NVIDIA has made upgrading to the latest driver simple and you’ll need to do this upgrade to use NMath Premium if your CUDA driver is below 5.0. If you have a GeForce GPU just point your browser at http://www.geforce.com/drivers and the site can automatically detect your hardware and download the latest correct driver. Alternatively the latest drivers for all NVIDIA hardware are available for download here.

Compute Capability

NVIDIA has classified it’s various hardware architectures under the moniker of Compute Capability. The higher the compute capability number a GPU has the more modern it’s architecture. Most software leveraging NVIDIA GPU’s requires some minimum compute capability to run correctly and NMath Premium is no different. NMath Premium requires a GPU with a compute capability of 1.3 or higher. All of NVIDIA’s GPUs are listed here along with their compute capability number. The deviceQuery also lists each installed GPU’s compute capability near the head of the listing under CUDA Capability Major/Minor version number (see above).

The post Detecting and Configuring your GPU for Computation appeared first on CenterSpace.

Offloading Computation to your GPU

Paul Shirkey — Thu, 13 Jun 2013 22:32:35 +0000

Large computational problems are offloaded onto a GPU because the problems run substantially faster on the GPU than on the CPU. By leveraging the innate parallelism of the GPU overall performance of the application is improved. (For example, see here and here.) However a second collateral benefit of moving computation to the GPU is the resulting offloading of computation from the CPU. But, until the advent of tools like NMath Premium, this benefit has been seldom discussed because of the complexity of programming the GPU; raw performance of the GPU has been the focus but for desktop users the ability to offload work to a second underutilized processor is often just as important. In this post I’ll present a code example that provides a simple task queuing model that can asynchronously offload work to the GPU and return results without writing any specialized GPU code.

Offloading to Your GPU

Frequently data processing applications have a tripartite structure – the data flows in from a disk on the network, the data is then computationally processed, and finally the results are analyzed and exported. Each of these tasks has various computational loads and each can be completed independently. In the code example below, this common structure is mirrored in three asynchronous tasks, one for each of the above described tasks, linked by two queues. We want to compute a stream of 2D FFT’s and would like to offload that work to the GPU to free up the CPU for more analysis.

    public void ThreadedGPUFFTExample()
    {

      //NMathConfiguration.ProcessorSharingMethod = ProcessorManagement.CPU;
      //NMathConfiguration.EnableGPULogging = true;

      Stopwatch timer = new Stopwatch();
      timer.Reset();

      // Off-load all FFT work to the GPU.
      var fftLength = 3000;
      FloatComplexForward2DFFT fftEngine = 
          new FloatComplexForward2DFFT( fftLength, fftLength );

      Queue dataInQ = new Queue( 2 );
      Queue dataOutQ = new Queue( 10 );

      var jobBlockCount = 10;

      // Start up threaded tasks that each monitor their respective Queues.
      var fftTask = Task.Factory.StartNew( () 
          = GPUFFTWorker( jobBlockCount, fftEngine, dataInQ, dataOutQ ) );
      var cpuTask = Task.Factory.StartNew( () 
          = CPUWorker( jobBlockCount, dataOutQ ) );
      var cpuDataReaderTask = Task.Factory.StartNew( () 
          = CPUDataReader( jobBlockCount, dataInQ ) );

      timer.Start();
      cpuTask.Wait();  // Wait until we are finished with the jobs
      timer.Stop();

      Console.WriteLine( String.Format( "\n * Tasks required {0} ms for {1} jobs. ", timer.ElapsedMilliseconds, jobBlockCount ) );

    }

This is the main body of our example where two queues are setup to pass data structures between the three tasks, GPUFFTWorker(), CPUWorker(), & CPUDataReader(). The data stored in the queues are FloatComplexMatrix but it could be any type or data structure as needed. Here our main GPU task is computing a series or 2D FFT’s, so 2D arrays are passed in the queues. Once the three tasks are started, we simply wait for the main CPU task to finish with all of the analysis, print a message and exit.

The three worker tasks are simple routines which are polling the queues for incoming work, and once their 10 jobs have been completed they exit. The code is provided at the bottom of this article.

Measuring the offloading

Running this example as show above, computing 10 3000×3000 2D FFT’s, we see the following output.

Enqueued data for job #10 
  Finished FFT on GPU for job 10.
  Dequeued spectrum 10 for analysis 
Enqueued data for job #9 
  Finished FFT on GPU for job 9.
  Dequeued spectrum 9 for analysis 
Enqueued data for job #8 
  Finished FFT on GPU for job 8.
  Dequeued spectrum 8 for analysis 
Enqueued data for job #7 
  Finished FFT on GPU for job 7.
Enqueued data for job #6 
  Dequeued spectrum 7 for analysis 
  Finished FFT on GPU for job 6.
Enqueued data for job #5 
  Finished FFT on GPU for job 5.
  Dequeued spectrum 6 for analysis 
Enqueued data for job #4 
  Finished FFT on GPU for job 4.
  Dequeued spectrum 5 for analysis 
Enqueued data for job #3 
  Finished FFT on GPU for job 3.
Enqueued data for job #2 
  Finished FFT on GPU for job 2.
  Dequeued spectrum 4 for analysis 
Enqueued data for job #1 
 * Finished loading all requested datasets.
  Finished FFT on GPU for job 1.
 * Finished all 2D FFT's.
  Dequeued spectrum 3 for analysis 
  Dequeued spectrum 2 for analysis 
  Dequeued spectrum 1 for analysis 

 * Tasks required 14148 ms for 10 jobs.

This output shows that the three tasks are indeed running asynchronously and that the final analysis in the CPUWorker can’t quite keep up with the other two upstream tasks. To measure how much work we are offloading to the GPU, we need to run this example while doing the 2D FFT’s on the GPU then on the CPU and compare the CPU spark charts in the resource monitor. If we are successfully off loading work to the GPU we should see substantially lower CPU loading while using the GPU for the 2D FFT’s. We can control the flow of computation by including or commenting out the first line of code in our example.

//NMathConfiguration.ProcessorSharingMethod = ProcessorManagement.CPU;

If this line of code is commented out the default processor sharing method of ProblemSize is used which will cause our large 2D FFT’s to be shunted over to the GPU. If this line is included all processing will be done on the CPU alone.

The following two images were plucked from my resource monitor after a complete run of 30 2D FFT jobs.

Offloading measurement by monitoring CPU loading

CPU load while running FFT’s on CPU	CPU load while running FFT’s on GPU

I ran these two experiments on my 4-core hyper-threaded i7 desktop using a NVIDIA GeForce 640 GPU. This particular GPU was shipped standard with my Dell computer and would be commonly found in many performance desktops. Clearly shifting the 2D FFT’s to the GPU offloads a lot of work from my CPU’s, and in fact CPU-7 and CPU-4 are completely parked (shut down) during the entire run and CPU-3 barely lifted a finger. Now we should go to work on threading the CPU-analysis portion of our code to leverage these idle cores.

– Happy Computing,

Paul

Worker Code

    // CPUDataReader is responsible gathering the data
    private void CPUDataReader( int jobCounter, Queue dataIn )
    {

      while ( jobCounter > 0 )
      {
        // Read the initial data set from disk and load into memory for
        // each job.  I'm just simulating this with a random matrix.
        FloatComplexMatrix data = new FloatComplexMatrix( 3000, 3000, 
          new RandGenNormal( 0.0, 1.0, 445 + jobCounter ) );

        dataIn.Enqueue( data );

        Console.WriteLine( String.Format( "Enqueued data for job #{0} ", jobCounter ) );

        jobCounter--;
      }

      Console.WriteLine( " * Finished loading all requested datasets." );
    }

    // GPUFFTWorker is responsible for computing the stream of 2D FFT's
    private void GPUFFTWorker( int jobCounter, FloatComplexForward2DFFT fftEngine,   
      Queue dataIn, Queue dataOut )
    {
      FloatComplexMatrix signal;

      // Monitor the job queue and execute the FFT's as the data becomes available.
      while ( jobCounter > 0 )
      {
        if( dataIn.Count > 0 )
        {
          signal = dataIn.Dequeue();

          fftEngine.FFTInPlace( signal );

          Console.WriteLine( String.Format("  Finished FFT on GPU for job {0}.", jobCounter) );

          dataOut.Enqueue( signal );

          jobCounter--;
        }
      }

      Console.WriteLine( " * Finished all 2D FFT's." );
    }

    // CPUWorker is responsible for the post analysis of the data.
    private void CPUWorker( int jobCounter, Queue dataOut )
    {

      while ( jobCounter > 0 )
      {
        if ( dataOut.Count > 0 )
        {

          FloatComplexMatrix fftSpectrum = dataOut.Dequeue();

          Console.WriteLine( String.Format( "  Dequeued spectrum {0} for analysis ", jobCounter ) );

          // Compute magnitude of FFT
          FloatMatrix absFFT = NMathFunctions.Abs( fftSpectrum );

          // Find spectral peaks, write out results, ...

          jobCounter--;
        }
      }
    }

The post Offloading Computation to your GPU appeared first on CenterSpace.

NMath Premium Tuning

Paul Shirkey — Fri, 31 May 2013 15:49:21 +0000

NMath Premium is the new CenterSpace GPU-accelerated math and statistics library for the .NET platform. The supported NVIDIA GPU routines include both a range of dense linear algebra algorithms and 1D & 2D Fast Fourier Transforms (FFTs). NMath Premium is designed to be a near drop-in replacement for NMath, however there are a few important configuration differences and additional logging capabilities that are specific to the premium product that I will discuss in this article.

NMath Premium will be released June 11. For immediate access, sign up here to join the beta program.

Crossover Thresholds

NMath Premium makes it very easy to take advantage of the GPU’s performance benefits by hiding the complexities of data formatting, GPU memory management, algorithms, and diagnostics. Because there is a memory transfer overhead for any type of GPU computation NMath automatically routes computations between the CPU and GPU as appropriate for best performance. Additionally there are configuration options that can globally force all computations (regardless of problem size) to either the CPU or GPU. This can be useful for debugging or performance profiling. Let’s continue with an example.

   NMathConfiguration.ProcessorSharingMethod = ProcessorManagement.ProblemSize;

   FloatComplexForward1DFFT fft = new FloatComplexForward1DFFT( 1024*1024 );
   FloatComplexVector signal = new FloatComplexVector( 1024*1024, new RandGenUniform( -1, 1, seed) );

   fft.FFTInPlace( signal ); // Execute the million point FFT

The first line directs NMath Premium to route GPU-enabled routines automatically between the CPU and GPU dependent on problem size. Small problems remain on the CPU and large problems are off-loaded to the GPU. The ProblemSize setting is the default behavior and this line of code is not strictly required. The last three lines of code which build the FFT object, populate the random signal vector, and execute the million-point FFT are standard NMath code. Except for configuration options, the NMath Premium API is unchanged from NMath.

The problem-size cross-over thresholds can be tuned to control the threshold for every GPU-enabled algorithm. The optimal cross-over threshold is primarily dependent on the computational precision of the problem (Double or Float) and the installed hardware. Frequently applications need to solve similarly sized problems repeatably and the threshold can be adjusted to place the computation where needed. As a default, 1D FFT’s with a length over 16384 execute on the GPU and 2D FFT’s with a size larger than 256*256.

The cross-over threshold for any GPU-enabled algorithm can be set with the following code.

NMathConfiguration.SetCrossoverThreshold( 
   NMathConfiguration.GraphicsProcessorFunctions.FFT1D, 2000);
   ... 
// Now execute a 1D FFT on a 2100 point signal on the GPU.
fft = new FloatComplexForward1DFFT( 2100 );
signal = new FloatComplexVector( 2100, new RandGenUniform( -1, 1, seed) );
fft.FFTInPlace( signal ); // Execute the 2100 point FFT

With this setting all (complex) 1D FFT’s with a length greater that 2000 will execute on the GPU.

Logging and Troubleshooting

Because NMath Premium automatically falls back to the CPU-execution if there any problems with the installed NVIDIA GPU (or if there isn’t a NVIDIA GPU installed at all), we often found ourselves wanting to verify that our code was actually executing on the GPU. To verify that our small 2100-point FFT did indeed run on the GPU, we can enable GPU logging, run the example, and then check the log file, NMathConfiguration.log. The log file will reside next to the executable unless the LogLocation property has been set to a different directory. The following line of code will enable GPU logging.

NMathConfiguration.EnableGPULogging = true;

Logging should only be used while debugging and must be turned on before any NMath classes are created. In the current release of NMath Premium logging cannot be dynamically turned on or off, but this will change in the future to allow specific sections of code to create log entries. Currently, either the entire program is logging or the entire program is not logging.

Running our 2100-point FFT above we will see the following entries near the end of the log file (many lines have been trimmed from the head of the log file for clarity here).

   ...
Instantiating GPUManagerKernel: class CenterSpace.NMath.Kernel.GPUKern....

GPU Kernel: GeForce GT 525M CUDA hardware installed and ready to use by NMath Premium.
GPU Kernel: CUDA Driver Version 5.0 detected.
GPU Kernel: CUDA Runtime Version 5.0 detected.

Instantiating FFTManagerKernelGPU: class CenterSpace.NMath.Kernel.FFTMan....
NMath created GPU executing 1-D, FLOAT REAL, 2100-point FFT object.
Instantiating FFTKernelInstantiator: class CenterSpace.NMath.Kernel.FF.....

The bold face line (bold added) reports that we have successfully created a FFT object that will execute its 2100-point FFT’s on the GPU. Every time such an GPU-active FFT object is created a similar line will be added to this log file. The three lines starting with “GPU Kernel:” are reporting the type of GPU hardware found and that the correct NVIDIA CUDA driver and runtime have been detected. If any hardware or driver configuration problems are detected, preventing NMath Premium from using the GPU, the various errors will be reported in this section of the log file. Further, additional GPU hardware and driver setup information can be found by running a diagnostic program, deviceQuery.exe, bundled with NMath Premium (found in the Assemblies/x64 and Assemblies/x86 directories).

Summary

With a few of lines of code .NET developers can now write optimally executing GPU software with NMath Premium. Applications currently using NMath can easily be accelerated by installing NMath Premium with few if any code changes. Small problems remain on the CPU and large problems are routed to the GPU and the programmer has control over the cross-over thresholds for all GPU-enabled classes in NMath Premium. A logging capability is provided to help with any GPU hardware or driver issues and to verify that your FFT’s are executing on the installed NVIDIA GPU.

For more information on NMath Premium tuning, see the chapter on NMath Premium in the NMath User’s Guide.

Happy Computing,

-Paul Shirkey

The post NMath Premium Tuning appeared first on CenterSpace.

NMath Premium: FFT Performance

Paul Shirkey — Tue, 28 May 2013 16:00:29 +0000

NMath Premium is our new GPU-accelerated math and statistics library for the .NET platform. The supported NVIDIA GPU routines include both a range of dense linear algebra algorithms and 1D and 2D Fast Fourier Transforms (FFTs). NMath Premium is designed to be a near drop-in replacement for NMath, however there are a few important differences and additional logging capabilities that are specific to the premium product.

NMath Premium will be released June 11. For immediate access, sign up here to join the beta program.

Benchmark Approach

Modern FFT implementations are hybridized algorithms which switch between algorithmic approaches and processing kernels depending on the available hardware, FFT type, and FFT length. A FFT library may use the straight Cooly-Tukey algorithm for a short power-of-two FFT but switch to Bluestein’s algorithm for odd-length FFT’s. Further, depending on the factors of the FFT length different combinations of processing kernels may be used. In other words there is no single ‘FFT algorithm’ and so there is no easy expression for FLOPS completed per FFT computed. Therefore, when analyzing the performance of FFT libraries today, the performance is often reported relative to the Cooly-Tukey implementation with the FLOPs estimated at 5 * N * log( N ) . This relative performance is reported here. As an example, if we report a performance of 10 GFLOP’s for a particular FFT, that means if you ran an implementation of the Cooly-Tukey algorithm you’d need a 10 GFLOP’s capable machine to match the performance (finish as quickly).

Because GPU computation takes place in a different memory space from the CPU, all data must be copied to the GPU and the results then copied back to the CPU. This copy time overhead is included in all reported performance numbers. We include this copy time to give our library users an accurate picture of attainable performance.

GPU’s Tested

The NMath Premium 1D and 2D FFT library was tested on four different NVIDIA GPU’s and a 4-core 2.0Ghz Intel i7. These models represent the current range of performance available from NVIDIA, ranging from the widely installed GeForce GTX 525 to NVIDIA’s fasted double precision GPU, the Tesla K20.

GPU	Peak GFLOP (single / double)	Summary
Tesla K20	3510 / 1170	Optimized for applications requiring double precision performance such as computational physics, biochemistry simulations, and computational finance.
Tesla K10	2288/ 95	This is a dual GPU processor card optimized for single precision performance for applications such as seismic and video or image processing. If both GPU cores are maximally utilized these GFLOP numbers would double.
Tesla 2090	1331/ 655	A single core GPU with a more balanced single and double precision performance.
GeForce 525	230 / –	A single core consumer GPU found in many gaming computers.

FFT Performance Charts

The four charts below represent the performance of various power-of-two length, complex to complex forward 1D and 2D FFT’s. All NMath products also seamlessly compute non-power-of-two length FFT’s but their performance is not part of this GPU comparison note.

The performance of the CPU-bound 1D FFT outperformed all of the GPU’s for relatively short FFT lengths. This is expected because the superior performance of the GPU’s cannot be enjoyed due to the data transfer overhead. Once the computational complexity of the 1D FFT is high enough the data transfer overhead is outweighed by the efficient parallel nature of the GPU’s, and they start to overtake the CPU-bound 1D FFT’s. This cross-over point occurs when the FFT reaches a length near 65536. The exception is the consumer level GeForce GTX 525, where the GPU and CPU FFT performance roughly track each other.

The 2D FFT case is different because of the higher computational demand of the two-dimensional case. First, in the single precision case we see the inferiority of the NVIDIA K20, which is designed primarily as a double precision computation engine. Here the CPU-bound outperforms the K20 for all image sizes. However the K10 and 2090 are extremely fast (including the data transfer time) and outperform the CPU-bound 2D FFT by approximately 60-70%. In the double precision 2D FFT case, the K20 outperforms all other processors in nearly all cases measured. The tested K20 was memory limited in the [ 8192 x 8192 ] test case and couldn’t complete the computation.

Batch FFT

To amortized the cost of data transfer to and from the GPU, NMath Premium can run FFT’s in batches of signal arrays. For the smaller FFT sizes, the batch processing nearly doubles the performance of the FFT on the GPU. As the length of the FFT increases the advantage of batch processing decreased because the full array signals can no longer be loaded into the GPU.

Summary

As the complexity of the FFT increases either due to an increase in length or problem dimension the GPU leveraged FFT performance overtakes the CPU-bound version. The advantage of the GPU 1D FFT grows substantially as the FFT length grows beyond ~100,000 samples. Batch processing of signals arranged in rows in a matrix can be used to mitigate the data transfer overhead to the GPU. There are times where it may be advantageous to offload the processing of FFT’s onto the GPU even when CPU-bound performance is greater because this will free many CPU cycles for other activities. Because NMath Premium supports adjustable crossover thresholds the developer can control the FFT length at which FFT computation switchs to the GPU. Setting this threshhold to zero will push all FFT processing to the GPU, completely offloading this work from the CPU.

The post NMath Premium: FFT Performance appeared first on CenterSpace.