c# LAPACK GPU Archives - CenterSpace

NMath Premium’s new Adaptive GPU Bridge Architecture

Paul Shirkey — Mon, 13 Oct 2014 16:35:01 +0000

The most recent release of NMath Premium 6.0 is a major update which includes an upgraded optimization suite, now backed by the Microsoft Solver Foundation, a significantly more powerful GPU-bridge architecture, and a new class for cubic smoothing splines. This blog post will focus on the new API for doing computation on GPU’s with NMath Premium.

The adaptive GPU bridge API in NMath Premium 6.0 includes the following important new features.

Support for multiple GPU’s
Automatic tuning of the CPU–GPU adaptive bridge to insure optimal hardware usage.
Per-thread control for binding threads to GPU’s.

As with the first release of NMath Premium, using NMath to leverage massively-parallel GPU’s never requires any kernel-level GPU programming or other specialized GPU programming skills. Yet the programmer can easily take as much control as needed to route executing threads or tasks to any available GPU device. In the following, after introducing the new GPU bridge architecture, we’ll discuss each of these features separately with code examples.

Before getting started on our NMath Premium tutorial it’s important to consider your test GPU model. While many of NVIDIA’s GPU’s provide a good to excellent computational advantage over the CPU, not all of NVIDIA’s GPU’s were designed with general computing in mind. The “NVS” class of NVIDIA GPU’s (such as the NVS 5400M) generally perform very poorly as do the “GT” cards in the GeForce series. However the “GTX” cards in the GeForce series generally perform well, as do the Quadro Desktop Produces and the Tesla cards. While it’s fine to test NMath Premium on any NVIDIA, testing on inexpensive consumer grade video cards will rarely show any performance advantage.

NMath’s GPU API Basics

With NMath there are three fundamental software entities involved with routing computations between the CPU and GPU’s: GPU hardware devices represented by IComputeDevice instances, the Bridge classes which control when a particular operation is sent to the CPU or a GPU, and finally the BridgeManager which provides the primary means for managing the devices and bridges.

These three entities are governed by two important ideas.

Bridges are assigned to compute devices and there is a strict one-to-one relationship between each Bridge and IComputeDevice. Once assigned, the bridge instance governs when computations will be sent to it’s paired GPU device or the CPU.
Executing threads are assigned to devices; this is a many-to-one relationship. Any number of threads can be routed to a particular compute device.

Assigning a Bridge class to a device is one line of code with the BridgeManager.

BridgeManager.Instance.SetBridge( BridgeManager.Instance.GetComputeDevice( 0 ), bridge );

Assigning a thread, in this case the CurrentThread, to a device is again accomplished using the BridgeManager.

IComputeDevice cd = BridgeManager.Instance.GetComputeDevice( 0 );
BridgeManager.Instance.SetComputeDevice( cd, Thread.CurrentThread );

After installing NMath Premium, the default behavior will create a default bridge and assign it to the GPU with a device number of 0 (generally the fastest GPU installed). Also by default, all unassigned threads will execute on device 0. This means that out of the box with no additional programming, existing NMath code, once recompiled against the new NMath Premium assemblies, will route all appropriate computations to the device 0 GPU. All of the follow discussions and code examples are ways to refine this default behavior to get the best performance from your GPU hardware.

Math on Multiple GPU’s Supported

Currently only the NVIDIA GPU with a device number 0 is supported by NMath Premium, this release removes that barrier. With version 6, work can be assigned to any installed NVIDIA device as long as the device drivers are up-to-date.

The work done by an executing thread is routed to a particular device using the BridgeManager.Instance.SetDevice() as we saw in the example above. Any properly configured hardware device can be used here including any NVIDIA device and the CPU. The CPU is simply viewed as another compute device and is always assigned a device number of -1.

var bmanager = BridgeManager.Instance;

var cd = bmanager .GetComputeDevice( -1 );
BridgeManager.Instance.SetComputeDevice( cd, Thread.CurrentThread );
....
cd = bmanager .GetComputeDevice( 2 );
BridgeManager.Instance.SetComputeDevice( cd, Thread.CurrentThread );

Lines 3 & 4 first assign the current thread to the CPU device (no code on this thread will run on any GPU) and then in lines 6 & 7 the current thread is switched to the GPU device 2. If an invalid compute device is requested a null IComputeDevice is returned. To find all available computing devices, the BridgeManager offers an array of IComputeDevices which contains all detected compute devices including the CPU, called IComputeDevices Devices[]. The number of detected GPU’s can be found using the property BridgeManager.Instance.CountGPU.

As an aside, keep in mind that PCI slot numbers do not necessarily correspond to GPU device numbers. NVIDIA assigns the device number 0 to the fastest detected GPU and so installing an additional GPU into a machine may renumber the device numbers for the previously installed GPU’s.

Tuning the Adaptive Bridge

Assigning a Bridge to a GPU device doesn’t necessarily mean that all computation routed to that device will run on that device. Instead, the assigned Bridge acts as an intermediary between the CPU and the GPU and moves the larger problems to the GPU where there’s a speed advantage and retains the smaller problems on the CPU. NMath has a built-in default bridge, but it may generate non-optimal run-times depending on your hardware or your customers hardware configuration. To improved the hardware usage and performance a bridge can be tuned once and then persisted to disk for all future use.

// Get a compute device and a new bridge.
IComputeDevice cd = BridgeManager.Instance.GetComputeDevice( 0 );
Bridge bridge = BridgeManager.Instance.NewDefaultBridge( cd );

// Tune this bridge for the matrix multiply operation alone. 
bridge.Tune( BridgeFunctions.dgemm, cd, 1200 );

// Or just tune the entire bridge.  Depending on the hardware and tuning parameters
// this can be an expensive one-time operation. 
bridge.TuneAll( cd, 1200 );

// Now assign this updated bridge to the device.
BridgeManager.Instance.SetBridge( cd, bridge );

// Persisting the bridge that was tuned above is done with the BridgeManager.  
// Note that this overwrites any existing bridge with the same name.
BridgeManager.Instance.SaveBridge( bridge, @".\MyTunedBridge" );

// Then loading that bridge from disk is simple.
var myTunedBridge = BridgeManager.Instance.LoadBridge( @".\MyTunedBridge" );

Once a bridge is tuned it can be persisted, redistributed, and used again. If three different GPU’s are installed this tuning should be done once for each GPU and then each bridge should be assigned to the device it was tuned on. However if there are three identical GPU’s the tuning need be done only once, then persisted to disk, and later assigned to all identical GPU’s. Bridges assigned to GPU devices for which it wasn’t tuned will never result in incorrect results, only possibly under performance of the hardware.

Thread Control

Once a bridge is paired to a device, threads may be assigned to that device for execution. This is not a necessary step as all unassigned threads will run on the default device (typically device 0). However, suppose we have three tasks and three GPU’s, and we wish to use a GPU per task. The following code does that.

...
IComputeDevice gpu0= BridgeManager.Instance.GetComputeDevice( 0 );
IComputeDevice gpu1 = BridgeManager.Instance.GetComputeDevice( 1 );
IComputeDevice gpu2 = BridgeManager.Instance.GetComputeDevice( 2 );

if( gpu0 != null && gpu1 != null && gpu2 != null)
{
   System.Threading.Tasks.Task[] tasks = new Task[3]
   {
      Task.Factory.StartNew(() => Task1Worker(gpu0)),
      Task.Factory.StartNew(() => Task2Worker(gpu1)),
      Task.Factory.StartNew(() => Task2Worker(gpu2)),
   };

   //Block until all tasks complete.
   Task.WaitAll(tasks);
}
...

This code is standard C# code using the Task Parallel Library and contains no NMath Premium specific API calls outside of passing a GPU compute device to each task. The task worker routines have the following simple structure.

private static void Task1Worker( IComputeDevice cd  )
  {
      BridgeManager.Instance.SetComputeDevice( cd );

      // Do Work here.
  }

The other two task workers are identical outside of whatever useful computing work they may be doing.

Good luck and please post any questions in the comments below or just email us at support AT centerspace.net we’ll get back to you.

Happy Computing,

Paul

The post NMath Premium’s new Adaptive GPU Bridge Architecture appeared first on CenterSpace.

Distributing Parallel Tasks on Multiple GPU’s

Paul Shirkey — Wed, 17 Sep 2014 20:50:51 +0000

In this post I’m going demonstrate how to use the Task Parallel Library with NMath Premium to run tasks in parallel on multiple GPU’s and the CPU. Back in 2012 when Microsoft released .NET 4.0 and the System.Threading.Task namespace many .NET programmers never, or only under duress, wrote multi-threaded code. It’s old news now that TPL has reduced the complexity of writing threaded code by providing several new classes to make the process easier while eliminating some pitfalls. Leveraging the TPL API together with NMath Premium is a powerful combination for quickly getting code running on your GPU hardware without the burden of learning complex CUDA programming techniques.

NMath Premium GPU Smart Bridge

The NMath Premium 6.0 library is now integrated with a new CPU-GPU hybrid-computing Adaptive Bridge™ Technology. This technology allows users to easily assign specific threads to a particular compute device and manage computational routing between the CPU and multiple on-board GPU’s. Each piece of installed computing hardware is uniformly treated as a compute device and managed in software as an immutable IComputeDevice; Currently the adaptive bridge allows a single CPU compute device (naturally!) along with any number of NVIDIA GPU devices. How NMath Premium interacts with each compute device is governed by a Bridge class. A one-to-one relationship between each Bridge instance and each compute device is enforced. All of the compute devices and bridges are managed by the singleton BridgeManager class.

Adaptive Bridge

These three classes: the BridgeManager, the Bridge, and the immutable IComputeDevice form the entire API of the Adaptive Bridge™. With this API, nearly all programming tasks, such as assigning a particular Action<> to a specific GPU, are accomplished in one or two lines of code. Let’s look at some code that does just that: Run an Action<> on a GPU.

using CenterSpace.NMath.Matrix;

public void mainProgram( string[] args )
    {
      // Set up a Action<> that runs on a IComputeDevice.
      Action worker = WorkerAction;
      
      // Get the compute devices we wish to run our 
      // Action<> on - in this case two GPU 0.
      IComputeDevice deviceGPU0 = BridgeManager.Instance.GetComputeDevice( 0 );

      // Do work
      worker(deviceGPU0, 9);
    }

    private void WorkerAction( IComputeDevice device, int input )
    {
      // Place this thread to the given compute device.
      BridgeManager.Instance.SetComputeDevice( device );

      // Do all the hard work here on the assigned device.
      // Call various GPU-aware NMath Premium routines here.
      FloatMatrix A = new FloatMatrix( 1230, 900, new RandGenUniform( -1, 1, 37 ) );
      FloatSVDecompServer server = new FloatSVDecompServer();
      FloatSVDDecomp svd = server.GetDecomp( A );
    }

It’s important to understand that only operations where the GPU has a computational advantage are actually run on the GPU. So it’s not as though all of the code in the WorkerAction runs on the GPU, but only code that makes sense such as: SVD, QR decomp, matrix multiply, Eigenvalue decomposition and so forth. But using this as a code template, you can easily run your own worker several times passing in different compute devices each time to compare the computational advantages or disadvantages of using various devices – including the CPU compute device.

In the above code example the BridgeManager is used twice: once to get a IComputeDevice reference and once to assign a thread (the Action<>'s thread in this case ) to the device. The Bridge class didn’t come into play since we implicitly relied on a default bridge to be assigned to our compute device of choice. Relying on the default bridge will likely result in inferior performance so it’s best to use a bridge that has been specifically tuned to your NVIDIA GPU. The follow code shows how to accomplish bridge tuning.

  // Here we get the bridge associated with GPU device 0.
  var cd = BridgeManager.Instance.GetComputeDevice( 0 );
  var bridge = (Bridge) BridgeManager.Instance.GetBridge( cd );

  // Tune the bridge and save it.  Turning can take a few minutes.
  bridge.TuneAll( device, 1200 );
  bridge.SaveBridge("Device0Bridge.bdg");

This bridge turning is typically a one-time operation per computer, and once done, the tuned bridge can be serialized to disk and then reload at application start-up. If new GPU hardware is installed then this tuning operation should be repeated. The following code snipped loads a saved bridge and pairs it with a device.

  // Load our serialized bridge.
  Bridge bridge = BridgeManager.Instance.LoadBridge( "Device0Bridge.bdg" );
  
  // Now pair this saved bridge with compute device 0.   
  var device0 = BridgeManager.Instance.GetComputeDevice( 0 );
  BridgeManager.Instance.SetBridge( device0, bridge );

Once the tuned bridge is assigned to a device, the behavior of all threads assigned to that device will be governed by that bridge. In the typical application the pairing of bridges to devices done at start up and not altered again, while the assignment of threads to devices may be done frequently at runtime.

It’s interesting to note that beyond optimally routing small and large problems to the CPU and GPU respectively, bridges can be configured to shunt all work to the GPU regardless of problem size. This is useful for testing and for offloading work to a GPU when the CPU if taxed. Even if the particular problem runs slower on the GPU than the CPU, if the CPU is fully occupied, offloading work to an otherwise idle GPU will enhance performance.

C# Code Example of Running Tasks on Two GPU’s

I’m going to wrap up this blog post with a complete C# code example which runs a matrix multiplication task simultaneously on two GPU’s and the CPU. The framework of this example uses the TPL and aspects of the adaptive bridge already covered here. I ran this code on a machine with two NVIDIA GeForce GPU’s, a GTX760 and a GT640, and the timing results from this run for executing a large matrix multiplication are shown below.

Finished matrix multiply on the GeForce GTX 760 in 67 ms.
Finished matrix multiply on the Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz in 103 ms.
Finished matrix multiply on the GeForce GT 640 in 282 ms.

Finished all double precision matrix multiplications in parallel in 282 ms.

The complete code for this example is given in the section below. In this run we see the GeForce GTX760 easily finished first in 67ms followed by the CPU and then finally by the GeForce GT640. It’s expected that the GeForce GT640 would not do well in this example because it’s optimized for single precision work and these matrix multiples are double precision. Nevertheless, this examples shows it’s programmatically simple to push work to any NVIDIA GPU and in a threaded application even a relatively slow GPU can be used to offload work from the CPU. Also note that the entire program ran in 282ms – the time required to finish the matrix multiply by the slowest hardware – verifying that all three tasks did run in parallel and that there was very little overhead in using the TPL or the Adaptive Bridge™

Below is a snippet of the NMath Premium log file generated during the run above.

	Time 		        tid   Device#  Function    Device Used    
2014-04-28 11:22:47.417 AM	10	0	dgemm		GPU
2014-04-28 11:22:47.421 AM	15	1	dgemm		GPU
2014-04-28 11:22:47.425 AM	13	-1	dgemm		CPU

We can see here that three threads were created nearly simultaneously with thread id’s of 10, 15, & 13; And that the first two threads ran their matrix multiplies (dgemm) on GPU’s 0 and 1 and the last thread 13 ran on the CPU. As a matter of convention the CPU device number is always -1 and all GPU device numbers are integers 0 and greater. Typically device number 0 is assigned to the fastest installed GPU and that is the default GPU used by NMath Premium.

-Paul

TPL Tasks on Multiple GPU’s C# Code

public void GPUTaskExample()
    {
     
      NMathConfiguration.Init();

      // Set up a string writer for logging
      using ( var writer = new System.IO.StringWriter() )
      {

        // Enable the CPU/GPU bridge logging
        BridgeManager.Instance.EnableLogging( writer );

        // Get the compute devices we wish to run our tasks on - in this case 
        // two GPU's and the CPU.
        IComputeDevice deviceGPU0 = BridgeManager.Instance.GetComputeDevice( 0 );
        IComputeDevice deviceGPU1 = BridgeManager.Instance.GetComputeDevice( 1 );
        IComputeDevice deviceCPU = BridgeManager.Instance.CPU;

        // Build some matrices
        var A = new DoubleMatrix( 1200, 1400, 0, 1 );
        var B = new DoubleMatrix( 1400, 1300, 0, 1 );

        // Build the task array and assign matrix multiply jobs and compute devices
        // to those tasks.  Any number of tasks can be added here and any number 
        // of tasks can be assigned to a particular device.
        Stopwatch timer = new Stopwatch();
        timer.Start();
        System.Threading.Tasks.Task[] tasks = new Task[3]
        {
          Task.Factory.StartNew(() => MatrixMultiply(deviceGPU0, A, B)),
          Task.Factory.StartNew(() => MatrixMultiply(deviceGPU1, A, B)),
          Task.Factory.StartNew(() => MatrixMultiply(deviceCPU, A, B)),
        };

        // Block until all tasks complete
        Task.WaitAll( tasks );
        timer.Stop();
        Console.WriteLine( "Finished all double precision matrix multiplications in parallel in " + timer.ElapsedMilliseconds + " ms.\n" );

        // Dump the log file for verification.
        Console.WriteLine( writer );

        // Quit logging
        BridgeManager.Instance.DisableLogging();
      
      }
    }

    private static void MatrixMultiply( IComputeDevice device, DoubleMatrix A, DoubleMatrix B )
    {
      // Place this thread to the given compute device.
      BridgeManager.Instance.SetComputeDevice( device );

      Stopwatch timer = new Stopwatch();
      timer.Start();

      // Do this task work.
      NMathFunctions.Product( A, B );

      timer.Stop();
      Console.WriteLine( "Finished matrix multiplication on the " + device.DeviceName  + " in " + timer.ElapsedMilliseconds + " ms.\n" );
    }

The post Distributing Parallel Tasks on Multiple GPU’s appeared first on CenterSpace.

An Introduction to Linear Algebra on the GPU

Paul Shirkey — Tue, 04 Jun 2013 19:02:38 +0000

NMath Premium was designed to provide an easy-to-follow path for .NET developers to leverage the performance of the GPU without having to wade through the complexities of GPU programming and their attendant details. NMath Premium allows developers to build once and run anywhere without concerning themselves with their users’ installed GPU models and versions, or even the existence of a GPU. NMath Premium is designed with fail-safe CPU fallbacks based on problem size, installed GPU hardware, and configuration settings. NMath Premium supports a complete set of dense linear algebra operations that execute on a wide class NVIDIA GPU’s, all using the intuitive, easy-to-use NMath API. As a result, NMath Premium not only offers superior performance with GPU-enabled linear algebra functions but also leverages these GPU-enabled classes internally in a wide range of algorithms.

NMath Premium will be released June 11, 2013. For immediate access, sign up here to join the beta program.

An SVD example

After installing and adding the NMath Premium assemblies to your project, the following example demonstrates the computation of a large SVD on the GPU. The NMath API has been largely preserved in NMath Premium so the following code example will be familiar to current NMath users as its syntax is identical to NMath. Nearly all NMath code will run correctly without edits and can be dropped into a NMath Premium project.

   // Build a dense random float matrix
   var A = new FloatMatrix( 5000, 5000, new RandGenUniform( -1, 1, seed));

   // Build the SVD server and request the right vectors
   var server = new FloatSVDecompServer();
   server.ComputeLeftVectors = false;
   server.ComputeRightVectors = true;
   server.ComputeFull = false;

   // Do the SVD
   FloatSVDecomp svd = server.GetDecomp( A );

Running in a NMath Premium project this 5000x5000 SVD decomposition can execute either on the GPU and CPU depending on the installed hardware and configuration settings, so most new users will want to immediately verify that their decomposition did indeed run on the GPU. To accomplish this NMath Premium provides a logging feature that allows programmers track where their GPU-aware classes routed their computation. The line of code below enables the logging feature – but because of the associated file writes logging should only be used while debugging and avoided in production code.

#if DEBUG
   NMathConfiguration.EnableGPULogging = true;
#endif

The log file will be written to a file named NMathGPULapack.log located next to the built executable. The location and name of this log file can be modified with the NMathConfiguration configuration class. The NMathConfiguration contains a number of new features and is worth a perusal. It’s important to note that in the current release the logging must be configured before any computational operations take place otherwise an exception will be thrown. Logging cannot currently be turn on or off once NMath Premium has loaded it’s dependent dlls and started running computational algorithms.

Having run the simple example above, I see the following in my NMathGPULapack log file.

  cula info:  sgesvd (N, S, 5000, 5000, ... , 5000)
  cula info:  issuing to CPU (work query)
  cula info:  CPU library is lapackcpu.dll
  cula info:  work query returned 654872
  cula info:  done
  cula info:  sgesvd (N, S, 5000, 5000, ... , 5000)
  cula info:  issuing to GPU (over threshold)
  cula info:  done

The first five lines record a query to the LAPACK library to determine the total memory requirements for this operation. This is simply a work query and does not run on the GPU. The final three lines record the operation name (sgesvd), size, where it ran and why. In this case the 5000x5000 SVD ran on the GPU because its size was over the cross-over threshold.

Performance

The raison d’être of GPU computation is performance and new users need to be aware of the various factors that impact GPU performance. For GPU developers leveraging a product like NMath Premium the three most important factors that determined the performance of a particular algorithm are:

Installed GPU hardware
Size of problem
Computational precision, Single or Double

There are many other reasons that a particular algorithm will run well on a GPU including its computational complexity or how well it can be bent to run on highly parallel GPU architecture, but the three reasons above are paramount to our library users. Following the example above, if you happen to have run the SVD on an early model GeForce laptop GPU, the performance may not have been much better than running it in a CPU bound manner; although doing so would have freed your CPU for other tasks – an important collateral benefit of GPU computation. Likewise computational precision can have a major impact on GPU performance, and recognizing the single-precision graphics origin of GPU architecture, it’s important to match the required computational precision with the proper NVIDIA hardware, particularly if double precision is needed. Lastly, problem size is the primary determinate used by NMath Premium to route problems between the CPU and GPU; Because there is memory-transfer overhead involved in transferring data to the GPU, small problems are retained on the CPU and large problems are shifted to the GPU. This optimal routing can be controlled by the developer if fine control is needed, however most developers will use NMath Premium in its default configuration with great success.

-Happy Computing,

Paul Shirkey

The post An Introduction to Linear Algebra on the GPU appeared first on CenterSpace.