The post Principal Components Regression: Part 3 – The NIPALS Algorithm appeared first on CenterSpace.
]]>Recall that the least squares solution to the multiple linear problem is given by
(1)
And that problems occurred finding when the matrix
(2)
was close to being singular. The Principal Components Regression approach to addressing the problem is to replace in equation (1) with a better conditioned approximation. This approximation is formed by computing the eigenvalue decomposition for and retaining only the r largest eigenvalues. This yields the PCR solution:
(3)
where is an r x r diagonal matrix consisting of the r largest eigenvalues of
are the corresponding eigenvectors of . In this piece we shall develop code for computing the PCR solution using the NMath libraries.
[eds: This blog article is final entry of a three part series on principal component regression. The first article in this series, “Principal Component Regression: Part 1 – The Magic of the SVD” is here. And the second, “Principal Components Regression: Part 2 – The Problem With Linear Regression” is here.]
In order to develop the algorithm, I want to go back to the Singular Value Decomposition (SVD) of a matrix and its relationship to the eigenvalue decomposition. Recall that the SVD of a matrix X is given by
(4)
Where U is the matrix of left singular vectors, V is the matrix of right singular vectors, and Σ is a diagonal matrix with positive entries equal to the singular values. The eigenvalue decomposition of is given by
(5)
Where the eigenvalues of X are the diagonal entries of the diagonal matrix and the columns of V are the eigenvectors of (V is also composed of the right singular vectors of X).
Recall further that if the matrix X has rank r then X can be written as
(6)
Where is the jth singular value (jth diagonal element of the diagonal matrix ), is the jth column of U, and is the jth column of V. An equivalent way of expressing the PCR solution (3) to the least squares problem in terms of the SVD for X is that we’ve replaced X in the solution (1) by its rank r approximation shown in (6).
The subject here is Principal Components Regression (PCR), but we have yet to mention principal components. All we have talked about are eigenvalues, eigenvectors, singular values, and singular vectors. We’ve seen how singular stuff and eigen stuff are related, but what are principal components?
Principal component analysis applies when one considers statistical properties of data. In linear regression each column of our matrix X represents a variable and each row is a set of observed value for these variables. The variables being observed are random variables and as such have means and variances. If we center the matrix X by subtracting from each column of X its corresponding mean, then we’ve normalized the random variables being observed so that they have zero mean. Once the matrix X is centered in this way, the matrix is then proportional to the variance/covariance for the variables. In this context the eigenvectors of are called the Principal Components of X. For completeness (and because they are used in discussing the PCR algorithm), we define two more terms.
In the SVD given by equation (4), define the matrix T by
(7)
The matrix T is called the scores for X. Note that T is orthogonal, but not necessarily orthonormal. Substituting this into the SVD for X yields
(8)
Using the fact that V is orthogonal we can also write
(9)
We call the matrix V the loadings. The goal of our algorithm is to obtain the representation given by equation (8) for X, retaining all the most significant principal components (or eigenvalues, or singular values – depending on where your heads at at the time).
Using equation (3) to compute the solution to our problem involves forming the matrix and obtaining its eigenvalue decomposition. This solution is fairly straight forward and has reasonable performance for moderately sized matrices X. However, in practice, the matrix X can be quite large, containing hundreds, even thousands of columns. In addition, many procedures for choosing the optimal number r of eigenvalues/singular values to retain involve computing the solution for many different values of r and comparing them. We therefore introduce an algorithm which computes only the number of eigenvalues we need.
We will be using an algorithm known as NIPALS (Nonlinear Iterative PArtial Least Squares). The NIPALS algorithm for the matrix X in our least squares problem and r, the number of retained principal components, proceeds as follows:
Initialize and . Then iterate through the following steps –
Let us see how the NIPALS algorithm produces principal components for us.
Let and write step (2) as
(10)
Setting in step 3 yields
(11)
This equation is satisfied upon completion of the loop 2-4. This shows that and are an eigenvalue and eigenvector of . The astute reader will note that the loop 2-4 is essentially the power method for computing a dominant eigenvalue and eigenvector for a linear transformation. Note further that using and equation (11) we obtain
(12)
After one iteration of the NIPALS algorithm we end up at step 5 with and
(13)
Note that and
are orthogonal:
(14)
Furthermore, since is initially picked as a column of , it is orthogonal to . Upon completion of the algorithm we form the following two matrices:
(15)
If r is equal to the rank of X then, using the information obtained from equations (12) and (14), it follows that (15) yields the matrix decomposition (8). The idea behind Principal Components Regression is that after choosing an appropriate r the important features of X have been captured in . We then perform a linear regression with in place of X,
(16) .
The least squares solution then gives
(17)
Note that since is diagonal it is easy to invert. Also note that we left out the loadings matrix . This is due to the fact that the scores are linear combinations of the columns of X, and the PCR method amounts to singling out those combinations that are best for predicting y. Finally, using (9) and (16) we rewrite our linear regression problem as
(18)
From (18) we see that the PCR estimation is given by
(19) .
Steve
The post Principal Components Regression: Part 3 – The NIPALS Algorithm appeared first on CenterSpace.
]]>The post CenterSpace partner releases symbolic, computational library appeared first on CenterSpace.
]]>Please check them out.
– Trevor
The post CenterSpace partner releases symbolic, computational library appeared first on CenterSpace.
]]>The post Announcing NMath 6.2 and NMath Stats 4.2 appeared first on CenterSpace.
]]>Added functionality includes:
For more complete changelogs, see:
Upgrades are provided free of charge to customers with current annual maintenance contracts. To request an upgrade, please visit our upgrade page, or contact sales@centerspace.net. Maintenance contracts are available through our webstore.
The post Announcing NMath 6.2 and NMath Stats 4.2 appeared first on CenterSpace.
]]>The post Filtering with Wavelet Transforms appeared first on CenterSpace.
]]>In signal processing, wavelets have been widely investigated for use in filtering bio-electric signals, among many other applications. Bio-electric signals are good candidates for multi-resolution wavelet filtering over standard Fourier analysis due to their non-stationary character. In this article we’ll discuss the filtering of electrocardiograms or ECGs and demonstrate with code examples how to filter an ECG waveform using NMath‘s new wavelet classes; keeping in mind that the techniques and code shown here apply to a wide class of time series measurements. If wavelets and their applications to filtering are unfamiliar to the reader, read a gentle and brief introduction to the subject in Wavelets for Kids: A Tutorial Introduction [5].
PhysioNet provides free access to a large collections of recorded physiologic signals, including many ECG’s. The ECG signal we will filter here, named aami-ec13 on PhysioNet, is shown below.
Our goal will be to remove the high frequency noise while preserving the character of the wave form, including the high frequency transitions at the signal peaks. Fourier based filter methods are ill suited for filtering this type of signal due to both it’s non-stationarity, as mentioned, but also the need to preserve the peak locations (phase) and shape.
As with Fourier analysis there are three basic steps to filtering signals using wavelets.
Briefly, the filtering of signals using wavelets is based on the idea that as the DWT decomposes the signal into details and approximation parts, at some scale the details contain mostly the insignificant noise and can be removed or zeroed out using thresholding without affecting the signal. This idea is discussed in more detail in the introductory paper [5]. To implement this DWT filtering scheme there are two basic filter design parameters: the wavelet type and a threshold. Typically the shape and form of the signal to be filtered is qualitatively matched to the general shape of the wavelet. In this example we will use the Daubechies forth order wavelet.
The general shape of this wavelet roughly matches, at various scales, the morphology of the ECG signal. Currently NMath supports the following wavelet families: Harr, Daubechies, Symlet, Best Localized, and Coiflet, 27 in all. Additionally, any custom wavelet of your invention can be created by passing in the wavelet’s low & high pass decimation filter values. The wavelet class then imposes the wavelet’s symmetry properties to compute the reconstruction filters.
// Build a Coiflet wavelet.
var wavelet = new FloatWavelet( Wavelet.Wavelets.C4 );
// Build a custom reverse bi-orthogonal wavelet.
var wavelet = new DoubleWavelet( new double[] {0.0, 0.0, 0.7071068, 0.7071068, 0.0, 0.0}, new double[] {0.0883883, 0.0883883, -0.7071068, 0.7071068, -0.0883883, -0.0883883} );
The FloatDWT
class provides four different thresholding strategies: Universal, UniversalMAD, Sure, and Hybrid (a.k.a SureShrink). We’ll use the Universal threshold strategy here. This is a good starting point but this strategy can over smooth the signal. Typically some empirical experimentation is done here to find the best threshold for the data (see [1], also see [4] for a good overview of common thresholding strategies.)
The three steps outlined above are easily coded using two classes in the NMath library: the FloatDWT
class and the FloatWavelet
class. As always in NMath, the library offers both a float precision and a double precision version of each of these classes. Let’s look at a code snippet that implements a DWT based filter with NMath.
// Choose wavelet, the Daubechies 4 wavelet
var wavelet = new FloatWavelet( Wavelet.Wavelets.D4 );
// Build DWT object using our wavelet & data
var dwt = new FloatDWT( data, wavelet );
// Decompose signal with DWT to level 5
dwt.Decompose( 5 );
// Find Universal threshold & threshold all detail levels
double lambdaU = dwt.ComputeThreshold( FloatDWT.ThresholdMethod.Universal, 1 );
dwt.ThresholdAllLevels( FloatDWT.ThresholdPolicy.Soft, new double[] { lambdaU,
lambdaU, lambdaU, lambdaU, lambdaU } );
// Rebuild the filtered signal.
float[] reconstructedData = dwt.Reconstruct();
The first two lines of code build the wavelet object and the DWT object using both the input data signal and the abbreviated Daubechies wavelet name Wavelet.Wavelets.D4
. The third line of code executes the wavelet decomposition at five consecutive scales. Both the signal’s details and approximations are stored in the DWT object at each step in the decomposition. Next, the Universal
threshold is computed and the wavelet details are thresholded using the same threshold with a Soft
policy (see [1], pg. 63). Lastly, the now filtered signal is reconstructed.
Below, the chart on the left shows the unfiltered ECG signal and the chart on the right shows the wavelet filtered ECG signal. It’s clear that this filter very effectively removed the noise while preserving the signal.
These two charts below show a detail from the chart above from indices 500 to 1000. Note how well the signal shape, phase, and amplitude has been preserved in this non-stationary wavelet-filtered signal.
It is this ability to preserve phase, form, and amplitude in DWT based filters all while having a O(n log n) runtime that Fourier-based filters enjoy that has made wavelets such an important part of signal processing today. The complete code for this example along with a link to the ECG data is provided below.
Paul
To copy the data file provided by PhysioNet for this example click: ECG_AAMIEC13.data
This ECG data was taken from the ANSI EC13 test data set waveforms.
public void BlogECGExample()
{
// Define your own dataDir
var dataDir = "................";
// Load ECG wave from physionet.org data file.
string filename = Path.Combine( dataDir, "ECG_AAMIEC13.data.txt" );
string line;
int cnt = 0;
FloatVector ecgMeasurement = new FloatVector( 3000 );
var fileStrm = new System.IO.StreamReader( filename );
fileStrm.ReadLine(); fileStrm.ReadLine();
while ( ( line = fileStrm.ReadLine() ) != null && cnt < 3000 )
{
ecgMeasurement[cnt] = Single.Parse( line.Split( ',' )[1] );
cnt++;
}
// Choose wavelet
var wavelet = new FloatWavelet( Wavelet.Wavelets.D4 );
// Build DWT object
var dwt = new FloatDWT( ecgMeasurement.DataBlock.Data, wavelet );
// Decompose signal with DWT to level 5
dwt.Decompose( 5 );
// Find Universal threshold & threshold all detail levels with lambdaU
double lambdaU = dwt.ComputeThreshold( FloatDWT.ThresholdMethod.Universal, 1 );
dwt.ThresholdAllLevels( FloatDWT.ThresholdPolicy.Soft, new double[] { lambdaU, lambdaU, lambdaU, lambdaU, lambdaU } );
// Rebuild the signal to level 1 - the original (filtered) signal.
float[] reconstructedData = dwt.Reconstruct();
// Display DWT results.
BlogECGExampleBuildCharts( dwt, ecgMeasurement, reconstructedData );
}
public void BlogECGExampleBuildCharts( FloatDWT dwt, FloatVector ECGMeasurement, float[] ReconstructedData )
{
// Plot out approximations at various levels of decomposition.
var approxAllLevels = new FloatVector();
for ( int n = 5; n > 0; n-- )
{
var approx = new FloatVector( dwt.WaveletCoefficients( DiscreteWaveletTransform.WaveletCoefficientType.Approximation, n ) );
approxAllLevels.Append( new FloatVector( approx ) );
}
var detailsAllLevels = new FloatVector();
for ( int n = 5; n > 0; n-- )
{
var approx = new FloatVector( dwt.WaveletCoefficients( DiscreteWaveletTransform.WaveletCoefficientType.Details, n ) );
detailsAllLevels.Append( new FloatVector( approx ) );
}
// Create and display charts.
Chart chart0 = NMathChart.ToChart( detailsAllLevels );
chart0.Titles.Add( "Concatenated DWT Details to Level 5" );
chart0.ChartAreas[0].AxisY.Title = "DWT Details";
chart0.Height = 270;
NMathChart.Show( chart0 );
Chart chart1 = NMathChart.ToChart( approxAllLevels );
chart1.Titles.Add("Concatenated DWT Approximations to Level 5");
chart1.ChartAreas[0].AxisY.Title = "DWT Approximations";
chart1.Height = 270;
NMathChart.Show( chart1 );
Chart chart2 = NMathChart.ToChart( (new FloatVector( ReconstructedData ))[new Slice(500,500)] );
chart2.Titles[0].Text = "Thresholded & Reconstructed ECG Signal";
chart2.ChartAreas[0].AxisY.Title = "mV";
chart2.Height= 270;
NMathChart.Show( chart2 );
Chart chart3 = NMathChart.ToChart( (new FloatVector( ECGMeasurement ))[new Slice(500,500)] );
chart3.Titles[0].Text = "Raw ECG Signal";
chart3.ChartAreas[0].AxisY.Title = "mV";
chart3.Height = 270;
NMathChart.Show( chart3 );
}
The post Filtering with Wavelet Transforms appeared first on CenterSpace.
]]>The post Precision and Reproducibility in Computing appeared first on CenterSpace.
]]>This issue of reproducibility arises with NMath users when writing and running unit tests; which is why it’s important when writing tests to compare floating point numbers only up to their designed precision, at an absolute maximum. With the IEEE 754 floating point representation which virtually all modern computers adhere to, the single precision float
type uses 32 bits or 4 bytes and offers 24 bits of precision or about 7 decimal digits. While the double precision double
type requires 64 bits or 8 bytes and offers 53 bits of precision or about 15 decimal digits. Few algorithms can achieve significant results to the 15th decimal place due to rounding, loss of precision due to subtraction and other sources of numerical precision degradation. NMath’s numerical results are tested, at a maximum, to the 14th decimal place.
As an example, what does the following code output?
double x = .050000000000000003;
double y = .050000000000000000;
if ( x == y )
Console.WriteLine( "x is y" );
else
Console.WriteLine( "x is not y" );
I get “x is y”, which is clearly not the case, but the number x specified is beyond the precision of a double
type.
Due to these limits on decimal number representation and the resulting rounding, the numerical results of some operations can be affected by the associative reordering of operations. For example, in some cases a*x + a*z
may not equal a*(x + z)
with floating point types. Although this can be difficult to test using modern optimizing compilers because the code you write and the code that runs can be organized in a very different way, but is mathematically equivalent if not numerically.
So reproducibility is impacted by precision via dynamic operation reorderings in the ALU and additionally by run-time processor dispatching, data-array alignment, and variation in thread number among other factors. These issues can create run-to-run differences in the least significant digits. Two runs, same code, two answers. This is by design and is not an issue of correctness. Subtle changes in the memory layout of the program’s data, differences in loading of the ALU registers and operation order, and differences in threading all due to unrelated processes running on the same machine cause these run-to-run differences.
Most importantly, one should test code’s numerical results only to the precision that can be expected by the algorithm, input data, and finally the limits of floating point arithmetic. To do this in unit tests, compare floating point numbers carefully only to a fixed number of digits. The code snippet below compares two double numbers and returns true only if the numbers match to a specified number of digits.
private static bool EqualToNumDigits( double expected, double actual, int numDigits )
{
double max = System.Math.Abs( expected ) > System.Math.Abs( actual ) ? System.Math.Abs( expected ) : System.Math.Abs( actual );
double diff = System.Math.Abs( expected - actual );
double relDiff = max > 1.0 ? diff / max : diff;
if ( relDiff <= DOUBLE_EPSILON )
{
return true;
}
int numDigitsAgree = (int) ( -System.Math.Floor( Math.Log10( relDiff ) ) - 1 );
return numDigitsAgree >= numDigits;
}
This type of comparison should be used throughout unit testing code. The full code listing, which we use for our internal testing, is provided at the end of this article.
If it is essential to enforce binary run-to-run reproducibility to the limits of precision, NMath provides a flag in its configuration class to ensure this is the case. However this flag should be set for unit testing only because there can be a significant cost to performance. In general, expect a 10% to 20% reduction in performance with some common operations degrading far more than that. For example, some matrix multiplications will take twice the time with this flag set.
Note that the number of threads that Intel’s MKL library uses ( which NMath depends on ) must also be fixed before setting the reproducibility flag.
int numThreads = 2; // This must be fixed for reproducibility.
NMathConfiguration.SetMKLNumThreads( numThreads );
NMathConfiguration.Reproducibility = true;
This reproducibility run configuration for NMath cannot be unset at a later point in the program. Note that both setting the number of threads and the reproducibility flag may be set in the AppConfig or in environmental variables. See the NMath User Guide for instructions on how to do this.
Paul
References
M. A. Cornea-Hasegan, B. Norin. IA-64 Floating-Point Operations and the IEEE Standard for Binary Floating-Point Arithmetic. Intel Technology Journal, Q4, 1999.
http://gec.di.uminho.pt/discip/minf/ac0203/icca03/ia64fpbf1.pdf
D. Goldberg, What Every Computer Scientist Should Know About Floating-Point Arithmetic. Computing Surveys. March 1991.
http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html
double
Comparison Code private static bool EqualToNumDigits( double expected, double actual, int numDigits )
{
bool xNaN = double.IsNaN( expected );
bool yNaN = double.IsNaN( actual );
if ( xNaN && yNaN )
{
return true;
}
if ( xNaN || yNaN )
{
return false;
}
if ( numDigits <= 0 )
{
throw new InvalidArgumentException( "numDigits is not positive in TestCase::EqualToNumDigits." );
}
double max = System.Math.Abs( expected ) > System.Math.Abs( actual ) ? System.Math.Abs( expected ) : System.Math.Abs( actual );
double diff = System.Math.Abs( expected - actual );
double relDiff = max > 1.0 ? diff / max : diff;
if ( relDiff <= DOUBLE_EPSILON )
{
return true;
}
int numDigitsAgree = (int) ( -System.Math.Floor( Math.Log10( relDiff ) ) - 1 );
//// Console.WriteLine( "x = {0}, y = {1}, rel diff = {2}, diff = {3}, num digits = {4}", x, y, relDiff, diff, numDigitsAgree );
return numDigitsAgree >= numDigits;
}
The post Precision and Reproducibility in Computing appeared first on CenterSpace.
]]>The post Special Functions appeared first on CenterSpace.
]]>SpecialFunctions
class, which is structured similarly to the existing StatsFunctions
and NMathFunctions
classes.
Below is a complete list of the special functions now available in the SpecialFunctions
class which resides in the CenterSpace.NMath.Core
name space. Previously a handful of these functions were available in either the NMathFunctions
or StatsFunctions
classes, but now those functions have been deprecated and consolidated into the SpecialFunctions
class. Please update your code accordingly as these deprecated functions will be removed from NMath within two to three release cycles.
Using these special functions in your code is simple.
using namespace CenterSpace.NMath.Core
// Compute the Jacobi function Sn() with a complex argument.
var cmplx = new DoubleComplex( 0.1, 3.3 )
var sn = SpecialFunctions.Sn( cmplx, .3 ); // sn = 0.16134 - i 0.99834
// Compute the elliptic integral, K(m)
var ei = SpecialFunctions.EllipticK( 0.432 ); // ei = 1.80039
Below is a complete list of all NMath special functions.
Special Function | Comments |
---|---|
EulerGamma |
A constant, also known as the Euler-Macheroni constant. Famously, rationality unknown. |
Airy |
Provides solutions Ai, Bi, and derivatives Ai’, Bi’ to y” – yz = 0. |
Zeta |
The Riemann zeta function. |
PolyLogarithm |
The Polylogarithm, Li_n(x) reduces to the Riemann zeta for x = 1. |
HarmonicNumber |
The harmonic number is a truncated sum of the harmonic series, closely related to the digamma function. |
Factorial |
n! |
FactorialLn |
The natural log of the factorial, ln( n! ). |
Binomial |
The binomial coefficient, n choose k; The number of ways of picking k unordered outcomes from n possibilities. |
BinomialLn |
The natural log of the binomial coefficient. |
Gamma |
The gamma function, conceptually a generalization of the factorial. |
GammaReciprocal |
The reciprocal of the gamma function. |
IncompleteGammaFunction |
Computes the gamma integral from 0 to x. |
IncompleteGammaComplement |
Computes the gamma integral from x to infinity (and beyond!). |
Digamma |
Also known as the psi function. |
GammaLn |
The natural log of the gamma function. |
Beta |
The beta integral is also known as the Eulerian integral of the first kind. |
IncompleteBeta |
Computes the beta integral from 0 to x in [0,1]. |
Ei |
The exponential integral. |
EllipticK |
The complete elliptic integral, K(m), of the first kind. Note that m is related to the elliptic modulus k with, m = k * k. |
EllipticE( m ) |
The complete elliptic integral, E(m), of the second kind. |
EllipticF |
The incomplete elliptic integral of the first kind. |
EllipticE(phi, m) |
The incomplete elliptic integral of the second kind. |
EllipJ |
Computes the Jacobi elliptic functions Cn(), Sn(), and Dn() for real arguments. |
Sn |
Computes the Jacobi elliptic function Sn() for complex arguments. |
Cn |
Computes the Jacobi elliptic function Cn() for complex arguments. |
BesselI0 |
Modified Bessel function of the first kind, order zero. |
BesselI1 |
Modified Bessel function of the first kind, first order. |
BesselIv |
Modified Bessel function of the first kind, non-integer order. |
BesselJ0 |
Bessel function of the first kind, order zero. |
BesselJ1 |
Bessel function of the first kind, first order. |
BesselJn |
Bessel function of the first kind, arbitrary integer order. |
BesselJv |
Bessel function of first kind, non-integer order. |
BesselK0 |
Modified Bessel function of the second kind, order zero. |
BesselK1 |
Modified Bessel function of the second kind, order one. |
BesselKn |
Modified Bessel function of the second kind, arbitrary integer order. |
BesselY0 |
Bessel function of the second kind, order zero. |
BesselY1 |
Bessel function of the second kind, order one. |
BesselYn |
Bessel function of the second kind of integer order. |
BesselYv |
Bessel function of the second kind, non-integer order. |
Hypergeometric1F1 |
The confluent hypergeometric series of the first kind. |
Hypergeometric2F1 |
The Gauss or generalized hypergeometric function. |
Let us know if you need any additional special functions and we’ll see if we can add them.
Mathematically,
Paul Shirkey
[1] Abramowitz, M. and Stegun, I. (1965). Handbook of Mathematical Functions. Dover Publications. ( Abramowitz and Stegun PDF )
[2] Wolfram Alpha LLC. (2014). www.wolframalpha.com
[3] Weisstein, Eric W. “[Various Articles]” From MathWorld–A Wolfram Web Resource. http://mathworld.wolfram.com/
[4] Moshier L. Stephen. (1995) The Cephes Math Library. (Cephes).
The post Special Functions appeared first on CenterSpace.
]]>The post Announcing NMath 6.1 and NMath Stats 4.1 appeared first on CenterSpace.
]]>Added functionality includes:
For more complete changelogs, see here and here.
Upgrades are provided free of charge to customers with current annual maintenance contracts. To request an upgrade, please visit our upgrade page, or contact sales@centerspace.net. Maintenance contracts are available through our webstore.
The post Announcing NMath 6.1 and NMath Stats 4.1 appeared first on CenterSpace.
]]>The post NMath Premium’s new Adaptive GPU Bridge Architecture appeared first on CenterSpace.
]]>The adaptive GPU bridge API in NMath Premium 6.0 includes the following important new features.
As with the first release of NMath Premium, using NMath to leverage massively-parallel GPU’s never requires any kernel-level GPU programming or other specialized GPU programming skills. Yet the programmer can easily take as much control as needed to route executing threads or tasks to any available GPU device. In the following, after introducing the new GPU bridge architecture, we’ll discuss each of these features separately with code examples.
Before getting started on our NMath Premium tutorial it’s important to consider your test GPU model. While many of NVIDIA’s GPU’s provide a good to excellent computational advantage over the CPU, not all of NVIDIA’s GPU’s were designed with general computing in mind. The “NVS” class of NVIDIA GPU’s (such as the NVS 5400M) generally perform very poorly as do the “GT” cards in the GeForce series. However the “GTX” cards in the GeForce series generally perform well, as do the Quadro Desktop Produces and the Tesla cards. While it’s fine to test NMath Premium on any NVIDIA, testing on inexpensive consumer grade video cards will rarely show any performance advantage.
With NMath there are three fundamental software entities involved with routing computations between the CPU and GPU’s: GPU hardware devices represented by IComputeDevice
instances, the Bridge
classes which control when a particular operation is sent to the CPU or a GPU, and finally the BridgeManager
which provides the primary means for managing the devices and bridges.
These three entities are governed by two important ideas.
Bridges
are assigned to compute devices and there is a strict one-to-one relationship between each Bridge
and IComputeDevice
. Once assigned, the bridge instance governs when computations will be sent to it’s paired GPU device or the CPU.Assigning a Bridge
class to a device is one line of code with the BridgeManager
.
BridgeManager.Instance.SetBridge( BridgeManager.Instance.GetComputeDevice( 0 ), bridge );
Assigning a thread, in this case the CurrentThread
, to a device is again accomplished using the BridgeManager
.
IComputeDevice cd = BridgeManager.Instance.GetComputeDevice( 0 );
BridgeManager.Instance.SetComputeDevice( cd, Thread.CurrentThread );
After installing NMath Premium, the default behavior will create a default bridge and assign it to the GPU with a device number of 0 (generally the fastest GPU installed). Also by default, all unassigned threads will execute on device 0. This means that out of the box with no additional programming, existing NMath code, once recompiled against the new NMath Premium assemblies, will route all appropriate computations to the device 0 GPU. All of the follow discussions and code examples are ways to refine this default behavior to get the best performance from your GPU hardware.
Currently only the NVIDIA GPU with a device number 0 is supported by NMath Premium, this release removes that barrier. With version 6, work can be assigned to any installed NVIDIA device as long as the device drivers are up-to-date.
The work done by an executing thread is routed to a particular device using the BridgeManager.Instance.SetDevice()
as we saw in the example above. Any properly configured hardware device can be used here including any NVIDIA device and the CPU. The CPU is simply viewed as another compute device and is always assigned a device number of -1.
var bmanager = BridgeManager.Instance;
var cd = bmanager .GetComputeDevice( -1 );
BridgeManager.Instance.SetComputeDevice( cd, Thread.CurrentThread );
....
cd = bmanager .GetComputeDevice( 2 );
BridgeManager.Instance.SetComputeDevice( cd, Thread.CurrentThread );
Lines 3 & 4 first assign the current thread to the CPU device (no code on this thread will run on any GPU) and then in lines 6 & 7 the current thread is switched to the GPU device 2. If an invalid compute device is requested a null IComputeDevice
is returned. To find all available computing devices, the BridgeManager
offers an array of IComputeDevices
which contains all detected compute devices including the CPU, called IComputeDevices Devices[]
. The number of detected GPU’s can be found using the property BridgeManager.Instance.CountGPU
.
As an aside, keep in mind that PCI slot numbers do not necessarily correspond to GPU device numbers. NVIDIA assigns the device number 0 to the fastest detected GPU and so installing an additional GPU into a machine may renumber the device numbers for the previously installed GPU’s.
Assigning a Bridge
to a GPU device doesn’t necessarily mean that all computation routed to that device will run on that device. Instead, the assigned Bridge
acts as an intermediary between the CPU and the GPU and moves the larger problems to the GPU where there’s a speed advantage and retains the smaller problems on the CPU. NMath has a built-in default bridge, but it may generate non-optimal run-times depending on your hardware or your customers hardware configuration. To improved the hardware usage and performance a bridge can be tuned once and then persisted to disk for all future use.
// Get a compute device and a new bridge.
IComputeDevice cd = BridgeManager.Instance.GetComputeDevice( 0 );
Bridge bridge = BridgeManager.Instance.NewDefaultBridge( cd );
// Tune this bridge for the matrix multiply operation alone.
bridge.Tune( BridgeFunctions.dgemm, cd, 1200 );
// Or just tune the entire bridge. Depending on the hardware and tuning parameters
// this can be an expensive one-time operation.
bridge.TuneAll( cd, 1200 );
// Now assign this updated bridge to the device.
BridgeManager.Instance.SetBridge( cd, bridge );
// Persisting the bridge that was tuned above is done with the BridgeManager.
// Note that this overwrites any existing bridge with the same name.
BridgeManager.Instance.SaveBridge( bridge, @".\MyTunedBridge" );
// Then loading that bridge from disk is simple.
var myTunedBridge = BridgeManager.Instance.LoadBridge( @".\MyTunedBridge" );
Once a bridge is tuned it can be persisted, redistributed, and used again. If three different GPU’s are installed this tuning should be done once for each GPU and then each bridge should be assigned to the device it was tuned on. However if there are three identical GPU’s the tuning need be done only once, then persisted to disk, and later assigned to all identical GPU’s. Bridges assigned to GPU devices for which it wasn’t tuned will never result in incorrect results, only possibly under performance of the hardware.
Once a bridge is paired to a device, threads may be assigned to that device for execution. This is not a necessary step as all unassigned threads will run on the default device (typically device 0). However, suppose we have three tasks and three GPU’s, and we wish to use a GPU per task. The following code does that.
...
IComputeDevice gpu0= BridgeManager.Instance.GetComputeDevice( 0 );
IComputeDevice gpu1 = BridgeManager.Instance.GetComputeDevice( 1 );
IComputeDevice gpu2 = BridgeManager.Instance.GetComputeDevice( 2 );
if( gpu0 != null && gpu1 != null && gpu2 != null)
{
System.Threading.Tasks.Task[] tasks = new Task[3]
{
Task.Factory.StartNew(() => Task1Worker(gpu0)),
Task.Factory.StartNew(() => Task2Worker(gpu1)),
Task.Factory.StartNew(() => Task2Worker(gpu2)),
};
//Block until all tasks complete.
Task.WaitAll(tasks);
}
...
This code is standard C# code using the Task Parallel Library and contains no NMath Premium specific API calls outside of passing a GPU compute device to each task. The task worker routines have the following simple structure.
private static void Task1Worker( IComputeDevice cd )
{
BridgeManager.Instance.SetComputeDevice( cd );
// Do Work here.
}
The other two task workers are identical outside of whatever useful computing work they may be doing.
Good luck and please post any questions in the comments below or just email us at support AT centerspace.net we’ll get back to you.
Happy Computing,
Paul
The post NMath Premium’s new Adaptive GPU Bridge Architecture appeared first on CenterSpace.
]]>Threading.Task
library with .NET 4 many programmers who never or only occasionally wrote multi-threaded code were now doing so regularly with the Threading.Task
API. The Task library reduced the complexity of writing threaded code and provided a several new related classes to make the process easier while eliminating some pitfalls. In this post I'm going to show how to use the Task library with NMath Premium 6.0 to run tasks in parallel on multiple GPU's and the CPU.
The post Distributing Parallel Tasks on Multiple GPU’s appeared first on CenterSpace.
]]>System.Threading.Task
namespace many .NET programmers never, or only under duress, wrote multi-threaded code. It’s old news now that TPL has reduced the complexity of writing threaded code by providing several new classes to make the process easier while eliminating some pitfalls. Leveraging the TPL API together with NMath Premium is a powerful combination for quickly getting code running on your GPU hardware without the burden of learning complex CUDA programming techniques.
The NMath Premium 6.0 library is now integrated with a new CPU-GPU hybrid-computing Adaptive Bridge™ Technology. This technology allows users to easily assign specific threads to a particular compute device and manage computational routing between the CPU and multiple on-board GPU’s. Each piece of installed computing hardware is uniformly treated as a compute device and managed in software as an immutable IComputeDevice
; Currently the adaptive bridge allows a single CPU compute device (naturally!) along with any number of NVIDIA GPU devices. How NMath Premium interacts with each compute device is governed by a Bridge
class. A one-to-one relationship between each Bridge
instance and each compute device is enforced. All of the compute devices and bridges are managed by the singleton BridgeManager
class.
These three classes: the BridgeManager
, the Bridge
, and the immutable IComputeDevice
form the entire API of the Adaptive Bridge™. With this API, nearly all programming tasks, such as assigning a particular Action<>
to a specific GPU, are accomplished in one or two lines of code. Let’s look at some code that does just that: Run an Action<>
on a GPU.
using CenterSpace.NMath.Matrix;
public void mainProgram( string[] args )
{
// Set up a Action<> that runs on a IComputeDevice.
Action worker = WorkerAction;
// Get the compute devices we wish to run our
// Action<> on - in this case two GPU 0.
IComputeDevice deviceGPU0 = BridgeManager.Instance.GetComputeDevice( 0 );
// Do work
worker(deviceGPU0, 9);
}
private void WorkerAction( IComputeDevice device, int input )
{
// Place this thread to the given compute device.
BridgeManager.Instance.SetComputeDevice( device );
// Do all the hard work here on the assigned device.
// Call various GPU-aware NMath Premium routines here.
FloatMatrix A = new FloatMatrix( 1230, 900, new RandGenUniform( -1, 1, 37 ) );
FloatSVDecompServer server = new FloatSVDecompServer();
FloatSVDDecomp svd = server.GetDecomp( A );
}
It’s important to understand that only operations where the GPU has a computational advantage are actually run on the GPU. So it’s not as though all of the code in the WorkerAction
runs on the GPU, but only code that makes sense such as: SVD, QR decomp, matrix multiply, Eigenvalue decomposition and so forth. But using this as a code template, you can easily run your own worker several times passing in different compute devices each time to compare the computational advantages or disadvantages of using various devices – including the CPU compute device.
In the above code example the BridgeManager
is used twice: once to get a IComputeDevice
reference and once to assign a thread (the Action<>'s
thread in this case ) to the device. The Bridge
class didn’t come into play since we implicitly relied on a default bridge to be assigned to our compute device of choice. Relying on the default bridge will likely result in inferior performance so it’s best to use a bridge that has been specifically tuned to your NVIDIA GPU. The follow code shows how to accomplish bridge tuning.
// Here we get the bridge associated with GPU device 0.
var cd = BridgeManager.Instance.GetComputeDevice( 0 );
var bridge = (Bridge) BridgeManager.Instance.GetBridge( cd );
// Tune the bridge and save it. Turning can take a few minutes.
bridge.TuneAll( device, 1200 );
bridge.SaveBridge("Device0Bridge.bdg");
This bridge turning is typically a one-time operation per computer, and once done, the tuned bridge can be serialized to disk and then reload at application start-up. If new GPU hardware is installed then this tuning operation should be repeated. The following code snipped loads a saved bridge and pairs it with a device.
// Load our serialized bridge.
Bridge bridge = BridgeManager.Instance.LoadBridge( "Device0Bridge.bdg" );
// Now pair this saved bridge with compute device 0.
var device0 = BridgeManager.Instance.GetComputeDevice( 0 );
BridgeManager.Instance.SetBridge( device0, bridge );
Once the tuned bridge is assigned to a device, the behavior of all threads assigned to that device will be governed by that bridge. In the typical application the pairing of bridges to devices done at start up and not altered again, while the assignment of threads to devices may be done frequently at runtime.
It’s interesting to note that beyond optimally routing small and large problems to the CPU and GPU respectively, bridges can be configured to shunt all work to the GPU regardless of problem size. This is useful for testing and for offloading work to a GPU when the CPU if taxed. Even if the particular problem runs slower on the GPU than the CPU, if the CPU is fully occupied, offloading work to an otherwise idle GPU will enhance performance.
I’m going to wrap up this blog post with a complete C# code example which runs a matrix multiplication task simultaneously on two GPU’s and the CPU. The framework of this example uses the TPL and aspects of the adaptive bridge already covered here. I ran this code on a machine with two NVIDIA GeForce GPU’s, a GTX760 and a GT640, and the timing results from this run for executing a large matrix multiplication are shown below.
Finished matrix multiply on the GeForce GTX 760 in 67 ms. Finished matrix multiply on the Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz in 103 ms. Finished matrix multiply on the GeForce GT 640 in 282 ms. Finished all double precision matrix multiplications in parallel in 282 ms.
The complete code for this example is given in the section below. In this run we see the GeForce GTX760 easily finished first in 67ms followed by the CPU and then finally by the GeForce GT640. It’s expected that the GeForce GT640 would not do well in this example because it’s optimized for single precision work and these matrix multiples are double precision. Nevertheless, this examples shows it’s programmatically simple to push work to any NVIDIA GPU and in a threaded application even a relatively slow GPU can be used to offload work from the CPU. Also note that the entire program ran in 282ms – the time required to finish the matrix multiply by the slowest hardware – verifying that all three tasks did run in parallel and that there was very little overhead in using the TPL or the Adaptive Bridge™
Below is a snippet of the NMath Premium log file generated during the run above.
Time tid Device# Function Device Used 2014-04-28 11:22:47.417 AM 10 0 dgemm GPU 2014-04-28 11:22:47.421 AM 15 1 dgemm GPU 2014-04-28 11:22:47.425 AM 13 -1 dgemm CPU
We can see here that three threads were created nearly simultaneously with thread id’s of 10, 15, & 13; And that the first two threads ran their matrix multiplies (dgemm) on GPU’s 0 and 1 and the last thread 13 ran on the CPU. As a matter of convention the CPU device number is always -1 and all GPU device numbers are integers 0 and greater. Typically device number 0 is assigned to the fastest installed GPU and that is the default GPU used by NMath Premium.
-Paul
public void GPUTaskExample()
{
NMathConfiguration.Init();
// Set up a string writer for logging
using ( var writer = new System.IO.StringWriter() )
{
// Enable the CPU/GPU bridge logging
BridgeManager.Instance.EnableLogging( writer );
// Get the compute devices we wish to run our tasks on - in this case
// two GPU's and the CPU.
IComputeDevice deviceGPU0 = BridgeManager.Instance.GetComputeDevice( 0 );
IComputeDevice deviceGPU1 = BridgeManager.Instance.GetComputeDevice( 1 );
IComputeDevice deviceCPU = BridgeManager.Instance.CPU;
// Build some matrices
var A = new DoubleMatrix( 1200, 1400, 0, 1 );
var B = new DoubleMatrix( 1400, 1300, 0, 1 );
// Build the task array and assign matrix multiply jobs and compute devices
// to those tasks. Any number of tasks can be added here and any number
// of tasks can be assigned to a particular device.
Stopwatch timer = new Stopwatch();
timer.Start();
System.Threading.Tasks.Task[] tasks = new Task[3]
{
Task.Factory.StartNew(() => MatrixMultiply(deviceGPU0, A, B)),
Task.Factory.StartNew(() => MatrixMultiply(deviceGPU1, A, B)),
Task.Factory.StartNew(() => MatrixMultiply(deviceCPU, A, B)),
};
// Block until all tasks complete
Task.WaitAll( tasks );
timer.Stop();
Console.WriteLine( "Finished all double precision matrix multiplications in parallel in " + timer.ElapsedMilliseconds + " ms.\n" );
// Dump the log file for verification.
Console.WriteLine( writer );
// Quit logging
BridgeManager.Instance.DisableLogging();
}
}
private static void MatrixMultiply( IComputeDevice device, DoubleMatrix A, DoubleMatrix B )
{
// Place this thread to the given compute device.
BridgeManager.Instance.SetComputeDevice( device );
Stopwatch timer = new Stopwatch();
timer.Start();
// Do this task work.
NMathFunctions.Product( A, B );
timer.Stop();
Console.WriteLine( "Finished matrix multiplication on the " + device.DeviceName + " in " + timer.ElapsedMilliseconds + " ms.\n" );
}
The post Distributing Parallel Tasks on Multiple GPU’s appeared first on CenterSpace.
]]>The post Announcing NMath 6.0 and NMath Stats 4.0 appeared first on CenterSpace.
]]>Added functionality includes:
For more complete changelogs, see here and here.
Upgrades are provided free of charge to customers with current annual maintenance contracts. To request an upgrade, please contact sales@centerspace.net. Maintenance contracts are available through our webstore.
The post Announcing NMath 6.0 and NMath Stats 4.0 appeared first on CenterSpace.
]]>