C++ Accelerate Massive Parallelism

by Call me dave... 25. June 2011 22:43

I have been an avid follower of Herb Sutter’s writings for years, from his magazine articles in the C++ Report and the C/C++ Users Journal, his books on C++ development and recently his series of posts on parallel programming and concurrency for Dr Dobbs.  Herb is the chair of the C++ standards committee where he and others are near to finalising C++0x standard.  In essence Herb knows his stuff.

Herb recently announced C++ AMP at the AMD Fusion conference and I was blown away.  Microsoft have modified their C++ compiler so that it can target both the CPU and a DirectX compatible graphics card within the same program.  This enables a developer to write code that utilises the hundreds of cores available on graphics cards, to perform blazingly fast parallel computations, without resorting to the current practice of writing code in a language designed for graphics processing (High Level Shader Language [HLSL] or GL Shading Language [GLSL]) or within a C like language (CUDA or OpenCL) where the OO benefits of C++ are missing.

Instead the keyword restrict is used to mark a method as being able to be executed on the graphics card.  The developer then uses a subset of the C++ language (enforced by the compiler) to implement the computation.  The complexity of moving data between the CPU, RAM and the GPU is also simplified by the introduction of a set of classes that automatically handle the marshalling of data between the devices.  The compiler converts the C++ code into HLSL code which is then embedded into the executable.  At runtime the HLSL code is sent to the DirectX driver which in turn converts it into the appropriate machine code (for a particular device) and executes it on the graphics card.

The example below (from Daniel Moth’s presentation - URL below) is a simple matrix by matrix multiplication.  Its clear that the outer for loop can be parallelised.  An interesting point is the complex indexing to pull out values from vA and vb which are both one dimensional arrays but actually store a two dimensional structure.

void MatrixMultiply( vector<float>& C,
const vector<float>& vA,
const vector<float>& vB, int M, int N, int W )
  for (int y = 0; y < M; y++) {
    for (int x = 0; x < N; x++) {
      float sum = 0;
      for(int i = 0; i < W; i++)
      sum += vA[y * W + i] * vB[i * N + x];
      vC[y * N + x] = sum;
Using C++ AMP the matrices are marshalled across to the GPU by wrapping the vectors in array_views.  Further the array_views are used to project the two dimensional matrix onto the one dimensional vector and so the complex indexing code isn’t present in the C++ AMP version.  The code that executes on the GPGPU are the lines within the lambda expression that marked with the restrict(direct3d) keyword.  On the graphics card the texture that is getting the result of computation is write only hence variable c is of type array_view<writeonly<>>.
void MatrixMultiply( vector<float>& vC,
const vector<float>& vA,
const vector<float>& vB, int M, int N, int W )
  array_view<const float,2> a(M,W,vA),b(W,N,vB);
  array_view<writeonly<float>,2> c(M,N,vC);
  parallel_for_each(c.grid, [=](index<2> idx) restrict(direct3d) {
   float sum = 0;
   for(int i = 0; i < a.extent.x; i++)
     sum += a(idx.y, i) * b(i, idx.x);
     c[idx] = sum;

There are millions of lines of C++ code used in finance industry for modelling derivatives and many of the larger institutions are now looking at GPGPU technology to give them an edge in the competitive world of algorithmic trading.  CUDA is currently the default choice for GPGPU development it will be very interesting in twelve months to see if this trend reverses due to both the simplicity of parallelising a function using AMP C++.

Herb’s keynote.

Daniel Moth: Blazing-fast code using GPUs and more, with C++ AMP

NVidia will also support C++ AMP but points out CUDA is available for Linux and Mac as well as Windows

Finally Microsoft have indicated that they will take their C++ extensions to a standards board thus enabling other compiler vendors to implement the restrict keyword in their products.  Hopefully this will mean that in a couple of years it will be possible to have C++ code target other devices such as FPGAs, PS3’s cell processor and Intel’s Many Integrated Cores (MIC) daughter board.


Side Note: 

Eventually all good ideas come back again. When C with Classes was being created it had a readonly and a writeonly keyword and when C++ was created the C standard committee liked the concept so much they asked that the readonly keyword by renamed to const and added it to the language. Meanwhile writeonly was dropped entirely. Instead of adding writeonly as a keyword Microsoft have gone down the library root i.e. writeonly<T> instead of writeonly T. Personally I think a keyword would have been far cleaner but then again I don’t work on the standards committee and have no idea how complex it would be to add into the official ISO C++ standard…  But considering how long its taken to complete this version of the standard I cant say I am terribly surprised...


Coding Dojos at the workplace

by Call me dave... 25. June 2011 20:58

Your an IT manager at a small manufacturing company and have a staff of 5.  In two months time your team is slated to replace the internal HR web site that is currently running on Classic ASP. The Architect and lead developer have selected the following software stack:

  • Entity Framework
  • .NET 4.0

The team is very experienced however for the last 2 years they have been maintaining the sales web site which runs on ASP.NET and .NET 2.0.  Due to production support constraints you are unable to send the team on an externally run training course… 

Over my career I have seen this situation played out far too many times.  Teams spend the first couple of weeks or months learning on the job and end up having to replace most of the code that was written at the project’s commencement.

A Coding Dojo is a meeting where a bunch of coders get together to work on a programming challenge. They are there have fun and to engage in DeliberatePractice in order to improve their skills. (ref Coding  Dojo)

What’s the best way to learn a new technology, pattern or programming language?  By playing, doing, prototyping, fiddling and experimenting with it.  What’s the best way for a team to learn a new technology, pattern or programming language? By doing it together!!

Coding Dojos rules:

  • A challenge isn’t solved or challenge beaten without code i.e. A paper design doesn’t cut it
  • Code doesn’t exists unless there are tests
  • Work as a team/hook up with an expert


  • Prepared Kata – Presenter solves the problem in front of an audience but can/should/will change his solution based on ideas/enhancements/feedback from the audience
  • Randori Kata – A pair of developers are selected from the audience.  The pair work together together to solve the problem and every 15 or 20 minutes one of the pair is swapped with another audience member.  Preferably everyone in the audience should have an opportunity to sit at the computer and bang out some code.

In a Coding Dojo “meet up” the Randori Kata can be intimidating for a new comer – No one wants to get a mental blank in front of a group of people or have a group commenting of their coding style.  This should be less of an issue in a work environment where the developers already know each other.  In fact the format ensures that everyone participates in the learning experience.

Workplace <Insert Technology here> Coding Dojo:

  1. Prep work
    1. Find the most experienced person in the team who has already used the technology
    2. Ask them to create a 30 minute demo (PowerPoint is fine) where the technology is used
    3. Ask them to create a “Cheat sheet” that consists of 3 or 4 pages of generic code snippets that show how the technology is used for different situations
    4. Create a coding problem (don’t tell the team or the person that wrote the Cheat Sheet) that can be solved reasonably easily using the technology and ensure that the solution to the problem utilises items within the Cheat Sheet
  2. Kata
    1. Have the team member present the 30 minute introduction demo
    2. Provide everyone with a copy of the “Cheat Sheet”
    3. Spend 10 minutes discussing the technology and the contents of the Cheat Sheet
    4. Explain the coding problem
    5. Randomly chose a developer to join the presenter – Coding in a pair is less stressful
    6. Have the presenter/colleague pair start solving the problem (make sure they write tests)
    7. Following Randori Kata rotate through the developers every 15 to 20 minutes
    8. After two hours of coding stop the clock
    9. Spend 10 minutes in a discussion
      1. What was hard?
      2. What was easy?
      3. How confident do people feel?
      4. What else do we need to learn?