A few weeks ago we had the annual workshop of one of the groups I’m involved in, the POWER Acceleration and Design Centre (PADC).

In the scope of the PADC we investigate new processors offered by IBM and the OpenPOWER consortium and how well the architectural choices map to applications. One of the features of the latest incarnation of the POWER processor chip is its connection to NVIDIA’s GPUs: The POWER8NVL employs a new, larger bus to connect to the GPU device – NVLink. The processor can make use of NVLink to exchange data with the GPU more than four times as fast compared to usual PCI-Express interfaces.1 Neat!

I’m yet to dive fully into the new world of POWER8NVL, NVLink, and NVIDIA’s Pascal GPU on the other side, since there are only few systems available right now. It’s brand new. But for evaluating the combination of the integrated design of POWER8 CPU and Pascal GPU for a specific project (the Human Brain Project, read more about the precommerical precurement here) we actually received a small test system with this brand new architecture. 2 Unfortunately, the machine only arrived shortly before the PADC workshop. There was no time for extended tests. But on Sunday afternoon before Monday’s workshop I managed to measure at least one aspect of one of my app’s behaviors. Yay!

You can see the performance of JuSPIC, a plasma physics application I’m researching, under the assumption of a simple information exchange model on the Pascal P100 GPU in a POWER8 system in the second part of the presentation. In the somewhat larger first part of the talk, I show what techniques I used to begin accelerating the application on the GPU. I started out with OpenACC, a pragma-based acceleration programming model, but soon found out that the code is a bit too complex for the compiler I use. See the slides for how it turned out.

I hope to continue the acceleration as well as the performance analysis (with a more refined model) soon. But I’m busy with other cool stuff right now.

You can find a handout version of the slides on the webpage of the workshop – or after the click; the version with all the overlays is also available, though.

Let me know what you think!3

1. PCIe Gen3: 16 GB/s, NVLink (Device to Host): 80 GB/s

2. Well. Small, as in multiple P100s with each about 10 TFLOP/s single precision performance…

3. I still do not have comments in this static blog engine. So you either need to tweet at me (@AndiH) or send me an email (a.herten@fz-ju…).