• ## SC17 Tutorial: Application Porting and Optimization on GPU-accelerated POWER Architectures

At last year’s Supercomputing Conference we held a full-day tutorial on Application Porting and Optimization on GPU-Accelerated POWER Architectures. And this year we did it again!

While last year’s tutorial was its first-ever incarnation, we were able to fine-tune the curriculum for this year a bit. The majority of the tutorial (e.g. the overall structure: CPU in the morning, GPU in the afternoon) was the same, but we changed quite a few details.

• The introductory lecture introduced the POWER/Minsky architecture in more detail. The speaker also introduced some theoretical concepts and explained the application we were using during the hands-on parts. Because…
• … we had one common example source code for each of the 3½ hands-on sessions. Based up on this, each session taught their respective examples.
• My lecture on performance counters and measurement was reduced in time because the new introductory session took some of it. I trimmed down the content and moved lots of the stuff into the appendix. I re-did my hands-on session to focus on the new common source code – a Jacobi example – and also implemented more meaningful performance counters: Misses in L1/L2/L3 caches.
• The lecture on application optimization focused on platform-specific optimization, in addition to compiler flags. The speaker also used the common example and exploited the many cores of the processor by OpenMP.
• The GPU part was virtually untouched except for some updates. The speaker here used OpenACC to bring the application to a) a GPU and b) a number of GPUs.
• In the final presentation, the speaker presented some real-world examples using the Minsky machine and how they choose to utilize the fat node; also: Deep Learning.

I think the changes to the content made the tutorial only more useful, as now one common theme is established. Also, we introduce modern programming models (OpenMP, OpenACC) and tools (PAPI, perf) through practical examples in passing. I like it.

For my part on performance counters, I was able to measure the sizes of the processor’s three cache levels by counting the number of cache misses and relating them to the source code. A rather practical learning experience, IMHO.

The evaluation is not yet out for this one, but speaking to individual people after the sessions I heard good things. It’s yet to be decided if we submit another application to host the tutorial again next year.

You can find the different talks of the tutorial at the corresponding Indico site or on JuSER. My slides are also available locally and embedded further down. Also for this one, there’s a overlay-reduced version.

• ## TALK: Brain Research Applications on Minsky

Right before the Supercomputing Conference in Denver, CO, I participated in a workshop titled OpenPOWER Academic Discussion Group. In it, academics invested in the OpenPOWER platform came together to present their current work and what the architecture holds for the future. Since I’m part of the POWER Acceleration and Design Centre and have hands-on experience in administrating JURON (our IBM POWER8’ Minsky server), it was a good occasion to present news from Jülich.

My talk, Brain Research Applications on Minsky, covers a selection of applications running on JURON from the neuroscience community. JURON was acquired as a pre-commercial procurement system in the scope of the Human Brain Project, so it makes sense to use this as a POWER use-case.

For the talk I read some background on the history of brain research, since neuroscience is not my research field at all. I also spoke to colleagues of mine involved in neuroscientific applications using JURON. They were so kind to provide me with content I could use and offered time to chat about neuroscience in general and their applications specifically. I picked applications which participated in our GPU Hackathon at the beginning of this year, since I already knew a few things about then.1

In the end, this was quite a novel experience, talking about other people’s research from a field I know little about. I did my best and hope everything turned out well. Personally, I think it went alright!

Also this talk is available at JuSER, locally here, or embedded below. Since I needed to think carefully what I was going to say about each slide, I took some notes to guide my words. Based on this, there’s a version of the slides appended by notes, thanks to the \note{} LaTeX Beamer feature.

1. I have not written about the GPU Hackathon here, have I? Damn. I wrote something in Jülich’s blogs, though.

• ## TALK/TUT: GPU Introduction and OpenACC Tutorial at GridKa Summer School 2017

Remember? Last year I held a talk at Karlsruhe Institute of Technology’s GridKa Summer School titled GPUs: Platform, Programming, Pitfalls introducing GPUs and stuff.

As we all know, the topic is still interesting. Also, my evaluation was quite alright, so the organizers invited me this year again to introduce GPUs at 2017’s make science && run-titled school. In addition, they asked if I wanted to hold a tutorial on OpenACC programming. I was happy to oblige.

The talk, GPU Programming 101, is basically last year’s talk with some tweaks and updates (Volta, …). But instead of running through the pitfalls – which was hard to do in the time constraints anyway – I decided rather to present some alternative GPU programming models. That is fitting with respect to the OpenACC tutorial, but also gives me a chance to present something really current in GPU programming: Higher-level, portable models.1

In the afternoon I have about five hours to host an OpenACC introduction. I present a boiled-down version of our JSC OpenACC course. I need to create a lot of slides, since I’ve never held the OpenACC introduction part of the course myself. Since I’m at it, I refine the curriculum a bit. Also, I use slightly new programming examples.2 During train travels to Karlsruhe (time is short, as usual), I add a OpenACC↔OtherGPU interoperability section to the tutorial; just in case we are finishing early. Turns out, the participants in the course were all quite apt and we indeed went through the interoperability stuff.

Again, the evaluation of talk and tutorial was quite good. The talk got a 9/10; all four tutorial-feedbacks were 10/10. Nice! Makes the effort worth it.

I made some time-consuming LaTeX fanciness in the slides, which are going into a dedicated posting.

Materials:

• Slides of the GPU-introducing talk are on JuSER, here (handout version), or embedded further down.
• Slides of the OpenACC tutorial are on JuSER, here (handout version), or embedded below the GPU slides. It’s 110 slides, so beware.3 If you’re interested in the source code of the tasks, contact me.
1. I still need a chance to properly test-run Kokkos, though.

2. I decide to go all-in in the method I introduced for our JSC courses for generating code. There, I have a master source code file from which the individual tasks are generated. This is done by appending the master source code with pre-processor macros which are resolved using the C partial pre-processor, cppp. When doing only one part of the tutorial, only a few tasks are generated from one master source code. But in a five-hour tutorial, where all tasks build up on each other, I ended up with eight tasks. Each accompanied by a solution version of the source code. So many #ifdefs. So many.

3. Which take like 10 minutes to typeset from scratch. Or 3 minutes for my faster computer…

• ## PAPER/TALK: GPU-Accelerated Particle-in-Cell Code on Minsky

During this year’s International Supercomputing Conference (ISC) in June in Frankfurt, I attended a workshop called International Workshop on OpenPOWER for HPC (IWOPH). I submitted and presented a paper titled GPU-Accelerated Particle-in-Cell Code on Minsky, which is now released (see below for PDF).

The workshop was about use cases of OpenPOWER systems (mainly, IBM’s POWER8NVL processor) in the context of high-performance computing. With our JURON cluster at JSC 1, this is quite fitting. A good chance, to revisit my port of JuSPIC to OpenACC, test it well, and write about it. You’ve read already about JuSPIC+OpenACC here on the blog before.

The paper goes through the work I already did before related to JuSPIC with OpenACC; i.e. the challenges of formulating the source code in a way that OpenACC can efficiently generate GPU-accelerated functions and my decision, to eventually move over to CUDA Fortran 2.

I run through the different optimizations I evaluate in order to speed things up, looking closely on different stages of the processing pipeline (data allocation, conversion, copy, processing, copy, conversion). Turns out, a structure-of-array approach is the fastest data container to crunch numbers with on the GPU device. Who would’ve thought – but there actually are some interesting and even unexpected results.

On of the unexpected results is that the conversion of the data structure takes an awful lot of time. Background: JuSPIC is written with one piece of data pointing to the next piece of data in a chain (a linked list). This is quite natural, but not optimal for modern computer architectures. Instead of data potentially scattered throughout memory, they want data packed closely together in memory so that as much data as possible can be fetched with as few instructions as possible. To achieve good performance, I convert data from JuSPIC’s linked list as soon as it enters the GPU-accelerated scope, do stuff, and convert it back afterwards. And, now back to the unexpected result, especially this back-conversion turns out to be very slow on the system tested (JURON).

Because things happen as they always do, I only recognized that during the last couple of days before the paper deadline. So, as far as time allowed for it, I did quite a few tests together with my colleagues. And we were able to pin-point were the issues comes from, studying the generated assembly, and could eventually integrate a temporary workaround. 3 I also filed a bug report with the compiler vendor, which got fixed two weeks ago. I do still need to test it, though.

In the last part of the paper I model JuSPIC’s GPU performance by simple performance models. I look at the effective bandwidth of the application on different GPU architectures. In addition, I study the dependency on GPU clock rates. For sure the (then) latest GPU we had access to, the NVIDIA Tesla P100, is the best choice for a GPU in JuSPIC’s case. So many multiprocessors, so much bandwidth.

The hunt for the allocation bug was quite involving, with different compiler versions, systems, system architectures, and MPIs on top. I think it was just luck that we noticed that the performance reduction was related to the specific MPI. Bug hunt feels always like Sherlock Matrix.

The paper is available at Springer, both in web and in PDF form as part of the High Performance Computing book. Unfortunately it is not Open Access. But I’m allowed to share the pre-print of the document, which is attached to the entry in Jülich’s library, JUSER, or here. 4

I presented the paper at the workshop in a talk, which can be found below (and on JUSER). For a first time, I created a notes version of the talk, which includes additional content not on the presented slides but things I talked about – intended for after-talk readers. Using LaTeX Beamer this is quite easy to do, and a good addition IMHO. More on that soon™ in a dedicated post.

1. A node of JURON is based on Minsky servers, the cluster has 18 of them.

2. Which was actually closer to the original source code than the OpenACC-optimized Fortran code.

3. The quickest one was using the linkers LD_PRELOAD environment variable to inject a different malloc() function…

4. The editors decided to reduce white space in my tables and not use booktabs as the LaTeX package. So, the nicer tables are in the pre-print version…

• ## POSTER: Accelerating Plasma Physics with GPUs – JuSPIC with OpenACC and CUDA Fortran

At GPU Technology Conference (GTC) this year I presented a poster on my acceleration work on JuSPIC with OpenACC. See below for an embed.

The main part of the poster are the different stages I went through while porting JuSPIC to OpenACC and to efficient OpenACC. It is meant as an experiences pieces outlining different strategies of acceleration at the same time.

I also present first models of JuSPIC’s performance here, but I show more on that in the IWOPH paper.

I was able to travel to GTC17 with the poster (which was awesome!), it was even chosen to be included in the Top 20 Fast Forward Program. Cool! I was worried that the non-standard design could be interpreted as somewhat unscientific. Nice that it wasn’t.
It’s unfortunate, though, that the poster submission deadline is so much earlier than the actual conference.

• ## Interesting Talks from GTC 17

A month ago, I traveled to San Jose, CA, to visit the GPU Technology Conference, GTC, and learn the latest on all things GPU.

The schedule was super packed and at more than one time, I wasn’t able to see some interesting talk because I was already sitting in one other interesting talk.1

Here’s a list of sessions I found interesting/noteworthy and/or want to (re)visit after the conference, sorted by topics.
Links to recordings and slides are provided. Bold indicates that I have not yet seen the talk and want to do so.

I post this only today since the materials have been private up to now.2

• Volta, CUDA 9
• General CUDA, GPU:
• GPU Data Management
• S7362: Benchmarking The New Unified Memory Of CUDA 8 (link, recording)
• S7628: The Future Of GPU Data Management (link, recording, slides)
• S7285: Unified Memory On The Latest GPU Architectures (Pascal, Volta) (link, recording, slides)
• S7764: GPUs: Using HMM To Blur The Lines Between CPU And GPU Programming (link, recording, slides)
• S7128: How To Enable NVIDIA CUDA Stream Synchronous Communications Using Gpudirect (link, recording, slides)
• S7700: An Introduction To The GPU Memory Model - Presented By Acceleware (session 2 Of 4) (link, recording)
• S7628: The Future Of GPU Data Management (link, recording, slides)
• Libraries, Packages, Tools
• S7150: Accelerating cuBLAS/cuDNN Using Input-aware Auto-tuning: The ISAAC Library (link, recording, slides)
• S7405: Bifrost: A Python/c++ Framework For Easy High-throughput Computing (link, recording, slides)
• S7438: Build Systems: Combining CUDA And Modern CMake (link, recording, slides)
• Multi-GPU, MPI
• Other Programming Models (OpenACC, OpenMP, OpenCL, Etc.)
• S7344: Kokkos - The C++ Performance Portability Programming Model (link, recording, slides)
• S7192: OmpSs+OpenACC: Multi-target Task-based Programming Model Exploiting OpenACC GPU Kernels (link, recording, slides)
• S7496: OpenCL At NVIDIA: Best Practices, Learnings, And Plans (link, recording, slides)
• S7626: A Simple Guideline For Code Optimizations On Modern Architectures With OpenACC And CUDA (link, recording, slides)
• S7636: Cache Directive Optimization In OpenACC Programming Model (link, recording, slides)
• Use-Cases
• S7341: Using OpenAC For NGS Techniques To Create A Portable And Easy-to-use Code Base (link, recording, slides)
• S7640: Porting C++ Applications To GPUs With OpenACC For Lattice Quantum Chromodynamics (link, recording, slides)
• S7672: OpenACC Best Practices: Accelerating The C++ NUMECA FINE/Open CFD (link, recording, slides)
• S7635: Comparison Of OpenACC And OpenMP4.5 Offloading: Speeding Up Simulations Of Stellar Explosions (link, recording, slides)
• S7478: Using OpenACC To Parallelize Irregular Algorithms On GPUs (link, recording, slides)
• S7193: Achieving Portable Performance For GTC-P With OpenACC On GPU, Multi-core CPU, And Sunway Many-core Processor (link, recording, slides)
• S7735: GPU Acceleration Of The Higrad Computational Fluid Dynamics Code With Mixed OpenACC And CUDA Fortran (link, recording, slides)
• S7382: GPUs Unleashed: Analysis Of Petascale Molecular Simulations With VMD (link, recording, slides)
• S7535: Potential Field Solutions Of The Solar Corona: Converting A PCG Solver From MPI To MPI+OpenACC (link, recording)
• AI, Machine Learning, Deep Learning, and Siblings
• S7457: Deep Learning Demystified (link, recording, slides)
• S7515: Eliminating The Regular Expression With Neural Networks (link, recording, slides)
• S7860: Starting A Deep Learning Project (link, recording, slides)
• S7666: Learning Particle Physics By Example: Using Generative Adversarial Networks To Accelerate Physics (link, recording, slides)
• S7804: Tensorflow: Open Source Machine Learning (presented By Google) (link, recording)
• Round Tables, Panels
• SE7142: CUDA Developer Tools Round Table (nothing on this :()
• S7564: Accelerator Programming Ecosystems (link, recording, slides)
• Use-Cases, Applications
• S7332: Accelerated Astrophysics: Using NVIDIA DGX-1 To Simulate And Understand The Universe (link, recording, slides)
• Others
• Python:
• S7609: Porting After Effects To The GPU (link, recording, slides)
• S7590: Passengers: Awakening VR, When Film Meet VR (link, nothing on this :()
• S7296: Cloudlighting: Merging GPU-based Hpc With Cloud Services (link, recording, slides)
• S7329: Open-source Tools For GPU Programming Assignments In Large Classroom Settings (link, recording, slides)
• S7482: Advances In Real-time Graphics At Pixar (link, unfortunately nothing else, even though I thought they said so during the session)
• S7642: Preparing GPU-accelerated Applications For The Summit Supercomputer (link, recording, slides)
1. The pinnacle of things was the Wednesday-4pm timeslot, when four this year new-like talks happened at the same time. Talk about parallelism.

• ## Data Analysis with Python

In the last few weeks I needed to crunch some data. It was structured data, so I had a reason for finally jumping into pivoting DataFrames in Pandas1 – a thing I still knew (and know…) very little about.

I’m using Python for any kinds of visualization since quite some time already. It’s so versatile, productive, and handy! #♥

After finishing my paper, I wanted to show my colleagues shortly the basics of what they need to know to massage their data and make nice-looking plots from it. With Python. A kind of Data Analysis with Python 1-0-½.

Here are the slides, which scratch the surface of Matplotlib, Pandas, and Jupyter Notebooks. Also: Seaborn. Navigate with space bar.

The presentation itself is done in a Jupyter Notebook. Hence the embedded HTML presentation with reveal.js, which Jupyter natively generates. If you’re looking for a more static version, there’s a PDF of it as well2. Also, the Notebook is available in this Gist, in case you’d like to see how its done.

Edit, 29 May: There’s a handy cheatsheet available in Pandas’ Github repository.

Let me know what you think of the slides. What would be your recommendations to further simplify or improve data analysis with Python? Tweet me!

1. WTF you say? Well. Read on. Or just jump ahead to the presentation. It all makes sense. I promise.

2. Which were hell to compile. That’s really not the strong suit of those HTML/JS presentation frameworks (and for me a show-stopper). I used the decktape method to get a PDF from the HTML and used pdfcrop to get rid of scrollbars.

• ## Preprocessor Macros for CUDA Errors

TL;DR: Error-Checking Preprocessor Macros for CUDA Fortran and CUDA C/C++. They are also on Github.

When calling functions from the CUDA Runtime API 1, usually an error code is returned. Although this gets ignored by many (most?) of the users, it can give great insight into the wrong-doings of your code.

Error handling is something omitted regularly for smaller code bases. For the CUDA errors, the reason might be in the additional lines of code, cluttering the more directly content-related API calls, or simple laziness.

But fear not! With preprocessor macros, there’s just little overhead to include error handling in your CUDA code.

Here are macros for CUDA C(++) and CUDA Fortran. See also the notes on error checking of kernels at the end.

## Macros

### C++, C

I do not know who initially came up with the idea. It’s on the NVIDIA devblogs, in a Gist, and also in a lot of the codes of my colleagues. I modified the usual snippet a bit, though, to create what I think is the most informative and concise representation of an error call.

#define CUDA_CALL( call )               \
{                                       \
cudaError_t result = call;              \
if ( cudaSuccess != result )            \
std::cerr << "CUDA error " << result << " in " << __FILE__ << ":" << __LINE__ << ": " << cudaGetErrorString( result ) << " (" << #call << ")" << std::endl;  \
}

This assumes that iostream is loaded. For C, replace the std::cerr << std::endl statement with fprintf(stderr, "CUDA error %i in %s …", result, __FILE__, …).

Use it by wrapping a plain API call into it:

CUDA_CALL( cudaMalloc( (void**)&ad, csize ); )

It will print one line per error, giving the file name and the line number of the error, the raw error code and its explained string; and it will print the actual call (#call). An erroneous call will then look like

CUDA error 38 in hello-world.cu:50: no CUDA-capable device is detected (cudaMalloc( (void**)&bd, isize );)


### CUDA Fortran

Since CUDA Fortran is only available through the PGI Fortran compiler, the following is true only for this compiler (especially with regards to the preprocessor and column width). Note: If you find more elegant solutions to the problems discussed in the following, let me know! I still have a lot to learn in the depths that is Fortran.

In general, there is no limitation in using a similar macro in CUDA Fortran code compared to the CUDA C version. But: Column width. Also when using modern Fortran 90 (.F90)2, PGI’s Fortran compiler only allows for lines with a width of 256 characters. And because the preprocessor is not able to propagate line breaks into the target source, the error-check-augmented resulting line will be quite long. If you run into a line-too-long error, consider using shorter variables, which is ugly and horrible and arrrr, Fortran, but that’s just the way it is.3 Another workaround would be the one explained for kernels later on. The line length is also the reason I opted for removing whitespace and non-descriptive variables. Sorry.

Macro:

#define CUDA_SUCCESS 0
#define CUDA_CALL__(e,fmt,c) \
e=c; \
if(e/=CUDA_SUCCESS) \
write(*,fmt) "CUDA Error ",e," in ",__FILE__,":",__LINE__,": ",trim(cudaGetErrorString(e))," (",#c,")"
#define CUDA_CALL(c) CUDA_CALL__(gpuStatus,fmt,c)

The macro lives best closely together with a module which provides gpuStatus and the format string fmt. It can then be used in any instrumented routine/program with a use debug:

module debug
character(len=27) :: fmt = "(A,I0,A,A,A,I0,A,A,A,A,A,A)"
integer :: gpuStatus
end module debug

CUDA_CALL is a shortcut to CUDA_CALL__. The latter is in case one wants to use a different status variable (for reuse or explicit checking) or a different format string.

## Error Checking on Kernels

Kernels do not have a return value so the usual means of error checking will not work on them. Instead, use a macro-wrapped cudaGetLastError()4 directly after the kernel launch, plus a wrapped cudaDeviceSynchronize() if there aren’t any other subsequent API calls or to help structure error messages.

CUDA_CALL( cudaGetLastError(); )
CUDA_CALL( cudaDeviceSynchronize(); )

This method works in all cases where the error code can not be handled directly, e.g. if the Fortran line is too long.

1. Although all of the written is true for the CUDA Driver API, I will refer to the Runtime API, since this is the more commonly used method of access to the GPU.

2. Make sure to give your filenames a capital F in F90 to include preprocessing on the file.

3. On StackOverflow, user talonmies adds an additional routine to reduce the length of the preprocessor macro. This should affect line-too-longs in Fortran beneficially, but comes with other caveats. It could be worth to do, though, if line-too-longs are a regular problem.

4. This StackOverflow answers uses cudaPeekAtLastError(). But as far as I see it the result should be equal.