• Interesting Talks from GTC 17

    A month ago, I traveled to San Jose, CA, to visit the GPU Technology Conference, GTC, and learn the latest on all things GPU.

    The schedule was super packed and at more than one time, I wasn’t able to see some interesting talk because I was already sitting in one other interesting talk.1

    Here’s a list of sessions I found interesting/noteworthy and/or want to (re)visit after the conference, sorted by topics.
    Links to recordings and slides are provided. Bold indicates that I have not yet seen the talk and want to do so.

    I post this only today since the materials have been private up to now.2

    • Volta, CUDA 9
    • General CUDA, GPU:
    • GPU Data Management
      • S7362: Benchmarking The New Unified Memory Of CUDA 8 (link, recording)
      • S7628: The Future Of GPU Data Management (link, recording, slides)
      • S7285: Unified Memory On The Latest GPU Architectures (Pascal, Volta) (link, recording, slides)
      • S7764: GPUs: Using HMM To Blur The Lines Between CPU And GPU Programming (link, recording, slides)
      • S7128: How To Enable NVIDIA CUDA Stream Synchronous Communications Using Gpudirect (link, recording, slides)
      • S7700: An Introduction To The GPU Memory Model - Presented By Acceleware (session 2 Of 4) (link, recording)
      • S7628: The Future Of GPU Data Management (link, recording, slides)
    • Libraries, Packages, Tools
      • S7150: Accelerating cuBLAS/cuDNN Using Input-aware Auto-tuning: The ISAAC Library (link, recording, slides)
      • S7405: Bifrost: A Python/c++ Framework For Easy High-throughput Computing (link, recording, slides)
      • S7438: Build Systems: Combining CUDA And Modern CMake (link, recording, slides)
    • Multi-GPU, MPI
    • Other Programming Models (OpenACC, OpenMP, OpenCL, Etc.)
      • S7344: Kokkos - The C++ Performance Portability Programming Model (link, recording, slides)
      • S7192: OmpSs+OpenACC: Multi-target Task-based Programming Model Exploiting OpenACC GPU Kernels (link, recording, slides)
      • S7496: OpenCL At NVIDIA: Best Practices, Learnings, And Plans (link, recording, slides)
      • S7626: A Simple Guideline For Code Optimizations On Modern Architectures With OpenACC And CUDA (link, recording, slides)
      • S7636: Cache Directive Optimization In OpenACC Programming Model (link, recording, slides)
      • Use-Cases
        • S7341: Using OpenAC For NGS Techniques To Create A Portable And Easy-to-use Code Base (link, recording, slides)
        • S7640: Porting C++ Applications To GPUs With OpenACC For Lattice Quantum Chromodynamics (link, recording, slides)
        • S7672: OpenACC Best Practices: Accelerating The C++ NUMECA FINE/Open CFD (link, recording, slides)
        • S7635: Comparison Of OpenACC And OpenMP4.5 Offloading: Speeding Up Simulations Of Stellar Explosions (link, recording, slides)
        • S7478: Using OpenACC To Parallelize Irregular Algorithms On GPUs (link, recording, slides)
        • S7193: Achieving Portable Performance For GTC-P With OpenACC On GPU, Multi-core CPU, And Sunway Many-core Processor (link, recording, slides)
        • S7735: GPU Acceleration Of The Higrad Computational Fluid Dynamics Code With Mixed OpenACC And CUDA Fortran (link, recording, slides)
        • S7382: GPUs Unleashed: Analysis Of Petascale Molecular Simulations With VMD (link, recording, slides)
        • S7535: Potential Field Solutions Of The Solar Corona: Converting A PCG Solver From MPI To MPI+OpenACC (link, recording)
    • AI, Machine Learning, Deep Learning, and Siblings
      • S7457: Deep Learning Demystified (link, recording, slides)
      • S7515: Eliminating The Regular Expression With Neural Networks (link, recording, slides)
      • S7800: Leveraging The Power Of Google’s Cloud Machine Learning Service (presented By Google) (link, slides)
      • S7860: Starting A Deep Learning Project (link, recording, slides)
      • S7666: Learning Particle Physics By Example: Using Generative Adversarial Networks To Accelerate Physics (link, recording, slides)
      • S7804: Tensorflow: Open Source Machine Learning (presented By Google) (link, recording)
    • Round Tables, Panels
      • SE7142: CUDA Developer Tools Round Table (nothing on this :()
      • S7564: Accelerator Programming Ecosystems (link, recording, slides)
    • Use-Cases, Applications
      • S7332: Accelerated Astrophysics: Using NVIDIA DGX-1 To Simulate And Understand The Universe (link, recording, slides)
    • Others
      • Python:
      • S7609: Porting After Effects To The GPU (link, recording, slides)
      • S7590: Passengers: Awakening VR, When Film Meet VR (link, nothing on this :()
      • S7296: Cloudlighting: Merging GPU-based Hpc With Cloud Services (link, recording, slides)
      • S7329: Open-source Tools For GPU Programming Assignments In Large Classroom Settings (link, recording, slides)
      • S7482: Advances In Real-time Graphics At Pixar (link, unfortunately nothing else, even though I thought they said so during the session)
      • S7642: Preparing GPU-accelerated Applications For The Summit Supercomputer (link, recording, slides)
    • Keynote (link)
    1. The pinnacle of things was the Wednesday-4pm timeslot, when four this year new-like talks happened at the same time. Talk about parallelism.

  • Data Analysis with Python

    In the last few weeks I needed to crunch some data. It was structured data, so I had a reason for finally jumping into pivoting DataFrames in Pandas1 – a thing I still knew (and know…) very little about.

    I’m using Python for any kinds of visualization since quite some time already. It’s so versatile, productive, and handy! #♥

    After finishing my paper, I wanted to show my colleagues shortly the basics of what they need to know to massage their data and make nice-looking plots from it. With Python. A kind of Data Analysis with Python 1-0-½.

    Here are the slides, which scratch the surface of Matplotlib, Pandas, and Jupyter Notebooks. Also: Seaborn. Navigate with space bar.

    The presentation itself is done in a Jupyter Notebook. Hence the embedded HTML presentation with reveal.js, which Jupyter natively generates. If you’re looking for a more static version, there’s a PDF of it as well2. Also, the Notebook is available in this Gist, in case you’d like to see how its done.

    Edit, 29 May: There’s a handy cheatsheet available in Pandas’ Github repository.

    Let me know what you think of the slides. What would be your recommendations to further simplify or improve data analysis with Python? Tweet me!

    1. WTF you say? Well. Read on. Or just jump ahead to the presentation. It all makes sense. I promise.

    2. Which were hell to compile. That’s really not the strong suit of those HTML/JS presentation frameworks (and for me a show-stopper). I used the decktape method to get a PDF from the HTML and used pdfcrop to get rid of scrollbars.

  • Preprocessor Macros for CUDA Errors

    TL;DR: Error-Checking Preprocessor Macros for CUDA Fortran and CUDA C/C++. They are also on Github.

    When calling functions from the CUDA Runtime API 1, usually an error code is returned. Although this gets ignored by many (most?) of the users, it can give great insight into the wrong-doings of your code.

    Error handling is something omitted regularly for smaller code bases. For the CUDA errors, the reason might be in the additional lines of code, cluttering the more directly content-related API calls, or simple laziness.

    But fear not! With preprocessor macros, there’s just little overhead to include error handling in your CUDA code.

    Here are macros for CUDA C(++) and CUDA Fortran. See also the notes on error checking of kernels at the end.


    C++, C

    I do not know who initially came up with the idea. It’s on the NVIDIA devblogs, in a Gist, and also in a lot of the codes of my colleagues. I modified the usual snippet a bit, though, to create what I think is the most informative and concise representation of an error call.

    #define CUDA_CALL( call )               \
    {                                       \
    cudaError_t result = call;              \
    if ( cudaSuccess != result )            \
        std::cerr << "CUDA error " << result << " in " << __FILE__ << ":" << __LINE__ << ": " << cudaGetErrorString( result ) << " (" << #call << ")" << std::endl;  \

    This assumes that iostream is loaded. For C, replace the std::cerr << std::endl statement with fprintf(stderr, "CUDA error %i in %s …", result, __FILE__, …).

    Use it by wrapping a plain API call into it:

    CUDA_CALL( cudaMalloc( (void**)&ad, csize ); )

    It will print one line per error, giving the file name and the line number of the error, the raw error code and its explained string; and it will print the actual call (#call). An erroneous call will then look like

    CUDA error 38 in hello-world.cu:50: no CUDA-capable device is detected (cudaMalloc( (void**)&bd, isize );)

    CUDA Fortran

    Since CUDA Fortran is only available through the PGI Fortran compiler, the following is true only for this compiler (especially with regards to the preprocessor and column width). Note: If you find more elegant solutions to the problems discussed in the following, let me know! I still have a lot to learn in the depths that is Fortran.

    In general, there is no limitation in using a similar macro in CUDA Fortran code compared to the CUDA C version. But: Column width. Also when using modern Fortran 90 (.F90)2, PGI’s Fortran compiler only allows for lines with a width of 256 characters. And because the preprocessor is not able to propagate line breaks into the target source, the error-check-augmented resulting line will be quite long. If you run into a line-too-long error, consider using shorter variables, which is ugly and horrible and arrrr, Fortran, but that’s just the way it is.3 Another workaround would be the one explained for kernels later on. The line length is also the reason I opted for removing whitespace and non-descriptive variables. Sorry.


    #define CUDA_SUCCESS 0
    #define CUDA_CALL__(e,fmt,c) \
    e=c; \
    if(e/=CUDA_SUCCESS) \
    write(*,fmt) "CUDA Error ",e," in ",__FILE__,":",__LINE__,": ",trim(cudaGetErrorString(e))," (",#c,")"
    #define CUDA_CALL(c) CUDA_CALL__(gpuStatus,fmt,c)

    The macro lives best closely together with a module which provides gpuStatus and the format string fmt. It can then be used in any instrumented routine/program with a use debug:

    module debug
        character(len=27) :: fmt = "(A,I0,A,A,A,I0,A,A,A,A,A,A)"
        integer :: gpuStatus
    end module debug

    CUDA_CALL is a shortcut to CUDA_CALL__. The latter is in case one wants to use a different status variable (for reuse or explicit checking) or a different format string.

    Error Checking on Kernels

    Kernels do not have a return value so the usual means of error checking will not work on them. Instead, use a macro-wrapped cudaGetLastError()4 directly after the kernel launch, plus a wrapped cudaDeviceSynchronize() if there aren’t any other subsequent API calls or to help structure error messages.

    CUDA_CALL( cudaGetLastError(); )
    CUDA_CALL( cudaDeviceSynchronize(); )

    This method works in all cases where the error code can not be handled directly, e.g. if the Fortran line is too long.

    1. Although all of the written is true for the CUDA Driver API, I will refer to the Runtime API, since this is the more commonly used method of access to the GPU.

    2. Make sure to give your filenames a capital F in F90 to include preprocessing on the file.

    3. On StackOverflow, user talonmies adds an additional routine to reduce the length of the preprocessor macro. This should affect line-too-longs in Fortran beneficially, but comes with other caveats. It could be worth to do, though, if line-too-longs are a regular problem.

    4. This StackOverflow answers uses cudaPeekAtLastError(). But as far as I see it the result should be equal.

  • SC16 tutorial on OpenPOWER and GPUs (including a TALK)

    On 14 November we held a tutorial on »Application Porting and Optimization on GPU-Accelerated POWER Architectures« during the Supercomputing Conference in Salt Lake City. The tutorial allowed me to travel to this largest conference on supercomputing-related topics for a first time. It was a busy, but great experience! I actually wrote a bit about it for the Jülich blog portal.

    In the tutorial we spoke about

    • the new POWER8NVL architecture, including its large bus to NVLink-equipped Tesla P100 GPUs,
    • how to measure an application’s performance by means of hardware performance counters,
    • how to use compiler flags to speed-up an application’s runtime - and what’s behind the flags,
    • what’s new in the Tesla P100 GPUs,
    • how to use OpenACC to run a program across a few of the GPU devices at once, and
    • how an actual application can make use of the new features (especially the larger bus).

    For three of the six parts we also had hand-ons where attendees of the tutorial could try out the various things taught. Since the POWER8NVL+P100 setup is so new, we actually had to use a machine in Jülich. Despite some general lag and initial WiFi problems in the room, this worked out surprisingly well. Every attendee downloaded an SSH key, signed a Usage Agreement, and exchanged it for the password to the key. Then they were good to go.

    The organization of the tutorial was also an interesting study in international collaboration. We worked together with colleagues from Switzerland, the States, and India. Timezones nicely tripartite1, nearly at least. Also a peculiarity: We had to hand in all slides and material 2.5 months before the actual conference. Quite challenging when you speak about unreleased hardware…

    In the end, the tutorial turned out quite well. In the evaluation, attendees gave us above average and close to best marks in all categories. Yay!

    The slides and sets of source code are not available publicly, unfortunately. But here are mine! Slides, anyway.

    I had the second talk (and subsequent first hands-on): »Performance Counters and Tools«. After the first talk, which introduced the POWER8 chip and related software infrastructure, I zoomed in on performance analysis with performance counters. This was a new topic for me as well it was assigned to me after a colleague left the institute. Although I worked with some performance counters before, my knowledge was not anywhere near the level of being able to knowledgeable teach others. So: Reading, testing, digging through documentation, finding use-cases. Detailed work, but as researchers this is in our job description after all.

    In the end I had enough content to hold a 50 minute talk (easily). So much interesting stuff is to be told about performance counters! To cut it down to 30 minutes I moved a lot of the material into the appendix. Not ideal, but since the attendees receive the slides anyway, at least the work was not in vain this way.

    My hands-on afterwards focused on measuring stalls of a matrix multiplication. I use the number of stalls due to data cache misses (PMU code PM_CMPLU_STALL_DCACHE_MISS) once with a simple, un-optimized matrix-matrix multiplication and once with a matrix multiplication in which the inner two loops are interchanged. This reduces the amount of misses by two orders of magnitude and leads to a speed-up of about 20 %. The message was along the lines of: Stalls are bad, they can be reduced by clever code tuning; but their impact is also hidden by modern processor features.

    I re-did the POWER8 CPI Stack!

    If you’re interested, you can find the slides of the talk after the click. The slides I actually presented (with all the overlays) are here. For the time being, the source code for my hands-on is not available publicly. If you want them, give me a note.

    I also remade IBM’s POWER8 CPI stack diagram into tree form. I like it better this way, although one could argue that the original table version also has its use. Here’s the PDF, a PNG, and also the TeX file.

    1. I looked this up. It is supposed to mean divided by three, since India, Germany, and USA are roughly arranged in three equivalently distant time zones.

    → continue reading!
  • TALK: Accelerating Plasma Physics with GPUs

    A few weeks ago we had the annual workshop of one of the groups I’m involved in, the POWER Acceleration and Design Centre (PADC).

    In the scope of the PADC we investigate new processors offered by IBM and the OpenPOWER consortium and how well the architectural choices map to applications. One of the features of the latest incarnation of the POWER processor chip is its connection to NVIDIA’s GPUs: The POWER8NVL employs a new, larger bus to connect to the GPU device – NVLink. The processor can make use of NVLink to exchange data with the GPU more than four times as fast compared to usual PCI-Express interfaces.1 Neat!

    I’m yet to dive fully into the new world of POWER8NVL, NVLink, and NVIDIA’s Pascal GPU on the other side, since there are only few systems available right now. It’s brand new. But for evaluating the combination of the integrated design of POWER8 CPU and Pascal GPU for a specific project (the Human Brain Project, read more about the precommerical precurement here) we actually received a small test system with this brand new architecture. 2 Unfortunately, the machine only arrived shortly before the PADC workshop. There was no time for extended tests. But on Sunday afternoon before Monday’s workshop I managed to measure at least one aspect of one of my app’s behaviors. Yay!

    You can see the performance of JuSPIC, a plasma physics application I’m researching, under the assumption of a simple information exchange model on the Pascal P100 GPU in a POWER8 system in the second part of the presentation. In the somewhat larger first part of the talk, I show what techniques I used to begin accelerating the application on the GPU. I started out with OpenACC, a pragma-based acceleration programming model, but soon found out that the code is a bit too complex for the compiler I use. See the slides for how it turned out.

    I hope to continue the acceleration as well as the performance analysis (with a more refined model) soon. But I’m busy with other cool stuff right now.

    You can find a handout version of the slides on the webpage of the workshop – or after the click; the version with all the overlays is also available, though.

    Let me know what you think!3

    1. PCIe Gen3: 16 GB/s, NVLink (Device to Host): 80 GB/s

    2. Well. Small, as in multiple P100s with each about 10 TFLOP/s single precision performance…

    3. I still do not have comments in this static blog engine. So you either need to tweet at me (@AndiH) or send me an email (a.herten@fz-ju…).

    → continue reading!
  • Collected: LaTeX Beamer Tips, Tricks, Workarounds

    I recently compiled two 30+ minute talks in LaTeX Beamer. While the output of LaTeX usually is great, getting there is not. There are many things which can (and will) go wrong; many commands and techniques you only discover after a lengthy workaround; and, anyway, sigh, workarounds…!

    The usual way to deal with any LaTeX insufficiency is to discover the solution in a post on StackExchange, you already read last week a couple of times. Or you read through some of the documentation and hope to find the solution in there.

    Well, here follows a list of all the neat tricks I searched for, stumbled upon, or discovered myself. All somehow related to creating presentations in LaTeX: LaTeX Beamer, TikZ, and more.


    Handout from Same File

    Apart from producing slides for presentations, Beamer has support for handouts built in. Adding [handout] as a option to the beamer document class will initialize the mode.

    In handout mode, all overlay specifications are reduced to one same value: 1. If overlays in handout mode are still needed, those can explicitly given by adding | handout: 1 to the overlay specification, i.e. \only<2- | handout: 3>{} (this will print its argument usually from overlay two on, but in handout mode only on slide three).

    To typeset a handout version from the same file as the presentation slides, without adding and removing [handout] all the time, I created a second file (ah-conference--handout.tex) and added the following line:


    This will insert the whole content of ah-conference.tex into the file and pass the handout option to beamer. Typeset as usual 1, done ✓.

    1. You should use latexmk for typesetting, by the way. It’s great. Also, it can be steered by a .latexmkrc file on a per-folder basis.

    → continue reading!
  • TALK: GPUs: Platform, Programming, Pitfalls

    Last Thursday, on 1 September, I held a talk at the GridKa Summer School 2016. Under the general topic of Data Science on Modern Architectures I introduced 30 Minutes programming on GPUs.

    This was my first general introduction talk on the GPU platform, so I started from scratch (and tried only to be influenced as little as possible). Additionally, I made the slides completely in LaTeX Beamer. Using a lot of TikZ adds an additional layer of cumbersomeness on-top of the usual LaTeX cumbersome experience. 1

    Anyway: I think the talk turned out quite well. I got some interested follow-up questions afterwards, especially in the coffee break. 2 The work was well worth it!

    An embed of the slides follows. This is the handout version. You can also download the actual set I presented, which has more overlays to guide my story.

    1. LaTeX is a constant struggle between »Why did I do this?« and »Wow, cool, that was worth it«. LaTeX, the alternating experience™.

    2. Well, plus the usual discussion of NVIDIA vs. AMD.

    → continue reading!
  • Clang's Optimization Levels

    Clang is a neat compiler. I like using it.

    For some manual optimization of a source code I was interested in the exact difference between the automatic optimization levels -O1 to -O3. What are they doing anyway?

    It turns out, this info is not so easy to come by.

    The official documentation of Clang specifies quite coarsely the different levels:

    -O2: Moderate level of optimization;
    -O1: Somewhere between -O0 and -O2


    Luckily there’s StackOverflow.

    In this answer by Antoine, the two lines needed to get the optimization passes are printed:

    llvm-as < /dev/null | opt -O1 -disable-output -debug-pass=Arguments
    echo 'int;' | clang -xc -O1 - -o /dev/null -\#\#\#

    The first line uses opt, which is the modular LLVM optimizer and analyzer, running on LLVM source files and, I reckon, being independent of the actual programming language. 1

    The second command prints the optimization passes which clang, the C/C++ driver of LLVM, puts on top of opt.

    This will not explain anything, but solely print the switches used. To understand what lies beneath each switch, LLVM has an explanatory website about the passes (opt --help will also print them, apparently). 2

    Luckily, Antoine has compiled the passes Clang uses in the above posting. (At least until Clang 3.8.)

    1. Although I can’t find -disable-output and -debug-pass in the list of options of opt’s help…

    2. For some of the options clang prints, the description is available through clang -cc1 --help, where cc1 is the frontend; find your’s through clang -\#\#\# -c file.c.