• SC16 tutorial on OpenPOWER and GPUs (including a TALK)

    On 14 November we held a tutorial on »Application Porting and Optimization on GPU-Accelerated POWER Architectures« during the Supercomputing Conference in Salt Lake City. The tutorial allowed me to travel to this largest conference on supercomputing-related topics for a first time. It was a busy, but great experience! I actually wrote a bit about it for the Jülich blog portal.

    In the tutorial we spoke about

    • the new POWER8NVL architecture, including its large bus to NVLink-equipped Tesla P100 GPUs,
    • how to measure an application’s performance by means of hardware performance counters,
    • how to use compiler flags to speed-up an application’s runtime - and what’s behind the flags,
    • what’s new in the Tesla P100 GPUs,
    • how to use OpenACC to run a program across a few of the GPU devices at once, and
    • how an actual application can make use of the new features (especially the larger bus).

    For three of the six parts we also had hand-ons where attendees of the tutorial could try out the various things taught. Since the POWER8NVL+P100 setup is so new, we actually had to use a machine in Jülich. Despite some general lag and initial WiFi problems in the room, this worked out surprisingly well. Every attendee downloaded an SSH key, signed a Usage Agreement, and exchanged it for the password to the key. Then they were good to go.

    The organization of the tutorial was also an interesting study in international collaboration. We worked together with colleagues from Switzerland, the States, and India. Timezones nicely tripartite1, nearly at least. Also a peculiarity: We had to hand in all slides and material 2.5 months before the actual conference. Quite challenging when you speak about unreleased hardware…

    In the end, the tutorial turned out quite well. In the evaluation, attendees gave us above average and close to best marks in all categories. Yay!

    The slides and sets of source code are not available publicly, unfortunately. But here are mine! Slides, anyway.

    I had the second talk (and subsequent first hands-on): »Performance Counters and Tools«. After the first talk, which introduced the POWER8 chip and related software infrastructure, I zoomed in on performance analysis with performance counters. This was a new topic for me as well it was assigned to me after a colleague left the institute. Although I worked with some performance counters before, my knowledge was not anywhere near the level of being able to knowledgeable teach others. So: Reading, testing, digging through documentation, finding use-cases. Detailed work, but as researchers this is in our job description after all.

    In the end I had enough content to hold a 50 minute talk (easily). So much interesting stuff is to be told about performance counters! To cut it down to 30 minutes I moved a lot of the material into the appendix. Not ideal, but since the attendees receive the slides anyway, at least the work was not in vain this way.

    My hands-on afterwards focused on measuring stalls of a matrix multiplication. I use the number of stalls due to data cache misses (PMU code PM_CMPLU_STALL_DCACHE_MISS) once with a simple, un-optimized matrix-matrix multiplication and once with a matrix multiplication in which the inner two loops are interchanged. This reduces the amount of misses by two orders of magnitude and leads to a speed-up of about 20 %. The message was along the lines of: Stalls are bad, they can be reduced by clever code tuning; but their impact is also hidden by modern processor features.

    I re-did the POWER8 CPI Stack!

    If you’re interested, you can find the slides of the talk after the click. The slides I actually presented (with all the overlays) are here. For the time being, the source code for my hands-on is not available publicly. If you want them, give me a note.

    I also remade IBM’s POWER8 CPI stack diagram into tree form. I like it better this way, although one could argue that the original table version also has its use. Here’s the PDF, a PNG, and also the TeX file.

    1. I looked this up. It is supposed to mean divided by three, since India, Germany, and USA are roughly arranged in three equivalently distant time zones.

    → continue reading!
  • TALK: Accelerating Plasma Physics with GPUs

    A few weeks ago we had the annual workshop of one of the groups I’m involved in, the POWER Acceleration and Design Centre (PADC).

    In the scope of the PADC we investigate new processors offered by IBM and the OpenPOWER consortium and how well the architectural choices map to applications. One of the features of the latest incarnation of the POWER processor chip is its connection to NVIDIA’s GPUs: The POWER8NVL employs a new, larger bus to connect to the GPU device – NVLink. The processor can make use of NVLink to exchange data with the GPU more than four times as fast compared to usual PCI-Express interfaces.1 Neat!

    I’m yet to dive fully into the new world of POWER8NVL, NVLink, and NVIDIA’s Pascal GPU on the other side, since there are only few systems available right now. It’s brand new. But for evaluating the combination of the integrated design of POWER8 CPU and Pascal GPU for a specific project (the Human Brain Project, read more about the precommerical precurement here) we actually received a small test system with this brand new architecture. 2 Unfortunately, the machine only arrived shortly before the PADC workshop. There was no time for extended tests. But on Sunday afternoon before Monday’s workshop I managed to measure at least one aspect of one of my app’s behaviors. Yay!

    You can see the performance of JuSPIC, a plasma physics application I’m researching, under the assumption of a simple information exchange model on the Pascal P100 GPU in a POWER8 system in the second part of the presentation. In the somewhat larger first part of the talk, I show what techniques I used to begin accelerating the application on the GPU. I started out with OpenACC, a pragma-based acceleration programming model, but soon found out that the code is a bit too complex for the compiler I use. See the slides for how it turned out.

    I hope to continue the acceleration as well as the performance analysis (with a more refined model) soon. But I’m busy with other cool stuff right now.

    You can find a handout version of the slides on the webpage of the workshop – or after the click; the version with all the overlays is also available, though.

    Let me know what you think!3

    1. PCIe Gen3: 16 GB/s, NVLink (Device to Host): 80 GB/s

    2. Well. Small, as in multiple P100s with each about 10 TFLOP/s single precision performance…

    3. I still do not have comments in this static blog engine. So you either need to tweet at me (@AndiH) or send me an email (a.herten@fz-ju…).

    → continue reading!
  • Collected: LaTeX Beamer Tips, Tricks, Workarounds

    I recently compiled two 30+ minute talks in LaTeX Beamer. While the output of LaTeX usually is great, getting there is not. There are many things which can (and will) go wrong; many commands and techniques you only discover after a lengthy workaround; and, anyway, sigh, workarounds…!

    The usual way to deal with any LaTeX insufficiency is to discover the solution in a post on StackExchange, you already read last week a couple of times. Or you read through some of the documentation and hope to find the solution in there.

    Well, here follows a list of all the neat tricks I searched for, stumbled upon, or discovered myself. All somehow related to creating presentations in LaTeX: LaTeX Beamer, TikZ, and more.

    Beamer

    Handout from Same File

    Apart from producing slides for presentations, Beamer has support for handouts built in. Adding [handout] as a option to the beamer document class will initialize the mode.

    In handout mode, all overlay specifications are reduced to one same value: 1. If overlays in handout mode are still needed, those can explicitly given by adding | handout: 1 to the overlay specification, i.e. \only<2- | handout: 3>{} (this will print its argument usually from overlay two on, but in handout mode only on slide three).

    To typeset a handout version from the same file as the presentation slides, without adding and removing [handout] all the time, I created a second file (ah-conference--handout.tex) and added the following line:

    \PassOptionsToClass{handout}{beamer}\input{ah-conference}
    

    This will insert the whole content of ah-conference.tex into the file and pass the handout option to beamer. Typeset as usual 1, done ✓.

    1. You should use latexmk for typesetting, by the way. It’s great. Also, it can be steered by a .latexmkrc file on a per-folder basis.

    → continue reading!
  • TALK: GPUs: Platform, Programming, Pitfalls

    Last Thursday, on 1 September, I held a talk at the GridKa Summer School 2016. Under the general topic of Data Science on Modern Architectures I introduced 30 Minutes programming on GPUs.

    This was my first general introduction talk on the GPU platform, so I started from scratch (and tried only to be influenced as little as possible). Additionally, I made the slides completely in LaTeX Beamer. Using a lot of TikZ adds an additional layer of cumbersomeness on-top of the usual LaTeX cumbersome experience. 1

    Anyway: I think the talk turned out quite well. I got some interested follow-up questions afterwards, especially in the coffee break. 2 The work was well worth it!

    An embed of the slides follows. This is the handout version. You can also download the actual set I presented, which has more overlays to guide my story.

    1. LaTeX is a constant struggle between »Why did I do this?« and »Wow, cool, that was worth it«. LaTeX, the alternating experience™.

    2. Well, plus the usual discussion of NVIDIA vs. AMD.

    → continue reading!
  • Clang's Optimization Levels

    Clang is a neat compiler. I like using it.

    For some manual optimization of a source code I was interested in the exact difference between the automatic optimization levels -O1 to -O3. What are they doing anyway?

    It turns out, this info is not so easy to come by.

    The official documentation of Clang specifies quite coarsely the different levels:

    -O2: Moderate level of optimization;
    -O1: Somewhere between -O0 and -O2

    Yay.

    Luckily there’s StackOverflow.

    In this answer by Antoine, the two lines needed to get the optimization passes are printed:

    llvm-as < /dev/null | opt -O1 -disable-output -debug-pass=Arguments
    echo 'int;' | clang -xc -O1 - -o /dev/null -\#\#\#
    

    The first line uses opt, which is the modular LLVM optimizer and analyzer, running on LLVM source files and, I reckon, being independent of the actual programming language. 1

    The second command prints the optimization passes which clang, the C/C++ driver of LLVM, puts on top of opt.

    This will not explain anything, but solely print the switches used. To understand what lies beneath each switch, LLVM has an explanatory website about the passes (opt --help will also print them, apparently). 2

    Luckily, Antoine has compiled the passes Clang uses in the above posting. (At least until Clang 3.8.)

    1. Although I can’t find -disable-output and -debug-pass in the list of options of opt’s help…

    2. For some of the options clang prints, the description is available through clang -cc1 --help, where cc1 is the frontend; find your’s through clang -\#\#\# -c file.c.

  • CUDA Course 2016: CUDA Tools

    Last week we had a CUDA course for the students of our guest student program. I held the session on CUDA Tools; that is, NVIDIA tools for programming, debugging, and profiling of GPU applications.

    Here are the slides, which are closely based on my colleague Jiri Kraus’ slides of the open-to-public CUDA course earlier this year.

    Download the slides here, or see them embedded below.

    → continue reading!
  • Reduce Filesize of PDF-embedded Bitmap Images with Ghostscript

    Ghostscript is a powerful tool for manipulating PDF and PS files. But with great power comes great complexity. Here are examples on embedding fonts and reducing image size with it!

    Embedding Fonts

    Usually, your PDF typesetting program takes care of embedding fonts into a PDF document (PDFLaTeX does); but sometimes you have strange sources of PDFs: My ROOT-generated plots for example do not embed their fonts1.

    In a blog post, Karl Rupp summarizes how to embed fonts into PDFs from different sources. To really embed ALL the fonts, also those usually ignored by Ghostscript, you have to dive in even deeper. Here is the command, which I found in a Stackoverflow reply:

    gs -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dCompressFonts=true -dSubsetFonts=true -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=output.pdf -c ".setpdfwrite <</NeverEmbed [ ]>> setdistillerparams" -f input.pdf

    A quicker alternative to Ghostscript is the pdftocairo command of the poppler PDF library. The command enables conversion to different vector graphics formats2. But it can also convert from PDF to PDF, embedding the fonts in the process.

    pdftocairo input.pdf -pdf output.pdf

    Changing Image Quality

    For printing a document, you probably want to have it available in the best quality possible. For uploading it somewhere for sharing with your friends, file size might be more important than quality. Usually, in best vector fashion, the bulk of bits of a LaTeX-set document are taken by bitmap images (or bitmap-like raster images like JPG, PNG, …). Ghostscript offers a batch way to reduce the size of all embedded bitmap-like images.

    Everything revolves around the -dPDFSETTINGS=/ setting. It can take different values, e.g. screen from the command above (equivalent to 72 dpi images) to prepress (300 dpi). A one-liner to get all images of a document down to 150 dpi would be

    gs -sDEVICE=pdfwrite -dCompabilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf

    Since I’m lazy and don’t want to memorize this, I made a small, encapsulating shell script a while ago to reduce the PDF’s size by means of image compression: reducePdfSize.sh.

    Using pdfimages -all on my thesis, which is in total 41 MB of size, results in extraction of about 21 MB images – half of the data in the PDF of my thesis is for bitmap images. Using the above Ghostscript command on thesis.pdf reduces the 41 MB to 15 MB, using the printer option3.
    Not bad, right?

    1. But then again, they use very basic fonts which should be available on any system.

    2. It’s probably also the easiest way to convert your PDF vector graphics to SVG.

    3. I don’t know what happens additionally to reduce the file size even beyond lossy image compression. If you know, tell me!

  • Talks with LaTeX Beamer, written in Markdown

    There are different ways out there to create slides for talks. One used a lot in academia is LaTeX Beamer. For the unknowing, in short, Beamer allows to generate PDF slides by relying on the comprehensive typesetting greatness of LaTeX.

    Compared to WYSIWYG tools like Powerpoint and Keynote, LaTeX Beamer has a high getting-started threshold of learning the keywords and peculiarities, which is inherent to all things LaTeX. This makes it unappealing for beginners, but also somewhat reduces productivity for experienced users.
    But fear not! There’s a well-working converter from the great markup language Markdown to LaTeX Beamer slides. And this is how it works.

    TL;DR: Pandoc can convert Markdown to PDF slides using LaTeX Beamer. It works out of the box, but can easily be extended. Apart from LaTeX Beamer, also HTML slideshows using reveal.js (and others) are possible from the same source file.

    → continue reading!