GPU kernel tuning

To get the maximum performance out of your GPU, the CUDA kernels need to be tuned to find the most optimal settings (e.g. threadblock sizes) for your setup. For this tuning, we use Kernel Launcher and Kernel Tuner, both developed by the Dutch eScience Center (NLeSC). For more info on GPU kernel tuning, see the Kernel Tuner reference paper.

The tuning consists of two (or three) steps. First, Kernel Launcher captures runtime information from MicroHH, like the kernels (code) that need to be tuned, and the kernel input data and parameters. Second, Kernel Tuner is used to find the most optimal configuration for each kernel. These configurations are appended to a wisdom file, which are used runtime by MicroHH to select the most optimal configuration for the setup and case that you are running.

Installation

The Kernel Launcher code is loaded automatically through a Git submodule in microhh/external/kernel_launcher. If that is not the case, try updating all submodules:

git submodule update --init --recursive

Kernel Tuner is available through PyPI, and can be installed (together with its dependencies) using:

pip install kernel_tuner[cuda]
# Or to install the requirements manually:
pip install pycuda cupy scikit-learn scikit-optimize kernel_tuner

Next, you need to configure and compile MicroHH with Kernel Launcher enabled:

cmake -DUSESP=true -DUSECUDA=true -DUSEKERNELLAUNCHER=true ..

GPU tuning

To tune a specific model setup, simply run the init and run phase as usual, but for the run phase, you need to specify which kernels you want to tune:

# Specify selection of kernels:
KERNEL_LAUNCHER_TUNE=advec_2i5* ./microhh run drycblles
# Or simply tune all:
KERNEL_LAUNCHER_TUNE=* ./microhh run drycblles

Kernel Launcher will provide info about which captures are exported, e.g.:

KERNEL_LAUNCHER [INFO] the following kernels will be captured: *
KERNEL_LAUNCHER [INFO] writing capture to /home/bart/meteo/models/microhh/captures/diff_smag2__calc_strain2_float_128x128x128.json for kernel diff_smag2::calc_strain2@float
Et cetera.

Once the captures are saved and the time integration starts, you can safely kill MicroHH. A number of captures have now have been written to microhh/captures in the form of .json (tuning settings/parameters) and .bin (input for kernels) files.

Warning

The captured files include full 3D fields, which are needed as input for the kernels during tuning. As a result, the disk space used by captures can grow rapidly. After the tuning step below is finished, the .json and .bin files can safely be deleted to free disk space.

To tune the code, run the tune.py script in microhh/tuner:

# Tune selection of captures:
python tune.py ../captures/*128x128x128.json --time=30:00
# Or simply tune all:
python tune.py ../captures/*.json --iterations=250

As you can see, there are different methods for specifying how much time Kernel Tuner should spend tuning each kernel. After tuning, the most optimal configurations for your specific setup are appended to the wisdom files in microhh/wisdom/. For each kernel (with settings name captures/kernel_name.json in this example), the individual timings are written to captures/kernel_name.cache.json, and can be visualised with microhh/tuner/plot_cache.py.

In its default configuration, tune.py uses a Bayesian optimization strategy to find the most optimal configuration. As shown in the figure below – which shows the timings of tuning the advec_2i5 scheme for one hour – Kernel Tuner manages to find a reasonably optimal configuration fairly quickly, although finding the absolute fastest configuration took almost one hour.

Tuning results advec_2i5

Using the wisdom files

During the run phase, MicroHH / Kernel Launcher reads the wisdom files, and selects the best configuration. Unless the exact configuration is available in the wisdom files, Kernel Tuner will inform you which configuration it selected, for example:

KERNEL_LAUNCHER [INFO] no wisdom found for kernel "diff_smag2::calc_strain2@float",
device "NVIDIA RTX A5000", and problem size (64, 64, 128),
using configuration for different problem size: (128, 128, 128).

Note that Kernel Launcher will always pick one configuration, even if it might not be optimal for your setup or GPU:

KERNEL_LAUNCHER [WARN] no wisdom found for kernel "diff_smag2::calc_strain2@float"
and device "NVIDIA GeForce RTX 2060 SUPER",
using configuration for different device "NVIDIA RTX A5000".

Tuning all tunable kernels

Exporting the captures for all available (tunable) kernels, and a variety of grid sizes, can be a tedious task. The drycblles_tuner.py script in the drycblles case simplifies this task by automatically running the model in various configurations, such that all tunable kernels are captured.

Debugging

To get more info out of Kernel Launcher, run the model with:

KERNEL_LAUNCHER_LOG=debug ./microhh run drycblles