GPU Containers on Slurm¶

This tutorial introduces software containers, how to build them, and how to run them on Slurm clusters using apptainer. This is not meant to teach container mastery, but expose you to some best practices with containers on HPC systems.

Last updated: Oct 02, 2025

Objectives

Quickstart to running GPU containers
Building and testing your first GPU container
Best practices for building python-based containers
Developing python scripts inside a running container on BCP
Running multi-node containers

Requirements

Container build system - Apptainer build OR Docker CLI Pre-authenticated with a container registry such as:
- Docker Hub
- nvcr.io
Container runtime - Slurm GPU cluster with apptainer OR enroot

Prepare your environment

prep_apptainer.sh¶

# Start a 3 hour interactive job with 1 GPU
srun -p interactive -n 1 -G 1 --cpus-per-task 16 -t 03:00:00 --pty bash -l

# Change cachedir to /tmp
export APPTAINER_CACHEDIR=/tmp/${USER}_apptainer_cache

# Create workspace for tutorial
mkdir -p ${MYDATA}/containers

# Change to your workspace
#  - This is a good place to store containers
#  - Keep definition files on $HOME
cd ${MYDATA}/containers

# Pull the container
apptainer pull docker://nvcr.io/nvidia/pytorch:24.03-py3

prep_docker.sh¶

# Pull the container
docker pull nvcr.io/nvidia/pytorch:24.03-py3

Introduction to containers¶

Containers are a common method for building, distributing, and running applications, web services, development, and more. Containers have gained popularity because they package up an application and all dependencies, provide isolation from the host environment, and allow for a consistent deployment across platforms. While you may have also heard of virtual machines (VMs), containers are separate and rely on namespace virtualization without hardware emulation, so there are no performance losses.

The portability and reproducibility, without sacrificing performance, make containers ideal for scientific applications. Whole environments can be saved to ensure a published tool can always be used over time.

Docker is the most common container runtime, but there are many that can consume Open Container Initiative (OCI) images . This tutorial will be focusing on building containers with apptainer or docker, and then running containers in a shared HPC environment with apptainer or enroot.

Quickstart: Running your first GPU container¶

Think of this as a quickstart to running GPU containers on HPC systems.

Pulling the container¶

Containers should only be run on a compute node inside a job, so I recommend starting a job on a GPU node. If you’re not already on one, take a look at how to prepare your environment or the srun command below if you’re using enroot.

To run a container on HPC systems, we first need to pull the layers from the container registry and then convert them into a single image.

pull_apptainer.sh¶

# Pull the container
apptainer pull docker://nvcr.io/nvidia/cuda:12.4.1-devel-ubuntu22.04

pull_enroot.sh¶

# Start a 2 hour interactive job with 1 GPU
srun -p gpu -n 1 -G 1 --cpus-per-task 16 -t 03:00:00 --pty bash -l

# Pull the container
enroot import docker://nvcr.io#nvidia/cuda:12.4.1-devel-ubuntu22.04

Note

Keep this job running for the rest of this tutorial

Examining the CUDA environment¶

To start off, take a look at the CUDA environment outside of the container by running the NVIDIA System Management Interface program (nvidia-smi).

$ nvidia-smi

Tue Dec  3 18:46:40 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    Off |   00000000:41:00.0 Off |                    0 |
| N/A   36C    P8             35W /  350W |       1MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Running nvidia-smi is the easiest way to see if there’s a GPU on your system and what driver it is running. In addition to using it on your host, it works inside GPU-capable containers.

nvidia-smi_apptainer.sh¶

apptainer exec --nv cuda_12.4.1-devel-ubuntu22.04.sif nvidia-smi

If you’re running aptainer, you’ll notice that the CUDA version doesn’t change with the --nv flag. This will change if the --nvccli option (nvidia container cli) is enabled on your system.

Optional Exercises¶

What happens if you exclude the --nv flag with apptainer?
What happens if you run on the container on a system without a GPU?

Building and testing your first GPU container¶

In this section, we’ll be building the nbody sample benchmark from https://github.com/NVIDIA/cuda-samples.

The nbody benchmark demonstrates efficient all-pairs simulation of a gravitational n-body simulation in CUDA and provides a GFLOP/s metric at the end. While this GFLOP/s metric is not meant for true performance comparisons, this sample code supports multiple GPUs and is relatively easy to build.

Containers are built using recipe files like Docker’s Dockerfile or Apptainer’s Definition file, which are essentially scripts for provisioning a linux environment.

Choosing a starting container¶

The first step to building any container is choosing an image to start from. This starting image is often a clean OS like this ubuntu image, from which you can add any necessary dependencies to build/run your software. Alternatively, you can start from an image that already contains software they’re pre-installed.

We’re going to be building and running a GPU application, so I recommend starting from NVIDIA’s CUDA container on NGC. NGC is NVIDIA’s container registry, where NVIDIA software, SDKs, and models are published in container format. Not only are these meant to make your development easier, they’re also serve as a common environment for NVIDIA to reproduce and troubleshoot any issues you might encounter through enterprise support with NVAIE.

Looking at the tags tab, you’ll see many different containers. To help you understand the naming convention, containers usually have a <project>/<name>:<tag> format. If you browse through the available containers, you’ll see that each container is named cuda, but tags have some common elements along with a CUDA version prefix:

base: Includes the CUDA runtime (cudart)
runtime: base + CUDA math libraries, and NCCL
devel: runtime + headers, development tools for compiling CUDA applications
cudnn-: (prefix) any of the above + cuDNN libraries

There are a ton of options, so here are some recommendations on choosing a container:

Latest CUDA version (unless a specific one is needed)
- Newer libraries work on older drivers
base for simple CUDA applications
devel for multi-staged builds
Choose an OS with a package manager you’re familiar with

Note

We’ll cover multi-staged builds in container optimization

In this case, we’re going to start from the nvcr.io/nvidia/cuda:12.4.1-devel-ubuntu22.04 container that we already pulled and cached during the quickstart.

Installing dependencies and building¶

Just like when trying to run an application, identifying and installing compatible dependencies is the hardest part of container development. If you look at the dependencies for nbody, X11 and GL are required to build and run. On an ubuntu system (notice container tag), we can install the development headers and libraries along with curl using:

apt-get update && apt-get install -y --no-install-recommends \
            freeglut3-dev libgl1-mesa-dev libglu1-mesa-dev curl

These commands won’t work for non-root users because they modify the host system. If you’re figuring out how to build a container, you can prototype commands in an interactive container:

interactive_build_apptainer.sh¶

# Create an overlay directory
#   - The base container is never changed, just the overlay
#   - Overlay only works with a single image
mkdir -p ${APPTAINER_CACHEDIR}/cuda-devel_overlay

# Launch a shell in the cuda devel container with
#   - fakeroot - appear to be root in the container
#   - overlay - allow modifications to be written to overlay directory
apptainer shell --fakeroot \
    --overlay ${APPTAINER_CACHEDIR}/cuda-devel_overlay \
    cuda_12.4.1-devel-ubuntu22.04.sif

# Your cluster may also support overlay images instead of directories

interactive_build_docker.sh¶

# Pull the container
docker pull nvcr.io/nvidia/cuda:12.4.1-devel-ubuntu22.04

# Enter the container and delete any modifications (--rm)
docker run --rm -it nvcr.io/nvidia/cuda:12.4.1-devel-ubuntu22.04 bash -l

Once the dependencies are installed, you can download, build, and install the nbody application with the following commands:

# Grab the sample code
curl -sL https://github.com/NVIDIA/cuda-samples/archive/refs/tags/v12.4.1.tar.gz -o v12.4.1.tar.gz

# Unpack the tarball
tar -xzf v12.4.1.tar.gz

# Build the nbody executable
cd cuda-samples-12.4.1/Samples/5_Domain_Specific/nbody \
        && make && mv nbody /usr/local/bin

Wrapping it all up and building the container¶

Your desired starting container and installation commands can be wrapped up into a single file. Apptainer uses Definition files and Docker uses Dockerfiles.

exit your interactive container instance and wget your corresponding build file.

Definition.nbody¶

Bootstrap: docker
From: nvcr.io/nvidia/cuda:12.4.1-devel-ubuntu22.04

%post
	# Install dependencies
	apt-get update \
	&& apt-get install -y --no-install-recommends \
		freeglut3-dev libgl1-mesa-dev libglu1-mesa-dev curl

	# Grab the sample code
	curl -sL https://github.com/NVIDIA/cuda-samples/archive/refs/tags/v12.4.1.tar.gz -o /root/v12.4.1.tar.gz

	# Unpack the tarball to /root
	tar -C /root -xzf /root/v12.4.1.tar.gz

	# Build the nbody executable
	cd /root/cuda-samples-12.4.1/Samples/5_Domain_Specific/nbody \
	&& make && mv nbody /usr/local/bin

Dockerfile.nbody¶

FROM nvcr.io/nvidia/cuda:12.4.1-devel-ubuntu22.04

# Install dependencies
RUN apt-get update \
	&& apt-get install -y --no-install-recommends \
		freeglut3-dev libgl1-mesa-dev libglu1-mesa-dev curl

# Grab the sample code
RUN curl -sL https://github.com/NVIDIA/cuda-samples/archive/refs/tags/v12.4.1.tar.gz -o /root/v12.4.1.tar.gz

# Unpack the tarball to /root
RUN tar -C /root -xzf /root/v12.4.1.tar.gz

# Build the nbody executable
RUN cd /root/cuda-samples-12.4.1/Samples/5_Domain_Specific/nbody \
	&& make && mv nbody /usr/local/bin

Note

You can either download this file directly or copy and paste into your favorite text editor

You can then build a container named nbody from your build script as follows:

build_nbody_apptainer.sh¶

# You should already be on a compute node
# srun -p gpu -n 1 -G 1 --cpus-per-task 16 -t 03:00:00 --pty bash -l

# build the container
apptainer build nbody.sif Definition.nbody

# Look at image size
ls -lh nbody.sif

build_nbody_docker.sh¶

# Your Docker hub username
HUB_USER=

# build the container
docker build -t ${HUB_USER}/nbody -f Dockerfile.nbody .

# Look at image size
docker images | grep nbody

This is a relatively large image, so not only does it take up a lot of space on the filesystem, but it also would take a while to upload to a remote registry for sharing or archive. Lets instead figure out how to make our final image more space efficient.

Making your container more space efficient¶

We can make this much smaller using the following techniques:

Use a multi-staged build - Building in one container and copying build binaries to a runtime container
- Docker multi-staged build documentation
- Apptainer multi-staged build documentation
Only install runtime libraries in the final container
1. Using the base container instead of devel
2. Not installing *-devel packages from apt
Copy the finished binary instead of the full source repo

Definition.nbody-efficient¶

Bootstrap: docker
From: nvcr.io/nvidia/cuda:12.4.1-devel-ubuntu22.04
Stage: builder

%post
	# Install dependencies
	apt-get update \
	&& apt-get install -y --no-install-recommends \
		freeglut3-dev libgl1-mesa-dev libglu1-mesa-dev curl

	# Grab the sample code
	curl -sL https://github.com/NVIDIA/cuda-samples/archive/refs/tags/v12.4.1.tar.gz -o /root/v12.4.1.tar.gz

	# Unpack the tarball to /root
	tar -C /root -xzf /root/v12.4.1.tar.gz

	# Build the nbody executable
	cd /root/cuda-samples-12.4.1/Samples/5_Domain_Specific/nbody \
	&& make && mv nbody /usr/local/bin

# Change to the base image
Bootstrap: docker
From: nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04

%post
	# Only install the runtime libraries
	apt-get update \
	&& apt-get install -y --no-install-recommends \
		freeglut3 libgl1 libglu1 \
	&& apt-get clean && rm -rf /var/lib/apt/lists/*

# Copy the pre-build binary from our builder stage
%files from builder
	/usr/local/bin/nbody /usr/local/bin/nbody

Dockerfile.nbody-efficient¶

FROM nvcr.io/nvidia/cuda:12.4.1-devel-ubuntu22.04 AS builder

# Install runtime and build dependencies
RUN apt-get update \
	&& apt-get install -y --no-install-recommends \
		freeglut3-dev libgl1-mesa-dev libglu1-mesa-dev

# Grab the sample code
ADD https://github.com/NVIDIA/cuda-samples/archive/refs/tags/v12.4.1.tar.gz /root

# Unpack the tarball to /root
RUN tar -C /root -xzf /root/v12.4.1.tar.gz

# Build the nbody executable
RUN cd /root/cuda-samples-12.4.1/Samples/5_Domain_Specific/nbody \
	&& make && mv nbody /usr/local/bin/

# Change to the base image
FROM nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04

# Install the runtime dependencies (not *-dev)
RUN apt-get update \
	&& apt-get install -y --no-install-recommends \
		freeglut3 libgl1 libglu1 \
	&& apt-get clean && rm -rf /var/lib/apt/lists/*

# Copy the pre-build binary from our builder stage
COPY --from=builder /usr/local/bin/nbody /usr/local/bin/nbody

Make sure to change the name or tag of the container when building it.

build_nbody-efficient_apptainer.sh¶

# build the container
apptainer build nbody-efficient.sif Definition.nbody-efficient

# Look at image size
ls -lh nbody-efficient.sif

build_nbody-efficient_docker.sh¶

# Your Docker hub username
HUB_USER=

# build the container
docker build -t ${HUB_USER}/nbody:efficient -f Dockerfile.nbody-efficient .

# Look at image size
docker images | grep nbody

# Push the container
docker push ${HUB_USER}/nbody:efficient

Once again, lets look at the final size of the containers we built.

$ ls -lh nbody*sif

-rwxr-xr-x 1 greg.zynda greg.zynda.grp 147M Dec  3 20:34 nbody-efficient.sif
-rwxr-xr-x 1 greg.zynda greg.zynda.grp 4.2G Dec  3 08:51 nbody.sif

In the case of these apptainer .sif images built by apptainer, you’ll notice that the efficient build is much smaller: 147MB vs 4.2GB! Not only will this take up less space on your filesystem, but it’s also easier to archive with a publication.

Running the nbody sample benchmark¶

You should already be inside a job with an allocated GPU, so you can run the benchmark with the following:

run_nbody_apptainer.sh¶

# Check that GPU is still detected
apptainer exec --nv nbody-efficient.sif nvidia-smi

# Run nbody benchmark
apptainer exec --nv nbody-efficient.sif nbody -benchmark -numbodies=2000000

When your job is done, you should see output similar to the following:

> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
GPU Device 0: "Ada" with compute capability 8.9

> Compute 8.9 CUDA device: [NVIDIA L40S]
Warning: "number of bodies" specified 2000000 is not a multiple of 256.
Rounding up to the nearest multiple: 2000128.
2000128 bodies, total time for 10 iterations: 21772.984 ms
= 1837.374 billion interactions per second
= 36747.484 single-precision GFLOP/s at 20 flops per interaction

Note

These performance results will change based on the GPU type your were allocated.

Optional Exercises¶

Looking at the help text, try using a different number of GPUs (requires new job)
Try increasing the number of bodies in the simulation
Try using double precision

Best practices for building python-based containers¶

One of the most common things I encounter when folks use containers with pre-existing python packages and libraries, is accidentally replacing or overwriting them with conda or pip. NVIDIA’s NGC containers have patched version of PyTorch and supporting libraries that shouldn’t be altered if you’re looking for optimal and verified performance.

This section will focus on how to install python packages in a way that will prevent changes to the pre-installed packages.

To illustrate this, try installing pytorch from the base pytorch:24.03-py3 container.

pip-install_apptainer.sh¶

# Create an overlay directory
#   - The base container is never changed, just the overlay
#   - Overlay only works with a single image
mkdir -p ${APPTAINER_CACHEDIR}/pytorch_24.03_overlay

# Launch a shell in the cuda devel container with
#   - fakeroot - appear to be root in the container
#   - overlay - allow modifications to be written to overlay directory
apptainer shell --fakeroot \
    --overlay ${APPTAINER_CACHEDIR}/pytorch_24.03_overlay \
    pytorch_24.03-py3.sif

# Pip install pytorch inside the running container
pip install torch torchvision torchaudio

pip-install_docker.sh¶

# Launch pytorch container on your host
docker run --rm -it nvcr.io/nvidia/pytorch:24.03-py3 bash -l

# Pip install pytorch inside the running container
pip install torch torchvision torchaudio

You’ll notice that installing these packages changes the toch package and installs a bunch of CUDA libraries even though both already exist. As you learned with our efficient builds, this greatly increases the size of the container layers while also potentially breaking any applications linked against these libaries and the “known working state”.

Lets exit this container create a fresh overlay.

# Be sure to exit your interactive container session
exit

Luckily, you can lock the versions by creating a package constraints file, which has the same format as a requirements file.

pip-constraints_apptainer.sh¶

# Delete old overlay and recreate
rm -rf ${APPTAINER_CACHEDIR}/pytorch_24.03_overlay
mkdir -p ${APPTAINER_CACHEDIR}/pytorch_24.03_overlay

# Launch a shell in the cuda devel container with
#   - fakeroot - appear to be root in the container
#   - overlay - allow modifications to be written to overlay directory
apptainer shell --fakeroot \
    --overlay ${APPTAINER_CACHEDIR}/pytorch_24.03_overlay \
    pytorch_24.03-py3.sif

# Save all existing packages and versions to a text file
pip list | awk '{print$1"=="$2}' | tail +3 > /root/base_constraints.txt

# Install any new packages without upgrading existing packages
pip install -c /root/base_constraints.txt torch torchvision torchaudio

This install should now fail because the pre-built torchaudio wheels can’t be installed with the NVIDIA patched versions of torch.

Note

If you actually want to install torchaudio into the Pytorch NGC container, take a look at this recipe.

Lets practice using this constraint method by building a new container with the PyTorch Lightning framework starting FROM the pytorch:24.03-py3 container.

Definition.lightning¶

# Change to the base image
Bootstrap: docker
From: nvcr.io/nvidia/pytorch:24.03-py3

%post
    # Save all existing packages and versions to a text file
    pip list | awk '{print$1"=="$2}' | tail +3 > /root/base_constraints.txt

    # Install any new packages without upgrading existing packages
    pip install -c /root/base_constraints.txt lightning

Dockerfile.lightning¶

FROM nvcr.io/nvidia/pytorch:24.03-py3

# Save all existing packages and versions to a text file
RUN pip list | awk '{print$1"=="$2}' | tail +3 > /root/base_constraints.txt

# Install any new packages without upgrading existing packages
RUN pip install -c /root/base_constraints.txt lightning

After download the corresponding build script, the container can be built with the following commands.

build_lightning_apptainer.sh¶

# build the container
apptainer build lightning.sif Definition.lightning

build_lightning_docker.sh¶

# Your Docker hub username
HUB_USER=

# build the container
docker build -t ${HUB_USER}/lighting:latest -f Dockerfile.lightning .

# Push the container
docker push ${HUB_USER}/nbody:efficient

Unlike the torchaudio install, this went fine, and no existing packages changed. If a package or its dependencies require a different version of PyTorch, you can either change the container version based on the NVIDIA support matrix to match the required version or determine if the package’s dependencies can be relaxed to match the package version in the container.

Optional Exercises¶

Try installing another python package

Developing python scripts inside a running container¶

Containers are meant to be static, reproducible checkpoints for your code that can always be started in the same way. This makes them ideal for porting software to different systems, reproducing results, archiving software, and more. However, since containers shouldn’t change once they’re built (because that would break reproducibility), developing software in them is not always intuitive.

If you try to incorporate all your code in the container and rebuilding as it evolves, this can get tedious - especially if you’re pushing and pulling these containers between a registry. Instead, I recommend making a container with most or all of your dependencies, and mounting your code into the container at runtime.

To explore these concepts, lets launch an interactive environment with our lightning container.

lightning_interactive-apptainer.sh¶

# Run nbody benchmark
#   - Include GPU support with --nv
apptainer shell --nv lightning.sif

First, lets open another terminal to the cluster. That could be another tmux pane or a whole new terminal connection from your local system. Once you have that open, lets look around in the running container.

Exploring Environment¶
	Container shell	Second shell
Who are you running as?	whoami	whoami
Where are you running from?	pwd	cd $MYDATA/containers
Do files match?	ls -lh	ls -lh
Do changes propogate?	echo “hello” > container.txt	cat container.txt
What else is in the container by default?	ls -lh $HOME; ls -lh /tmp	ls -lh $HOME; ls -lh /tmp
What if you create a file somewhere else?	touch /workspace/test	ls /workspace
Should you be able to create files?	ls -lhd /workspace	ls -lhd /workspace

Note

$MYDATA/containers was available in the container because the container mounts our current working directory at runtime. If you need additional locations available in the container, you can make them available with (similar to Docker’s -v):

apptainer - bind paths (-B)
enroot - mount (-m)

Running external scripts¶

As you experienced when trying to create a test file in /workspace, which is open for writing, you discovered that the container has a read-only filesystem. This means, that you can’t make any changes without an overlay. This might be tedious for prototyping, but it’s good if you’re sharing a container with colleagues on a project, or if you just want to make sure you can’t accidentally make changes.

First, download python_dev.tar.gz to your current working directory with wget (may need to use --no-check-certificate). After downloading, unpack the tarball with tar.

# Unpack
tar -xzf python_dev.tar.gz

cd python_dev

ls *

The script self_contained.py doesn’t require any extra python modules other than PyTorch, which exists in the container, and can be run directly. Try running it.

python self_contained.py

Not only can you run scripts from inside the container, you can interact with them outside the container too. If you have your other terminal still open, find these files and open the script in your favorite editor. Not only can you open the files, you can edit them too - all while beign able to run them inside a container.

Developing packages from inside a container¶

If you’re developing a whole package that needs to be updated, you either have to rely on relative imports or install the package. Relative imports often work, but may not depending on the complexity of the package. In our example python code, there’s a pt_bench python module that gets loaded and used by bench.py.

# Prints where pt_bench was loaded from
python bench.py

# Change directories
cd ..

# Copy bench.py to break relative imports
cp python_dev/bench.py .
python bench.py

You can see that it’s easy to go wrong with relative imports, so I often recommend fully installing the package.

We already know that the container can’t be modified. Luckily, python can install packages in a user directory, that defaults to $HOME/.local, using the --user flag.

# Install pt_bench using our constraint file
pip install -c /root/base_constraints.txt --user python_dev/

# Try running bench.py again
python bench.py

You should see that pt_bench is being loaded from $HOME/.local, which is where user packages are installed. While this works, this location is universally shared by all python packages, which will lead to collisions between containers. I recommend launching the container with -c, which will not mount any external locations, and -B to mount the current working directory. Since many things require a valid $HOME for writing files, apptainer creates a temporary filesystem (tmpfs) for /home. You’ll be able to make changes, like installing a small package, and it won’t affect the container or bleed into other python environments.

First, lets clean our environment

# remove pt_bench
pip uninstall -y pt_bench

# Exit the container
exit

and then relaunch.

lightning_interactive_nohome-apptainer.sh¶

# Run nbody benchmark
#   - Include GPU support (--nv)
#   - Exclude $HOME mount (-c)
#   - Mount CWD (-B)
apptainer shell --nv -c -B $PWD:$PWD lightning.sif

# Make sure home is empty
ls $HOME

# Change to container directory
cd $MYDATA/containers

# Try running bench.py
python bench.py
# Install wasn't found

# Do a local install in $HOME tmpfs
pip install -c /root/base_constraints.txt --user python_dev/

# Run bench.py
python bench.py

Lastly, if you’re making changes to the package, you can do an editable install with -e. This means that when the package is installed, it’s really just linked to it’s current location instead of copying files.

# remove pt_bench
pip uninstall -y pt_bench

# Editable install (-e)
pip install -c /root/base_constraints.txt --user -e python_dev/

# Make a change to a package file
echo "print('New Change')" >> python_dev/pt_bench/__init__.py

# Run bench, and see if change works
python bench.py

When you exit the container, make sure the pt_bench package no longer exists.

# Exit the container
exit

# Make sure pt_bench doesn't exist
find $HOME/.local/ | grep pt_bench

If you’re done exploring the container, feel free to exit the job in preparation for the next section.

# Exit the job
exit

Running multi-node containers¶

Multi-node, or distributed, computing is a model of computation that runs parallel tasks across multiple computers. While it’s easy to spawn threads and processes on system, distributed applications need to be launched across all nodes and told how to communicate with eachother. This sounds difficult, but many frameworks make this accessible and give you near-linear speedups as more compute nodes are used.

Multi-node MPI NCCL Test¶

PyTorch containers from NGC ship with NCCL tests, which are useful for diagnosing MPI and bandwidth issues. If I’m ever questioning the performance of the compute fabric between GPUs, this is the first thing I run.

These can be run as single-line jobs using srun to handle the allocation and process spawning.

nccl-apptainer.sh¶

# Run on 2 GPUs of any type
#  (-g) argument sets how many GPUs each process will use
srun -p gpu -N 2 -n 2 --gpus-per-node 1 --mpi=pmi2 apptainer exec --nv lightning.sif all_reduce_perf_mpi -b 1G -e 4G -f 2 -g 1

# Run on 4 H100 GPUs across 2 nodes
srun -p gpu --mem=32G -N 2 -n 2 --gpus-per-node h100:2 --mpi=pmi2 apptainer exec --nv lightning.sif all_reduce_perf_mpi -b 1G -e 4G -f 2 -g 2

Note

If want to figure out how many GPUs are on a node and the type, you can run scontrol show node [node name] to see what resources are available on that node.

Multi-node PyTorch¶

Using wget, download pt_ddp_example.py, which is a simple script to demonstrate strong scaling using PyTorch DDP. We’ll be skipping over PyTorch specifics to focus on how to launch multi-node PyTorch containers with Slurm. Download the following sbatch script as well.

pt_ddp_example.sbatch¶

#!/bin/bash
#SBATCH --job-name=pt_ddp_example
#SBATCH --nodes=2             # Set number of nodes
#SBATCH --gpus-per-node=2     # Set number of GPUs per node
#SBATCH --mem=32GB            # Set memory limits (consider --exclusive)
#SBATCH --tasks-per-node=1
#SBATCH --output=%x-%j.out
#SBATCH --cpus-per-task=8
#SBATCH --partition=gpu
#SBATCH --time=00:30:00

# Job debug info
echo "Launching on ${SLURM_JOB_NUM_NODES} nodes"
echo "Launching on: " ${SLURM_JOB_NODELIST}
echo "Launching ${SLURM_NTASKS_PER_NODE} tasks per node"
echo "Using ${SLURM_GPUS_ON_NODE} GPUs per task"

# Optional debug logging
#export LOGLEVEL=INFO
#export NCCL_DEBUG=INFO

##### No need to edit these #########################################################
# main address is detected by first name in nodelist
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
# port is chosen by jobID (prevents collisions if nodes are shared)
export MASTER_PORT=$(expr 10000 + $(echo -n ${SLURM_JOBID} | tail -c 4))
export WORLD_SIZE=$((${SLURM_NNODES} * ${SLURM_GPUS_ON_NODE}))
#####################################################################################
echo "Training on ${WORLD_SIZE} GPUs - ${MASTER_ADDR}:${MASTER_PORT}"

srun --mpi=pmi2 apptainer exec --nv lightning.sif torchrun \
	--nnodes ${SLURM_JOB_NUM_NODES} \
	--nproc_per_node ${SLURM_GPUS_ON_NODE} \
	--rdzv_id $RANDOM \
	--rdzv_backend c10d \
	--rdzv_endpoint $MASTER_ADDR:${MASTER_PORT} \
	pt_ddp_example.py

PyTorch needs the following variables set for multi-node runs:

MASTER_ADDR - Address of the main node
MASTER_PORT - Port of connect on
WORLD_SIZE - Total number of workers/GPUs

While srun launches the initial process on each node, it calls torchrun, which spawns additional processes based on the argument --nproc_per_node. Think of torchrun as a helper script that handles a lot of the global and local rank logic.

Optional variables:

LOGLEVEL - pytorch log level
NCCL_DEBUG - NCCL log level

Submit the script with sbatch, which will generate a .out file with a number corresponding to the job with all output text. You’ll see that this runs a training job on 4 GPUs in total, distributed across 2 nodes. If you increase the resources allocated by the SBATCH arguements, training will scale as well.

Multi-node Pytorch Lightning¶

This is the same task as the Multi-node PyTorch script, just adapted to PyTorch Lightning. You’ll notice that code is clearner because PyTorch Lightning does it’s best to simplify common training tasks, including multi-GPU and multi-node trainging.

Download both the training script ptl_ddp_example.py and the sbatch script below.

ptl_ddp_example.sbatch¶

#!/bin/bash
#SBATCH --job-name=ptl_ddp_example
#SBATCH --nodes=2             # Set number of nodes
#SBATCH --tasks-per-node=2    # Set number of GPUs per node
#SBATCH --gpus-per-node=2     #   - set to the same
#SBATCH --mem=16GB            # Set memory limits (consider --exclusive)
#SBATCH --cpus-per-task=8
#SBATCH --output=%x-%j.out
#SBATCH --partition=gpu
#SBATCH --time=00:30:00

#export LOGLEVEL=INFO
#export NCCL_DEBUG=INFO

export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_PORT=$(expr 10000 + $(echo -n ${SLURM_JOBID} | tail -c 4))

# Launches one container per node
#  - Contaienr spawns multiple processes
srun --mpi=pmi2 apptainer exec --nv lightning.sif \
	bash -c "export NODE_RANK=\${SLURM_PROCID}; python \
		ptl_ddp_example.py -N $SLURM_JOB_NUM_NODES \
		-p ${SLURM_GPUS_ON_NODE}"

With PyTorch Lightning, both the MASTER_ADDR and MASTER_PORT need to be set, but also the NODE_RANK, which is the 0-based index of the node the process is on. In this example, it’s being set in a bash shell, with the $ escaped so it’s substituted after being launched on each node by srun. When it’s running, you’ll see that Lightning has nice logging about the process pool at the start, and produces nice output during the training progress.

Optional Exercises¶

Try using scontrol show node [node name] to see what kinds of GPUs are available on your cluster.
Try using more GPUs to see how the number of steps run by each GPU scales.
Try comparing training and NCCL performance on different types of nodes.

Next Steps¶

Apptainer/Singularity is a well known container runtime in the world of HPC, but NVIDIA recommends using enroot as a container runtime for several reasons. Enroot doesn’t have a build functionality, but can consume OCI images built by Docker or buildah and can be combined with pyxis for Slurm support. I also highly recommend checking out Docker for building containers due to the size of the community and support availability.

NVIDIA Containers:

Container workshops/tutorials: