GPU Containers on Slurm¶
This tutorial introduces software containers, how to build them, and how to run them on Slurm clusters using apptainer. This is not meant to teach container mastery, but expose you to some best practices with containers on HPC systems.
Last updated: Oct 02, 2025
Objectives
Quickstart to running GPU containers
Building and testing your first GPU container
Best practices for building python-based containers
Developing python scripts inside a running container on BCP
Running multi-node containers
Requirements
Container build system - Apptainer build OR Docker CLI Pre-authenticated with a container registry such as:
Container runtime - Slurm GPU cluster with apptainer OR enroot
Prepare your environment
# Start a 3 hour interactive job with 1 GPU
srun -p interactive -n 1 -G 1 --cpus-per-task 16 -t 03:00:00 --pty bash -l
# Change cachedir to /tmp
export APPTAINER_CACHEDIR=/tmp/${USER}_apptainer_cache
# Create workspace for tutorial
mkdir -p ${MYDATA}/containers
# Change to your workspace
# - This is a good place to store containers
# - Keep definition files on $HOME
cd ${MYDATA}/containers
# Pull the container
apptainer pull docker://nvcr.io/nvidia/pytorch:24.03-py3
# Pull the container
docker pull nvcr.io/nvidia/pytorch:24.03-py3
Introduction to containers¶
Containers are a common method for building, distributing, and running applications, web services, development, and more. Containers have gained popularity because they package up an application and all dependencies, provide isolation from the host environment, and allow for a consistent deployment across platforms. While you may have also heard of virtual machines (VMs), containers are separate and rely on namespace virtualization without hardware emulation, so there are no performance losses.
The portability and reproducibility, without sacrificing performance, make containers ideal for scientific applications. Whole environments can be saved to ensure a published tool can always be used over time.
Docker is the most common container runtime, but there are many that can consume Open Container Initiative (OCI) images . This tutorial will be focusing on building containers with apptainer or docker, and then running containers in a shared HPC environment with apptainer or enroot.
Quickstart: Running your first GPU container¶
Think of this as a quickstart to running GPU containers on HPC systems.
Pulling the container¶
Containers should only be run on a compute node inside a job, so I recommend starting a job on a GPU node.
If you’re not already on one, take a look at how to prepare your environment or the srun command below if you’re using enroot.
To run a container on HPC systems, we first need to pull the layers from the container registry and then convert them into a single image.
# Pull the container
apptainer pull docker://nvcr.io/nvidia/cuda:12.4.1-devel-ubuntu22.04
# Start a 2 hour interactive job with 1 GPU
srun -p gpu -n 1 -G 1 --cpus-per-task 16 -t 03:00:00 --pty bash -l
# Pull the container
enroot import docker://nvcr.io#nvidia/cuda:12.4.1-devel-ubuntu22.04
Note
Keep this job running for the rest of this tutorial
Examining the CUDA environment¶
To start off, take a look at the CUDA environment outside of the container by running the NVIDIA System Management Interface program (nvidia-smi).
$ nvidia-smi
Tue Dec 3 18:46:40 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01 Driver Version: 565.57.01 CUDA Version: 12.7 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L40S Off | 00000000:41:00.0 Off | 0 |
| N/A 36C P8 35W / 350W | 1MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Running nvidia-smi is the easiest way to see if there’s a GPU on your system and what driver it is running.
In addition to using it on your host, it works inside GPU-capable containers.
apptainer exec --nv cuda_12.4.1-devel-ubuntu22.04.sif nvidia-smi
If you’re running aptainer, you’ll notice that the CUDA version doesn’t change with the --nv flag.
This will change if the --nvccli option (nvidia container cli) is enabled on your system.
Optional Exercises¶
What happens if you exclude the
--nvflag withapptainer?What happens if you run on the container on a system without a GPU?
Building and testing your first GPU container¶
In this section, we’ll be building the nbody sample benchmark from https://github.com/NVIDIA/cuda-samples.
The nbody benchmark demonstrates efficient all-pairs simulation of a gravitational n-body simulation in CUDA and provides a GFLOP/s metric at the end. While this GFLOP/s metric is not meant for true performance comparisons, this sample code supports multiple GPUs and is relatively easy to build.
Containers are built using recipe files like Docker’s Dockerfile or Apptainer’s Definition file, which are essentially scripts for provisioning a linux environment.
Choosing a starting container¶
The first step to building any container is choosing an image to start from. This starting image is often a clean OS like this ubuntu image, from which you can add any necessary dependencies to build/run your software. Alternatively, you can start from an image that already contains software they’re pre-installed.
We’re going to be building and running a GPU application, so I recommend starting from NVIDIA’s CUDA container on NGC. NGC is NVIDIA’s container registry, where NVIDIA software, SDKs, and models are published in container format. Not only are these meant to make your development easier, they’re also serve as a common environment for NVIDIA to reproduce and troubleshoot any issues you might encounter through enterprise support with NVAIE.
Looking at the tags tab, you’ll see many different containers.
To help you understand the naming convention, containers usually have a <project>/<name>:<tag> format.
If you browse through the available containers, you’ll see that each container is named cuda, but tags have some common elements along with a CUDA version prefix:
base: Includes the CUDA runtime (cudart)runtime: base + CUDA math libraries, and NCCLdevel: runtime + headers, development tools for compiling CUDA applicationscudnn-: (prefix) any of the above + cuDNN libraries
There are a ton of options, so here are some recommendations on choosing a container:
Latest CUDA version (unless a specific one is needed)
Newer libraries work on older drivers
basefor simple CUDA applicationsdevelfor multi-staged buildsChoose an OS with a package manager you’re familiar with
Note
We’ll cover multi-staged builds in container optimization
In this case, we’re going to start from the nvcr.io/nvidia/cuda:12.4.1-devel-ubuntu22.04 container that we already pulled and cached during the quickstart.
Installing dependencies and building¶
Just like when trying to run an application, identifying and installing compatible dependencies is the hardest part of container development.
If you look at the dependencies for nbody, X11 and GL are required to build and run.
On an ubuntu system (notice container tag), we can install the development headers and libraries along with curl using:
apt-get update && apt-get install -y --no-install-recommends \
freeglut3-dev libgl1-mesa-dev libglu1-mesa-dev curl
These commands won’t work for non-root users because they modify the host system. If you’re figuring out how to build a container, you can prototype commands in an interactive container:
# Create an overlay directory
# - The base container is never changed, just the overlay
# - Overlay only works with a single image
mkdir -p ${APPTAINER_CACHEDIR}/cuda-devel_overlay
# Launch a shell in the cuda devel container with
# - fakeroot - appear to be root in the container
# - overlay - allow modifications to be written to overlay directory
apptainer shell --fakeroot \
--overlay ${APPTAINER_CACHEDIR}/cuda-devel_overlay \
cuda_12.4.1-devel-ubuntu22.04.sif
# Your cluster may also support overlay images instead of directories
# Pull the container
docker pull nvcr.io/nvidia/cuda:12.4.1-devel-ubuntu22.04
# Enter the container and delete any modifications (--rm)
docker run --rm -it nvcr.io/nvidia/cuda:12.4.1-devel-ubuntu22.04 bash -l
Once the dependencies are installed, you can download, build, and install the nbody application with the following commands:
# Grab the sample code
curl -sL https://github.com/NVIDIA/cuda-samples/archive/refs/tags/v12.4.1.tar.gz -o v12.4.1.tar.gz
# Unpack the tarball
tar -xzf v12.4.1.tar.gz
# Build the nbody executable
cd cuda-samples-12.4.1/Samples/5_Domain_Specific/nbody \
&& make && mv nbody /usr/local/bin
Wrapping it all up and building the container¶
Your desired starting container and installation commands can be wrapped up into a single file. Apptainer uses Definition files and Docker uses Dockerfiles.
exit your interactive container instance and wget your corresponding build file.
Bootstrap: docker
From: nvcr.io/nvidia/cuda:12.4.1-devel-ubuntu22.04
%post
# Install dependencies
apt-get update \
&& apt-get install -y --no-install-recommends \
freeglut3-dev libgl1-mesa-dev libglu1-mesa-dev curl
# Grab the sample code
curl -sL https://github.com/NVIDIA/cuda-samples/archive/refs/tags/v12.4.1.tar.gz -o /root/v12.4.1.tar.gz
# Unpack the tarball to /root
tar -C /root -xzf /root/v12.4.1.tar.gz
# Build the nbody executable
cd /root/cuda-samples-12.4.1/Samples/5_Domain_Specific/nbody \
&& make && mv nbody /usr/local/bin
FROM nvcr.io/nvidia/cuda:12.4.1-devel-ubuntu22.04
# Install dependencies
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
freeglut3-dev libgl1-mesa-dev libglu1-mesa-dev curl
# Grab the sample code
RUN curl -sL https://github.com/NVIDIA/cuda-samples/archive/refs/tags/v12.4.1.tar.gz -o /root/v12.4.1.tar.gz
# Unpack the tarball to /root
RUN tar -C /root -xzf /root/v12.4.1.tar.gz
# Build the nbody executable
RUN cd /root/cuda-samples-12.4.1/Samples/5_Domain_Specific/nbody \
&& make && mv nbody /usr/local/bin
Note
You can either download this file directly or copy and paste into your favorite text editor
You can then build a container named nbody from your build script as follows:
# You should already be on a compute node
# srun -p gpu -n 1 -G 1 --cpus-per-task 16 -t 03:00:00 --pty bash -l
# build the container
apptainer build nbody.sif Definition.nbody
# Look at image size
ls -lh nbody.sif
# Your Docker hub username
HUB_USER=
# build the container
docker build -t ${HUB_USER}/nbody -f Dockerfile.nbody .
# Look at image size
docker images | grep nbody
This is a relatively large image, so not only does it take up a lot of space on the filesystem, but it also would take a while to upload to a remote registry for sharing or archive. Lets instead figure out how to make our final image more space efficient.
Making your container more space efficient¶
We can make this much smaller using the following techniques:
Use a multi-staged build - Building in one container and copying build binaries to a runtime container
Only install runtime libraries in the final container
Using the base container instead of devel
Not installing
*-develpackages from apt
Copy the finished binary instead of the full source repo
Bootstrap: docker
From: nvcr.io/nvidia/cuda:12.4.1-devel-ubuntu22.04
Stage: builder
%post
# Install dependencies
apt-get update \
&& apt-get install -y --no-install-recommends \
freeglut3-dev libgl1-mesa-dev libglu1-mesa-dev curl
# Grab the sample code
curl -sL https://github.com/NVIDIA/cuda-samples/archive/refs/tags/v12.4.1.tar.gz -o /root/v12.4.1.tar.gz
# Unpack the tarball to /root
tar -C /root -xzf /root/v12.4.1.tar.gz
# Build the nbody executable
cd /root/cuda-samples-12.4.1/Samples/5_Domain_Specific/nbody \
&& make && mv nbody /usr/local/bin
# Change to the base image
Bootstrap: docker
From: nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04
%post
# Only install the runtime libraries
apt-get update \
&& apt-get install -y --no-install-recommends \
freeglut3 libgl1 libglu1 \
&& apt-get clean && rm -rf /var/lib/apt/lists/*
# Copy the pre-build binary from our builder stage
%files from builder
/usr/local/bin/nbody /usr/local/bin/nbody
FROM nvcr.io/nvidia/cuda:12.4.1-devel-ubuntu22.04 AS builder
# Install runtime and build dependencies
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
freeglut3-dev libgl1-mesa-dev libglu1-mesa-dev
# Grab the sample code
ADD https://github.com/NVIDIA/cuda-samples/archive/refs/tags/v12.4.1.tar.gz /root
# Unpack the tarball to /root
RUN tar -C /root -xzf /root/v12.4.1.tar.gz
# Build the nbody executable
RUN cd /root/cuda-samples-12.4.1/Samples/5_Domain_Specific/nbody \
&& make && mv nbody /usr/local/bin/
# Change to the base image
FROM nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04
# Install the runtime dependencies (not *-dev)
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
freeglut3 libgl1 libglu1 \
&& apt-get clean && rm -rf /var/lib/apt/lists/*
# Copy the pre-build binary from our builder stage
COPY --from=builder /usr/local/bin/nbody /usr/local/bin/nbody
Make sure to change the name or tag of the container when building it.
# build the container
apptainer build nbody-efficient.sif Definition.nbody-efficient
# Look at image size
ls -lh nbody-efficient.sif
# Your Docker hub username
HUB_USER=
# build the container
docker build -t ${HUB_USER}/nbody:efficient -f Dockerfile.nbody-efficient .
# Look at image size
docker images | grep nbody
# Push the container
docker push ${HUB_USER}/nbody:efficient
Once again, lets look at the final size of the containers we built.
$ ls -lh nbody*sif
-rwxr-xr-x 1 greg.zynda greg.zynda.grp 147M Dec 3 20:34 nbody-efficient.sif
-rwxr-xr-x 1 greg.zynda greg.zynda.grp 4.2G Dec 3 08:51 nbody.sif
In the case of these apptainer .sif images built by apptainer, you’ll notice that the efficient build is much smaller: 147MB vs 4.2GB!
Not only will this take up less space on your filesystem, but it’s also easier to archive with a publication.
Running the nbody sample benchmark¶
You should already be inside a job with an allocated GPU, so you can run the benchmark with the following:
# Check that GPU is still detected
apptainer exec --nv nbody-efficient.sif nvidia-smi
# Run nbody benchmark
apptainer exec --nv nbody-efficient.sif nbody -benchmark -numbodies=2000000
When your job is done, you should see output similar to the following:
> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
GPU Device 0: "Ada" with compute capability 8.9
> Compute 8.9 CUDA device: [NVIDIA L40S]
Warning: "number of bodies" specified 2000000 is not a multiple of 256.
Rounding up to the nearest multiple: 2000128.
2000128 bodies, total time for 10 iterations: 21772.984 ms
= 1837.374 billion interactions per second
= 36747.484 single-precision GFLOP/s at 20 flops per interaction
Note
These performance results will change based on the GPU type your were allocated.
Optional Exercises¶
Looking at the help text, try using a different number of GPUs (requires new job)
Try increasing the number of bodies in the simulation
Try using double precision
Best practices for building python-based containers¶
One of the most common things I encounter when folks use containers with pre-existing python packages and libraries, is accidentally replacing or overwriting them with conda or pip.
NVIDIA’s NGC containers have patched version of PyTorch and supporting libraries that shouldn’t be altered if you’re looking for optimal and verified performance.
This section will focus on how to install python packages in a way that will prevent changes to the pre-installed packages.
To illustrate this, try installing pytorch from the base pytorch:24.03-py3 container.
# Create an overlay directory
# - The base container is never changed, just the overlay
# - Overlay only works with a single image
mkdir -p ${APPTAINER_CACHEDIR}/pytorch_24.03_overlay
# Launch a shell in the cuda devel container with
# - fakeroot - appear to be root in the container
# - overlay - allow modifications to be written to overlay directory
apptainer shell --fakeroot \
--overlay ${APPTAINER_CACHEDIR}/pytorch_24.03_overlay \
pytorch_24.03-py3.sif
# Pip install pytorch inside the running container
pip install torch torchvision torchaudio
# Launch pytorch container on your host
docker run --rm -it nvcr.io/nvidia/pytorch:24.03-py3 bash -l
# Pip install pytorch inside the running container
pip install torch torchvision torchaudio
You’ll notice that installing these packages changes the toch package and installs a bunch of CUDA libraries even though both already exist. As you learned with our efficient builds, this greatly increases the size of the container layers while also potentially breaking any applications linked against these libaries and the “known working state”.
Lets exit this container create a fresh overlay.
# Be sure to exit your interactive container session
exit
Luckily, you can lock the versions by creating a package constraints file, which has the same format as a requirements file.
# Delete old overlay and recreate
rm -rf ${APPTAINER_CACHEDIR}/pytorch_24.03_overlay
mkdir -p ${APPTAINER_CACHEDIR}/pytorch_24.03_overlay
# Launch a shell in the cuda devel container with
# - fakeroot - appear to be root in the container
# - overlay - allow modifications to be written to overlay directory
apptainer shell --fakeroot \
--overlay ${APPTAINER_CACHEDIR}/pytorch_24.03_overlay \
pytorch_24.03-py3.sif
# Save all existing packages and versions to a text file
pip list | awk '{print$1"=="$2}' | tail +3 > /root/base_constraints.txt
# Install any new packages without upgrading existing packages
pip install -c /root/base_constraints.txt torch torchvision torchaudio
This install should now fail because the pre-built torchaudio wheels can’t be installed with the NVIDIA patched versions of torch.
Note
If you actually want to install torchaudio into the Pytorch NGC container, take a look at this recipe.
Lets practice using this constraint method by building a new container with the PyTorch Lightning framework starting FROM the pytorch:24.03-py3 container.
# Change to the base image
Bootstrap: docker
From: nvcr.io/nvidia/pytorch:24.03-py3
%post
# Save all existing packages and versions to a text file
pip list | awk '{print$1"=="$2}' | tail +3 > /root/base_constraints.txt
# Install any new packages without upgrading existing packages
pip install -c /root/base_constraints.txt lightning
FROM nvcr.io/nvidia/pytorch:24.03-py3
# Save all existing packages and versions to a text file
RUN pip list | awk '{print$1"=="$2}' | tail +3 > /root/base_constraints.txt
# Install any new packages without upgrading existing packages
RUN pip install -c /root/base_constraints.txt lightning
After download the corresponding build script, the container can be built with the following commands.
# build the container
apptainer build lightning.sif Definition.lightning
# Your Docker hub username
HUB_USER=
# build the container
docker build -t ${HUB_USER}/lighting:latest -f Dockerfile.lightning .
# Push the container
docker push ${HUB_USER}/nbody:efficient
Unlike the torchaudio install, this went fine, and no existing packages changed.
If a package or its dependencies require a different version of PyTorch, you can either change the container version based on the NVIDIA support matrix to match the required version or determine if the package’s dependencies can be relaxed to match the package version in the container.
Optional Exercises¶
Try installing another python package
Developing python scripts inside a running container¶
Containers are meant to be static, reproducible checkpoints for your code that can always be started in the same way. This makes them ideal for porting software to different systems, reproducing results, archiving software, and more. However, since containers shouldn’t change once they’re built (because that would break reproducibility), developing software in them is not always intuitive.
If you try to incorporate all your code in the container and rebuilding as it evolves, this can get tedious - especially if you’re pushing and pulling these containers between a registry. Instead, I recommend making a container with most or all of your dependencies, and mounting your code into the container at runtime.
To explore these concepts, lets launch an interactive environment with our lightning container.
# Run nbody benchmark
# - Include GPU support with --nv
apptainer shell --nv lightning.sif
First, lets open another terminal to the cluster. That could be another tmux pane or a whole new terminal connection from your local system. Once you have that open, lets look around in the running container.
Container shell |
Second shell |
|
|---|---|---|
Who are you running as? |
whoami |
whoami |
Where are you running from? |
pwd |
cd $MYDATA/containers |
Do files match? |
ls -lh |
ls -lh |
Do changes propogate? |
echo “hello” > container.txt |
cat container.txt |
What else is in the container by default? |
ls -lh $HOME; ls -lh /tmp |
ls -lh $HOME; ls -lh /tmp |
What if you create a file somewhere else? |
touch /workspace/test |
ls /workspace |
Should you be able to create files? |
ls -lhd /workspace |
ls -lhd /workspace |
Note
$MYDATA/containers was available in the container because the container mounts our current working directory at runtime.
If you need additional locations available in the container, you can make them available with (similar to Docker’s -v):
enroot - mount (-m)
Running external scripts¶
As you experienced when trying to create a test file in /workspace, which is open for writing, you discovered that the container has a read-only filesystem.
This means, that you can’t make any changes without an overlay.
This might be tedious for prototyping, but it’s good if you’re sharing a container with colleagues on a project, or if you just want to make sure you can’t accidentally make changes.
First, download python_dev.tar.gz to your current working directory with wget (may need to use --no-check-certificate).
After downloading, unpack the tarball with tar.
# Unpack
tar -xzf python_dev.tar.gz
cd python_dev
ls *
The script self_contained.py doesn’t require any extra python modules other than PyTorch, which exists in the container, and can be run directly.
Try running it.
python self_contained.py
Not only can you run scripts from inside the container, you can interact with them outside the container too. If you have your other terminal still open, find these files and open the script in your favorite editor. Not only can you open the files, you can edit them too - all while beign able to run them inside a container.
Developing packages from inside a container¶
If you’re developing a whole package that needs to be updated, you either have to rely on relative imports or install the package.
Relative imports often work, but may not depending on the complexity of the package.
In our example python code, there’s a pt_bench python module that gets loaded and used by bench.py.
# Prints where pt_bench was loaded from
python bench.py
# Change directories
cd ..
# Copy bench.py to break relative imports
cp python_dev/bench.py .
python bench.py
You can see that it’s easy to go wrong with relative imports, so I often recommend fully installing the package.
We already know that the container can’t be modified.
Luckily, python can install packages in a user directory, that defaults to $HOME/.local, using the --user flag.
# Install pt_bench using our constraint file
pip install -c /root/base_constraints.txt --user python_dev/
# Try running bench.py again
python bench.py
You should see that pt_bench is being loaded from $HOME/.local, which is where user packages are installed.
While this works, this location is universally shared by all python packages, which will lead to collisions between containers.
I recommend launching the container with -c, which will not mount any external locations, and -B to mount the current working directory.
Since many things require a valid $HOME for writing files, apptainer creates a temporary filesystem (tmpfs) for /home.
You’ll be able to make changes, like installing a small package, and it won’t affect the container or bleed into other python environments.
First, lets clean our environment
# remove pt_bench
pip uninstall -y pt_bench
# Exit the container
exit
and then relaunch.
# Run nbody benchmark
# - Include GPU support (--nv)
# - Exclude $HOME mount (-c)
# - Mount CWD (-B)
apptainer shell --nv -c -B $PWD:$PWD lightning.sif
# Make sure home is empty
ls $HOME
# Change to container directory
cd $MYDATA/containers
# Try running bench.py
python bench.py
# Install wasn't found
# Do a local install in $HOME tmpfs
pip install -c /root/base_constraints.txt --user python_dev/
# Run bench.py
python bench.py
Lastly, if you’re making changes to the package, you can do an editable install with -e. This means that when the package is installed, it’s really just linked to it’s current location instead of copying files.
# remove pt_bench
pip uninstall -y pt_bench
# Editable install (-e)
pip install -c /root/base_constraints.txt --user -e python_dev/
# Make a change to a package file
echo "print('New Change')" >> python_dev/pt_bench/__init__.py
# Run bench, and see if change works
python bench.py
When you exit the container, make sure the pt_bench package no longer exists.
# Exit the container
exit
# Make sure pt_bench doesn't exist
find $HOME/.local/ | grep pt_bench
If you’re done exploring the container, feel free to exit the job in preparation for the next section.
# Exit the job
exit
Running multi-node containers¶
Multi-node, or distributed, computing is a model of computation that runs parallel tasks across multiple computers. While it’s easy to spawn threads and processes on system, distributed applications need to be launched across all nodes and told how to communicate with eachother. This sounds difficult, but many frameworks make this accessible and give you near-linear speedups as more compute nodes are used.
Multi-node MPI NCCL Test¶
PyTorch containers from NGC ship with NCCL tests, which are useful for diagnosing MPI and bandwidth issues. If I’m ever questioning the performance of the compute fabric between GPUs, this is the first thing I run.
These can be run as single-line jobs using srun to handle the allocation and process spawning.
# Run on 2 GPUs of any type
# (-g) argument sets how many GPUs each process will use
srun -p gpu -N 2 -n 2 --gpus-per-node 1 --mpi=pmi2 apptainer exec --nv lightning.sif all_reduce_perf_mpi -b 1G -e 4G -f 2 -g 1
# Run on 4 H100 GPUs across 2 nodes
srun -p gpu --mem=32G -N 2 -n 2 --gpus-per-node h100:2 --mpi=pmi2 apptainer exec --nv lightning.sif all_reduce_perf_mpi -b 1G -e 4G -f 2 -g 2
Note
If want to figure out how many GPUs are on a node and the type, you can run scontrol show node [node name] to see what resources are available on that node.
Multi-node PyTorch¶
Using wget, download pt_ddp_example.py, which is a simple script to demonstrate strong scaling using PyTorch DDP.
We’ll be skipping over PyTorch specifics to focus on how to launch multi-node PyTorch containers with Slurm.
Download the following sbatch script as well.
#!/bin/bash
#SBATCH --job-name=pt_ddp_example
#SBATCH --nodes=2 # Set number of nodes
#SBATCH --gpus-per-node=2 # Set number of GPUs per node
#SBATCH --mem=32GB # Set memory limits (consider --exclusive)
#SBATCH --tasks-per-node=1
#SBATCH --output=%x-%j.out
#SBATCH --cpus-per-task=8
#SBATCH --partition=gpu
#SBATCH --time=00:30:00
# Job debug info
echo "Launching on ${SLURM_JOB_NUM_NODES} nodes"
echo "Launching on: " ${SLURM_JOB_NODELIST}
echo "Launching ${SLURM_NTASKS_PER_NODE} tasks per node"
echo "Using ${SLURM_GPUS_ON_NODE} GPUs per task"
# Optional debug logging
#export LOGLEVEL=INFO
#export NCCL_DEBUG=INFO
##### No need to edit these #########################################################
# main address is detected by first name in nodelist
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
# port is chosen by jobID (prevents collisions if nodes are shared)
export MASTER_PORT=$(expr 10000 + $(echo -n ${SLURM_JOBID} | tail -c 4))
export WORLD_SIZE=$((${SLURM_NNODES} * ${SLURM_GPUS_ON_NODE}))
#####################################################################################
echo "Training on ${WORLD_SIZE} GPUs - ${MASTER_ADDR}:${MASTER_PORT}"
srun --mpi=pmi2 apptainer exec --nv lightning.sif torchrun \
--nnodes ${SLURM_JOB_NUM_NODES} \
--nproc_per_node ${SLURM_GPUS_ON_NODE} \
--rdzv_id $RANDOM \
--rdzv_backend c10d \
--rdzv_endpoint $MASTER_ADDR:${MASTER_PORT} \
pt_ddp_example.py
PyTorch needs the following variables set for multi-node runs:
MASTER_ADDR - Address of the main node
MASTER_PORT - Port of connect on
WORLD_SIZE - Total number of workers/GPUs
While srun launches the initial process on each node, it calls torchrun, which spawns additional processes based on the argument --nproc_per_node.
Think of torchrun as a helper script that handles a lot of the global and local rank logic.
Optional variables:
LOGLEVEL - pytorch log level
NCCL_DEBUG - NCCL log level
Submit the script with sbatch, which will generate a .out file with a number corresponding to the job with all output text.
You’ll see that this runs a training job on 4 GPUs in total, distributed across 2 nodes.
If you increase the resources allocated by the SBATCH arguements, training will scale as well.
Multi-node Pytorch Lightning¶
This is the same task as the Multi-node PyTorch script, just adapted to PyTorch Lightning. You’ll notice that code is clearner because PyTorch Lightning does it’s best to simplify common training tasks, including multi-GPU and multi-node trainging.
Download both the training script ptl_ddp_example.py and the sbatch script below.
#!/bin/bash
#SBATCH --job-name=ptl_ddp_example
#SBATCH --nodes=2 # Set number of nodes
#SBATCH --tasks-per-node=2 # Set number of GPUs per node
#SBATCH --gpus-per-node=2 # - set to the same
#SBATCH --mem=16GB # Set memory limits (consider --exclusive)
#SBATCH --cpus-per-task=8
#SBATCH --output=%x-%j.out
#SBATCH --partition=gpu
#SBATCH --time=00:30:00
#export LOGLEVEL=INFO
#export NCCL_DEBUG=INFO
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_PORT=$(expr 10000 + $(echo -n ${SLURM_JOBID} | tail -c 4))
# Launches one container per node
# - Contaienr spawns multiple processes
srun --mpi=pmi2 apptainer exec --nv lightning.sif \
bash -c "export NODE_RANK=\${SLURM_PROCID}; python \
ptl_ddp_example.py -N $SLURM_JOB_NUM_NODES \
-p ${SLURM_GPUS_ON_NODE}"
With PyTorch Lightning, both the MASTER_ADDR and MASTER_PORT need to be set, but also the NODE_RANK, which is the 0-based index of the node the process is on.
In this example, it’s being set in a bash shell, with the $ escaped so it’s substituted after being launched on each node by srun.
When it’s running, you’ll see that Lightning has nice logging about the process pool at the start, and produces nice output during the training progress.
Optional Exercises¶
Try using
scontrol show node [node name]to see what kinds of GPUs are available on your cluster.Try using more GPUs to see how the number of steps run by each GPU scales.
Try comparing training and NCCL performance on different types of nodes.
Next Steps¶
Apptainer/Singularity is a well known container runtime in the world of HPC, but NVIDIA recommends using enroot as a container runtime for several reasons. Enroot doesn’t have a build functionality, but can consume OCI images built by Docker or buildah and can be combined with pyxis for Slurm support. I also highly recommend checking out Docker for building containers due to the size of the community and support availability.
NVIDIA Containers:
Container workshops/tutorials: