GROMACS and MPS

This tutorial introduces MPS and the benefits of using it with GROMACS.

Objectives

  • Introduction to GROMACS benchmark

  • NVIDIA MPS and benefits

  • FOR and MPS

  • XARGS and MPS

  • Summary

  • Next Steps

Requirements

To run all example scripts, you’ll need

Introduction to Gromacs benchmark

This tutorial will be using the GROMACS container built for DGX on NGC

https://catalog.ngc.nvidia.com/orgs/hpc/containers/gromacs

GROMACS can also be built from source, but the pre-compiled container is convenient for this tutorial since it contains all necessary libraries.

First, lets kick off an interactive 2-GPU job on DGX Cloud with the gromacs:2023.2 container:

ngc batch run --name "gromacs-2gpu" \
	--total-runtime 7200s --instance dgxa100.80g.2.norm \
	--commandline "apt update && DEBIAN_FRONTEND=noninteractive apt install -y xterm curl wget zip vim-nox less && sleep 2h" \
	--result /results \
	--image "nvcr.io/hpc/gromacs:2023.2"

You’ll notice that this job script installs the following packages at runtime:

  • xterm - used to resize terminal window

  • curl - used to download benchmark data

  • wget - used to download benchmark data

  • zip - used to unzip benchmark data

  • vim-nox - used for editing files

  • less - used for viewing files

The GROMACS container is meant to just contain enough packages to run GROMACS, so many useful interactive programs are excluded. This means the container is small and great for running batch jobs, but makes it sparse for interactive sessions. Luckily, the container is built from an Ubuntu base, so we can install these extra packages easily.

After your job is running, connect to it and load the AVX2 GROMACS environment

# From your localhost
ngc batch exec <jobid>

# Fix terminal size from inside job
resize

# Load GROMACS environment
. /usr/local/gromacs/avx2_256/bin/GMXRC.bash

# cd to the local DGX filesystem
cd /raid

The benchMEM benchmark

This tutorial will be using input files from bench, a free GROMACS benchmark suite with a variety of sizes and run scripts. Specifically, we will be working with the benchMEM benchmark, which features a protein in membrane system surrounded by water, comprised of 82,000 atoms, and employs a 2 fs time step. This particular benchmark is well-suited for interactive exploration due to its relatively fast runtime.

Download and unpack it with the following:

# Download benchMEM benchmark from https://www.mpinat.mpg.de/grubmueller/bench

curl -sL https://www.mpinat.mpg.de/benchMEM -o benchMEM.zip
unzip benchMEM.zip

# should now have benchMEM.tpr file

Running the benchMEM benchmark

Now that the benchMEM benchmark is downloaded, we can run it as follows:

# Runs benchMEM with 1 process, 8 threads, for 10k steps on GPU 0
# Simulation output is written to /tmp/out (deleted)
# Log is written to 1gpu.log
gmx mdrun -ntmpi 1 -ntomp 8 -npme 0 -s benchMEM.tpr -cpt 1440 \
	-nsteps 10000 -v -noconfout -nb gpu -dlb yes -gpu_id 0 \
	-bonded gpu -e /tmp/out -g 1gpu.log

The final performance is printed in nanoseconds per day (ns/day). If you think of the simulation as a movie, ns/day is the length and not the walltime of the job, so a higher ns/day is better.

Since the interactive job we submitted has two GPUs, which you can check with nvidia-smi, we can scale this run across both with:

# Runs benchMEM with 2 processes, 8 threads each (16 total), for 10k steps on GPUs 0 and 1
# Simulation output is written to /dev/null (deleted)
# Log is written to 2gpu.log
gmx mdrun -ntmpi 2 -ntomp 8 -npme 0 -s benchMEM.tpr -cpt 1440 \
	-nsteps 10000 -v -noconfout -nb gpu -dlb yes -gpu_id 01 \
	-bonded gpu -e /tmp/out -g 2gpu.log

Because this is a small simulation, you should notice that running benchMEM on two GPUs results in a lower ns/day.

benchMEM and GPU utilization

While these tasks are running, you can look at the telemetry for this BCP job to get an idea of how much of the GPU is being utilized. If you’re not using BCP, you can also use nvidia-smi as follows:

# Kill child processes on exit
trap 'pkill -P $$' SIGINT SIGTERM EXIT

# Start capturing utilization to CSV file in the background
nvidia-smi -i 0 --query-gpu=timestamp,name,utilization.gpu,memory.used --format=csv -l 5 > utilization.csv &

# Runs benchMEM with 1 process, 8 threads, for 10k steps on GPU 0
# Simulation output is written to /tmp/out (deleted)
# Log is written to 1gpu.log
gmx mdrun -ntmpi 1 -ntomp 8 -npme 0 -s benchMEM.tpr -cpt 1440 \
	-nsteps 10000 -v -noconfout -nb gpu -dlb yes -gpu_id 0 \
	-bonded gpu -e /tmp/out -g monitor_1gpu.log

This script starts by querying the GPU utilization and memory usage every 5 seconds and write it out to the file utilization.csv.

Optional Exercises

  • What happens to performance if you increase the number of steps?

  • What happens to performance if you run on the CPU?

  • Monitor GPU utilization when using two GPUs

NVIDIA MPS and benefits

NVIDIA Multi-Process Service (MPS) is a feature that allows multiple CUDA applications to share the same GPU, improving system utilization and reducing the overhead of context switching between applications. By providing a single, unified address space for all MPS clients, MPS enables efficient sharing of GPU resources, making it ideal for scenarios where multiple applications need to access the GPU simultaneously.

If you remember, the utilization of the benchMEM simulation wasn’t very good. Due to the small size of the benchMEM simulation (82k atoms) and the capability of modern GPU infrastructure, the A100 GPUs used in this tutorial did not have very good utilization. This means that if you ran a bunch of benchMEM simulations on the GPU, the hardware would mostly be idle.

Usually, utilization is improved in the code at the application level by a developer. Different (larger) inputs can also improve hardware utilization. If you’re studying the simulation of small molecules, hardware utilization can be improved by running multiple simulations on the same GPU. Multiple applications on the same GPU, but MPS improves performance by making context switching more efficient.

To illustrate this with an example, lets run four benchMEM simulations at a time on GPU 0 with and without MPS using the following code:

# Variable of shared arguments
GMX_ARGS="-ntmpi 1 -ntomp 8 -npme 0 -s benchMEM.tpr -cpt 1440 -nsteps 10000 -v -noconfout -nb gpu -dlb yes -bonded gpu"

# without MPS
# unique log and ouptput files
# processes are backgrounded and stdout sent to /dev/null
gmx mdrun ${GMX_ARGS} -gpu_id 0 -e /tmp/out1 -g 1gpu_sim1-4.log &> /dev/null &
gmx mdrun ${GMX_ARGS} -gpu_id 0 -e /tmp/out2 -g 1gpu_sim2-4.log &> /dev/null &
gmx mdrun ${GMX_ARGS} -gpu_id 0 -e /tmp/out3 -g 1gpu_sim3-4.log &> /dev/null &
gmx mdrun ${GMX_ARGS} -gpu_id 0 -e /tmp/out4 -g 1gpu_sim4-4.log &> /dev/null &
# wait for processes to complete
wait

# with MPS
nvidia-cuda-mps-control -d
# unique log and ouptput files
# processes are backgrounded and stdout sent to /dev/null
gmx mdrun ${GMX_ARGS} -gpu_id 0 -e /tmp/out1 -g 1gpu_sim1-4_mps.log &> /dev/null &
gmx mdrun ${GMX_ARGS} -gpu_id 0 -e /tmp/out2 -g 1gpu_sim2-4_mps.log &> /dev/null &
gmx mdrun ${GMX_ARGS} -gpu_id 0 -e /tmp/out3 -g 1gpu_sim3-4_mps.log &> /dev/null &
gmx mdrun ${GMX_ARGS} -gpu_id 0 -e /tmp/out4 -g 1gpu_sim4-4_mps.log &> /dev/null &
# wait for processes to complete
wait
# stop MPS server
echo quit | nvidia-cuda-mps-control

To make it easier to compare throughput performance from the generated log files, use the calc_throughput.sh script as follows:

# Calculate throughput WITHOUT MPS
$ bash calc_throughput.sh 1gpu_sim*-4.log

1gpu_sim1-4.log 1gpu_sim2-4.log 1gpu_sim3-4.log 1gpu_sim4-4.log
1gpu_sim1-4.log:Performance:    41.402
1gpu_sim2-4.log:Performance:    39.433
1gpu_sim3-4.log:Performance:    51.208
1gpu_sim4-4.log:Performance:    41.574
Total throughput: 173.617

# Calculate throughput WITH MPS
$ bash calc_throughput.sh 1gpu_sim*-4_mps.log

1gpu_sim1-4_mps.log 1gpu_sim2-4_mps.log 1gpu_sim3-4_mps.log 1gpu_sim4-4_mps.log
1gpu_sim1-4_mps.log:Performance:        45.262
1gpu_sim2-4_mps.log:Performance:        44.495
1gpu_sim3-4_mps.log:Performance:        53.269
1gpu_sim4-4_mps.log:Performance:        44.811
Total throughput: 187.837

You’ll notice total throughput (ns/day) is higher when using MPS to share the GPU. In the next section, we’ll figure out what the maximum throughput can be on the A100.

Optional Exercises

  • How is GPU utilization when running these concurrent simulations?

FOR and MPS

In the previous section, we ran each GROMACS processes individually as a separate line in the bash script. This can also be done in a for loop if you’re looping over files or a range:

#!/bin/bash

# Number of cores in job
NCORES=22
# Number of tasks to run concurrently
NP=${1:-8}

# Variable of shared arguments
GMX_ARGS="-ntmpi 1 -ntomp $(( $NCORES / $NP )) -npme 0 -s benchMEM.tpr -cpt 1440 -nsteps 10000 -v -noconfout -nb gpu -dlb yes -bonded gpu"

# without MPS
echo "Running ${NP} simulations concurrently without MPS"
# Spawn NP processes and wait for them to complete
for i in $(seq 1 $NP); do
	gmx mdrun ${GMX_ARGS} -gpu_id 0 -e /tmp/out${i} -g 1gpu_sim${i}-${NP}.log &> /dev/null &
done
# wait for processes to complete
wait

# with MPS
echo "Running ${NP} simulations concurrently with MPS"
nvidia-cuda-mps-control -d
# Spawn NP processes and wait for them to complete
for i in $(seq 1 $NP); do
	gmx mdrun ${GMX_ARGS} -gpu_id 0 -e /tmp/out${i} -g 1gpu_sim${i}-${NP}_mps.log &> /dev/null &
done
# wait for processes to complete
wait
# stop MPS server
echo quit | nvidia-cuda-mps-control

By default, this script will run 8 concurrent GROMACS tasks at the same time, but that can also be controlled at runtime with an integer argument. For example, you can run 5 tasks at at a time with

bash Nsim_1gpu.sh 5

Using the calc_throughput.sh script to determine the total throughput, take some time to identify the optimal number of processes to run at a time.

Optional Exercises

  • Try visualizing the throughput results (N x throughput) in your favorite plotting program (excel counts)

XARGS and MPS

In the previous example, we used FOR loops to launch GROMACS processes in the background and waited for them to complete. This worked great since we were just experimenting with

The xargs program can be thought of as a “map” operation, where a list of inputs is given and a function is applied to each one. This is often used to process file contents line by line, but it can also be used with a FOR loop in a script if the loop prints out the command.

# Prepend a value to every line in a list
for i in $(seq 1 8); do
	echo $i;
done | xargs -L 1 echo prepend

xargs can also call a function in a subshell

# Can also run a function
function times2 {
	echo "$1 * 2 =" $(( $1 * 2 ))
	# Simulates processing time
	sleep 1
}
# Export function
export -f times2
time for i in $(seq 1 8); do
	echo $i;
# -I takes a whole line (-L 1 implied) and substitutes it as the matching symbol
done | xargs -I {} bash -c 'times2 {}'

One of the cooler features, is that xargs can also run tasks in parallel with the -P argument

time for i in $(seq 1 8); do
	echo $i;
# -I takes a whole line (-L 1 implied) and substitutes it as the matching symbol
# Can also run in parallel
done | xargs -P 4 -I {} bash -c 'times2 {}'

You should notice that this should run 4x faster than the version that called each function sequentially because it runs 4 tasks at the same time.

XARGS on 1 GPU

Now that we know how to run tasks in parallel in a function, we can apply that format to our GROMACS benchmark. In the snippet below, you’ll notice that we create variables NP for the number of tasks to run at a time (used by -P argument) and NT the number of tasks to generate. We then create the run_gmx function to run the benchmark, which takes one argument, TID the task ID. First, since this is a sub shell, we need to reload the GROMACS environment. Then, we can run the GROMACS benchmark along with some help text to know what’s running.

#!/bin/bash

# Number of cores in job
export NCORES=22
# Number of tasks to run concurrently
export NP=${1:-8}
# Number of tasks in queue
export NT=8

# Variable of shared arguments
export GMX_ARGS="-ntmpi 1 -ntomp $(( $NCORES / $NP )) -npme 0 -s benchMEM.tpr -cpt 1440 -nsteps 10000 -v -noconfout -nb gpu -dlb yes -bonded gpu"

# Define and export function for running GMX
function run_gmx {
	TID=$1
	# Load GMX environment
	. /usr/local/gromacs/avx2_256/bin/GMXRC.bash
	echo starting task ${TID} in slot ${SLOT}
	# Task is not backgrounded
	gmx mdrun ${GMX_ARGS} -gpu_id 0 -e /tmp/out${TID} -g 1gpu_sim${TID}-${NP}_xargs.log &> /dev/null
	echo finished task ${TID} in slot ${SLOT}
}
export -f run_gmx

# with MPS
echo "Running ${NP} simulations concurrently with MPS"
nvidia-cuda-mps-control -d
# Spawn NP processes and wait for them to complete
for i in $(seq 1 $NT); do
	echo run_gmx $i
done | xargs -P ${NP} --process-slot-var=SLOT -I {} bash -c '{}'
# stop MPS server
echo quit | nvidia-cuda-mps-control

Depending on the tasks you need to process, you may need to add additional arguments to this function to accept parameters or files you want to explore.

Optional Exercises

  • If you increase the number of tasks, does this solution scale nicely?

  • If you have time, try looking at the utilization

XARGS on multiple GPUs

We can also do some math to calculate the GPU index based on the SLOT index.

#!/bin/bash

# Number of cores in job
export NCORES=22
# Number of tasks to run concurrently
export NP=${1:-8}
# Number of tasks in queue
export NT=16
# Number of GPUs
export NG=2

# Calculate total number of concurrent tasks
export TT=$(( $NP * $NG ))
# Variable of shared arguments
export GMX_ARGS="-ntmpi 1 -ntomp $(( $NCORES / $NP )) -npme 0 -s benchMEM.tpr -cpt 1440 -nsteps 10000 -v -noconfout -nb gpu -dlb yes -bonded gpu"

# Define and export function for running GMX
function run_gmx {
	TID=$1
	GPU=$(( $SLOT % $NG ))
	# Load GMX environment
	. /usr/local/gromacs/avx2_256/bin/GMXRC.bash
	echo starting task ${TID} in slot ${SLOT} on GPU ${GPU}
	# Task is not backgrounded
	gmx mdrun ${GMX_ARGS} -gpu_id ${GPU} -e /tmp/out${TID} -g ${NG}gpu_sim${TID}-${TT}_xargs.log &> /dev/null
	echo finished task ${TID} in slot ${SLOT}
}
export -f run_gmx

# with MPS
echo "Running ${NP} simulations concurrently with MPS"
nvidia-cuda-mps-control -d
# Spawn NP processes and wait for them to complete
for i in $(seq 1 $NT); do
	echo run_gmx $i
done | xargs -P ${TT} --process-slot-var=SLOT -I {} bash -c '{}'
# stop MPS server
echo quit | nvidia-cuda-mps-control

If we restricted the number of CPUs to 1 core per task, this would be about double the single-GPU performance. As an exercise, try changing the both scripts to allocate a single GPU per task to see if this is true.

Optional Exercises

  • If you increase the number of tasks, does the xargs solution scale nicely?

  • Try looking at utilization while running xargs to make sure both GPUs are actually being used.

  • Try changing both xargs scripts to allocate a single CPU per task to see if 2-GPU throughput is 2x that of 1-GPU.

Summary

After completing this tutorial, you should have learned the benefit of MPS when running multiple applications and how to efficiently process many tasks across multiple GPUs.

Next Steps

NVIDIA Developer Blog Posts:

NVIDIA NGC Container:

Documentation: