In the cluster is installed the ‘Singularity‘ software. Singularity allows build, manage and run ‘containers‘. The aim of this kind of software is encapsulate all the needed packages, programs, whatever needed to perform a computation in a archive independent from the operating system and software of the computer that hosts the job.

Build a container is a relatively simple process. The following example is aimed to show the power of the tool but the users are encouraged to refer to the ‘Singularity‘ website for examples and suggestions.

Running a Nvidia provided CUDA based container is the recommended way to run CUDA base jobs. The example shows how create a container with the necessary software and how to run it on the cluster.

STEP 1: definition file [‘container.def’]

Bootstrap: docker

#
# List on: https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/supported-tags.md
#
From: nvcr.io/nvidia/tensorflow:23.04-tf2-py3

%post

# Preconfigure tzdata
echo 'export TZ=Europe/Rome' >> /environment
export TZ=Europe/Rome
ln -snf /usr/share/zoneinfo/$TZ /etc/localtime
echo $TZ > /etc/timezone

# Downloads the latest upgrades and package lists (important) and install gnuplot.
apt-get update -y && apt-get -y upgrade
apt-get install -y gnuplot-nox


The definition file starts with the first two lines declaring a ‘docker container definition’ as starting point. This pre-defined container is available in the location described in the comment. This container comes with a base Ubuntu 20.04, Cuda 12.1  installation complete with the development files. These only two lines can be enough for general purpose job using using ubuntu and cuda.

After the ‘%post‘ line starts the commands to be issued when the installation of the base pre-defined container

nvcr.io/nvidia/tensorflow:23.04-tf2-py3

ends. As in a shell script the lines starting with the character ‘#’ are comments, e.g. not real commands but notes inserted to clarify the command flow. In this file we find two comments starting with ‘# ...‘ . The commands of these two sections:

 

STEP 2: container build

With our definition file (named ‘container.def‘) we can start the building of the container, a process similar to the installation of an operating system. The simplest version of the command can be:

singularity build container.sif container.def

This creates the file ‘container.sif‘ that you can think as the hard disk of the container.

Please, be aware that building a container – usually – require the root credentials that is unavailable on the cluster submitting hosts (labsrv7 and labsrv8). You can:

 

singularity-remote-build container.sif container.def container.log

to perform the build. The ‘container.log’ file will store the output of the process and will be available at the end of the process.

STEP 3: container execution

Usually to run the container the command will be:

singularity exec --nv container.sif my-command

this version of the command will execute the command ‘my-command‘ in the context of the container. The ‘–nv’ switch instruct the singularity software to manage the connection with the local CUDA card.

Another choice for the execution a command in the context of our container is ‘shell’. This opportunity is best fitted for the debugging of our container as it allows to run a shell in the context of the container and perform tests or whatsoever needed. Our execution command will be:

singularity shell  --nv  container.sif

STEP 4: run a SLURM job using a container

Running a job in the cluster that uses containers is just a matter of adapt the command line that you use to run your program in a new command line that uses the container. This is an example of a slurm jobfile that uses the container defined in STEP-1 and created in STEP-2:

#!/bin/bash
#SBATCH --job-name=CUDATEST
#SBATCH --error=cudatest.err
#SBATCH --output=cudatest.out
#SBATCH --partition=debug
#SBATCH --ntasks=1
#SBATCH --mem=1G
#SBATCH --time=00:01:00
#SBATCH --gres=gpu:1
### Run my command        
singularity exec --nv ./container.sif python ./inquire.py

the file ‘inquire.py’ is a basic python file that loads the device library of tensoflow and reports the availables CUDA cards. This is the file content:

# Python script
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
exit()