In the cluster is installed the ‘Singularity‘ software. Singularity allows build, manage and run ‘containers
‘. The aim of this kind of software is encapsulate all the needed packages, programs, whatever needed to perform a computation in a archive independent from the operating system and software of the computer that hosts the job.
Build a container is a relatively simple process. The following example is aimed to show the power of the tool but the users are encouraged to refer to the ‘Singularity‘ website for examples and suggestions.
Running a Nvidia provided CUDA based container is the recommended way to run CUDA base jobs. The example shows how create a container with the necessary software and how to run it on the cluster.
STEP 1: definition file [‘container.def’]
Bootstrap: docker
#
# List on: https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/supported-tags.md
#
From: nvcr.io/nvidia/tensorflow:23.04-tf2-py3
%post
# Preconfigure tzdata
echo 'export TZ=Europe/Rome' >> /environment
export TZ=Europe/Rome
ln -snf /usr/share/zoneinfo/$TZ /etc/localtime
echo $TZ > /etc/timezone
# Downloads the latest upgrades and package lists (important) and install gnuplot.
apt-get update -y && apt-get -y upgrade
apt-get install -y gnuplot-nox
The definition file starts with the first two lines declaring a ‘docker container definition’ as starting point. This pre-defined container is available in the location described in the comment. This container comes with a base Ubuntu 20.04, Cuda 12.1 installation complete with the development files. These only two lines can be enough for general purpose job using using ubuntu and cuda.
After the ‘%post
‘ line starts the commands to be issued when the installation of the base pre-defined container
nvcr.io/nvidia/tensorflow:23.04-tf2-py3
ends. As in a shell script the lines starting with the character ‘#’ are comments, e.g. not real commands but notes inserted to clarify the command flow. In this file we find two comments starting with ‘# ...
‘ . The commands of these two sections:
- Define the correct time zone as ‘Europe/Rome’ The reference to this is then added to the ‘
/environment
‘ file - update the basic ubuntu installation and install the package gnuplot-nox
STEP 2: container build
With our definition file (named ‘container.def
‘) we can start the building of the container, a process similar to the installation of an operating system. The simplest version of the command can be:
singularity build container.sif container.def
This creates the file ‘container.sif
‘ that you can think as the hard disk of the container.
Please, be aware that building a container – usually – require the root credentials that is unavailable on the cluster submitting hosts (labsrv7
and labsrv8
). You can:
- Create your container on your PC and then transfer the ‘container.sif’ file on the submit host of the cluster for execution;
- Test the creation of the container on your PC and then use the command
singularity-remote-build container.sif container.def container.log
to perform the build. The ‘container.log’ file will store the output of the process and will be available at the end of the process.
- Use the remote build service for singularity (account required) available at the URL: https://cloud.sylabs.io/builder
STEP 3: container execution
Usually to run the container the command will be:
singularity exec --nv container.sif my-command
this version of the command will execute the command ‘my-command
‘ in the context of the container. The ‘–nv’ switch instruct the singularity software to manage the connection with the local CUDA card.
Another choice for the execution a command in the context of our container is ‘shell’. This opportunity is best fitted for the debugging of our container as it allows to run a shell in the context of the container and perform tests or whatsoever needed. Our execution command will be:
singularity shell
--nvcontainer.sif
STEP 4: run a SLURM job using a container
Running a job in the cluster that uses containers is just a matter of adapt the command line that you use to run your program in a new command line that uses the container. This is an example of a slurm jobfile that uses the container defined in STEP-1 and created in STEP-2:
#!/bin/bash #SBATCH --job-name=CUDATEST #SBATCH --error=cudatest.err #SBATCH --output=cudatest.out #SBATCH --partition=debug #SBATCH --ntasks=1 #SBATCH --mem=1G #SBATCH --time=00:01:00 #SBATCH --gres=gpu:1 ### Run my command
singularity exec --nv ./container.sif python ./inquire.py
the file ‘inquire.py’ is a basic python file that loads the device library of tensoflow and reports the availables CUDA cards. This is the file content:
# Python script
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
exit()