In the cluster is installed the ‘Singularity‘ software. Singularity allows build, manage and run ‘containers‘. The aim of this kind of software is encapsulate all the needed packages, programs, whatever needed to perform a computation in a archive independent from the operating system and software of the computer that hosts the job.

Build a container is a relatively simple process. The following example is aimed to show the power of the tool but the users are encouraged to refer to the ‘Singularity‘ website for examples and suggestions.

Running a CUDA based container needs that at runtime the container uses the correct CUDA driver. The example shows how create a container with the necessary mount points and enviroment for the CUDA driver directory and how to run it on the cluster.

STEP 1: definition file [‘container.def’]

Bootstrap: docker
From: nvidia/cuda:11.0.3-cudnn8-devel-ubuntu20.04

%post

    # NVIDIA: create directory and add nvidia driver paths to the environment variables
    mkdir /nvdriver
    echo "\n #Nvidia driver paths \n"                          >> /environment
    echo 'export PATH="/nvdriver:$PATH"'                       >> /environment
    echo 'export LD_LIBRARY_PATH="/nvdriver:$LD_LIBRARY_PATH"' >> /environment

    # NVIDIA: define CUDA paths
    echo "\n #Cuda paths \n" >> /environment
    echo 'export CPATH="/usr/local/cuda/include:$CPATH"'                   >> /environment
    echo 'export PATH="/usr/local/cuda/bin:$PATH"'                         >> /environment
    echo 'export LD_LIBRARY_PATH="/usr/local/cuda/lib64:$LD_LIBRARY_PATH"' >> /environment
    echo 'export CUDA_HOME="/usr/local/cuda"'                              >> /environment

    # MY INSTALLATIONS:
    # Downloads the latest package lists (important).
    apt-get update -y

    # python3-tk is required by matplotlib.
    DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
        python3 \
        python3-tk \
	python3-dev \
        python3-pip \
        python3-setuptools
    # Reduce the size of the image by deleting the package lists we downloaded,
    # which are useless now.
    rm -rf /var/lib/apt/lists/*
    # Install Python modules.
    pip3 install torch matplotlib tensorflow numpy==1.19.2

The definition file starts with the first two lines declaring a ‘docker container definition’ as starting point. This pre-defined container is available in the ‘docker/nvidia/cuda repository‘. As the name suggest this container comes with a base Ubuntu 20.04, Cuda 11.0 and CuDNN8 installation complete with the development files. These only two lines can be enough for general purpose job using using ubuntu and cuda.

After the ‘%post‘ line starts the commands to be issued when the installation of the base pre-defined container (nvidia/cuda:11.0.3-cudnn8-devel-ubuntu20.04) ends. As in a shell script the lines starting with the character ‘#’ are comments, e.g. not real commands but notes inserted to clarify the command flow. In this file we find two comments starting with ‘# NVIDIA:...‘ . The commands of these two sections:

The next section starts after the comment ‘# MY INSTALLATIONS‘. Here we find the usual commands needed for install software in the Debian/Ubuntu Linux suite. In this case a standard suite for AI research using python, pyTorch and Tensorflow. You are free to modify this section to best fit your needs;

STEP 2: container build

With our definition file (named ‘container.def‘) we can start the building of the container, a process similar to the installation of an operating system. The simplest version of the command can be:

singularity build container.sif container.def

This creates the file ‘container.sif‘ that you can think as the hard disk of the container.

Please, be aware that building a container – usually – require the root credentials that is unavailable on the cluster submitting hosts (labsrv7 and labsrv8). You can:

singularity-remote-build container.sif container.def container.log

to perform the build. The ‘container.log’ file will store the output of the process.

STEP 3: container execution

Usually to run the container the command will be:

singularity exec container.sif my-command

this version of the command will execute the command ‘my-command‘ in the context of the container.

In our case we need to perform the ‘mount’ (connection) of the directory ‘/nvdriver‘ in the context of the container before any execution of CUDA related command. For user convenience in the cluster there are various version of the drivers in the directory ‘/conf/shared-software/Singularity/CUDA/driver/HOSTNAME‘ where ‘HOSTNAME‘ is the name of the particular host of the cluster where the execution take place. Our execution command will be:

singularity exec  -B /conf/shared-software/Singularity/CUDA/driver/`hostname`/:/nvdriver \  container.sif my-command

Another choice for the execution a command in the context of our container is ‘shell’. This opportunity is best fitted for the debugging of our container as it allows to run a shell in the context of the container and perform tests or whatsoever needed. Our execution command will be:

singularity shell  -B /conf/shared-software/Singularity/CUDA/driver/`hostname`/:/nvdriver \  container.sif

Is worth noting that the current directory is automatically connected (mounted) as current directory in the context of the running container. This will make available every program you have outside the container itself.

STEP 4: run a SLURM job using a container

Running a job in the cluster that uses containers is just a matter of adapt the command line that you use to run your program in a new command line that uses the container. This is an example of a slurm jobfile that uses the container defined in STEP-1 and created in STEP-2:

#!/bin/bash
#SBATCH --job-name=CUDATEST
#SBATCH --error=cudatest.err
#SBATCH --output=cudatest.out
#SBATCH --partition=allgroups
#SBATCH --ntasks=1
#SBATCH --mem=1G
#SBATCH --time=00:01:00
#SBATCH --gres=gpu:Tesla_K20m:1
### Run my command
SHAREDIR="/conf/shared-software/Singularity/CUDA"
singularity exec -B \
         ${SHAREDIR}/driver/hostname:/nvdriver \ ${SHAREDIR}/containers/cuda-11.2-ubuntu2004-tensorflow-pytorch/container.sif \ ./deviceQuery

Our program is ‘deviceQuery‘ (a standard example available in any CUDA distribution) that is stored in the current directory. The definition of the environment variable ‘SHAREDIR’ is useful for improving the readability of the script.