In the following you can find a series of examples that show the cluster simplest uses. At the end of the file a full sequence of commands for a cluster job is given.
Each one of the examples is compressed in a ZIP archive. You can extract it in your labsrv7/labsrv8
home-directory and it will create a directory Examples/XX-Name
where:
- XX is the example number
- Name is the example name
Some of the example need to be compiled (C, Java, ecc), for those there is a Makefile.
In a terminal change your directory to the example directory and use the make
command. The JOB definition file is set to run from the compilation directory with sbatch
command. The job definition file has suffix ‘.slurm
‘.
Example
Connect to labsrv7
and unzip/compile/run an example with the following commands (for the 01-Sieve example) :
[labsrv7]:~$ unzip 01-Sieve.zip Archive: Examples/Archives/01-Sieve.zip creating: Examples/01-Sieve/ inflating: Examples/01-Sieve/Makefile inflating: Examples/01-Sieve/sieve.c inflating: Examples/01-Sieve/sieve.slurm [labsrv7]:~$ cd Examples/01-Sieve [labsrv7]:~/Examples/01-Sieve$ make cc -Wall -g -c sieve.c gcc -o sieve sieve.o [labsrv7]:~/Examples/01-Sieve$ sbatch sieve.slurm Submitted batch job 132 [labsrv7]:~/Examples/01-Sieve$
-
- Hello World
- Archive: 00-Hello.zip
- Execution: a simple ‘echo’ command in the job file
- Language: bash shell
- Description: a very minimal program
- A simple compiled program (C/C++/Fortran)
- Archive: 01-Sieve.zip
- Execution: one process on a generic machine of the cluster
- Language: C
- Description: The program is a simple implementation of a Sieve of Eratosthenes that prints all prime numbers lower than an argument given in input. The JOB file is written to search all prime numbers lower than 100 millions.
- GPU shared use (preferred)
- Archive: 02-Singularity-GPU-shared.zip
- Execution: one process (the ‘deviceQuery’ example of the CUDA distribution) started in a singularity container
- Language: Python or C
- Description: The job leverages the container technology to run a simple pre-compiled (deviceQuery) program or a python script (container.test.py). You can easily modify the ‘mps.slurm’ file to choose the program. In this case the GPU/host selection is performed via the line ‘
#SBATCH --constraint=debug02
‘. The line ‘#SBATCH --gres=mps:30
‘ declares that 30% of the GPU is needed. With this setup more than one program can run by sharing the GPU
- GPU reserved use
- Archive: 03-Singularity-GPU-reserved.zip
- Execution: one python program (‘gpu.test.py’) started in a singularity container
- Language: Python
- Description: The job leverages the container technology to run a simple python script. In this case the GPU selection is performed via the line ‘
#SBATCH --gres=gpu:1
‘ that asks for a fully reserved GPU.
- MATLAB program
- Archive: 04-Matlab.zip
- Execution: a matlab run on a example script
- Language: MATLAB
- Description: The job select an host that can run MATLAB via the line: ‘
#SBATCH --constraint=matlab
‘. The sample script is ‘integrate.m
‘
- Checkpoint a program
- Archive: 05-Checkpoint-DMTCP.zip
- Execution: a Python script started in a singularity container with installed a checkpointing program
- Language: Python
- Description: In the container (a standard ubuntu 20.04) is installed the DMTPC program. The provided example script run a cicle that: 1) perform a dummy computation, 2) perform the checkpoint, 3) start a new cicle. This illustrated the correct use of the checkpointing tecnique that, as possible, should be used at the main point of the execution flow. PLEASE NOTE: In the case of a CUDA/GPU program the checkpoint is valid if the GPU doesn’t retain data in the GPU memory because the GPU memory is invisible to the checkpoint program.
- Multiprocess parallel job
- Archive: 07-MultiCPU.zip
- Execution: a script that starts multiple processes
- Language: Python
- Description: The job demonstrates the multi-processes capabilities of the system
- Conda (anaconda) usage
- Archive: 10-condatest.zip
- Execution: a script that start a conda enviroment and run a GPU test
- Language: Python
- Description: The job demonstrates the conda usage (please see the Conda page)
- Running interactive queue
- You can ask for an interactive session on a particular host (in this example ‘dellcuda1’ for 30 minutes) with a command like:
srun --partition=interactive --pty \ --export=ALL --ntasks=1 --constraint=dellcuda1 \ --time=00:30:00 /bin/bash
- You can ask for an interactive session on a particular host (in this example ‘dellcuda1’ for 30 minutes) with a command like:
- Hello World
A full sequence example
Here you will find a complete example with:
- General setup
- Build of a container cuda enabled
- Setup of a job that uses the container
- Run of the job
1 – General setup
Connect to labsrv7
or labsrv8
and create a new directory (Job0
in our example) for the task:
user@myhost:~$ ssh username@labsrv7.math.unipd.it .... username@labsrv7:~$ mkdir Job0 username@labsrv7:~$ cd Job0 username@labsrv7:~/Job0$
2 – Build of a container cuda enabled
The first step is provide a singularity container definition. A text file with a defined syntax. You can find the complete documentation on the singularity site.
For this example we will use one of the basic containers provided in the common area of the cluster. With the following commands we will copy the definition file and launch the remote build of the container.
username@labsrv7:~/Job0$ cp /conf/shared-software/Singularity/CUDA/containers/cuda-11.2-ubuntu2004-tensorflow-pytorch/container.def . username@labsrv7:~/Job0$ singularity-remote-build container.sif container.def container-build.log Running with parameters: - Container file: /home/koriel/Job0/container.sif - Definition file: /home/koriel/Job0/container.def - Log file: /home/koriel/Job0/container-build.log build started... username@labsrv7:~/Job0$
Now you have to wait until your log file container-build.log
become readable and the container file container.sif
appears (if no error happens).
3 – Setup of a job that uses the container
Now we need provide (at least) two files: the job descrition file and the actual program to run.
- job description copy in a file named “myjob.slurm” the following text:
#!/bin/sh #SBATCH --job-name=GPU-Shared #SBATCH --error=myjob.err #SBATCH --output=myjob.out #SBATCH --partition=allgroups #SBATCH --ntasks=1 #SBATCH --mem=12G #SBATCH --time=01:00:00 #SBATCH --constraint=dellcuda1 #SBATCH --gres=mps:1 SHAREDIR="/conf/shared-software/Singularity/CUDA/" singularity exec -B \ ${SHAREDIR}/driver/`hostname`:/nvdriver \ ./container.sif \ ./container.test.py
- actual program: copy the example python program from the common area
username@labsrv7:~/Job0$ cp /conf/shared-software/Singularity/CUDA/containers/cuda-11.2-ubuntu2004-tensorflow-pytorch/container.test.py .
4 – Run of the job
The command to run the job (on host dellcuda1
as can be see in the job file description) will be:
username@labsrv7:~/Job0$ sbatch myjob.slurm