Basics – Dep. of Mathematics HPC facilities documentation

Introduction

The running software of che cluster is ‘SLURM’, the ‘de-facto’ standard for this kind of tasks. You can find extensive documentation on the SLURM main site. Here you find a small introduction.

To run a computation on the cluster you have to:

Gain access to one of the ‘access hosts’ (see below)
Create your program in any programming language of your choice. Depending on the complexity of your program you can find very useful incapsulate all your computing enviroment in a singularity container (examples are provided)
Create a slurm job definition using a text editor of your choice (examples are provided). The job definition describes the features of the job, how it will be executed and the command lines to be issued to perform the computation
Submit your job to the cluster for execution.

Access

Is it possible run jobs from three submit hosts:

labsrv0.math.unipd.it (New! Please give it a try and report)

labsrv7.math.unipd.it

labsrv8.math.unipd.it

You can connect to these hosts via ssh from any host internal to the Department of Mathematics. To connect from the general internet you have to perform as first step an ssh connection to riemann.math.unipd.it or labta.math.unipd.it or guestportal.math.unipd.it (depending from your status: faculty member, student, guest) and then connect to the submit hosts.

Basic usage

Connect to a submit host (labsrv0, labsrv7 or labsrv8) and create a text file named ‘hello.slurm‘ with the following content:

#!/bin/bash
#SBATCH –job-name=”Hello World”
#SBATCH –error=hello.err
#SBATCH –output=hello.out
#SBATCH –partition=allgroups
#SBATCH –ntasks=1
#SBATCH –mem=1G
#SBATCH –time=01:00:00
echo “HELLO WORLD”

Then give the following command:

sbatch hello.slurm

Provided that the job will find room in the cluster you should find two new files in your directory: hello.err (empty) and hello.out that contains the words ‘HELLO WORLD’.

Congratulation, you just run your first job.

Dissection of the job file

A job file is a ‘shell script‘. In this case a list of commands for the GNU unix/linux shell ‘bash‘. This is the reason for the first line (please, note that this needs to be the very first line and the character ‘#’ needs to be the very first character):

#!/bin/bash

Then starts the lines that describes the execution of the jobs from the point of view of the SLURM scheduler (these lines starts with the ‘#’ sign, then from the point of view of the shell these are comments, not subjected to executions. The SLURM scheduler that scans the script before the execution will find in these comments the instruction for the execution of the job)

#SBATCH –job-name=”Hello World’

gives a name to the job. This will be the name that will be displayed from the commands like ‘squeue’

#SBATCH –error=hello.err
#SBATCH –output=hello.out

defines the files where the job will write its ‘standard error‘ and ‘standard output‘

#SBATCH –partition=allgroups

This line defines the SLURM ‘partition’ (think to a queue) on which the job will be submitted.

#SBATCH –ntasks=1

Defines the number of process needed to perform the job

#SBATCH –mem=1G

Defines the maximum amount of memory needed for the job. In this case one gibabyte.

#SBATCH –time=01:00:00

Defines then maximum amount of time (format DD:HH:MM) needed to perform all the operations of the job

echo “HELLO WORLD”

This is the actual command that performs all the computation of the job. In this case the only operation is the print of the words ‘HELLO WORLD’. Real jobs will put in this place one or more commands.

Useful SLURM commands

Show queue status:

squeue

Show cluster queues and status:

sinfo -l -v

Show nodes (computing machines):

scontrol show node

Show a particular node:

scontrol show node <nodename>

Submit a job:

sbatch file.slurm