Checkpointing – Dep. of Mathematics HPC facilities documentation

Using a computing cluster is common practice have tasks (jobs) that run for days or weeks. This can lead to problems because the longer a task runs the highest is the probability that some problem or action stops the task.

A possible answer to this problem is the technique of ‘CheckPoint’ a running program (a ‘process’).

From a basic point of view ‘make a checkpoint’ of a running program means records in one or more files the informations needed to restart the program from the point reached when the checkpoint was performed.

Some framework o programming enviroment includes a facility for save the state of the computation. An example is Tensorflow how described in this page: https://www.tensorflow.org/guide/checkpoint

For a generic program writtten in Fortran, C, Python, etc. in the cluster in the directory /conf/shared-software/dmtcp is installed the software DMTCP. To use this software you have to add the line

. /conf/shared-software/dmtcp/load.sh

(including the initial sequence ‘dot’,’space’) to the file ‘.bashrc‘ in your home directory (you have to restart your shell for the modification take effect).

Basic operations

Given a generic program ‘myprogram’ the general sequence of work is:

start your program under supervision of dmtcp (in this example the checkpoint is performed every 600 seconds/10 minutes):

dmtcp_launch --interval 600 myprogram argument_1 ... argument_N

now your program is running and every 600 seconds a file with the name like ‘ckpt_myprogram_*.dmtcp‘ is re-created. The file contains all the information needed to restart the computation from the time of the last save.

When the program is interrupted for whatever reason you can restart the computation from the last saved point with the command:

dmtcp_restart ckpt_myprogram_*.dmtcp

Observation 1

A checkpoint file contains a copy of the current memory of your program. This can lead to *HUGE* files that take long time to be created and managed. For this reason is important define a precise strategy of checkpoint for any non trivial program.

Observation 2

The dmtcp suite is richer than showed here. You can – for example – perform a check point at any given time with the command

dmtcp_command --checkpoint

Using this capability you can issue this command from your running program/task for ask for a checkpoint in a convenient point of your algorithm (a trivial operation for a programming language as C or Python). This setup has the advantage that the checkpoint can be performed only and every time needed.

If this is impossible for whatever reason but you are able to create a arbitrary empty file from your program then you can organize to create a “signal.myfile” (an empty file) to ask for a checkpoint. From a second program you have to wait for the “signal.myfile“, perform the checkpoint with the previous command and remove “signal.myfile“.

Observation 3

The communication of the checkpointing system will happen over TCP at a default port (7779). This can lead to conflicts in case of multi-user systems where multiple instances of the DMTCP suite is operating. To avoid the problem the command ‘dmtpc_launch‘ has a couple of options: ‘--coord-port NUMBER‘ and ‘--new-coordinator‘ that creates a coordination channel/server at the given port. All the commands of the dmtcp suite acceps the option ‘--coord-port NUMBER‘ as a pointer for the communications channel.

Again, the suite is richer. You can, for example, create a coordinator on a given host and use the option ‘--coord-host HOSTNAME‘ to contact a remote coordinator from your program. Please, read the documentation of the program if these basic instructions are not enough.