SLURM

SLURM (formerly, Simple Linux Utility for Resource Management) is an application for managing tasks on computer systems.

The official SLURM cheatsheet can be found HERE

SLURM has two main operating modes: batch and interactive. Batch is the preferred mode for watgpu.cs: you can start your task and get email when it is complete. If you require an interactive session (useful for debugging) with snorlax.cs hardware, see the salloc section below

Batch mode usage

You can submit jobs using an SLURM job script. Below is an example of a simple script: :warning: #SBATCH is the trigger word for slurm to take into account your arguments. If you wish to disable a line consider using ##SBATCH, do not remove # at the begining of the lines.

#!/bin/bash # To be submitted to the SLURM queue with the command: # sbatch batch-submit.sh # Set resource requirements: Queues are limited to seven day allocations # Time format: HH:MM:SS #SBATCH --time=00:15:00 #SBATCH --mem=10GB #SBATCH --cpus-per-task=4 #SBATCH --gres=gpu:1 # Set output file destinations (optional) # By default, output will appear in a file in the submission directory: # slurm-$job_number.out # This can be changed: #SBATCH -o JOB%j.out # File to which STDOUT will be written #SBATCH -e JOB%j.out # File to which STDERR will be written # email notifications: Get email when your job starts, stops, fails, completes... # Set email address #SBATCH --mail-user=(email address where notifications are delivered to) # Set types of notifications (from the options: BEGIN, END, FAIL, REQUEUE, ALL): #SBATCH --mail-type=ALL # Load up your conda environment # Set up environment on snorlax-login.cs or in interactive session (use `source` keyword instead of `conda`) source activate <env> # Task to run ~/cuda-samples/Samples/5_Domain_Specific/nbody/nbody -benchmark -device=0 -numbodies=16777216

You can use SBATCH variables like --mem for example the one above will assign 10GB of RAM to the job.

For CPU cores allocation, you can use --cpus-per-task , for example the one above will assign 4 cores to the job. The --gres=gpu:1 will assign 1x GPU to your job.

Running the script

To run the script, simply run sbatch your_script.sh on snorlax.cs

Interactive mode usage

You can book/reserve resources on the cluster using the salloc command. Below is an example:

salloc --gres=gpu:2 --cpus-per-task=4 --mem=16G --time=2:00:00

The example above will reserve 2 GPUs, 4 CPU cores, and 16GB of RAM for 2 hours. Once you run the command, it will output the name of the host like so:

salloc: Nodes snorlax-1 are ready for job

here snorlax-1 is the assigned host that the user can SSH to.

Ideally you want to run this command in either a screen or tmux session on snorlax-login.cs

Queues

To look at the queue of jobs currently, you can use squeue to display it.

The command scurrent will also give all the current job running on snorlax. It is useful to see which ressource are in use at the moment.

Monitoring jobs

Current jobs

By default squeue will show all the jobs the scheduler is managing at the moment. It will run much faster if you ask only about your own jobs with

$ squeue -u $USER

You can show only running jobs, or only pending jobs:

$ squeue -u <username> -t RUNNING $ squeue -u <username> -t PENDING

You can show detailed information for a specific job with scontrol:

$ scontrol show job -dd <jobid>

Do not run squeue from a script or program at high frequency, e.g., every few seconds. Responding to squeue adds load to Slurm, and may interfere with its performance or correct operation.

Cancelling jobs

Use scancel with the job ID to cancel a job:

$ scancel <jobid>

You can also use it to cancel all your jobs, or all your pending jobs:

$ scancel -u $USER

$ scancel -t PENDING -u $USER