SLURM
SLURM (formerly, Simple Linux Utility for Resource Management) is an application for managing tasks on computer systems.
The official SLURM cheatsheet can be found HERE
SLURM has two main operating modes: batch and interactive. Batch is the preferred mode for watgpu.cs: you can start your task and get email when it is complete. If you require an interactive session (useful for debugging) with snorlax.cs hardware, see the salloc section below
Batch mode usage
You can submit jobs using an SLURM job script. Below is an example of a simple script:
:warning: #SBATCH
is the trigger word for slurm to take into account your arguments. If you wish to disable a line consider using ##SBATCH
, do not remove #
at the begining of the lines.
#!/bin/bash
# To be submitted to the SLURM queue with the command:
# sbatch batch-submit.sh
# Set resource requirements: Queues are limited to seven day allocations
# Time format: HH:MM:SS
#SBATCH --time=00:15:00
#SBATCH --mem=10GB
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:1
# Set output file destinations (optional)
# By default, output will appear in a file in the submission directory:
# slurm-$job_number.out
# This can be changed:
#SBATCH -o JOB%j.out # File to which STDOUT will be written
#SBATCH -e JOB%j.out # File to which STDERR will be written
# email notifications: Get email when your job starts, stops, fails, completes...
# Set email address
#SBATCH --mail-user=(email address where notifications are delivered to)
# Set types of notifications (from the options: BEGIN, END, FAIL, REQUEUE, ALL):
#SBATCH --mail-type=ALL
# Load up your conda environment
# Set up environment on snorlax-login.cs or in interactive session (use `source` keyword instead of `conda`)
source activate <env>
# Task to run
~/cuda-samples/Samples/5_Domain_Specific/nbody/nbody -benchmark -device=0 -numbodies=16777216
You can use SBATCH
variables like --mem
for example the one above will assign 10GB of RAM to the job.
For CPU cores allocation, you can use --cpus-per-task
, for example the one above will assign 4 cores to the job.
The --gres=gpu:1
will assign 1x GPU to your job.
Running the script
To run the script, simply run sbatch your_script.sh
on snorlax.cs
Interactive mode usage
You can book/reserve resources on the cluster using the salloc
command. Below is an example:
salloc --gres=gpu:2 --cpus-per-task=4 --mem=16G --time=2:00:00
The example above will reserve 2 GPUs, 4 CPU cores, and 16GB of RAM for 2 hours. Once you run the command, it will output the name of the host like so:
salloc: Nodes snorlax-1 are ready for job
here snorlax-1
is the assigned host that the user can SSH to.
Ideally you want to run this command in either a screen
or tmux
session on snorlax-login.cs
Queues
To look at the queue of jobs currently, you can use squeue
to display it.
The command scurrent
will also give all the current job running on snorlax. It is useful to see which ressource are in use at the moment.
Monitoring jobs
Current jobs
By default squeue
will show all the jobs the scheduler is managing at the moment. It will run much faster if you ask only about your own jobs with
$ squeue -u $USER
You can show only running jobs, or only pending jobs:
$ squeue -u <username> -t RUNNING
$ squeue -u <username> -t PENDING
You can show detailed information for a specific job with scontrol
:
$ scontrol show job -dd <jobid>
Do not run squeue
from a script or program at high frequency, e.g., every few seconds. Responding to squeue
adds load to Slurm, and may interfere with its performance or correct operation.
Cancelling jobs
Use scancel
with the job ID to cancel a job:
$ scancel <jobid>
You can also use it to cancel all your jobs, or all your pending jobs:
$ scancel -u $USER
$ scancel -t PENDING -u $USER