Queuing System Guide - Slurm

Contents:

Overview

In an HPC cluster, the users tasks to be done on compute nodes are controlled by a batch queuing system.  On Talon 3, we have chosen The Slurm Workload Manager or Slurm.

Queuing systems manage job requests (shell scripts generally referred to as jobs) submitted by users.  In other words, to get your computations done by the cluster, you must submit a job request to a specific batch queue. The scheduler will assign your job to a compute node in the order determined by the policy on that queue and the availability of an idle compute node. Currently, Talon 3 resources have several policies in place to help guarantee fair resource utilization. 

Paritions

On Talon3, there is one Partition, compute. There is a limit of 448 CPUs or 16 compute nodes.

QOS (Quality of Service)

There is three QOS's under the 'compute' partition.

Name Description
general
This is the default QOS for submitting jobs that take 72 hours or less.
Limit 420 CPUs
large
This QOS is for large jobs.
Limit 15 compute nodes. Allow exclusive jobs.
debug
For running quick test computations for debugging purposes.
Limit to 2 hours and 2 compute nodes

SLURM Commands

The following table lists frequently used commands

Slurm Command Discription UQE Equiv.
sbatch script.job submit a job qsub script.job
squeue [job_id] display job status (by job) qstat [job_id]
squeue -u display status user's jobs qstat -u
squeue display queue summary stauts qstat -g c
scancel delete a job in current state qdel
scontrol update modify a pending job qalter
salloc run a interactive job qlogin or qrsh
? run a parallel make to compile code qmake

Job State

When using squeue, the following job states are possible.  
State Full State Name Description
R RUNNING The job currently has an allocation.
CA CANCELLED The job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.
CD COMPLETED The job has been terminated all processes on all nodes.
CF CONFIGURING The job has been allocated resources, but are waiting for them to become ready for use (e.g. booting)
CG COMPLETING The job is in the process of completing. Some processes on some nodes may still be active.
F FAILED The job terminated with non-zero exit code or other failure condition.
NF NODE_FAIL The job terminated due to failure of one or more allocated nodes.
PD PENDING The job is awaiting resource allocation.
PR PREEMPTED The job terminated due to preemption.
S SUSPENDED The job has an allocation, but execution has been suspended.
TO TIMEOUT The job terminated upon reaching its time limit.

The following table lists common SLURM variables, with their UGE equivalents; for a compete list see the sbatch manpage:

SLURM Variable Discription UGE Variable
SLURM_SUBMIT_DIR current working directory of the submitting client SGE_O_WORKDIR
SLURM_JOB_ID unique identifier assigned when the job was submitted JOB_ID
SLURM_NTASKS number of CPUs in use by a parallel job NSLOTS
SLURM_NNODES number of hosts in use by a parallel job NHOSTS
SLURM_ARRAY_TASK_ID index  number of the current array job task SGE_TASK_ID
SLURM_JOB_CPUS_PER_NODE number of CPU cores per node  
SLURM_JOB_NAME Name of JOB  

Job Submission Tips

At the top of your job script, begin with special directive #SBATCH, which are sbatch options.  Alternatively, these options also could be submitted as command line options with srun.

  • #SBATCH -p compute

    Defines the parition which may be used to execute this job. Compute is the only parition on Talon3

  • #SBATCH --qos general

    Defines the QOS the job will be executed. (debug, general, large are the only options)

  • #SBATCH --exclusive=user

    Sets the job to be exclusive not allowing other jobs to share the compute node.  This is required for all large QOS submissions.

  • #SBATCH -t 80:00:00

    Sets up the WallTime Limit for the job in hh:mm:ss.

  • #SBATCH --ntasks=220

    Defines the total number of CPUs for the job.

  • #SBATCH --ntasks-per-node=28

    Defines the number of CPUs per node.

Basic Information about Slurm:

The Slurm Workload Manager (formally known as Simple Linux Utility for Resource Management or SLURM), or Slurm, is a free and open-source job scheduler for Linux and Unix-like kernels, used by many of the world's supercomputers and computer clusters. It provides three key functions. First, it allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job such as MPI) on a set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending jobs. Slurm is the workload manager on about 60% of the TOP500 supercomputers, including Tianhe-2 that, until 2016, was the world's fastest computer. Slurm uses a best fit algorithm based on Hilbert curve scheduling or fat tree network topology in order to optimize locality of task assignments on parallel computers.

Slurm Tutorials and Commands:

A Quick-Start Guide for those unfamiliar with Slurm can be found here:
 
Slurm Tutorial Videos can be found here for additional information:

Sample Slurm Job Script:

Here's a simple Slurm job script:

$ cat slurm-job.sh
#!/usr/bin/env bash

#SBATCH -o slurm.sh.out
#SBATCH -p compute

#SBATCH --qos general
#SBATCH -N 1
#SBATCH --ntasks-per-node=28
#SBATCH -t 72:00:00

echo "In the directory: `pwd`"
echo "As the user: `whoami`"
echo “write this is a file" > analysis.output
sleep 60

Submit the job:

$ module load slurm
$ sbatch slurm-job.sh
Submitted batch job 106

List jobs:

$ squeue
  JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
    106 compute slurm-jo  rstober   R   0:04      1 atom01

Get job details:

$ scontrol show job 106
JobId=106 Name=slurm-job.sh
   UserId=rstober(1001) GroupId=rstober(1001)
   Priority=4294901717 Account=(null) QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
   RunTime=00:00:07 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2013-01-26T12:55:02 EligibleTime=2013-01-26T12:55:02
   StartTime=2013-01-26T12:55:02 EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=defq AllocNode:Sid=atom-head1:3526
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=atom01
   BatchHost=atom01
   NumNodes=1 NumCPUs=2 CPUs/Task=1 ReqS:C:T=*:*:*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=0 Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/rstober/slurm/local/slurm-job.sh
   WorkDir=/home/rstober/slurm/local

Kill a job. Users can kill their own jobs, root can kill any job.

$ scancel 135
$ squeue
  JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

Hold a job:

$ squeue
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
    139      defq   simple  rstober  PD       0:00      1 (Dependency)
    138      defq   simple  rstober   R       0:16      1 atom01
$ scontrol hold 139
$ squeue
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
    139      defq   simple  rstober  PD       0:00      1 (JobHeldUser)
    138      defq   simple  rstober   R       0:32      1 atom01

Release a job:

$ scontrol release 139
$ squeue
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
    139      defq   simple  rstober  PD       0:00      1 (Dependency)
    138      defq   simple  rstober   R       0:46      1 atom01

List partitions:

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
defq*        up   infinite      1  down* atom04
defq*        up   infinite      3   idle atom[01-03]
cloud        up   infinite      2  down* cnode1,cnodegpu1
cloudtran    up   infinite      1   idle atom-head1

Submit a job that's dependant on a prerequisite job being completed:

Here's a simple job script. Note that the Slurm -J option is used to give the job a name.

#!/usr/bin/env bash

#SBATCH -p compute
#SBATCH -J simple

sleep 60

Submit the job

$ sbatch simple.sh
Submitted batch job 149

Now we'll submit another job that's dependent on the previous job. There are many ways to specify the dependency conditions, but the "singleton" is the simplest. The Slurm -d singleton argument tells Slurm not to dispatch this job until all previous jobs with the same name have completed.

$ sbatch -d singleton simple.sh
Submitted batch job 150
$ squeue
  JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
    150 defq   simple  rstober  PD  0:00  1 (Dependency)
    149 defq   simple  rstober   R  0:17  1 atom01

Once the prerequisite job finishes the dependent job is dispatched.

$ squeue
  JOBID PARTITION NAME USER ST TIME  NODES NODELIST(REASON)
    150 defq   simple  rstober   R   0:31  1 atom01
...