Queuing System Guide - Slurm

Contents:

Overview

In an HPC cluster, the users tasks to be done on compute nodes are controlled by a batch queuing system. On Talon 3, we have chosen The Slurm Workload Manager or Slurm.

Queuing systems manage job requests (shell scripts generally referred to as jobs) submitted by users. In other words, to get your computations done by the cluster, you must submit a job request to a specific batch queue. The scheduler will assign your job to a compute node in the order determined by the policy on that queue and the availability of an idle compute node. Currently, Talon 3 resources have several policies in place to help guarantee fair resource utilization. 

Partitions

On Talon3, the main partition is named public. There is a limit of 672 CPUs or 24 compute nodes.

There are other private partitions for users that need more computing resources. Please contact hpc-admin@unt.edu for more info and to request these partitions.

QOS (Quality of Service)

There will be three QOS's under the 'public' partition

Name Description
debug
For running quick test computations for debugging purposes.
Limit to 2 hours and 2 compute nodes 
Exclusive Jobs Allowed
general
This is the default QOS for submitting jobs that take 72 hours or less.
Time Limit: 72 hours
Limit 616 CPUs
large
This QOS is for large jobs.
Time Limit: 3 Weeks
Limit 22 compute nodes.
Exclusive Jobs Allowed
unlimited
Limits: Two Jobs
Exclusive Jobs Allowed

SLURM Commands

The following table lists frequently used commands

Slurm Command Description UQE Equiv.
sbatch script.job submit a job qsub script.job
squeue [job_id] display job status (by job) qstat [job_id]
squeue -u EUID display status user's jobs qstat -u
squeue display queue summary status qstat -g c
scancel delete a job in current state qdel
scontrol update modify a pending job qalter

Job State

When using squeue, the following job states are possible.  
State Full State Name Description
R RUNNING The job currently has an allocation.
CA CANCELED The job was explicitly canceled by the user or system administrator. The job may or may not have been initiated.
CD COMPLETED The job has been terminated all processes on all nodes.
CF CONFIGURING The job has been allocated resources, but are waiting for them to become ready for use (e.g. booting)
CG COMPLETING The job is in the process of completing. Some processes on some nodes may still be active.
F FAILED The job terminated with non-zero exit code or other failure condition.
NF NODE_FAIL The job terminated due to failure of one or more allocated nodes.
PD PENDING The job is awaiting resource allocation.
PR PREEMPTED The job terminated due to preemption.
S SUSPENDED The job has an allocation, but execution has been suspended.
TO TIMEOUT The job terminated upon reaching its time limit.

The following table lists common SLURM variables, with their UGE Equivalents; for a compete list see the sbatch manpage:

SLURM Variable Description UGE Variable
SLURM_SUBMIT_DIR current working directory of the submitting client SGE_O_WORKDIR
SLURM_JOB_ID unique identifier assigned when the job was submitted JOB_ID
SLURM_NTASKS number of CPUs in use by a parallel job NSLOTS
SLURM_NNODES number of hosts in use by a parallel job NHOSTS
SLURM_ARRAY_TASK_ID index  number of the current array job task SGE_TASK_ID
SLURM_JOB_CPUS_PER_NODE number of CPU cores per node  
SLURM_JOB_NAME Name of JOB  

Job Submission Tips

Tips will be updated for SLURM shortly.

At the top of your job script, begin with special directive #$, which are sbatch options. Alternatively these options also could be submitted as command line options with srun.

  • #SBATCH -p public

    Defines the partition which may be used to execute this job. The only partition on Talon3 is 'public'.

  • #SBATCH -J job_name

    Defines the job name.

  • #SBATCH -o JOB.o%j

    Defines the output file name.

  • #SBATCH -e JOB.e%j

    Defines the error file name.

  • #SBATCH --qos general

    Defines the QOS the job will be executed. (debug, general, large are the only options)

  • #SBATCH --exclusive

    Sets the job to be exclusive not allowing other jobs to share the compute node.  This is required for all large QOS submissions.

  • #SBATCH -t 80:00:00

    Sets up the WallTime Limit for the job in hh:mm:ss.

  • #SBATCH -n 84

    Defines the total number of mpi tasks.

  • #SBATCH -N 3

    Defines the number of compute nodes requested.

  • #SBATCH --ntasks-per-node 28

    Defines the number of tasks per node.

  • #SBATCH -C c6320

    Requests the c6320 compute nodes. (Also can request r420, r720, and r730 compute nodes)

  • #SBATCH --mail-user=user@unt.edu

    Sets up email notification.

  • #SBATCH --mail-type=begin

    Email user when job begins.

  • #SBATCH --mail-type=end

    Email user when job finishes.

Basic Information about Slurm:

The Slurm Workload Manager (formally known as Simple Linux Utility for Resource Management or SLURM), or Slurm, is a free and open-source job scheduler for Linux and Unix-like kernels, used by many of the world's supercomputers and computer clusters. It provides three key functions. First, it allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job such as MPI) on a set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending jobs. Slurm is the workload manager on about 60% of the TOP500 supercomputers, including Tianhe-2 that, until 2016, was the world's fastest computer. Slurm uses a best fit algorithm based on Hilbert curve scheduling or fat tree network topology in order to optimize locality of task assignments on parallel computers.

Slurm Tutorials and Commands:

A Quick-Start Guide for those unfamiliar with Slurm can be found here:
 
Slurm Tutorial Videos can be found here for additional information:
 

Sample Slurm Job Script:

Simple serial job script example

#!/bin/bash
######################################
# Example of a SLURM job script for Talon3
# Job Name: Sample_Job
# Number of cores: 1
# Number of nodes: 1
# QOS: general
# Run time: 12 hrs
######################################
 
#SBATCH -J Sample_Job
#SBATCH -o Sample_job.o%j
#SBATCH -p public
#SBATCH --qos general
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -t 12:00:00
 
### Loading modules
module load intel/PS2017
 
./a.out > outfile.out

 

Simple parallel MPI job script example

#!/bin/bash
######################################
# Example of a SLURM job script for Talon3
# Job Name: Sample_Job
# Number of cores: 28
# Number of nodes: 1
# QOS: general
# Run time: 12 hrs
######################################
 
#SBATCH -J Sample_Job
#SBATCH -o Sample_job.o%j
#SBATCH -p public
#SBATCH --qos general
#SBATCH -N 1
#SBATCH -n 28
#SBATCH -t 12:00:00
 
### Loading modules
module load intel/PS2017
 
### Use mpirun to run parallel jobs
mpirun ./a.out > outfile.out

 

Large MPI job script example

#!/bin/bash
######################################
# Example of a SLURM job script for Talon3
# Job Name: Sample_Job
# Number of cores: 112
# Number of nodes: 4
# QOS: general
# Run time: 12 hrs
######################################
 
#SBATCH -J Sample_Job
#SBATCH -o Sample_job.o%j
#SBATCH -p public
#SBATCH --qos general
#SBATCH -N 4
#SBATCH -n 112
#SBATCH -t 12:00:00
 
### Loading modules
module load intel/PS2017
 
## Use mpirun for MPI jobs
mpirun ./a.out > outfile.out

 

OPENMP job script example

#!/bin/bash
######################################
# Example of a SLURM job script for Talon3
# Job Name: Sample_Job
# Number of MPI tasks: 1
# Number of nodes: 1
# QOS: general
# Run time: 12 hrs
######################################
 
#SBATCH -J Sample_Job
#SBATCH -o Sample_job.o%j
#SBATCH -p public
#SBATCH --qos general
#SBATCH -N 4
#SBATCH -n 112
#SBATCH -t 12:00:00
 
### Loading modules
module load intel/PS2017
 
### Set the number of threads
export OMP_NUM_THREADS=28
 
./a.out > outfile.out

 

CUDA parallel GPU job script example

#!/bin/bash
######################################
# Example of a SLURM job script for Talon3
# Job Name: Sample_Job
# Number of devices(GPUs): 2
# Number of nodes: 1
# QOS: general
# Run time: 12 hrs
######################################

#SBATCH -J Sample_Job
#SBATCH --ntasks=1
#SBATCH --qos=general
#SBATCH -p public
#SBATCH --gres=gpu:1
#SBATCH -t 12:00:00

### execute code
./a.out -numdevices=2

 

Submit the job:
$ sbatch slurm-job.sh

Interactive Jobs

Interactive job sessions can be used on Talon if you need to compile or test software. An example command of starting an interactive sessions is shown below:

srun -p public --qos general -N 1  --pty bash

This launches an interactive job session and lanches a bash shell to a compute node. From there, you can exectue software and shell commands that would otherwise not be allowed on the Talon login nodes.

List jobs:
$ squeue -u $USER

Output:
  JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
    106 public slurm-jo  rstober   R   0:04      1 atom01

Get job details:
$ scontrol show job 106

Kill a job. Users can kill their own jobs, root can kill any job.
$ scancel $JOB_ID
where $JOB_ID is the job ID number of the job you want to be killed

Hold a job:
$ scontrol hold 139

Release a job:
$ scontrol release 139