Queuing System Guide—SLURM

Contents

Overview

In an HPC cluster, the users' tasks to be done on compute nodes are controlled by a batch queuing system. On Talon 3, we have chosen The Slurm Workload Manager or SLURM.

Queuing systems manage job requests (shell scripts generally referred to as jobs) submitted by users. In other words, to get your computations done by the cluster, you must submit a job request to a specific batch queue. The scheduler will assign your job to a compute node in the order determined by the policy on that queue and the availability of an idle compute node. Currently, Talon 3 resources have several policies in place to help guarantee fair resource utilization. 

Partitions

On Talon 3, the main partition is named public. There is a limit of 672 CPUs or 24 compute nodes.

There are other private partitions for users that need more computing resources. Please contact hpc-admin@unt.edu for more info and to request these partitions.

Quality of Service, QOS

There will be three QOSs under the 'public' partition

Name Description
debug
For running quick test computations for debugging purposes.
Limit to two hours and two compute nodes 
Exclusive Jobs Allowed
High priority
general
This is the default QOS for submitting jobs that take 72 hours or fewer.
Time Limit: 72 hours
Limit 616 CPUs
large
This QOS is for large jobs requiring more resources
Time Limit: three weeks
Limit 22 compute nodes.
Exclusive Jobs Allowed
Low priority
unlimited
Limits: Two Jobs
Exclusive Jobs Allowed

SLURM Commands

The following table lists frequently used commands

Slurm Command Description
sbatch script.job submit a job
squeue [job_id] display job status (by job)
squeue -u EUID display status user's jobs
squeue display queue summary status
scancel delete a job in current state
scontrol update modify a pending job

Job State

When using squeue, the following job states are possible.  

State Full State Name Description
R RUNNING The job currently has an allocation.
CA CANCELED The job was explicitly canceled by the user or system administrator. The job may or may not have been initiated.
CD COMPLETED The job has been terminated all processes on all nodes.
CF CONFIGURING The job has been allocated resources, but are waiting for them to become ready for use (e.g. booting)
CG COMPLETING The job is in the process of completing. Some processes on some nodes may still be active.
F FAILED The job terminated with non-zero exit code or other failure condition.
NF NODE_FAIL The job terminated due to failure of one or more allocated nodes.
PD PENDING The job is awaiting resource allocation.
PR PREEMPTED The job terminated due to preemption.
S SUSPENDED The job has an allocation, but execution has been suspended.
TO TIMEOUT The job terminated upon reaching its time limit.

The following table lists common SLURM variables, with their UGE Equivalents; for a compete list, see the sbatch manpage.

SLURM Variable Description
SLURM_SUBMIT_DIR current working directory of the submitting client
SLURM_JOB_ID unique identifier assigned when the job was submitted
SLURM_NTASKS number of CPUs in use by a parallel job
SLURM_NNODES number of hosts in use by a parallel job
SLURM_ARRAY_TASK_ID index number of the current array job task
SLURM_JOB_CPUS_PER_NODE number of CPU cores per node
SLURM_JOB_NAME Name of JOB

Job Submission Tips

Tips will be updated for SLURM shortly.

At the top of your job script, begin with special directive #$, which are sbatch options. Alternatively these options also could be submitted as command line options with srun.

  • #SBATCH -p public
    Defines the partition which may be used to execute this job. The only partition on Talon 3 is 'public'.
  • #SBATCH -J job_name
    Defines the job name.
  • #SBATCH -o JOB.o%j
    Defines the output file name.
  • #SBATCH -e JOB.e%j
    Defines the error file name.
  • #SBATCH --qos general
    Defines the QOS the job will be executed. (debug, general, large are the only options)
  • #SBATCH --exclusive
    Sets the job to be exclusive not allowing other jobs to share the compute node.  This is required for all large QOS submissions.
  • #SBATCH -t 80:00:00
    Sets up the WallTime Limit for the job in hh:mm:ss.
  • #SBATCH -n 84
    Defines the total number of cpu tasks.
  • #SBATCH -N 3
    Defines the number of compute nodes requested.
  • #SBATCH --ntasks-per-node 28
    Defines the number of tasks per node.
  • #SBATCH -C c6320
    Requests the c6320 compute nodes. (Also can request r420, r720, and r730 compute nodes)
  • #SBATCH --mail-user=user@unt.edu
    Sets up email notification.
  • #SBATCH --mail-type=begin
    Email user when job begins.
  • #SBATCH --mail-type=end
    Email user when job finishes.

Basic Information about Slurm

The Slurm Workload Manager (formally known as Simple Linux Utility for Resource Management or SLURM), or Slurm, is a free and open-source job scheduler for Linux and Unix-like kernels, used by many of the world's supercomputers and computer clusters. It provides three key functions. First, it allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job such as MPI) on a set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending jobs. Slurm is the workload manager on about 60% of the TOP500 supercomputers, including Tianhe-2 that, until 2016, was the world's fastest computer. Slurm uses a best fit algorithm based on Hilbert curve scheduling or fat tree network topology in order to optimize locality of task assignments on parallel computers.

Slurm Tutorials and Commands:

A Quick-Start Guide for those unfamiliar with Slurm can be found here:
 
Slurm Tutorial Videos can be found here for additional information:
 

Sample Slurm Job Script:

Simple serial job script example

#!/bin/bash
######################################
# Example of a SLURM job script for Talon3
# Job Name: Sample_Job
# Number of cores: 1
# Number of nodes: 1
# QOS: general
# Run time: 12 hrs
######################################
 
#SBATCH -J Sample_Job
#SBATCH -o Sample_job.o%j
#SBATCH -p public
#SBATCH --qos general
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -t 12:00:00
#SBATCH -C r420
 
### Loading modules 
module load intel/compilers/18.0.1
 
./a.out > outfile.out

 

Simple parallel MPI job script example

#!/bin/bash
######################################
# Example of a SLURM job script for Talon3
# Job Name: Sample_Job
# Number of cores: 28
# Number of nodes: 1
# QOS: general
# Run time: 12 hrs
######################################
 
#SBATCH -J Sample_Job
#SBATCH -o Sample_job.o%j
#SBATCH -p public
#SBATCH --qos general
#SBATCH -N 1
#SBATCH -n 16
#SBATCH --ntasks-per-node 16
#SBATCH -t 12:00:00
#SBATCH -C r420
 
### Loading modules
module load PackageEnv/intel17.0.4_gcc8.1.0_MKL_IMPI_AVX
 
### Use mpirun to run parallel jobs
mpirun ./a.out > outfile.out

 

Large MPI job script example

#!/bin/bash
######################################
# Example of a SLURM job script for Talon3
# Job Name: Sample_Job
# Number of cores: 112
# Number of nodes: 4
# QOS: general
# Run time: 12 hrs
######################################
 
#SBATCH -J Sample_Job
#SBATCH -o Sample_job.o%j
#SBATCH -p public
#SBATCH --qos general
#SBATCH -N 4
#SBATCH -n 64
#SBATCH --ntasks-per-node 16
#SBATCH -t 12:00:00
#SBATCH -C r420
 
### Loading modules
module load PackageEnv/intel17.0.4_gcc8.1.0_MKL_IMPI_AVX
 
## Use mpirun for MPI jobs
mpirun ./a.out > outfile.out

 

OPENMP job script example

#!/bin/bash
######################################
# Example of a SLURM job script for Talon3
# Job Name: Sample_Job
# Number of MPI tasks: 1
# Number of nodes: 1
# QOS: general
# Run time: 12 hrs
######################################
 
#SBATCH -J Sample_Job
#SBATCH -o Sample_job.o%j
#SBATCH -p public
#SBATCH --qos general
#SBATCH -N 4
#SBATCH -n 112
#SBATCH --ntasks-per-node 28
#SBATCH -t 12:00:00
#SBATCH -C c6320
 
### Loading modules
module load PackageEnv/intel17.0.4_gcc8.1.0_MKL_IMPI_AVX
 
### Set the number of threads
export OMP_NUM_THREADS=28
 
./a.out > outfile.out

 

CUDA parallel GPU job script example

#!/bin/bash
######################################
# Example of a SLURM job script for Talon3
# Job Name: Sample_Job
# Number of devices(GPUs): 2
# Number of nodes: 1
# QOS: general
# Run time: 12 hrs
######################################


#SBATCH -J Sample_Job
#SBATCH --ntasks=1
#SBATCH --qos=general
#SBATCH -p public
#SBATCH --gres=gpu:2
#SBATCH -t 12:00:00


### execute code
./a.out -numdevices=2

 

Submit the job:
$ sbatch slurm-job.sh

Interactive Jobs

Interactive job sessions can be used on Talon if you need to compile or test software. An example command of starting an interactive sessions is shown below:

srun -p public --qos debug -N 1  --pty bash

This launches an interactive job session and lanches a bash shell to a compute node. From there, you can exectue software and shell commands that would otherwise not be allowed on the Talon login nodes.

List jobs:
$ squeue -u $USER


Get job details:
$ scontrol show job 106

Kill a job. Users can kill their own jobs, root can kill any job.
$ scancel $JOB_ID
where $JOB_ID is the job ID number of the job you want to be killed


Hold a job:
$ scontrol hold 139


Release a job:
$ scontrol release 139