Job management with SLURM

Griz uses the SLURM Workload Manager to distribute resources and schedule jobs. What this means is that in order to run computational or memory intensive jobs, you will have to submit a job script with all the resources you need for that job and all the commands you would like to run. Below I will outline an example job script and also show how to connect to a compute node interactively.

Important!

Currently, during this test phase, we have no limits on resources we can request or time to run jobs, but this will likely change when the server is out of testing!

Example job script — Download

I am still learning the SLURM system (old University used TORQUE), so I am open to feedback about best practices!

Below is a basic job script called job_script.sh:

	#!/bin/bash
	#SBATCH --job-name=[job name]
	#SBATCH --output="/path/to/desired/directory/%x-%j.out"
	#SBATCH --mail-user=[your email]
	#SBATCH --mail-type=ALL
	#SBATCH --partition=good_lab_cpu
	#SBATCH --nodes=1
	#SBATCH --ntasks=1
	#SBATCH --cpus-per-task=4
	#SBATCH --mem=96000
	#SBATCH --time=2:30:00 # How long should they run
	## Above is all information for SLURM. It should all appear at the top of
	## the script before your commands to run. SLURM understands lines 
	## beginning with ## as a comment.

	## Command(s) to run:

	source ~/bin/anaconda3/bin/activate
	conda activate biotools
	# Make sure the environment with the software you need is activated.

	cd /mnt/beegfs/gt156213e/
	wgsim -N1000 -S1 genomes/NC_008253_1K.fna simulated_reads/sim_reads.fq /dev/null
	bowtie2 -x indexes/e_coli -U simulated_reads/sim_reads.fq -S alignments/sim_reads_aligned.sam
	samtools view -b -S -o alignments/sim_reads_aligned.bam alignments/sim_reads_aligned.sam
	samtools view -c -f 4 alignments/sim_reads_aligned.bam
	samtools view -q 42 -c alignments/sim_reads_aligned.bam

Full documentation of the sbatch options can be found at the following link:

Briefly, the options above are:

	--job-name:	A name to give your job that will appear in the queue.
	--output:	Location for SLURM to write log files. Default is same location as job script. %x represents job name and %j represents job ID.
	--mail-user:	An email address to receive updates from SLURM about job progress.
	--mail-type:	What type of email updates you'd like to receive (NONE, BEGIN, END, FAIL, ALL).
	--partition:	The type of node you want to run your job on. See Node info.
	--nodes:	Number of nodes you will need for the job.
	--ntasks:	Number of tasks you need for your job. Each command is a task. If you run commands with parallel or srun, set this to be the number of commands you want to run simultaneously.
	--cpus-per-task:Number of threads available for each command. If you run a program that is multithreaded, set this to be the number of threads specified by that program.
	--mem:		The amount of memory you need for you job. Default unit is MB.
	--time: 	The amount of time needed for your job to run. Ignore for now!

Submitting your job script

Job scripts are submitted with the sbatch command:

sbatch job_script.sh

SLURM will read the options in the header of the file and assign resources accordingly before executing the desired commands. Jobs will be assigned an ID and a log file will be written in the same location as the job script called [job id].out. Check this file if you encounter errors during your run.

Monitoring jobs

The status of running jobs can be checked by running squeue.

A job can be cancelled by running scancel [job id]

I have also provided a script called sres that checks node resource availability in the good-utils repository. Simply clone the repository and add it to your $PATH to run:

By default sres prints information for all Good Lab partitions. You can provide it with the name of a particular partition to print out info only for that one

For example:

sres good_lab_cpu

should produce the following output:

NODE NAME     PARTITION(S)                    TOTAL CPUs    ALLOCATED CPUs    FREE CPUs    TOTAL MEM (MB)    ALLOCATED MEM (MB)    FREE MEM (MB)    STATE
compute-0-1   good_lab_cpu,good_lab_large_cpu 72            0                 72           772439            0                     765683           IDLE
compute-0-2   good_lab_cpu,good_lab_large_cpu 72            0                 72           772439            0                     755300           IDLE
compute-0-3   good_lab_cpu,good_lab_large_cpu 72            0                 72           772439            0                     692035           IDLE
compute-0-4   good_lab_cpu,good_lab_large_cpu 72            0                 72           772439            0                     722757           IDLE
compute-0-5   good_lab_cpu,good_lab_large_cpu 72            0                 72           772439            0                     765922           IDLE
compute-0-6   good_lab_cpu,good_lab_large_cpu 72            0                 72           772439            0                     752593           IDLE
compute-0-7   good_lab_cpu,good_lab_large_cpu 72            0                 72           772439            0                     753552           IDLE
compute-0-8   good_lab_cpu,good_lab_large_cpu 72            69                3            772439            0                     23577            MIXED

Running commands on a compute node interactively

In some instances, it may be preferable to allocate resources on a compute node and run commands manually rather than through a job script. This can be especially useful for debugging and testing workflows, and can be effectively combined with screen.

To run commands interactively, use salloc:

salloc -p good_lab_cpu -N1 --exclusive srun --pty bash

This will allocate one good_lab_cpu node for interactive commands. Many more options are available for salloc. See the following docs for more info

Important!

Be aware that if you request resources with salloc that are unavailable, you may be waiting in queue even for an interactive session

Recommendation

Download the good-utils repository, which includes the interact command to request interactive node allocations. Use interact -h to see its usage.