How to create and use a Slurm cluster within the TRE

Introduction

Most TRE Users work from a single workspace. The size of this workspace depends on the type of work users want to do, which is discussed by PIs/HIC during project initiation meetings. However, if a problem exceeds the capacity of a single workspace, it can benefit from being distributed across multiple machines. In these cases, we provide access to a service catalogue item that provisions a private High Performance Computing (HPC) environment. Advantages of this over a standard workspace include,

Massive increase in available computing resources - from just a few machines to hundreds
Cost savings, as your cluster will dynamically scale depending on how many jobs are running
Simplicity of workload management though the Slurm Workload Manager (Simple Linux Utility for Resource Management)

This knowledge base article will run through the process of creating your own cluster, as well as providing an introduction into how it can be used.

Creating a cluster

The process is much the same as when you provision a workspace. In this case, the workspace you provision may have fewer resources than what you are used to, but will be configured to provision more workspaces as jobs are submitted with Slurm.

From the Studies tab, find and select the study you wish to link the cluster to.
Towards the upper right, select Next.
Select the latest Slurm service catalogue item and click Next.
1. Enter a name for your workspace. This name will be used later to keep track of your data egress requests, so you may wish to name it something relevant to the analysis you will be performing.
2. Ignore Restricted CIDR
3. Select the account to create this workspace in. There will likely only be one, based on the study you selected earlier.
4. Select Yes to “Allow inbound traffic from project environment”
5. Select the appropriate image based on your software and hardware requirements.
6. Enter a brief description - this is a required field but may not be useful to you. You can use this to store any information which you may wish to keep track of for the workspace.
Click Create Research Workspace

The process will take 10-20 minutes to provision your workspace. Once done, you’ll be able to connect to the graphical desktop environment as you would normally.

An Introduction to High Performance Computing and Slurm

Slurm is a workload manager, which means instead of running the jobs yourself, you ask Slurm to run them for you. It will decide where the best location for your job is and allocate the appropriate resources up to a certain limit. The section below gives a short introduction to the most commonly used commands within your Slurm environment.

Ideas

The entry point to your cluster is referred to as a login node. In more traditional HPC environments, multiple users from different groups login to this single point and submit their jobs. For this reason, it’s often strongly discouraged from CPU, I/O and memory intensive workloads on the login node - since it may interfere with other people working interactively. However, within the Cloud TRE, a login node is provisioned per user study. The computers which actually perform your work are referred to compute nodes. In our Cloud TRE’s environment, these are provisioned on demand for each job, depending on the amount of CPU and memory requested.

One important aspect to keep in mind is that a cluster will not appear as a single computer - your workloads have to be designed to utilise a cluster efficiently. Workloads often fall into one of three categories:

Embarrassingly Parallel (also called Delightfully Parallel) - These kinds of jobs scale well across hundreds or thousands of machines due to absolute independence between the work chunks. A great example of an embarrassingly parallel workload would be resizing images. You might have several million images which you need to resize, but the size of one does not (usually) depend on the size of others. They can be operated on independently.
Communication Intensive - These jobs often scale quite well up to a certain point, but then communication between the tasks begins to occupy the majority of the resources. A good example of this might be training a deep neural network. Whether training on a single computer with multiple GPUs, or many GPUs over multiple computers, keeping the parameters synchronised across these instances after each step of back-propagation, takes a significant amount of bandwidth. At a certain point, the bandwidth required for communication will take more time than is being used for actual computation.
Inherently Sequential - Unfortunately for these jobs, it does not matter how many cores you throw at it, there may not be any performance gain. This is either because the task depends heavily on atomic operations, or the tasks have an inherent dependency and each step cannot begin until the last completes. The only way to improve these problems is to rewrite them in a more efficient language, switch to a parallelisable algorithm, or use a faster computer.

Submitting a job

Jobs can be submitted to the cluster in two ways: interactively and batch. The former allows you to see the output from the process in realtime, and even allows keyboard entry under certain cases. However, this can make it difficult when many processes needed to be started at once, as the output from one process can clobber the output from another.

An interactive job can be started using the srun command. As an example, we can submit a simple job which will run on eight nodes and display their hostname.

$ srun -N8 hostname -f
srun: Requested partition configuration not available now
srun: job 1 queued and waiting for resources
ip-10-108-4-176.eu-west-2.compute.internal
ip-10-108-20-106.eu-west-2.compute.internal
ip-10-108-11-131.eu-west-2.compute.internal
ip-10-108-17-18.eu-west-2.compute.internal
ip-10-108-25-219.eu-west-2.compute.internal
ip-10-108-29-128.eu-west-2.compute.internal
ip-10-108-22-245.eu-west-2.compute.internal
ip-10-108-16-223.eu-west-2.compute.internal

If you haven’t run any jobs recently, this might take a few moments as compute nodes are created for your job in the background. Processes which require keyboard input can only be invoked as a single task, and also require the --pty argument as this creates a pseudoterminal to redirect the input/output from/to.. For example, we can start a shell session on one of the compute nodes.

$ nproc
2
$ srun --pty bash
ubuntu@ip-10-108-22-245:~$ nproc
16
ubuntu@ip-10-108-22-245:~$ exit
$

Batch Job Submission

A more typical use case might be to submit a job directly to the cluster. This means any terminal windows open on your login node can be closed. The tasks will continue in the background. Submission of batch jobs to Slurm is done by writing a script which will be executed on each compute node. Below is a simple job which produces some output.

#!/bin/bash

#SBATCH -n8 -N8
#SBATCH --mem 64m --cpus-per-task 1
#SBATCH --output=/lustre/slurm-%j.out

srun -n8 bash <<EOF
echo "I am part of job \${SLURM_JOB_ID} running task \${SLURM_PROCID}."
echo "My name is \$(hostname) and I work for \${SLURM_SUBMIT_HOST}."
EOF

If we call this file something like “hello.sh” we can submit it using sbatch. We can then check the process of the job.

Over time, we will aim to produce further documentation demonstrating how to use Slurm for a variety of tasks, such as with MATLAB and PyTorch.

A More Advanced Example

There are many ways to split your job until smaller, parallelisable parts. A common problem is needing to do a number of operations on hundreds to millions of files, whether this be simple resize operation, feature extraction or exploring some combination of parameters. This section aims to demonstrate a few of these approaches.

Processing many files

Let’s say you have a directory of 1 million images. We could use MATLAB to go through them all one by one and perform a resize. For example,

For a small number of files, this is fine - it may take much longer to modify this to be parallel. Within MATLAB, we can easily parallelise this for a single host by replacing for with parfor. However, at a certain point, we need to split up the processing of these files. There are a couple of ways we can define these chunks

bash	MATLAB

bash	MATLAB
For chunks containing 10,000 files each:	For 30 chunks of arbitrary size:

Our job is then able to index either the file, or specific chunk, and work through those files. For example, we run the following job:

This can then be run in parallel using sbatch or srun, e.g.

Note that, parfor can also be used for the for-loop along with slurm, so there can be multiple levels of parallelism.

Workspace Configuration Version History

We aim to continuously improve and refine our workspace configurations, adding support for new functionality or fixing issues.

HIC Knowledge Base