Linux Cluster

Table of Contents

  1. Access and Job Restrictions
  2. SLURM and SGE Syntax Comparison
  3. How To Submit Single-core Jobs
  4. How To Kill a Job
  5. Interactive Jobs
  6. How to Monitor Jobs
  7. Submitting Jobs to the High Partition
  8. Submitting Parallel Jobs
  9. Submitting GPU Jobs

The SCF operates a Linux cluster with a total of 352 cores. 256 of the cores are in the 'low' partition (formerly the 'SGE' production cluster). 96 of the cores (the newest and fastest cores) are in the 'high' partition (formerly the high-priority 'SLURM' cluster). One node has an NVIDIA Tesla K20Xm GPGPU. For more information on access to this cluster, please contact consult [at] stat [dot] berkeley [dot] edu.

The 'low' partition has eight nodes, each with 32 cores and 256 GB dedicated RAM. The 'high' partition has four nodes, each with two 6-core CPUs (i.e., 12 cores per node) and 128 GB dedicated RAM. Each core has two hyperthreads, for a total of 96 processing units. 

The cluster is managed by the SLURM queueing software. SLURM provides a standard batch queueing system through which users submit jobs to the cluster. Jobs are typically submitted to SLURM using a user-defined shell script that executes one's application code. Interactive use is also an option. Users may also query the cluster to see job status. As currently set up, the cluster is designed for processing single-core and multi-core/threaded jobs (at most 32 cores per job), as well as distributed memory jobs that use MPI. All software running on SCF Linux machines is available on the cluster. Users can also compile programs on any SCF Linux machine and then run that program on the cluster.

Below is more detailed information about and how to use the cluster.

Access and Job Restrictions

The cluster is open to Department of Statistics faculty, grad students, postdocs, and visitors using their SCF logon. Class account users do not have access by default, but instructors can email manager [at] stat [dot] berkeley [dot] edu to discuss access for their class.

Currently users may submit jobs on the following submit hosts:

  arwen, beren, bilbo, gandalf, gimli, legolas, pooh, radagast, roo, shelob, springer, treebeard

The cluster has multiple partitions, corresponding to groups of nodes. The different partitions have different hardware and job restrictions as discussed here: 

Partition Max. # cores per user (running) Time Limit Max. memory/job Max. # cores/job
low 256 28 days 256 GB 32**
high* 24 7 days 128 GB 24**
GPU* 2 CPU cores 28 days 128 GB (CPU) 2

* See How to Submit Jobs to the High Partition or How to Submit GPU Jobs.

** If you use MPI (including foreach with doMPI in R), you can run individual jobs across more than one node. See the Subsection on "Submitting MPI Jobs" for such cases.

We have implemented a 'fair share' policy that governs the order in which jobs that are waiting in a given queue start when resources become available. In particular, if two users each have a job sitting in a queue, the job that will start first will be that of the user who has made less use of the cluster recently (measured in terms of CPU time). The measurement of CPU time downweights usage over time, with a half-life of one month, so a job that ran a month ago will count half as much as a job that ran yesterday. Apart from this prioritization based on recent use, all users are treated equally.

SLURM and SGE Syntax Comparison

If you're used to the SGE commands on our former production cluster (now the 'low' partition), the following table summarizes the analogous commands under SLURM.

Comparison of SGE and SLURM commands
qsub job.sh sbatch job.sh
qstat squeue
qdel {job_id} scancel {job_id}
qrsh -q interactive.q srun --pty /bin/bash
srun --pty --x11=first {command}

How to Submit a Simple Single-core Job

Prepare a shell script containing the instructions you would like the system to execute. When submitted using the instructions in this section, your code should only use a single core at a time; it should not start any additional processes. In the later sections of this document, we describe how to submit jobs that use a variety of types of parallelization.

For example a simple script to run an R program called 'simulate.R' would contain these lines:

#!/bin/bash
R CMD BATCH --no-save simulate.R simulate.out

Once logged onto a submit host, use the sbatch command with the name of the shell script (assumed to be job.sh here) to enter a job into the queue:

arwen:~/Desktop$ sbatch job.sh
Submitted batch job 380

Here the job and assigned job ID 380. Results that would normally be printed to the screen via standard output and standard error will be written to a file called slurm-380.out.

Any Matlab jobs submitted in this fashion must start Matlab with the -singleCompThread flag in your job script.

Similarly, any SAS jobs submitted in this fashion must start SAS with the following flags in your job script.

sas -nothreads -cpucount 1

SLURM provides a number of additional flags to control what happens; you can see the man page for sbatch for help with these. Here are some examples, placed in the job script file, where we name the job, ask for email updates and name the output and error files:

#!/bin/bash
#SBATCH --job-name=myAnalysisName
#SBATCH --mail-type=ALL                       
#SBATCH --mail-user=blah [at] berkeley [dot] edu
#SBATCH -o myAnalysisName.out #File to which standard out will be written
#SBATCH -e myAnalysisName.err #File to which standard err will be written
R CMD BATCH --no-save simulate.R simulate.Rout

How to Kill a Job

First, find the job-id of the job, by typing squeue at the command line of a submit host (see Monitoring Jobs).

Then use scancel to delete the job (with id 380 in this case):

scancel 380

Interactive Jobs

You can work interactively on a node from the Linux shell command line by starting a job in the interactive queue.

The syntax for requesting an interactive (bash) shell session is:

srun --pty /bin/bash

This will start a shell on one of the four nodes. You can then act as you would on any SCF Linux compute server. For example, you might use top to assess the status of one of your non-interactive (i.e., batch) cluster jobs. Or you might test some code before running it as a batch job. You can also transfer files to the local disk of the cluster node.

If you want to run a program that involves a graphical interface (requiring an X11 window), you need to add --x11=first to your srun command. So you could directly run Matlab, e.g., as follows:

srun --pty --x11=first matlab 

or you could add the -x11=first flag when requesting an interactive shell session and then subsequently start a program that has a graphical interface.

To run an interactive session in which you would like to use multiple cores, do the following (here we request 4 cores for our use):

srun --pty --cpus-per-task 4 /bin/bash 

Note that "-c" is a shorthand for "--cpus-per-task".

To monitor a job on a specific node or transfer files to the local disk of a specific node, you need to request that your interactive session be started on the node of interest (in this case scf-sm20):

srun --pty -p high -w scf-sm20 /bin/bash

Note that if that specific node has 24 or 32 cores in use (depending on the partition), you will need to wait until resources become available on that node before your interactive session will start. The squeue command (see below in the section on monitoring jobs) will tell you on which node a given job is running.

Finally, you can request multiple cores using -c, as with batch jobs. As with batch jobs, you can change OMP_NUM_THREADS from its default of one, provided you make sure that that the total number of cores used (number of processes your code starts multiplied by threads per process) does not exceed the number of cores you request.

How To Monitor Jobs and See Output

The SLURM command squeue provides info on job status:

arwen:~/Desktop> squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               381      high   job.sh paciorek  R      25:28      1 scf-sm20
               380      high   job.sh paciorek  R      25:37      1 scf-sm20

The following will tailor the output to include information on the number of cores (the CPUs column below) being used as well as other potentially useful information:

arwen:~/Desktop> squeue -o "%.7i %.9P %.20j %.8u %.2t %l %.9M %.5C %.8r %.6D %R %p %q %b"
   JOBID PARTITION                 NAME     USER ST TIME_LIMIT      TIME  CPUS   REASON  NODES NODELIST(REASON) PRIORITY QOS GRES
     49       low                 bash paciorek  R 28-00:00:00   1:23:29     1     None      1 scf-sm00 0.00000017066486 normal (null)
     54       low              kbpew2v paciorek  R 28-00:00:00     11:01     1     None      1 scf-sm00 0.00000000488944 normal (null)

The 'ST' field indicates whether a job is running (R), failed (F), or pending (PD). The latter occurs when there are not yet enough resources on the system for your job to run.

Job output that would normally appear in your terminal session will be sent to a file named slurm-<jobid>.out where <jobid> will be the number of the job (visible via squeue as above).

Submitting Jobs to the High Partition

To submit jobs to the faster nodes in the high partition, you must include either the '--partition=high' or '-p high' flag. By default jobs will be run in the low partition. For example:

arwen:~/Desktop$ sbatch -p high job.sh
Submitted batch job 380

You can also submit interactive jobs to the high partition, by simply adding the flag to specify that partition as above.

Submitting Parallel Jobs

One can use SLURM to submit a variety of types of parallel code. Here is a set of potentially useful templates that we expect will account for most user needs. If you have a situation that does not fall into these categories or have questions about parallel programming, submitting jobs to use more than one core, or are not sure how to follow these rules, please email consult [at] stat [dot] berkeley [dot] edu.

For additional details, please see the notes from SCF workshops on the basics of parallel programming in R, Python, Matlab and C, with some additional details on using the cluster. If you're making use of the threaded BLAS, it's worth doing some testing to make sure that threading is giving an non-negligible speedup; see the notes above for more information.

Submitting Threaded Jobs

Here's an example job script to use multiple threads (4 in this case) in R (or with your own openMP-based program):

#!/bin/bash
#SBATCH --cpus-per-task 4
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
R CMD BATCH --no-save simulate.R simulate.Rout

This will allow your R code to use the system's threaded BLAS and LAPACK routines. [Note that in R you can instead use the omp_set_num_threads() function in the RhpcBLASctl package, again making use of the SLURM_CPUS_PER_TASK environment variable.] The same syntax in your job script will work if you've compiled a C/C++/Fortran program that makes use of openMP for threading. Just replace the R CMD BATCH line with the line calling your program.

Here's an example job script to use multiple threads (4 in this case) in Matlab:

#!/bin/bash
#SBATCH --cpus-per-task 4
matlab -nodesktop -nodisplay < simulate.m > simulate.out

IMPORTANT: At the start of your Matlab code file you should include this line:

feature(’numThreads’, str2num(getenv('SLURM_CPUS_PER_TASK')));

Here's an example job script to use multiple threads (4 in this case) in SAS:

#!/bin/bash
#SBATCH --cpus-per-task 4
sas -threads -cpucount $SLURM_CPUS_PER_TASK

Submitting Multi-core Jobs

The following example job script files pertain to jobs that need to use multiple cores on a single node that do not fall under the threading/openMP context. This is relevant for parallel code in R that starts multiple R process (e.g., foreach, mclapply, parLapply), for parfor in Matlab, and for Pool.map and pp.Server in Python.

Here's an example script that uses multiple cores (4 in this case):

#!/bin/bash
#SBATCH --cpus-per-task 4
R CMD BATCH --no-save simulate.R simulate.Rout

IMPORTANT: Your R, Python, or any other code should use no more than the number of total cores requested (4 in this case). You can use the SLURM_CPUS_PER_TASK environment variable to programmatically control this.

The same syntax for your job script pertains to Matlab. IMPORTANT: when using parpool in Matlab, you should do the following:

parpool(str2num(getenv('SLURM_CPUS_PER_TASK')))

Submitting MPI Jobs

You can use MPI to run jobs across multiple nodes. This is useful in two ways. First, it allows you to use more than 24 or 32 cores (depending on the partition) for a single job. Second, if you need fewer cores but the free cores are scattered across the nodes and there are not sufficient cores on any one node, this allows you to make use of those scattered cores.

Here's an example script that uses multiple processors via MPI (36 in this case):

#!/bin/bash
#SBATCH -n 36
mpirun -np $SLURM_NTASKS myMPIexecutable

"myMPIexecutable" could be C/C++/Fortran code you've written that uses MPI, or R or Python code that makes use of MPI. More details are available here. One simple way to use MPI is to use the doMPI back-end to foreach in R. In this case you invoke R via mpirun as:

mpirun -np $SLURM_NTASKS R CMD BATCH file.R file.out

Note that in this case, unlike some other invocations of R via mpirun, mpirun starts all of the R processes.

Another use case for R in a distributed computing context is to use functions such as parSapply and parLapply after using the makeCluster command with a character vector indicating the nodes allocated by SLURM. If you run the following as part of your job script before the command invoking R, the file slurm.hosts will contain a list of the node names that you can read into R and pass to makeCluster.

srun hostname -s > slurm.hosts

To run an MPI job with each process threaded, your job script would look like the following (here with 18 processes and two threads per process):

#!/bin/bash
#SBATCH -n 18 -c 2
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
mpirun -np $SLURM_NTASKS -x OMP_NUM_THREADS myMPIexecutable

Submitting Data Parallel (SPMD) Code

Here's how you would set up your job script if you want to run multiple instances (18 in this case) of the same code.

#!/bin/bash
#SBATCH -n 18 
srun myExecutable

To have each instance behave differently, you can make use of the SLURM_PROCID environment variable, which will be distinct (and have values 0, 1, 2, ...) between the different instances.

To have each process be  threaded, see the syntax under the MPI section above.

Job Arrays: Submitting Multiple jobs in an Automated fashion

Job array submissions are a nice way to submit multiple jobs in which you vary a parameter across the different jobs.

Here's what your job script would look like, in this case to run a total of 5 jobs with parameter values of 0, 1, 2, 5, 7:

#!/bin/bash
#SBATCH -a 0-2,5,7
myExecutable

Your program should then make use of the SLURM_ARRAY_TASK_ID environment variable, which for a given job will contain one of the values from the set given with the -a flag (in this case from {0,1,2,5,7}). You could, for example, read SLURM_ARRAY_TASK_ID into your R, Python, Matlab, or C code.

Here's a concrete example where it's sufficient to use SLURM_ARRAY_TASK_ID to distinguish different input files if you need to run the same command (the bioinformatics program tophat in this case) on multiple input files (in this case, trans0.fq, trans1.fq, ...):

#!/bin/bash
#SBATCH -a 0-2,5,7
tophat BowtieIndex trans$SLURM_ARRAY_TASK_ID.fq

Submitting a GPU Job

To use the GPU on the scf-sm20 node, you need to specifically request that your job use the GPU as follows:

arwen:~/Desktop$ sbatch --partition=gpu -w scf-sm20-gpu --gres=gpu:1 job.sh

Once it starts your job will have exclusive access to the GPU and its memory. If another user is using the GPU, your job will be queued until the current job finishes.

Interactive jobs should use that same gres flag with the usual srun syntax for an interactive job.

arwen:~/Desktop$ srun --pty --partition=gpu -w scf-sm20-gpu --gres=gpu:1 /bin/bash

scf-sm20-gpu is a virtual node - it is a set of two CPUs on the scf-sm20 node that are partitioned off for GPU use.

If you want to interactively logon to the GPU node to check on compute or memory use on the GPU, make sure not to specify the gres flag (otherwise your interactive job will just queue, waiting for your original job to finish!) and specify the actual CPU portion of the node:

arwen:~/Desktop$ srun --pty --partition=high -w scf-sm20-cpu /bin/bash

and then use nvidia-smi commands, e.g.,

scf-sm20:~$ nvidia-smi -q -d UTILIZATION,MEMORY -l 1

For details on setting up your code to use the GPU, please see this link.