Linux Cluster

Table of Contents

  1. Access and Job Restrictions
  2. SLURM and SGE Syntax Comparison
  3. How To Submit Single-core Jobs
  4. How To Kill a Job
  5. Interactive Jobs
  6. How to Monitor Jobs and See Output
  7. How To Monitor Cluster Usage
  8. Submitting Jobs to the High Partitions
  9. Submitting Parallel Jobs
  10. Submitting GPU Jobs

The SCF operates a Linux cluster with a total of 556 cores.

  • 256 of the cores are in the 'low' partition.
    • 8 nodes, 32 cores and 256 GB RAM per node.
    • By default jobs submitted to the cluster will run here.
  • 96 of the cores are in the 'high' partition.
    • 4 nodes, 12 cores (each with two hyperthreads) and 128 GB RAM per node.
    • Newer and faster nodes than in the 'low' partition.
  • 48 cores are in the 'high_pre' partition.
    • 1 node, 24 cores (each with two hyperthreads) and 376 GB RAM.
    • Newer and perhaps somewhat faster cores than in the 'high' partition.
    • Very fast disk I/O (using NVMe SSDs) to files located in /tmp and /var/tmp.
    • Jobs are subject to preemption at any time and will be cancelled in that case without warning (more details below).
  • 144 cores are in the 'lowmem' partition, intended to supplement our other partitions, particularly during heavy cluster usage.
    • 9 nodes, 16 cores and 12-16 GB RAM per node.
    • Older, slower nodes that are comparable to, or a bit slower than, the nodes in the 'low' partition.
    • Only suitable for jobs that have low memory needs, but you should feel free to request an entire node for jobs needing more memory, even if your code will only use one or a few cores. 

The cluster is managed by the SLURM queueing software. SLURM provides a standard batch queueing system through which users submit jobs to the cluster. Jobs are typically submitted to SLURM using a user-defined shell script that executes one's application code. Interactive use is also an option. Users may also query the cluster to see job status. As currently set up, the cluster is designed for processing single-core and multi-core/threaded jobs (at most 32 cores per job in the 'low' partition), as well as distributed memory jobs that use MPI. All software running on SCF Linux machines is available on the cluster. Users can also compile programs on any SCF Linux machine and then run that program on the cluster.

Below is more detailed information about and how to use the cluster.

Access and Job Restrictions

The cluster is open to Department of Statistics faculty, grad students, postdocs, and visitors using their SCF logon. Class account users do not have access by default, but instructors can email manager [at] stat [dot] berkeley [dot] edu to discuss access for their class.

Currently users may submit jobs on the following submit hosts:

  arwen, beren, gandalf, hagrid, harry, radagast, shelob

The cluster has multiple partitions, corresponding to groups of nodes. The different partitions have different hardware and job restrictions as discussed here: 

Partition Max. # cores per user (running) Time Limit Max. memory/job Max. # cores/job
low 256 28 days 256 GB 32***
high* 36 7 days 128 GB 24***
high_pre* 48 28 days** 376 GB 48
lowmem 144 28 days 12-16 GB 16***
gpu* 2 CPU cores 28 days varied (CPU) 2
gpu_jsteinhardt* varied 28 days** varied (CPU) 2
gpu_yugroup* 2 CPU cores 28 days** varied (CPU) 2

* See How to Submit Jobs to the High Partitions or How to Submit GPU Jobs.

** Preemptible

*** If you use software that can parallelize across multiple nodes (e.g., R packages that use MPI or the future package, Python's Dask or IPython Parallel, MATLAB, MPI), you can run individual jobs across more than one node. See the Subsection on "Submitting MPI or Other Multi-Node Jobs" for such cases.

We have implemented a 'fair share' policy that governs the order in which jobs that are waiting in a given queue start when resources become available. In particular, if two users each have a job sitting in a queue, the job that will start first will be that of the user who has made less use of the cluster recently (measured in terms of CPU time). The measurement of CPU time downweights usage over time, with a half-life of one month, so a job that ran a month ago will count half as much as a job that ran yesterday. Apart from this prioritization based on recent use, all users are treated equally.

SLURM and SGE Syntax Comparison

If you're used to the SGE commands on our former production cluster (now the 'low' partition), the following table summarizes the analogous commands under SLURM.

Comparison of SGE and SLURM commands
qsub sbatch
qstat squeue
qdel {job_id} scancel {job_id}
qrsh -q interactive.q srun --pty /bin/bash
srun --pty --x11 {command}

How to Submit a Simple Single-core Job

Prepare a shell script containing the instructions you would like the system to execute. When submitted using the instructions in this section, your code should only use a single core at a time; it should not start any additional processes. In the later sections of this document, we describe how to submit jobs that use a variety of types of parallelization.

For example a simple script to run an R program called 'simulate.R' would contain these lines:

R CMD BATCH --no-save simulate.R simulate.out

Once logged onto a submit host, use the sbatch command with the name of the shell script (assumed to be here) to enter a job into the queue:

arwen:~/Desktop$ sbatch
Submitted batch job 380

Here the job and assigned job ID 380. Results that would normally be printed to the screen via standard output and standard error will be written to a file called slurm-380.out.

Any Matlab jobs submitted in this fashion must start Matlab with the -singleCompThread flag in your job script.

Similarly, any SAS jobs submitted in this fashion must start SAS with the following flags in your job script.

sas -nothreads -cpucount 1

SLURM provides a number of additional flags to control what happens; you can see the man page for sbatch for help with these. Here are some examples, placed in the job script file, where we name the job, ask for email updates and name the output and error files:

#SBATCH --job-name=myAnalysisName
#SBATCH --mail-type=ALL                       
#SBATCH --mail-user=blah [at] berkeley [dot] edu
#SBATCH -o myAnalysisName.out #File to which standard out will be written
#SBATCH -e myAnalysisName.err #File to which standard err will be written
R CMD BATCH --no-save simulate.R simulate.Rout

How to Kill a Job

First, find the job-id of the job, by typing squeue at the command line of a submit host (see Monitoring Jobs).

Then use scancel to delete the job (with id 380 in this case):

scancel 380

Interactive Jobs

You can work interactively on a node from the Linux shell command line by starting a job in the interactive queue.

The syntax for requesting an interactive (bash) shell session is:

srun --pty /bin/bash

This will start a shell on one of the four nodes. You can then act as you would on any SCF Linux compute server. For example, you might use top to assess the status of one of your non-interactive (i.e., batch) cluster jobs. Or you might test some code before running it as a batch job. You can also transfer files to the local disk of the cluster node.

If you want to run a program that involves a graphical interface (requiring an X11 window), you need to add --x11 to your srun command. So you could directly run Matlab, e.g., as follows:

srun --pty --x11 matlab 

or you could add the --x11 flag when requesting an interactive shell session and then subsequently start a program that has a graphical interface.

To run an interactive session in which you would like to use multiple cores, do the following (here we request 4 cores for our use):

srun --pty --cpus-per-task 4 /bin/bash 

Note that "-c" is a shorthand for "--cpus-per-task".

To transfer files to the local disk of a specific node, you need to request that your interactive session be started on the node of interest (in this case scf-sm20):

srun --pty -p high -w scf-sm20 /bin/bash

Note that if that specific node has 24 or 32 cores in use (depending on the partition), you will need to wait until resources become available on that node before your interactive session will start.

Finally, you can request multiple cores using -c, as with batch jobs. As with batch jobs, you can change OMP_NUM_THREADS, though if the the total number of cores used (number of processes your code starts multiplied by threads per process) tries to exceed the number of cores you request, the system will automatically throttle back your core usage.

How To Monitor Jobs and See Output

The SLURM command squeue provides info on job status:

arwen:~/Desktop> squeue
               381      high paciorek  R      25:28      1 scf-sm20
               380      high paciorek  R      25:37      1 scf-sm20

The following will tailor the output to include information on the number of cores (the CPUs column below) being used as well as other potentially useful information:

arwen:~/Desktop> squeue -o "%.7i %.9P %.20j %.8u %.2t %l %.9M %.5C %.8r %.6D %R %p %q %b"
     49       low                 bash paciorek  R 28-00:00:00   1:23:29     1     None      1 scf-sm00 0.00000017066486 normal (null)
     54       low              kbpew2v paciorek  R 28-00:00:00     11:01     1     None      1 scf-sm00 0.00000000488944 normal (null)

The 'ST' field indicates whether a job is running (R), failed (F), or pending (PD). The latter occurs when there are not yet enough resources on the system for your job to run.

Job output that would normally appear in your terminal session will be sent to a file named slurm-<jobid>.out where <jobid> will be the number of the job (visible via squeue as above).

If you would like to logon to the node on which your job is running in order to assess CPU or memory use, you can run an interactive job within the context of your existing job. First determine the job ID of your running job using squeue and insert that in place of <jobid> in the following command:

arwen:~/Desktop$ srun --pty --jobid=<jobid> /bin/bash

You can then run top and other tools. Please do not do any intensive computation that would use additional cores in addition to those used by your running job.

How To Monitor Cluster Usage

If you'd like to see how busy each node is (e.g., to choose what partition to submit a job to), you can run the following:

arwen:~/Desktop$ sinfo -N -o "%8P %15N %.5a %6t %C"
low*     scf-sm00           up idle   0/32/0/32
low*     scf-sm01           up idle   0/32/0/32
low*     scf-sm02           up idle   0/32/0/32
low*     scf-sm03           up idle   0/32/0/32
low*     scf-sm10           up mix    29/3/0/32
low*     scf-sm11           up mix    27/5/0/32
low*     scf-sm12           up mix    9/23/0/32
low*     scf-sm13           up idle   0/32/0/32
high     scf-sm20-cpu       up idle   0/22/0/22
high     scf-sm21-cpu       up mix    8/12/0/20
high     scf-sm22           up idle   0/24/0/24
high     scf-sm23           up idle   0/24/0/24

Here the A column indicates the number of cores used (i.e., active), I indicates the number of inactive cores, and T the total number of cores on the node.

Submitting Jobs to the High Partitions

To submit jobs to the faster nodes in the high partition, you must include either the '--partition=high' or '-p high' flag. By default jobs will be run in the low partition. For example:

arwen:~/Desktop$ sbatch -p high
Submitted batch job 380

You can also submit interactive jobs to the high partition, by simply adding the flag to specify that partition as above.

The high_pre partition has a node with newer CPUs and very fast disk I/O to /tmp and /var/tmp using an NVMe SSD. This node is owned by a faculty member and is made available on a preemptible basis. Your job could be cancelled without warning at any time if researchers in the faculty member's group need to run a job using the cores/memory your job is using. However we don't expect this to happen very often given that the node has 48 cores that can be shared amongst jobs. For example:

arwen:~/Desktop$ sbatch -p high_pre
Submitted batch job 380

Jobs are cancelled when preemption happens. If you want your job to be automatically started again (i.e., started from the beginning) when the node becomes available you can add the "--requeue" flag when you submit via sbatch.

Also note that if you need more disk space on the NVMe SSD, we can make available space on a much larger NVMe SSD (in /data) if you request it.

Submitting Parallel Jobs

One can use SLURM to submit a variety of types of parallel code. Here is a set of potentially useful templates that we expect will account for most user needs. If you have a situation that does not fall into these categories or have questions about parallel programming, submitting jobs to use more than one core, or are not sure how to follow these rules, please email consult [at] stat [dot] berkeley [dot] edu.

For additional details, please see the notes from SCF workshops on the basics of parallel programming in R, Python, Matlab and C, with some additional details on using the cluster. If you're making use of the threaded BLAS, it's worth doing some testing to make sure that threading is giving an non-negligible speedup; see the notes above for more information.

Submitting Threaded Jobs

Here's an example job script to use multiple threads (4 in this case) in R (or with your own openMP-based program):

#SBATCH --cpus-per-task 4
R CMD BATCH --no-save simulate.R simulate.Rout

This will allow your R code to use the system's threaded BLAS and LAPACK routines. [Note that in R you can instead use the omp_set_num_threads() function in the RhpcBLASctl package, again making use of the SLURM_CPUS_PER_TASK environment variable.] The same syntax in your job script will work if you've compiled a C/C++/Fortran program that makes use of openMP for threading. Just replace the R CMD BATCH line with the line calling your program.

Here's an example job script to use multiple threads (4 in this case) in Matlab:

#SBATCH --cpus-per-task 4
matlab -nodesktop -nodisplay < simulate.m > simulate.out

IMPORTANT: At the start of your Matlab code file you should include this line:


Here's an example job script to use multiple threads (4 in this case) in SAS:

#SBATCH --cpus-per-task 4
sas -threads -cpucount $SLURM_CPUS_PER_TASK

Submitting Multi-core Jobs

The following example job script files pertain to jobs that need to use multiple cores on a single node that do not fall under the threading/openMP context. This is relevant for parallel code in R that starts multiple R process (e.g., foreach, mclapply, parLapply), for parfor in Matlab, and for and pp.Server in Python.

Here's an example script that uses multiple cores (4 in this case):

#SBATCH --cpus-per-task 4
R CMD BATCH --no-save simulate.R simulate.Rout

IMPORTANT: Your R, Python, or any other code should use no more than the number of total cores requested (4 in this case). You can use the SLURM_CPUS_PER_TASK environment variable to programmatically control this.

The same syntax for your job script pertains to Matlab. IMPORTANT: when using parpool in Matlab, you should do the following:


Note that by default the maximum number of workers is 12. To use more, run the following before invoking parpool.

c = parcluster('local');
c.NumWorkers = str2num(getenv('SLURM_CPUS_PER_TASK'));

By default MATLAB will use one core per worker (i.e., the parallel tasks will not be threaded). To use multiple threads per worker, here's an example job script (this is for four workers and two threads per worker):

#SBATCH --nodes 1
#SBATCH --ntasks 4
#SBATCH --cpus-per-task 2
matlab -nodesktop -nodisplay < simulate.m > simulate.out

and here's example MATLAB code for starting up your parallel pool of (threaded) workers:

c = parcluster('local');
c.NumThreads = str2num(getenv('SLURM_CPUS_PER_TASK'));

Running MATLAB across Multiple Nodes

As of June 2019, you can run MATLAB across unlimited workers (except as constrained by the number of cores available on the system not used by other jobs) on one or more nodes. There are two ways you can do this. Please see these instructions for how this is done on the EML (Economics) cluster, also run by SCF. The same instructions will work on the SCF, but for 'Option 2', of course you should be running MATLAB on an SCF stand-alone server. 

Submitting MPI or Other Multi-node Jobs

You can use MPI to run jobs across multiple nodes. Alternatively, there are ways to run software such as Python and R across multiple nodes. This is useful in two ways. First, it allows you to use more than 24 or 32 cores (depending on the partition) for a single job. Second, if you need fewer cores but the free cores are scattered across the nodes and there are not sufficient cores on any one node, this allows you to make use of those scattered cores.

A standard approach for such jobs is to request the number of cores (one core per parallel worker) using the "-n" flag. Here we request 36 cores.

#SBATCH -n 36
< insert your call(s) to executables that can parallelize across nodes here >

Here's an example script that uses multiple processors via MPI (36 in this case):

#SBATCH -n 36
mpirun -np $SLURM_NTASKS myMPIexecutable

"myMPIexecutable" could be C/C++/Fortran code you've written that uses MPI, or R or Python code that makes use of MPI. More details are available here. One simple way to use MPI is to use the doMPI back-end to foreach in R. In this case you invoke R via mpirun as:

mpirun -np $SLURM_NTASKS R CMD BATCH file.R file.out

Note that in this case, unlike some other invocations of R via mpirun, mpirun starts all of the R processes.

Another use case for R in a distributed computing context is to use functions such as parSapply and parLapply after using the makeCluster command with a character vector indicating the nodes allocated by SLURM. If you run the following as part of your job script before the command invoking R, the file slurm.hosts will contain a list of the node names that you can read into R and pass to makeCluster.

srun hostname -s > slurm.hosts

To run an MPI job with each process threaded, your job script would look like the following (here with 18 processes and two threads per process):

#SBATCH -n 18 -c 2
mpirun -np $SLURM_NTASKS -x OMP_NUM_THREADS myMPIexecutable

Submitting Data Parallel (SPMD) Code

Here's how you would set up your job script if you want to run multiple instances (18 in this case) of the same code.

#SBATCH -n 18 
srun myExecutable

To have each instance behave differently, you can make use of the SLURM_PROCID environment variable, which will be distinct (and have values 0, 1, 2, ...) between the different instances.

To have each process be  threaded, see the syntax under the MPI section above.

Job Arrays: Submitting Multiple jobs in an Automated fashion

Job array submissions are a nice way to submit multiple jobs in which you vary a parameter across the different jobs.

Here's what your job script would look like, in this case to run a total of 5 jobs with parameter values of 0, 1, 2, 5, 7:

#SBATCH -a 0-2,5,7

Your program should then make use of the SLURM_ARRAY_TASK_ID environment variable, which for a given job will contain one of the values from the set given with the -a flag (in this case from {0,1,2,5,7}). You could, for example, read SLURM_ARRAY_TASK_ID into your R, Python, Matlab, or C code.

Here's a concrete example where it's sufficient to use SLURM_ARRAY_TASK_ID to distinguish different input files if you need to run the same command (the bioinformatics program tophat in this case) on multiple input files (in this case, trans0.fq, trans1.fq, ...):

#SBATCH -a 0-2,5,7
tophat BowtieIndex trans$SLURM_ARRAY_TASK_ID.fq

Submitting a GPU Job

To use the GPUs on the scf-sm20 node or the roo server, you need to specifically request that your job use the GPU as follows:

arwen:~/Desktop$ sbatch --partition=gpu --gres=gpu:1

Once it starts your job will have exclusive access to the GPU and its memory. If another user is using the GPU, your job will be queued until the current job finishes.

Interactive jobs should use that same gres flag with the usual srun syntax for an interactive job.

arwen:~/Desktop$ srun --pty --partition=gpu --gres=gpu:1 /bin/bash

Note that the GPU hosted on scf-sm20 is quite a bit older, and likely slower, than the GPU hosted on roo. If you'd like to specifically request one of the GPUs, you can add the -w flag, e.g. "-w roo" to request the GPU on roo. 

scf-sm20-gpu is a virtual node - it is a set of two CPUs on the scf-sm20 node that are partitioned off for GPU use.

If you want to interactively logon to the GPU node to check on compute or memory use of an sbatch job that uses the GPU, find the job ID of your job using squeue and insert that job ID in place of '<jobID>' in the following command. This will give you an interactive job running in the context of your original job:

arwen:~/Desktop$ srun --pty --partition=gpu --jobid=<jobid> /bin/bash

and then use nvidia-smi commands, e.g.,

scf-sm20:~$ nvidia-smi -q -d UTILIZATION,MEMORY -l 1

For details on setting up your code to use the GPU, please see this link.