Linux Cluster

Linux Cluster

The SCF operates a Linux cluster as our primary resource for users for computational jobs.

The cluster is managed by the SLURM queueing software. SLURM provides a standard batch queueing system through which users submit jobs to the cluster. Jobs are typically submitted to SLURM using a user-defined shell script that executes one's application code. Interactive use is also an option. Users may also query the cluster to see job status. As currently set up, the cluster is designed for processing single-core and multi-core/threaded jobs (at most 32 cores per job in the 'low' partition), as well as jobs that are set up to use multiple machines (aka 'nodes') within the cluster (i.e., distributed memory jobs). All software running on SCF Linux machines is available on the cluster. Users can also compile programs on any SCF Linux machine and then run that program on the cluster.

The cluster has a total of 556 cores, divided into 'partitions' of nodes of similar hardware/capabilities.

Below is more detailed information about how to use the cluster.

Access, Hardware, and Job Restrictions

The cluster is open to Department of Statistics faculty, grad students, postdocs, and visitors using their SCF logon. Class account users do not have access by default, but instructors can email manager@stat.berkeley.edu to discuss access for their class.

Currently users may submit jobs on the following submit hosts:

  arwen, beren, gandalf, hagrid, harry, radagast, shelob

The cluster has multiple partitions, corresponding to groups of nodes. The different partitions have different hardware and job restrictions as discussed here: 

Partition Max # cores per user (running) Time limit Max memory per job Max cores per job
low 256 28 days 256 GB 32***
high* 36 7 days 128 GB 24***
high_pre* 48 (smaug), 80 (balrog) 28 days** 376 GB (smaug), 792 GB (balrog) 48 or 80
gpu* 2 CPU cores 28 days 128 GB (CPU) 2
gpu_jsteinhardt* varied 28 days** varied (CPU) 2
gpu_yugroup* 2 CPU cores 28 days** varied (CPU) 2
low_mem 144 28 days 12-16 GB 16***

* See How to Submit Jobs to the High Partitions or How to Submit GPU Jobs.

** Preemptible

*** If you use software that can parallelize across multiple nodes (e.g., R packages that use MPI or the future package, Python's Dask or IPython Parallel, MATLAB, MPI), you can run individual jobs across more than one node. See How to Submit Parallel Jobs. 

We have implemented a 'fair share' policy that governs the order in which jobs that are waiting in a given queue start when resources become available. In particular, if two users each have a job sitting in a queue, the job that will start first will be that of the user who has made less use of the cluster recently (measured in terms of CPU time). The measurement of CPU time downweights usage over time, with a half-life of one month, so a job that ran a month ago will count half as much as a job that ran yesterday. Apart from this prioritization based on recent use, all users are treated equally.
 

How to Submit Single-core jobs

Prepare a shell script containing the instructions you would like the system to execute.

The instructions here are for the simple case of submitting a job without any parallelization; i.e., a job using a single core (CPU). When submitted using the instructions in this section, your code should only use a single core at a time; it should not start any additional processes. We also have extensive instructions for submitting parallelized jobs and automating the submission of multiple jobs.

For example a simple script to run an R program called 'simulate.R' would contain these lines:

#!/bin/bash
R CMD BATCH --no-save simulate.R simulate.out

Once logged onto a submit host, use the sbatch command with the name of the shell script (assumed to be job.sh here) to enter a job into the queue:

arwen:~/Desktop$ sbatch job.sh
Submitted batch job 380

Here the job and assigned job ID 380. Results that would normally be printed to the screen via standard output and standard error will be written to a file called slurm-380.out.

SLURM provides a number of additional flags to control what happens; you can see the man page for sbatch for help with these. Here are some examples, placed in the job script file, where we name the job, ask for email updates and name the output and error files:

#!/bin/bash
#SBATCH --job-name=myAnalysisName
#SBATCH --mail-type=ALL                       
#SBATCH --mail-user=blah@berkeley.edu
#SBATCH -o myAnalysisName.out #File to which standard out will be written
#SBATCH -e myAnalysisName.err #File to which standard err will be written
R CMD BATCH --no-save simulate.R simulate.Rout

 

How to Submit Parallel Jobs

One can use SLURM to submit a variety of types of parallel code. Please see here for details.

How to Kill a Job

First, find the job-id of the job, by typing squeue at the command line of a submit host (see How to Monitor Jobs).

Then use scancel to delete the job (with id 380 in this case):

scancel 380

 

Interactive Jobs

You can work interactively on a node from the Linux shell command line by starting a job in the interactive queue.

The syntax for requesting an interactive (bash) shell session is:

srun --pty /bin/bash

This will start a shell on one of the four nodes. You can then act as you would on any SCF Linux compute server. For example, you might use top to assess the status of one of your non-interactive (i.e., batch) cluster jobs. Or you might test some code before running it as a batch job. You can also transfer files to the local disk of the cluster node.

If you want to run a program that involves a graphical interface (requiring an X11 window), you need to add --x11 to your srun command. So you could directly run Matlab, e.g., as follows:

srun --pty --x11 matlab

or you could add the --x11 flag when requesting an interactive shell session and then subsequently start a program that has a graphical interface.

To run an interactive session in which you would like to use multiple cores, do the following (here we request 4 cores for our use):

srun --pty --cpus-per-task 4 /bin/bash

Note that "-c" is a shorthand for "--cpus-per-task".

To transfer files to the local disk of a specific node, you need to request that your interactive session be started on the node of interest (in this case scf-sm20):

srun --pty -p high -w scf-sm20 /bin/bash

Note that if that specific node has all its cores in use by other users, you will need to wait until resources become available on that node before your interactive session will start.

Finally, you can request multiple cores using -c, as with batch jobs. As with batch jobs, you can change OMP_NUM_THREADS from its default of one, provided you make sure that that the total number of cores used (number of processes your code starts multiplied by threads per process) does not exceed the number of cores you request.
 

How to Monitor Jobs

The SLURM command squeue provides info on job status:

arwen:~/Desktop> squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               381      high   job.sh paciorek  R      25:28      1 scf-sm20
               380      high   job.sh paciorek  R      25:37      1 scf-sm20

           
The following will tailor the output to include information on the number of cores (the CPUs column below) being used as well as other potentially useful information:

arwen:~/Desktop> squeue -o "%.7i %.9P %.20j %.8u %.2t %l %.9M %.5C %.8r %.6D %R %p %q %b"
   JOBID PARTITION                 NAME     USER ST TIME_LIMIT      TIME  CPUS   REASON  NODES NODELIST(REASON) PRIORITY QOS GRES
     49       low                 bash paciorek  R 28-00:00:00   1:23:29     1     None      1 scf-sm00 0.00000017066486 normal (null)
     54       low              kbpew2v paciorek  R 28-00:00:00     11:01     1     None      1 scf-sm00 0.00000000488944 normal (null)

The 'ST' field indicates whether a job is running (R), failed (F), or pending (PD). The latter occurs when there are not yet enough resources on the system for your job to run.

Job output that would normally appear in your terminal session will be sent to a file named slurm-<jobid>.out where <jobid> will be the number of the job (visible via squeue as above).

If you would like to logon to the node on which your job is running in order to assess CPU or memory use, you can run an interactive job within the context of your existing job. First determine the job ID of your running job using squeue and insert that in place of <jobid> in the following command:

arwen:~/Desktop$ srun --pty --jobid=<jobid> /bin/bash

You can then run top and other tools. Please do not do any intensive computation that would use additional cores in addition to those used by your running job.

How to Monitor Cluster Usage

If you'd like to see how busy each node is (e.g., to choose what partition to submit a job to), you can run the following:

arwen:~/Desktop$ sinfo -N  -o "%.12N %.5a %.6t %C"
    NODELIST AVAIL  STATE CPUS(A/I/O/T)
    scf-sm00    up  alloc 32/0/0/32
    scf-sm01    up  alloc 32/0/0/32
    scf-sm02    up  alloc 32/0/0/32
    scf-sm03    up  alloc 32/0/0/32
    scf-sm10    up  alloc 32/0/0/32
    scf-sm11    up  alloc 32/0/0/32
    scf-sm12    up  alloc 32/0/0/32
    scf-sm13    up    mix 27/5/0/32
scf-sm20-cpu    up   idle 0/22/0/22
scf-sm20-gpu    up   idle 0/2/0/2
scf-sm21-cpu    up   idle 0/20/0/20
scf-sm21-gpu    up   idle 0/4/0/4
    scf-sm22    up   idle 0/24/0/24
    scf-sm23    up   idle 0/24/0/24

Here the A column indicates the number of cores used (i.e., active), I indicates the number of inactive cores, and T the total number of cores on the node.

Submitting Jobs to the High Partitions

High partition

To submit jobs to the faster nodes in the high partition, you must include either the '--partition=high' or '-p high' flag. By default jobs will be run in the low partition. For example:

arwen:~/Desktop$ sbatch -p high job.sh
Submitted batch job 380

You can also submit interactive jobs to the high partition, by simply adding the flag to specify that partition as above.

High preemptible partition

The high_pre partition has two nodes (smaug and balrog) with newer CPUs and very fast disk I/O to /tmp and /var/tmp using an NVMe SSD. This nodes are owned by a faculty member and is made available on a preemptible basis. Your job could be cancelled without warning at any time if researchers in the faculty member's group need to run a job using the cores/memory your job is using. However we don't expect this to happen very often given that the nodes have 48 (smaug) and 80 (balrog) cores that can be shared amongst jobs. For example:

arwen:~/Desktop$ sbatch -p high_pre job.sh
Submitted batch job 380

Jobs are cancelled when preemption happens. If you want your job to be automatically started again (i.e., started from the beginning) when the node becomes available you can add the "--requeue" flag when you submit via sbatch.

Also note that if you need more disk space on the NVMe SSD on smaug (but not balrog), we may be able to make available space on a much larger NVMe SSD (in /data) if you request it.
 

How to Submit GPU Jobs

The SCF hosts a number of GPUs, available only by submitting a job through our SLURM scheduling software. The GPUs are quite varied in their hardware configurations (different generations of GPUS, with different speeds and GPU memory); details are here to guide you in selecting which GPU you may want to use. 

To use the GPUs, you need to submit a job via our SLURM scheduling software. In doing so, you need to specifically request that your job use the GPU as follows:

arwen:~/Desktop$ sbatch --partition=gpu --gres=gpu:1 job.sh

Once it starts your job will have exclusive access to the GPU and its memory. If another user is using the GPU, your job will be queued until the current job finishes.

Interactive jobs should use that same gres flag with the usual srun syntax for an interactive job.

arwen:~/Desktop$ srun --pty --partition=gpu --gres=gpu:1 /bin/bash

Given the heterogenity in the GPUs available (see here for details), you may want to request use of one or more GPUs on a specific machine. To do so, you can use the -w flag, e.g., requesting the roo GPU server:

arwen:~/Desktop$ sbatch --partition=gpu -w roo --gres=gpu:1 job.sh

If you want to interactively logon to the GPU node to check on compute or memory use of an sbatch job that uses the GPU, find the job ID of your job using squeue and insert that job ID in place of '<jobID>' in the following command. This will give you an interactive job running in the context of your original job:

arwen:~/Desktop$ srun --pty --partition=gpu --jobid=<jobid> /bin/bash

and then use nvidia-smi commands, e.g.,

scf-sm20:~$ nvidia-smi -q -d UTILIZATION,MEMORY -l 1

For details on setting up your code to use the GPU, please see this link.

Two GPUs are generally available to all SCF users; these are hosted on the scf-sm20 and roo servers. 

Note that the GPU hosted on scf-sm20 is quite a bit older, and likely slower, than the GPU hosted on roo. If you'd like to specifically request one of the GPUs, you can add the -w flag, e.g. "-w roo" to request the GPU on roo. 

scf-sm20-gpu is a virtual node - it is a set of two CPUs on the scf-sm20 node that are partitioned off for GPU use.

Additional GPUs have been obtained by the Steinhardt and Yu lab groups. Most of these GPUs have higher performance (either speed or GPU memory) than our standard GPUs.

Members of the lab group have priority access to the GPUs of their group. Other SCF users can submit jobs that use these GPUs but those jobs will be preempted (killed) if higher-priority jobs need access to the GPUs. 

To submit jobs requesting access to these GPUs, you need to specify either the gpu_jsteinhardt or gpu_yugroup partitions. Here's an example:

arwen:~/Desktop$ sbatch --partition=gpu_jsteinhardt --gres=gpu:1 job.sh

To use multiple GPUs for a job (only possible when using a server with more than one GPU, namely scf-sm21-gpu, smaug, shadowfax, balrog, and morgoth), simply change the number 1 after --gres=gpu to the number desired.

The Steinhart group has priority access to the balrog (8 GPUs), shadowfax (8 GPUs) and smaug (2 GPUs) GPU servers. If you are in the group, simply submit jobs to the gpu_jsteinhardt partition and you will automatically preempt jobs by users not in the group if it is needed for your job to run.

More details on optimal use of these servers can be obtained from the guide prepared by Steinhardt group members and the SCF and available by contacting one of us.

The Yu group has priority access to GPUs located on the scf-sm21-gpu (2 GPUs), merry (1 GPU), and morgoth (2 GPUs) servers. If you are in the group, simply submit jobs to the gpu_yugroup partition and you will automatically preempt jobs by users not in the group if it is needed for your job to run. Please contact SCF staff or group members for more details.