Linux Cluster

Linux Cluster

The SCF operates a Linux cluster as our primary resource for users for computational jobs.

The cluster is managed by the SLURM queueing software. SLURM provides a standard batch queueing system through which users submit jobs to the cluster. Jobs are typically submitted to SLURM using a user-defined shell script that executes one's application code. Interactive use is also an option. Users may also query the cluster to see job status. As currently set up, the cluster is designed for processing single-core and multi-core/threaded jobs (at most 32 cores per job in the 'low' partition), as well as jobs that are set up to use multiple machines (aka 'nodes') within the cluster (i.e., distributed memory jobs). All software running on SCF Linux machines is available on the cluster. Users can also compile programs on any SCF Linux machine and then run that program on the cluster.

The cluster has a total of 1032 cores, divided into 'partitions' of nodes of similar hardware/capabilities. Each 'core' is actually a hardware thread, as we have hyperthreading enabled, so the number of physical cores per machine is generally half that indicated in the Hardware tab below.

Below is more detailed information about how to use the cluster.

Access

The cluster is open to Department of Statistics faculty, grad students, postdocs, and visitors using their SCF logon. Class account users do not have access by default, but instructors can email manager@stat.berkeley.edu to discuss access for their class.

Currently users may submit jobs on the following submit hosts:

arwen, bellatrix, beren, gandalf, gollum, hermione, radagast, shelob

Hardware

The cluster has 1032 logical cores (essentially CPUs) spread across a number of nodes of varying hardware specifications and ages. Each 'core' is actually a hardware thread, as we have hyperthreading enabled, so the number of physical cores per machine is generally half that indicated here. If you'd like to run multi-core jobs without hyperthreading, please contact us for work-arounds.

  • 256 of the cores are in the 'low' partition.
    • 8 nodes with 32 cores and 256 GB RAM per node.
    • By default jobs submitted to the cluster will run here.
  • 96 of the cores are in the 'high' partition.
    • 4 nodes with 24 cores and 128 GB RAM per node.
    • Newer (but not new) and faster nodes than in the 'low' partition.
  • 504 cores are available on preemptible basis in the 'jsteinhardt' partition.
    • Many cores and (for some nodes) large memory on various nodes:
      • smaug has 64 cores and 288 GB RAM.
      • balrog has 96 cores and 792 GB RAM.
      • sunstone and smokyquartz each have 64 and 128 GB RAM.
      • rainbowquartz has 64 cores and 792 GB RAM.
      • saruman has 104 cores and 1 TB RAM.
    • Newer and perhaps somewhat faster cores than in the 'high' partition.
    • Very fast disk I/O (using NVMe SSDs) to files located in /tmp and /var/tmp.
    • Jobs are subject to preemption at any time and will be cancelled in that case without warning (more details below).
  • Additional GPU servers in the 'yugroup' partition.
  • 144 cores are in the 'lowmem' partition, intended to supplement our other partitions, particularly during heavy cluster usage.
    • 9 nodes, 16 cores and 12-16 GB RAM per node.
    • Older, slower nodes that are comparable to, or a bit slower than, the nodes in the 'low' partition.
    • Only suitable for jobs that have low memory needs, but you should feel free to request an entire node for jobs needing more memory, even if your code will only use one or a few cores.
Slurm Scheduler Setup and Job Restrictions

The cluster has multiple partitions, corresponding to groups of nodes. The different partitions have different hardware and job restrictions as discussed here:

Partition Max # cores per user (running) Time limit Max memory per job Max cores per job
low 256 28 days 256 GB 32***
high* 36 7 days 128 GB 24***
gpu* 8 CPU cores 28 days 6 GB (CPU) 8
jsteinhardt* varied 28 days** 288 GB (smaug), 792 GB (balrog, rainbowquartz), 1 TB (saruman), 128 GB (various) (CPU) varied
yugroup* varied 28 days** varied (CPU) varied
lowmem 144 28 days 12-16 GB 16***

* See How to Submit Jobs to the High Partitions or How to Submit GPU Jobs.

** Preemptible

*** If you use software that can parallelize across multiple nodes (e.g., R packages that use MPI or the future package, Python's Dask or IPython Parallel, MATLAB, MPI), you can run individual jobs across more than one node. See How to Submit Parallel Jobs.

We have implemented a 'fair share' policy that governs the order in which jobs that are waiting in a given queue start when resources become available. In particular, if two users each have a job sitting in a queue, the job that will start first will be that of the user who has made less use of the cluster recently (measured in terms of CPU time). The measurement of CPU time downweights usage over time, with a half-life of one month, so a job that ran a month ago will count half as much as a job that ran yesterday. Apart from this prioritization based on recent use, all users are treated equally.

How to Submit Single-core jobs

Prepare a shell script containing the instructions you would like the system to execute. 

The instructions here are for the simple case of submitting a job without any parallelization; i.e., a job using a single core (CPU). When submitted using the instructions in this section, such jobs will have access to only a single core at a time. We also have extensive instructions for submitting parallelized jobs and automating the submission of multiple jobs.

For example a simple script to run an R program called 'simulate.R' would contain these lines:

#!/bin/bash
R CMD BATCH --no-save simulate.R simulate.out

Once logged onto a submit host, navigate to a directory within your home or scratch directory (i.e., make sure your working directory is not in /tmp or /var/tmp) and use the sbatch command with the name of the shell script (assumed to be job.sh here) to enter a job into the queue:

arwen:~/Desktop$ sbatch job.sh
Submitted batch job 380

Here the job and assigned job ID 380. Results that would normally be printed to the screen via standard output and standard error will be written to a file called slurm-380.out.

If you have many single-core jobs to run, there are various ways to automate submitting such jobs.

Note that Slurm is configured such that single-core jobs will have access to a single physical core (including both hyperthreads), so there won't be any contention between the two threads on a physical core. However, if you have many single-core jobs to run, you might improve your throughput by modifying your workflow so that you can one run job per hyperthread rather than one job per physical core. You could do this by taking advantage of parallelization strategies in R, Python, or MATLAB to distribute tasks across workers in a single job, or you could use GNU parallel or srun within sbatch

Slurm provides a number of additional flags to control what happens; you can see the man page for sbatch for help with these. Here are some examples, placed in the job script file, where we name the job, ask for email updates and name the output and error files:

#!/bin/bash
#SBATCH --job-name=myAnalysisName
#SBATCH --mail-type=ALL                       
#SBATCH --mail-user=blah@berkeley.edu
#SBATCH -o myAnalysisName.out #File to which standard out will be written
#SBATCH -e myAnalysisName.err #File to which standard err will be written
R CMD BATCH --no-save simulate.R simulate.Rout

 

How to Submit Parallel Jobs

One can use SLURM to submit parallel code of a variety of types.

Submitting Jobs to the High Performance Partitions

High partition

To submit jobs to the faster nodes in the high partition, you must include either the '--partition=high' or '-p high' flag. By default jobs will be run in the low partition. For example:

arwen:~/Desktop$ sbatch -p high job.sh
Submitted batch job 380

You can also submit interactive jobs to the high partition, by simply adding the flag to specify that partition in your srun invocation.

Pre-emptible jobs using many cores, fast disk I/O, or large memory

The jsteinhardt partition has various nodes with newer CPUs, a lot of memory, and very fast disk I/O to /tmp and /var/tmp using an NVMe SSD. These nodes are owned by a faculty member and are made available on a preemptible basis. Your job could be cancelled without warning at any time if researchers in the faculty member's group need to run a job using the cores/memory your job is using. However we don't expect this to happen too often given that the nodes have 64 cores (96 cores and 104 cores in the cases of balrog and saruman, respectively) that can be shared amongst jobs. For example to request use of one of these nodes, which are labelled as 'manycore' nodes:

arwen:~/Desktop$ sbatch -p jsteinhardt -C manycore job.sh
Submitted batch job 380

Jobs are cancelled when preemption happens. If you want your job to be automatically started again (i.e., started from the beginning) when the node becomes available you can add the "--requeue" flag when you submit via sbatch.

You can request specific resources as follows:

  • "-C fasttmp" for access to fast disk I/O in /tmp and /var/tmp,
  • "-C manycore" for access to many (64 or more) cores,
  • "-C mem256g" for up to 256 GB CPU memory,
  • "-C mem768g" for up to 768 GB CPU memory, and
  • "-C mem1024g" for up to 1024 GB CPU memory.

Also note that if you need more disk space on the NVMe SSD on some but not all of these nodes, we may be able to make available space on a much larger NVMe SSD if you request it.

How to Kill a Job

First, find the job-id of the job, by typing squeue at the command line of a submit host (see How to Monitor Jobs).

Then use scancel to delete the job (with id 380 in this case):

scancel 380

 

Interactive Jobs

You can work interactively on a node from the Linux shell command line by starting a job in the interactive queue.

The syntax for requesting an interactive (bash) shell session is:

srun --pty /bin/bash

This will start a shell on one of the four nodes. You can then act as you would on any SCF Linux compute server. For example, you might use top to assess the status of one of your non-interactive (i.e., batch) cluster jobs. Or you might test some code before running it as a batch job. You can also transfer files to the local disk of the cluster node.

If you want to run a program that involves a graphical interface (requiring an X11 window), you need to add --x11 to your srun command. So you could directly run Matlab, e.g., as follows:

srun --pty --x11 matlab

or you could add the --x11 flag when requesting an interactive shell session and then subsequently start a program that has a graphical interface.

To run an interactive session in which you would like to use multiple cores, do the following (here we request 4 cores for our use):

srun --pty --cpus-per-task 4 /bin/bash

Note that "-c" is a shorthand for "--cpus-per-task".

To transfer files to the local disk of a specific node, you need to request that your interactive session be started on the node of interest (in this case scf-sm20):

srun --pty -p high -w scf-sm20 /bin/bash

Note that if that specific node has all its cores in use by other users, you will need to wait until resources become available on that node before your interactive session will start.

Finally, you can request multiple cores using -c, as with batch jobs. As with batch jobs, you can change OMP_NUM_THREADS from its default of one, provided you make sure that that the total number of cores used (number of processes your code starts multiplied by threads per process) does not exceed the number of cores you request.

How to Monitor Jobs

The SLURM command squeue provides info on job status:

arwen:~/Desktop> squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               381      high   job.sh paciorek  R      25:28      1 scf-sm20
               380      high   job.sh paciorek  R      25:37      1 scf-sm20

The following will tailor the output to include information on the number of cores (the CPUs column below) being used as well as other potentially useful information:

arwen:~/Desktop> squeue -o "%.7i %.9P %.20j %.8u %.2t %l %.9M %.5C %.8r %.6D %R %p %q %b"
   JOBID PARTITION                 NAME     USER ST TIME_LIMIT      TIME  CPUS   REASON  NODES NODELIST(REASON) PRIORITY QOS GRES
     49       low                 bash paciorek  R 28-00:00:00   1:23:29     1     None      1 scf-sm00 0.00000017066486 normal (null)
     54       low              kbpew2v paciorek  R 28-00:00:00     11:01     1     None      1 scf-sm00 0.00000000488944 normal (null)

The 'ST' field indicates whether a job is running (R), failed (F), or pending (PD). The latter occurs when there are not yet enough resources on the system for your job to run.

Job output that would normally appear in your terminal session will be sent to a file named slurm-<jobid>.out where <jobid> will be the number of the job (visible via squeue as above).

If you would like to login to the node on which your job is running in order to assess CPU or memory use, you can run an interactive job within the context of your existing job. First determine the job ID of your running job using squeue and insert that in place of <jobid> in the following command:

arwen:~/Desktop$ srun --pty --jobid=<jobid> /bin/bash

You can then run top and other tools. 

To see a history of your jobs (within a time range), including reasons they might have failed:

sacct  --starttime=2021-04-01 --endtime=2021-04-30 \
--format JobID,JobName,Partition,Account,AllocCPUS,State%30,ExitCode,Submit,Start,End,NodeList,MaxRSS

The MaxRSS column indicates memory usage, which can be very useful.

How to Monitor Cluster Usage

If you'd like to see how busy each node is (e.g., to choose what partition to submit a job to), you can run the following:

arwen:~/Desktop$ sinfo -p low,high,jsteinhardt,yugroup -N -o "%.12N %.5a %.6t %C"
    NODELIST AVAIL  STATE CPUS(A/I/O/T)
      balrog    up    mix 50/46/0/96
       merry    up   idle 0/4/0/4
     morgoth    up    mix 1/11/0/12
    scf-sm00    up   idle 0/32/0/32
    scf-sm01    up   idle 0/32/0/32
    scf-sm02    up    mix 31/1/0/32
    scf-sm03    up   idle 0/32/0/32
    scf-sm10    up   idle 0/32/0/32
    scf-sm11    up   idle 0/32/0/32
    scf-sm12    up   idle 0/32/0/32
    scf-sm13    up   idle 0/32/0/32
    scf-sm20    up   idle 0/24/0/24
    scf-sm21    up   idle 0/24/0/24
    scf-sm22    up   idle 0/24/0/24
    scf-sm23    up   idle 0/24/0/24
   shadowfax    up   idle 0/48/0/48
       smaug    up   idle 0/64/0/64
   treebeard    up    mix 2/30/0/32

Here the A column indicates the number of cores used (i.e., active), I indicates the number of inactive cores, and T the total number of cores on the node.

To see the jobs running in a partition

 

How to Submit GPU Jobs

The SCF hosts a number of GPUs, available only by submitting a job through our SLURM scheduling software. The GPUs are quite varied in their hardware configurations (different generations of GPUS, with different speeds and GPU memory). We have documented the GPU servers to guide you in selecting which GPU you may want to use.

To use the GPUs, you need to submit a job via our SLURM scheduling software. In doing so, you need to specifically request that your job use the GPU as follows using the 'gres' flag:

arwen:~/Desktop$ sbatch --partition=gpu --gres=gpu:1 job.sh

Note that the partition to use will vary, as discussed in the other tabs given in this set of tabbed information.

Once it starts your job will have exclusive access to the GPU and its memory. If another user is using the GPU, your job will be queued until the current job finishes.

Interactive jobs should use that same gres flag with the usual srun syntax for an interactive job.

arwen:~/Desktop$ srun --pty --partition=gpu --gres=gpu:1 /bin/bash

Given the heterogenity in the GPUs available, you may want to request use of a specific GPU type. To do so, you can add the type to the 'gres' flag, e.g., requesting a K80 GPU:

arwen:~/Desktop$ sbatch --partition=high --gres=gpu:K80:1 job.sh

If you want to interactively logon to the GPU node to check on compute or memory use of an sbatch job that uses the GPU, find the job ID of your job using squeue and insert that job ID in place of '<jobID>' in the following command. This will give you an interactive job running in the context of your original job:

arwen:~/Desktop$ srun --pty --partition=gpu --jobid=<jobid> /bin/bash

and then use nvidia-smi commands, e.g.,

scf-sm20:~$ nvidia-smi -q -d UTILIZATION,MEMORY -l 1

There are many ways to set up your code to use the GPU.

Four GPUs are generally available to all SCF users; these are hosted on the roo (1 GPU), scf-sm21 (2 older GPUs), and scf-sm20 (1 older GPU) servers.

The GPUs hosted on scf-sm20 and scf-sm21 are quite a bit older, and likely slower, than the GPU hosted on roo.

  • If you'd like to specifically request the roo GPU, you should submit to the gpu partition.
  • To submit to either scf-sm20 or scf-sm20, submit to the high partition.
    • To specifically request either a K20 (scf-sm20) or K80 (scf-sm21) GPU, you can use the syntax "--gres=gpu:K20:1" or "--gres=gpu:K80:1".

Additional GPUs have been obtained by the Steinhardt and Yu lab groups. Most of these GPUs have higher performance (either speed or GPU memory) than our standard GPUs.

Members of the lab group have priority access to the GPUs of their group. Other SCF users can submit jobs that use these GPUs but those jobs will be preempted (killed) if higher-priority jobs need access to the GPUs.

To submit jobs requesting access to these GPUs, you need to specify either the jsteinhardt or yugroup partitions. Here's an example:

arwen:~/Desktop$ sbatch --partition=jsteinhardt --gres=gpu:1 job.sh

To use multiple GPUs for a job (only possible when using a server with more than one GPU, namely scf-sm21, smaug, shadowfax, balrog, saruman, sunstone, rainbowquartz, smokyquartz, treebeard, and morgoth), simply change the number 1 after --gres=gpu to the number desired.

To request a specific GPU type, you can add that to the gres flag, e.g., here requesting an A100:

sbatch -p jsteinhardt --gres=gpu:A100:1 job.sh

If you need more than one CPU, please request that using the --cpus-per-task flag. The value you specify actually requests that number of hardware threads, but with the caveat that a given job is allocated all the threads on a given core to avoid contention between jobs for a given physical core.

The Steinhart group has priority access to the balrog, shadowfax, sunstone, rainbowquartz, smokyquartz (8 GPUs each), saruman (8, eventually 10, GPUs), and smaug (2 GPUs) GPU servers. If you are in the group, simply submit jobs to the jsteinhardt partition and you will automatically preempt jobs by users not in the group if it is needed for your job to run.

In addition to the notes below, more details on optimal use of these servers can be obtained from the guide prepared by Steinhardt group members and the SCF and available by contacting one of us.

The smaug, saruman, and balrog GPUs have a lot of GPU memory and are primarily intended for training very large models (e.g., ImageNet not CIFAR10 or MNIST), but it is fine to use these GPUs for smaller problems if shadowfax, sunstone, rainbowquartz, and smokyquartz are busy.

By default, if you do not specify a GPU type or a particular GPU server, Slurm will try to run the job on shadowfax, sunstone, or smokyquartz , unless they are busy. 

To request a specific GPU type, you can add that to the gres flag, e.g., here requesting an A100:

sbatch -p jsteinhardt --gres=gpu:A100:1 job.sh

If you need more than one CPU, please request that using the --cpus-per-task flag. The value you specify actually requests that number of hardware threads, but with the caveat that a given job is allocated all the threads on a given core to avoid contention between jobs for a given physical core. So the default of "-c 1" allocates one physical core and two hardware threads. Your CPU usage will be restricted to the number of threads you request. 

As an example, since shadowfax has 48 CPUs (actually 48 threads and 24 physical cores as discussed above) and 8 GPUs, there are 6 CPUs per GPU. You could request more than 6 CPUs per GPU for your job, but note that if other group members do the same, it's possible that the total number of CPUs may be fully used before all the GPUs are used. Similar considerations hold for balrog (96 CPUs and 8 GPUs), saruman (104 CPUs and 8 GPUs) and smaug (64 CPUs and 2 GPUs) as well as rainbowquartz, smokyquartz and sunstone (all with 64 CPUs and 8 GPUs). That said, that's probably a rather unlikely scenario.

To see what jobs are running on particular machines, so that you can have a sense of when a job that requests a particular machine might start: 

arwen:~> squeue -p jsteinhardt -o "%.9i %.20j %.12u %.2t %.11l %.11M %.11V %.5C %.8r %.6D %.20R %.13p %8q %b"
    JOBID                 NAME         USER ST  TIME_LIMIT        TIME SUBMIT_TIME  CPUS   REASON  NODES     NODELIST(REASON)      PRIORITY QOS      TRES_PER_NODE
  1077240                 bash nikhil_ghosh  R 28-00:00:00    12:51:48 2021-11-01T     1     None      1               balrog 0.00196710997 normal   gpu:1
  1077248              jupyter         awei  R 28-00:00:00     1:40:55 2021-11-01T     1     None      1               balrog 0.00092315604 preempti gpu:1
  1077121             train.sh andyzou_jiam  R 28-00:00:00  2-17:32:41 2021-10-29T    48     None      1               balrog 0.00027842330 preempti gpu:2

 

The Yu group has priority access to GPUs located on merry (1 GTX GPU), morgoth (2 TITAN GPUs), and treebeard (1 A100 GPU) servers. If you are in the group, simply submit jobs to the yugroup partition and you will automatically preempt jobs by users not in the group if it is needed for your job to run.

To request a specific GPU type, you can add that to the gres flag, e.g., here requesting an A100:

sbatch -p yugroup --gres=gpu:A100:1 job.sh

If you need more than one CPU, please request that using the --cpus-per-task flag, but note that merry only has four CPUs. The value you specify actually requests that number of hardware threads, but with the caveat that a given job is allocated all the threads on a given core to avoid contention between jobs for a given physical core. 

Please contact SCF staff or group members for more details.