Linux Cluster

Linux Cluster

The SCF operates a Linux cluster as our primary resource for users for computational jobs.

The cluster is managed by the SLURM queueing software. SLURM provides a standard batch queueing system through which users submit jobs to the cluster. Jobs are typically submitted to SLURM using a user-defined shell script that executes one's application code. Interactive use is also an option. Users may also query the cluster to see job status. As currently set up, the cluster is designed for processing single-core and multi-core/threaded jobs (at most 32 cores per job in the 'low' partition), as well as jobs that are set up to use multiple machines (aka 'nodes') within the cluster (i.e., distributed memory jobs). All software running on SCF Linux machines is available on the cluster. Users can also compile programs on any SCF Linux machine and then run that program on the cluster.

The cluster has a total of 656 cores, divided into 'partitions' of nodes of similar hardware/capabilities.

Below is more detailed information about how to use the cluster.


The cluster is open to Department of Statistics faculty, grad students, postdocs, and visitors using their SCF logon. Class account users do not have access by default, but instructors can email to discuss access for their class.

Currently users may submit jobs on the following submit hosts:

arwen, bellatrix, beren, gandalf, gollum, hermione, radagast, shelob


The cluster has 656 cores (essentially CPUs) spread across a number of nodes of varying hardware specifications and ages.

  • 256 of the cores are in the 'low' partition.
    • 8 nodes, 32 cores and 256 GB RAM per node.
    • By default jobs submitted to the cluster will run here.
  • 96 of the cores are in the 'high' partition.
    • 4 nodes, 12 cores (each with two hyperthreads) and 128 GB RAM per node.
    • Newer (but not new) and faster nodes than in the 'low' partition.
  • 160 cores are available on preemptible basis in the 'jsteinhardt' partition.
    • 1 node (smaug) has 32 cores (each with two hyperthreads) and 376 GB RAM. A second node (balrog) has 48 cores (each with two hyperthreads) and 792 GB RAM.
    • Newer and perhaps somewhat faster cores than in the 'high' partition.
    • Very fast disk I/O (using NVMe SSDs) to files located in /tmp and /var/tmp.
    • Jobs are subject to preemption at any time and will be cancelled in that case without warning (more details below).
  • 144 cores are in the 'lowmem' partition, intended to supplement our other partitions, particularly during heavy cluster usage.
    • 9 nodes, 16 cores and 12-16 GB RAM per node.
    • Older, slower nodes that are comparable to, or a bit slower than, the nodes in the 'low' partition.
    • Only suitable for jobs that have low memory needs, but you should feel free to request an entire node for jobs needing more memory, even if your code will only use one or a few cores.
Slurm Scheduler Setup and Job Restrictions

The cluster has multiple partitions, corresponding to groups of nodes. The different partitions have different hardware and job restrictions as discussed here:

Partition Max # cores per user (running) Time limit Max memory per job Max cores per job
low 256 28 days 256 GB 32***
high* 36 7 days 128 GB 24***
gpu* 8 CPU cores 28 days 6 GB (CPU) 8
jsteinhardt* 64 (smaug), 96 (balrog), 48 (shadowfax) 28 days** 376 GB (smaug), 792 GB (balrog), 132 GB (shadowfax) (CPU) varied
yugroup* varied 28 days** varied (CPU) varied
low_mem 144 28 days 12-16 GB 16***

* See How to Submit Jobs to the High Partitions or How to Submit GPU Jobs.

** Preemptible

*** If you use software that can parallelize across multiple nodes (e.g., R packages that use MPI or the future package, Python's Dask or IPython Parallel, MATLAB, MPI), you can run individual jobs across more than one node. See How to Submit Parallel Jobs.

We have implemented a 'fair share' policy that governs the order in which jobs that are waiting in a given queue start when resources become available. In particular, if two users each have a job sitting in a queue, the job that will start first will be that of the user who has made less use of the cluster recently (measured in terms of CPU time). The measurement of CPU time downweights usage over time, with a half-life of one month, so a job that ran a month ago will count half as much as a job that ran yesterday. Apart from this prioritization based on recent use, all users are treated equally.

How to Submit Single-core jobs

Prepare a shell script containing the instructions you would like the system to execute.

The instructions here are for the simple case of submitting a job without any parallelization; i.e., a job using a single core (CPU). When submitted using the instructions in this section, your code should only use a single core at a time; it should not start any additional processes. We also have extensive instructions for submitting parallelized jobs and automating the submission of multiple jobs.

For example a simple script to run an R program called 'simulate.R' would contain these lines:

R CMD BATCH --no-save simulate.R simulate.out

Once logged onto a submit host, use the sbatch command with the name of the shell script (assumed to be here) to enter a job into the queue:

arwen:~/Desktop$ sbatch
Submitted batch job 380

Here the job and assigned job ID 380. Results that would normally be printed to the screen via standard output and standard error will be written to a file called slurm-380.out.

SLURM provides a number of additional flags to control what happens; you can see the man page for sbatch for help with these. Here are some examples, placed in the job script file, where we name the job, ask for email updates and name the output and error files:

#SBATCH --job-name=myAnalysisName
#SBATCH --mail-type=ALL                       
#SBATCH -o myAnalysisName.out #File to which standard out will be written
#SBATCH -e myAnalysisName.err #File to which standard err will be written
R CMD BATCH --no-save simulate.R simulate.Rout


How to Submit Parallel Jobs

One can use SLURM to submit parallel code of a variety of types.

Submittings Jobs to the High Performance Partitions

High partition

To submit jobs to the faster nodes in the high partition, you must include either the '--partition=high' or '-p high' flag. By default jobs will be run in the low partition. For example:

arwen:~/Desktop$ sbatch -p high
Submitted batch job 380

You can also submit interactive jobs to the high partition, by simply adding the flag to specify that partition in your srun invocation.

Pre-emptible jobs using many cores, fast disk I/O, or large memory

The jsteinhardt partition has two nodes (smaug and balrog) with newer CPUs, a lot of memory, and very fast disk I/O to /tmp and /var/tmp using an NVMe SSD. These nodes are owned by a faculty member and are made available on a preemptible basis. Your job could be cancelled without warning at any time if researchers in the faculty member's group need to run a job using the cores/memory your job is using. However we don't expect this to happen very often given that the nodes have 64 (smaug) and 96 (balrog) cores that can be shared amongst jobs. For example to request use of either node, which are labelled as 'manycore' nodes:

arwen:~/Desktop$ sbatch -p jsteinhardt -C manycore
Submitted batch job 380

Jobs are cancelled when preemption happens. If you want your job to be automatically started again (i.e., started from the beginning) when the node becomes available you can add the "--requeue" flag when you submit via sbatch.

If you specifically want access to fast disk I/O use "-C nvme". If you want access to many cores, use "-C manycore". if you need a lot of memory, use "-C bigmem" for up to 392 GB and "-C xlmem" for up to 792 GB.

Also note that if you need more disk space on the NVMe SSD on smaug (but not balrog), we may be able to make available space on a much larger NVMe SSD (in /data) if you request it.

How to Kill a Job

First, find the job-id of the job, by typing squeue at the command line of a submit host (see How to Monitor Jobs).

Then use scancel to delete the job (with id 380 in this case):

scancel 380


Interactive Jobs

You can work interactively on a node from the Linux shell command line by starting a job in the interactive queue.

The syntax for requesting an interactive (bash) shell session is:

srun --pty /bin/bash

This will start a shell on one of the four nodes. You can then act as you would on any SCF Linux compute server. For example, you might use top to assess the status of one of your non-interactive (i.e., batch) cluster jobs. Or you might test some code before running it as a batch job. You can also transfer files to the local disk of the cluster node.

If you want to run a program that involves a graphical interface (requiring an X11 window), you need to add --x11 to your srun command. So you could directly run Matlab, e.g., as follows:

srun --pty --x11 matlab

or you could add the --x11 flag when requesting an interactive shell session and then subsequently start a program that has a graphical interface.

To run an interactive session in which you would like to use multiple cores, do the following (here we request 4 cores for our use):

srun --pty --cpus-per-task 4 /bin/bash

Note that "-c" is a shorthand for "--cpus-per-task".

To transfer files to the local disk of a specific node, you need to request that your interactive session be started on the node of interest (in this case scf-sm20):

srun --pty -p high -w scf-sm20 /bin/bash

Note that if that specific node has all its cores in use by other users, you will need to wait until resources become available on that node before your interactive session will start.

Finally, you can request multiple cores using -c, as with batch jobs. As with batch jobs, you can change OMP_NUM_THREADS from its default of one, provided you make sure that that the total number of cores used (number of processes your code starts multiplied by threads per process) does not exceed the number of cores you request.

How to Monitor Jobs

The SLURM command squeue provides info on job status:

arwen:~/Desktop> squeue
               381      high paciorek  R      25:28      1 scf-sm20
               380      high paciorek  R      25:37      1 scf-sm20

The following will tailor the output to include information on the number of cores (the CPUs column below) being used as well as other potentially useful information:

arwen:~/Desktop> squeue -o "%.7i %.9P %.20j %.8u %.2t %l %.9M %.5C %.8r %.6D %R %p %q %b"
     49       low                 bash paciorek  R 28-00:00:00   1:23:29     1     None      1 scf-sm00 0.00000017066486 normal (null)
     54       low              kbpew2v paciorek  R 28-00:00:00     11:01     1     None      1 scf-sm00 0.00000000488944 normal (null)

The 'ST' field indicates whether a job is running (R), failed (F), or pending (PD). The latter occurs when there are not yet enough resources on the system for your job to run.

Job output that would normally appear in your terminal session will be sent to a file named slurm-<jobid>.out where <jobid> will be the number of the job (visible via squeue as above).

If you would like to logon to the node on which your job is running in order to assess CPU or memory use, you can run an interactive job within the context of your existing job. First determine the job ID of your running job using squeue and insert that in place of <jobid> in the following command:

arwen:~/Desktop$ srun --pty --jobid=<jobid> /bin/bash

You can then run top and other tools. Please do not do any intensive computation that would use additional cores in addition to those used by your running job.

How to Monitor Cluster Usage

If you'd like to see how busy each node is (e.g., to choose what partition to submit a job to), you can run the following:

arwen:~/Desktop$ sinfo -N  -o "%.12N %.5a %.6t %C"
    scf-sm00    up  alloc 32/0/0/32
    scf-sm01    up  alloc 32/0/0/32
    scf-sm02    up  alloc 32/0/0/32
    scf-sm03    up  alloc 32/0/0/32
    scf-sm10    up  alloc 32/0/0/32
    scf-sm11    up  alloc 32/0/0/32
    scf-sm12    up  alloc 32/0/0/32
    scf-sm13    up    mix 27/5/0/32
scf-sm20-cpu    up   idle 0/22/0/22
scf-sm20-gpu    up   idle 0/2/0/2
scf-sm21-cpu    up   idle 0/20/0/20
scf-sm21-gpu    up   idle 0/4/0/4
    scf-sm22    up   idle 0/24/0/24
    scf-sm23    up   idle 0/24/0/24

Here the A column indicates the number of cores used (i.e., active), I indicates the number of inactive cores, and T the total number of cores on the node.

How to Submit GPU Jobs

The SCF hosts a number of GPUs, available only by submitting a job through our SLURM scheduling software. The GPUs are quite varied in their hardware configurations (different generations of GPUS, with different speeds and GPU memory). We have documented the GPU servers to guide you in selecting which GPU you may want to use.

To use the GPUs, you need to submit a job via our SLURM scheduling software. In doing so, you need to specifically request that your job use the GPU as follows using the 'gres' flag:

arwen:~/Desktop$ sbatch --partition=gpu --gres=gpu:1

Note that the partition to use will vary, as discussed in the other tabs given in this set of tabbed information.

Once it starts your job will have exclusive access to the GPU and its memory. If another user is using the GPU, your job will be queued until the current job finishes.

Interactive jobs should use that same gres flag with the usual srun syntax for an interactive job.

arwen:~/Desktop$ srun --pty --partition=gpu --gres=gpu:1 /bin/bash

Given the heterogenity in the GPUs available, you may want to request use of a specific GPU type. To do so, you can add the type to the 'gres' flag, e.g., requesting a K80 GPU:

arwen:~/Desktop$ sbatch --partition=high --gres=gpu:K80:1

If you want to interactively logon to the GPU node to check on compute or memory use of an sbatch job that uses the GPU, find the job ID of your job using squeue and insert that job ID in place of '<jobID>' in the following command. This will give you an interactive job running in the context of your original job:

arwen:~/Desktop$ srun --pty --partition=gpu --jobid=<jobid> /bin/bash

and then use nvidia-smi commands, e.g.,

scf-sm20:~$ nvidia-smi -q -d UTILIZATION,MEMORY -l 1

There are many ways to set up your code to use the GPU.

Four GPUs are generally available to all SCF users; these are hosted on the roo (1 GPU), scf-sm21 (2 older GPUs), and scf-sm20 (1 older GPU) servers.

The GPUs hosted on scf-sm20 and scf-sm21 are quite a bit older, and likely slower, than the GPU hosted on roo.

  • If you'd like to specifically request the roo GPU, you should submit to the gpu partition.
  • To submit to either scf-sm20 or scf-sm20, submit to the high partition.
    • To specifically request either a K20 (scf-sm20) or K80 (scf-sm21) GPU, you can use the syntax "--gres=gpu:K20:1" or "--gres=gpu:K80:1".

Additional GPUs have been obtained by the Steinhardt and Yu lab groups. Most of these GPUs have higher performance (either speed or GPU memory) than our standard GPUs.

Members of the lab group have priority access to the GPUs of their group. Other SCF users can submit jobs that use these GPUs but those jobs will be preempted (killed) if higher-priority jobs need access to the GPUs.

To submit jobs requesting access to these GPUs, you need to specify either the jsteinhardt or yugroup partitions. Here's an example:

arwen:~/Desktop$ sbatch --partition=jsteinhardt --gres=gpu:1

To use multiple GPUs for a job (only possible when using a server with more than one GPU, namely scf-sm21, smaug, shadowfax, balrog, and morgoth), simply change the number 1 after --gres=gpu to the number desired.

To request a specific GPU type, you can add that to the gres flag, e.g., here requesting an A100:

sbatch -p jsteinhardt --gres=gpu:A100:1

If you need more than one CPU, please request that using the --cpus-per-task flag.

The Steinhart group has priority access to the balrog (8 GPUs), shadowfax (8 GPUs) and smaug (2 GPUs) GPU servers. If you are in the group, simply submit jobs to the jsteinhardt partition and you will automatically preempt jobs by users not in the group if it is needed for your job to run.

In addition to the notes below, more details on optimal use of these servers can be obtained from the guide prepared by Steinhardt group members and the SCF and available by contacting one of us.

The smaug and balrog GPUs have a lot of GPU memory and are primarily intended for training very large models (e.g., ImageNet not CIFAR10 or MNIST).

To request a specific GPU type, you can add that to the gres flag, e.g., here requesting an A100:

sbatch -p jsteinhardt --gres=gpu:A100:1

If you need more than one CPU, please request that using the --cpus-per-task flag. Your CPU usage will be restricted to the number of CPUs you request, which defaults to one if you don't use the --cpus-per-task flag. Furthermore, each GPU server has hardware hyperthreading, so if you request one CPU you are really getting a single hyperthread (aka virtual core). To get the full physical core, you'd need to use '--cpus-per-task=2'.

As an example, since shadowfax has 48 CPUs (actually 48 virtual cores and 24 physical cores as discussed above) and 8 GPUs, there are 6 CPUs per GPU. You could request more than 6 CPUs per GPU for your job, but note that if other group members do the same, it's possible that the total number of CPUs may be fully used before all the GPUs are used. Similar considerations hold for balrog (96 CPUs and 8 GPUs) and smaug (64 CPUs and 2 GPUs). That said, that's probably a rather unlikely scenario.


The Yu group has priority access to GPUs located on merry (1 GTX GPU), morgoth (2 TITAN GPUs), and treebeard (1 A100 GPU) servers. If you are in the group, simply submit jobs to the yugroup partition and you will automatically preempt jobs by users not in the group if it is needed for your job to run.

To request a specific GPU type, you can add that to the gres flag, e.g., here requesting an A100:

sbatch -p yugroup --gres=gpu:A100:1

If you need more than one CPU, please request that using the --cpus-per-task flag, but note that merry only has four CPUs).

Please contact SCF staff or group members for more details.