Linux Cluster

Linux Cluster

The SCF operates a Linux cluster as our primary resource for users for computational jobs.

The cluster is managed by the SLURM queueing software. SLURM provides a standard batch queueing system through which users submit jobs to the cluster. Jobs are typically submitted to SLURM using a user-defined shell script that executes one's application code. Interactive use is also an option. Users may also query the cluster to see job status. As currently set up, the cluster is designed for processing single-core and multi-core/threaded jobs (at most 32 cores per job in the 'low' partition), as well as jobs that are set up to use multiple machines (aka 'nodes') within the cluster (i.e., distributed memory jobs). All software running on SCF Linux machines is available on the cluster. Users can also compile programs on any SCF Linux machine and then run that program on the cluster.

The cluster has a total of 1064 cores, divided into 'partitions' of nodes of similar hardware/capabilities. In our newer machines, each 'core' is actually a hardware thread, as we have hyperthreading enabled, so the number of physical cores per machine is generally half that indicated in the Hardware tab below.

Below is more detailed information about how to use the cluster.

Quick Start

This is a quick-start guide, intended to show the basic commands to get up and running on the cluster.
The other tabs on this page provided detailed documentation on the SCF cluster.

This guide documents use via connecting to the SCF via ssh. You can also use our the SCF JupyterHub to start Jupyter notebooks, RStudio sessions, remote desktop sessions, terminal sessions, and more.

Connecting to the SCF

First, ssh to one of the SCF's standalone Linux servers. For example from a terminal on your Mac or Linux machine:

ssh gandalf.berkeley.edu

Or use Putty or MobaXTerm from your Windows machine. 

All jobs that you run on the cluster need to be submitted via the Slurm scheduling software from one of our Linux servers. Slurm sends each job to run on a machine (aka a node) (or machines) on one of the partitions in the cluster. A partition is a collection of machines with similar hardware.

Checking Cluster Usage

Once are logged onto an SCF server, you can submit a job (assuming you are in a group that has permission to use the cluster).

But first we'll check usage on the cluster to get a sense for how busy each partition is:

paciorek@gandalf:~> sinfo
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
low*           up 28-00:00:0      8    mix scf-sm[00-03,10-13]
high           up 7-00:00:00      2    mix scf-sm[22-23]
high           up 7-00:00:00      2   idle scf-sm[20-21]
lowmem         up 28-00:00:0      3  down$ boromir,pooh,toad
lowmem         up 28-00:00:0      1  maint gollum
lowmem         up 28-00:00:0      3   idle ghan,hermione,shelob
gpu            up 28-00:00:0      1   idle roo
jsteinhardt    up 28-00:00:0      6    mix balrog,rainbowquartz,saruman,shadowfax,smokyquartz,sunstone
jsteinhardt    up 28-00:00:0      1   idle smaug
yss            up 28-00:00:0      1   idle luthien
yugroup        up 28-00:00:0      1    mix treebeard
yugroup        up 28-00:00:0      2   idle merry,morgoth

Alternatively, we have a wrapper script snodes that shows usage on a per-node basis:

paciorek@gandalf:~> snodes
      NODELIST    PARTITION AVAIL  STATE CPUS(A/I/O/T)
        balrog  jsteinhardt    up    mix 50/46/0/96
<snip>
      scf-sm00         low*    up    mix 24/8/0/32
      scf-sm01         low*    up    mix 25/7/0/32
<snip>
      scf-sm02         low*    up    mix 24/8/0/32
      scf-sm03         low*    up    mix 28/4/0/32
      scf-sm10         low*    up    mix 30/2/0/32
      scf-sm11         low*    up    mix 28/4/0/32
      scf-sm12         low*    up    mix 24/8/0/32
<snip>

Here 'A' is for 'active' (i.e., in use), 'I' is for 'idle' (i.e., not in use), and 'T' is for the 'total' number of CPUs (aka cores) on the node. 

Interactive Jobs with srun

We can start an interactive job with srun. This will start an interactive session that can use a single core (aka CPU) on one of the machines in our (default) low partition, which contains older machines. You'll see the prompt change, indicating you're now running on one of the cluster machines. 

paciorek@gandalf:~> srun --pty bash
paciorek@scf-sm10:~> 

When you're done with your computation, make sure to exit out of the interactive session

paciorek@scf-sm10:~> exit

We can instead start a job on the newer machines in the high partition:

paciorek@gandalf:~> srun -p high --pty bash

And we can start a job that needs four cores on a single machine using the cpus-per-task (-c) flag:

paciorek@gandalf:~> srun -c 4 --pty bash

Interactive jobs might take a while to start if the machines are busy with other users' jobs.

Batch jobs with sbatch

To run a background job, you need to create a job script. Here's a simple one that requests four cores and then runs a Python script (which should be set up to take advantage of four cores via parallelization in some fashion):

#! /bin/bash
#SBATCH -c 4
python code.py > code.pyout

When you submit the job, it will show you the job id. Here we assume the code above is in a file job.sh:

paciorek@gandalf:~> sbatch job.sh
Submitted batch job 47139

Monitoring

You can now check if the job is running:

paciorek@gandalf:~> squeue -u $(whoami)
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             41377      high jupyterh paciorek  R 3-21:55:42      1 scf-sm22
             47139       low   job.sh paciorek  R       0:46      1 scf-sm11

If you see `R` under the `ST` column, the job is running. If you see `PD` it is waiting in the queue.

Or you can use our wrapper script to see a richer set of information about the running job(s):

paciorek@gandalf:~> sjobs -u $(whoami)
         JOBID         USER                 NAME    PARTITION QOS      ST         REASON  TIME_LIMIT        TIME SUBMIT_TIME  CPUS  NODES     NODELIST(REASON)  PRIORITY FEATURES TRES_PER_NODE
         41377     paciorek           jupyterhub         high normal    R           None  7-00:00:00  3-22:39:36 2022-11-17T     1      1             scf-sm22 0.0022660 (null) N/A
         47139     paciorek               job.sh          low normal    R           None 28-00:00:00       44:40 2022-11-21T     4      1             scf-sm11 0.0022545 (null) N/A

To better understand CPU and memory use by the job, you can connect to the node the job is running on and then use commands like top and ps:

paciorek@gandalf:~> srun --jobid 47139 --pty bash
paciorek@scf-sm10:~> top    # use Ctrl-C to exit top
paciorek@scf-sm10:~> exit

To cancel a running job:

paciorek@gandalf:~> scancel 47139

To check details of recently-completed jobs:

paciorek@gandalf:~> sacct -u $(whoami) -S 2022-11-01  # jobs started since Nov 1, 2022
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
47138              bash        low       site          1  COMPLETED      0:0 
47138.extern     extern                  site          1  COMPLETED      0:0 
47138.0            bash                  site          1  COMPLETED      0:0 
47139            job.sh        low       site          4    RUNNING      0:0 
47139.batch       batch                  site          4    RUNNING      0:0 
47139.extern     extern                  site          4    RUNNING      0:0 

Alternatively, use our wrapper for a richer set of information. Of particular note, the MaxRSS column shows the maximum memory use.

paciorek@gandalf:~> shist -S 2022-10-01   # using our wrapper
     User        JobID    JobName  Partition    Account  AllocCPUS      State     MaxRSS ExitCode              Submit               Start                 End    Elapsed  Timelimit        NodeList 
--------- ------------ ---------- ---------- ---------- ---------- ---------- ---------- -------- ------------------- ------------------- ------------------- ---------- ---------- --------------- 
 paciorek 47138              bash        low       site          1  COMPLETED                 0:0 2022-11-21T15:10:19 2022-11-21T15:10:19 2022-11-21T15:10:22   00:02:03 28-00:00:+        scf-sm10 
          47138.extern     extern                  site          1  COMPLETED    902352K      0:0 2022-11-21T15:10:19 2022-11-21T15:10:19 2022-11-21T15:10:22   00:03:03                   scf-sm10 
          47138.0            bash                  site          1  COMPLETED       540K      0:0 2022-11-21T15:10:19 2022-11-21T15:10:19 2022-11-21T15:10:22   00:03:03                   scf-sm10 
 paciorek 47139            job.sh        low       site          4    RUNNING                 0:0 2022-11-21T15:51:04 2022-11-21T15:51:05             Unknown   00:04:37 28-00:00:+        scf-sm11 
          47139.batch       batch                  site          4    RUNNING                 0:0 2022-11-21T15:51:05 2022-11-21T15:51:05             Unknown   00:04:37                   scf-sm11 
          47139.extern     extern                  site          4    RUNNING                 0:0 2022-11-21T15:51:05 2022-11-21T15:51:05             Unknown   00:04:37                   scf-sm11 

Parallelization

Use the -c flag if you need all the cores on a single node.

In some cases your code may be able to parallelize across the cores on multiple nodes. In this case you should use the --ntasks (-n) flag. Here's a job script in which we ask for 20 cores, which may be provided by Slurm on one or more machines:

#! /bin/bash
#SBATCH -n 20
python code.py > code.pyout

It only makes sense to use -n if your code is set up to be able to run across multiple machines (if you're not sure, it's likely that it won't run on multiple machines).

You can use both -c and -n in some cases, such as if you wanted to run Python or R in parallel with multiple processes and have each process use multiple cores for threaded linear algebra. Here's a job script asking for three tasks (suitable for starting three worker processes) and four cores per task (suitable for running four software threads per process):

#! /bin/bash
#SBATCH -n 3
#SBATCH -c 4
python code.py > code.pyout

Slurm sets various environment variables inside the context of your running job that you can then use in your code so that your code adapts to whatever resources you requested. This is much better than hard-coding things like the number of workers that you want to use in your parallelized code. 

Here are a few that can be useful:

env | grep SLURM
<snip>
SLURM_CPUS_PER_TASK=4
SLURM_NTASKS=3
SLURM_NNODES=2
SLURM_NODELIST=scf-sm[11-12]
<snip>

The job is using cores on two nodes, with four CPUs for each of three tasks. Note that only code that is set up to run across multiple machines will work in this situation.
 

Access

The cluster is open to Department of Statistics faculty, grad students, postdocs, and visitors using their SCF logon. Class account users do not have access by default, but instructors can email manager@stat.berkeley.edu to discuss access for their class.

Currently users may submit jobs on the following submit hosts:

arwen, gandalf, gollum, hermione, radagast, shelob

Hardware

The cluster has 1064 logical cores (essentially CPUs) spread across a number of nodes of varying hardware specifications and ages. For our newer machines, each 'core' is actually a hardware thread, as we have hyperthreading enabled, so the number of physical cores per machine is generally half that indicated here. If you'd like to run multi-core jobs without hyperthreading, please contact us for work-arounds.

  • 256 of the cores are in the 'low' partition.
    • 8 nodes with 32 cores and 256 GB RAM per node.
    • By default jobs submitted to the cluster will run here.
  • 96 of the cores are in the 'high' partition.
    • 4 nodes with 24 cores (12 physical cores) and 128 GB RAM per node.
    • Newer (but not new) and faster nodes than in the 'low' partition.
  • 504 cores are available on preemptible basis in the 'jsteinhardt' partition.
    • Many cores and (for some nodes) large memory on various nodes:
      • smaug has 64 cores (32 physical cores) and 288 GB RAM.
      • balrog has 96 cores (48 physical cores) and 792 GB RAM.
      • sunstone and smokyquartz each have 64 cores (32 physical cores) and 128 GB RAM.
      • rainbowquartz has 64 cores (32 physical cores) and 792 GB RAM.
      • saruman has 104 cores (52 physical cores) and 1 TB RAM.
    • Newer and perhaps somewhat faster cores than in the 'high' partition.
    • Very fast disk I/O (using NVMe SSDs) to files located in /tmp and /var/tmp.
    • Jobs are subject to preemption at any time and will be cancelled in that case without warning (more details below).
  • Additional GPU servers in the 'yugroup' and 'yss' partitions.
  • 144 cores are in the 'lowmem' partition, intended to supplement our other partitions, particularly during heavy cluster usage.
    • 9 nodes, 16 cores and 12-16 GB RAM per node.
    • Older, slower nodes that are comparable to, or a bit slower than, the nodes in the 'low' partition.
    • Only suitable for jobs that have low memory needs, but you should feel free to request an entire node for jobs needing more memory, even if your code will only use one or a few cores.
Slurm Scheduler Setup and Job Restrictions

The cluster has multiple partitions, corresponding to groups of nodes. The different partitions have different hardware and job restrictions as discussed here:

Partition Max # cores per user (running) Time limit Max memory per job Max cores per job
low 256 28 days 256 GB 32***
high* 36 7 days 128 GB 24***
gpu* 8 CPU cores 28 days 6 GB (CPU) 8
jsteinhardt* varied 28 days** 288 GB (smaug), 792 GB (balrog, rainbowquartz), 1 TB (saruman), 128 GB (various) (CPU) varied
yugroup* varied 28 days** varied (CPU) varied
yss* 32 28 days** 528 GB (CPU) 32
lowmem 144 28 days 12-16 GB 16***

* See How to Submit Jobs to the High Partitions or How to Submit GPU Jobs.

** Preemptible

*** If you use software that can parallelize across multiple nodes (e.g., R packages that use MPI or the future package, Python's Dask or IPython Parallel, MATLAB, MPI), you can run individual jobs across more than one node. See How to Submit Parallel Jobs.

How to Submit Single-core jobs

Prepare a shell script containing the instructions you would like the system to execute. 

The instructions here are for the simple case of submitting a job without any parallelization; i.e., a job using a single core (CPU). When submitted using the instructions in this section, such jobs will have access to only a single core at a time. We also have extensive instructions for submitting parallelized jobs and automating the submission of multiple jobs.

For example a simple script to run an R program called 'simulate.R' would contain these lines:

#!/bin/bash
R CMD BATCH --no-save simulate.R simulate.out

Once logged onto a submit host, navigate to a directory within your home or scratch directory (i.e., make sure your working directory is not in /tmp or /var/tmp) and use the sbatch command with the name of the shell script (assumed to be job.sh here) to enter a job into the queue:

arwen:~/Desktop$ sbatch job.sh
Submitted batch job 380

Here the job and assigned job ID 380. Results that would normally be printed to the screen via standard output and standard error will be written to a file called slurm-380.out.

If you have many single-core jobs to run, there are various ways to automate submitting such jobs.

Note that Slurm is configured such that single-core jobs will have access to a single physical core (including both hyperthreads on our new machines), so there won't be any contention between the two threads on a physical core. However, if you have many single-core jobs to run on the high, jsteinhardt, or yugroup partitions; you might improve your throughput by modifying your workflow so that you can one run job per hyperthread rather than one job per physical core. You could do this by taking advantage of parallelization strategies in R, Python, or MATLAB to distribute tasks across workers in a single job, or you could use GNU parallel or srun within sbatch

Slurm provides a number of additional flags to control what happens; you can see the man page for sbatch for help with these. Here are some examples, placed in the job script file, where we name the job, ask for email updates and name the output and error files:

#!/bin/bash
#SBATCH --job-name=myAnalysisName
#SBATCH --mail-type=ALL                       
#SBATCH --mail-user=blah@berkeley.edu
#SBATCH -o myAnalysisName.out #File to which standard out will be written
#SBATCH -e myAnalysisName.err #File to which standard err will be written
R CMD BATCH --no-save simulate.R simulate.Rout

 

How to Submit Parallel Jobs

One can use SLURM to submit parallel code of a variety of types.

Submitting Jobs to the High Performance Partitions

High partition

To submit jobs to the faster nodes in the high partition, you must include either the '--partition=high' or '-p high' flag. By default jobs will be run in the low partition. For example:

arwen:~/Desktop$ sbatch -p high job.sh
Submitted batch job 380

You can also submit interactive jobs to the high partition, by simply adding the flag to specify that partition in your srun invocation.

Pre-emptible jobs using many cores, fast disk I/O, or large memory

The jsteinhardt partition has various nodes with newer CPUs, a lot of memory, and very fast disk I/O to /tmp and /var/tmp using an NVMe SSD. These nodes are owned by a faculty member and are made available on a preemptible basis. Your job could be cancelled without warning at any time if researchers in the faculty member's group need to run a job using the cores/memory your job is using. However we don't expect this to happen too often given that the nodes have 64 cores (96 cores and 104 cores in the cases of balrog and saruman, respectively) that can be shared amongst jobs. For example to request use of one of these nodes, which are labelled as 'manycore' nodes:

arwen:~/Desktop$ sbatch -p jsteinhardt -C manycore job.sh
Submitted batch job 380

Jobs are cancelled when preemption happens. If you want your job to be automatically started again (i.e., started from the beginning) when the node becomes available you can add the "--requeue" flag when you submit via sbatch.

You can request specific resources as follows:

  • "-C fasttmp" for access to fast disk I/O in /tmp and /var/tmp,
  • "-C manycore" for access to many (64 or more) cores,
  • "-C mem256g" for up to 256 GB CPU memory,
  • "-C mem768g" for up to 768 GB CPU memory, and
  • "-C mem1024g" for up to 1024 GB CPU memory.

Also note that if you need more disk space on the NVMe SSD on some but not all of these nodes, we may be able to make available space on a much larger NVMe SSD if you request it.

When will my job run or why is it not starting?

The cluster is managed using the Slurm scheduling software. We configure Slurm to try to balance the needs of the various cluster users.

Often there may be enough available CPU cores (aka 'resources') on the partition, and your job will start immediately after you submit it.

However, there are various reasons a job may take a while to start. Here are some details of how the scheduler works.

  • If there aren't enough resources in a given partition to run a job when it is submitted, it goes into the queue. The queue is sorted based how much CPU time you've used over the past few weeks, using the 'fair share' policy described below. Your jobs will be moved to a position in the queue below the jobs of users who have used less CPU time in recent weeks. This happens dynamically, so another user can submit a job well after you have submitted your job, and that other user's job can be placed higher in the queue at that time and start sooner than your job. If this were not the case, imagine the situation of a user sending 100s of big jobs to the system. They get into the queue and everyone submitting after that has to wait for a long time while those jobs are run, if other jobs can't get moved above those jobs in the queue.
  • When a job at the top of the queue needs multiple CPU cores (or in some cases an entire node), then jobs submitted to that same partition that are lower in the queue will not start even if there are enough CPU cores available for those lower priority jobs. That's because the scheduler is trying to accumulate resources for the higher priority job(s) and guarantee that the higher priority jobs' start times wouldn't be pushed back by running the lower priority jobs.
  • In some cases, if the scheduler has enough information about how long jobs will run, it will run lower-priority jobs on available resources when it can do so without affecting when a higher priority job would start. This is called backfill. It can work well on some systems but on the SCF and EML it doesn't work all that well because we don't require that users specify their jobs' time limits carefully. We made that tradeoff to make the cluster more user-friendly at the expense of optimizing throughput.

The 'fair share' policy governs the order of jobs that are waiting in the queue for resources to become available. In particular, if two users each have a job sitting in a queue, the job that will start first will be that of the user who has made less use of the cluster recently (measured in terms of CPU time). The measurement of CPU time downweights usage over time, with a half-life of one month, so a job that ran a month ago will count half as much as a job that ran yesterday. Apart from this prioritization based on recent use, all users are treated equally.

How to Kill a Job

First, find the job-id of the job, by typing squeue at the command line of a submit host (see How to Monitor Jobs).

Then use scancel to delete the job (with id 380 in this case):

scancel 380

 

Interactive Jobs

You can work interactively on a node from the Linux shell command line by starting a job in the interactive queue.

The syntax for requesting an interactive (bash) shell session is:

srun --pty /bin/bash

This will start a shell on one of the four nodes. You can then act as you would on any SCF Linux compute server. For example, you might use top to assess the status of one of your non-interactive (i.e., batch) cluster jobs. Or you might test some code before running it as a batch job. You can also transfer files to the local disk of the cluster node.

If you want to run a program that involves a graphical interface (requiring an X11 window), you need to add --x11 to your srun command. So you could directly run Matlab, e.g., as follows:

srun --pty --x11 matlab

or you could add the --x11 flag when requesting an interactive shell session and then subsequently start a program that has a graphical interface.

To run an interactive session in which you would like to use multiple cores, do the following (here we request 4 cores for our use):

srun --pty --cpus-per-task 4 /bin/bash

Note that "-c" is a shorthand for "--cpus-per-task".

To transfer files to the local disk of a specific node, you need to request that your interactive session be started on the node of interest (in this case scf-sm20):

srun --pty -p high -w scf-sm20 /bin/bash

Note that if that specific node has all its cores in use by other users, you will need to wait until resources become available on that node before your interactive session will start.

Finally, you can request multiple cores using -c, as with batch jobs. As with batch jobs, you can change OMP_NUM_THREADS from its default of one, provided you make sure that that the total number of cores used (number of processes your code starts multiplied by threads per process) does not exceed the number of cores you request.

How to Monitor Jobs

The SLURM command squeue provides info on job status:

arwen:~/Desktop> squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               381      high   job.sh paciorek  R      25:28      1 scf-sm20
               380      high   job.sh paciorek  R      25:37      1 scf-sm20

The following will tailor the output to include information on the number of cores (the CPUs column below) being used as well as other potentially useful information:

arwen:~/Desktop> squeue -o "%.7i %.9P %.20j %.8u %.2t %l %.9M %.5C %.8r %.6D %R %p %q %b"
   JOBID PARTITION                 NAME     USER ST TIME_LIMIT      TIME  CPUS   REASON  NODES NODELIST(REASON) PRIORITY QOS GRES
     49       low                 bash paciorek  R 28-00:00:00   1:23:29     1     None      1 scf-sm00 0.00000017066486 normal (null)
     54       low              kbpew2v paciorek  R 28-00:00:00     11:01     1     None      1 scf-sm00 0.00000000488944 normal (null)

The 'ST' field indicates whether a job is running (R), failed (F), or pending (PD). The latter occurs when there are not yet enough resources on the system for your job to run.

Job output that would normally appear in your terminal session will be sent to a file named slurm-<jobid>.out where <jobid> will be the number of the job (visible via squeue as above).

If you would like to login to the node on which your job is running in order to assess CPU or memory use, you can run an interactive job within the context of your existing job. First determine the job ID of your running job using squeue and insert that in place of <jobid> in the following command:

arwen:~/Desktop$ srun --pty --jobid=<jobid> /bin/bash

You can then run top and other tools. 

To see a history of your jobs (within a time range), including reasons they might have failed:

sacct  --starttime=2021-04-01 --endtime=2021-04-30 \
--format JobID,JobName,Partition,Account,AllocCPUS,State%30,ExitCode,Submit,Start,End,NodeList,MaxRSS

The MaxRSS column indicates memory usage, which can be very useful.

How to Monitor Cluster Usage

If you'd like to see how busy each node is (e.g., to choose what partition to submit a job to), you can run the following:

arwen:~/Desktop$ sinfo -p low,high,jsteinhardt,yugroup -N -o "%.12N %.5a %.6t %C"
    NODELIST AVAIL  STATE CPUS(A/I/O/T)
      balrog    up    mix 50/46/0/96
       merry    up   idle 0/4/0/4
     morgoth    up    mix 1/11/0/12
    scf-sm00    up   idle 0/32/0/32
    scf-sm01    up   idle 0/32/0/32
    scf-sm02    up    mix 31/1/0/32
    scf-sm03    up   idle 0/32/0/32
    scf-sm10    up   idle 0/32/0/32
    scf-sm11    up   idle 0/32/0/32
    scf-sm12    up   idle 0/32/0/32
    scf-sm13    up   idle 0/32/0/32
    scf-sm20    up   idle 0/24/0/24
    scf-sm21    up   idle 0/24/0/24
    scf-sm22    up   idle 0/24/0/24
    scf-sm23    up   idle 0/24/0/24
   shadowfax    up   idle 0/48/0/48
       smaug    up   idle 0/64/0/64
   treebeard    up    mix 2/30/0/32

Here the A column indicates the number of cores used (i.e., active), I indicates the number of inactive cores, and T the total number of cores on the node.

To see the jobs running in a partition

 

Useful Slurm commands

We've prepared a set of shortcut commands that wrap around `srun`, squeue`, `sinfo`, and `sacct` with some commonly-used options:

  • `slogin`: starts an interactive shell on a cluster node
  • `snodes`: prints the current usage of nodes on the cluster
  • `sjobs`: lists running jobs on the cluster
  • `shist`: provides information about completed (including failed) jobs
  • `sassoc`: gives information about user access to cluster partitions

For each of these commands, you can add the `-h` flag to see how to use them. For example:

gandalf:~$ slogin -h
Usages:
'slogin' to start an interactive job
'slogin <jobid>' to start a shell on the node a job is running on
'slogin <additional_arguments_to_run>' to start an interactive job with additional arguments to srun
How to Submit GPU Jobs

The SCF hosts a number of GPUs, available only by submitting a job through our SLURM scheduling software. The GPUs are quite varied in their hardware configurations (different generations of GPUS, with different speeds and GPU memory). We have documented the GPU servers to guide you in selecting which GPU you may want to use.

To use the GPUs, you need to submit a job via our SLURM scheduling software. In doing so, you need to specifically request that your job use the GPU as follows using the 'gres' flag:

arwen:~/Desktop$ sbatch --partition=gpu --gres=gpu:1 job.sh

Note that the partition to use will vary, as discussed in the other tabs given in this set of tabbed information.

Once it starts your job will have exclusive access to the GPU and its memory. If another user is using the GPU, your job will be queued until the current job finishes.

Interactive jobs should use that same gres flag with the usual srun syntax for an interactive job.

arwen:~/Desktop$ srun --pty --partition=gpu --gres=gpu:1 /bin/bash

Given the heterogenity in the GPUs available, you may want to request use of a specific GPU type. To do so, you can add the type to the 'gres' flag, e.g., requesting a K80 GPU:

arwen:~/Desktop$ sbatch --partition=high --gres=gpu:K80:1 job.sh

If you want to interactively logon to the GPU node to check on compute or memory use of an sbatch job that uses the GPU, find the job ID of your job using squeue and insert that job ID in place of '<jobID>' in the following command. This will give you an interactive job running in the context of your original job:

arwen:~/Desktop$ srun --pty --partition=gpu --jobid=<jobid> /bin/bash

and then use nvidia-smi commands, e.g.,

scf-sm20:~$ nvidia-smi -q -d UTILIZATION,MEMORY -l 1

There are many ways to set up your code to use the GPU.

Four GPUs are generally available to all SCF users; these are hosted on the roo (1 GPU), scf-sm21 (2 older GPUs), and scf-sm20 (1 older GPU) servers.

The GPUs hosted on scf-sm20 and scf-sm21 are quite a bit older, and likely slower, than the GPU hosted on roo.

  • If you'd like to specifically request the roo GPU, you should submit to the gpu partition.
  • To submit to either scf-sm20 or scf-sm20, submit to the high partition.
    • To specifically request either a K20 (scf-sm20) or K80 (scf-sm21) GPU, you can use the syntax "--gres=gpu:K20:1" or "--gres=gpu:K80:1".

Additional GPUs have been obtained by the Steinhardt, Song, and Yu lab groups. Most of these GPUs have higher performance (either speed or GPU memory) than our standard GPUs.

Members of the lab group have priority access to the GPUs of their group. Other SCF users can submit jobs that use these GPUs but those jobs will be preempted (killed) if higher-priority jobs need access to the GPUs.

To submit jobs requesting access to these GPUs, you need to specify either the jsteinhardt, yss or yugroup partitions. Here's an example:

arwen:~/Desktop$ sbatch --partition=jsteinhardt --gres=gpu:1 job.sh

To use multiple GPUs for a job (only possible when using a server with more than one GPU, namely scf-sm21, smaug, shadowfax, balrog, saruman, sunstone, rainbowquartz, smokyquartz, treebeard, and morgoth), simply change the number 1 after --gres=gpu to the number desired.

To request a specific GPU type, you can add that to the gres flag, e.g., here requesting an A100:

sbatch -p jsteinhardt --gres=gpu:A100:1 job.sh

If you need more than one CPU, please request that using the --cpus-per-task flag. The value you specify actually requests that number of hardware threads, but with the caveat that a given job is allocated all the threads on a given core to avoid contention between jobs for a given physical core.

The Steinhart group has priority access to the balrog, shadowfax, sunstone, rainbowquartz, smokyquartz (8 GPUs each), saruman (8, eventually 10, GPUs), and smaug (2 GPUs) GPU servers. If you are in the group, simply submit jobs to the jsteinhardt partition and you will automatically preempt jobs by users not in the group if it is needed for your job to run.

In addition to the notes below, more details on optimal use of these servers can be obtained from the guide prepared by Steinhardt group members and the SCF and available by contacting one of us.

The smaug, saruman, and balrog GPUs have a lot of GPU memory and are primarily intended for training very large models (e.g., ImageNet not CIFAR10 or MNIST), but it is fine to use these GPUs for smaller problems if shadowfax, sunstone, rainbowquartz, and smokyquartz are busy.

By default, if you do not specify a GPU type or a particular GPU server, Slurm will try to run the job on shadowfax, sunstone, or smokyquartz , unless they are busy. 

To request a specific GPU type, you can add that to the gres flag, e.g., here requesting an A100:

sbatch -p jsteinhardt --gres=gpu:A100:1 job.sh

If you need more than one CPU, please request that using the --cpus-per-task flag. The value you specify actually requests that number of hardware threads, but with the caveat that a given job is allocated all the threads on a given core to avoid contention between jobs for a given physical core. So the default of "-c 1" allocates one physical core and two hardware threads. Your CPU usage will be restricted to the number of threads you request. 

As an example, since shadowfax has 48 CPUs (actually 48 threads and 24 physical cores as discussed above) and 8 GPUs, there are 6 CPUs per GPU. You could request more than 6 CPUs per GPU for your job, but note that if other group members do the same, it's possible that the total number of CPUs may be fully used before all the GPUs are used. Similar considerations hold for balrog (96 CPUs and 8 GPUs), saruman (104 CPUs and 8 GPUs) and smaug (64 CPUs and 2 GPUs) as well as rainbowquartz, smokyquartz and sunstone (all with 64 CPUs and 8 GPUs). That said, that's probably a rather unlikely scenario.

To see what jobs are running on particular machines, so that you can have a sense of when a job that requests a particular machine might start: 

arwen:~> squeue -p jsteinhardt -o "%.9i %.20j %.12u %.2t %.11l %.11M %.11V %.5C %.8r %.6D %.20R %.13p %8q %b"
    JOBID                 NAME         USER ST  TIME_LIMIT        TIME SUBMIT_TIME  CPUS   REASON  NODES     NODELIST(REASON)      PRIORITY QOS      TRES_PER_NODE
  1077240                 bash nikhil_ghosh  R 28-00:00:00    12:51:48 2021-11-01T     1     None      1               balrog 0.00196710997 normal   gpu:1
  1077248              jupyter         awei  R 28-00:00:00     1:40:55 2021-11-01T     1     None      1               balrog 0.00092315604 preempti gpu:1
  1077121             train.sh andyzou_jiam  R 28-00:00:00  2-17:32:41 2021-10-29T    48     None      1               balrog 0.00027842330 preempti gpu:2

 

The Yu group has priority access to GPUs located on merry (1 GTX GPU), morgoth (2 TITAN GPUs), and treebeard (1 A100 GPU) servers. If you are in the group, simply submit jobs to the yugroup partition and you will automatically preempt jobs by users not in the group if it is needed for your job to run.

To request a specific GPU type, you can add that to the gres flag, e.g., here requesting an A100:

sbatch -p yugroup --gres=gpu:A100:1 job.sh

If you need more than one CPU, please request that using the --cpus-per-task flag, but note that merry only has four CPUs. The value you specify actually requests that number of hardware threads, but with the caveat that a given job is allocated all the threads on a given core to avoid contention between jobs for a given physical core. 

Please contact SCF staff or group members for more details.

The Song group has priority access to the GPUs located on luthien  (4 A100 GPUs). If you are in the group, simply submit jobs to the yss partition and you will automatically preempt jobs by users not in the group if it is needed for your job to run.

If you need more than one CPU, please request that using the --cpus-per-task flag. The value you specify actually requests that number of hardware threads, but with the caveat that a given job is allocated all the threads on a given core to avoid contention between jobs for a given physical core. 

Please contact SCF staff or group members for more details.