Quick Start

Quick Start

This is a quick-start guide, intended to show the basic commands to get up and running on the cluster.
Please see the other pages on the sidebar for detailed documentation on the SCF cluster.

Connecting to the SCF

First, ssh to one of the SCF's standalone Linux servers. For example from a terminal on your Mac or Linux machine:

ssh gandalf.berkeley.edu

Or use Putty or MobaXTerm from your Windows machine.

You can also start a terminal on an SCF machine via the SCF JupyterHub.

All jobs that you run on the cluster need to be submitted via the Slurm scheduling software from one of our Linux servers. Slurm sends each job to run on a machine (aka a node) (or machines) on one of the partitions in the cluster. A partition is a collection of machines with similar hardware.

Checking Cluster Usage

Once are logged onto an SCF server, you can submit a job (assuming you are in a group that has permission to use the cluster).

But first we'll check usage on the cluster to get a sense for how busy each partition is:

paciorek@gandalf:~> sinfo
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
low*           up 28-00:00:0      8    mix scf-sm[00-03,10-13]
high           up 7-00:00:00      2    mix scf-sm[22-23]
high           up 7-00:00:00      2   idle scf-sm[20-21]
lowmem         up 28-00:00:0      3  down$ boromir,pooh,toad
lowmem         up 28-00:00:0      1  maint gollum
lowmem         up 28-00:00:0      3   idle ghan,hermione,shelob
gpu            up 28-00:00:0      1   idle roo
jsteinhardt    up 28-00:00:0      6    mix balrog,rainbowquartz,saruman,shadowfax,smokyquartz,sunstone
jsteinhardt    up 28-00:00:0      1   idle smaug
yss            up 28-00:00:0      1   idle luthien
yugroup        up 28-00:00:0      1    mix treebeard
yugroup        up 28-00:00:0      2   idle merry,morgoth

Alternatively, we have a wrapper script snodes that shows usage on a per-node basis:

paciorek@gandalf:~> snodes
      NODELIST    PARTITION AVAIL  STATE CPUS(A/I/O/T)
        balrog  jsteinhardt    up    mix 50/46/0/96
<snip>
      scf-sm00         low*    up    mix 24/8/0/32
      scf-sm01         low*    up    mix 25/7/0/32
<snip>
      scf-sm02         low*    up    mix 24/8/0/32
      scf-sm03         low*    up    mix 28/4/0/32
      scf-sm10         low*    up    mix 30/2/0/32
      scf-sm11         low*    up    mix 28/4/0/32
      scf-sm12         low*    up    mix 24/8/0/32
<snip>

Here 'A' is for 'active' (i.e., in use), 'I' is for 'idle' (i.e., not in use), and 'T' is for the 'total' number of CPUs (aka cores) on the node. 

Interactive Jobs with srun

We can start an interactive job with srun. This will start an interactive session that can use a single core (aka CPU) on one of the machines in our (default) low partition, which contains older machines. You'll see the prompt change, indicating you're now running on one of the cluster machines (a machine named scf-sm10 in this case). 

paciorek@gandalf:~> srun --pty bash
paciorek@scf-sm10:~> 

When you're done with your computation, make sure to exit out of the interactive session

paciorek@scf-sm10:~> exit

We can instead start a job on the newer machines in the high partition:

paciorek@gandalf:~> srun -p high --pty bash

And we can start a job that needs four cores on a single machine using the cpus-per-task (-c) flag:

paciorek@gandalf:~> srun -c 4 --pty bash

Interactive jobs might take a while to start if the machines are busy with other users' jobs.

Batch jobs with sbatch

To run a background job, you need to create a job script. Here's a simple one that requests four cores and then runs a Python script (which should be set up to take advantage of four cores via parallelization in some fashion):

#! /bin/bash
#SBATCH -c 4
python code.py > code.pyout

When you submit the job, it will show you the job id. Here we assume the code above is in a file job.sh:

paciorek@gandalf:~> sbatch job.sh
Submitted batch job 47139

Monitoring

You can now check if the job is running:

paciorek@gandalf:~> squeue -u $(whoami)
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             41377      high jupyterh paciorek  R 3-21:55:42      1 scf-sm22
             47139       low   job.sh paciorek  R       0:46      1 scf-sm11

If you see `R` under the `ST` column, the job is running. If you see `PD` it is waiting in the queue.

Or you can use our sjobs wrapper script to see a richer set of information about the running job(s):

paciorek@gandalf:~> sjobs -u $(whoami)
         JOBID         USER                 NAME    PARTITION QOS      ST         REASON  TIME_LIMIT        TIME SUBMIT_TIME  CPUS  NODES     NODELIST(REASON)  PRIORITY FEATURES TRES_PER_NODE
         41377     paciorek           jupyterhub         high normal    R           None  7-00:00:00  3-22:39:36 2022-11-17T     1      1             scf-sm22 0.0022660 (null) N/A
         47139     paciorek               job.sh          low normal    R           None 28-00:00:00       44:40 2022-11-21T     4      1             scf-sm11 0.0022545 (null) N/A

To better understand CPU and memory use by the job, you can connect to the node the job is running on and then use commands like top and ps:

paciorek@gandalf:~> srun --jobid 47139 --pty bash
paciorek@scf-sm10:~> top    # use Ctrl-C to exit top
paciorek@scf-sm10:~> exit

To cancel a running job:

paciorek@gandalf:~> scancel 47139

To check details of recently-completed jobs:

paciorek@gandalf:~> sacct -u $(whoami) -S 2022-11-01  # jobs started since Nov 1, 2022
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
47138              bash        low       site          1  COMPLETED      0:0 
47138.extern     extern                  site          1  COMPLETED      0:0 
47138.0            bash                  site          1  COMPLETED      0:0 
47139            job.sh        low       site          4    RUNNING      0:0 
47139.batch       batch                  site          4    RUNNING      0:0 
47139.extern     extern                  site          4    RUNNING      0:0 

Alternatively, use our shist wrapper for a richer set of information. Of particular note, the MaxRSS column shows the maximum memory use.

paciorek@gandalf:~> shist -S 2022-10-01   # using our wrapper
     User        JobID    JobName  Partition    Account  AllocCPUS      State     MaxRSS ExitCode              Submit               Start                 End    Elapsed  Timelimit        NodeList 
--------- ------------ ---------- ---------- ---------- ---------- ---------- ---------- -------- ------------------- ------------------- ------------------- ---------- ---------- --------------- 
 paciorek 47138              bash        low       site          1  COMPLETED                 0:0 2022-11-21T15:10:19 2022-11-21T15:10:19 2022-11-21T15:10:22   00:02:03 28-00:00:+        scf-sm10 
          47138.extern     extern                  site          1  COMPLETED    902352K      0:0 2022-11-21T15:10:19 2022-11-21T15:10:19 2022-11-21T15:10:22   00:03:03                   scf-sm10 
          47138.0            bash                  site          1  COMPLETED       540K      0:0 2022-11-21T15:10:19 2022-11-21T15:10:19 2022-11-21T15:10:22   00:03:03                   scf-sm10 
 paciorek 47139            job.sh        low       site          4    RUNNING                 0:0 2022-11-21T15:51:04 2022-11-21T15:51:05             Unknown   00:04:37 28-00:00:+        scf-sm11 
          47139.batch       batch                  site          4    RUNNING                 0:0 2022-11-21T15:51:05 2022-11-21T15:51:05             Unknown   00:04:37                   scf-sm11 
          47139.extern     extern                  site          4    RUNNING                 0:0 2022-11-21T15:51:05 2022-11-21T15:51:05             Unknown   00:04:37                   scf-sm11 

Parallelization

Use the -c flag if you need all the cores on a single node.

In some cases your code may be able to parallelize across the cores on multiple nodes. In this case you should use the --ntasks (-n) flag. Here's a job script in which we ask for 20 cores, which may be provided by Slurm on one or more machines:

#! /bin/bash
#SBATCH -n 20
python code.py > code.pyout

It only makes sense to use -n if your code is set up to be able to run across multiple machines (if you're not sure, it's likely that it won't run on multiple machines).

You can use both -c and -n in some cases, such as if you wanted to run Python or R in parallel with multiple processes and have each process use multiple cores for threaded linear algebra. Here's a job script asking for three tasks (suitable for starting three worker processes) and four cores per task (suitable for running four software threads per process):

#! /bin/bash
#SBATCH -n 3
#SBATCH -c 4
python code.py > code.pyout

Slurm sets various environment variables inside the context of your running job that you can then use in your code so that your code adapts to whatever resources you requested. This is much better than hard-coding things like the number of workers that you want to use in your parallelized code. 

Here are a few that can be useful:

env | grep SLURM
<snip>
SLURM_CPUS_PER_TASK=4
SLURM_NTASKS=3
SLURM_NNODES=2
SLURM_NODELIST=scf-sm[11-12]
<snip>

The job is using cores on two nodes, with four CPUs for each of three tasks. Note that only code that is set up to run across multiple machines will work in this situation.