This is a quick-start guide, intended to show the basic commands to get up and running on the cluster.
Please see the other pages on the sidebar for detailed documentation on the SCF cluster.
Connecting to the SCF
First, ssh to one of the SCF's standalone Linux servers. For example from a terminal on your Mac or Linux machine:
Or use Putty or MobaXTerm from your Windows machine.
You can also start a terminal on an SCF machine via the SCF JupyterHub.
All jobs that you run on the cluster need to be submitted via the Slurm scheduling software from one of our Linux servers. Slurm sends each job to run on a machine (aka a node) (or machines) on one of the partitions in the cluster. A partition is a collection of machines with similar hardware.
Checking Cluster Usage
Once are logged onto an SCF server, you can submit a job (assuming you are in a group that has permission to use the cluster).
But first we'll check usage on the cluster to get a sense for how busy each partition is:
paciorek@gandalf:~> sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST low* up 28-00:00:0 8 mix scf-sm[00-03,10-13] high up 7-00:00:00 2 mix scf-sm[22-23] high up 7-00:00:00 2 idle scf-sm[20-21] lowmem up 28-00:00:0 3 down$ boromir,pooh,toad lowmem up 28-00:00:0 1 maint gollum lowmem up 28-00:00:0 3 idle ghan,hermione,shelob gpu up 28-00:00:0 1 idle roo jsteinhardt up 28-00:00:0 6 mix balrog,rainbowquartz,saruman,shadowfax,smokyquartz,sunstone jsteinhardt up 28-00:00:0 1 idle smaug yss up 28-00:00:0 1 idle luthien yugroup up 28-00:00:0 1 mix treebeard yugroup up 28-00:00:0 2 idle merry,morgoth
Alternatively, we have a wrapper script snodes that shows usage on a per-node basis:
paciorek@gandalf:~> snodes NODELIST PARTITION AVAIL STATE CPUS(A/I/O/T) balrog jsteinhardt up mix 50/46/0/96 <snip> scf-sm00 low* up mix 24/8/0/32 scf-sm01 low* up mix 25/7/0/32 <snip> scf-sm02 low* up mix 24/8/0/32 scf-sm03 low* up mix 28/4/0/32 scf-sm10 low* up mix 30/2/0/32 scf-sm11 low* up mix 28/4/0/32 scf-sm12 low* up mix 24/8/0/32 <snip>
Here 'A' is for 'active' (i.e., in use), 'I' is for 'idle' (i.e., not in use), and 'T' is for the 'total' number of CPUs (aka cores) on the node.
Interactive Jobs with srun
We can start an interactive job with srun. This will start an interactive session that can use a single core (aka CPU) on one of the machines in our (default) low partition, which contains older machines. You'll see the prompt change, indicating you're now running on one of the cluster machines (a machine named scf-sm10 in this case).
paciorek@gandalf:~> srun --pty bash paciorek@scf-sm10:~>
When you're done with your computation, make sure to exit out of the interactive session
We can instead start a job on the newer machines in the high partition:
paciorek@gandalf:~> srun -p high --pty bash
And we can start a job that needs four cores on a single machine using the cpus-per-task (-c) flag:
paciorek@gandalf:~> srun -c 4 --pty bash
Interactive jobs might take a while to start if the machines are busy with other users' jobs.
Batch jobs with sbatch
To run a background job, you need to create a job script. Here's a simple one that requests four cores and then runs a Python script (which should be set up to take advantage of four cores via parallelization in some fashion):
#! /bin/bash #SBATCH -c 4 python code.py > code.pyout
When you submit the job, it will show you the job id. Here we assume the code above is in a file job.sh:
paciorek@gandalf:~> sbatch job.sh Submitted batch job 47139
You can now check if the job is running:
paciorek@gandalf:~> squeue -u $(whoami) JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 41377 high jupyterh paciorek R 3-21:55:42 1 scf-sm22 47139 low job.sh paciorek R 0:46 1 scf-sm11
If you see `R` under the `ST` column, the job is running. If you see `PD` it is waiting in the queue.
Or you can use our sjobs wrapper script to see a richer set of information about the running job(s):
paciorek@gandalf:~> sjobs -u $(whoami) JOBID USER NAME PARTITION QOS ST REASON TIME_LIMIT TIME SUBMIT_TIME CPUS NODES NODELIST(REASON) PRIORITY FEATURES TRES_PER_NODE 41377 paciorek jupyterhub high normal R None 7-00:00:00 3-22:39:36 2022-11-17T 1 1 scf-sm22 0.0022660 (null) N/A 47139 paciorek job.sh low normal R None 28-00:00:00 44:40 2022-11-21T 4 1 scf-sm11 0.0022545 (null) N/A
To better understand CPU and memory use by the job, you can connect to the node the job is running on and then use commands like top and ps:
paciorek@gandalf:~> srun --jobid 47139 --pty bash paciorek@scf-sm10:~> top # use Ctrl-C to exit top paciorek@scf-sm10:~> exit
To cancel a running job:
paciorek@gandalf:~> scancel 47139
To check details of recently-completed jobs:
paciorek@gandalf:~> sacct -u $(whoami) -S 2022-11-01 # jobs started since Nov 1, 2022 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 47138 bash low site 1 COMPLETED 0:0 47138.extern extern site 1 COMPLETED 0:0 47138.0 bash site 1 COMPLETED 0:0 47139 job.sh low site 4 RUNNING 0:0 47139.batch batch site 4 RUNNING 0:0 47139.extern extern site 4 RUNNING 0:0
Alternatively, use our shist wrapper for a richer set of information. Of particular note, the MaxRSS column shows the maximum memory use.
paciorek@gandalf:~> shist -S 2022-10-01 # using our wrapper User JobID JobName Partition Account AllocCPUS State MaxRSS ExitCode Submit Start End Elapsed Timelimit NodeList --------- ------------ ---------- ---------- ---------- ---------- ---------- ---------- -------- ------------------- ------------------- ------------------- ---------- ---------- --------------- paciorek 47138 bash low site 1 COMPLETED 0:0 2022-11-21T15:10:19 2022-11-21T15:10:19 2022-11-21T15:10:22 00:02:03 28-00:00:+ scf-sm10 47138.extern extern site 1 COMPLETED 902352K 0:0 2022-11-21T15:10:19 2022-11-21T15:10:19 2022-11-21T15:10:22 00:03:03 scf-sm10 47138.0 bash site 1 COMPLETED 540K 0:0 2022-11-21T15:10:19 2022-11-21T15:10:19 2022-11-21T15:10:22 00:03:03 scf-sm10 paciorek 47139 job.sh low site 4 RUNNING 0:0 2022-11-21T15:51:04 2022-11-21T15:51:05 Unknown 00:04:37 28-00:00:+ scf-sm11 47139.batch batch site 4 RUNNING 0:0 2022-11-21T15:51:05 2022-11-21T15:51:05 Unknown 00:04:37 scf-sm11 47139.extern extern site 4 RUNNING 0:0 2022-11-21T15:51:05 2022-11-21T15:51:05 Unknown 00:04:37 scf-sm11
Use the -c flag if you need all the cores on a single node.
In some cases your code may be able to parallelize across the cores on multiple nodes. In this case you should use the --ntasks (-n) flag. Here's a job script in which we ask for 20 cores, which may be provided by Slurm on one or more machines:
#! /bin/bash #SBATCH -n 20 python code.py > code.pyout
It only makes sense to use -n if your code is set up to be able to run across multiple machines (if you're not sure, it's likely that it won't run on multiple machines).
You can use both -c and -n in some cases, such as if you wanted to run Python or R in parallel with multiple processes and have each process use multiple cores for threaded linear algebra. Here's a job script asking for three tasks (suitable for starting three worker processes) and four cores per task (suitable for running four software threads per process):
#! /bin/bash #SBATCH -n 3 #SBATCH -c 4 python code.py > code.pyout
Slurm sets various environment variables inside the context of your running job that you can then use in your code so that your code adapts to whatever resources you requested. This is much better than hard-coding things like the number of workers that you want to use in your parallelized code.
Here are a few that can be useful:
env | grep SLURM <snip> SLURM_CPUS_PER_TASK=4 SLURM_NTASKS=3 SLURM_NNODES=2 SLURM_NODELIST=scf-sm[11-12] <snip>
The job is using cores on two nodes, with four CPUs for each of three tasks. Note that only code that is set up to run across multiple machines will work in this situation.