The SCF operates a Linux cluster as our primary resource for users for computational jobs.
The cluster is managed by the SLURM queueing software. SLURM provides a standard batch queueing system through which users submit jobs to the cluster. Jobs are typically submitted to SLURM using a user-defined shell script that executes one's application code. Interactive use is also an option. Users may also query the cluster to see job status. As currently set up, the cluster is designed for processing single-core and multi-core/threaded jobs (at most 32 cores per job in the 'low' partition), as well as jobs that are set up to use multiple machines (aka 'nodes') within the cluster (i.e., distributed memory jobs). All software running on SCF Linux machines is available on the cluster. Users can also compile programs on any SCF Linux machine and then run that program on the cluster.
The cluster has a total of 656 cores, divided into 'partitions' of nodes of similar hardware/capabilities.
Below is more detailed information about how to use the cluster.
The cluster is open to Department of Statistics faculty, grad students, postdocs, and visitors using their SCF logon. Class account users do not have access by default, but instructors can email email@example.com to discuss access for their class.
Currently users may submit jobs on the following submit hosts:
arwen, beren, gandalf, hagrid, harry, radagast, shelob
The cluster has 656 cores (essentially CPUs) spread across a number of nodes of varying hardware specifications and ages.
- 256 of the cores are in the 'low' partition.
- 8 nodes, 32 cores and 256 GB RAM per node.
- By default jobs submitted to the cluster will run here.
- 96 of the cores are in the 'high' partition.
- 4 nodes, 12 cores (each with two hyperthreads) and 128 GB RAM per node.
- Newer (but not new) and faster nodes than in the 'low' partition.
- 128 cores are in the 'high_pre' partition.
- 1 node (smaug) has 24 cores (each with two hyperthreads) and 376 GB RAM. A second node (balrog) has 40 cores (each with two hyperthreads) and 792 GB RAM.
- Newer and perhaps somewhat faster cores than in the 'high' partition.
- Very fast disk I/O (using NVMe SSDs) to files located in /tmp and /var/tmp.
- Jobs are subject to preemption at any time and will be cancelled in that case without warning (more details below).
- 144 cores are in the 'lowmem' partition, intended to supplement our other partitions, particularly during heavy cluster usage.
- 9 nodes, 16 cores and 12-16 GB RAM per node.
- Older, slower nodes that are comparable to, or a bit slower than, the nodes in the 'low' partition.
- Only suitable for jobs that have low memory needs, but you should feel free to request an entire node for jobs needing more memory, even if your code will only use one or a few cores.
- 256 of the cores are in the 'low' partition.
- Slurm Scheduler Setup and Job Restrictions
The cluster has multiple partitions, corresponding to groups of nodes. The different partitions have different hardware and job restrictions as discussed here:
Partition Max # cores per user (running) Time limit Max memory per job Max cores per job low 256 28 days 256 GB 32*** high* 36 7 days 128 GB 24*** high_pre* 48 (smaug), 80 (balrog) 28 days** 376 GB (smaug), 792 GB (balrog) 48 or 80 gpu* 2 CPU cores 28 days 128 GB (CPU) 2 gpu_jsteinhardt* varied 28 days** varied (CPU) 2 gpu_yugroup* 2 CPU cores 28 days** varied (CPU) 2 low_mem 144 28 days 12-16 GB 16***
* See How to Submit Jobs to the High Partitions or How to Submit GPU Jobs.
*** If you use software that can parallelize across multiple nodes (e.g., R packages that use MPI or the future package, Python's Dask or IPython Parallel, MATLAB, MPI), you can run individual jobs across more than one node. See How to Submit Parallel Jobs.
We have implemented a 'fair share' policy that governs the order in which jobs that are waiting in a given queue start when resources become available. In particular, if two users each have a job sitting in a queue, the job that will start first will be that of the user who has made less use of the cluster recently (measured in terms of CPU time). The measurement of CPU time downweights usage over time, with a half-life of one month, so a job that ran a month ago will count half as much as a job that ran yesterday. Apart from this prioritization based on recent use, all users are treated equally.
- How to Submit Single-core jobs
Prepare a shell script containing the instructions you would like the system to execute.
The instructions here are for the simple case of submitting a job without any parallelization; i.e., a job using a single core (CPU). When submitted using the instructions in this section, your code should only use a single core at a time; it should not start any additional processes. We also have extensive instructions for submitting parallelized jobs and automating the submission of multiple jobs.
For example a simple script to run an R program called 'simulate.R' would contain these lines:
#!/bin/bash R CMD BATCH --no-save simulate.R simulate.out
Once logged onto a submit host, use the sbatch command with the name of the shell script (assumed to be job.sh here) to enter a job into the queue:
arwen:~/Desktop$ sbatch job.sh Submitted batch job 380
Here the job and assigned job ID 380. Results that would normally be printed to the screen via standard output and standard error will be written to a file called slurm-380.out.
SLURM provides a number of additional flags to control what happens; you can see the man page for sbatch for help with these. Here are some examples, placed in the job script file, where we name the job, ask for email updates and name the output and error files:
#!/bin/bash #SBATCH --job-name=myAnalysisName #SBATCH --mail-type=ALL #SBATCH --firstname.lastname@example.org #SBATCH -o myAnalysisName.out #File to which standard out will be written #SBATCH -e myAnalysisName.err #File to which standard err will be written R CMD BATCH --no-save simulate.R simulate.Rout
- How to Submit Parallel Jobs
One can use SLURM to submit parallel code of a variety of types.
- Submittings Jobs to the High Partitions
To submit jobs to the faster nodes in the high partition, you must include either the '--partition=high' or '-p high' flag. By default jobs will be run in the low partition. For example:
arwen:~/Desktop$ sbatch -p high job.sh Submitted batch job 380
You can also submit interactive jobs to the high partition, by simply adding the flag to specify that partition in your srun invocation.
High pre-emptible partition
The high_pre partition has two nodes (smaug and balrog) with newer CPUs and very fast disk I/O to /tmp and /var/tmp using an NVMe SSD. This nodes are owned by a faculty member and are made available on a preemptible basis. Your job could be cancelled without warning at any time if researchers in the faculty member's group need to run a job using the cores/memory your job is using. However we don't expect this to happen very often given that the nodes have 48 (smaug) and 80 (balrog) cores that can be shared amongst jobs. For example:
arwen:~/Desktop$ sbatch -p high_pre job.sh Submitted batch job 380
Jobs are cancelled when preemption happens. If you want your job to be automatically started again (i.e., started from the beginning) when the node becomes available you can add the "--requeue" flag when you submit via sbatch.
If you specifically want access to either smaug or balrog, specify the -w flag with the name of the server, either "-w smaug-cpu" or "-w balrog-cpu".
Also note that if you need more disk space on the NVMe SSD on smaug (but not balrog), we may be able to make available space on a much larger NVMe SSD (in /data) if you request it.
- How to Kill a Job
First, find the job-id of the job, by typing squeue at the command line of a submit host (see How to Monitor Jobs).
Then use scancel to delete the job (with id 380 in this case):
- Interactive Jobs
You can work interactively on a node from the Linux shell command line by starting a job in the interactive queue.
The syntax for requesting an interactive (bash) shell session is:
srun --pty /bin/bash
This will start a shell on one of the four nodes. You can then act as you would on any SCF Linux compute server. For example, you might use top to assess the status of one of your non-interactive (i.e., batch) cluster jobs. Or you might test some code before running it as a batch job. You can also transfer files to the local disk of the cluster node.
If you want to run a program that involves a graphical interface (requiring an X11 window), you need to add --x11 to your srun command. So you could directly run Matlab, e.g., as follows:
srun --pty --x11 matlab
or you could add the --x11 flag when requesting an interactive shell session and then subsequently start a program that has a graphical interface.
To run an interactive session in which you would like to use multiple cores, do the following (here we request 4 cores for our use):
srun --pty --cpus-per-task 4 /bin/bash
Note that "-c" is a shorthand for "--cpus-per-task".
To transfer files to the local disk of a specific node, you need to request that your interactive session be started on the node of interest (in this case scf-sm20):
srun --pty -p high -w scf-sm20 /bin/bash
Note that if that specific node has all its cores in use by other users, you will need to wait until resources become available on that node before your interactive session will start.
Finally, you can request multiple cores using -c, as with batch jobs. As with batch jobs, you can change OMP_NUM_THREADS from its default of one, provided you make sure that that the total number of cores used (number of processes your code starts multiplied by threads per process) does not exceed the number of cores you request.
- How to Monitor Jobs
The SLURM command squeue provides info on job status:
arwen:~/Desktop> squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 381 high job.sh paciorek R 25:28 1 scf-sm20 380 high job.sh paciorek R 25:37 1 scf-sm20
The following will tailor the output to include information on the number of cores (the CPUs column below) being used as well as other potentially useful information:
arwen:~/Desktop> squeue -o "%.7i %.9P %.20j %.8u %.2t %l %.9M %.5C %.8r %.6D %R %p %q %b" JOBID PARTITION NAME USER ST TIME_LIMIT TIME CPUS REASON NODES NODELIST(REASON) PRIORITY QOS GRES 49 low bash paciorek R 28-00:00:00 1:23:29 1 None 1 scf-sm00 0.00000017066486 normal (null) 54 low kbpew2v paciorek R 28-00:00:00 11:01 1 None 1 scf-sm00 0.00000000488944 normal (null)
The 'ST' field indicates whether a job is running (R), failed (F), or pending (PD). The latter occurs when there are not yet enough resources on the system for your job to run.
Job output that would normally appear in your terminal session will be sent to a file named slurm-<jobid>.out where <jobid> will be the number of the job (visible via squeue as above).
If you would like to logon to the node on which your job is running in order to assess CPU or memory use, you can run an interactive job within the context of your existing job. First determine the job ID of your running job using squeue and insert that in place of <jobid> in the following command:
arwen:~/Desktop$ srun --pty --jobid=<jobid> /bin/bash
You can then run top and other tools. Please do not do any intensive computation that would use additional cores in addition to those used by your running job.
- How to Monitor Cluster Usage
If you'd like to see how busy each node is (e.g., to choose what partition to submit a job to), you can run the following:
arwen:~/Desktop$ sinfo -N -o "%.12N %.5a %.6t %C" NODELIST AVAIL STATE CPUS(A/I/O/T) scf-sm00 up alloc 32/0/0/32 scf-sm01 up alloc 32/0/0/32 scf-sm02 up alloc 32/0/0/32 scf-sm03 up alloc 32/0/0/32 scf-sm10 up alloc 32/0/0/32 scf-sm11 up alloc 32/0/0/32 scf-sm12 up alloc 32/0/0/32 scf-sm13 up mix 27/5/0/32 scf-sm20-cpu up idle 0/22/0/22 scf-sm20-gpu up idle 0/2/0/2 scf-sm21-cpu up idle 0/20/0/20 scf-sm21-gpu up idle 0/4/0/4 scf-sm22 up idle 0/24/0/24 scf-sm23 up idle 0/24/0/24
Here the A column indicates the number of cores used (i.e., active), I indicates the number of inactive cores, and T the total number of cores on the node.
- How to Submit GPU Jobs
The SCF hosts a number of GPUs, available only by submitting a job through our SLURM scheduling software. The GPUs are quite varied in their hardware configurations (different generations of GPUS, with different speeds and GPU memory). We have documented the GPU servers to guide you in selecting which GPU you may want to use.
To use the GPUs, you need to submit a job via our SLURM scheduling software. In doing so, you need to specifically request that your job use the GPU as follows:
arwen:~/Desktop$ sbatch --partition=gpu --gres=gpu:1 job.sh
Once it starts your job will have exclusive access to the GPU and its memory. If another user is using the GPU, your job will be queued until the current job finishes.
Interactive jobs should use that same gres flag with the usual srun syntax for an interactive job.
arwen:~/Desktop$ srun --pty --partition=gpu --gres=gpu:1 /bin/bash
Given the heterogenity in the GPUs available, you may want to request use of one or more GPUs on a specific machine. To do so, you can use the -w flag, e.g., requesting the roo GPU server:
arwen:~/Desktop$ sbatch --partition=gpu -w roo --gres=gpu:1 job.sh
If you want to interactively logon to the GPU node to check on compute or memory use of an sbatch job that uses the GPU, find the job ID of your job using squeue and insert that job ID in place of '<jobID>' in the following command. This will give you an interactive job running in the context of your original job:
arwen:~/Desktop$ srun --pty --partition=gpu --jobid=<jobid> /bin/bash
and then use nvidia-smi commands, e.g.,
scf-sm20:~$ nvidia-smi -q -d UTILIZATION,MEMORY -l 1
There are many ways to set up your code to use the GPU.
Two GPUs are generally available to all SCF users; these are hosted on the scf-sm20 and roo servers.
Note that the GPU hosted on scf-sm20 is quite a bit older, and likely slower, than the GPU hosted on roo. If you'd like to specifically request one of the GPUs, you can add the -w flag, e.g. "-w roo" to request the GPU on roo.
scf-sm20-gpu is a virtual node - it is a set of two CPUs on the scf-sm20 node that are partitioned off for GPU use.
Additional GPUs have been obtained by the Steinhardt and Yu lab groups. Most of these GPUs have higher performance (either speed or GPU memory) than our standard GPUs.
Members of the lab group have priority access to the GPUs of their group. Other SCF users can submit jobs that use these GPUs but those jobs will be preempted (killed) if higher-priority jobs need access to the GPUs.
To submit jobs requesting access to these GPUs, you need to specify either the gpu_jsteinhardt or gpu_yugroup partitions. Here's an example:
arwen:~/Desktop$ sbatch --partition=gpu_jsteinhardt --gres=gpu:1 job.sh
To use multiple GPUs for a job (only possible when using a server with more than one GPU, namely scf-sm21-gpu, smaug, shadowfax, balrog, and morgoth), simply change the number 1 after --gres=gpu to the number desired.
The Steinhart group has priority access to the balrog (8 GPUs), shadowfax (8 GPUs) and smaug (2 GPUs) GPU servers. If you are in the group, simply submit jobs to the gpu_jsteinhardt partition and you will automatically preempt jobs by users not in the group if it is needed for your job to run.
In addition to the notes below, more details on optimal use of these servers can be obtained from the guide prepared by Steinhardt group members and the SCF and available by contacting one of us.
The smaug and balrog GPUs have a lot of GPU memory and are primarily intended for training very large models (e.g., ImageNet not CIFAR10 or MNIST).
To request a specific GPU, use the -w flag with the name of the GPU server of interest (shadowfax, smaug-gpu, or balrog-gpu).
sbatch -p gpu_jsteinhardt -w smaug-gpu --gres=gpu:1 job.sh
For jobs running on shadowfax, your CPU usage will be restricted to the number of CPUs you request (via the --cpus-per-task flag). Since shadowfax has 48 CPUs and 8 GPUs, there are 6 CPUs per GPU. You could request more than 6 CPUs per GPU for your job, but note that if other group members do the same, it's possible that the total number of CPUs may be fully used before all the GPUs are used. That said, that's probably a rather unlikely scenario.
For jobs running on smaug and balrog, your code can use as many CPUs as needed (up to the 64 CPUs on smaug and 96 CPUs on balrog) without specifying the --cpus-per-task flag. However 48 (on smaug) and 80 (on balrog) of those CPUs are available to other SCF users for non-GPU jobs. It's possible that usage by those other users could compete with CPU usage by your job. If you would like to make sure that usage by other SCF users doesn't slow your work down, you can run a "dummy" job that excludes access by other users. Create a script, perhaps called 'sleep.sh' that has this content:
#!/bin/bash #SBATCH -p high_pre #SBATCH -w smaug-cpu # or balrog-cpu #SBATCH -c 48 # 80 for balrog-cpu #SBATCH -t 24:00:00 # time limit equal to your main GPU job submission sleep 10000000
Then submit the script via sbatch.
The Yu group has priority access to GPUs located on the scf-sm21-gpu (2 GPUs), merry (1 GPU), and morgoth (2 GPUs) servers. If you are in the group, simply submit jobs to the gpu_yugroup partition and you will automatically preempt jobs by users not in the group if it is needed for your job to run. Please contact SCF staff or group members for more details.