Monitoring
You can monitor jobs on the cluster and usage of the different partitions. We also provide several useful commands that give commonly-needed information.
- How to Monitor Jobs
-
The SLURM command squeue provides info on job status:
arwen:~/Desktop> squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 381 high job.sh paciorek R 25:28 1 scf-sm20 380 high job.sh paciorek R 25:37 1 scf-sm20
The following will tailor the output to include information on the number of cores (the CPUs column below) being used as well as other potentially useful information:
arwen:~/Desktop> squeue -o "%.7i %.9P %.20j %.8u %.2t %l %.9M %.5C %.8r %.6D %R %p %q %b" JOBID PARTITION NAME USER ST TIME_LIMIT TIME CPUS REASON NODES NODELIST(REASON) PRIORITY QOS GRES 49 low bash paciorek R 28-00:00:00 1:23:29 1 None 1 scf-sm00 0.00000017066486 normal (null) 54 low kbpew2v paciorek R 28-00:00:00 11:01 1 None 1 scf-sm00 0.00000000488944 normal (null)
The 'ST' field indicates whether a job is running (R), failed (F), or pending (PD). The latter occurs when there are not yet enough resources on the system for your job to run.
Job output that would normally appear in your terminal session will be sent to a file named slurm-<jobid>.out where <jobid> will be the number of the job (visible via squeue as above).
If you would like to login to the node on which your job is running in order to assess CPU or memory use, you have two options.
The easiest is that if (and only if) you have a job running on a node, you can ssh directly to the node (from one of the SCF login servers). The resulting ssh session will be associated with your existing job on the node. One important caveat is that if your terminal on the login server is through JupyterHub, you'll need to run `ssh -F none <name_of_node>` rather than just `ssh <name_of_node>`.
Alternatively, you can run an interactive job within the context of your existing job. First determine the job ID of your running job using squeue and insert that in place of <jobid> in the following command:
arwen:~/Desktop$ srun --pty --jobid=<jobid> /bin/bash
In either case, you can then run top and other tools.
To see a history of your jobs (within a time range), including reasons they might have failed:
sacct --starttime=2021-04-01 --endtime=2021-04-30 \ --format JobID,JobName,Partition,Account,AllocCPUS,State%30,ExitCode,Submit,Start,End,NodeList,MaxRSS
The MaxRSS column indicates memory usage, which can be very useful.
- How to Monitor Cluster Usage
-
If you'd like to see how busy each node is (e.g., to choose what partition to submit a job to), you can use `snodes` (which is an alias for `-N -o "%.12N %.5a %.6t %C"`):
arwen:~/Desktop$ snodes NODELIST AVAIL STATE CPUS(A/I/O/T) balrog up mix 50/46/0/96 merry up idle 0/4/0/4 morgoth up mix 1/11/0/12 scf-sm00 up idle 0/32/0/32 scf-sm01 up idle 0/32/0/32 scf-sm02 up mix 31/1/0/32 scf-sm03 up idle 0/32/0/32 scf-sm10 up idle 0/32/0/32 scf-sm11 up idle 0/32/0/32 scf-sm12 up idle 0/32/0/32 scf-sm13 up idle 0/32/0/32 scf-sm20 up idle 0/24/0/24 scf-sm21 up idle 0/24/0/24 scf-sm22 up idle 0/24/0/24 scf-sm23 up idle 0/24/0/24 shadowfax up idle 0/48/0/48 smaug up idle 0/64/0/64 treebeard up mix 2/30/0/32
Here the A column indicates the number of cores used (i.e., active), I indicates the number of inactive cores, and T the total number of cores on the node.
To see the jobs running in a partition you can use `squeue` as discussed in the previous drop-down, but specifying the partition with `-p`, e.g., `-p high`.
To see GPU availability, you can use `sgpus`, which is an alias for `sinfo` with some additional output processing.
arwen:~/Desktop$ sgpus NODELIST PARTITION GPUs GPUs_USED roo gpu TITAN:1 TITAN:1 smokyquartz jsteinhardt A4000:8 A4000:0 smaug jsteinhardt RTX8000:2 RTX8000:0 rainbowquartz jsteinhardt A5000:8 A5000:0 shadowfax jsteinhardt RTX2080:8 RTX2080:3 sunstone jsteinhardt A4000:8 A4000:1 balrog jsteinhardt A100:8 A100:0 saruman jsteinhardt A100:8 A100:3 beren yss A100:8 A100:8 luthien yss A100:4 A100:4 morgoth yugroup TITAN:2 TITAN:0 merry yugroup GTX:1 GTX:1 treebeard yugroup A100:1 A100:1
- Useful Slurm commands
-
We've prepared a set of shortcut commands that wrap around `srun`, squeue`, `sinfo`, and `sacct` with some commonly-used options:
- `slogin`: starts an interactive shell on a cluster node
- `snodes`: prints the current usage of nodes on the cluster
- `sjobs`: lists running jobs on the cluster
- `shist`: provides information about completed (including failed) jobs
- `sassoc`: gives information about user access to cluster partitions
- `sgpus`: gives information about GPU availability.
For each of these commands, you can add the `-h` flag to see how to use them. For example:
gandalf:~$ slogin -h Usages: 'slogin' to start an interactive job 'slogin <jobid>' to start a shell on the node a job is running on 'slogin <additional_arguments_to_run>' to start an interactive job with additional arguments to srun