Monitoring
You can monitor jobs on the cluster and usage of the different partitions. We also provide several useful commands that give commonly-needed information.
- How to Monitor Jobs
-
The SLURM command squeue provides info on job status:
arwen:~/Desktop> squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 381 high job.sh paciorek R 25:28 1 scf-sm20 380 high job.sh paciorek R 25:37 1 scf-sm20
The following will tailor the output to include information on the number of cores (the CPUs column below) being used as well as other potentially useful information:
arwen:~/Desktop> squeue -o "%.7i %.9P %.20j %.8u %.2t %l %.9M %.5C %.8r %.6D %R %p %q %b" JOBID PARTITION NAME USER ST TIME_LIMIT TIME CPUS REASON NODES NODELIST(REASON) PRIORITY QOS GRES 49 low bash paciorek R 28-00:00:00 1:23:29 1 None 1 scf-sm00 0.00000017066486 normal (null) 54 low kbpew2v paciorek R 28-00:00:00 11:01 1 None 1 scf-sm00 0.00000000488944 normal (null)
The 'ST' field indicates whether a job is running (R), failed (F), or pending (PD). The latter occurs when there are not yet enough resources on the system for your job to run.
Job output that would normally appear in your terminal session will be sent to a file named slurm-<jobid>.out where <jobid> will be the number of the job (visible via squeue as above).
If you would like to login to the node on which your job is running in order to assess CPU or memory use, you can run an interactive job within the context of your existing job. First determine the job ID of your running job using squeue and insert that in place of <jobid> in the following command:
arwen:~/Desktop$ srun --pty --jobid=<jobid> /bin/bash
You can then run top and other tools.
To see a history of your jobs (within a time range), including reasons they might have failed:
sacct --starttime=2021-04-01 --endtime=2021-04-30 \ --format JobID,JobName,Partition,Account,AllocCPUS,State%30,ExitCode,Submit,Start,End,NodeList,MaxRSS
The MaxRSS column indicates memory usage, which can be very useful.
- How to Monitor Cluster Usage
-
If you'd like to see how busy each node is (e.g., to choose what partition to submit a job to), you can use `snodes` (which is an alias for `-N -o "%.12N %.5a %.6t %C"`):
arwen:~/Desktop$ snodes NODELIST AVAIL STATE CPUS(A/I/O/T) balrog up mix 50/46/0/96 merry up idle 0/4/0/4 morgoth up mix 1/11/0/12 scf-sm00 up idle 0/32/0/32 scf-sm01 up idle 0/32/0/32 scf-sm02 up mix 31/1/0/32 scf-sm03 up idle 0/32/0/32 scf-sm10 up idle 0/32/0/32 scf-sm11 up idle 0/32/0/32 scf-sm12 up idle 0/32/0/32 scf-sm13 up idle 0/32/0/32 scf-sm20 up idle 0/24/0/24 scf-sm21 up idle 0/24/0/24 scf-sm22 up idle 0/24/0/24 scf-sm23 up idle 0/24/0/24 shadowfax up idle 0/48/0/48 smaug up idle 0/64/0/64 treebeard up mix 2/30/0/32
Here the A column indicates the number of cores used (i.e., active), I indicates the number of inactive cores, and T the total number of cores on the node.
To see the jobs running in a partition you can use `squeue` as discussed in the previous drop-down, but specifying the partition with `-p`, e.g., `-p high`.
- Useful Slurm commands
-
We've prepared a set of shortcut commands that wrap around `srun`, squeue`, `sinfo`, and `sacct` with some commonly-used options:
- `slogin`: starts an interactive shell on a cluster node
- `snodes`: prints the current usage of nodes on the cluster
- `sjobs`: lists running jobs on the cluster
- `shist`: provides information about completed (including failed) jobs
- `sassoc`: gives information about user access to cluster partitions
For each of these commands, you can add the `-h` flag to see how to use them. For example:
gandalf:~$ slogin -h Usages: 'slogin' to start an interactive job 'slogin <jobid>' to start a shell on the node a job is running on 'slogin <additional_arguments_to_run>' to start an interactive job with additional arguments to srun