Monitoring

Monitoring

You can monitor jobs on the cluster and usage of the different partitions. We also provide several useful commands that give commonly-needed information.

How to Monitor Jobs

The SLURM command squeue provides info on job status:

arwen:~/Desktop> squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               381      high   job.sh paciorek  R      25:28      1 scf-sm20
               380      high   job.sh paciorek  R      25:37      1 scf-sm20

The following will tailor the output to include information on the number of cores (the CPUs column below) being used as well as other potentially useful information:

arwen:~/Desktop> squeue -o "%.7i %.9P %.20j %.8u %.2t %l %.9M %.5C %.8r %.6D %R %p %q %b"
   JOBID PARTITION                 NAME     USER ST TIME_LIMIT      TIME  CPUS   REASON  NODES NODELIST(REASON) PRIORITY QOS GRES
     49       low                 bash paciorek  R 28-00:00:00   1:23:29     1     None      1 scf-sm00 0.00000017066486 normal (null)
     54       low              kbpew2v paciorek  R 28-00:00:00     11:01     1     None      1 scf-sm00 0.00000000488944 normal (null)

The 'ST' field indicates whether a job is running (R), failed (F), or pending (PD). The latter occurs when there are not yet enough resources on the system for your job to run.

Job output that would normally appear in your terminal session will be sent to a file named slurm-<jobid>.out where <jobid> will be the number of the job (visible via squeue as above).

If you would like to login to the node on which your job is running in order to assess CPU or memory use, you can run an interactive job within the context of your existing job. First determine the job ID of your running job using squeue and insert that in place of <jobid> in the following command:

arwen:~/Desktop$ srun --pty --jobid=<jobid> /bin/bash

You can then run top and other tools. 

To see a history of your jobs (within a time range), including reasons they might have failed:

sacct  --starttime=2021-04-01 --endtime=2021-04-30 \
--format JobID,JobName,Partition,Account,AllocCPUS,State%30,ExitCode,Submit,Start,End,NodeList,MaxRSS

The MaxRSS column indicates memory usage, which can be very useful.

How to Monitor Cluster Usage

If you'd like to see how busy each node is (e.g., to choose what partition to submit a job to), you can run the following:

arwen:~/Desktop$ sinfo -p low,high,jsteinhardt,yugroup -N -o "%.12N %.5a %.6t %C"
    NODELIST AVAIL  STATE CPUS(A/I/O/T)
      balrog    up    mix 50/46/0/96
       merry    up   idle 0/4/0/4
     morgoth    up    mix 1/11/0/12
    scf-sm00    up   idle 0/32/0/32
    scf-sm01    up   idle 0/32/0/32
    scf-sm02    up    mix 31/1/0/32
    scf-sm03    up   idle 0/32/0/32
    scf-sm10    up   idle 0/32/0/32
    scf-sm11    up   idle 0/32/0/32
    scf-sm12    up   idle 0/32/0/32
    scf-sm13    up   idle 0/32/0/32
    scf-sm20    up   idle 0/24/0/24
    scf-sm21    up   idle 0/24/0/24
    scf-sm22    up   idle 0/24/0/24
    scf-sm23    up   idle 0/24/0/24
   shadowfax    up   idle 0/48/0/48
       smaug    up   idle 0/64/0/64
   treebeard    up    mix 2/30/0/32

Here the A column indicates the number of cores used (i.e., active), I indicates the number of inactive cores, and T the total number of cores on the node.

To see the jobs running in a partition

 

Useful Slurm commands

We've prepared a set of shortcut commands that wrap around `srun`, squeue`, `sinfo`, and `sacct` with some commonly-used options:

  • `slogin`: starts an interactive shell on a cluster node
  • `snodes`: prints the current usage of nodes on the cluster
  • `sjobs`: lists running jobs on the cluster
  • `shist`: provides information about completed (including failed) jobs
  • `sassoc`: gives information about user access to cluster partitions

For each of these commands, you can add the `-h` flag to see how to use them. For example:

gandalf:~$ slogin -h
Usages:
'slogin' to start an interactive job
'slogin <jobid>' to start a shell on the node a job is running on
'slogin <additional_arguments_to_run>' to start an interactive job with additional arguments to srun