SLURM administrator commands

Posted on Wed 08 May 2024 by Pavlo Khmel

This is a collection of Slurm commands that are often in use.

create account, add default account for user

sacctmgr create account name=test1
sacctmgr create account name=test2
sacctmgr create user name=bob cluster=tardis account=test1
sacctmgr show assoc format=cluster,account,user | grep bob
sacctmgr show assoc format=Account%15,User,QOS | grep -e QOS -e bob
sacctmgr add user bob DefaultAccount=test2
sacctmgr show user bob
      User   Def Acct     Admin 
---------- ---------- --------- 
   bob            test2      None 

report CPU and GPU usage for a week

sreport cluster AccountUtilizationByUser User=bob start=2024-04-29 end=2024-05-06 -t hour -T cpu,gres/gpu format=Accounts%21,TRESName,Used

create reservation for user

scontrol create reservation ReservationName=pavlokh starttime=2024-03-17T20:04:00 endtime=2025-03-19T07:00:00 flags=ignore_jobs nodes=c001-c010 user=bob

create daily reservation for account

scontrol create reservation ReservationName=daily starttime=06:00:00 endtime=09:00:00 flags=ignore_jobs,daily nodes=c001 account=test

change job priority to the maximum

scontrol update priority=4294967293 job=19487792

Show deatailed sinfo grouped by resource type

sinfo -o "%10P %5D %34N %5c %7m %37f %23G"

Release job, was helpful to force Slurm to re-evaluate job

scontrol release 19098440

see all collected information about this job with this command:

sacct -j 19361471 --format="ALL"

some fields are long. Example 150 character length %150.

sacct -j 19361471 --format="ALL%150"

selected fields

sacct -j 19108751 --format="JobID,JobName%30,Submit,Start,End,Elapsed"

change job time limit

scontrol update jobid=2569329 TimeLimit=8-00:00:00

show share information

sshare -l --format=Account,GrpTRESMins,TRESRunMins%215 -A account1

cluster usage by user bob for a week

sreport cluster AccountUtilizationByUser User=bob start=2024-05-01 end=2024-04-07 -t hour -T ALL

show QOS priority and limits per user

sacctmgr show qos format=name,priority,MaxTRESPerUser%20

change QOS priority and limits per user

sacctmgr modify qos normal set priority=25
sacctmgr modify qos high set priority=50
sacctmgr modify qos normal set maxtresperuser=cpu=800
sacctmgr modify qos high set maxtresperuser=cpu=800
sacctmgr modify qos normal set maxtresperuser=gres/gpu=80
sacctmgr modify qos high set maxtresperuser=gres/gpu=80

drain and resume node

scontrol update nodename=c003 state=drain reason=reinstall
scontrol update nodename=c003 state=resume

set shares and grptresmins

sacctmgr -i modify account test1 set share=1
sacctmgr -i modify account test1 set grptresmins=cpu=60

show shares and grptresmins

sacctmgr show assoc account=test1 format=cluster,account,user,share,grptresmins

show detailed overview of pending jobs (one per line)

# squeue -p GPUQ -t PENDING --format='%.20i|%.4P|%.5D|%.8c|%.10m|%.11l|%.8u|%.8q|%.17r|%b'
   JOBID|PART|NODES|MIN_CPUS|MIN_MEMORY| TIME_LIMIT|    USER|     QOS|           REASON|TRES_PER_NODE
19635781|GPUQ|    2|       4|     4000M|10-00:00:00|username|  normal|QOSMaxGRESPerUser|gres/gpu:a100:8

Add, Remove, Show GrpTRES limits

sacctmgr modify user bob set GrpTRES=cpu=150,gres/gpu=5
sacctmgr modify user bob set GrpTRES=cpu=-1,mem=-1,gres/gpu=-1
sacctmgr show assoc format=Account,user,GrpTRES%100 | grep bob

Find all job run by user

sacct -T -S2024-06-19-00:00 -E2024-06-19-23:59 --user bob -X -ojobid,jobname%10,user,start,end,state,node