Basic Slurm Usage for Linux Clusters
Bright Cluster Manager is a comprehensive cluster management solution for managing all types of HPC clusters and server farms, including CPU and GPU clusters, storage and database clusters, and big data Hadoop clusters. Slurm Workload Manager, which is integrated in Bright Cluster Manager, is an open source resource manager with a plug-in architecture, used in many large installations. It includes both queuing and scheduling functionality. The following blog post will provide basic Slurm Usage for Linux Clusters.
Creating a Slurm (memtester) job script
$ cat memtesterScript.sh
#!/bin/bash
/cm/shared/apps/memtester/current/memtester 24G
Submitting the job to N nodes
$ module load slurm
$ sbatch --array=1-50 ~/memtesterScript.sh
Submitted batch job 120
Listing the job
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
120_[7-50] defq memteste exx PD 0:00 1 (Resources)
120_1 defq memteste exx R 1:42 1 node001
120_2 defq memteste exx R 1:42 1 node002
120_3 defq memteste exx R 1:42 1 node003
120_4 defq memteste exx R 1:42 1 node004
120_5 defq memteste exx R 1:42 1 node005
120_6 defq memteste exx R 1:42 1 node006
Getting job details
$ scontrol show job 121
JobId=121 ArrayJobId=120 ArrayTaskId=1 JobName=memtesterScript.sh
UserId=exx(1002) GroupId=exx(1002)
Priority=4294901753 Nice=0 Account=(null) QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:05:01 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2015-08-14T00:13:20 EligibleTime=2015-08-14T00:13:21
StartTime=2015-08-14T00:13:21 EndTime=Unknown
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=defq AllocNode:Sid=bright71:17752
ReqNodeList=(null) ExcNodeList=(null)
NodeList=node001
BatchHost=node001
NumNodes=1 NumCPUs=24 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=0 Contiguous=0 Licenses=(null) Network=(null)
Command=/home/exx/memtesterScript.sh
WorkDir=/home/exx
StdErr=/home/exx/slurm-120_1.out
StdIn=/dev/null
StdOut=/home/exx/slurm-120_1.out
Suspending a job*
# scontrol suspend 125
# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
125 defq memteste exx S 0:13 1 node01
Resuming a job*
# scontrol resume 125
# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
125 defq memteste exx R 0:13 1 node01
Killing a job**
$ scancel 125
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
*Root Only
**Users can kill their own jobs, root can kill any job
Basic Slurm Usage for Linux Clusters
Basic Slurm Usage for Linux Clusters
Bright Cluster Manager is a comprehensive cluster management solution for managing all types of HPC clusters and server farms, including CPU and GPU clusters, storage and database clusters, and big data Hadoop clusters. Slurm Workload Manager, which is integrated in Bright Cluster Manager, is an open source resource manager with a plug-in architecture, used in many large installations. It includes both queuing and scheduling functionality. The following blog post will provide basic Slurm Usage for Linux Clusters.
Creating a Slurm (memtester) job script
$ cat memtesterScript.sh
#!/bin/bash
/cm/shared/apps/memtester/current/memtester 24G
Submitting the job to N nodes
$ module load slurm
$ sbatch --array=1-50 ~/memtesterScript.sh
Submitted batch job 120
Listing the job
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
120_[7-50] defq memteste exx PD 0:00 1 (Resources)
120_1 defq memteste exx R 1:42 1 node001
120_2 defq memteste exx R 1:42 1 node002
120_3 defq memteste exx R 1:42 1 node003
120_4 defq memteste exx R 1:42 1 node004
120_5 defq memteste exx R 1:42 1 node005
120_6 defq memteste exx R 1:42 1 node006
Getting job details
$ scontrol show job 121
JobId=121 ArrayJobId=120 ArrayTaskId=1 JobName=memtesterScript.sh
UserId=exx(1002) GroupId=exx(1002)
Priority=4294901753 Nice=0 Account=(null) QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:05:01 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2015-08-14T00:13:20 EligibleTime=2015-08-14T00:13:21
StartTime=2015-08-14T00:13:21 EndTime=Unknown
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=defq AllocNode:Sid=bright71:17752
ReqNodeList=(null) ExcNodeList=(null)
NodeList=node001
BatchHost=node001
NumNodes=1 NumCPUs=24 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=0 Contiguous=0 Licenses=(null) Network=(null)
Command=/home/exx/memtesterScript.sh
WorkDir=/home/exx
StdErr=/home/exx/slurm-120_1.out
StdIn=/dev/null
StdOut=/home/exx/slurm-120_1.out
Suspending a job*
# scontrol suspend 125
# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
125 defq memteste exx S 0:13 1 node01
Resuming a job*
# scontrol resume 125
# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
125 defq memteste exx R 0:13 1 node01
Killing a job**
$ scancel 125
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
*Root Only
**Users can kill their own jobs, root can kill any job