HPC

Basic Slurm Usage for Linux Clusters

November 25, 2015
2 min read
Featured-Basic-Slurm.jpg

Basic Slurm Usage for Linux Clusters

Bright Cluster Manager is a comprehensive cluster management solution for managing all types of HPC clusters and server farms, including CPU and GPU clusters, storage and database clusters, and big data Hadoop clusters. Slurm Workload Manager, which is integrated in Bright Cluster Manager, is an open source resource manager with a plug-in architecture, used in many large installations. It includes both queuing and scheduling functionality. The following blog post will provide basic Slurm Usage for Linux Clusters.

Creating a Slurm (memtester) job script

$ cat memtesterScript.sh
#!/bin/bash

/cm/shared/apps/memtester/current/memtester 24G

Submitting the job to N nodes

$ module load slurm
$ sbatch --array=1-50 ~/memtesterScript.sh
Submitted batch job 120

Listing the job

$ squeue

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

120_[7-50] defq memteste exx PD 0:00 1 (Resources)

120_1 defq memteste exx R 1:42 1 node001

120_2 defq memteste exx R 1:42 1 node002

120_3 defq memteste exx R 1:42 1 node003

120_4 defq memteste exx R 1:42 1 node004

120_5 defq memteste exx R 1:42 1 node005

120_6 defq memteste exx R 1:42 1 node006

Getting job details

$ scontrol show job 121

JobId=121 ArrayJobId=120 ArrayTaskId=1 JobName=memtesterScript.sh

UserId=exx(1002) GroupId=exx(1002)

Priority=4294901753 Nice=0 Account=(null) QOS=normal

JobState=RUNNING Reason=None Dependency=(null)

Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0

RunTime=00:05:01 TimeLimit=UNLIMITED TimeMin=N/A

SubmitTime=2015-08-14T00:13:20 EligibleTime=2015-08-14T00:13:21

StartTime=2015-08-14T00:13:21 EndTime=Unknown

PreemptTime=None SuspendTime=None SecsPreSuspend=0

Partition=defq AllocNode:Sid=bright71:17752

ReqNodeList=(null) ExcNodeList=(null)

NodeList=node001

BatchHost=node001

NumNodes=1 NumCPUs=24 CPUs/Task=1 ReqB:S:C:T=0:0:*:*

Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*

MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0

Features=(null) Gres=(null) Reservation=(null)

Shared=0 Contiguous=0 Licenses=(null) Network=(null)

Command=/home/exx/memtesterScript.sh

WorkDir=/home/exx

StdErr=/home/exx/slurm-120_1.out

StdIn=/dev/null

StdOut=/home/exx/slurm-120_1.out

Suspending a job*

# scontrol suspend 125
# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
125 defq memteste exx S 0:13 1 node01

Resuming a job*

# scontrol resume 125
# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
125 defq memteste exx R 0:13 1 node01

Killing a job**

$ scancel 125
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

*Root Only
**Users can kill their own jobs, root can kill any job

Topics

Featured-Basic-Slurm.jpg
HPC

Basic Slurm Usage for Linux Clusters

November 25, 20152 min read

Basic Slurm Usage for Linux Clusters

Bright Cluster Manager is a comprehensive cluster management solution for managing all types of HPC clusters and server farms, including CPU and GPU clusters, storage and database clusters, and big data Hadoop clusters. Slurm Workload Manager, which is integrated in Bright Cluster Manager, is an open source resource manager with a plug-in architecture, used in many large installations. It includes both queuing and scheduling functionality. The following blog post will provide basic Slurm Usage for Linux Clusters.

Creating a Slurm (memtester) job script

$ cat memtesterScript.sh
#!/bin/bash

/cm/shared/apps/memtester/current/memtester 24G

Submitting the job to N nodes

$ module load slurm
$ sbatch --array=1-50 ~/memtesterScript.sh
Submitted batch job 120

Listing the job

$ squeue

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

120_[7-50] defq memteste exx PD 0:00 1 (Resources)

120_1 defq memteste exx R 1:42 1 node001

120_2 defq memteste exx R 1:42 1 node002

120_3 defq memteste exx R 1:42 1 node003

120_4 defq memteste exx R 1:42 1 node004

120_5 defq memteste exx R 1:42 1 node005

120_6 defq memteste exx R 1:42 1 node006

Getting job details

$ scontrol show job 121

JobId=121 ArrayJobId=120 ArrayTaskId=1 JobName=memtesterScript.sh

UserId=exx(1002) GroupId=exx(1002)

Priority=4294901753 Nice=0 Account=(null) QOS=normal

JobState=RUNNING Reason=None Dependency=(null)

Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0

RunTime=00:05:01 TimeLimit=UNLIMITED TimeMin=N/A

SubmitTime=2015-08-14T00:13:20 EligibleTime=2015-08-14T00:13:21

StartTime=2015-08-14T00:13:21 EndTime=Unknown

PreemptTime=None SuspendTime=None SecsPreSuspend=0

Partition=defq AllocNode:Sid=bright71:17752

ReqNodeList=(null) ExcNodeList=(null)

NodeList=node001

BatchHost=node001

NumNodes=1 NumCPUs=24 CPUs/Task=1 ReqB:S:C:T=0:0:*:*

Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*

MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0

Features=(null) Gres=(null) Reservation=(null)

Shared=0 Contiguous=0 Licenses=(null) Network=(null)

Command=/home/exx/memtesterScript.sh

WorkDir=/home/exx

StdErr=/home/exx/slurm-120_1.out

StdIn=/dev/null

StdOut=/home/exx/slurm-120_1.out

Suspending a job*

# scontrol suspend 125
# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
125 defq memteste exx S 0:13 1 node01

Resuming a job*

# scontrol resume 125
# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
125 defq memteste exx R 0:13 1 node01

Killing a job**

$ scancel 125
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

*Root Only
**Users can kill their own jobs, root can kill any job

Topics