General command
submit a job1
2sbatch myscript.sh
sbatch --test-only myscript.sh # test a job and find out when your job is estimated to run
Information of jobs for a user1
2
3
4
5squeue -u <username> # List all current jobs for a user
squeue -u <username> -t RUNNING # List all running jobs for a user
squeue -u <username> -t PENDING # List all pending jobs for a user
scontrol show jobid -dd <jobid> # List detailed information for a job (useful for troubleshooting)
sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j <jobid> --allsteps # List status info for a currently running job
Controlling jobs1
2
3
4
5
6scancel <jobid> # cancel one job
scancel -u <username> # cancel all the jobs for a user
scancel -t PENDING -u <username> # cancel all pending jobs for a user
scontrol hold <jobid> # pause a particular job
scontrol requeue <jobid> # requeue (cancel and rerun) a particular job
scancel <jobid>_<index> # To cancel an indexed job in a job array
Exit code
开始使用HPC时遇到很多问题,特别是用slurm deploy job的时候,如果在slurm文件中设置将开始,结束,失败的信息发送给自己邮箱时,它会针对每一种情况给一个code,这个code即表示一种状态。下面记录一些常见的 exit code 表明的意思。
The exit code from a batch job is a standard Unix termination signal.
Typically, exit code 0 means successful completion.
Codes 1-127 are generated from the job calling exit() with a non-zero value to indicate an error.
Exit codes 129-255 represent jobs terminated by Unix signals.
Each signal has a corresponding value which is indicated in the job exit code
Job Termination Signals
Signal Name | Signal Number | Exit Type | Reason |
---|---|---|---|
SIGHUP | 1 | Term | Hangup detected on controlling terminal or death of controlling process |
SIGINT | 2 | Term | Interrupt from keyboard |
SIGQUIT | 3 | Core | Quit from keyboard |
SIGILL | 4 | Core | Illegal Instruction |
SIGABRT | 6 | Core | Abort signal from abort(3) |
SIGFPE | 8 | Core | Floating point exception |
SIGKILL | 9 | Term | Kill signal |
SIGSEGV | 11 | Core | Invalid memory reference |
SIGPIPE | 13 | Term | Broken pipe: write to pipe with no readers |
SIGALRM | 14 | Term | Timer signal from alarm(2) |
SIGTERM | 15 | Term | Termination signal |
Job Exit Status
Exit Code | Reason |
---|---|
9 | Ran out of CPU time. |
64 | The job ended nicely for but your job was running out of CPU time. The solution is to submit the job to a queue with more resources (bigger CPU time limit). |
125 | An ErrMsg(severe) was reached in your job. |
127 | Something wrong with the machine? |
130 | The job ran out of CPU or swap time. If swap time is the culprit, check for memory leaks. |
131 | The job ran out of CPU or swap time. If swap time is the culprit, check for memory leaks. |
134 | The job is killed with an abort signal, and you probably got core dumped. Often this is caused either by an assert() or an ErrMsg(fatal) being hit in your job. There may be a run-time bug in your code. Use a debugger like gdb or Totalview to find out what’s wrong. |
137 | The job was killed because it exceeded the time limit. |
139 | Segmentation violation. Usually indicates a pointer error. |
140 | The job exceeded the “wall clock” time limit (as opposed to the CPU time limit). |
Reference:
https://www.rc.fas.harvard.edu/resources/documentation/convenient-slurm-commands/
https://sites.google.com/a/case.edu/hpc-upgraded-cluster/cluster-faq/running-jobs/exit-code-status
https://software.intel.com/en-us/fortran-compiler-developer-guide-and-reference-list-of-run-time-error-messages