This shows you the differences between two versions of the page.
| Both sides previous revision Previous revision Next revision | Previous revision | ||
|
slurm [2014/09/24 11:52] volker |
slurm [2015/03/24 11:47] (current) volker [Basics] |
||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | :!: I copied this over from my old Wiki. It needs some updating. Especially the node sharing on the zBox partition is now deprecated I think. /Volker :!: | + | :!: I copied this over from my old Wiki. It needs some updating. Especially the node sharing on the zBox partition is now deprecated (I think).--- //[[volker@physik.uzh.ch|Volker Hoffmann]] 2014/09/24 11:55// :!: |
| ====== SLURM Scheduler ====== | ====== SLURM Scheduler ====== | ||
| + | * [[http://slurm.schedmd.com/rosetta.html|Rosetta Stone of Schedulers]] | ||
| * Cf. [[https://computing.llnl.gov/linux/slurm/man_index.html]] | * Cf. [[https://computing.llnl.gov/linux/slurm/man_index.html]] | ||
| * Especially [[https://computing.llnl.gov/linux/slurm/sbatch.html]] | * Especially [[https://computing.llnl.gov/linux/slurm/sbatch.html]] | ||
| + | |||
| + | ===== Basics ====== | ||
| + | |||
| + | * Submit batch jobs | ||
| + | |||
| + | <code> | ||
| + | $ sbatch script.job | ||
| + | </code> | ||
| + | |||
| + | * Cancel jobs | ||
| + | |||
| + | <code> | ||
| + | $ scancel jobid | ||
| + | </code> | ||
| + | |||
| + | * View the queue | ||
| + | |||
| + | <code> | ||
| + | $ squeue | ||
| + | </code> | ||
| + | |||
| + | See below for example job scripts. | ||
| + | |||
| + | ===== Random Tips & Tricks ===== | ||
| + | |||
| + | * Attach to a running job [[https://computing.llnl.gov/linux/slurm/sattach.html]] | ||
| + | |||
| + | <code> | ||
| + | $ sattach jobid.jobstep | ||
| + | </code> | ||
| + | |||
| + | * We can hold a job by postponing it's start time [[https://computing.llnl.gov/linux/slurm/faq.html#hold]] | ||
| + | |||
| + | <code> | ||
| + | $ scontrol update JobId=1234 StartTime=now+30days | ||
| + | ... later ... | ||
| + | $ scontrol update JobId=1234 StartTime=now | ||
| + | </code> | ||
| + | |||
| + | * If you want squeue to look like at CSCS, add the following to your .bashrc | ||
| + | |||
| + | <file> | ||
| + | alias squeue="squeue --format='%.12i %.8u %.9P %.32j %.12B %.2t %.12r %.14M %.14L %.6D %.10Q'" | ||
| + | </file> | ||
| + | |||
| + | ===== Launch Interactive GPU Jobs (Compiling, Testing) ===== | ||
| + | |||
| + | * Allocate a GPU slot | ||
| + | |||
| + | <code> | ||
| + | salloc --ntasks 1 --gres gpu:1 --partition tasna --account gpu | ||
| + | </code> | ||
| + | |||
| + | * Once allocated, launch bash shell | ||
| + | |||
| + | <code> | ||
| + | srun --pty bash | ||
| + | </code> | ||
| + | |||
| + | * :!: Always do this from the front-end nodes. As Slurm inherits you're environment, CUDA stuff (nvcc, etc) won't be available of you issue this job from other computers. | ||
| ===== Example Script for GPU Jobs ===== | ===== Example Script for GPU Jobs ===== | ||
| - | The file **/home/itp/volker/Slurm/blacklist** contains a line-by-line listing of nodes we wish to avoid. | + | * #XSBATCH lines are comments and are not parsed by the SLURM. |
| <file> | <file> | ||
| #!/bin/bash | #!/bin/bash | ||
| - | #SBATCH --output /home/itp/volker/Genga/Jobs/Debris/ReRuns-01/formation/pro_2/Logs/runm_5-%j.out | + | #SBATCH --output /home/ics/volker/Genga/Jobs/HitnRun/Reufer2012/Logs/cC03m_conex-%j.out |
| - | #SBATCH --job-name pro_2/runm_5 | + | #SBATCH --job-name HitnRun/R12/cC03m/ConeX |
| - | #SBATCH --partition tasna | + | #SBATCH --partition vesta |
| #SBATCH --account gpu | #SBATCH --account gpu | ||
| #SBATCH --ntasks 1 | #SBATCH --ntasks 1 | ||
| #SBATCH --gres gpu:1 | #SBATCH --gres gpu:1 | ||
| - | #OLDSBATCH --time 0-00:10:00 | + | #SBATCH --time 28-00:00:00 |
| - | #SBATCH --exclude=/home/itp/volker/Slurm/blacklist | + | #XSBATCH --exclude=tasna5 |
| - | #SBATCH --mail-user volker@physik.uzh.ch | + | #SBATCH --mail-user you@yourdomain.com |
| + | #SBATCH --mail-type END | ||
| + | #SBATCH --no-requeue | ||
| - | genga=/home/itp/volker/Source/genga-dev/source/genga_sm20 | + | home=/home/ics/volker |
| - | scratch=/zbox/data/volker/Debris/ReRuns-01/formation/pro_2/runm_5 | + | data=/zbox/data/volker |
| - | echo $genga | + | genga=$home/Source/genga-dev-hitnrun/source/genga_hitnrun_coll24days_sm37 |
| - | echo $scratch | + | outdir=$data/HitnRun/Reufer2012/cC03m_conex |
| - | echo "" | + | |
| + | echo "" | ||
| echo "***** LAUNCHING *****" | echo "***** LAUNCHING *****" | ||
| echo `date '+%F %H:%M:%S'` | echo `date '+%F %H:%M:%S'` | ||
| echo "" | echo "" | ||
| - | cd $scratch | + | echo "genga="$genga |
| + | echo "outdir="$outdir | ||
| + | echo "hostname="`hostname` | ||
| + | echo "cuda_visible_devices="$CUDA_VISIBLE_DEVICES | ||
| + | |||
| + | echo "" | ||
| + | echo "***" | ||
| + | echo "" | ||
| + | |||
| + | cd $outdir | ||
| export DATE=`date +%F_%H%M` | export DATE=`date +%F_%H%M` | ||
| - | time srun $genga > Run_$DATE.log | + | srun $genga > Run_$DATE.log |
| echo "" | echo "" | ||
| Line 45: | Line 117: | ||
| ===== Example Script for MPI Jobs ===== | ===== Example Script for MPI Jobs ===== | ||
| + | The file **/home/itp/volker/Slurm/blacklist** contains a line-by-line listing of nodes we wish to avoid. | ||
| <file> | <file> | ||
| #!/bin/bash | #!/bin/bash | ||