This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
slurm [2014/09/24 11:52] volker |
slurm [2015/03/24 11:47] (current) volker [Basics] |
||
---|---|---|---|
Line 1: | Line 1: | ||
- | :!: I copied this over from my old Wiki. It needs some updating. Especially the node sharing on the zBox partition is now deprecated I think. /Volker :!: | + | :!: I copied this over from my old Wiki. It needs some updating. Especially the node sharing on the zBox partition is now deprecated (I think).--- //[[volker@physik.uzh.ch|Volker Hoffmann]] 2014/09/24 11:55// :!: |
====== SLURM Scheduler ====== | ====== SLURM Scheduler ====== | ||
+ | * [[http://slurm.schedmd.com/rosetta.html|Rosetta Stone of Schedulers]] | ||
* Cf. [[https://computing.llnl.gov/linux/slurm/man_index.html]] | * Cf. [[https://computing.llnl.gov/linux/slurm/man_index.html]] | ||
* Especially [[https://computing.llnl.gov/linux/slurm/sbatch.html]] | * Especially [[https://computing.llnl.gov/linux/slurm/sbatch.html]] | ||
+ | |||
+ | ===== Basics ====== | ||
+ | |||
+ | * Submit batch jobs | ||
+ | |||
+ | <code> | ||
+ | $ sbatch script.job | ||
+ | </code> | ||
+ | |||
+ | * Cancel jobs | ||
+ | |||
+ | <code> | ||
+ | $ scancel jobid | ||
+ | </code> | ||
+ | |||
+ | * View the queue | ||
+ | |||
+ | <code> | ||
+ | $ squeue | ||
+ | </code> | ||
+ | |||
+ | See below for example job scripts. | ||
+ | |||
+ | ===== Random Tips & Tricks ===== | ||
+ | |||
+ | * Attach to a running job [[https://computing.llnl.gov/linux/slurm/sattach.html]] | ||
+ | |||
+ | <code> | ||
+ | $ sattach jobid.jobstep | ||
+ | </code> | ||
+ | |||
+ | * We can hold a job by postponing it's start time [[https://computing.llnl.gov/linux/slurm/faq.html#hold]] | ||
+ | |||
+ | <code> | ||
+ | $ scontrol update JobId=1234 StartTime=now+30days | ||
+ | ... later ... | ||
+ | $ scontrol update JobId=1234 StartTime=now | ||
+ | </code> | ||
+ | |||
+ | * If you want squeue to look like at CSCS, add the following to your .bashrc | ||
+ | |||
+ | <file> | ||
+ | alias squeue="squeue --format='%.12i %.8u %.9P %.32j %.12B %.2t %.12r %.14M %.14L %.6D %.10Q'" | ||
+ | </file> | ||
+ | |||
+ | ===== Launch Interactive GPU Jobs (Compiling, Testing) ===== | ||
+ | |||
+ | * Allocate a GPU slot | ||
+ | |||
+ | <code> | ||
+ | salloc --ntasks 1 --gres gpu:1 --partition tasna --account gpu | ||
+ | </code> | ||
+ | |||
+ | * Once allocated, launch bash shell | ||
+ | |||
+ | <code> | ||
+ | srun --pty bash | ||
+ | </code> | ||
+ | |||
+ | * :!: Always do this from the front-end nodes. As Slurm inherits you're environment, CUDA stuff (nvcc, etc) won't be available of you issue this job from other computers. | ||
===== Example Script for GPU Jobs ===== | ===== Example Script for GPU Jobs ===== | ||
- | The file **/home/itp/volker/Slurm/blacklist** contains a line-by-line listing of nodes we wish to avoid. | + | * #XSBATCH lines are comments and are not parsed by the SLURM. |
<file> | <file> | ||
#!/bin/bash | #!/bin/bash | ||
- | #SBATCH --output /home/itp/volker/Genga/Jobs/Debris/ReRuns-01/formation/pro_2/Logs/runm_5-%j.out | + | #SBATCH --output /home/ics/volker/Genga/Jobs/HitnRun/Reufer2012/Logs/cC03m_conex-%j.out |
- | #SBATCH --job-name pro_2/runm_5 | + | #SBATCH --job-name HitnRun/R12/cC03m/ConeX |
- | #SBATCH --partition tasna | + | #SBATCH --partition vesta |
#SBATCH --account gpu | #SBATCH --account gpu | ||
#SBATCH --ntasks 1 | #SBATCH --ntasks 1 | ||
#SBATCH --gres gpu:1 | #SBATCH --gres gpu:1 | ||
- | #OLDSBATCH --time 0-00:10:00 | + | #SBATCH --time 28-00:00:00 |
- | #SBATCH --exclude=/home/itp/volker/Slurm/blacklist | + | #XSBATCH --exclude=tasna5 |
- | #SBATCH --mail-user volker@physik.uzh.ch | + | #SBATCH --mail-user you@yourdomain.com |
+ | #SBATCH --mail-type END | ||
+ | #SBATCH --no-requeue | ||
- | genga=/home/itp/volker/Source/genga-dev/source/genga_sm20 | + | home=/home/ics/volker |
- | scratch=/zbox/data/volker/Debris/ReRuns-01/formation/pro_2/runm_5 | + | data=/zbox/data/volker |
- | echo $genga | + | genga=$home/Source/genga-dev-hitnrun/source/genga_hitnrun_coll24days_sm37 |
- | echo $scratch | + | outdir=$data/HitnRun/Reufer2012/cC03m_conex |
- | echo "" | + | |
+ | echo "" | ||
echo "***** LAUNCHING *****" | echo "***** LAUNCHING *****" | ||
echo `date '+%F %H:%M:%S'` | echo `date '+%F %H:%M:%S'` | ||
echo "" | echo "" | ||
- | cd $scratch | + | echo "genga="$genga |
+ | echo "outdir="$outdir | ||
+ | echo "hostname="`hostname` | ||
+ | echo "cuda_visible_devices="$CUDA_VISIBLE_DEVICES | ||
+ | |||
+ | echo "" | ||
+ | echo "***" | ||
+ | echo "" | ||
+ | |||
+ | cd $outdir | ||
export DATE=`date +%F_%H%M` | export DATE=`date +%F_%H%M` | ||
- | time srun $genga > Run_$DATE.log | + | srun $genga > Run_$DATE.log |
echo "" | echo "" | ||
Line 45: | Line 117: | ||
===== Example Script for MPI Jobs ===== | ===== Example Script for MPI Jobs ===== | ||
+ | The file **/home/itp/volker/Slurm/blacklist** contains a line-by-line listing of nodes we wish to avoid. | ||
<file> | <file> | ||
#!/bin/bash | #!/bin/bash |