:!: I copied this over from my old Wiki. It needs some updating. Especially the node sharing on the zBox partition is now deprecated (I think).--- //[[volker@physik.uzh.ch|Volker Hoffmann]] 2014/09/24 11:55// :!: ====== SLURM Scheduler ====== * [[http://slurm.schedmd.com/rosetta.html|Rosetta Stone of Schedulers]] * Cf. [[https://computing.llnl.gov/linux/slurm/man_index.html]] * Especially [[https://computing.llnl.gov/linux/slurm/sbatch.html]] ===== Basics ====== * Submit batch jobs $ sbatch script.job * Cancel jobs $ scancel jobid * View the queue $ squeue See below for example job scripts. ===== Random Tips & Tricks ===== * Attach to a running job [[https://computing.llnl.gov/linux/slurm/sattach.html]] $ sattach jobid.jobstep * We can hold a job by postponing it's start time [[https://computing.llnl.gov/linux/slurm/faq.html#hold]] $ scontrol update JobId=1234 StartTime=now+30days ... later ... $ scontrol update JobId=1234 StartTime=now * If you want squeue to look like at CSCS, add the following to your .bashrc alias squeue="squeue --format='%.12i %.8u %.9P %.32j %.12B %.2t %.12r %.14M %.14L %.6D %.10Q'" ===== Launch Interactive GPU Jobs (Compiling, Testing) ===== * Allocate a GPU slot salloc --ntasks 1 --gres gpu:1 --partition tasna --account gpu * Once allocated, launch bash shell srun --pty bash * :!: Always do this from the front-end nodes. As Slurm inherits you're environment, CUDA stuff (nvcc, etc) won't be available of you issue this job from other computers. ===== Example Script for GPU Jobs ===== * #XSBATCH lines are comments and are not parsed by the SLURM. #!/bin/bash #SBATCH --output /home/ics/volker/Genga/Jobs/HitnRun/Reufer2012/Logs/cC03m_conex-%j.out #SBATCH --job-name HitnRun/R12/cC03m/ConeX #SBATCH --partition vesta #SBATCH --account gpu #SBATCH --ntasks 1 #SBATCH --gres gpu:1 #SBATCH --time 28-00:00:00 #XSBATCH --exclude=tasna5 #SBATCH --mail-user you@yourdomain.com #SBATCH --mail-type END #SBATCH --no-requeue home=/home/ics/volker data=/zbox/data/volker genga=$home/Source/genga-dev-hitnrun/source/genga_hitnrun_coll24days_sm37 outdir=$data/HitnRun/Reufer2012/cC03m_conex echo "" echo "***** LAUNCHING *****" echo `date '+%F %H:%M:%S'` echo "" echo "genga="$genga echo "outdir="$outdir echo "hostname="`hostname` echo "cuda_visible_devices="$CUDA_VISIBLE_DEVICES echo "" echo "***" echo "" cd $outdir export DATE=`date +%F_%H%M` srun $genga > Run_$DATE.log echo "" echo "***** DONE *****" echo `date '+%F %H:%M:%S'` echo "" ===== Example Script for MPI Jobs ===== The file **/home/itp/volker/Slurm/blacklist** contains a line-by-line listing of nodes we wish to avoid. #!/bin/bash #SBATCH -o /home/itp/volker/Mydisk/Jobs/AdaC/1024/Logs/t03000_E4__R1-%j.out #SBATCH -J AdaC/1024/t03000_E4__R1 #SBATCH -p zbox #SBATCH --time 0-24:00:00 #SBATCH --ntasks=256 --exclusive #SBATCH --exclude=/home/itp/volker/Slurm/blacklist #SBATCH --mail-user=volker@physik.uzh.ch home=/home/itp/volker scratch=/zbox/project/volker nml=$home/Mydisk/NML/AdaC/1024/t03000_E4__R1.nml ramses=$home/Source/ramses-dev/trunk/ramses/bin/ppd3d data=$scratch/Mydisk/AdaC/1024/t03000_E4__R1 cd $data echo $nml echo $ramses echo "***" pwd echo "" echo "***** LAUNCHING *****" echo `date '+%F %H:%M:%S'` echo "" export DATE=`date +%F_%H%M` time srun $ramses $nml > $data/Run_$DATE.log echo "" echo "***** DONE *****" echo `date '+%F %H:%M:%S'` echo "" ===== Example Script for Node-Sharing Single-Core Jobs ===== #!/bin/bash #SBATCH -o /home/itp/volker/Mydisk/Jobs/Viz4/AdaC/1024/Logs/t03000_E4__R1-%j.out #SBATCH -J Viz4/AdaC/1024/t03000_E4__R1 #SBATCH -p zbox #SBATCH --ntasks=1 #SBATCH --time=0-06:00:00 #SBATCH --exclude=/home/itp/volker/Slurm/blacklist #SBATCH --mail-user=volker@physik.uzh.ch # Load Python Environment export WORKON_HOME=$HOME/.virtualenvs export PROJECT_HOME=$HOME/Source source $HOME/.local/bin/virtualenvwrapper.sh workon scipy imin=1 imax=47 opts="--together" #fps=15 home=/home/itp/volker scratch=/zbox/project/volker script1=$home/Source/Viz4/reduce.py script2=$home/Source/Viz4/plot_quad_xy.py script3=$home/Source/Viz4/plot_quad_rz.py script4=$home/Source/Viz4/plot_quad_r.py data=$scratch/Mydisk/AdaC/1024/t03000_E4__R1 echo $data echo $script1 $imin $imax --lofi echo $script2 $imin $imax echo $script3 $imin $imax echo $script4 $imin $imax echo "" echo "***** LAUNCHING *****" echo `date '+%F %H:%M:%S'` echo "" cd $data time python $script1 $imin $imax --lofi time python $script2 $imin $imax time python $script3 $imin $imax time python $script4 $imin $imax # mencoder "mf://quad_r_*.png" -mf w=1600:h=1200:fps=${fps}:type=png -ovc lavc -lavcopts vcodec=mpeg4:mbd=2:trell -oac copy -o quad_r.avi # mencoder "mf://quad_rz_*.png" -mf w=1600:h=1200:fps=${fps}:type=png -ovc lavc -lavcopts vcodec=mpeg4:mbd=2:trell -oac copy -o quad_rz.avi # mencoder "mf://quad_xy_*.png" -mf w=1600:h=1200:fps=${fps}:type=png -ovc lavc -lavcopts vcodec=mpeg4:mbd=2:trell -oac copy -o quad_xy.avi echo "" echo "***** DONE *****" echo `date '+%F %H:%M:%S'` echo ""