SlurmQuest · 2025HPC scheduling lab

Learn HPC schedulingby running real experiments.

A living lab notebook of SLURM experiments, field notes, and hands-on challenges for curious cluster engineers.

Runbook autopsies
Queue & GPU signals
Interactive experiments
Engineers in the lab
500+
Cluster playbooks
38
New experiments / month
12
Live queue signal
Healthy
Queue
Streaming
$squeue -u userlive
JOBIDNAMESTATETIMENODES
49231train.pyRUNNING03:151
49232eval.shPENDING0:001
$tail -f train.logloss
[2025-02-14 11:37:01] Epoch 27: loss=0.334
[2025-02-14 11:37:08] Epoch 28: loss=0.312
[2025-02-14 11:36:39] Epoch 24: loss=0.421
GPU
live telemetry
$nvidia-smi
A100 80GB · Util: 64% · Mem: 34/80GB
Temp: 64°C
Batch
steady
$sbatch train-job.sbatch
Backfill window clear · Next slot opens in 02:12
New: failure drills weekly
Fair-share tuning, backfill drills, GPU queue health
Build LogOpen build notes

Built in the open, for curious operators

Engineering notes, field reports, and new drills as we learn.

On deck

Next on the bench

Field tests soon

Challenges

Practice real-world SLURM problems in a playful sandbox. Submit jobs, break queues, fix scheduling failures — all inside your browser.

challenge_01_fair_share.sh
$...

Want to help shape it? Drop your email in the lab notes below and we'll invite small waves for fast feedback.

Lab Notessmall batch each week

Learn SLURM by tinkering together

Short lab notes with experiments, failures, and fixes from real clusters. Built for the curious and the hands-on.

Runbook autopsies

Post-mortems, fixes, and what we learned.

Queue-first heuristics

Backfill recipes, fair-share tuning, and GPU-aware tips.

Notes + telemetry

Short lessons paired with interactive terminals.

Occasional notes. Unsubscribe anytime.

Field notes
Failure drills
Open build