SlurmQuest · 2025HPC scheduling lab

Learn HPC schedulingby running real experiments.

A living lab notebook of SLURM experiments, field notes, and hands-on challenges for curious cluster engineers.

Runbook autopsies

Queue & GPU signals

Interactive experiments

Get lab notes Explore challenges See what's being built

Engineers in the lab

500+

Cluster playbooks

New experiments / month

Live queue signal

Healthy

Queue

Streaming

$squeue -u userlive

JOBIDNAMESTATETIMENODES

49231train.pyRUNNING03:151

49232eval.shPENDING0:001

$tail -f train.logloss

[2025-02-14 11:37:01] Epoch 27: loss=0.334

[2025-02-14 11:37:08] Epoch 28: loss=0.312

[2025-02-14 11:36:39] Epoch 24: loss=0.421█

GPU

live telemetry

$nvidia-smi

A100 80GB · Util: 64% · Mem: 34/80GB

Temp: 64°C █

Batch

steady

$sbatch train-job.sbatch

Backfill window clear · Next slot opens in 02:12

New: failure drills weekly

Fair-share tuning, backfill drills, GPU queue health

Build LogOpen build notes

Built in the open, for curious operators

Engineering notes, field reports, and new drills as we learn.

In the lab now

Working experiments today

Running

Challenges

Explore

Hands-on challenges exploring SLURM internals, edge cases, and advanced scheduling algorithms

Tutorials

Explore

Expert deep-dives and comprehensive guides for mastering HPC cluster management

Video Library

Explore

Visual walkthroughs and screencast tutorials for hands-on learning

Resources

Explore

Production-ready scripts, templates, and curated GitHub repositories

Weekly experimentsBuilt with the community

On deck

Next on the bench

Field tests soon

Challenges

Practice real-world SLURM problems in a playful sandbox. Submit jobs, break queues, fix scheduling failures — all inside your browser.

challenge_01_fair_share.sh

$...

Want to help shape it? Drop your email in the lab notes below and we'll invite small waves for fast feedback.

Lab Notessmall batch each week

Learn SLURM by tinkering together

Short lab notes with experiments, failures, and fixes from real clusters. Built for the curious and the hands-on.

Runbook autopsies

Post-mortems, fixes, and what we learned.

Queue-first heuristics

Backfill recipes, fair-share tuning, and GPU-aware tips.

Notes + telemetry

Short lessons paired with interactive terminals.

Field notes

Failure drills

Open build