SlurmQuest is my way of sharing years of hands-on experience managing petaflop-scale clusters, optimizing workloads, and solving real infrastructure challenges in production environments.
Currently working on RCCL (ROCm Communication Collectives Library) for high-performance computing workloads on AMD GPUs. Focused on optimizing collective communication operations for distributed training and HPC applications.
Working on RCCL (ROCm Communication Collectives Library) for HPC workloads on AMD GPUs.
Managed multiple petaflop-scale clusters across global locations, handling compute, storage, and networking infrastructure.
Managed petaflop-scale clusters across multiple global data centers
Built and deployed hybrid cloud bursting solutions for dynamic workload scaling
Implemented end-to-end observability stack for cluster health monitoring
Configured and optimized Lustre and BeeGFS parallel file systems at scale
Designed and provisioned research computing infrastructure for scientific workloads
After years of managing production HPC infrastructure—from petaflop clusters in oil and gas to research computing environments—I noticed a gap in how SLURM knowledge is shared and learned.
Most documentation is either too theoretical or scattered across wikis and forums. SlurmQuest brings together practical, battle-tested knowledge: the scripts that actually work, the configurations that scale, and the troubleshooting techniques that save hours of debugging.
This platform is my contribution to the HPC community—a living lab where cluster engineers can learn from real-world experience and accelerate their journey from novice to expert.
Join the SlurmQuest community to share experiences, ask questions, and stay updated on new content and resources.