About

Built by an engineer who lives and breathes HPC.

SlurmQuest is my way of sharing years of hands-on experience managing petaflop-scale clusters, optimizing workloads, and solving real infrastructure challenges in production environments.

CURRENT ROLE

Software Engineer at AMD

Currently working on RCCL (ROCm Communication Collectives Library) for high-performance computing workloads on AMD GPUs. Focused on optimizing collective communication operations for distributed training and HPC applications.

Experience

Software Engineer

AMD•Present

Working on RCCL (ROCm Communication Collectives Library) for HPC workloads on AMD GPUs.

HPC Cluster Manager

Oil & Gas Corporation•Previous

Managed multiple petaflop-scale clusters across global locations, handling compute, storage, and networking infrastructure.

Key Achievements

Managed petaflop-scale clusters across multiple global data centers

Built and deployed hybrid cloud bursting solutions for dynamic workload scaling

Implemented end-to-end observability stack for cluster health monitoring

Configured and optimized Lustre and BeeGFS parallel file systems at scale

Designed and provisioned research computing infrastructure for scientific workloads

Technical Expertise

Storage Systems

LustreBeeGFSParallel File SystemsStorage Architecture

Infrastructure

Cluster ProvisioningBare Metal DeploymentSwitch ConfigurationNetwork Topology

Observability

GrafanaPrometheusMetrics & MonitoringPerformance Analysis

Cloud & Hybrid

Hybrid Cloud BurstingOn-Prem IntegrationMulti-Cloud OrchestrationWorkload Migration

HPC Workloads

SLURMResource ManagementJob SchedulingCluster Optimization

Research Computing

Research Cluster ManagementAcademic HPCScientific ComputingUser Support

Why SlurmQuest?

After years of managing production HPC infrastructure—from petaflop clusters in oil and gas to research computing environments—I noticed a gap in how SLURM knowledge is shared and learned.

Most documentation is either too theoretical or scattered across wikis and forums. SlurmQuest brings together practical, battle-tested knowledge: the scripts that actually work, the configurations that scale, and the troubleshooting techniques that save hours of debugging.

This platform is my contribution to the HPC community—a living challenge hub where cluster engineers can learn from real-world experience and accelerate their journey from novice to expert.

Let's connect

Join the SlurmQuest community to share experiences, ask questions, and stay updated on new content and resources.