remote
Senior Platform Engineer - STN Inc
Devops Engineer
Senior Platform Engineer responsible for designing, building, and operating a multi‑tenant orchestration and scheduling layer that transforms raw GPU infrastructure into a reliable cloud service using Kubernetes, Slurm, and Run:ai.
About the role
Key Responsibilities
- Architect and implement the core orchestration layer for a multi‑tenant GPU‑as‑a‑Service platform, leveraging Kubernetes, Slurm, and Run:ai.
- Develop and maintain scheduling, provisioning, and lifecycle management services that expose GPU resources to customers as a seamless cloud offering.
- Collaborate with cross‑functional teams to integrate monitoring, logging, and security controls into the platform stack.
- Optimize performance, reliability, and cost efficiency of GPU workloads across large‑scale clusters.
- Mentor junior engineers and contribute to best‑practice documentation and automation frameworks.
Requirements
- 5+ years of experience building large‑scale, cloud‑native platforms, preferably with GPU workloads.
- Deep expertise in Kubernetes, Slurm, or comparable batch‑scheduling systems; hands‑on experience with Run:ai is a strong plus.
- Proficiency in Python (or Go) for automation, tooling, and service development.
- Strong understanding of cloud infrastructure (AWS, GCP, or Azure) and containerization technologies.
- Demonstrated ability to design highly available, multi‑tenant systems and to troubleshoot complex distributed environments.