remote
Staff DevOps Engineer - Runware
Devops Engineer
Lead the design, deployment, and scaling of a high‑performance, GPU‑enabled infrastructure for real‑time AI inference, leveraging Kubernetes, AWS, and advanced CI/CD pipelines to ensure reliability and rapid model rollout.
About the role
Key Responsibilities
- Architect and maintain a globally distributed, GPU‑centric Kubernetes cluster that supports thousands of concurrent inference requests.
- Design and implement CI/CD pipelines using GitHub Actions, Terraform, and Helm to automate model deployment and infrastructure updates.
- Collaborate with ML teams to optimize inference latency, throughput, and cost across AWS services (EKS, S3, Lambda).
- Implement observability, logging, and alerting with Prometheus, Grafana, and CloudWatch to ensure 99.9% uptime.
- Drive capacity planning, auto‑scaling, and cost‑optimization strategies for GPU resources.
- Mentor junior engineers and establish best practices for infrastructure as code and security compliance.
Requirements
- 10+ years of experience in DevOps or Site Reliability Engineering, with a focus on AI/ML workloads.
- Proficiency in Kubernetes, Docker, and AWS (EKS, EC2, S3, Lambda).
- Hands‑on experience with Terraform, Helm, and CI/CD tooling.
- Strong scripting skills in Python or Bash and familiarity with monitoring tools.
- Excellent problem‑solving skills and a passion for building scalable, high‑performance systems.
Skills
kubernetesawsdockercicdterraform