remote
Senior Site Reliability Engineer - Nebius
Site Reliability Engineer
Senior Site Reliability Engineer responsible for designing, scaling, and automating highly available AI cloud infrastructure, leveraging Kubernetes, Docker, Terraform, and monitoring tools to ensure performance and reliability for large‑scale GPU workloads.
About the role
Key Responsibilities
- Design, implement, and operate highly available, scalable Kubernetes clusters for AI/ML workloads across multi‑cloud environments.
- Automate infrastructure provisioning and configuration management using Terraform and CI/CD pipelines.
- Develop observability solutions with Prometheus, Grafana, and custom Python scripts to monitor system health, latency, and resource utilization.
- Collaborate with development and data science teams to optimize GPU orchestration, storage, and networking for model training and inference.
- Lead incident response, root‑cause analysis, and post‑mortem processes to continuously improve reliability.
Requirements
- 5+ years of SRE or DevOps experience in cloud‑native environments, preferably with AI/ML workloads.
- Deep expertise in Kubernetes, Docker, and Linux system administration.
- Proficiency in infrastructure as code (Terraform) and scripting/automation using Python.
- Strong background in monitoring, alerting, and performance tuning with Prometheus/Grafana.
- Experience with multi‑cloud platforms (AWS, GCP, Azure) and networking concepts for high‑throughput data pipelines.
Skills
kubernetesdockerterraformprometheuspythonlinux