onsite
Senior Software Engineer, Site Reliability Engineering GenAI
AI Engineer
Senior SRE engineer driving reliability and automation for GenAI platforms, focusing on capacity planning, fault‑tolerant distributed systems, and cloud infrastructure using Python, Go, Kubernetes, and AWS.
About the role
Key Responsibilities
- Design, implement, and maintain highly available, fault‑tolerant services that power GenAI workloads.
- Develop automation for capacity planning, scaling, and performance tuning across multi‑cloud environments.
- Build and operate monitoring, alerting, and observability pipelines using Prometheus, Grafana, and custom metrics.
- Collaborate with development teams to embed reliability best practices into CI/CD pipelines and infrastructure as code (Terraform, CloudFormation).
- Lead incident response, root‑cause analysis, and post‑mortem processes to continuously improve system resilience.
Requirements
- 5+ years of experience in site reliability engineering or production engineering for large‑scale distributed systems.
- Strong programming skills in Python and Go, with a solid understanding of networking, concurrency, and systems design.
- Hands‑on experience with Kubernetes, container orchestration, and cloud platforms such as AWS.
- Proficiency in infrastructure‑as‑code tools (Terraform, CloudFormation) and CI/CD frameworks (Jenkins, GitHub Actions, Argo CD).
- Demonstrated ability to build observability solutions (Prometheus, Grafana) and conduct capacity planning for AI/ML workloads.
Skills
pythongokubernetesawsterraformprometheuscicd