remote
Senior SRE Engineer - AI - Ally Financial
Site Reliability Engineer
Lead the reliability and automation of AI-driven services, ensuring high availability, performance, and security across cloud and on‑prem environments using Python, Kubernetes, Terraform, and AWS.
About the role
Key Responsibilities
- Design, build, and maintain highly available AI infrastructure on AWS, ensuring 99.99% uptime and rapid incident response.
- Implement and evolve CI/CD pipelines with Terraform, GitHub Actions, and Kubernetes to automate deployments and rollbacks.
- Develop observability solutions—metrics, logs, and traces—using Prometheus, Grafana, and OpenTelemetry to detect and resolve performance bottlenecks.
- Collaborate with data science and ML teams to optimize model serving, scaling, and resource allocation.
- Lead post‑mortem analyses, root‑cause investigations, and continuous improvement initiatives to reduce MTTR.
Requirements
- 5+ years of SRE or DevOps experience in a production environment.
- Strong proficiency in Python, Kubernetes, and Terraform.
- Hands‑on experience with AWS services (EKS, ECS, S3, CloudWatch).
- Deep understanding of monitoring, alerting, and incident management best practices.
- Excellent communication skills and a collaborative mindset.
Skills
pythonkubernetesterraformaws