onsite
Senior Site Reliability Engineer - AcquireX
Site Reliability Engineer
Senior Site Reliability Engineer leading end‑to‑end reliability, automation, and observability for mission‑critical production systems using Kubernetes, Prometheus, Grafana, CI/CD pipelines, AWS, and Terraform.
About the role
Key Responsibilities
- Own end‑to‑end reliability of production services, ensuring high availability, scalability, and performance.
- Design, implement, and maintain CI/CD pipelines, infrastructure as code, and automated deployment workflows.
- Build and maintain observability stack (Prometheus, Grafana, Loki) for real‑time monitoring, alerting, and incident response.
- Lead incident management, root‑cause analysis, and post‑mortem documentation to drive continuous improvement.
- Collaborate with development teams to embed SRE best practices into code reviews, architecture decisions, and release processes.
Requirements
- 10+ years of experience in SRE/DevOps roles with proven track record in large‑scale distributed systems.
- Deep expertise in Kubernetes, container orchestration, and cloud platforms (AWS preferred).
- Hands‑on experience with Prometheus, Grafana, Loki, and other observability tools.
- Strong scripting skills (Python, Bash) and proficiency with IaC tools such as Terraform or CloudFormation.
- Excellent communication, problem‑solving, and on‑call readiness for 24/7 production support.
Skills
kubernetesprometheusgrafanacicdawsterraform