remote
Agentic Reliability Engineer - SAP
Software Engineer
Drive end‑to‑end reliability for cloud‑native services, automating incident response, monitoring, and capacity planning using Kubernetes, AWS, Python, and Terraform to ensure high availability and performance.
About the role
Key Responsibilities
- Design, implement, and maintain observability, alerting, and incident response workflows for multi‑region cloud services.
- Automate deployment, scaling, and configuration of Kubernetes clusters and associated infrastructure using Terraform and CI/CD pipelines.
- Collaborate with development teams to embed reliability best practices into the software development lifecycle.
- Analyze post‑incident reports, root cause analyses, and implement preventive measures to reduce MTTR.
- Participate in on‑call rotations, providing rapid response to production incidents and coordinating cross‑functional resolution.
Requirements
- 3+ years of experience in Site Reliability Engineering or DevOps roles.
- Proficiency with Kubernetes, AWS services (EKS, CloudWatch, Lambda), and infrastructure as code (Terraform).
- Strong scripting skills in Python or Bash for automation and tooling.
- Experience with monitoring/alerting platforms such as Prometheus, Grafana, or Datadog.
- Excellent problem‑solving skills and a proactive, collaborative mindset.
Skills
kubernetesawspythonterraform