remote
Site Reliability Engineer - S. A. Solution
Site Reliability Engineer
Senior Site Reliability Engineer responsible for troubleshooting, incident response, and performance optimization across Kubernetes and AWS environments, ensuring high platform reliability and user satisfaction.
About the role
Key Responsibilities
- Act as the primary technical point of contact for user‑reported platform issues, triaging and resolving incidents within defined SLAs.
- Investigate, debug, and remediate problems across Kubernetes clusters, AWS infrastructure, and application services.
- Collaborate with engineering, product, and customer‑facing teams to root‑cause failures and implement preventive measures.
- Design and maintain monitoring, alerting, and logging solutions to detect and mitigate reliability risks.
- Participate in on‑call rotations, post‑mortem analysis, and continuous improvement initiatives.
Requirements
- Proven experience as an SRE or DevOps engineer in a cloud‑native environment.
- Strong knowledge of Kubernetes, AWS services (EC2, EKS, RDS, CloudWatch), and container orchestration.
- Hands‑on expertise with monitoring tools (Prometheus, Grafana, Datadog) and incident management.
- Excellent troubleshooting, communication, and collaboration skills.
- Experience with CI/CD pipelines, scripting (Python, Bash), and infrastructure as code (Terraform, CloudFormation) is a plus.