onsite
Senior Site Reliability Engineer - Happy Staffers
Site Reliability Engineer
Lead platform reliability for a cloud‑native stack, troubleshooting Kubernetes and AWS infra, driving incident response, and collaborating with engineering and product to maintain high availability and performance.
About the role
Key Responsibilities
- Act as the primary technical point of contact for user‑reported platform issues, triaging and resolving within defined SLAs.
- Investigate, debug, and remediate incidents across Kubernetes clusters, AWS services, and application components.
- Collaborate with engineering, product, and customer‑facing teams to identify root causes and implement preventive measures.
- Design and maintain monitoring, alerting, and logging solutions to ensure proactive detection of reliability problems.
- Participate in on‑call rotations, post‑mortem analysis, and continuous improvement of SRE practices.
Requirements
- 5+ years of experience in Site Reliability Engineering or DevOps roles.
- Deep expertise with Kubernetes, AWS, and cloud‑native application stacks.
- Strong troubleshooting skills and familiarity with monitoring tools (Prometheus, Grafana, CloudWatch).
- Experience with incident response, root‑cause analysis, and post‑mortem documentation.
- Excellent communication skills and ability to work cross‑functionally.