onsite
Lead Engineer - Site Reliability Engineering - CitiusTech
Software Engineer
Lead Engineer responsible for ensuring high availability and reliability of AWS‑based and enterprise applications, managing P1/P2 incidents, on‑call duties, and driving continuous improvement through RCA and performance tuning.
About the role
Key Responsibilities
- Own and resolve P1/P2 incidents end‑to‑end, including escalation and on‑call response for AWS cloud and enterprise applications.
- Provide L4 production support, triaging issues across application, middleware, and infrastructure layers.
- Perform deep troubleshooting and debugging, tuning performance and stability of distributed systems.
- Conduct root cause analysis and post‑mortems, translating findings into actionable improvements.
- Maintain and enhance SLO/SLA compliance, monitoring, and alerting frameworks.
Requirements
- 6–12 years of experience in production support or SRE roles, with strong AWS expertise.
- Proven ability to manage high‑impact incidents and lead on‑call rotations.
- Deep knowledge of distributed systems, performance tuning, and reliability engineering practices.
- Excellent communication skills for cross‑team collaboration and incident reporting.