remote
Site Reliability Engineer SRE Specialist - NTT DATA
Site Reliability Engineer
Experienced Site Reliability Engineer with 8+ years managing observability, defining SLIs/SLOs, and building alerting pipelines using New Relic, automation, and Linux environments.
About the role
Key Responsibilities
- Own end‑to‑end observability stack, including New Relic APM, infrastructure monitoring, dashboards, and alerting.
- Define, implement, and continuously refine Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to drive reliability goals.
- Design and maintain automated alerting and incident response workflows, ensuring rapid detection and resolution of production issues.
- Collaborate with development and operations teams to embed reliability best practices into CI/CD pipelines.
- Develop and maintain automation scripts and tooling for configuration management, scaling, and performance tuning on Linux platforms.
Requirements
- 8+ years of hands‑on experience in Site Reliability Engineering or related roles.
- Deep expertise with New Relic for application performance monitoring and infrastructure observability.
- Proven ability to design and manage SLIs/SLOs and associated alerting strategies.
- Strong scripting/automation skills (e.g., Bash, Python) on Linux systems.
- Experience with incident management processes and a track record of improving system reliability.