onsite
Lead Site Reliability Engineer SRE - AWS/Linux/Windows - Cognizant
Site Reliability Engineer
Lead a compute reliability team, driving operational excellence across Linux, Windows, and AWS environments using SRE principles, automation, and performance tuning.
About the role
Key Responsibilities
- Lead and mentor a cross‑functional compute reliability team covering Linux/Unix, Windows, and AWS platforms.
- Design, implement, and maintain automated monitoring, alerting, and incident‑response workflows to reduce mean time to recovery.
- Apply Site Reliability Engineering practices to improve system availability, performance, and scalability.
- Collaborate with development and infrastructure teams to define service level objectives (SLOs) and service level indicators (SLIs).
- Drive continuous improvement by identifying operational toil and implementing automation, scripting, and infrastructure‑as‑code solutions.
Requirements
- 10+ years of experience in systems administration or reliability engineering, with deep expertise in Linux (preferred) and solid working knowledge of Windows.
- Extensive hands‑on experience managing production workloads on AWS, including EC2, S3, RDS, and networking services.
- Proven track record of implementing SRE methodologies, incident management, and performance tuning at scale.
- Strong scripting/automation skills (e.g., Python, Bash, PowerShell) and familiarity with infrastructure‑as‑code tools.
- Excellent communication and leadership abilities to guide teams and influence stakeholders.