onsite
SRE Engineer / Site Reliability Engineer Specialist - NTT Data Americas, Inc.
Site Reliability Engineer
Senior SRE Engineer responsible for designing, deploying, and maintaining highly available cloud-native services using Kubernetes, Docker, and CI/CD pipelines. Leverages AWS, monitoring tools, and Python scripting to ensure reliability, performance, and rapid incident response.
About the role
Key Responsibilities
- Design, implement, and operate scalable, highly available services on Kubernetes clusters in AWS.
- Build and maintain CI/CD pipelines to automate application delivery and infrastructure changes.
- Implement monitoring, alerting, and logging solutions (Prometheus, Grafana, ELK) to detect and resolve incidents quickly.
- Collaborate with development teams to embed reliability best practices into the software development lifecycle.
- Conduct post‑mortems, root cause analysis, and continuous improvement initiatives to reduce MTTR.
Requirements
- 5+ years of experience in Site Reliability Engineering or DevOps roles.
- Proficient with Kubernetes, Docker, and cloud platforms (AWS preferred).
- Strong scripting skills in Python or Bash for automation.
- Hands‑on experience with CI/CD tools (Jenkins, GitHub Actions, ArgoCD).
- Excellent problem‑solving skills and ability to work in a fast‑paced environment.
Skills
kubernetesdockercicdawspython