onsite
Senior Site Reliability Engineer - Optimum
Site Reliability Engineer
Lead the design, implementation, and operation of highly available, scalable connectivity services using Kubernetes, AWS, and Docker, while driving automation, monitoring, and incident response to ensure world‑class uptime and performance.
About the role
Key Responsibilities
- Architect, deploy, and maintain large‑scale, highly available services on Kubernetes and AWS infrastructure.
- Implement CI/CD pipelines, infrastructure as code, and automated testing to accelerate feature delivery.
- Design and maintain robust monitoring, alerting, and logging solutions to detect and resolve incidents proactively.
- Lead incident response, post‑mortem analysis, and continuous improvement initiatives to enhance reliability.
- Collaborate with development, security, and product teams to embed reliability best practices across the software lifecycle.
Requirements
- 5+ years of experience in site reliability or DevOps roles, with a strong focus on cloud and container orchestration.
- Proficiency with Kubernetes, AWS services (EC2, EKS, S3, CloudWatch), and Docker.
- Hands‑on experience with CI/CD tools (GitHub Actions, Jenkins, ArgoCD) and IaC (Terraform, CloudFormation).
- Deep understanding of monitoring, alerting, and log aggregation tools (Prometheus, Grafana, ELK/EFK).
- Excellent problem‑solving skills, strong communication, and a passion for continuous learning and automation.
Skills
kubernetesawsdocker