onsite
Senior Site Reliability Engineer SRE - 1global
Site Reliability Engineer
Senior Site Reliability Engineer responsible for designing, deploying, and maintaining highly available cloud infrastructure using Kubernetes, Docker, and AWS. Leverages monitoring tools like Prometheus and Grafana, automates with Terraform, and writes Python scripts to ensure system reliability and performance.
About the role
Key Responsibilities
- Design, implement, and manage scalable Kubernetes clusters on AWS, ensuring high availability and fault tolerance.
- Develop and maintain CI/CD pipelines using Terraform, GitHub Actions, and Docker to automate deployments and infrastructure changes.
- Implement comprehensive monitoring and alerting with Prometheus, Grafana, and CloudWatch, and troubleshoot production incidents to minimize downtime.
- Collaborate with development teams to enforce best practices for observability, logging, and security across services.
- Conduct post‑mortem analyses, root cause investigations, and continuous improvement initiatives to enhance system reliability.
Requirements
- 5+ years of experience in site reliability engineering or DevOps roles.
- Proficient with Kubernetes, Docker, and AWS services (EKS, EC2, S3, CloudWatch).
- Strong scripting skills in Python and experience with IaC tools like Terraform.
- Hands‑on experience with Prometheus, Grafana, and ELK stack for monitoring and logging.
- Excellent problem‑solving skills and ability to work in a fast‑paced, collaborative environment.
Skills
kubernetesdockerprometheusgrafanaawsterraformpython