onsite
Site Reliability Engineer - Artificial Labs
Site Reliability Engineer
Lead the design, deployment, and maintenance of scalable, highly available services on Kubernetes and AWS, using Docker, Prometheus, Grafana, Terraform, and CI/CD pipelines to ensure reliability, performance, and rapid incident response.
About the role
Key Responsibilities
- Design, implement, and maintain production-grade Kubernetes clusters and Docker-based microservices across AWS environments.
- Build and manage observability stack with Prometheus, Grafana, and alerting to detect and resolve incidents proactively.
- Automate infrastructure provisioning and configuration using Terraform, ensuring repeatable, version-controlled deployments.
- Collaborate with development teams to embed SRE best practices into CI/CD pipelines and release processes.
- Lead root‑cause analysis, post‑mortem documentation, and continuous improvement initiatives to reduce MTTR and improve system resilience.
Requirements
- 5+ years of experience in site reliability engineering or DevOps roles.
- Strong proficiency with Kubernetes, Docker, and cloud-native tooling.
- Hands‑on experience with Prometheus, Grafana, and alerting systems.
- Expertise in AWS services (EKS, EC2, S3, CloudWatch) and IaC with Terraform.
- Solid scripting skills (Python, Bash) and familiarity with CI/CD tools (GitHub Actions, Jenkins, ArgoCD).
Skills
kubernetesdockerprometheusgrafanaawsterraformcicd