onsite
Staff Site Reliability Engineer - UKG
Site Reliability Engineer
Lead the design, implementation, and operation of highly available, scalable cloud services using Kubernetes, Docker, and AWS, while driving automation, observability, and incident response excellence.
About the role
Key Responsibilities
- Architect, deploy, and maintain production‑grade Kubernetes clusters and containerized workloads across AWS environments.
- Implement CI/CD pipelines, infrastructure as code (Terraform), and automated configuration management to accelerate delivery and reduce toil.
- Design and enforce robust monitoring, alerting, and logging solutions (Prometheus, Grafana, ELK) to ensure high availability and rapid incident resolution.
- Lead incident investigations, post‑mortems, and continuous improvement initiatives to enhance system reliability and resilience.
- Collaborate with development, security, and product teams to embed SRE best practices into the software development lifecycle.
Requirements
- 10+ years of experience in production site reliability or DevOps roles, with a strong focus on cloud-native technologies.
- Deep expertise in Kubernetes, Docker, and AWS services (EKS, EC2, S3, CloudWatch).
- Proficient with infrastructure as code tools such as Terraform and configuration management (Ansible, Chef).
- Hands‑on experience with monitoring, alerting, and log aggregation platforms (Prometheus, Grafana, ELK).
- Strong analytical, problem‑solving, and communication skills, with a proven track record of driving reliability improvements.
Skills
kubernetesdockerawsterraform