remote
Senior Site Reliability Engineer - Adobe
Site Reliability Engineer
Senior Site Reliability Engineer responsible for building and maintaining scalable, highly available HTTP APIs and infrastructure for a creative AI platform, leveraging Kubernetes, Docker, CI/CD pipelines, AWS, and observability tools to ensure reliability and performance.
About the role
Key Responsibilities
- Design, implement, and operate highly available Kubernetes clusters that host the Graph platform’s HTTP APIs and microservices.
- Build and maintain CI/CD pipelines using GitHub Actions, Terraform, and Docker to automate deployments across multiple environments.
- Implement observability with Prometheus, Grafana, and Loki, creating dashboards, alerts, and incident response playbooks.
- Collaborate with backend and frontend teams to optimize API performance, reduce latency, and enforce security best practices.
- Lead capacity planning, load testing, and cost optimization initiatives on AWS.
- Mentor junior engineers and contribute to SRE knowledge base and tooling improvements.
Requirements
- 5+ years of experience in site reliability or DevOps roles, with a strong focus on cloud-native technologies.
- Proficient with Kubernetes, Docker, and Helm; hands‑on experience with Terraform or CloudFormation.
- Deep knowledge of AWS services (EKS, EC2, S3, CloudWatch, IAM) and experience building scalable, secure APIs.
- Strong scripting skills in Python or Go, and familiarity with CI/CD tooling.
- Excellent problem‑solving, communication, and collaboration skills in a fast‑paced, cross‑functional environment.
Skills
kubernetesdockercicdawspythonterraformprometheus