remote
Sr Site Reliability Engineer US Federal - Workday
Site Reliability Engineer
Senior Site Reliability Engineer driving reliability, scalability, and automation for federal cloud services using AWS, Kubernetes, Terraform, and Python, while leading incident response and continuous improvement initiatives.
About the role
Key Responsibilities
- Design, implement, and maintain highly available, secure, and scalable infrastructure on AWS for federal workloads.
- Automate deployment pipelines and configuration management using Terraform, Kubernetes, and CI/CD tools.
- Develop and maintain monitoring, alerting, and logging solutions to ensure 99.9% uptime and rapid incident resolution.
- Lead incident response, root cause analysis, and post‑mortem documentation to drive continuous improvement.
- Collaborate with development, security, and compliance teams to enforce best practices and regulatory requirements.
Requirements
- 5+ years of SRE or DevOps experience in a cloud‑native environment.
- Proficiency with AWS services (EC2, RDS, S3, IAM, CloudWatch) and Kubernetes orchestration.
- Strong scripting skills in Python and experience with Terraform for IaC.
- Hands‑on experience with monitoring tools (Prometheus, Grafana, Datadog) and log aggregation.
- Excellent problem‑solving, communication, and collaboration skills in a fast‑paced, federal‑compliant setting.
Skills
awskubernetesterraformpython