remote
Senior DevOps Site Reliability Engineer Remote Eligible - Southwest Power Pool
Site Reliability Engineer
Senior DevOps SRE leading reliability initiatives for large‑scale infrastructure, automating deployments, managing cloud services, and driving observability using Kubernetes, Terraform, Python, AWS, CI/CD pipelines, and Linux systems.
About the role
Key Responsibilities
- Design, implement, and maintain highly available, scalable infrastructure on AWS using IaC tools such as Terraform.
- Develop and manage container orchestration platforms (Kubernetes) and associated CI/CD pipelines to support rapid, reliable releases.
- Automate operational tasks and incident response workflows with Python scripting and configuration management.
- Implement monitoring, alerting, and observability solutions (e.g., Prometheus, Grafana) to ensure service reliability and performance.
- Collaborate with development and product teams to embed reliability best practices into the software development lifecycle.
Requirements
- 5+ years of hands‑on experience in DevOps or Site Reliability Engineering roles.
- Strong expertise with Kubernetes, Terraform, and AWS services.
- Proficiency in Python for automation and tooling development.
- Deep understanding of Linux systems, networking, and CI/CD concepts.
- Demonstrated ability to troubleshoot complex production issues and improve system reliability.
Skills
kubernetesterraformpythonawscicdlinux