remote
Senior Site Reliability Engineer - Red Hat
Site Reliability Engineer
Lead the design, deployment, and operation of a multi‑tenant Kubernetes platform on AWS, blending software engineering with production reliability to deliver a managed OpenShift service.
About the role
Key Responsibilities
- Architect, build, and maintain ROSA HCP control plane infrastructure on AWS, ensuring high availability, scalability, and security.
- Write production‑grade code in Go (and Python) to extend platform capabilities, contribute to upstream OpenShift and Kubernetes projects, and automate operational tasks.
- Own end‑to‑end reliability: design observability, monitoring, alerting, and incident response processes; participate in on‑call rotations.
- Implement and evolve IaC with Terraform, GitOps workflows, and CI/CD pipelines to streamline releases and rollbacks.
- Collaborate with cross‑functional teams (product, security, support) to define SLAs, capacity planning, and cost optimization strategies.
Requirements
- 5+ years of SRE or DevOps experience in cloud‑native environments, with deep knowledge of Kubernetes and OpenShift.
- Proficient in AWS services (EKS, VPC, IAM, CloudWatch) and infrastructure automation tools.
- Strong coding skills in Go (or similar) and experience with CI/CD, GitOps, and Terraform.
- Hands‑on experience with monitoring/observability stacks (Prometheus, Grafana, Loki) and incident management.
- Excellent communication, problem‑solving, and collaboration abilities in a fast‑paced, distributed team.
Skills
kubernetesawsgoterraform