onsite
Principal Site Reliability Engineer - Oracle
Site Reliability Engineer
Lead the design, architecture, and operation of highly reliable, scalable infrastructure using Kubernetes, Terraform, and CI/CD pipelines, while driving automation, monitoring, and incident response to ensure optimal service performance.
About the role
Key Responsibilities
- Design, architect, and maintain highly available, scalable infrastructure for mission‑critical services using Kubernetes, Terraform, and cloud platforms.
- Collaborate with software engineering teams to embed reliability and scalability into application development lifecycles.
- Lead incident response, root‑cause analysis, and post‑mortem activities to continuously improve system resilience.
- Develop and enforce monitoring, alerting, and health‑check frameworks that provide actionable insights into system performance.
- Identify automation opportunities, implement CI/CD pipelines, and streamline deployment processes to reduce manual effort and error rates.
- Produce comprehensive health and performance reports for stakeholders, proactively communicating potential impacts of changes.
Requirements
- 10+ years of experience in site reliability engineering or related roles, with a proven track record of managing large‑scale distributed systems.
- Deep expertise in Kubernetes, container orchestration, and infrastructure as code (Terraform, CloudFormation).
- Strong background in CI/CD tooling (GitHub Actions, Jenkins, ArgoCD) and automated deployment pipelines.
- Hands‑on experience with monitoring and observability stacks (Prometheus, Grafana, ELK, Datadog).
- Excellent problem‑solving skills, strong communication, and the ability to mentor and influence cross‑functional teams.
Skills
kubernetesterraformcicd