onsite

Principal Site Reliability Engineer - Oracle

Site Reliability Engineer

Lead the design, architecture, and operation of highly reliable, scalable infrastructure using Kubernetes, Terraform, and CI/CD pipelines, while driving automation, monitoring, and incident response to ensure optimal service performance.

About the role

Key Responsibilities

Design, architect, and maintain highly available, scalable infrastructure for mission‑critical services using Kubernetes, Terraform, and cloud platforms.
Collaborate with software engineering teams to embed reliability and scalability into application development lifecycles.
Lead incident response, root‑cause analysis, and post‑mortem activities to continuously improve system resilience.
Develop and enforce monitoring, alerting, and health‑check frameworks that provide actionable insights into system performance.
Identify automation opportunities, implement CI/CD pipelines, and streamline deployment processes to reduce manual effort and error rates.
Produce comprehensive health and performance reports for stakeholders, proactively communicating potential impacts of changes.

Requirements

10+ years of experience in site reliability engineering or related roles, with a proven track record of managing large‑scale distributed systems.
Deep expertise in Kubernetes, container orchestration, and infrastructure as code (Terraform, CloudFormation).
Strong background in CI/CD tooling (GitHub Actions, Jenkins, ArgoCD) and automated deployment pipelines.
Hands‑on experience with monitoring and observability stacks (Prometheus, Grafana, ELK, Datadog).
Excellent problem‑solving skills, strong communication, and the ability to mentor and influence cross‑functional teams.

Skills

kubernetesterraformcicd

CompanyOracle

DepartmentEngineering

LocationNashville, Tennessee, United States

Experience7+ years

Tenurefull-time

LevelLead

Posted June 27, 2026