onsite
Senior Site Reliability Engineer - Database Services - Toyota North America
Site Reliability Engineer
Lead the reliability and performance of high‑availability database services, driving automation, monitoring, and incident response across cloud and on‑prem environments using Kubernetes, Prometheus, Grafana, AWS, Terraform, and Python.
About the role
Key Responsibilities
- Design, implement, and maintain highly available database clusters (SQL/NoSQL) across hybrid cloud environments.
- Develop and maintain CI/CD pipelines, infrastructure as code (Terraform), and automated deployment workflows.
- Implement comprehensive monitoring, alerting, and observability using Prometheus, Grafana, and custom dashboards.
- Lead incident response, root‑cause analysis, and post‑mortem documentation to improve system resilience.
- Collaborate with development, security, and operations teams to enforce best practices and optimize performance.
Requirements
- 5+ years of SRE or database operations experience in production environments.
- Proficiency with Kubernetes, Helm, and container orchestration at scale.
- Strong scripting skills in Python and experience with Terraform or similar IaC tools.
- Hands‑on experience with AWS services (RDS, Aurora, EC2, EKS) and on‑prem database technologies.
- Excellent problem‑solving skills, strong communication, and a proactive, collaborative mindset.
Skills
kubernetesprometheusgrafanaawsterraformpython