onsite
Site Reliability Engineer, Diagnostics - Tesla
Site Reliability Engineer
Seasoned Site Reliability Engineer needed to architect, scale, and maintain next‑generation diagnostics services for a growing fleet, leveraging containerization, public cloud platforms, and cloud‑native tooling to ensure high availability and rapid capacity planning.
About the role
Key Responsibilities
- Plan and analyze capacity for diagnostics services, proactively resizing and migrating infrastructure to meet demand.
- Lead infrastructure change management, tuning and reshaping production environments for optimal performance.
- Collaborate with software engineers to identify, troubleshoot, and resolve production incidents, ensuring minimal impact.
- Design, validate, and exercise failover and disaster recovery plans, implementing graceful degradation policies.
- Maintain and improve monitoring, alerting, and logging pipelines for real‑time visibility.
Requirements
- 5+ years of SRE experience in a fast‑moving, high‑scale environment.
- Deep expertise with container orchestration (Kubernetes, Docker) and public cloud platforms (AWS, GCP, Azure).
- Strong background in cloud‑native application design, CI/CD, and infrastructure as code.
- Proven ability to perform capacity planning, performance tuning, and disaster recovery.
- Excellent communication skills and a collaborative mindset.
Skills
pythonbashkubernetesdockerlinuxelectrical engineering