onsite
Senior Site Reliability Engineer - Stack AV
Site Reliability Engineer
Lead reliability initiatives for AI‑driven autonomous systems, designing scalable infrastructure, automating deployments, and ensuring high availability using Kubernetes, AWS, Terraform, and modern observability tools.
About the role
Key Responsibilities
- Design, implement, and operate highly available, scalable infrastructure for AI‑powered autonomous trucking solutions.
- Develop and maintain IaC pipelines using Terraform and CI/CD tools to automate provisioning and releases.
- Manage Kubernetes clusters on AWS, ensuring performance, security, and cost‑efficiency.
- Build robust monitoring, alerting, and incident‑response frameworks with tools such as Prometheus, Grafana, and PagerDuty.
- Collaborate with software, data science, and robotics teams to embed reliability best practices into the development lifecycle.
Requirements
- 5+ years of SRE or DevOps experience in cloud environments, preferably AWS.
- Strong proficiency in Kubernetes orchestration and containerization.
- Hands‑on experience with Terraform, Python or Go for automation and tooling.
- Deep understanding of CI/CD pipelines, monitoring, logging, and incident management.
- Track record of improving system reliability, performance, and scalability in complex, real‑time applications.
Skills
kubernetesawsterraformpythongocicd