onsite
Site Reliability Engineer - Stack AV
Site Reliability Engineer
We seek a Site Reliability Engineer to design, implement, and maintain highly available cloud infrastructure for AI‑driven autonomous trucking solutions, leveraging Kubernetes, AWS, and automation tools.
About the role
Key Responsibilities
- Design, deploy, and operate scalable Kubernetes clusters on AWS to support AI and robotics workloads.
- Develop and maintain infrastructure‑as‑code using Terraform and related automation frameworks.
- Implement robust CI/CD pipelines for continuous delivery of services and updates.
- Monitor system performance, troubleshoot incidents, and drive root‑cause analysis to improve reliability.
- Collaborate with software, data science, and hardware teams to ensure seamless integration of autonomous system components.
Requirements
- 3+ years of experience in site reliability or DevOps roles, preferably in cloud‑native environments.
- Proficiency in programming/scripting with Python and Go.
- Strong hands‑on experience with Kubernetes, Docker, and AWS services (EKS, EC2, S3, etc.).
- Expertise in infrastructure‑as‑code tools such as Terraform and configuration management.
- Solid understanding of Linux systems, networking, and monitoring tools (Prometheus, Grafana, ELK).
Skills
pythongokubernetesawsterraformcicdlinux