onsite
Senior Manager, Reliability Engineering - Anduril
Software Engineer
Lead a high‑performing reliability engineering team to design, build, and operate resilient, AI‑driven defense systems using cloud, container, and observability technologies.
About the role
Key Responsibilities
- Lead the design and implementation of reliability strategies for large‑scale, AI‑powered defense platforms.
- Oversee incident response, post‑mortem processes, and continuous improvement of system resilience.
- Drive automation of deployment, scaling, and monitoring pipelines across cloud and on‑prem environments.
- Collaborate with product, security, and operations teams to embed reliability best practices into the development lifecycle.
- Mentor and grow a team of reliability engineers, fostering a culture of ownership and rapid learning.
Requirements
- 10+ years of experience in reliability or site‑reliability engineering, with 3+ years in a leadership role.
- Deep expertise in cloud platforms (AWS, GCP, or Azure), container orchestration (Kubernetes), and CI/CD tooling.
- Proven track record of building observability stacks (Prometheus, Grafana, ELK) and incident‑management frameworks.
- Strong understanding of AI/ML system reliability challenges and real‑time data processing.
- Excellent communication, mentorship, and cross‑functional collaboration skills.