onsite
Senior Site Reliability Engineer - Anduril
Site Reliability Engineer
Lead the design, deployment, and operation of highly available, secure, and scalable infrastructure for AI‑powered defense systems, leveraging Kubernetes, AWS, and automation tools to ensure continuous delivery and resilience.
About the role
Key Responsibilities
- Architect, deploy, and maintain production‑grade Kubernetes clusters and associated services on AWS, ensuring high availability and fault tolerance.
- Implement and manage CI/CD pipelines, infrastructure as code (Terraform), and automated testing to accelerate feature delivery while maintaining stability.
- Monitor system health, performance, and security using observability tools; respond to incidents, conduct post‑mortems, and drive continuous improvement.
- Collaborate with software, security, and operations teams to define and enforce best practices for scalability, reliability, and compliance.
- Lead capacity planning, cost optimization, and disaster recovery strategies for mission‑critical workloads.
Requirements
- 5+ years of experience in site reliability engineering or a related role, with a strong background in cloud-native technologies.
- Proficiency in Kubernetes, AWS services (EKS, EC2, S3, RDS), and Terraform for infrastructure automation.
- Solid scripting skills in Python and Bash; experience with CI/CD tools such as GitHub Actions, Jenkins, or ArgoCD.
- Deep understanding of networking, security, and observability concepts (Prometheus, Grafana, ELK).
- Excellent problem‑solving abilities, strong communication skills, and a proactive, collaborative mindset.
Skills
kubernetesawspythonterraformcicd