onsite

Senior Site Reliability Engineer - Anduril

Site Reliability Engineer

Lead the design, deployment, and operation of highly available, secure, and scalable infrastructure for AI‑powered defense systems, leveraging Kubernetes, AWS, and automation tools to ensure continuous delivery and resilience.

About the role

Key Responsibilities

Architect, deploy, and maintain production‑grade Kubernetes clusters and associated services on AWS, ensuring high availability and fault tolerance.
Implement and manage CI/CD pipelines, infrastructure as code (Terraform), and automated testing to accelerate feature delivery while maintaining stability.
Monitor system health, performance, and security using observability tools; respond to incidents, conduct post‑mortems, and drive continuous improvement.
Collaborate with software, security, and operations teams to define and enforce best practices for scalability, reliability, and compliance.
Lead capacity planning, cost optimization, and disaster recovery strategies for mission‑critical workloads.

Requirements

5+ years of experience in site reliability engineering or a related role, with a strong background in cloud-native technologies.
Proficiency in Kubernetes, AWS services (EKS, EC2, S3, RDS), and Terraform for infrastructure automation.
Solid scripting skills in Python and Bash; experience with CI/CD tools such as GitHub Actions, Jenkins, or ArgoCD.
Deep understanding of networking, security, and observability concepts (Prometheus, Grafana, ELK).
Excellent problem‑solving abilities, strong communication skills, and a proactive, collaborative mindset.

Skills

kubernetesawspythonterraformcicd

CompanyAnduril

DepartmentEngineering

LocationWashington, United States

Experience5+ years

Tenurefull-time

LevelSenior

Posted June 23, 2026