onsite
Site Reliability Engineer, C2 Systems - Anduril
Site Reliability Engineer
Site Reliability Engineer focused on building and maintaining scalable, resilient infrastructure for AI‑powered defense systems using Kubernetes, Docker, AWS, and observability tools like Prometheus and Grafana.
About the role
Key Responsibilities
- Design, deploy, and manage containerized services on Kubernetes clusters to support real‑time command and control workloads.
- Implement CI/CD pipelines and infrastructure as code (Terraform) for rapid, reliable releases.
- Monitor system health with Prometheus, Grafana, and custom alerts; troubleshoot performance and availability issues.
- Collaborate with software, security, and operations teams to enforce best practices and improve system reliability.
- Automate routine operational tasks using scripting (Python, Bash) and configuration management.
Requirements
- 3+ years of SRE or DevOps experience in a high‑availability environment.
- Proficiency with Kubernetes, Docker, and cloud platforms (AWS).
- Hands‑on experience with monitoring/alerting tools such as Prometheus and Grafana.
- Strong scripting skills (Python or Bash) and familiarity with Terraform or similar IaC tools.
- Excellent problem‑solving skills and a proactive, collaborative mindset.
Skills
kubernetesdockerawsprometheusgrafanaterraform