remote
Site Reliability Engineer, Kubernetes Platform Starshield - SpaceX
Site Reliability Engineer
Lead the reliability and scalability of Starshield’s Kubernetes platform, ensuring high availability, automated deployment, and robust monitoring across a global satellite constellation using AWS/GCP, CI/CD pipelines, and advanced observability tools.
About the role
Key Responsibilities
- Design, implement, and maintain a highly available Kubernetes platform that supports the Starshield satellite constellation’s mission-critical workloads.
- Develop and manage CI/CD pipelines for automated build, test, and deployment of containerized services across multiple cloud environments.
- Implement comprehensive monitoring, logging, and alerting solutions to detect, diagnose, and remediate incidents with minimal downtime.
- Collaborate with cross‑functional teams to define and enforce best practices for infrastructure as code, security, and compliance.
- Lead capacity planning, performance tuning, and cost optimization initiatives for large‑scale distributed systems.
Requirements
- 5+ years of experience in Site Reliability Engineering or DevOps roles, with a strong focus on Kubernetes.
- Proficiency in cloud platforms (AWS or GCP) and experience with infrastructure as code tools such as Terraform or CloudFormation.
- Hands‑on expertise with CI/CD tools (Jenkins, GitHub Actions, ArgoCD) and container orchestration best practices.
- Deep knowledge of monitoring and observability stacks (Prometheus, Grafana, ELK/EFK, or similar).
- Strong scripting skills (Python, Bash) and a solid understanding of networking, security, and compliance requirements for mission‑critical systems.