onsite
Sr. Site Reliability Engineer Starshield - SpaceX
Site Reliability Engineer
Senior Site Reliability Engineer driving the reliability, scalability, and security of Starshield’s satellite‑based services using Kubernetes, Terraform, and AWS, while ensuring robust monitoring, incident response, and continuous delivery pipelines for mission‑critical government applications.
About the role
Key Responsibilities
- Design, implement, and maintain highly available, scalable infrastructure for Starshield’s satellite services on AWS and Kubernetes clusters.
- Develop and manage Terraform modules and CI/CD pipelines to automate provisioning, configuration, and deployment of services.
- Implement comprehensive monitoring, alerting, and logging solutions (Prometheus, Grafana, ELK) to ensure 99.99% uptime and rapid incident resolution.
- Collaborate with security, networking, and product teams to enforce best practices, perform threat modeling, and harden infrastructure against cyber threats.
- Lead post‑mortem analyses, root cause investigations, and continuous improvement initiatives to reduce MTTR and prevent recurrence.
Requirements
- 5+ years of SRE or DevOps experience in a high‑scale, mission‑critical environment.
- Proficiency with Kubernetes, Terraform, AWS (EC2, RDS, S3, VPC), and Linux system administration.
- Strong scripting skills in Python or Bash and experience with CI/CD tools (GitHub Actions, Jenkins, ArgoCD).
- Hands‑on experience with monitoring/alerting stacks (Prometheus, Grafana, Loki, ELK) and incident management tools.
- Excellent communication, problem‑solving, and collaboration skills in a fast‑paced, cross‑functional team.
Skills
kubernetesterraformawslinuxcicd