onsite
Senior Linux Site Reliability Engineer - SpaceX
Site Reliability Engineer
Lead the design, scaling, and optimization of Kubernetes clusters on Linux, ensuring high availability and performance for critical business services.
About the role
Key Responsibilities
- Architect, deploy, and maintain Kubernetes clusters across production environments, ensuring reliability and scalability.
- Collaborate with development teams to integrate CI/CD pipelines and automate deployment workflows.
- Implement monitoring, alerting, and logging solutions to proactively detect and resolve incidents.
- Optimize resource utilization, cost, and performance through capacity planning and tuning.
- Lead incident response, root‑cause analysis, and post‑mortem documentation.
Requirements
- 5+ years of experience in Linux system administration and site reliability engineering.
- Deep expertise in Kubernetes, container runtimes, and related ecosystem tools.
- Proficiency with automation tools (Ansible, Terraform, Helm) and scripting (Bash, Python).
- Strong knowledge of monitoring/observability stacks (Prometheus, Grafana, ELK).
- Excellent problem‑solving skills and ability to work in a fast‑paced, mission‑critical environment.