remote
Senior Cloud Operations Engineer - The Linux Foundation
Systems Engineer
Senior Cloud Operations Engineer driving automation, optimization, and reliability for PyTorch’s cloud infrastructure using Python, AWS, Kubernetes, Terraform, CI/CD pipelines, and Docker.
About the role
Key Responsibilities
- Design, implement, and maintain scalable cloud-native infrastructure for PyTorch using AWS services and Kubernetes clusters.
- Automate deployment pipelines with Terraform, CI/CD tools, and Docker to accelerate release cycles.
- Monitor system health, troubleshoot performance bottlenecks, and enforce security best practices across the stack.
- Collaborate with development, data science, and security teams to integrate new features and ensure compliance.
- Document architecture, processes, and runbooks to support knowledge transfer and incident response.
Requirements
- 5+ years of cloud operations experience, preferably in AI/ML environments.
- Proficiency in Python scripting, AWS, Kubernetes, Terraform, and CI/CD pipelines.
- Strong understanding of containerization (Docker) and observability tools.
- Excellent problem‑solving skills and a proactive, collaborative mindset.
Skills
pythonawskubernetesterraformcicddocker