remote
Senior Infrastructure Optimization Specialist - AI Trainer - 10xteam
Software Engineer
Senior specialist optimizing AI infrastructure, ensuring scalable, resilient environments for model training. Leverage Python, AWS, Kubernetes, Docker, and Terraform to deliver high‑performance, cost‑efficient solutions in a flexible remote freelance role.
About the role
Key Responsibilities
- Design, implement, and maintain scalable AI training pipelines on AWS, ensuring high availability and performance.
- Automate infrastructure provisioning and deployment using Terraform, Docker, and Kubernetes.
- Optimize resource utilization and cost across cloud environments, applying best practices for spot instances, auto‑scaling, and spot‑based training.
- Collaborate with data scientists to integrate model training workflows, monitor GPU/CPU usage, and troubleshoot bottlenecks.
- Document infrastructure architecture, runbooks, and performance metrics for continuous improvement.
Requirements
- 5+ years of experience in cloud infrastructure, with a focus on AI/ML workloads.
- Proficiency in Python, AWS services (SageMaker, EC2, EKS), Kubernetes, Docker, and Terraform.
- Strong understanding of CI/CD pipelines, monitoring, and cost‑optimization strategies.
- Excellent problem‑solving skills and ability to work independently in a remote freelance setting.
Skills
pythonawskubernetesdockerterraform