onsite
Infrastructure and MLOps Engineer - Graphcore
MLOps Engineer
Design, build, and operate scalable MLOps infrastructure for AI workloads, leveraging Kubernetes, Docker, Terraform, and AWS to enable rapid model deployment and continuous integration for high‑performance compute.
About the role
Key Responsibilities
- Architect and implement end‑to‑end MLOps pipelines that support training, validation, and serving of large‑scale AI models.
- Deploy, manage, and optimize containerized workloads on Kubernetes clusters across on‑premise and cloud environments.
- Automate infrastructure provisioning and configuration using Terraform and related IaC tools.
- Integrate CI/CD workflows for model code, data, and artifacts, ensuring reproducibility and rapid iteration.
- Monitor system performance, reliability, and cost, implementing observability solutions for GPU‑intensive workloads.
- Collaborate with hardware, software, and data science teams to align infrastructure with evolving AI compute requirements.
Requirements
- Strong experience with Python scripting for automation and orchestration.
- Deep knowledge of Kubernetes, Docker, and container orchestration at scale.
- Proficiency in Terraform or similar infrastructure‑as‑code tools, and cloud platforms such as AWS.
- Hands‑on experience building CI/CD pipelines for machine‑learning workflows.
- Solid Linux systems administration skills and familiarity with GPU‑accelerated environments.
Skills
pythonkubernetesdockerterraformawscicdlinux