onsite
Software Infrastructure Kubernetes Engineer - Graphcore
Software Engineer
Lead the design, deployment, and scaling of Kubernetes-based infrastructure for AI workloads, ensuring high availability, security, and performance across cloud and on‑prem environments.
About the role
Key Responsibilities
- Architect and maintain production‑grade Kubernetes clusters for AI and ML workloads.
- Implement CI/CD pipelines, Helm charts, and GitOps workflows to automate deployments.
- Collaborate with DevOps, security, and platform teams to enforce best practices and compliance.
- Monitor cluster health, troubleshoot performance bottlenecks, and optimize resource utilization.
- Drive continuous improvement of infrastructure tooling and documentation.
Requirements
- 5+ years of experience building and operating Kubernetes environments at scale.
- Proficiency with Docker, Helm, and cloud platforms (AWS, GCP, or Azure).
- Strong scripting skills (Python, Bash) and familiarity with IaC tools (Terraform, Pulumi).
- Experience with monitoring/observability stacks (Prometheus, Grafana, ELK).
- Excellent problem‑solving skills and a collaborative mindset.
Skills
kubernetesdockercicdhelm