onsite
Agentic Engineering Manager - Infrastructure - OpenAI
Engineering Manager
Lead a high‑impact team building intelligent automation for large‑scale AI infrastructure, overseeing deployment, operation, and debugging of clusters, networks, and data‑center resources using cloud, Kubernetes, and IaC technologies.
About the role
Key Responsibilities
- Lead and mentor a multidisciplinary engineering team focused on agentic automation for AI‑driven infrastructure at massive scale.
- Design, implement, and operate Kubernetes‑based clusters and networking across multi‑cloud and on‑prem environments.
- Develop and maintain infrastructure‑as‑code pipelines (Terraform, CI/CD) to ensure reliable, repeatable deployments.
- Collaborate with research and product groups to translate AI workload requirements into robust, observable infrastructure solutions.
- Drive continuous improvement of monitoring, debugging, and self‑healing capabilities using machine‑learning‑ops techniques.
Requirements
- 5+ years of experience building and operating large‑scale cloud or data‑center infrastructure, with deep expertise in Kubernetes and AWS.
- Proficiency in Python for automation, scripting, and building intelligent agents.
- Strong background in infrastructure‑as‑code tools such as Terraform and modern CI/CD systems.
- Demonstrated ability to lead technical teams, mentor engineers, and deliver complex projects on schedule.
- Experience applying machine‑learning‑ops concepts to improve reliability, observability, and automated remediation.
Skills
pythonkubernetesterraformawscicd