hybrid
Machine Learning Infrastructure Engineer
Machine Learning Infrastructure Engineer
Character.AI is seeking an experienced Machine Learning Infrastructure Engineer to design, build, and maintain training and serving infrastructure for ML research. This role involves providing infrastructure support, building diagnostic tools, monitoring deployments, and optimizing GPU allocation and utilization.
About the role
About the role
We’re looking for seasoned ML Infrastructure engineers with experience designing, building and maintaining training and serving infrastructure for ML research.
Responsibilities
- Provide infrastructure support to our ML research and product
- Build tooling to diagnose cluster issues and hardware failures
- Monitor deployments, manage experiments, and generally support our research
- Maximize GPU allocation and utilization for both serving and training
Requirements
- 4+ years of experience supporting the infrastructure within an ML environment
- Experience in developing tools used to diagnose ML infrastructure problems and failures
- Experience with cloud platforms (e.g., Compute Engine, Kubernetes, Cloud Storage)
- Experience working with GPUs
Nice to have
- Experience with large GPU clusters and high-performance computing/networking
- Experience with supporting large language model training
- Experience with ML frameworks like Pytorch/TensorFlow/JAX
- Experience with GPU kernel development
Skills
ML environmentCloud PlatformsCompute EngineKubernetesCloud StorageGPUslarge GPU clustershigh performance computinghigh performance networkinglarge language model trainingPyTorchTensorFlowJaxGPU kernel development