About the Role
We are looking for a Senior Software Engineer to join our ML Infrastructure: Dev Enablement Team . Our mission is to build a frictionless development environment that empowers our researchers and engineers to rapidly innovate on deep learning models for autonomous driving.
We manage a high-scale Cloud Development Environment (CDE) platform that provides standardized, high-performance workspaces for ML development. As we evolve, in this role, you’ll spearhead high-impact initiatives: designing multi-cloud setups to maximize GPU availability, driving deep-level model optimization, and building next-generation Agentic AI toolings. You will play a pivotal role in ensuring our training ecosystem remains cutting-edge, resilient and highly efficient.
What You’ll Be Doing
- Architect Multi-Cloud Solutions: Explore, design, and implement multi-cloud architectures for our ML training platform to increase the availability, scalability, and resilience of high-performance compute resources (GPUs).
- System-Level ML Optimization: Partner closely with ML Researchers to profile and optimize distributed training jobs (PyTorch/DDP). Focus on resolving system-level bottlenecks—such as data loading (I/O), memory management, and network communication overhead—to maximize GPU utilization and training throughput.
- Build Agentic AI Tooling: Design, develop, and enhance Agentic AI tools and systems to automate workflows, streamline the ML lifecycle, and empower developer productivity.
- Scale Core Infrastructure: Drive the continuous development of our core ML infrastructure and existing CDE platform, leveraging Kubernetes to build robust, high-scale distributed solutions.
- Collaborate Cross-Functionally: Partner with ML engineers and data scientists to understand their complex needs, bridging the gap between underlying infrastructure and model development.
- Drive Engineering Excellence: Champion best practices in software engineering, system reliability, and code quality while mentoring junior engineers through high-level system design and code reviews.
What We’re Looking For
- BS or MS in Computer Science or related field
- 5+ years of professional experience in software engineering with strong foundations in distributed systems.
- Expertise with Python or Go or C++
- Solid experience with building on AWS services or other Cloud platforms and container orchestration using Kubernetes.
- Experience with the various stages of the ML development lifecycle
- Strong written and oral communication skills, with a track record of successfully taking ownership of infrastructure components and mentoring junior engineers.
Bonus Points
- Hands-on experience with ML model profiling and performance optimization for distributed training.
- Experience managing or wo