remote
Staff/Principal Software Engineer, ML/AI Environments
Staff/Principal Software Engineer, ML/AI Environments
As a Staff/Principal Software Engineer on the ML/AI Environments team, you will build infrastructure enabling AI researchers and engineers to configure training and serving environments. You will focus on creating reliable, reproducible systems, collaborating with other teams, and shaping the future of AI development on the Databricks platform.
About the role
About the Role
As part of the ML/AI Environments team at Databricks, you will build the system that enables AI researchers and engineers to set up their desired training and serving environments. This is a high-agency, high-visibility team operating at the frontier of AI infrastructure, with deep ties to research, product, and real-world enterprise use cases. Databricks Mosaic AI is one of our fastest-growing businesses, helping thousands of customers democratize AI within their organizations by building the infrastructure that powers the next generation of AI.
The Impact You Will Have
- Build the infrastructure that enables ML and AI users to configure training and serving environments easily, reliably, and reproducibly.
- Collaborate with other AI infrastructure teams to build features that customers need to get more from the Databricks platform. Examples include improving performance of setting up virtual environments for short training and data processing sessions, and improving observability to help customers debug when runs fail.
- Interact with turnkey customers and product managers to envision new features and identify areas for improvement.
- Shape how developers and data scientists build and interact with AI on Databricks.
What We Look For
- 5+ years of experience in backend or infrastructure engineering with a focus on building systems.
- Strong programming skills in Python, Scala, or Java.
- Experience with distributed systems, scalable APIs, or cloud-native infrastructure.
- Familiarity with service-oriented architecture, deployment pipelines, and system observability.
- Strong product and ownership mindset – you care about building the right solution, not just any solution.
- Strong understanding of dependency management technologies, including virtual environments or containerization technologies.
Skills
PythonScalaJavaDistributed Systemsscalable APIscloud native infrastructureservice oriented architecturedeployment pipelinessystem observabilityDependency managementvirtual environmentsContainerization