Machine Learning: Multimodal Foundation Models
The Bot Company is seeking a Machine Learning Engineer specializing in Multimodal Foundation Models to build unified models that reason across text, image, video, and kinematics for intelligent robotic policies. This role involves developing native multimodal architectures, improving cross-modal reasoning, and owning the end-to-end training loop for large-scale experiments, integrating models into real robotic systems.
We're building a helpful robot for every home. We're a small team of engineers, designers, and operators based in San Francisco. Our team comes from Tesla, Cruise, OpenAI, Google, Pixar, and many other great companies. In the past we've shipped to hundreds of millions of users and know what it takes to build amazing products and experiences. Our team is deliberately lean to promote rapid decision making and do away with bureaucracy and hierarchy. Everyone is an IC and is empowered with massive scope, radical ownership, and direct responsibility. We work across the stack with a culture built for rapid iteration and fast execution.
We are building unified foundation models that natively reason across text, image, video, and kinematics to drive intelligent robotic policies. You will work on large multi-modal networks and own the entire stack from data to training and deploying models.
Posted May 31, 2026