About the Role
We are looking for a Research Scientist, Video Understanding to own Mecka’s video understanding agenda end-to-end: train large-scale video representation and video-language models on our egocentric + stereo corpus, and turn the resulting checkpoints into production signals the rest of the stack ships on.
This role is focused on large model training, video encoders, video-language models, VLMs/VLAs, and temporal representation learning on real-world robotics data.
What You’ll Work On
Large-Scale Training & Architecture
- Own model architecture and training strategy across Mecka’s task families (manipulation, locomotion, daily activity, long-horizon behavior).
- Run self-supervised and multimodal pretraining (VideoMAE / VJEPA / VideoPrism / InternVideo-class) with rigorous evals and clean ablations.
Video-Language & Multimodal Modeling
- Train and fine-tune video encoders and video-language models (temporal transformers, joint-embedding models, contrastive objectives, masked modeling, instruction/video alignment).
- Incorporate useful priors (pose, depth, camera motion, optical flow) when it improves representation quality.
Research → Production Signals
- Turn checkpoints into usable artifacts: embeddings and model outputs that downstream systems can reliably consume (retrieval, labeling, QA, analytics).
- Build a disciplined training + eval workflow with regression tracking and reproducible runs.
Who You Are
Required Background
- Deep experience training large models in PyTorch (or equivalent), including multi-GPU or distributed training.
- Strong understanding of modern video representation learning and/or multimodal modeling.
- Ability to run rigorous experiments and communicate results clearly.
Strong Signals:
- Experience with video VLMs / VLA-adjacent systems (VideoCLIP, InstructBLIP-Video, LLaVA-Video-class).
- Experience with egocentric / embodied datasets (Ego4D, EgoExo4D, EPIC-Kitchens, Something-Something).
- Strong software engineering discipline: you write research code that can be shipped.
Why This Role
- Work on a domain — egocentric embodied video — where data is scarce everywhere except here.
- Own a research agenda that directly feeds production systems and product outcomes.