onsite

Research Scientist, Video Understanding & World Models

Mecka AI is seeking a Research Scientist, Video Understanding to lead their video understanding agenda. This role involves training large-scale video representation and video-language models on egocentric and stereo data, and transforming these into production signals for the company's robotics and embodied AI data infrastructure.

About the role

About the Role

We are looking for a Research Scientist, Video Understanding to own Mecka’s video understanding agenda end-to-end: train large-scale video representation and video-language models on our egocentric + stereo corpus, and turn the resulting checkpoints into production signals the rest of the stack ships on.

This role is focused on large model training, video encoders, video-language models, VLMs/VLAs, and temporal representation learning on real-world robotics data.

What You’ll Work On

Large-Scale Training & Architecture

Own model architecture and training strategy across Mecka’s task families (manipulation, locomotion, daily activity, long-horizon behavior).
Run self-supervised and multimodal pretraining (VideoMAE / VJEPA / VideoPrism / InternVideo-class) with rigorous evals and clean ablations.

Video-Language & Multimodal Modeling

Train and fine-tune video encoders and video-language models (temporal transformers, joint-embedding models, contrastive objectives, masked modeling, instruction/video alignment).
Incorporate useful priors (pose, depth, camera motion, optical flow) when it improves representation quality.

Research → Production Signals

Turn checkpoints into usable artifacts: embeddings and model outputs that downstream systems can reliably consume (retrieval, labeling, QA, analytics).
Build a disciplined training + eval workflow with regression tracking and reproducible runs.

Who You Are

Required Background

Deep experience training large models in PyTorch (or equivalent), including multi-GPU or distributed training.
Strong understanding of modern video representation learning and/or multimodal modeling.
Ability to run rigorous experiments and communicate results clearly.

Strong Signals:

Experience with video VLMs / VLA-adjacent systems (VideoCLIP, InstructBLIP-Video, LLaVA-Video-class).
Experience with egocentric / embodied datasets (Ego4D, EgoExo4D, EPIC-Kitchens, Something-Something).
Strong software engineering discipline: you write research code that can be shipped.

Why This Role

Work on a domain — egocentric embodied video — where data is scarce everywhere except here.
Own a research agenda that directly feeds production systems and product outcomes.

About the role

About the Role

This role is focused on large model training, video encoders, video-language models, VLMs/VLAs, and temporal representation learning on real-world robotics data.

What You’ll Work On

Large-Scale Training & Architecture

Own model architecture and training strategy across Mecka’s task families (manipulation, locomotion, daily activity, long-horizon behavior).
Run self-supervised and multimodal pretraining (VideoMAE / VJEPA / VideoPrism / InternVideo-class) with rigorous evals and clean ablations.

Video-Language & Multimodal Modeling

Train and fine-tune video encoders and video-language models (temporal transformers, joint-embedding models, contrastive objectives, masked modeling, instruction/video alignment).
Incorporate useful priors (pose, depth, camera motion, optical flow) when it improves representation quality.

Research → Production Signals

Turn checkpoints into usable artifacts: embeddings and model outputs that downstream systems can reliably consume (retrieval, labeling, QA, analytics).
Build a disciplined training + eval workflow with regression tracking and reproducible runs.

Who You Are

Required Background

Deep experience training large models in PyTorch (or equivalent), including multi-GPU or distributed training.
Strong understanding of modern video representation learning and/or multimodal modeling.
Ability to run rigorous experiments and communicate results clearly.

Strong Signals:

Experience with video VLMs / VLA-adjacent systems (VideoCLIP, InstructBLIP-Video, LLaVA-Video-class).
Experience with egocentric / embodied datasets (Ego4D, EgoExo4D, EPIC-Kitchens, Something-Something).
Strong software engineering discipline: you write research code that can be shipped.

Why This Role

Work on a domain — egocentric embodied video — where data is scarce everywhere except here.
Own a research agenda that directly feeds production systems and product outcomes.

Research Scientist, Video Understanding & World Models

About the role

About the Role

What You’ll Work On

Large-Scale Training & Architecture

Video-Language & Multimodal Modeling

Research → Production Signals

Who You Are

Required Background

Strong Signals:

Why This Role

Research Scientist, Video Understanding & World Models

About the role

About the Role

What You’ll Work On

Large-Scale Training & Architecture

Video-Language & Multimodal Modeling

Research → Production Signals

Who You Are

Required Background

Strong Signals:

Why This Role

Skills