remote

Staff Research Engineer, Post-training & Evaluation

Reddit is seeking a Staff Research Engineer to lead the Post-Training & Evaluation Science for their AI Engineering team. This role involves defining the "Reddit Benchmark" for LLM quality, owning evaluation reliability and statistical rigor, and designing post-training methodologies to create high-performing models that understand Reddit's unique culture.

About the role

About the Role

As a Staff Research Engineer for Post-Training & Evaluation Science, you will own the science of our model development "feedback loop." While pre-training builds the base models, you define how we measure whether those models are safe, smart, and "Reddit-native," and you set the post-training methodology that turns base checkpoints into high-performing endpoints. You will define the Reddit Benchmark — our internal standard for rigorous model quality across both generation and representation — and own the evaluation science that the rest of the org's iteration depends on.

Responsibilities

Define the "Reddit Benchmark" evaluation standard: Own the methodology — not just the harness — for rigorously measuring model quality across Safety, Reasoning, representation/retrieval, and Reddit-specific knowledge. Decide what "Reddit-native" means in measurable terms and set the bar the org trains against.
Own evaluation reliability and statistical rigor: Establish the science behind trustworthy evals — judge variance, multi-sample scoring, inter-rater/inter-sample agreement, sampling and temperature effects, and calibration of automated judges. You are accountable for whether a benchmark delta is real or noise. Drive the practice of evaluation as a release gate — offline against frozen datasets, and pre-merge in CI/CD — so regressions are caught before endpoints ship.
Design model-as-a-judge methodology: Own judge selection, prompt design, calibration, and reliability for automated evaluation using frontier external models, enabling rapid, trustworthy iteration cycles.
Set post-training recipes and strategy: Design SFT recipes (data mixtures, curriculum, ablation strategy) that convert base models into helpful, well-aligned endpoints; partner with engineering to scale them.
Evaluate base and CPT checkpoints, not just endpoints: Design checkpoint-selection methodology across CPT experiments and LR studies, so we pick the right base before committing post-training compute.
Drive synthetic data generation strategy: Define and curate high-quality instruction and evaluation sets to improve generalization where human data is scarce.
Partner with Safety Engineering: Translate high-level safety policy into concrete classification metrics, probe sets, and CI/CD unit tests — including precision/recall at threshold, label-noise handling, and false-positive taxonomy for abuse detection (HHV).
Diagnose post-training instability: Dive into loss curves and eval logs to identify alignment tax and capability degradation, and recommend the fix.
Lead research direction: Set technical direction for evaluation and post-training across the team, mentor engineers and scientists, and represent the work internally (and externally where appropriate).

Required Qualifications

6+ years of professional ML experience (or PhD + 4+) with a direct focus on LLM post-training and evaluation.
PhD or MS in CS, ML, NLP, IR, or a related quantitative field — or equivalent industry research experience.
Deep expertise in evaluation reliability: judge/sample variance, multi-sample scoring, calibration, statistical significance, and the failure modes of automated evaluation.
Strong experience building custom, domain-specific evaluation harnesses (e.g., lm-eval-harness, Inspect AI, LightEval) — you know the strengths and limits of benchmarks like MMLU and GSM8K and when they don't apply, and you treat eval sets as versioned, frozen, regression-tracked code.
Experience evaluating both generation and representation/classification: model-as-a-judge for generative quality and precision/recall, PR-AUC, retrieval/MTEB-style metrics, gold-label denoising, and label-noise handling.
Deep understanding of Continuous Pre-training (CPT), Instruction Tuning (SFT), and how data quality shapes model behavior.
Fluency in Python; strong data-pipeline and eval-harness engineering (e.g., Hugging Face Transformers, vLLM, lm-eval-harness). Working knowledge of PyTorch and distributed training (FSDP2, DeepSpeed ZeRO-3) sufficient to direct and debug post-training runs.

Nice to Have

Experience with MLflow or similar experiment-tracking frameworks.
Familiarity with modern fine-tuning frameworks (Axolotl, TorchTune) and PyTorch-native training stacks (TorchTitan).
Synthetic data generation techniques (e.g., Self-Instruct).
Experience with preference optimization (DPO, RLHF, RLAIF, GRPO).
Publications in NLP/ML/FAccT or related venues, or other evidence of research leadership.
Experience evaluating multimodal models (embeddings, hateful-memes-style classification).

About the role

About the Role

Responsibilities

Define the "Reddit Benchmark" evaluation standard: Own the methodology — not just the harness — for rigorously measuring model quality across Safety, Reasoning, representation/retrieval, and Reddit-specific knowledge. Decide what "Reddit-native" means in measurable terms and set the bar the org trains against.
Own evaluation reliability and statistical rigor: Establish the science behind trustworthy evals — judge variance, multi-sample scoring, inter-rater/inter-sample agreement, sampling and temperature effects, and calibration of automated judges. You are accountable for whether a benchmark delta is real or noise. Drive the practice of evaluation as a release gate — offline against frozen datasets, and pre-merge in CI/CD — so regressions are caught before endpoints ship.
Design model-as-a-judge methodology: Own judge selection, prompt design, calibration, and reliability for automated evaluation using frontier external models, enabling rapid, trustworthy iteration cycles.
Set post-training recipes and strategy: Design SFT recipes (data mixtures, curriculum, ablation strategy) that convert base models into helpful, well-aligned endpoints; partner with engineering to scale them.
Evaluate base and CPT checkpoints, not just endpoints: Design checkpoint-selection methodology across CPT experiments and LR studies, so we pick the right base before committing post-training compute.
Drive synthetic data generation strategy: Define and curate high-quality instruction and evaluation sets to improve generalization where human data is scarce.
Partner with Safety Engineering: Translate high-level safety policy into concrete classification metrics, probe sets, and CI/CD unit tests — including precision/recall at threshold, label-noise handling, and false-positive taxonomy for abuse detection (HHV).
Diagnose post-training instability: Dive into loss curves and eval logs to identify alignment tax and capability degradation, and recommend the fix.
Lead research direction: Set technical direction for evaluation and post-training across the team, mentor engineers and scientists, and represent the work internally (and externally where appropriate).

Required Qualifications

6+ years of professional ML experience (or PhD + 4+) with a direct focus on LLM post-training and evaluation.
PhD or MS in CS, ML, NLP, IR, or a related quantitative field — or equivalent industry research experience.
Deep expertise in evaluation reliability: judge/sample variance, multi-sample scoring, calibration, statistical significance, and the failure modes of automated evaluation.
Strong experience building custom, domain-specific evaluation harnesses (e.g., lm-eval-harness, Inspect AI, LightEval) — you know the strengths and limits of benchmarks like MMLU and GSM8K and when they don't apply, and you treat eval sets as versioned, frozen, regression-tracked code.
Experience evaluating both generation and representation/classification: model-as-a-judge for generative quality and precision/recall, PR-AUC, retrieval/MTEB-style metrics, gold-label denoising, and label-noise handling.
Deep understanding of Continuous Pre-training (CPT), Instruction Tuning (SFT), and how data quality shapes model behavior.
Fluency in Python; strong data-pipeline and eval-harness engineering (e.g., Hugging Face Transformers, vLLM, lm-eval-harness). Working knowledge of PyTorch and distributed training (FSDP2, DeepSpeed ZeRO-3) sufficient to direct and debug post-training runs.

Nice to Have

Experience with MLflow or similar experiment-tracking frameworks.
Familiarity with modern fine-tuning frameworks (Axolotl, TorchTune) and PyTorch-native training stacks (TorchTitan).
Synthetic data generation techniques (e.g., Self-Instruct).
Experience with preference optimization (DPO, RLHF, RLAIF, GRPO).
Publications in NLP/ML/FAccT or related venues, or other evidence of research leadership.
Experience evaluating multimodal models (embeddings, hateful-memes-style classification).

Staff Research Engineer, Post-training & Evaluation

About the role

About the Role

Responsibilities

Required Qualifications

Nice to Have

Staff Research Engineer, Post-training & Evaluation

About the role

About the Role

Responsibilities

Required Qualifications

Nice to Have

Skills