A few times a quarter, an engineer in our network — usually mid-transition from a general ML role into post-training research — asks us for a focused reading list. There is no perfect list. There is, however, the list we actually hand them, because we’ve handed enough versions of it now to notice which one results in the candidate showing up to their first frontier-lab panel sounding load-bearing. This is that list, in the order we recommend.
Two notes before the list itself. The order matters more than the list does. Most readers who try to absorb post-training research by reading the freshest papers first end up confused about why anything works at all; most readers who start with the conceptual foundations and end with the recent open problems find that the recent work is much easier to read on the second pass. Order is the actual product here. The list is mostly an artifact.
Second, this is a workingcanon, not a finished one. We update it roughly every six months as the field moves. Two of the twelve below were not on the list a year ago; two from last year’s list have aged out of the top twelve. Treat this as a snapshot, not a monument.
Papers 1\u20133: the conceptual ground floor
Read these first. If you skip these and start at the methods, the methods stop making sense the first time something doesn\u2019t work in your training run.
Deep Reinforcement Learning from Human Preferences
The original learn-from-preferences paper. Predates the modern RLHF stack by years; uses the formulation everyone since has built on.
// What to takeThe conceptual move that a preference dataset is sufficient to train a reward model, which can then drive RL. Almost everything else in this canon is a refinement of that idea. The math is light; the framing is heavy. Read for the framing.
Training language models to follow instructions with human feedback (InstructGPT)
The paper where modern LLM post-training, as a recognizable stack, lands. SFT → reward model → PPO, in a way that produced the model that became the basis of ChatGPT.
// What to takeThe full pipeline at minimum-viable scale. Pay attention to the data curationsections — they are where the actual difficulty lives, and almost every subsequent paper in this canon is doing the same thing more carefully. Read Section 3 twice.
Scaling Laws for Reward Model Overoptimization
A study of what happens when you push RLHF past the point where the reward model stops being a good proxy for what you actually want. Names the central failure mode.
// What to takeThe KL-as-distance-to-the-referenceframing, and the empirical observation that reward goes up but proxy quality eventually drops. This is the paper most engineers find clarifies why post-training never converges to “just maximize reward.” Read after Paper 02 to feel the tension this paper is pointing at.
Papers 4\u20136: the methods you\u2019ll actually run
Once the foundations are in place, these three papers are the methods that frontier labs in 2026 most often pick between (and combine). You should be able to explain when to reach for each.
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
The reformulation that lets you skip the explicit reward model and PPO step, training directly on preferences with a much simpler loss. Lower variance, fewer moving parts.
// What to takeThe derivation. Most people remember DPO as “RLHF without the RL”; the more useful framing is that the policy and the implicit reward model are the same object, and the loss falls out of that observation. Once you internalize that, the family of DPO-style methods becomes navigable.
Constitutional AI: Harmlessness from AI Feedback
The paper that made RLAIF (RL from AI feedback) credible at scale. A model critiques and revises another model’s outputs against a written principles list, and the resulting preferences drive post-training.
// What to takeTwo things. First, the data-generation pattern — a principled prompt + a critic modelcan produce preference data at scale and at quality. Second, the framing that the constitution makes the model’s behavior explicit and reviewable. The technique generalizes; the framing is durable.
Secrets of RLHF in Large Language Models
A practitioner-oriented teardown of why naive PPO is brittle at LLM scale, and the specific tricks that make it stable: advantage normalization, KL control, reward shaping, value-function pretraining.
// What to takeThe implementation details. Almost every detail in this paper is a thing your training run will hit. Bookmark it; you’ll re-read it when something is exploding and you’re not sure why.
Papers 7\u20139: the data, the evals, and what to regularize
Post-training in practice is mostly about data, evaluation, and the choice of regularizer. These three papers each pick a non-obvious thing about one of those domains and make it the load-bearing observation.
LIMA: Less Is More for Alignment
A demonstration that a small, very high-quality SFT set (1K examples) gets surprisingly close to a much larger one — provided the examples are actually high-quality.
// What to takeCalibration about how much of post-training success is upstream of the algorithm: it’s the data. The work that pays off most reliably is the work of curating fewer, better examples. Hold this thought while you read the next paper, which is the practitioner-side counterpart.
Instruction-Following Evaluation for Large Language Models (IFEval)
A benchmark for whether the model actually does what the user asked, broken down into verifiableinstructions (“respond in JSON,” “use exactly three sentences,” “do not use the letter e”). The benchmark you’ll cite in your write-ups.
// What to takeThe verifiable-instructiondesign pattern. It’s hard to overstate how much downstream eval work has imitated this. If you’re building a new eval and aren’t using the IFEval verifier pattern somewhere, you’re probably making your own job harder.
On the KL-divergence trade-off in RLHF
A class of papers that study how aggressively you can move the policy away from the reference model before the model degrades in ways the reward signal doesn’t catch. Read whichever recent representative your team prefers.
// What to takeThe KL budget as a first-class hyperparameter, with named regions where the model is still recoverable vs. broken. This is the conceptual tool you’ll reach for in approximately every post-training run debrief once you have it.
Papers 10\u201312: the open problems
These three are the most recent in the list, and the ones still being argued about. The point of reading them isn\u2019t to internalize the answers \u2014 they don\u2019t have answers \u2014 but to know what the question shape is.
Reward Hacking: A Survey of Symptoms and Mitigations
A taxonomy of the ways reward models fail in practice — sycophancy, verbosity bias, format gaming, refusal-overuse — and the patterns that have been shown to mitigate each.
// What to takeThe taxonomy. Frontier-lab panels often probe whether the candidate can name the specific failure mode they’re seeing in a hypothetical scenario; this paper gives you the vocabulary. Pair it with your own experience — most engineers have seen at least four of these in their own runs.
Multi-Objective Post-training: Pareto Fronts at Frontier Scale
Recent work on training a single policy against multiple sometimes-competing rewards (helpfulness vs. harmlessness vs. honesty) and characterizing the trade-off surface. The practical setting most frontier-lab post-training teams now operate in.
// What to takeThe Pareto-front framing. Multi-objective post-training is no longer a research curiosity — every frontier lab does some version of it in production. Knowing how to talk about which point on the front you’re targeting is now table stakes for senior IC interviews.
Self-Rewarding Language Models and the Synthetic Preference Frontier
The thread of recent work on training preference signals from the model itself rather than human annotators — including self-rewarding, AI-judge-as-reward, and synthetic-preference pipelines. Where the field is heading.
// What to takeThe open question of whether synthetic preferences can scale arbitrarily or whether they collapse without a human signal grounded somewhere. This is the question every frontier-lab post-training team is currently arguing about; having a defensible position on it is one of the strongest senior-IC interview signals we know.
// 13How to use this list.
Three suggestions, based on what we see work.
First, don’t read them all in two weeks. The list is roughly 35–45 hours of reading if you’re being honest about taking notes, working through math, and actually loading the framings into memory rather than ticking off PDFs. Two papers a week, for six weeks, with a day spent applying each one to your existing project, gets you there at the pace that retains.
Second, after each paper, write a one-paragraph “what changes for my work” note — even (especially) when the paper isn’t directly applicable to your current role. The act of asking “does this change how I’d run my next experiment” forces the framing into your operational memory, and the resulting notes turn into the substrate of your panel answers six months later.
Third, discuss them with someone. The canon is much harder to absorb in isolation than in conversation. If you don’t have a reading partner inside your current team, the OpenTalent network runs informal monthly reading groups that members can join — typically four to six engineers per group, one paper per week, async-first with a single live discussion per month. Email us if that’s useful.
The list will change. Two of the papers above (numbers 10 and 11) are entries we expect to update within twelve months as the underlying research firms up. The rest we expect to remain stable. We’ll re-publish a fresh canon when the changes warrant it — historically that’s been roughly annually.
Next week’s piece will be on how we read a frontier lab’s hiring signal — the five things we look at to figure out what a lab is actually doing, regardless of what its public roles say. Subscribe at the bottom of the blog page if you want it.
// Read next
Agent-eval design is the new bottleneck role
The eval engineer’s side of the post-training story — and the role frontier labs can’t fill fast enough.
READ →// ROLE TRACKAI Research roles
The eight research specializations. Post-training is one. Comp, rubric, what labs grade on.
READ →// PREPInterview guides
Stage-by-stage prep including the post-training-specific deep skill review.
READ →Reading group invite.
Network members can join the monthly post-training reading group — small cohorts, one paper a week, one live discussion a month.