The 2026 post-training canon. Twelve papers. -- Blog

A few times a quarter, an engineer in our network — usually mid-transition from a general ML role into post-training research — asks us for a focused reading list. There is no perfect list. There is, however, the list we actually hand them, because we’ve handed enough versions of it now to notice which one results in the candidate showing up to their first frontier-lab panel sounding load-bearing. This is that list, in the order we recommend.

Two notes before the list itself. The order matters more than the list does. Most readers who try to absorb post-training research by reading the freshest papers first end up confused about why anything works at all; most readers who start with the conceptual foundations and end with the recent open problems find that the recent work is much easier to read on the second pass. Order is the actual product here. The list is mostly an artifact.

Second, this is a workingcanon, not a finished one. We update it roughly every six months as the field moves. Two of the twelve below were not on the list a year ago; two from last year’s list have aged out of the top twelve. Treat this as a snapshot, not a monument.

// PART 1 \u00b7 FOUNDATIONS

Papers 1\u20133: the conceptual ground floor

Read these first. If you skip these and start at the methods, the methods stop making sense the first time something doesn\u2019t work in your training run.

// Paper 012017 \u00b7 STILL FOUNDATIONAL

Deep Reinforcement Learning from Human Preferences

Christiano, Leike, Brown, Martic, Legg, Amodei· OpenAI / DeepMind

// What it is

The original learn-from-preferences paper. Predates the modern RLHF stack by years; uses the formulation everyone since has built on.

// What to take

The conceptual move that a preference dataset is sufficient to train a reward model, which can then drive RL. Almost everything else in this canon is a refinement of that idea. The math is light; the framing is heavy. Read for the framing.

// Paper 022022 \u00b7 INFLECTION POINT

Training language models to follow instructions with human feedback (InstructGPT)

Ouyang et al.· OpenAI

// What it is

The paper where modern LLM post-training, as a recognizable stack, lands. SFT → reward model → PPO, in a way that produced the model that became the basis of ChatGPT.

// What to take

The full pipeline at minimum-viable scale. Pay attention to the data curationsections — they are where the actual difficulty lives, and almost every subsequent paper in this canon is doing the same thing more carefully. Read Section 3 twice.

// Paper 032022 \u00b7 NAMED THE PROBLEM

Scaling Laws for Reward Model Overoptimization

Gao, Schulman, Hilton· OpenAI

// What it is

A study of what happens when you push RLHF past the point where the reward model stops being a good proxy for what you actually want. Names the central failure mode.

// What to take

The KL-as-distance-to-the-referenceframing, and the empirical observation that reward goes up but proxy quality eventually drops. This is the paper most engineers find clarifies why post-training never converges to “just maximize reward.” Read after Paper 02 to feel the tension this paper is pointing at.

// PART 2 \u00b7 METHODS

Papers 4\u20136: the methods you\u2019ll actually run

Once the foundations are in place, these three papers are the methods that frontier labs in 2026 most often pick between (and combine). You should be able to explain when to reach for each.

// Paper 042023 \u00b7 DPO

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafailov et al.· Stanford

// What it is

The reformulation that lets you skip the explicit reward model and PPO step, training directly on preferences with a much simpler loss. Lower variance, fewer moving parts.

// What to take

The derivation. Most people remember DPO as “RLHF without the RL”; the more useful framing is that the policy and the implicit reward model are the same object, and the loss falls out of that observation. Once you internalize that, the family of DPO-style methods becomes navigable.

// Paper 052022 \u00b7 CONSTITUTIONAL AI

Constitutional AI: Harmlessness from AI Feedback

Bai et al.· Anthropic

// What it is

The paper that made RLAIF (RL from AI feedback) credible at scale. A model critiques and revises another model’s outputs against a written principles list, and the resulting preferences drive post-training.

// What to take

Two things. First, the data-generation pattern — a principled prompt + a critic modelcan produce preference data at scale and at quality. Second, the framing that the constitution makes the model’s behavior explicit and reviewable. The technique generalizes; the framing is durable.

// Paper 062023 \u00b7 PPO REVISITED FOR LLMS

Secrets of RLHF in Large Language Models

Zheng, Stiennon, et al.· multiple labs

// What it is

A practitioner-oriented teardown of why naive PPO is brittle at LLM scale, and the specific tricks that make it stable: advantage normalization, KL control, reward shaping, value-function pretraining.

// What to take

The implementation details. Almost every detail in this paper is a thing your training run will hit. Bookmark it; you’ll re-read it when something is exploding and you’re not sure why.

// PART 3 \u00b7 PRACTICALITIES

Papers 7\u20139: the data, the evals, and what to regularize

Post-training in practice is mostly about data, evaluation, and the choice of regularizer. These three papers each pick a non-obvious thing about one of those domains and make it the load-bearing observation.

// Paper 072023 \u00b7 LIMA

LIMA: Less Is More for Alignment

Zhou et al.· Meta

// What it is

A demonstration that a small, very high-quality SFT set (1K examples) gets surprisingly close to a much larger one — provided the examples are actually high-quality.

// What to take

Calibration about how much of post-training success is upstream of the algorithm: it’s the data. The work that pays off most reliably is the work of curating fewer, better examples. Hold this thought while you read the next paper, which is the practitioner-side counterpart.

// Paper 082023 \u00b7 IFEVAL

Instruction-Following Evaluation for Large Language Models (IFEval)

Zhou et al.· Google

// What it is

A benchmark for whether the model actually does what the user asked, broken down into verifiableinstructions (“respond in JSON,” “use exactly three sentences,” “do not use the letter e”). The benchmark you’ll cite in your write-ups.

// What to take

The verifiable-instructiondesign pattern. It’s hard to overstate how much downstream eval work has imitated this. If you’re building a new eval and aren’t using the IFEval verifier pattern somewhere, you’re probably making your own job harder.

// Paper 092024 \u00b7 KL CONTROL STUDY

On the KL-divergence trade-off in RLHF

Multiple authors, multiple labs· representative of a small cluster of 2024 papers on this

// What it is

A class of papers that study how aggressively you can move the policy away from the reference model before the model degrades in ways the reward signal doesn’t catch. Read whichever recent representative your team prefers.

// What to take

The KL budget as a first-class hyperparameter, with named regions where the model is still recoverable vs. broken. This is the conceptual tool you’ll reach for in approximately every post-training run debrief once you have it.

// From the Q1 2026 panel debriefs

// PART 4 \u00b7 THE FRONTIER

Papers 10\u201312: the open problems

These three are the most recent in the list, and the ones still being argued about. The point of reading them isn\u2019t to internalize the answers \u2014 they don\u2019t have answers \u2014 but to know what the question shape is.

// Paper 102024 \u00b7 REWARD HACKING

Reward Hacking: A Survey of Symptoms and Mitigations

Multiple authors· survey paper

// What it is

A taxonomy of the ways reward models fail in practice — sycophancy, verbosity bias, format gaming, refusal-overuse — and the patterns that have been shown to mitigate each.

// What to take

The taxonomy. Frontier-lab panels often probe whether the candidate can name the specific failure mode they’re seeing in a hypothetical scenario; this paper gives you the vocabulary. Pair it with your own experience — most engineers have seen at least four of these in their own runs.

// Paper 112025 \u00b7 MULTI-OBJECTIVE

Multi-Objective Post-training: Pareto Fronts at Frontier Scale

Representative of a 2025 cluster· multiple labs

// What it is

Recent work on training a single policy against multiple sometimes-competing rewards (helpfulness vs. harmlessness vs. honesty) and characterizing the trade-off surface. The practical setting most frontier-lab post-training teams now operate in.

// What to take

The Pareto-front framing. Multi-objective post-training is no longer a research curiosity — every frontier lab does some version of it in production. Knowing how to talk about which point on the front you’re targeting is now table stakes for senior IC interviews.

// Paper 122025\u20132026 \u00b7 SYNTHETIC DATA

Self-Rewarding Language Models and the Synthetic Preference Frontier

Multiple authors· ongoing research cluster

// What it is

The thread of recent work on training preference signals from the model itself rather than human annotators — including self-rewarding, AI-judge-as-reward, and synthetic-preference pipelines. Where the field is heading.

// What to take

The open question of whether synthetic preferences can scale arbitrarily or whether they collapse without a human signal grounded somewhere. This is the question every frontier-lab post-training team is currently arguing about; having a defensible position on it is one of the strongest senior-IC interview signals we know.

// 13How to use this list.

Three suggestions, based on what we see work.

First, don’t read them all in two weeks. The list is roughly 35–45 hours of reading if you’re being honest about taking notes, working through math, and actually loading the framings into memory rather than ticking off PDFs. Two papers a week, for six weeks, with a day spent applying each one to your existing project, gets you there at the pace that retains.

Second, after each paper, write a one-paragraph “what changes for my work” note — even (especially) when the paper isn’t directly applicable to your current role. The act of asking “does this change how I’d run my next experiment” forces the framing into your operational memory, and the resulting notes turn into the substrate of your panel answers six months later.

Third, discuss them with someone. The canon is much harder to absorb in isolation than in conversation. If you don’t have a reading partner inside your current team, the OpenTalent network runs informal monthly reading groups that members can join — typically four to six engineers per group, one paper per week, async-first with a single live discussion per month. Email us if that’s useful.

The list will change. Two of the papers above (numbers 10 and 11) are entries we expect to update within twelve months as the underlying research firms up. The rest we expect to remain stable. We’ll re-publish a fresh canon when the changes warrant it — historically that’s been roughly annually.

Next week’s piece will be on how we read a frontier lab’s hiring signal — the five things we look at to figure out what a lab is actually doing, regardless of what its public roles say. Subscribe at the bottom of the blog page if you want it.

The OpenTalent Network

// EDITORIAL · RESEARCH