Preference — RLHF · DPO · KTO

Signal = comparisons (A ≻ B). This family exists for the case where you cannot write verify() — “helpful / harmless / on-brand” has no programmatic checker, but a human (or an AI judge) can rank two outputs. Load-bearing property: preference methods reshape ranking over behaviors π_θ can already produce — they inject no new capability (lessons/post-training/dpo-kto-for-agent-tool-selection.md, shared memory).

RLHF (reward model + PPO)

What: train a reward model on preference pairs (Bradley-Terry), then optimize π_θ against it with PPO + KL-to-reference. The canonical pipeline is InstructGPT (arXiv:2203.02155).
Eats: preference pairs → a learned scalar reward.
Still alive in 2026, not dead: Gemini 2.5 runs an explicit Reward-Model + Critic + RL loop (“RLF”, arXiv:2507.06261 §2.4); GPT-5’s sycophancy fix scores conversations and uses that as a training reward (OpenAI GPT-5 system card / model-training page).
Gotcha: a learned RM has parameters to exploit → reward hacking. Deterministic verifiers (RLVR) avoid this; see the gameability ladder in Contested edges.

DPO and the direct-preference family

What: skip the RM + RL loop — a closed-form loss directly raises logπ_θ(chosen) − logπ_θ(rejected) against a frozen reference, provably equivalent to the RLHF objective under Bradley-Terry (DPO, arXiv:2305.18290). Key hyperparameter: β (KL strength).
Eats: (prompt, chosen, rejected) triples.
Policy: off-policy by default (pairs usually from another model / earlier checkpoint) — its weakness; iterative/online DPO resamples from current π_θ to make it on-policy.
Production proof: Llama 3 chose DPO over PPO for its offline preference stage for stability/scalability at their scale, and runs it iteratively (their “iTeC” = rejection-sampling + SFT/DPO/IPO + online RL, several rounds) (arXiv:2407.21783).

Variants and their niche

KTO (arXiv:2402.01306): learns from unpaired good/bad labels (Kahneman-Tversky value model) — no matched pairs needed. This fits mined agent logs exactly (a pile of failed runs + a pile of clean solves).
IPO (arXiv:2310.12036) stabilizes DPO’s tendency to collapse both logprobs at high β; ORPO (arXiv:2403.07691) folds preference into SFT with no reference model; SimPO (arXiv:2405.14734) drops the reference via length-normalized reward.
Honest status: IPO/KTO/ORPO/SimPO are niche — real, used in fine-tuning shops and ablated in Tülu 3, but no Llama/Qwen/DeepSeek/GPT/Claude/Gemini tech report names them as the production choice (survey: arXiv:2503.11701). Plain DPO + iterative DPO are the mainstream ones.

RLAIF / Constitutional AI

What: replace human preference labels with AI feedback against a written constitution (Constitutional AI, arXiv:2212.08073).
Status: mainstream at Anthropic (it is the core method) and partially adopted at Google (Gemini 2.5 safety is “loosely inspired by Constitutional AI”, arXiv:2507.06261). 2026 refinement: Anthropic now teaches the constitution via synthetic document fine-tuning (SDF) → SFT → RL, because “demonstrating desired behavior is insufficient — the model must learn why” (alignment.anthropic.com, “teaching Claude why”, 2026).

Keyboard shortcuts

Post-Training Field Notes

Preference — RLHF · DPO · KTO

RLHF (reward model + PPO)

DPO and the direct-preference family

Variants and their niche

RLAIF / Constitutional AI