Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Preference — RLHF · DPO · KTO

Signal = comparisons (A ≻ B). This family exists for the case where you cannot write verify() — “helpful / harmless / on-brand” has no programmatic checker, but a human (or an AI judge) can rank two outputs. Load-bearing property: preference methods reshape ranking over behaviors π_θ can already produce — they inject no new capability (lessons/post-training/dpo-kto-for-agent-tool-selection.md, shared memory).

RLHF (reward model + PPO)

  • What: train a reward model on preference pairs (Bradley-Terry), then optimize π_θ against it with PPO + KL-to-reference. The canonical pipeline is InstructGPT (arXiv:2203.02155).
  • Eats: preference pairs → a learned scalar reward.
  • Still alive in 2026, not dead: Gemini 2.5 runs an explicit Reward-Model + Critic + RL loop (“RLF”, arXiv:2507.06261 §2.4); GPT-5’s sycophancy fix scores conversations and uses that as a training reward (OpenAI GPT-5 system card / model-training page).
  • Gotcha: a learned RM has parameters to exploit → reward hacking. Deterministic verifiers (RLVR) avoid this; see the gameability ladder in Contested edges.

DPO and the direct-preference family

  • What: skip the RM + RL loop — a closed-form loss directly raises logπ_θ(chosen) − logπ_θ(rejected) against a frozen reference, provably equivalent to the RLHF objective under Bradley-Terry (DPO, arXiv:2305.18290). Key hyperparameter: β (KL strength).
  • Eats: (prompt, chosen, rejected) triples.
  • Policy: off-policy by default (pairs usually from another model / earlier checkpoint) — its weakness; iterative/online DPO resamples from current π_θ to make it on-policy.
  • Production proof: Llama 3 chose DPO over PPO for its offline preference stage for stability/scalability at their scale, and runs it iteratively (their “iTeC” = rejection-sampling + SFT/DPO/IPO + online RL, several rounds) (arXiv:2407.21783).

Variants and their niche

  • KTO (arXiv:2402.01306): learns from unpaired good/bad labels (Kahneman-Tversky value model) — no matched pairs needed. This fits mined agent logs exactly (a pile of failed runs + a pile of clean solves).
  • IPO (arXiv:2310.12036) stabilizes DPO’s tendency to collapse both logprobs at high β; ORPO (arXiv:2403.07691) folds preference into SFT with no reference model; SimPO (arXiv:2405.14734) drops the reference via length-normalized reward.
  • Honest status: IPO/KTO/ORPO/SimPO are niche — real, used in fine-tuning shops and ablated in Tülu 3, but no Llama/Qwen/DeepSeek/GPT/Claude/Gemini tech report names them as the production choice (survey: arXiv:2503.11701). Plain DPO + iterative DPO are the mainstream ones.

RLAIF / Constitutional AI

  • What: replace human preference labels with AI feedback against a written constitution (Constitutional AI, arXiv:2212.08073).
  • Status: mainstream at Anthropic (it is the core method) and partially adopted at Google (Gemini 2.5 safety is “loosely inspired by Constitutional AI”, arXiv:2507.06261). 2026 refinement: Anthropic now teaches the constitution via synthetic document fine-tuning (SDF) → SFT → RL, because “demonstrating desired behavior is insufficient — the model must learn why” (alignment.anthropic.com, “teaching Claude why”, 2026).