Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

The family map

Three families by what signal they learn from, plus one orthogonal axis (PEFT) that is a delivery mechanism, not a learning signal. Every named method is a fixed preset on the on/off-policy axis from Foundations — you don’t freely combine axes.

graph TD
  ROOT["Post-training<br/>push mass toward good behavior"]
  ROOT --> IM["IMITATION<br/>signal = demonstrations"]
  ROOT --> PR["PREFERENCE<br/>signal = comparisons A≻B"]
  ROOT --> RL["REINFORCEMENT<br/>signal = reward / verifier"]

  IM --> SFT["SFT"]
  IM --> DIST["Distillation<br/>off-policy / on-policy"]
  IM --> RS["Rejection-sampling FT<br/>= 'RL without RL'"]

  PR --> RLHF["RLHF (RM + PPO)"]
  PR --> DPO["DPO · KTO · IPO · ORPO · SimPO"]

  RL --> PPO["PPO"]
  RL --> GRPO["GRPO → GSPO / DAPO"]
  RL --> RLVR["RLVR (verifiable reward)"]
  RL --> AG["Agentic / multi-turn RL"]

  PEFT["PEFT: LoRA · QLoRA · DoRA<br/>a HOW, applied to any of the above"]
  ROOT -.delivery.-> PEFT

What’s actually load-bearing in 2026 (verified)

The status column below is from a fresh Exa pass over model tech reports + lab blogs (2025–2026), not recalled — sources cited per row throughout the method chapters.

Method2026 statusAnchor
SFTMainstream, universal — stage 0 of every recipe
Off-policy distillationMainstream — DeepSeek distills R1 → V3/V3.2 as a named stagearXiv:2412.19437, arXiv:2512.02556
On-policy distillationNiche / promising — NOT yet confirmed in any frontier lab’s production recipeGKD arXiv:2306.13649; Thinking Machines blog 2025-10-27
Rejection-sampling FTMainstream — named stage in Llama 3 & DeepSeek-R1arXiv:2407.21783, arXiv:2501.12948
RLHF (RM+PPO)Mainstream at proprietary labs (Gemini 2.5, GPT-5)arXiv:2507.06261
DPOMainstream — Llama 3’s offline preference stagearXiv:2407.21783
IPO/KTO/ORPO/SimPONiche — OSS/research tooling; no flagship names them as primaryarXiv:2503.11701
GRPOMainstream — the reasoning-RL defaultarXiv:2402.03300
RLVRMainstream — arguably the defining 2025-26 techniquearXiv:2501.12948, arXiv:2507.06261
GSPO (Qwen3)Mainstream — first GRPO-successor with a flagship behind itarXiv:2507.18071
DAPOOSS-tooling mainstream; ByteDance-origin, not confirmed elsewherearXiv:2503.14476
PRM (process reward)Niche — explicitly rejected for R1 (step-level reward hacking)arXiv:2501.12948
Rubric/critic outcome rewardMainstream & growing — the real replacement for PRMarXiv:2507.06261
Agentic / multi-turn RLMainstream & the frontier edge — see its own chapterDeep Research; Kimi K2/K2.5
LoRA/QLoRA/DoRAMainstream in the applied layer; labs post-train flagships full-parameterarXiv:2106.09685
Self-playExperimental — no confirmed frontier-lab production use as of this pass

Read the family chapters for the mechanism + “what data it eats” + when to reach for each.