The family map
Three families by what signal they learn from, plus one orthogonal axis (PEFT) that is a delivery mechanism, not a learning signal. Every named method is a fixed preset on the on/off-policy axis from Foundations — you don’t freely combine axes.
graph TD ROOT["Post-training<br/>push mass toward good behavior"] ROOT --> IM["IMITATION<br/>signal = demonstrations"] ROOT --> PR["PREFERENCE<br/>signal = comparisons A≻B"] ROOT --> RL["REINFORCEMENT<br/>signal = reward / verifier"] IM --> SFT["SFT"] IM --> DIST["Distillation<br/>off-policy / on-policy"] IM --> RS["Rejection-sampling FT<br/>= 'RL without RL'"] PR --> RLHF["RLHF (RM + PPO)"] PR --> DPO["DPO · KTO · IPO · ORPO · SimPO"] RL --> PPO["PPO"] RL --> GRPO["GRPO → GSPO / DAPO"] RL --> RLVR["RLVR (verifiable reward)"] RL --> AG["Agentic / multi-turn RL"] PEFT["PEFT: LoRA · QLoRA · DoRA<br/>a HOW, applied to any of the above"] ROOT -.delivery.-> PEFT
What’s actually load-bearing in 2026 (verified)
The status column below is from a fresh Exa pass over model tech reports + lab blogs (2025–2026), not recalled — sources cited per row throughout the method chapters.
| Method | 2026 status | Anchor |
|---|---|---|
| SFT | Mainstream, universal — stage 0 of every recipe | — |
| Off-policy distillation | Mainstream — DeepSeek distills R1 → V3/V3.2 as a named stage | arXiv:2412.19437, arXiv:2512.02556 |
| On-policy distillation | Niche / promising — NOT yet confirmed in any frontier lab’s production recipe | GKD arXiv:2306.13649; Thinking Machines blog 2025-10-27 |
| Rejection-sampling FT | Mainstream — named stage in Llama 3 & DeepSeek-R1 | arXiv:2407.21783, arXiv:2501.12948 |
| RLHF (RM+PPO) | Mainstream at proprietary labs (Gemini 2.5, GPT-5) | arXiv:2507.06261 |
| DPO | Mainstream — Llama 3’s offline preference stage | arXiv:2407.21783 |
| IPO/KTO/ORPO/SimPO | Niche — OSS/research tooling; no flagship names them as primary | arXiv:2503.11701 |
| GRPO | Mainstream — the reasoning-RL default | arXiv:2402.03300 |
| RLVR | Mainstream — arguably the defining 2025-26 technique | arXiv:2501.12948, arXiv:2507.06261 |
| GSPO (Qwen3) | Mainstream — first GRPO-successor with a flagship behind it | arXiv:2507.18071 |
| DAPO | OSS-tooling mainstream; ByteDance-origin, not confirmed elsewhere | arXiv:2503.14476 |
| PRM (process reward) | Niche — explicitly rejected for R1 (step-level reward hacking) | arXiv:2501.12948 |
| Rubric/critic outcome reward | Mainstream & growing — the real replacement for PRM | arXiv:2507.06261 |
| Agentic / multi-turn RL | Mainstream & the frontier edge — see its own chapter | Deep Research; Kimi K2/K2.5 |
| LoRA/QLoRA/DoRA | Mainstream in the applied layer; labs post-train flagships full-parameter | arXiv:2106.09685 |
| Self-play | Experimental — no confirmed frontier-lab production use as of this pass | — |
Read the family chapters for the mechanism + “what data it eats” + when to reach for each.