The family map

Three families by what signal they learn from, plus one orthogonal axis (PEFT) that is a delivery mechanism, not a learning signal. Every named method is a fixed preset on the on/off-policy axis from Foundations — you don’t freely combine axes.

graph TD
  ROOT["Post-training<br/>push mass toward good behavior"]
  ROOT --> IM["IMITATION<br/>signal = demonstrations"]
  ROOT --> PR["PREFERENCE<br/>signal = comparisons A≻B"]
  ROOT --> RL["REINFORCEMENT<br/>signal = reward / verifier"]

  IM --> SFT["SFT"]
  IM --> DIST["Distillation<br/>off-policy / on-policy"]
  IM --> RS["Rejection-sampling FT<br/>= 'RL without RL'"]

  PR --> RLHF["RLHF (RM + PPO)"]
  PR --> DPO["DPO · KTO · IPO · ORPO · SimPO"]

  RL --> PPO["PPO"]
  RL --> GRPO["GRPO → GSPO / DAPO"]
  RL --> RLVR["RLVR (verifiable reward)"]
  RL --> AG["Agentic / multi-turn RL"]

  PEFT["PEFT: LoRA · QLoRA · DoRA<br/>a HOW, applied to any of the above"]
  ROOT -.delivery.-> PEFT

What’s actually load-bearing in 2026 (verified)

The status column below is from a fresh Exa pass over model tech reports + lab blogs (2025–2026), not recalled — sources cited per row throughout the method chapters.

Method	2026 status	Anchor
SFT	Mainstream, universal — stage 0 of every recipe	—
Off-policy distillation	Mainstream — DeepSeek distills R1 → V3/V3.2 as a named stage	arXiv:2412.19437, arXiv:2512.02556
On-policy distillation	Niche / promising — NOT yet confirmed in any frontier lab’s production recipe	GKD arXiv:2306.13649; Thinking Machines blog 2025-10-27
Rejection-sampling FT	Mainstream — named stage in Llama 3 & DeepSeek-R1	arXiv:2407.21783, arXiv:2501.12948
RLHF (RM+PPO)	Mainstream at proprietary labs (Gemini 2.5, GPT-5)	arXiv:2507.06261
DPO	Mainstream — Llama 3’s offline preference stage	arXiv:2407.21783
IPO/KTO/ORPO/SimPO	Niche — OSS/research tooling; no flagship names them as primary	arXiv:2503.11701
GRPO	Mainstream — the reasoning-RL default	arXiv:2402.03300
RLVR	Mainstream — arguably the defining 2025-26 technique	arXiv:2501.12948, arXiv:2507.06261
GSPO (Qwen3)	Mainstream — first GRPO-successor with a flagship behind it	arXiv:2507.18071
DAPO	OSS-tooling mainstream; ByteDance-origin, not confirmed elsewhere	arXiv:2503.14476
PRM (process reward)	Niche — explicitly rejected for R1 (step-level reward hacking)	arXiv:2501.12948
Rubric/critic outcome reward	Mainstream & growing — the real replacement for PRM	arXiv:2507.06261
Agentic / multi-turn RL	Mainstream & the frontier edge — see its own chapter	Deep Research; Kimi K2/K2.5
LoRA/QLoRA/DoRA	Mainstream in the applied layer; labs post-train flagships full-parameter	arXiv:2106.09685
Self-play	Experimental — no confirmed frontier-lab production use as of this pass	—

Read the family chapters for the mechanism + “what data it eats” + when to reach for each.

Keyboard shortcuts

Post-Training Field Notes

The family map

What’s actually load-bearing in 2026 (verified)