Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Imitation — SFT · distillation · rejection sampling

Signal = demonstrations. Objective = cross-entropy on target tokens. What differs across the three is whose trajectories you imitate — which is the on/off-policy axis doing all the work.

Per-method template: what · data it eats · on/off-policy · when · gotcha · cite.

SFT (Supervised Fine-Tuning)

  • What: MLE on (prompt → target); for agents, target = a full trajectory (What “data” means). The original instruction-tuning result is InstructGPT (arXiv:2203.02155).
  • Eats: curated/human/teacher demonstrations.
  • Policy: off-policy (targets aren’t π_θ’s samples).
  • When: inject a capability or format the model lacks; establish a cold-start before RL.
  • Gotcha: off-policy ⇒ blind to execution gaps (the εT² compounding, Foundations) and it overwrites — this is the “SFT replaces capabilities” half of “Scalpel vs Hammer: GRPO Amplifies Existing Capabilities, SFT Replaces Them” (arXiv:2507.10616). Worse on small models (less capacity to absorb without forgetting).

Distillation (a kind of SFT — the teacher supplies the demonstrations)

Knowledge distillation originates with Hinton et al., arXiv:1503.02531. Two variants, and the split is the on/off-policy axis:

  • Off-policy distillation = SFT on the teacher’s completions. Mainstream: DeepSeek transfers R1’s reasoning into V3/V3.2 as a named post-training stage (DeepSeek-V3, arXiv:2412.19437; V3.2, arXiv:2512.02556); “distilled from GPT-4/R1” datasets are how most small OSS models get capability.
  • On-policy distillation = student samples its own rollouts, teacher grades them densely (reverse-KL per token). Fixes the fixed-dataset train/inference mismatch (GKD, arXiv:2306.13649: +90% relative on GSM8K vs supervised-KD). Thinking Machines’ 2025 write-up reports Qwen3-8B ← Qwen3-32B reaching ~70% AIME’24 in 150–200 steps at ~9–30× less compute than RL-from-scratch (thinkingmachines.ai, 2025-10-27).
    • Honest status: niche / promising, not lab-confirmed. As of a 2026 pass, no frontier lab (OpenAI/Anthropic/Google/DeepSeek/Qwen) has stated on-policy distillation as its production recipe — evidence is GKD + one lab blog. Treat the efficiency numbers as directional, not settled. (This corrects an earlier over-strong “sleeper” framing.)
  • Eats: teacher completions (off) / teacher-graded student rollouts (on). Requires a teacher genuinely better at your task.

Rejection-sampling FT (“RL without RL”)

  • What: sample N completions from π_θ, keep verifier-accepted winners, SFT on them; iterate. The lineage: STaR (arXiv:2203.14465), ReST (arXiv:2308.08998), RAFT (arXiv:2304.06767), RFT (arXiv:2308.01825).
  • Eats: your own verifier-passed trajectories — which for a CTF harness with a flag check you already generate.
  • Policy: on-policy data, SFT update. It’s the first-order special case of policy gradient (reward∈{0,1}, upweight winners).
  • When: an execution gap, and you have a verifier but no stronger teacher. Cheapest on-policy move; reuses your SFT pipeline.
  • Gotcha (measurable graduation trigger): positives-only ⇒ policy-entropy collapse — fast early gains then plateau. GRPO’s real edge over it is not group-normalization (ablated → negligible) but discarding all-same-reward groups (implicit filtering). See “A Minimalist Approach to LLM Reasoning: From Rejection Sampling to Reinforce” (arXiv:2504.11343). Watch entropy; when it collapses, graduate to GRPO/RLVR.
  • Production proof: explicit named stage in Llama 3 (arXiv:2407.21783) and DeepSeek-R1 (~800K rejection-sampled examples between its two RL stages, arXiv:2501.12948).