Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

The one axis that predicts everything

If you internalize one thing, make it this: on-policy vs off-policy. It predicts which methods can fix which failures, and it’s the reason a colleague’s SFT run can move the eval by two points and stall.

Definitions (precise)

  • Off-policy: training targets are sampled from a distribution other than the model’s current policy π_θ — a human, a teacher model, a frozen dataset. The model raises the likelihood of sequences it did not generate.
  • On-policy: the training data is sampled from π_θ itself (the current weights), then scored/labeled. The model learns from its own rollouts.

This is standard RL vocabulary, not a framing I invented — see any policy-gradient treatment; the LLM-specific consequences are laid out formally in the imitation-learning reduction below.

Why off-policy imitation is structurally blind to execution gaps

The mechanism, as a flow:

graph LR
  A["Train on the teacher's states"] --> B["Deploy: π_θ drives,<br/>visits π_θ's OWN states"]
  B --> C["One slip → a state the<br/>teacher never visited"]
  C --> D["No training signal there<br/>→ error compounds"]
  D --> B

This isn’t hand-waving — it’s a theorem. Ross, Gordon & Bagnell, “A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning” (AISTATS 2011, arXiv:1011.0686), Thm 2.1: a behavior-cloned policy with per-step error ε incurs total cost bounded by J(π̂) ≤ J(π*) + ε·T²quadratic in horizon T, because a single deviation moves you to states off the expert’s distribution where you have no supervision, and errors accrue at up to unit cost for the rest of the episode. Their DAgger correction — aggregate data from the learner’s own induced state distribution — restores near-linear O(ε·T) regret.

Translate to your setting: an execution failure happens, by definition, in a state π_θ reaches on its own. Off-policy data is — structurally — data about states π_θ doesn’t reach. So off-policy SFT optimizes correctness in the wrong region. On a long agentic CTF trajectory (large T), the vs T gap is the whole story.

The fix, stated once

Make the data on-policy: sample from π_θ, then score those samples. Every method that “fixes execution” — rejection-sampling FT, GRPO/RLVR, on-policy distillation — is a different way of doing exactly that. The on-policy distillation line makes the connection explicit: it exists specifically to kill the train/inference mismatch of fixed-dataset KD by sampling from the student during training (GKD, Agarwal et al., arXiv:2306.13649).

The corollary you’ll use constantly

  • Failure lives in π_θ’s own distribution (it solves sometimes, fails often) → on-policy method.
  • The correct behavior is absent from π_θ entirely (never appears at high N) → nothing on-policy to reinforce → you must inject off-policy (SFT / teacher data / a tool).
  • The behavior exists but is mis-ranked → preference method.

That routing is the spine of The decision.