Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

The decision

The whole book collapses to one diagnostic and its branches. The routing question — does the correct action ever appear in π_θ’s own outputs at high N? — is what separates a knowledge gap (inject off-policy) from an execution gap (on-policy) from a ranking gap (preference). The principle underneath is “GRPO amplifies existing capabilities, SFT replaces them” (arXiv:2507.10616): you can only reinforce what already fires.

That routing question has a sharper, task-structure-dependent version once you sample at high N and track when in the episode the correct action shows up — an execution gap can hide an exploration gap underneath it. The tree below now branches on that.

The rigorous version of this page lives in From behavioral audit to training signal. That chapter runs all 5 BSides patterns through gap-type → training-signal → verification-check, cites the pass@k / Pass@(k,T) / Cover@τ instrumentation this page’s routing question is a shorthand for, and is where you go before trusting any single branch below as final for a given challenge.

graph TD
  Q0["Does the correct action ever appear in π_θ's<br/>own outputs — even rarely — at high N?"]
  Q0 -->|"Never, not once"| K["KNOWLEDGE gap"]
  Q0 -->|"Sometimes (solves occasionally)"| E{"Have a model that's<br/>actually better at your CTFs?"}
  Q0 -->|"Knows it, mis-picks / quits early"| R["RANKING gap"]

  K --> KM["Inject OFF-POLICY:<br/>SFT / teacher data / a TOOL<br/>(RL can't cheaply conjure it)"]
  E -->|Yes| DIST["On-policy distillation<br/>dense · ~10× cheaper than RL<br/>(promising, not yet lab-proven)"]
  E -->|"No, only a verifier"| Q1{"Is the winning path single-shot,<br/>or compositional / sequentially-gated<br/>(enumeration must land before<br/>exploitation is even visible)?"}
  R --> RM["DPO / KTO<br/>(KTO for unpaired good/bad logs)"]

  Q1 -->|"Single-shot — recon isn't gating"| RS["Rejection-sampling SFT<br/>→ GRPO when entropy collapses"]
  Q1 -->|"Sequentially-gated —<br/>PTES pattern 4: 62% of<br/>failures stall in exploitation"| XG["EXPLORATION gap:<br/>reaches the right region,<br/>then collapses / never enumerates"]

  XG --> XGM["On-policy RL FIRST, not more SFT —<br/>matched-data SFT regresses this exact<br/>subset (net −4) while RL expands it<br/>(net +4): self-directed exploration<br/>during rollout is the causal ingredient,<br/>not demonstrations"]

  classDef your fill:#132b22,stroke:#34d399,color:#eafaf3;
  class E,Q1,RS,XG your;

Reading the branches

  • Knowledge gap — nothing on-policy to reinforce. Inject by demonstration, or cheaper, put the missing fact in a tool (“knowledge in tools, not weights” — llmresearch-handbook.md rule 5). Standard RLVR won’t cheaply create it — see the contested boundary in Contested edges (Yue et al., arXiv:2504.13837).

  • Execution gap + stronger teacheron-policy distillation: dense signal on your own rollouts (GKD arXiv:2306.13649); flagged promising, not lab-confirmed.

  • Execution gap + only a verifier, single-shot paththe common case. Rejection-sampling SFT on your ~100–200 solves → graduate to GRPO/RLVR when policy entropy collapses (arXiv:2504.11343).

  • Exploration gap — an execution gap that’s actually sequentially-gated [E][L][N] — the naive test (“solves occasionally at high N”) says execution gap, but if the winning path requires correct enumeration at turn 5 before turn 40’s exploit is even visible, that’s a different beast. Zhai et al.’s Pass@(k,T) analysis, arXiv:2604.14877 (2026-04-16, single-group, promising) found that on this exact task shape (“Category C” — compositional, sequentially-gated retrieval), the RL pass-curve pulls away from the base curve as k grows — real capability expansion — while matched-data SFT on the same task regresses it (net −4 vs RL’s net +4). The causal factor they isolate is self-directed exploration during on-policy rollout, not exposure to more demonstrations. Practically: don’t spend your rejection-sampling budget flattening this subset first — it needs GRPO’s exploration before SFT saturates it, the opposite ordering from the single-shot branch above.

    Designed to fix: patterns 1, 2, 4 — tool-space under-exploration (87.7% curl/shell bypass), no PTES enumeration before pivoting (82% pivot-after-one-failure), and the 62%-of-failures-stall-in-exploitation split. All three are the same mechanism at different granularities: Cui et al.’s entropy law, arXiv:2505.22617 (R = -a·exp(H) + b) predicts the policy trades away exactly this enumeration/tool-diversity budget for reward as entropy collapses — so plain GRPO on this subset needs DAPO-style clip-higher/dynamic sampling (arXiv:2503.14476) from the start, not as a later patch. Full technique-by-pattern mapping (ToolRL, GiGPO, RL-PLUS, NuRL, plus Pentest-R1 — academic, cited for context only, not a basis) is in the diagnosis chapter, not repeated here.

    Contested / not settled: this is the sharpened, task-conditional version of the “RL can’t create capability” debate in Contested edges §1 — Yue et al.’s crossover result (elicit-not-expand) holds on the static/independent-retrieval task shape; Zhai et al.’s result (genuine expansion) holds on the compositional/sequentially-gated shape. Same instrument, opposite conclusion, and the split is the falsifiable variable — segment your own 1000 challenges by this criterion before trusting either reading wholesale.

  • Ranking gapDPO/KTO; KTO fits your unpaired solved/failed logs (arXiv:2402.01306).

One prerequisite before any of this

Your 10–20% aggregate is a k=1 portfolio statistic, not a per-challenge pass rate. Before choosing a branch, run pass@k per challenge and bucket by difficulty — the 30–60% band is a per-group property that GRPO needs, and a challenge that “solved once” may be a 30% target that got lucky (a prime RL candidate), not a done deal (lessons/post-training/rl-candidate-selection-from-passk.md, shared memory). Diagnose per-challenge, then route.

The single-shot-vs-sequentially-gated split above needs the same per-challenge treatment: label each challenge by whether its winning path is recon-gated (turn-5 enumeration must land before turn-40 exploit is reachable) or not, before running pass@k — the split determines which axis of the tree you’re even on. Two cheap refinements to the pass@k check, both eval-only (no training change): Cover@τ (Dragoi et al., arXiv:2510.08325) flags challenges where a high pass@64 is really “guessable by brute force” rather than genuinely reliable — don’t rejection-sample SFT on those, you’ll just teach confident guessing; and running the base model’s own pass@k as a control (per Yue et al., arXiv:2504.13837) tells you whether a claimed post-SFT gain on a given challenge is elicitation or noise before you credit it to the pipeline.

The interactive version

The same tree, clickable, is in The 5-minute journey (final section) — answer it for your own failing challenges.