Diagnosing the gap — a scientific framework
The question this chapter answers: is there an industry-standard, peer-defensible way to prove a failure is a KNOWLEDGE gap, not an EXECUTION gap, not an EXPLORATION gap? Short answer, upfront: no single accepted instrument exists. There is no ISO-9001 for capability diagnosis. What exists is a converging set of measurement techniques, each independently validated, that — combined into one protocol — give you a defensible, falsifiable, split verdict. That protocol is what this chapter hands you.
Every arXiv id below was verified live against arxiv.org/abs/<id> on 2026-07-02 (project research pass, artifacts/overnight-rl-sweep/research/diagnosis.md). Confidence tags follow that pass: [HIGH] peer-reviewed/heavily reproduced, [MED] coherent preprint not yet contested, [LOW] single small-N preprint.
Bottom line up front
For your ~1000-challenge, ~100-turn, ground-truth-flag-verified CTF portfolio at ~10–20% k=1 solve rate, the honest, defensible answer will not be a single sentence. It will be: X% of the currently-failing challenges are a knowledge gap, Y% are an execution/performance-floor gap fixable by elicitation, Z% are an exploration gap that needs on-policy RL, not more demonstrations — and here is the measurement that sorted each challenge into its bucket. That heterogeneous, per-challenge-subtype verdict is itself the scientifically credible output — collapsing it into “it’s execution not knowledge” is exactly the move a skeptical reviewer will catch you on.
1. Three gap types, defined precisely
Ground the vocabulary in the 60-year-old linguistics/cognitive-science split this whole ML debate re-derives without citing: competence (what the system can in principle produce) vs. performance (what it actually produces under real constraints — prompting, memory, self-verification, time) — Firestone, “Performance vs. competence in human–machine comparisons,” PMC7604508 [HIGH], and its LLM-era instance splitting formal competence (linguistic surface mastery) from functional competence (using it in the world) — Mahowald et al., arXiv:2301.06627 [HIGH].
| Gap type | Competence/performance framing | Operational test | Fix lever |
|---|---|---|---|
| Knowledge | Competence ceiling — genuinely absent | Correct action never appears in any of N samples, at any N, on any checkpoint | Inject off-policy: SFT on demonstrations, a stronger teacher, or a tool (knowledge-in-tools rule) |
| Execution | Performance floor — competence present, elicitation fails | Correct action appears at moderate–large N, but pass@1 doesn’t convert it; prompting or a few SFT demos recover it | Cheap: better scaffolding/prompting, or light SFT elicitation |
| Exploration | Coverage present before training, destroyed during training | Correct action was recoverable at large N pre-RL; SFT-matched-data actually regresses it; on-policy RL (not demonstrations) is what recovers/expands it | On-policy RL with explicit entropy/diversity preservation, not more SFT |
The exploration gap is the one that’s easy to misdiagnose as a knowledge gap if you only look at a single snapshot: it’s a process failure (the training loop killing coverage that existed a step ago), not a static property of the base model. Section 4 below is the test that tells these two apart.
2. The core instrument: pass@k → Cover@τ → Pass@(k,T)
2.1 pass@k as a coverage probe [E]
The unbiased pass@k estimator (the one this project already uses at eval time, per decisions/2026-06-11-ctf-benchmark-pass-count.md):
def pass_at_k(n, c, k):
"""n = samples generated, c = number correct, k = budget."""
if n - c < k:
return 1.0
return 1.0 - comb(n - c, k) / comb(n, k)
Sampling k completions per problem at large k and plotting the curve is, structurally, a coverage measurement — the probability mass the policy places on any correct completion. The theory for why this works: the Coverage Principle — cross-entropy loss is dominated by tokens irrelevant to correctness, but coverage (mass on high-quality responses) is necessary and sufficient for post-training / test-time scaling to succeed, and estimates faster than loss does. Chen et al., arXiv:2510.15020 [MED]. [E]
2.2 The crossover test — does RL amplify or replace? [E][N]
The core instrument. Run the base model and your trained checkpoint through the same challenge set at k = {1, 4, 16, 64, 256…}. Plot both pass@k curves.
- Base catches up to or exceeds trained pass@k at large k → the training only reweighted an existing distribution (elicitation, not new capability). Yue et al., “Does RL Really Incentivize Reasoning Capacity Beyond the Base Model?,” arXiv:2504.13837 [HIGH] — the founding result. 6 RLVR algorithms tested; all “remain far from optimal in leveraging the base model’s potential.”
- Trained pass@k pulls away and widens as k grows → real capability expansion.
Designed to fix: pattern 5 (benchmarks measure pattern-match speed, not thoroughness). Yue et al.’s own prescription is exactly “report pass@k at large k, with base model as control” — not just pass@1. Report base-model pass@k at the same k as a mandatory control column in every benchmark table you publish internally.
Contested rebuttal — CoT-Pass@K: pass@k credits a correct final answer even from a wrong chain-of-thought (a lucky guess). Require the reasoning path itself to be correct and the crossover disappears — RLVR shows monotonic gains at every k. Wen et al., arXiv:2506.14245 [MED]. State this as contested when presenting — it directly falsifies the load-bearing assumption of §2.2’s headline result. For your agent, this maps onto a real risk in your own SFT curation: a verifier-passed trajectory can still contain wrong/wasted turns before the winning one (BSides: ~half of solves hinge on an ungrounded guess at a critical step). Filter on trajectory soundness (backtracking, wasted turns, tool-call validity), not just flag==1, or you reproduce the exact confound this paper diagnoses.
2.3 Cover@τ — punish guessing, reward reliability [E]
pass@k at huge k conflates “genuinely solvable” with “eventually guessable by brute force.” Cover@τ(q) = 1 if ≥ τ·n of n samples on problem q are correct — a reliability threshold, not a “did any sample land” threshold. Dragoi et al., arXiv:2510.08325 [MED]. Relative RLVR-algorithm rankings change under Cover@τ vs pass@1 — some algorithms that look best on pass@1 are worse at genuine reliability.
Designed to fix: pattern 5 and complements pattern 3 (good guessers until they’re not). A challenge with high pass@64 but near-zero Cover@0.3 is guessing-dominated — rejection-sampling SFT on its lucky wins teaches the model to guess more confidently, not more competently. Report Cover@τ (τ≈0.3) alongside pass@k as a second axis in every benchmark table.
2.4 Pass@(k,T) — the agentic extension, and the single most load-bearing citation in this chapter [L][E][N]
Everything above is validated on static, single-shot reasoning (math). Your agent is T-round tool interaction. Zhai et al., “Does RL Expand the Capability Boundary of LLM Agents? A Pass@(k,T) Analysis,” arXiv:2604.14877 [MED], asks the crossover question with interaction-depth as a second axis:
PASS@(k,T)(q,π) = 1 - C(n - c_T, k) / C(n, k)
— identical to standard pass@k, except c_T counts correct at interaction depth T, not overall.
Their finding flips §2.2 on compositional tasks: on Category C (compositional, sequentially-gated information gathering — structurally identical to enumerate-then-chain vuln discovery), the RL curve pulls above and widens against the base curve as k grows — the opposite of the static-reasoning crossover. On independent-retrieval tasks the effect is small; on pure static reasoning (no tool, negative control) RL is inert, replicating Yue et al.
The critical additional result: matched-data SFT actually regresses the capability boundary on the same compositional tasks (net −4 vs RL’s net +4). This isolates self-directed exploration during RL — not data exposure — as the causal factor for expansion.
Designed to fix: pattern 4 (uneven PTES phases — strong at chaining inside an exploit, weak at enumeration, 62% of failures stall in exploitation). Category C (“sequential retrieval”) is structurally identical to “enumerate-correctly-at-turn-5-before-turn-40’s-exploit-becomes-visible.” This is the paper that tells you where your exploration gap lives on the portfolio.
The single highest-value experimental design in this chapter: segment your 1000 challenges by whether the winning path is (a) single-shot / not sequentially gated (their Cat A/B analog), or (b) genuinely compositional/sequentially-gated (Cat C analog — your “enumeration-gates-exploitation” weak spot). Run Pass@(k,T) on both segments, before and after rejection-sampling SFT. The falsifiable prediction: on (a), SFT works fine, further RL may plateau (§2.2’s static result holds — an execution gap, cheaply closed). On (b), SFT alone regresses capability and you need on-policy RL specifically — an exploration gap, not an execution gap, and rejection-sampling SFT is the wrong tool for it. This is directly testable this week and determines whether “SFT now, GRPO later” has the ordering right for the compositional subset, or whether it needs RL first.
3. Does base/SFT pass@k predict RL gains, before you spend the compute?
High SFT-stage scores are not reliably predictive of eventual RL performance — sometimes inversely so. What does predict post-RL pass@1: generalization loss on held-out examples and pass@large-k on the post-SFT checkpoint, with up to 2× better R²/Spearman correlation than post-SFT pass@1 alone. Kang et al. (Meta FAIR + Virginia Tech), “Quagmires in SFT-RL Post-Training,” arXiv:2510.01624 [HIGH] — >1M GPU-hours, hundreds of models to 12B, 7 math benchmarks, up to 256 repetitions.
Training-loop delta: add a cheap diagnostic gate between SFT and RL. Before launching a GRPO run, compute pass@64 (or larger) on your rejection-sampling-SFT checkpoint, cold-start, pdq --fresh-retries, on the held-out challenge set. If it’s flat/low, don’t trust the SFT accuracy number as a green light — this predicts a disappointing GRPO run regardless of how good SFT looked.
What changes in your graduation criterion: the project’s stated trigger is “graduate to GRPO/RLVR when policy entropy collapses.” Add a second, independent gate: AND pass@64 on held-out challenges is non-trivial. Entropy collapse tells you SFT has converged; pass@64 tells you there’s still coverage headroom worth converting. Both are needed — collapsed entropy with flat pass@64 means you’ve converged onto a policy with nothing left to reinforce.
4. The routing test — does the correct action ever appear at high N?
This is the book’s existing one-line diagnostic (see The decision), and it deserves the fuller justification here because it is doing real theoretical work, not just intuition:
- Never, at any N, on any checkpoint → knowledge gap. Standard RLVR “amplifies existing capabilities, SFT replaces them” (arXiv:2507.10616, confirmed project lesson) — you cannot cheaply RL your way to a distribution that places zero mass on the answer. Inject off-policy (demonstration, teacher, or — cheaper — put the missing fact in a tool, not the weights).
- Sometimes, and pass@(k,T) shows RL (not matched SFT) expanding it (§2.4) → exploration gap. Coverage exists but training-time entropy collapse is what’s suppressing it turn-to-turn; the fix is on-policy RL with explicit exploration preservation, not more demonstrations.
- Sometimes, and a few-shot prompt or a handful of SFT demos recover it (§5) → execution gap / performance floor. Cheap fix, no RL infrastructure needed.
This routing test is directly the mechanism §2.2–§2.4 give you a rigorous way to run, not just eyeball. The crossover is the “does it ever appear at high N” test, made continuous and plottable rather than binary.
5. The elicitation ladder — a borrowed, rigorous “prove it’s X not Y” methodology
The AI-safety capability elicitation literature already had to solve almost exactly this problem — “is the model bad at this, or is the capability locked away where naive prompting can’t reach it?” — because sandbagging/eval-integrity depends on the same distinction. Their protocol, adapted: build (or find) a case where you don’t yet know if the capability is present, then try elicitation techniques in escalating cost order, and whichever one first recovers the behavior tells you what kind of gap it was.
prompt / few-shot → SFT on a handful of demos → RL against a judge/verifier → (still nothing) → knowledge gap
(cheapest) (cheap) (expensive)
- Password-locked models: fine-tune a model to only show a capability when a password token is present (imitating a weaker model otherwise), then test whether elicitation techniques recover it. Finding: a few high-quality SFT demonstrations are often sufficient to fully elicit a locked capability; RL can also elicit it when demos aren’t available, but less reliably. Greenblatt et al., arXiv:2405.19550 [HIGH].
- Harder organism (circuit-breaking, not just password-locking): prompting/steering fail; only fine-tuning-based elicitation succeeds. Recommendation: “evaluators should use fine-tuning techniques to elicit capabilities when possible” — if prompting can’t get behavior out, don’t conclude the capability is absent; fine-tune first, then conclude. Hofstätter et al., arXiv:2502.02180 [MED].
- Operational definition of real capability: what can be elicited at ≤1% of total training cost (Anthropic RSP’s own definition, directly reusable). van der Weij et al., arXiv:2406.07358 [HIGH, ICLR 2025].
- Order matters: neither SFT nor RL alone reliably elicits held-back performance from a degenerate policy; SFT on weak demonstrations first, then RL, is what fully elicits it — RL-first “almost always leads to reward hacking rather than genuine improvement” starting from a degenerate policy. Ryd et al., arXiv:2604.22082 [MED, 2026].
Designed to fix: pattern 1 (agents prefer their own tools; 87.7% of calls bypass the rich tool surface, 26/40 tools dead). This is the cheapest, most directly actionable experiment in the whole chapter. Take a handful of ignored tools and run the ladder: (a) few-shot prompting with 2–3 correct-usage examples recovers usage → pure elicitation/prompting gap, no training needed; (b) SFT on a small demonstrated-usage set recovers it → elicitation via light fine-tuning (matches Greenblatt’s finding); (c) neither works → genuinely a missing-knowledge/affordance problem, and rejection-sampling SFT should specifically upweight trajectories that exercise those tools. This turns “the model prefers curl” from an anecdote into a falsifiable, paper-backed experiment.
5.1 The wrinkle mid-episode: the self-verification cliff [E][R]
Your flag verifier is external, ground-truth, and perfect — exactly the regime where BoN/rejection-sampling/RL should work without a ceiling (Stroebl et al., “Inference Scaling fLaws,” arXiv:2411.17501 [HIGH]: with an imperfect verifier the false-positive floor is non-removable even at infinite compute; a perfect verifier has no such floor). That’s a load-bearing reason the project’s “ground-truth-verified reward, never regex” rule is correct.
But the verifier only fires at submission — at turn 40 of 100, the agent must judge without it whether its current path is worth continuing. That’s exactly the regime multiple papers show degrades, not improves, with capability: models find a correct answer among k samples far more often than they can self-select it, and the self-selection gap widens with generator capability (contested/preliminary — one 2026-06 OpenReview submission, no confirmed arXiv id, treat as [LOW], flag if cited). Corroborating, harder evidence: Best-of-N provably degrades past a reward-hacking threshold even with a competent reward model — scaling samples isn’t monotonically good. Huang et al., arXiv:2503.21878 [HIGH, ICML 2025]. Formal geometry: rejection-sampling and Best-of-N both converge to a ceiling set by the verifier’s ROC curve; more samples cannot buy past it. Dorner et al., arXiv:2507.12399 [MED].
Designed to fix: pattern 3 (good guessers until they’re not — ~half of solves hinge on an ungrounded guess at a critical step; 82% pivot after a single failure). Diagnostic, cheap, run on the existing Phoenix corpus: log every point in a trajectory where the agent pivots/abandons a path, and check post-hoc whether the abandoned path was actually unproductive (e.g., did a later successful run on the same challenge use a similar path?). If abandoned paths are disproportionately ones that would have worked, that’s a self-verification-cliff signature — an execution/judgment gap, and the fix is a mid-episode progress signal, not more SFT demonstrations.
6. A two-type failure vocabulary — grounded in general agent/RL evidence, not domain benchmarks
Standing project rule: no conclusion here may rest on academic cybersecurity-LLM training/benchmark papers (CTF-Dojo-style work, pentest-agent papers, CTF-family robustness studies, etc.) — none of that literature has produced a frontier cybersecurity model. The domain-specific pentesting-agent papers below are mentioned for context only, not as a basis for anything in this chapter:
- Deng et al., “What Makes a Good LLM Agent for Real-world Penetration Testing?,” arXiv:2602.17622 — academic, cited for context — not a basis for our decisions. (28 pentesting systems, proposes a Type A “capability gap” / Type B “planning and state-management limitation” taxonomy, and an Evidence-Guided Attack Tree Search system, “Excalibur.”)
- Nakano et al., arXiv:2509.07939 — academic, cited for context — not a basis for our decisions. (Deterministic ATT&CK-derived task tree lifts subtask completion 13.5–16.5% → 71.8–78.6% on the same models — a pentest-benchmark result, not project evidence.)
- Shen et al., “PentestAgent,” arXiv:2411.05185 — academic, cited for context — not a basis for our decisions. (Frames its own motivating failure as a knowledge gap fixed with RAG; illustrative only of how unsettled domain-specific pentest-agent papers are, not evidence either way.)
The two-type vocabulary itself is worth keeping — it just needs a non-domain-specific foundation. Re-grounded on general long-horizon-agent evidence and RL theory:
- Capability/elicitation gaps — missing tools, inadequate prompts, absent demonstrations. Cheaply closed by better scaffolding, tool surface, or light SFT (§5’s elicitation ladder, itself grounded in the AI-safety elicitation literature, not domain-security work).
- Planning/state-management limitations — a structurally different failure mode that does not reliably close with a stronger base model or more knowledge alone. Four independent, non-cybersecurity anchors support treating this as a real, separate axis — two qualitative/theoretical, two now directly quantitative:
- Empirical, frontier-model, general-domain: METR’s time-horizon study finds that what separates frontier models on long-horizon tasks is reliability and the ability to adapt to their own mistakes, not per-step knowledge or reasoning quality — and this axis scales on its own trajectory, doubling roughly every 7 months, independent of raw capability jumps. Kwa et al. (METR), “Measuring AI Ability to Complete Long Software Tasks,” arXiv:2503.14499 [HIGH]. This is general evidence (RE-Bench/HCAST software tasks, no cybersecurity framing) that long-horizon degradation is a distinct axis from single-turn competence — exactly the property a “Type B” needs to be a real thing and not just restated knowledge-gap.
- Theoretical, classic RL/imitation-learning result: compounding error under covariate shift — a policy trained/evaluated with a small per-step error rate accumulates error quadratically in trajectory length because each mistake pushes the agent into states its training distribution under-covers, and no single-step fix removes this without addressing the sequential structure itself. Ross, Gordon & Bagnell (DAgger), arXiv:1011.0686 [HIGH, 840+ citations]. This is the general-theory reason a state-tracking/replanning failure at turn 40 of 100 can be structurally invariant to swapping in a stronger base model — the problem is the sequential decision process, not the weights.
- Direct quantitative test, general-domain, verified 2026-07-02: isolating pure execution (plan and knowledge handed to the model, so only turn-to-turn execution is measured), larger models within the same family do execute more correct turns — but per-step accuracy still degrades as turns accumulate, driven by a self-conditioning effect (the model becomes more likely to err once its own prior mistakes are sitting in context). The paper’s own framing is exactly this chapter’s question: “self-conditioning does not reduce by just scaling the model size” — it is removed only by switching training paradigm to a “thinking”/reasoning-trained model, not by a bigger non-reasoning model of the same family. Sinha, Arun, Goel, Staab & Geiping, “The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs,” arXiv:2509.09677 [MED, preprint]. This is the closest thing in general literature to a direct scale-invariance test of a planning/state-tracking failure — the honest reading is invariant to parameter scale within a training paradigm, not invariant to LLM full stop (reasoning-trained models are a real escape hatch, a bigger base model of the old paradigm is not).
- Direct quantitative test, general-domain, verified 2026-07-02: a minimal explicit-lookahead planning module (FLARE) bolted onto a much weaker base model lets LLaMA-8B outperform GPT-4o run with standard step-by-step reasoning on multi-step planning benchmarks — i.e. an eight-billion-parameter model with a planning fix beats a frontier model without one. This is a clean existence proof that the planning axis is separable from and can dominate raw base-model strength. Wang, Wu, Wang, Tang, Li, Yin, Ma, Li, Sun, Chen & Ye, “Why Reasoning Fails to Plan: A Planning-Centric Analysis of Long-Horizon Decision Making in LLM Agents,” arXiv:2601.22311 [LOW, 0-citation preprint — promising, not yet validated].
- Corroborating, not separately load-bearing: τ-bench’s own cross-model pass^k leaderboard (already cited above, §6.1) shows the same steep pass^1→pass^8 reliability collapse for both gpt-4o and claude-3.5-sonnet — picking the “better” frontier model of a different family narrows but does not close the multi-trial consistency gap. Yao et al., arXiv:2406.12045 [HIGH]. And a 2026 cross-family diagnostic benchmark (GPT-5 variants + Claude models, 3100+ trajectories) documents the same horizon-dependent degradation pattern recurring across both families rather than being a one-model artifact. Wang, Bai, Sun, Wang, Zhang, Hu, Schroder, Mutlu, Song & Nowak, “The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break,” arXiv:2604.11978 [MED, preprint].
Designed to fix: pattern 4 (62% of failures stall in exploitation — a plausible instance of exactly the compounding-error dynamic DAgger formalizes: an early misstep in enumeration pushes the trajectory off-distribution, and the deeper into exploitation the agent gets, the harder recovery becomes). Actionable, cheap: label a sample of your failed trajectories on two axes — (a) missing tool / bad prompt / no demonstration → capability/elicitation gap, cheap fix; (b) over-committed to a low-value branch, exhausted context near a dead end, no real-time replanning after a costly failure → planning/state-management gap. If (b) dominates, the theoretical prediction (DAgger), the empirical general-domain finding (METR), and the two direct quantitative tests above (Sinha et al.’s self-conditioning result; Wang et al.’s weak-model-plus-planning-beats-strong-model result) all say: don’t expect SFT-then-GRPO on raw episode reward alone to close it, and don’t expect swapping in a bigger base model of the same family to close it either — the fix needs to address the sequential/compounding structure (mid-trajectory checkpointing, explicit replanning triggers, a difficulty/progress signal, or a reasoning-trained backbone), not just more knowledge or more parameters. Flag, updated 2026-07-02: the qualitative distinction (capability vs. planning/state) is now well-grounded in four independent general-literature anchors, and two of them (Sinha et al. 2509.09677, Wang et al. 2601.22311) are direct quantitative tests of “does a stronger/bigger base model fix it” in non-cybersecurity settings — both say no, for different mechanisms (self-conditioning invariant to scale; planning-fix on a weak model beats a strong model without one). What general literature does not give you is this project’s own number — no source above measured the specific fraction of this portfolio’s failures that are planning/state-management vs. capability, nor whether it holds at this project’s turn-depths (~100) and challenge structure. Status: qualitative claim grounded; quantitative “X% invariant to base LLM” for this corpus is worth pursuing — must be measured on our own Phoenix corpus (68,049 spans, 189 runs), not assumed from the general literature or the academic pentest-agent papers above.
6.1 Per-turn fault labeling — operationalize the taxonomy on your own trace corpus [R]
τ-bench provides an auto error-identification tool with a fixed taxonomy (fault assignment × fault type: used_wrong_tool, used_wrong_tool_argument, took_unintended_action, goal_partially_completed). Yao et al., arXiv:2406.12045 [HIGH]. AgentRx extends this to an automated framework that localizes the single critical failure step in a long trajectory, with a cross-domain taxonomy (Misinterpretation of Tool Output 24.1%, Intent-Plan Misalignment 24.1%, Under-specified Intent 27.6%, per their τ-bench column). Barke et al. (Microsoft Research), arXiv:2602.02475 [MED].
Your Phoenix trace corpus (68,049 spans, 189 runs) is exactly the substrate this tooling wants. Label each turn as
{reconnaissance-adequate, wrong-tool, wrong-argument, wrong-decision/policy, tool-output-misread, under-specified-plan}and compute: fraction of episodes with a labeled “under-specified-plan” or “wrong-decision” turn before the first “tool-output-misread” — this operationalizes BSides pattern 2 (“no methodology”) into a countable metric that separates planning (execution/skill — a scaffolding or SFT-curriculum fix) from interpretation (closer to reasoning/knowledge) from tool affordance (§5’s elicitation question).
7. A robustness cross-check: is the “gain” generalization or memorization?
Domain-specific aside (context only, not a basis for the method below): Honarvar et al., arXiv:2602.05523 — academic, cited for context — not a basis for our decisions. They build families of semantics-preserving CTF variants and find models robust to shallow transforms but degrading sharply under composed/deeper obfuscation, in the CTF domain specifically; per project rule, an academic CTF-benchmark paper cannot be the basis for the method below.
The methodology stands on its own, general (non-cybersecurity) grounding: semantics-/knowledge-preserving perturbation is an established way to separate genuine generalization from pattern-matching in general LLM evaluation. C-BOD rephrases MMLU questions with a parameterized, meaning-preserving transform and finds an average 2.75% performance drop across 32 SOTA models under modest rephrasing — with higher-performing, larger models showing greater sensitivity, i.e. bigger benchmark numbers can mean more surface-cue reliance, not less. Cohen-Inger et al., “Forget What You Know about LLM Evaluations — LLMs are Like a Chameleon,” arXiv:2502.07445 [MED, EMNLP 2025]. The same logic applied to code generation: rewrite a task’s ground-truth solution into a semantically-different-but-equal-difficulty variant and check whether the model’s answer degrades — a Memorization Risk Index that’s high only when the model reproduces a similar-looking answer and fails the rewritten task. Zhang et al., “Memorize or Generalize? Evaluating LLM Code Generation with Code Rewriting,” arXiv:2503.02296 [LOW, brand-new preprint, 0 citations — promising, not yet validated].
Before crediting any pipeline change with “closing an execution gap,” check the gain isn’t an artifact of the fixed, memorized 10 canonical PD26 challenges the harness has seen many times, using the same semantics-preserving-perturbation logic C-BOD and the code-rewriting paper apply outside cybersecurity.
Designed to fix: pattern 5. Cheap, no new training: generate meaning-preserving transformed variants of a held-out subset (renaming, restructuring, composed obfuscation — the transform families are generic; you don’t need a domain-specific benchmark paper to justify the check) and see whether a solve-rate gain transfers. If it evaporates on transformed variants, you have elicitation/memorization, not the execution-reliability improvement the SFT/GRPO diagnosis is banking on — and per C-BOD’s finding, don’t assume your strongest checkpoints are exempt; they may be the most exposed to this failure mode.
Also check the opposite failure mode post-RL: RL-PLUS names “capability boundary collapse” — pass@k at large k dropping even as pass@1 rises during RLVR, i.e. the on-policy training itself narrowing what the model can still do, not just what it does by default. Their fix mixes in the very off-policy verifier-passed trajectories rejection-sampling SFT already produces, via importance-sampling-corrected updates. Dong et al., arXiv:2508.00222 [MED]. This is the training-time confirmatory check for an exploration gap that got worse, not better, under GRPO — run pass@large-k before and after every GRPO checkpoint, not just pass@1.
8. The honest verdict: no single standard — here is the assembled protocol
No single accepted diagnostic instrument exists that a team runs once for a clean X-not-Y verdict. What the literature convergently offers, ranked by how load-bearing each is for this project:
- Pass@k / Pass@(k,T) / Cover@τ curve decomposition (§2) — the closest thing to a standard quantitative instrument, but interpretation is contested even among its own authors (§2.2 vs its CoT-Pass@K rebuttal), and its agentic extension shows the diagnosis is task-structure-dependent: same metric, opposite conclusion, depending on whether the task is compositional/sequentially-gated or not.
- Capability-elicitation methodology (§5) — a rigorous, falsifiable, cost-ordered protocol borrowed from AI-safety evaluation, directly reusable and cheap to run against the tool-avoidance finding.
- Capability/planning two-type taxonomy + per-turn fault labeling (§6) — a general vocabulary grounded in long-horizon-agent measurement (METR), compounding-error theory (DAgger), and now two direct general-domain quantitative tests of scale-invariance (self-conditioning not fixed by scale, Sinha et al. 2509.09677; weak-model-plus-planning beats strong-model-without, Wang et al. 2601.22311) — not in domain-specific pentest-agent papers; the specific per-corpus “X% invariant to base LLM” number stays flagged as worth pursuing, to be measured on this project’s own corpus, not asserted from general literature.
- The competence/performance vocabulary (§1) to frame the final answer for a skeptical reviewer: expect and report a split verdict by challenge subtype, not a single number.
The protocol — what to actually run, in order
| Step | Instrument | Discriminates | Cost | Section |
|---|---|---|---|---|
| 1 | Segment 1000 challenges: single-shot exploit chain vs. sequentially-gated (enumeration-gates-exploitation) | Sets up steps 2–3 correctly | Free (manual/heuristic labeling) | §2.4 |
| 2 | Pass@(k,T): base vs. rejection-sampling-SFT vs. (eventual) GRPO checkpoint, per segment | Execution gap vs. exploration gap vs. knowledge gap | Medium (sampling compute, no training) | §2.2–2.4 |
| 3 | Pass@64 on the SFT checkpoint as an RL go/no-go gate, alongside entropy | Whether GRPO is worth running at all | Cheap (no training run) | §3 |
| 4 | Cover@τ (τ≈0.3) alongside pass@k on every reported number | Genuine reliability vs. guessing | Free (same rollouts, different aggregation) | §2.3 |
| 5 | Elicitation ladder (prompt → few-shot → light SFT → RL) on the 26 dead tools + methodology failures | Elicitation/performance-floor vs. genuine knowledge gap | Cheap → medium, escalating | §5 |
| 6 | Post-hoc pivot-point audit: did abandoned paths ever succeed elsewhere? | Self-verification-cliff (execution/judgment) vs. genuinely dead end | Free (existing Phoenix corpus) | §5.1 |
| 7 | Capability/planning labeling + per-turn fault taxonomy on the Phoenix corpus | Engineering-fixable vs. architectural (test invariance-to-LLM claim on your own data, don’t assume it) | Medium (manual labeling pass, or automate via AgentRx-style tooling) | §6 |
| 8 | Semantics-preserving-transform robustness check on a held-out subset | Generalization vs. memorization | Medium (need the transform tool) | §7 |
| 9 | Entropy instrumentation from GRPO step 0 + pass@large-k before/after every checkpoint | Training-induced exploration collapse (boundary shrinking, not growing) | Free once GRPO is running | §7 |
Report all nine together, segmented by challenge subtype. Refuse to collapse it into one sentence — §2.4 and §6 both predict, and §7’s entropy check explains a mechanism for, the true answer being heterogeneous across the portfolio.
9. The decision, expanded
flowchart TD
Start["Failing challenge / challenge-subtype<br/>under diagnosis"] --> Seg{"Winning path structure?"}
Seg -->|"Single-shot / independent recon<br/>(Cat A/B analog)"| PKT_AB["Pass@(k,T): base vs SFT vs RL<br/>(arXiv:2604.14877)"]
Seg -->|"Sequentially-gated:<br/>enum must succeed before<br/>exploit is even visible (Cat C)"| PKT_C["Pass@(k,T): base vs SFT vs RL<br/>(arXiv:2604.14877)"]
PKT_AB --> X1{"Base pass@k(large k)<br/>>= trained pass@k?"}
X1 -->|"Yes — crossover"| Elicit1["ELICITATION only:<br/>run the ladder (§5)<br/>before assuming knowledge gap"]
X1 -->|"No — trained pulls away"| Exec1["EXECUTION gap:<br/>rejection-sampling SFT → GRPO<br/>is the right ordering"]
PKT_C --> X2{"Does the correct action<br/>ever appear at large N,<br/>on ANY checkpoint?"}
X2 -->|"Never"| Know["KNOWLEDGE gap:<br/>inject off-policy<br/>(SFT / teacher / TOOL)"]
X2 -->|"Yes — but matched-data SFT<br/>REGRESSES it (§2.4 test)"| Explore["EXPLORATION gap:<br/>on-policy RL required,<br/>NOT more demonstrations"]
X2 -->|"Yes — few-shot prompting<br/>recovers it"| ElicitP["Performance floor:<br/>cheap prompting/scaffold fix<br/>(§5, pattern 1)"]
X2 -->|"Yes — only light SFT<br/>on demos recovers it"| ElicitS["Elicitation via light SFT<br/>(arXiv:2405.19550)"]
Elicit1 --> TypeAB["Capability vs. planning/state<br/>label the failures<br/>(arXiv:2503.14499, arXiv:1011.0686)"]
Exec1 --> TypeAB
Explore --> TypeAB
TypeAB -->|"Capability: tool / prompt gap"| FixA["Engineering fix:<br/>tool surface, scaffolding,<br/>walkthrough-shaped SFT data"]
TypeAB -->|"Planning/state: difficulty-<br/>estimation, compounding error<br/>(test invariance on own data)"| FixB["Architectural fix:<br/>difficulty-gating / attack-tree<br/>search wrapper — NOT more<br/>SFT-then-GRPO on raw reward"]
classDef your fill:#132b22,stroke:#34d399,color:#eafaf3;
class Exec1,Explore,TypeAB your;
This is the same routing question as The decision — does the correct action ever appear in π_θ’s own outputs at high N? — expanded with the two tests that make it rigorous for an agentic, T-round, sequentially-gated task instead of a single-shot one: the compositionality segmentation (§2.4) and the matched-data-SFT-regression test that separates a genuine exploration gap from a knowledge gap when coverage exists but training destroys it.
Cross-links
- The decision — the one-line version of this chapter’s routing test; this chapter is its full justification.
- Contested edges & landmines §1 — “RL can’t create capability” is contested exactly along the lines §2.2/§2.4 draw out (recipe-dependent, not a law).
- Agentic & multi-turn RL — where the exploration-gap fix (on-policy RL, entropy preservation) is implemented once diagnosed.
- Imitation — SFT · distillation · rejection sampling — where the execution-gap and knowledge-gap fixes live.
memory/research/(shared pool) —long-horizon.mdandexploration.mdresearch notes underlie §2.4/§7’s turn-level and entropy-collapse mechanics respectively; not re-derived here to keep this chapter’s scope to diagnosis, not fix.
Bibliography (all verified live via arxiv.org/abs/<id>, 2026-07-02)
| Citation | arXiv | Confidence |
|---|---|---|
| Yue et al., Does RL Really Incentivize Reasoning Capacity Beyond the Base Model? | 2504.13837 | HIGH |
| Wen et al., RLVR Implicitly Incentivizes Correct Reasoning (CoT-Pass@K) | 2506.14245 | MED |
| Dragoi et al., Beyond Pass@k: Breadth-Depth Metrics (Cover@τ) | 2510.08325 | MED |
| Zhai et al., Does RL Expand the Capability Boundary of LLM Agents? Pass@(k,T) | 2604.14877 | MED |
| Kang et al., Quagmires in SFT-RL Post-Training | 2510.01624 | HIGH |
| Chen et al., The Coverage Principle | 2510.15020 | MED |
| Greenblatt et al., Stress-Testing Capability Elicitation (password-locked) | 2405.19550 | HIGH |
| Hofstätter et al., The Elicitation Game | 2502.02180 | MED |
| van der Weij et al., AI Sandbagging | 2406.07358 | HIGH |
| Ryd et al., Removing Sandbagging via Weak Supervision | 2604.22082 | MED |
| Stroebl et al., Inference Scaling fLaws | 2411.17501 | HIGH |
| Dorner et al., ROC-n-reroll | 2507.12399 | MED |
| Huang et al., Is Best-of-N the Best of Them? | 2503.21878 | HIGH |
| Mahowald et al., Dissociating Language and Thought | 2301.06627 | HIGH |
| Firestone, Performance vs. Competence in Human–Machine Comparisons | PMC7604508 (journal, no arXiv) | HIGH |
| Yao et al., τ-bench | 2406.12045 | HIGH |
| Barke et al., AgentRx | 2602.02475 | MED |
| Kwa et al. (METR), Measuring AI Ability to Complete Long Software Tasks — general grounding for §6’s planning/state-management axis | 2503.14499 | HIGH |
| Ross, Gordon & Bagnell, A Reduction of Imitation Learning to No-Regret Online Learning (DAgger, compounding error) — theory grounding for §6 | 1011.0686 | HIGH |
| Sinha, Arun, Goel, Staab & Geiping, The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs (self-conditioning invariant to scale) — direct quantitative grounding for §6’s scale-invariance claim | 2509.09677 | MED |
| Wang, Wu, Wang, Tang, Li, Yin, Ma, Li, Sun, Chen & Ye, Why Reasoning Fails to Plan (FLARE; LLaMA-8B+planning beats GPT-4o) — direct quantitative grounding for §6’s scale-invariance claim | 2601.22311 | LOW, 0-citation preprint, promising not yet validated |
| Wang, Bai, Sun, Wang, Zhang, Hu, Schroder, Mutlu, Song & Nowak, The Long-Horizon Task Mirage? (HORIZON; cross-family GPT-5/Claude degradation) — corroborating cross-family evidence for §6 | 2604.11978 | MED |
| Cohen-Inger et al., Forget What You Know about LLM Evaluations — LLMs are Like a Chameleon (C-BOD) — general grounding for §7’s robustness check | 2502.07445 | MED |
| Zhang et al., Memorize or Generalize? Evaluating LLM Code Generation with Code Rewriting — secondary grounding for §7 | 2503.02296 | LOW, brand-new preprint, promising not yet validated |
| Dong et al., RL-PLUS: Countering Capability Boundary Collapse | 2508.00222 | MED |
| “GRPO amplifies existing capabilities, SFT replaces them” (confirmed project lesson) | 2507.10616 | project-confirmed |
Academic cybersecurity-LLM domain-specific work — mentioned in §6/§7 for context only, per standing project rule NOT a basis for any claim/decomposition/recipe/verdict/number in this chapter (none of these produced a frontier cybersecurity model):
| Citation | arXiv | Note |
|---|---|---|
| Deng et al., What Makes a Good LLM Agent for Pentesting? (Type A/B, Excalibur) | 2602.17622 | context only — §6’s taxonomy is re-grounded on METR + DAgger above |
| Nakano et al., Guided Reasoning / Structured Attack Trees | 2509.07939 | context only |
| Shen et al., PentestAgent (contested reading) | 2411.05185 | context only |
| Honarvar et al., Capture the Flags: Family-Based Evaluation | 2602.05523 | context only — §7’s check is re-grounded on C-BOD + code-rewriting above |
Flagged, not cited as fact: “The Self-Verification Cliff” (OpenReview, 2026-06-17) — no confirmed arXiv id as of this writing. Treat as [LOW], directional-only, per §5.1.