Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

One problem, or many? — monolithic outcome-RL vs staged decomposition

Every other chapter in this book asks “which algorithm” (SFT vs DPO vs GRPO). This chapter asks a question one level up, specific to a sequential, multi-stage task with a single sparse terminal reward: should the CTF solve be trained as one end-to-end outcome-RL problem (flag reward only, let RL discover the stages), or decomposed into sub-problems — evaluated per stage and, more contentiously, trained per stage? The two halves of “decompose” turn out to have very different answers, and conflating them is the single easiest way to get this wrong.

All citations below are carried over, unmodified, from five research threads run 2026-07-02 (artifacts/overnight-decomposition/research/{monolithic-case,decomposition-case,staged-eval,pentest-ctf-rl,credit-assignment-theory,verdict}.md) — no id below was invented for this chapter. Re-grounding pass, same date: per the project’s standing rule, no conclusion in this chapter may rest on a domain-specific academic CTF/pentest training or benchmark paper (CTF-Dojo, Cyber-Zero, Pentest-R1, HackSynth-GRPO, AutoPenBench, Cybench, NYU CTF Bench, EnIGMA, InterCode-CTF, DRLRM-PT, node-fragility shaping, the kill-chain-staged-reward paper) — every such work is demoted to a labelled context-only mention below, and every claim that had rested on one is re-anchored on general frontier-lab / RL-theory evidence or this project’s own data instead. Two new citations were added and independently verified live for this pass: STaR (arXiv:2203.14465) and ReST-EM (arXiv:2312.06585), both general (non-security) self-training literature, replacing CTF-Dojo/Cyber-Zero as the basis for §5’s rejection-sampling-SFT recommendation.


1. The pipeline, and why the failure isn’t uniform

A CTF solve is not one action, it’s a chain:

flowchart LR
  R["Recon /\nenumeration"] --> E["Endpoint\ndiscovery"]
  E --> V["Identify the\nvulnerable endpoint"]
  V --> X["Exploit it"]
  X --> P["Post-exploitation /\npivot"]
  P --> F(("Flag\n{0,1}\nground-truth verified"))

  classDef stage fill:#132b22,stroke:#34d399,color:#eafaf3;
  class R,E,V,X,P stage;

Only the last box is ground-truth-checkable today. Observed failures cluster by where in the chain the agent dies, not uniformly across it — the project’s own F1–F4 taxonomy, which turns out to be a CTF-specific instance of failure clusters the general agent-eval literature keeps independently rediscovering (§4).

TagFailureCanonical RL/agent-research framingWhat it is not
F1Never finds the vulnerable endpointExploration / coverage failure — no gradient exists until the reward is first observed; large, deceptive state spaceNot a credit-assignment problem — you can’t assign credit for a reward you’ve never seen
F2Finds it, probes shallowly, can’t land the exploitExecution / skill (performance-floor) failure — capability present, doesn’t reliably convertNot usually fixed by more exploration
F3Clumsy tool use, wrong tool for the jobPolicy / tool-selection failure — a distinct axis from “does it find the bug”Overlaps F2 but has its own literature (tool-augmented-LLM failure taxonomies)
F4No real pivot/chaining after a footholdLong-horizon credit-assignment failure — the terminal bit has to retroactively explain ~100 turns; variance grows with horizonNot solved by “try more” alone — it’s a variance, not coverage, problem

The credit-assignment theory thread makes the split precise: a ~100-turn trajectory with reward only at the end is hard for three separable reasons — exploration burden (F1, upstream of everything else), credit-assignment variance (Monte-Carlo/GRPO-style returns smear one scalar across all turns — arXiv:1506.02438, GAE), and compounding distributional drift (a policy trained once, on one static snapshot, drifts off-distribution as a rollout gets longer — arXiv:1011.0686, DAgger, classically O(εT²) uncorrected). F1 is the first problem; F2–F4 are flavors of the second and third. Don’t expect one fix (a denser reward) to solve both (arXiv:2312.01072, the credit-assignment-vs-exploration survey).


2. Side A — the monolithic case, steelmanned

The pattern across every lab that tried both is consistent: outcome-only + scale beats hand-built process supervision, every time it’s been A/B’d.

  • DeepSeek-R1-Zero rejects process reward outright, in its own failure-experience writeup. Pure RL, no SFT, rule-based outcome-only reward; reasoning behaviors emerge as a side effect. Verbatim: “a model-based PRM… inevitably leads to reward hacking, and retraining the reward model needs additional training resources and it complicates the whole training pipeline.” arXiv:2501.12948. This is the single strongest evidence against a learned/neural per-stage reward — note precisely what it doesn’t rule out: a deterministic, ground-truth per-stage check is a different animal (§4).
  • OpenAI Deep Research shipped a long-horizon, tool-using agent trained end-to-end on outcome/rubric reward, and its own team says why: “End-to-end training beats manual orchestration… constructing a graph of operations… is the common approach to building agents [but] Deep Research is trained end-to-end… This allows the model to develop flexible strategies… that would break if scripted manually.” (OpenAI Deep Research system card, no arxiv id — flagged as such.) Closest real-world analog to this project’s shape: long-horizon, tool-using, sparse/rubric-graded.
  • Kimi k1.5 gets SOTA reasoning results with no PRM, no MCTS, no value function — substituting long-context scaling for explicit search. arXiv:2501.12599. Second independent lab, same conclusion as R1.
  • Llama 4’s own post-training team found heavy SFT/DPO caps the ceiling of the subsequent RL stage — verbatim: “SFT and DPO can over-constrain the model, restricting exploration during the online RL stage.” They responded by pruning >50% (95% for Behemoth) of their SFT data (ai.meta.com blog, no arxiv id). This is the project-critical citation: a hand-imposed stage boundary is itself a form of prescriptive pre-RL structure, and this is the general mechanism by which imposed structure narrows a policy’s exploration before RL gets to use it.
  • Academic, cited for context only, not a basis (per project standing rule — no domain-specific academic CTF/pentest training paper has produced a frontier cybersecurity model): in the CTF/pentest domain specifically, the same monolithic-wins pattern is also reported by academic training papers — CTF-Dojo (arXiv:2508.18370), Cyber-Zero (arXiv:2508.00910), Pentest-R1 (arXiv:2508.07382), HackSynth-GRPO (arXiv:2506.02048). None of these is load-bearing here — the actual basis for Side A is the frontier-lab evidence directly above (R1, Kimi k1.5, OpenAI Deep Research, Llama 4, Bitter Lesson), which independently converges on the same conclusion without needing a CTF-specific data point.
  • The Bitter Lesson (Sutton 2019, incompleteideas.net) is the intellectual ancestor of all of the above: hand-built structure plateaus, general search+learning wins at scale. High confidence the historical pattern is real; medium-low confidence it transfers directly to a 1000-challenge corpus with real per-rollout infra cost — that’s exactly the disanalogy the honest limits below press.

The honest limits — where the monolithic case’s own literature admits it breaks down

LimitCitationWhat it says
Pure outcome RL can structurally fail to ever find the rewardGo-Explore, arXiv:1901.10995 / Nature s41586-020-03157-9Vanilla deep RL scored ~0 on Montezuma’s Revenge/Pitfall — canonical sparse, deceptive, long-horizon environments — until an explicit “remember states, return, explore from there” mechanism was added. This maps almost exactly onto F1: it’s an exploration-algorithm problem, not a hyperparameter one.
The best long-horizon precedent needed denser reward + enormous scaleOpenAI Five, arXiv:1912.0668010 GPU-months of distributed self-play, AND a per-frame shaped reward (last-hits, kills, tower damage) — not a single terminal bit. Citing this as “pure sparse reward at scale works” over-claims what the paper shows.
R1’s own reward-reliability admissionarXiv:2501.12948“The success of pure RL depends on reliable reward signals… for tasks that cannot obtain a reliable signal, DeepSeek-R1 uses human annotation… and only conducts RL for hundreds of steps.” The CTF flag reward is reliable (ground-truth) but far sparser per compute-dollar than a math/code answer — R1’s paper doesn’t test this sparsity regime; it’s an extrapolation this project would be making, not a validated claim.

Bottom line for Side A: thick, convergent, in-domain evidence that monolithic-outcome-plus-better-data/curriculum wins whenever it’s been tried against a decomposed alternative in LLM-CTF training specifically. The genuine open risk it must own is Go-Explore’s — whether the ~100-200/1000 solve rate reflects “still finding it eventually with more rollouts” (favors monolithic) or “structurally not finding it” (favors an exploration-specific intervention) is an empirical question literature alone cannot resolve.


3. Side B — the decomposition case, steelmanned

Framing. Monolithic outcome-RL is implicitly betting on four things at once: (1) the base policy already puts non-zero mass on the correct trajectory shape for every stage on a large fraction of challenges, (2) the RL algorithm can correctly attribute a late reward to the right subset of ~100 turns, (3) one scalar is expressive enough to teach four qualitatively different skills (exploration breadth, exploit depth, tool discipline, chaining) without one skill’s gradient starving another’s, and (4) “more outcome RL” is uniformly the right lever for F1 through F4 alike. Every technique below is a documented failure mode of at least one of these assumptions.

The menu of decomposition mechanisms

MechanismCitationWhat changes in the loopConfidence
Options / SMDP framework (the seminal foundation, asked-for regardless of age)Sutton, Precup, Singh, Artificial Intelligence 112 (1999) (pre-arxiv; DOI 10.1016/S0004-3702(99)00052-1)Action space becomes {launch_recon_option, launch_exploit_option, ...}; a high-level policy picks among temporally-extended sub-policies, shortening the effective horizon the terminal reward has to bridgeHigh (theory), medium (LLM transfer). Known failure: naive end-to-end option learning collapses to one mega-option or micro-manages every step.
FeUdal Networks — Manager/Worker split fixes option-collapsearXiv:1703.01161Manager emits abstract directional goals in latent space at low temporal resolution; Worker is intrinsically rewarded for moving state toward that direction; own ablations show a plain (non-dilated) recurrent Manager “fails catastrophically” on long-credit-assignment tasksHigh (mechanism), medium (LLM transfer — from-scratch Atari RL, not a token-level LLM policy)
ArCHer — the LLM-native analoguearXiv:2402.19446A high-level, off-policy turn-level value function aggregates reward across turns; a low-level PPO-style update trains the token policy inside each turn using that value as its reward. Map “turn” onto “stage.” Single strongest “if I had to prototype one paper” citation for training-decomposition.High (recipe exists), medium (untested on anything CTF-shaped)
HiPER — hierarchical advantage estimationarXiv:2602.16165Factorizes policy into planner + executor; Hierarchical Advantage Estimation aggregates returns per subgoal, provably reducing variance vs flat GAE; +6.6% ALFWorld, +8.3% WebShop, largest gains specifically on long-horizon multi-subtask tasksHigh (strong ablations)
MiRA — milestone-based dense rewardarXiv:2603.19685Dense, milestone-based reward replaces sparse outcome-only; on Gemma3-12B, WebArena-Lite success rate 6.4% → 43.0%, beating WebRL (38.4%) and GPT-4-Turbo (17.6%). The single strongest empirical existence-proof in this whole dossier that flag-only reward can leave a large gap on the table — but on web-navigation, not offensive-security CTF.High
Pentest-R1 — domain-specific two-stage trainingarXiv:2508.07382Academic, cited for context only, not a basis (per standing rule): offline RL on 500+ real pentest walkthroughs → online RL in a live CTF env. Structurally resembles this project’s own planned SFT→GRPO, but that resemblance is not the basis for recommending it — the load-bearing mechanism for training-decomposition is the general options/ArCHer/HiPER hierarchical framing above plus curriculum-learning theory below.N/A — context only
Potential-based reward shaping — the theoretical safety net for everything aboveNg, Harada, Russell, ICML 1999 (pre-arxiv)F(s,s') = γΦ(s') − Φ(s) for any state-only potential Φ provably leaves the optimal policy unchanged — the telescoping sum over an episode collapses back to Φ(s_T) − Φ(s_0) plus the true reward. This is a theorem, not an empirical claim. Modern reaffirmation: arXiv:2502.01307 (practical effectiveness still depends on Φ’s scaling).Very high (correctness); risk is entirely implementation
RUDDER — learned, return-equivalent redistributionarXiv:1806.07857Train an auxiliary model to predict final return from trajectory prefixes (using the project’s own ~100-200 verified solves), use its temporal differences as a per-step reward — a learned alternative to hand-specifying Φ, with the same correctness guaranteeHigh (theory), directly actionable given existing verified-solve data
Ground-truth per-stage verifiers / VPR / CM2 (checklist rewards)arXiv:2605.10325 (VPR), arXiv:2602.12268 (CM2)Decompose the terminal task into a checklist of objectively verifiable sub-criteria (sandbox-checked, not judge-opinion) — the safe form of stage reward, symmetric to the flag verifier’s own contract. VPR’s own honest caveat: benefit “depends on the reliability of the verifier,” and extension “to less structured, open-ended environments… remains an open challenge” — directly relevant to CTF’s own stage 3 (§4).High, with an explicit open-environment caveat
Curriculum learning & sequencing — orthogonal to reward, lowest risk of the whole menuBengio et al., ICML 2009 (pre-arxiv, foundational); h1 arXiv:2510.07312; FastCuRL arXiv:2503.17287; BPO arXiv:2508.03018Order training by difficulty (single-endpoint before decoy-heavy; 1-hop exploit before 2-hop pivot) — touches no reward function at all. h1: curriculum + pure outcome-only reward gets an exponential sample-complexity gain. BPO explicitly reports vanilla GRPO on their sparse-reward setting yields only marginal improvement without it.High — the cheapest, least-risky lever in this entire menu
Kill-chain-staged reward (cyber-defense red-teaming)arXiv:2605.17075 (May 2026)Academic cybersecurity-LLM training work, cited for context only, not a basis (per standing rule). Frozen LLM planner emits kill-chain intent; a trained RL controller gets reward “aligned with kill-chain progression.” Superficially the closest thing in the literature to Option B, but it’s brand-new, unreplicated, doesn’t ablate “staged reward” from “hybrid architecture,” and — per the standing rule — carries no evidentiary weight for this project regardless. The actual basis for Option B’s viability is the general HRL/theory rows above (ArCHer, HiPER, potential-based shaping).N/A — context only
Classical (non-LLM) staged reward for pentestDRLRM-PT (reward machines), DOI 10.1109/ijcnn60899.2024.10650368; node-fragility shaping, DOI 10.3390/electronics13214311Academic pentest RL, cited for context only, not a basis (DRLRM-PT is explicitly named in the project’s standing rule). Reports staged/dense reward helping sample efficiency in small, discrete, formally-specified MDPs (network graphs, no language, no tool-calling) — a structurally different regime, and not where this project’s “staged reward can help” claim rests. That claim’s actual basis is the potential-based-shaping theorem and RUDDER above.N/A — context only

The honest limits — where decomposed training breaks, concretely

The reward-hacking evidence against naive per-stage reward is thick and convergent, not one paper. The moment a “verifier” stops being a deterministic ground-truth check and becomes a learned/judge score, this project’s own confirmed lesson (SFT-induced FLAG{} confabulation from a loose format-matcher) generalizes into a much larger literature:

  • PURE / Stop Summation (arXiv:2504.15275) names the mechanism precisely: the canonical summation-form credit assignment (additive, per-step reward) “easily induces LLMs to hack steps with high rewards.”
  • Reward Under Attack (arXiv:2603.06621) shows SOTA PRMs function as “fluency detectors rather than reasoning verifiers” — >0.9 PRM reward on trajectories with <4% ground-truth accuracy.
  • Gao et al. (arXiv:2410.15115) — combining a learned PRM/ORM with success reward can hurt relative to success-reward-only, via “repeating correct but unnecessary steps.”
  • PRIME’s own authors (arXiv:2502.01456) state the central open problem is that process labels are “prohibitively expensive… making [PRMs] particularly vulnerable to reward hacking,” and route around a separately-trained PRM entirely for this reason.
  • MONA (arXiv:2501.13011, DeepMind) generalizes this: multi-step reward hacking can occur even when no single step looks bad to a human/judge overseer.
  • The ancestor of all of it: the bicycle-shaping failure (Randlov & Alstrom, ICML 1998, pre-arxiv) — a non-potential-based “looks like progress” bonus taught an agent to ride in tight circles farming the bonus instead of reaching the goal. Same species as this project’s own SFT confabulation lesson: reward emitting the right-looking pattern rather than deterministically verified success, and the policy learns to farm the pattern.

Ceiling-capping is a real cost of decomposed training too, symmetric to Llama 4’s SFT/DPO warning. HIRO (arXiv:1805.08296) needed an explicit off-policy correction specifically because a high-level subgoal’s meaning drifts as the lower-level policy improves during training — without it, the system converges on subgoals that are locally useful but cap out below the true optimum. Translated to F1–F4: an “exploit-only” sub-policy trained against a synthetic “endpoint identified” subgoal risks converging on the shallowest exploit that satisfies the boundary — which is precisely the F2 shallow-probing failure this project already observes, not a hypothetical.

Academic, cited for context only, not a basis: the domain-specific CTF/pentest training papers named above happen to be monolithic-outcome or curriculum-decomposed rather than reward-decomposed, and the one adjacent paper doing genuine staged reward in an LLM+RL security loop (arXiv:2605.17075) is unreplicated — but neither observation is the basis for caution here. The actual, load-bearing case against naive per-stage reward is the general reward-hacking convergence immediately above (PURE, Reward Under Attack, Gao et al., PRIME, MONA, HIRO) — that literature alone is sufficient to warrant the conditional verdict in §5, independent of what the CTF-training corpus does or doesn’t show.


4. The key split — eval-decomposition vs training-decomposition

This is the load-bearing distinction. They are not the same decision, and the evidence supports very different confidence levels for each.

Eval-decompositionTraining-decomposition
What it meansMeasure per-stage reached/not-reached, on top of the existing flag_verified terminal signalReplace/augment the terminal reward with per-stage rewards, curricula, or hierarchically-trained sub-policies
Training-loop changeNone — a read-only pass over traces already generatedThe reward function, or the training architecture, or both
CostNear-free (one aggregation pass)Real engineering + real risk surface
Evidence for itAgentBoard’s “progress rate” (arXiv:2401.13178, NeurIPS 2024 Oral) — “current evaluation frameworks mostly focus on the final success rate, revealing few insights”; MAST’s 14-mode/3-category taxonomy (arXiv:2503.13657, κ=0.88); AgentErrorTaxonomy — root-cause diagnosis alone (no reward change) buys +24% all-correct accuracy (arXiv:2509.25370); phase-aligned taxonomies independently reinvented in a different (non-security) domain (arXiv:2508.13143); tau-bench’s pass^k (arXiv:2406.12045) — separates “never clears” from “unreliable,” composable with a phase vector. This general, non-security agent-eval literature is the basis for the verdict below on its own. Academic, cited for context only, not a basis: Cybench subtasks (arXiv:2408.08926), AutoPenBench milestones (arXiv:2410.03225), NYU CTF Bench (arXiv:2406.05590), and EnIGMA’s “soliloquizing” fabrication finding (arXiv:2409.16165, ICML 2025) happen to converge on a near-identical F1–F4-shaped split, which is a reassuring coincidence, not evidence this project’s verdict depends on.
Evidence against itNone — every paper that ships it treats it as strictly additive, diagnostic-only, never a substitute for the terminal checkThe 2025–2026 PRM-hacking convergence above (§3); ceiling-capping via HIRO-style subgoal drift. (The observation that domain-specific CTF/pentest training papers are uniformly monolithic when they report strong numbers is academic context only, not part of this basis — see §3.)
Honest limitsMatcher/judge reliability is the new bottleneck one level down — an LLM-judge-scored phase check inherits some of the flag-matcher’s fragility (MAST’s own top category is “task verification” failure); require a corroborating TOOL-kind span, not LLM-only reasoning, for any phase claiming environment interaction. Per-stage sample sizes shrink fast in a funnel — apply the same pass@k confidence-interval discipline already used project-wide. Phase credit can mislead if not cross-checked against the final flag (treat it as diagnostic under flag_verified, never a replacement).Safe only via a provably policy-invariant mechanism (potential-based shaping / RUDDER) — anything softer (a per-step LLM-judge “does this look like competent recon” score) inherits a decade-plus of documented gaming behavior.
VerdictYes, unconditionally, do it now.Conditional — see §5.

The theory’s own framing of why these are different decisions: per-stage evaluation is just better logging — nothing is being optimized against it, so it carries none of the correctness burden. Per-stage reward is where every failure mode above lives, because now something in the loop is being optimized against the signal. This is why the project brief is right to force these into two separate decisions.

The one empirical finding that turns this from philosophy into an operational rule. A controlled study across the RL design space on TravelPlanner finds: “reward and algorithm choices are scale-dependent — smaller models benefit from staged rewards and enhanced exploration, whereas larger models converge efficiently with simpler dense [outcome-only-adjacent] rewards.” arXiv:2603.21972. This is directly checkable via the eval funnel: is the current baseline “occasionally stumbles onto stage 3” (favors staged help) or “reliably reaches stage 3/4, fails to convert” (favors leaving outcome reward alone and attacking execution depth via data/SFT)? The project’s own diagnosis — “largely an execution gap” — leans toward the latter, but this is an empirical call the funnel should confirm, not an assumption to bake in from literature alone.


5. The verdict for this project

QuestionVerdictConfidence
EVAL-decomposition — measure recon / endpoint-discovery / vuln-ID / exploit / pivot independently, on top of flag_verified?Yes. Do it now, unconditionally. Near-free (read-only pass over existing traces), zero effect on training dynamics, independently reinvented by every serious CTF/agent benchmark that hit this problem before this project.High
TRAINING-decomposition — replace/augment the terminal flag reward with four separate per-stage rewards, curricula, or policies?No, not as a wholesale redesign — but a narrow, provably-safe form (potential-based milestone shaping, layered on top of the flag reward, never instead of it) earns its keep once the eval funnel shows an exploration-dominated bottleneck.Medium (conditional, not universal)

The concrete next step this whole verdict depends on

The project’s own PTES matcher schema (benchmark/challenges/*/challenge.json, ptes.<phase>.steps[]) was arrived at independently, before this literature review, on the general non-security basis in §4 (AgentBoard, MAST, AgentErrorTaxonomy). Academic, cited for context only, not a basis: it happens to structurally resemble the Cybench-subtask / AutoPenBench-milestone design too. What’s missing is aggregation: run the existing triage subagent over completed runs and emit one funnel row per challenge (five phase-reached booleans + an exploit-given-vuln-found conditional rate), rolled up into a corpus-level funnel. This single aggregation is the input every downstream decision below depends on — it settles empirically whether the ~100-200/1000 solve rate is F1-dominated, F2/F3-dominated, or F4-dominated, which the flag_verified column alone cannot supply no matter how much data accumulates.

  1. Decompose the eval fully, now, unconditionally (§4). Cross with the project’s own pass@k methodology rather than one aggregate pass@5, per tau-bench’s pass^k precedent.
  2. Keep the terminal flag reward as the ground-truth backbone, unconditionally. Nothing in this dossier argues for demoting it below a milestone signal.
  3. Run rejection-sampling SFT on own verified solves as already planned — but keep it light. The general, non-security basis for this move is the frontier self-training line: STaR (arXiv:2203.14465, Zelikman et al. 2022) — the seminal “generate, keep only what’s verified correct, fine-tune, repeat” loop — and ReST^EM (arXiv:2312.06585, Singh et al., DeepMind 2023) — the frontier-lab scaling result showing this expectation-maximization-style self-training on a model’s own correct samples beats training on human data alone, on math/code reasoning, no cybersecurity domain involved. Academic, cited for context only, not a basis: CTF-Dojo and Cyber-Zero report the same pattern (~500 verified trajectories → double-digit gains, no staged reward) inside the CTF/pentest domain specifically — a reassuring domain-match, not the reason to do this. Respect the Llama 4 warning: don’t over-train on the easy/repetitive subset — it narrows the exploration space the subsequent RL stage needs.
  4. Add curriculum sequencing before touching the reward function at all. The single lowest-risk lever available — no new reward, hence none of §3’s hacking surface. Order by whichever axis the funnel identifies as the bottleneck.
  5. Only if the funnel shows an F1 (exploration)-dominated bottleneck, and only via a ground-truth mechanism: add potential-based milestone shaping on top of the terminal reward. Define Φ(s) as a monotonic count of deterministically-verified stage completions (same verification contract the flag oracle already uses — server-side checks, not judge opinions), paid once per stage-transition, never re-collectable. Do not build this for stage 3 (vuln identification) specifically — VPR’s own authors flag exactly this stage-shape (“identify which of several candidates is vulnerable”) as the “open, unstructured” regime their method doesn’t yet solve well; keep that stage eval-only until a genuine deterministic check exists.
  6. Explicitly do NOT build a learned/LLM-judge per-stage reward model. Every citation in §3’s honest-limits section converges on this being the failure mode to avoid.
  7. If the funnel instead shows an F2/F3 (execution-depth / tool-policy)-dominated bottleneck — which the project’s own current diagnosis (“largely an execution gap”) suggests is more likely — the evidence base points away from reward decomposition and toward better trajectory curation and more/better SFT data, not a training-loop change.
  8. When entropy collapses under GRPO (the project’s own stated graduation trigger), watch stage-transition tokens specifically — a badly-shaped milestone reward is an easy, low-entropy shortcut to farm, and would accelerate collapse.

Deliberately not recommended as a first move: standing up four independently-trained sub-policies with four separate critics (the full options/ArCHer/HiPER/FeUdal-style architectural decomposition). Real, actively converging in the literature, and MiRA’s 6.4%→43.0% number is the strongest existence-proof in this whole dossier that monolithic reward can leave a large gap on the table — but every one of these is validated on web-navigation or generic agentic benchmarks with crisp, cheap-to-verify milestones, not on an offensive-security CTF corpus. Highest-upside, least-validated-for-this-domain lever here — a candidate for a later, small, gated experiment, not step one.

The decision, as a diagram

flowchart TD
  Start["Failing challenge / corpus\nunder diagnosis"] --> EvalDecomp["Step 1 — decompose the EVAL\n(PTES funnel, near-free)\ndo this unconditionally"]

  EvalDecomp --> Funnel{"Funnel shows which\nbottleneck dominates?"}

  Funnel -->|"F1: rarely reaches\nthe vulnerable endpoint"| ScaleCheck{"arXiv:2603.21972 —\nweak policy, capacity-limited?"}
  Funnel -->|"F2/F3: reaches it,\nfails to convert / clumsy tools"| Mono["Stay monolithic.\nInvest in trajectory curation +\nrejection-sampling SFT data\n(STaR / ReST-EM pattern)"]
  Funnel -->|"F4: no pivot after\na foothold"| Curric["Curriculum first\n(1-hop before 2-hop pivot chains,\nno reward change)"]

  ScaleCheck -->|"yes — occasionally\nstumbles onto it"| Shape["Potential-based milestone\nshaping ON TOP OF the flag reward\n(Ng/Harada/Russell 1999 — provably\npolicy-invariant), NOT stage 3"]
  ScaleCheck -->|"no — already capable,\njust unreliable"| Mono

  Shape --> Guard["Guard: ground-truth verifier only,\nnever a learned/LLM-judge score\n(PURE / Reward-Under-Attack / MONA)"]
  Mono --> Entropy["Watch entropy at GRPO\ngraduation regardless of path taken"]
  Curric --> Entropy
  Guard --> Entropy

  classDef safe fill:#132b22,stroke:#34d399,color:#eafaf3;
  classDef risk fill:#3a1414,stroke:#f87171,color:#fde8e8;
  class EvalDecomp,Curric,Mono safe;
  class Shape,Guard risk;

Contested point, stated plainly: there exists one paper doing genuine kill-chain-staged RL reward inside an LLM+RL hybrid (arXiv:2605.17075, cyber-defense red-teaming) that on its face looks like support for training-decomposition on an adjacent task. Per the project’s standing rule it carries no evidentiary weight here regardless — it’s academic cybersecurity-LLM work, cited for context only. The Medium-confidence, conditional verdict above does not rest on it; it rests on the general theory (potential-based shaping’s policy-invariance guarantee, HIRO’s ceiling-capping mechanism, the PRM-hacking convergence) and on this project’s own diagnosis. Demoting this citation changes nothing about the verdict.


  • Diagnosing the gap — a scientific framework — the pass@k / Pass@(k,T) / Cover@τ protocol that tells you which gap (knowledge / execution / exploration) a challenge subtype actually has; run this before deciding whether an F1-dominated funnel result calls for exploration-RL or just more samples. That chapter’s routing test and this chapter’s eval-funnel are complementary diagnostics, not competing ones.
  • RL that creates value — long-horizon, exploration, reasoning, novelty — the mechanics of how to fix an exploration or credit-assignment gap once diagnosed here (GiGPO step-level credit, DAPO, entropy instrumentation, ArCHer, curriculum-band filtering) — this chapter answers whether to decompose; that one answers how to execute the fix on the training-loop mechanics.
  • Agentic & multi-turn RL — the missing category — the training-loop shape (turn as the unit of advantage) that any of §5’s mechanisms (potential-based shaping, curriculum) has to be implemented inside.
  • Contested edges & landmines — the “does RL create capability or just amplify it” fight this chapter’s scale-dependence finding (arXiv:2603.21972) directly informs.

Bibliography (all traced to a verified source file, 2026-07-02)

CitationarXiv / DOIRole here
Sutton, “The Bitter Lesson”no arxiv; incompleteideas.netDon’t hand-author structure that plateaus
DeepSeek-R1 / R1-Zero2501.12948Rejects neural PRM at scale; reward-reliability admission
OpenAI Deep Researchsystem card, no arxivEnd-to-end beats manual orchestration
Llama 4 post-trainingai.meta.com blog, no arxivHeavy SFT/DPO caps RL exploration ceiling
Kimi k1.52501.12599Outcome-only + long context beats PRM/MCTS/value-fn
Go-Explore1901.10995 / Nature s41586-020-03157-9Pure outcome RL can structurally fail to find sparse reward
OpenAI Five1912.06680Long-horizon precedent needed huge scale + denser reward
Options framework (Sutton/Precup/Singh 1999)AIJ 112, DOI 10.1016/S0004-3702(99)00052-1Seminal HRL / temporal abstraction
FeUdal Networks1703.01161Manager/Worker HRL, option-collapse fix
ArCHer2402.19446LLM-native 2-level value function HRL
HiPER2602.16165Hierarchical advantage estimation, +6.6–8.3%
MiRA2603.19685Milestone reward, 6.4%→43% WebArena-Lite
Ng, Harada, Russell — reward shapingICML 1999, no arxivPotential-based shaping theorem (policy-invariant)
Müller & Kudenko2502.01307PBRS effectiveness depends on potential scaling
RUDDER1806.07857Learned return-equivalent redistribution
Randlov & Alstrom (bicycle shaping)ICML 1998, no arxivCanonical non-potential-based shaping failure
Verifiable Process Rewards (VPR)2605.10325Safe ground-truth process reward, open-env caveat for stage 3
CM2 checklist rewards2602.12268Checklist-style verifiable sub-criteria
Curriculum Learning (Bengio et al.)ICML 2009, no arxivFoundational curriculum citation
h12510.07312Curriculum + outcome-only, exponential sample-complexity gain
FastCuRL2503.17287Context-length curriculum, entropy-collapse timing
BPO2508.03018Curriculum + rejection-sampling refine, near-identical to project plan
PURE / Stop Summation2504.15275Sum-form PRM hacking mechanism, named
Reward Under Attack2603.06621PRMs as fluency detectors, adversarial hackability
Gao et al., designing RL reward2410.15115Learned PRM+success reward can hurt vs success-only
PRIME2502.01456Authors’ own admission of PRM hacking vulnerability
MONA2501.13011Multi-step reward hacking even with no bad-looking single step
HIRO1805.08296Off-policy correction, HRL non-stationarity / ceiling-capping
AgentBoard2401.13178Progress-rate metric, general capability-decomposition principle
MAST2503.1365714-mode/3-category failure taxonomy
AgentErrorTaxonomy / AgentDebug2509.25370Root-cause diagnosis gains without reward change
Phase-aligned taxonomy (autonomous agents)2508.13143Independent-domain convergence on phase-keyed failure
Cybench (academic — context only, not a basis)2408.08926Subtask decomposition, eval-only — reassuring convergence with AgentBoard/MAST, not evidence relied on
AutoPenBench (academic — context only, not a basis)2410.03225Milestone taxonomy near-matching F1–F4, eval-only — same caveat
NYU CTF Bench (academic — context only, not a basis)2406.05590CTF benchmark family
EnIGMA (academic — context only, not a basis)2409.16165“Soliloquizing” fabrication failure mode
tau-bench2406.12045pass^k reliability decomposition
InterCode-CTF (academic — context only, not a basis)2306.14898Seminal monolithic-reward CTF environment
CTF-Dojo (academic — context only, not a basis)2508.18370Monolithic rejection-sampling SFT, +11.6% — domain-match only; real basis is STaR/ReST-EM below
Cyber-Zero (academic — context only, not a basis)2508.00910Monolithic, simulated env, +13.1% — same caveat
Pentest-R1 (academic — context only, not a basis)2508.07382Two-stage curriculum, monolithic per-stage reward
HackSynth-GRPO (academic — context only, not a basis)2506.02048Outcome-only GRPO sufficient for single-stage CTF
STaR2203.14465Seminal general (non-security) rejection-sampling self-training loop; basis for §5’s SFT-on-own-solves recipe
ReST-EM (“Beyond Human Data”)2312.06585DeepMind frontier-lab scaling result for self-training on own correct samples, math/code domain
Kill-chain-staged reward (red-teaming) (academic cybersecurity-LLM — context only, not a basis)2605.17075The one LLM+RL staged-reward paper, adjacent domain, un-ablated
DRLRM-PT (reward machine, pentest) (academic — context only, not a basis)DOI 10.1109/ijcnn60899.2024.10650368Classical RL, staged reward helps, non-LLM regime
Node-fragility reward shaping (academic — context only, not a basis)DOI 10.3390/electronics13214311Classical dense-reward pentest, non-LLM regime
DAgger1011.0686Compounding error / distribution drift theory
GAE1506.02438Bias/variance dial for advantage estimation
Credit Assignment survey2312.01072Separates credit assignment from exploration
Demystifying long-horizon tool-use RL2603.21972Scale-dependence: staged reward helps weak models only