One problem, or many? — monolithic outcome-RL vs staged decomposition

Every other chapter in this book asks “which algorithm” (SFT vs DPO vs GRPO). This chapter asks a question one level up, specific to a sequential, multi-stage task with a single sparse terminal reward: should the CTF solve be trained as one end-to-end outcome-RL problem (flag reward only, let RL discover the stages), or decomposed into sub-problems — evaluated per stage and, more contentiously, trained per stage? The two halves of “decompose” turn out to have very different answers, and conflating them is the single easiest way to get this wrong.

All citations below are carried over, unmodified, from five research threads run 2026-07-02 (artifacts/overnight-decomposition/research/{monolithic-case,decomposition-case,staged-eval,pentest-ctf-rl,credit-assignment-theory,verdict}.md) — no id below was invented for this chapter. Re-grounding pass, same date: per the project’s standing rule, no conclusion in this chapter may rest on a domain-specific academic CTF/pentest training or benchmark paper (CTF-Dojo, Cyber-Zero, Pentest-R1, HackSynth-GRPO, AutoPenBench, Cybench, NYU CTF Bench, EnIGMA, InterCode-CTF, DRLRM-PT, node-fragility shaping, the kill-chain-staged-reward paper) — every such work is demoted to a labelled context-only mention below, and every claim that had rested on one is re-anchored on general frontier-lab / RL-theory evidence or this project’s own data instead. Two new citations were added and independently verified live for this pass: STaR (arXiv:2203.14465) and ReST-EM (arXiv:2312.06585), both general (non-security) self-training literature, replacing CTF-Dojo/Cyber-Zero as the basis for §5’s rejection-sampling-SFT recommendation.

1. The pipeline, and why the failure isn’t uniform

A CTF solve is not one action, it’s a chain:

flowchart LR
  R["Recon /\nenumeration"] --> E["Endpoint\ndiscovery"]
  E --> V["Identify the\nvulnerable endpoint"]
  V --> X["Exploit it"]
  X --> P["Post-exploitation /\npivot"]
  P --> F(("Flag\n{0,1}\nground-truth verified"))

  classDef stage fill:#132b22,stroke:#34d399,color:#eafaf3;
  class R,E,V,X,P stage;

Only the last box is ground-truth-checkable today. Observed failures cluster by where in the chain the agent dies, not uniformly across it — the project’s own F1–F4 taxonomy, which turns out to be a CTF-specific instance of failure clusters the general agent-eval literature keeps independently rediscovering (§4).

Tag	Failure	Canonical RL/agent-research framing	What it is not
F1	Never finds the vulnerable endpoint	Exploration / coverage failure — no gradient exists until the reward is first observed; large, deceptive state space	Not a credit-assignment problem — you can’t assign credit for a reward you’ve never seen
F2	Finds it, probes shallowly, can’t land the exploit	Execution / skill (performance-floor) failure — capability present, doesn’t reliably convert	Not usually fixed by more exploration
F3	Clumsy tool use, wrong tool for the job	Policy / tool-selection failure — a distinct axis from “does it find the bug”	Overlaps F2 but has its own literature (tool-augmented-LLM failure taxonomies)
F4	No real pivot/chaining after a foothold	Long-horizon credit-assignment failure — the terminal bit has to retroactively explain ~100 turns; variance grows with horizon	Not solved by “try more” alone — it’s a variance, not coverage, problem

The credit-assignment theory thread makes the split precise: a ~100-turn trajectory with reward only at the end is hard for three separable reasons — exploration burden (F1, upstream of everything else), credit-assignment variance (Monte-Carlo/GRPO-style returns smear one scalar across all turns — arXiv:1506.02438, GAE), and compounding distributional drift (a policy trained once, on one static snapshot, drifts off-distribution as a rollout gets longer — arXiv:1011.0686, DAgger, classically O(εT²) uncorrected). F1 is the first problem; F2–F4 are flavors of the second and third. Don’t expect one fix (a denser reward) to solve both (arXiv:2312.01072, the credit-assignment-vs-exploration survey).

2. Side A — the monolithic case, steelmanned

The pattern across every lab that tried both is consistent: outcome-only + scale beats hand-built process supervision, every time it’s been A/B’d.

DeepSeek-R1-Zero rejects process reward outright, in its own failure-experience writeup. Pure RL, no SFT, rule-based outcome-only reward; reasoning behaviors emerge as a side effect. Verbatim: “a model-based PRM… inevitably leads to reward hacking, and retraining the reward model needs additional training resources and it complicates the whole training pipeline.” arXiv:2501.12948. This is the single strongest evidence against a learned/neural per-stage reward — note precisely what it doesn’t rule out: a deterministic, ground-truth per-stage check is a different animal (§4).
OpenAI Deep Research shipped a long-horizon, tool-using agent trained end-to-end on outcome/rubric reward, and its own team says why: “End-to-end training beats manual orchestration… constructing a graph of operations… is the common approach to building agents [but] Deep Research is trained end-to-end… This allows the model to develop flexible strategies… that would break if scripted manually.” (OpenAI Deep Research system card, no arxiv id — flagged as such.) Closest real-world analog to this project’s shape: long-horizon, tool-using, sparse/rubric-graded.
Kimi k1.5 gets SOTA reasoning results with no PRM, no MCTS, no value function — substituting long-context scaling for explicit search. arXiv:2501.12599. Second independent lab, same conclusion as R1.
Llama 4’s own post-training team found heavy SFT/DPO caps the ceiling of the subsequent RL stage — verbatim: “SFT and DPO can over-constrain the model, restricting exploration during the online RL stage.” They responded by pruning >50% (95% for Behemoth) of their SFT data (ai.meta.com blog, no arxiv id). This is the project-critical citation: a hand-imposed stage boundary is itself a form of prescriptive pre-RL structure, and this is the general mechanism by which imposed structure narrows a policy’s exploration before RL gets to use it.
Academic, cited for context only, not a basis (per project standing rule — no domain-specific academic CTF/pentest training paper has produced a frontier cybersecurity model): in the CTF/pentest domain specifically, the same monolithic-wins pattern is also reported by academic training papers — CTF-Dojo (arXiv:2508.18370), Cyber-Zero (arXiv:2508.00910), Pentest-R1 (arXiv:2508.07382), HackSynth-GRPO (arXiv:2506.02048). None of these is load-bearing here — the actual basis for Side A is the frontier-lab evidence directly above (R1, Kimi k1.5, OpenAI Deep Research, Llama 4, Bitter Lesson), which independently converges on the same conclusion without needing a CTF-specific data point.
The Bitter Lesson (Sutton 2019, incompleteideas.net) is the intellectual ancestor of all of the above: hand-built structure plateaus, general search+learning wins at scale. High confidence the historical pattern is real; medium-low confidence it transfers directly to a 1000-challenge corpus with real per-rollout infra cost — that’s exactly the disanalogy the honest limits below press.

The honest limits — where the monolithic case’s own literature admits it breaks down

Limit	Citation	What it says
Pure outcome RL can structurally fail to ever find the reward	Go-Explore, arXiv:1901.10995 / Nature s41586-020-03157-9	Vanilla deep RL scored ~0 on Montezuma’s Revenge/Pitfall — canonical sparse, deceptive, long-horizon environments — until an explicit “remember states, return, explore from there” mechanism was added. This maps almost exactly onto F1: it’s an exploration-algorithm problem, not a hyperparameter one.
The best long-horizon precedent needed denser reward + enormous scale	OpenAI Five, arXiv:1912.06680	10 GPU-months of distributed self-play, AND a per-frame shaped reward (last-hits, kills, tower damage) — not a single terminal bit. Citing this as “pure sparse reward at scale works” over-claims what the paper shows.
R1’s own reward-reliability admission	arXiv:2501.12948	“The success of pure RL depends on reliable reward signals… for tasks that cannot obtain a reliable signal, DeepSeek-R1 uses human annotation… and only conducts RL for hundreds of steps.” The CTF flag reward is reliable (ground-truth) but far sparser per compute-dollar than a math/code answer — R1’s paper doesn’t test this sparsity regime; it’s an extrapolation this project would be making, not a validated claim.

Bottom line for Side A: thick, convergent, in-domain evidence that monolithic-outcome-plus-better-data/curriculum wins whenever it’s been tried against a decomposed alternative in LLM-CTF training specifically. The genuine open risk it must own is Go-Explore’s — whether the ~100-200/1000 solve rate reflects “still finding it eventually with more rollouts” (favors monolithic) or “structurally not finding it” (favors an exploration-specific intervention) is an empirical question literature alone cannot resolve.

3. Side B — the decomposition case, steelmanned

Framing. Monolithic outcome-RL is implicitly betting on four things at once: (1) the base policy already puts non-zero mass on the correct trajectory shape for every stage on a large fraction of challenges, (2) the RL algorithm can correctly attribute a late reward to the right subset of ~100 turns, (3) one scalar is expressive enough to teach four qualitatively different skills (exploration breadth, exploit depth, tool discipline, chaining) without one skill’s gradient starving another’s, and (4) “more outcome RL” is uniformly the right lever for F1 through F4 alike. Every technique below is a documented failure mode of at least one of these assumptions.

Mechanism	Citation	What changes in the loop	Confidence
Options / SMDP framework (the seminal foundation, asked-for regardless of age)	Sutton, Precup, Singh, Artificial Intelligence 112 (1999) (pre-arxiv; DOI 10.1016/S0004-3702(99)00052-1)	Action space becomes `{launch_recon_option, launch_exploit_option, ...}`; a high-level policy picks among temporally-extended sub-policies, shortening the effective horizon the terminal reward has to bridge	High (theory), medium (LLM transfer). Known failure: naive end-to-end option learning collapses to one mega-option or micro-manages every step.
FeUdal Networks — Manager/Worker split fixes option-collapse	arXiv:1703.01161	Manager emits abstract directional goals in latent space at low temporal resolution; Worker is intrinsically rewarded for moving state toward that direction; own ablations show a plain (non-dilated) recurrent Manager “fails catastrophically” on long-credit-assignment tasks	High (mechanism), medium (LLM transfer — from-scratch Atari RL, not a token-level LLM policy)
ArCHer — the LLM-native analogue	arXiv:2402.19446	A high-level, off-policy turn-level value function aggregates reward across turns; a low-level PPO-style update trains the token policy inside each turn using that value as its reward. Map “turn” onto “stage.” Single strongest “if I had to prototype one paper” citation for training-decomposition.	High (recipe exists), medium (untested on anything CTF-shaped)
HiPER — hierarchical advantage estimation	arXiv:2602.16165	Factorizes policy into planner + executor; Hierarchical Advantage Estimation aggregates returns per subgoal, provably reducing variance vs flat GAE; +6.6% ALFWorld, +8.3% WebShop, largest gains specifically on long-horizon multi-subtask tasks	High (strong ablations)
MiRA — milestone-based dense reward	arXiv:2603.19685	Dense, milestone-based reward replaces sparse outcome-only; on Gemma3-12B, WebArena-Lite success rate 6.4% → 43.0%, beating WebRL (38.4%) and GPT-4-Turbo (17.6%). The single strongest empirical existence-proof in this whole dossier that flag-only reward can leave a large gap on the table — but on web-navigation, not offensive-security CTF.	High
Pentest-R1 — domain-specific two-stage training	arXiv:2508.07382	Academic, cited for context only, not a basis (per standing rule): offline RL on 500+ real pentest walkthroughs → online RL in a live CTF env. Structurally resembles this project’s own planned SFT→GRPO, but that resemblance is not the basis for recommending it — the load-bearing mechanism for training-decomposition is the general options/ArCHer/HiPER hierarchical framing above plus curriculum-learning theory below.	N/A — context only
Potential-based reward shaping — the theoretical safety net for everything above	Ng, Harada, Russell, ICML 1999 (pre-arxiv)	`F(s,s') = γΦ(s') − Φ(s)` for any state-only potential Φ provably leaves the optimal policy unchanged — the telescoping sum over an episode collapses back to `Φ(s_T) − Φ(s_0)` plus the true reward. This is a theorem, not an empirical claim. Modern reaffirmation: arXiv:2502.01307 (practical effectiveness still depends on Φ’s scaling).	Very high (correctness); risk is entirely implementation
RUDDER — learned, return-equivalent redistribution	arXiv:1806.07857	Train an auxiliary model to predict final return from trajectory prefixes (using the project’s own ~100-200 verified solves), use its temporal differences as a per-step reward — a learned alternative to hand-specifying Φ, with the same correctness guarantee	High (theory), directly actionable given existing verified-solve data
Ground-truth per-stage verifiers / VPR / CM2 (checklist rewards)	arXiv:2605.10325 (VPR), arXiv:2602.12268 (CM2)	Decompose the terminal task into a checklist of objectively verifiable sub-criteria (sandbox-checked, not judge-opinion) — the safe form of stage reward, symmetric to the flag verifier’s own contract. VPR’s own honest caveat: benefit “depends on the reliability of the verifier,” and extension “to less structured, open-ended environments… remains an open challenge” — directly relevant to CTF’s own stage 3 (§4).	High, with an explicit open-environment caveat
Curriculum learning & sequencing — orthogonal to reward, lowest risk of the whole menu	Bengio et al., ICML 2009 (pre-arxiv, foundational); h1 arXiv:2510.07312; FastCuRL arXiv:2503.17287; BPO arXiv:2508.03018	Order training by difficulty (single-endpoint before decoy-heavy; 1-hop exploit before 2-hop pivot) — touches no reward function at all. h1: curriculum + pure outcome-only reward gets an exponential sample-complexity gain. BPO explicitly reports vanilla GRPO on their sparse-reward setting yields only marginal improvement without it.	High — the cheapest, least-risky lever in this entire menu
Kill-chain-staged reward (cyber-defense red-teaming)	arXiv:2605.17075 (May 2026)	Academic cybersecurity-LLM training work, cited for context only, not a basis (per standing rule). Frozen LLM planner emits kill-chain intent; a trained RL controller gets reward “aligned with kill-chain progression.” Superficially the closest thing in the literature to Option B, but it’s brand-new, unreplicated, doesn’t ablate “staged reward” from “hybrid architecture,” and — per the standing rule — carries no evidentiary weight for this project regardless. The actual basis for Option B’s viability is the general HRL/theory rows above (ArCHer, HiPER, potential-based shaping).	N/A — context only
Classical (non-LLM) staged reward for pentest	DRLRM-PT (reward machines), DOI 10.1109/ijcnn60899.2024.10650368; node-fragility shaping, DOI 10.3390/electronics13214311	Academic pentest RL, cited for context only, not a basis (DRLRM-PT is explicitly named in the project’s standing rule). Reports staged/dense reward helping sample efficiency in small, discrete, formally-specified MDPs (network graphs, no language, no tool-calling) — a structurally different regime, and not where this project’s “staged reward can help” claim rests. That claim’s actual basis is the potential-based-shaping theorem and RUDDER above.	N/A — context only

The honest limits — where decomposed training breaks, concretely

The reward-hacking evidence against naive per-stage reward is thick and convergent, not one paper. The moment a “verifier” stops being a deterministic ground-truth check and becomes a learned/judge score, this project’s own confirmed lesson (SFT-induced FLAG{} confabulation from a loose format-matcher) generalizes into a much larger literature:

PURE / Stop Summation (arXiv:2504.15275) names the mechanism precisely: the canonical summation-form credit assignment (additive, per-step reward) “easily induces LLMs to hack steps with high rewards.”
Reward Under Attack (arXiv:2603.06621) shows SOTA PRMs function as “fluency detectors rather than reasoning verifiers” — >0.9 PRM reward on trajectories with <4% ground-truth accuracy.
Gao et al. (arXiv:2410.15115) — combining a learned PRM/ORM with success reward can hurt relative to success-reward-only, via “repeating correct but unnecessary steps.”
PRIME’s own authors (arXiv:2502.01456) state the central open problem is that process labels are “prohibitively expensive… making [PRMs] particularly vulnerable to reward hacking,” and route around a separately-trained PRM entirely for this reason.
MONA (arXiv:2501.13011, DeepMind) generalizes this: multi-step reward hacking can occur even when no single step looks bad to a human/judge overseer.
The ancestor of all of it: the bicycle-shaping failure (Randlov & Alstrom, ICML 1998, pre-arxiv) — a non-potential-based “looks like progress” bonus taught an agent to ride in tight circles farming the bonus instead of reaching the goal. Same species as this project’s own SFT confabulation lesson: reward emitting the right-looking pattern rather than deterministically verified success, and the policy learns to farm the pattern.

Ceiling-capping is a real cost of decomposed training too, symmetric to Llama 4’s SFT/DPO warning. HIRO (arXiv:1805.08296) needed an explicit off-policy correction specifically because a high-level subgoal’s meaning drifts as the lower-level policy improves during training — without it, the system converges on subgoals that are locally useful but cap out below the true optimum. Translated to F1–F4: an “exploit-only” sub-policy trained against a synthetic “endpoint identified” subgoal risks converging on the shallowest exploit that satisfies the boundary — which is precisely the F2 shallow-probing failure this project already observes, not a hypothetical.

Academic, cited for context only, not a basis: the domain-specific CTF/pentest training papers named above happen to be monolithic-outcome or curriculum-decomposed rather than reward-decomposed, and the one adjacent paper doing genuine staged reward in an LLM+RL security loop (arXiv:2605.17075) is unreplicated — but neither observation is the basis for caution here. The actual, load-bearing case against naive per-stage reward is the general reward-hacking convergence immediately above (PURE, Reward Under Attack, Gao et al., PRIME, MONA, HIRO) — that literature alone is sufficient to warrant the conditional verdict in §5, independent of what the CTF-training corpus does or doesn’t show.

4. The key split — eval-decomposition vs training-decomposition

This is the load-bearing distinction. They are not the same decision, and the evidence supports very different confidence levels for each.

	Eval-decomposition	Training-decomposition
What it means	Measure per-stage reached/not-reached, on top of the existing `flag_verified` terminal signal	Replace/augment the terminal reward with per-stage rewards, curricula, or hierarchically-trained sub-policies
Training-loop change	None — a read-only pass over traces already generated	The reward function, or the training architecture, or both
Cost	Near-free (one aggregation pass)	Real engineering + real risk surface
Evidence for it	AgentBoard’s “progress rate” (arXiv:2401.13178, NeurIPS 2024 Oral) — “current evaluation frameworks mostly focus on the final success rate, revealing few insights”; MAST’s 14-mode/3-category taxonomy (arXiv:2503.13657, κ=0.88); AgentErrorTaxonomy — root-cause diagnosis alone (no reward change) buys +24% all-correct accuracy (arXiv:2509.25370); phase-aligned taxonomies independently reinvented in a different (non-security) domain (arXiv:2508.13143); tau-bench’s pass^k (arXiv:2406.12045) — separates “never clears” from “unreliable,” composable with a phase vector. This general, non-security agent-eval literature is the basis for the verdict below on its own. Academic, cited for context only, not a basis: Cybench subtasks (arXiv:2408.08926), AutoPenBench milestones (arXiv:2410.03225), NYU CTF Bench (arXiv:2406.05590), and EnIGMA’s “soliloquizing” fabrication finding (arXiv:2409.16165, ICML 2025) happen to converge on a near-identical F1–F4-shaped split, which is a reassuring coincidence, not evidence this project’s verdict depends on.
Evidence against it	None — every paper that ships it treats it as strictly additive, diagnostic-only, never a substitute for the terminal check	The 2025–2026 PRM-hacking convergence above (§3); ceiling-capping via HIRO-style subgoal drift. (The observation that domain-specific CTF/pentest training papers are uniformly monolithic when they report strong numbers is academic context only, not part of this basis — see §3.)
Honest limits	Matcher/judge reliability is the new bottleneck one level down — an LLM-judge-scored phase check inherits some of the flag-matcher’s fragility (MAST’s own top category is “task verification” failure); require a corroborating TOOL-kind span, not LLM-only reasoning, for any phase claiming environment interaction. Per-stage sample sizes shrink fast in a funnel — apply the same pass@k confidence-interval discipline already used project-wide. Phase credit can mislead if not cross-checked against the final flag (treat it as diagnostic under `flag_verified`, never a replacement).	Safe only via a provably policy-invariant mechanism (potential-based shaping / RUDDER) — anything softer (a per-step LLM-judge “does this look like competent recon” score) inherits a decade-plus of documented gaming behavior.
Verdict	Yes, unconditionally, do it now.	Conditional — see §5.

The theory’s own framing of why these are different decisions: per-stage evaluation is just better logging — nothing is being optimized against it, so it carries none of the correctness burden. Per-stage reward is where every failure mode above lives, because now something in the loop is being optimized against the signal. This is why the project brief is right to force these into two separate decisions.

The one empirical finding that turns this from philosophy into an operational rule. A controlled study across the RL design space on TravelPlanner finds: “reward and algorithm choices are scale-dependent — smaller models benefit from staged rewards and enhanced exploration, whereas larger models converge efficiently with simpler dense [outcome-only-adjacent] rewards.” arXiv:2603.21972. This is directly checkable via the eval funnel: is the current baseline “occasionally stumbles onto stage 3” (favors staged help) or “reliably reaches stage 3/4, fails to convert” (favors leaving outcome reward alone and attacking execution depth via data/SFT)? The project’s own diagnosis — “largely an execution gap” — leans toward the latter, but this is an empirical call the funnel should confirm, not an assumption to bake in from literature alone.

5. The verdict for this project

Question	Verdict	Confidence
EVAL-decomposition — measure recon / endpoint-discovery / vuln-ID / exploit / pivot independently, on top of `flag_verified`?	Yes. Do it now, unconditionally. Near-free (read-only pass over existing traces), zero effect on training dynamics, independently reinvented by every serious CTF/agent benchmark that hit this problem before this project.	High
TRAINING-decomposition — replace/augment the terminal flag reward with four separate per-stage rewards, curricula, or policies?	No, not as a wholesale redesign — but a narrow, provably-safe form (potential-based milestone shaping, layered on top of the flag reward, never instead of it) earns its keep once the eval funnel shows an exploration-dominated bottleneck.	Medium (conditional, not universal)

The concrete next step this whole verdict depends on

The project’s own PTES matcher schema (benchmark/challenges/*/challenge.json, ptes.<phase>.steps[]) was arrived at independently, before this literature review, on the general non-security basis in §4 (AgentBoard, MAST, AgentErrorTaxonomy). Academic, cited for context only, not a basis: it happens to structurally resemble the Cybench-subtask / AutoPenBench-milestone design too. What’s missing is aggregation: run the existing triage subagent over completed runs and emit one funnel row per challenge (five phase-reached booleans + an exploit-given-vuln-found conditional rate), rolled up into a corpus-level funnel. This single aggregation is the input every downstream decision below depends on — it settles empirically whether the ~100-200/1000 solve rate is F1-dominated, F2/F3-dominated, or F4-dominated, which the flag_verified column alone cannot supply no matter how much data accumulates.

The recommended recipe, cheapest and safest first

Decompose the eval fully, now, unconditionally (§4). Cross with the project’s own pass@k methodology rather than one aggregate pass@5, per tau-bench’s pass^k precedent.
Keep the terminal flag reward as the ground-truth backbone, unconditionally. Nothing in this dossier argues for demoting it below a milestone signal.
Run rejection-sampling SFT on own verified solves as already planned — but keep it light. The general, non-security basis for this move is the frontier self-training line: STaR (arXiv:2203.14465, Zelikman et al. 2022) — the seminal “generate, keep only what’s verified correct, fine-tune, repeat” loop — and ReST^EM (arXiv:2312.06585, Singh et al., DeepMind 2023) — the frontier-lab scaling result showing this expectation-maximization-style self-training on a model’s own correct samples beats training on human data alone, on math/code reasoning, no cybersecurity domain involved. Academic, cited for context only, not a basis: CTF-Dojo and Cyber-Zero report the same pattern (~500 verified trajectories → double-digit gains, no staged reward) inside the CTF/pentest domain specifically — a reassuring domain-match, not the reason to do this. Respect the Llama 4 warning: don’t over-train on the easy/repetitive subset — it narrows the exploration space the subsequent RL stage needs.
Add curriculum sequencing before touching the reward function at all. The single lowest-risk lever available — no new reward, hence none of §3’s hacking surface. Order by whichever axis the funnel identifies as the bottleneck.
Only if the funnel shows an F1 (exploration)-dominated bottleneck, and only via a ground-truth mechanism: add potential-based milestone shaping on top of the terminal reward. Define Φ(s) as a monotonic count of deterministically-verified stage completions (same verification contract the flag oracle already uses — server-side checks, not judge opinions), paid once per stage-transition, never re-collectable. Do not build this for stage 3 (vuln identification) specifically — VPR’s own authors flag exactly this stage-shape (“identify which of several candidates is vulnerable”) as the “open, unstructured” regime their method doesn’t yet solve well; keep that stage eval-only until a genuine deterministic check exists.
Explicitly do NOT build a learned/LLM-judge per-stage reward model. Every citation in §3’s honest-limits section converges on this being the failure mode to avoid.
If the funnel instead shows an F2/F3 (execution-depth / tool-policy)-dominated bottleneck — which the project’s own current diagnosis (“largely an execution gap”) suggests is more likely — the evidence base points away from reward decomposition and toward better trajectory curation and more/better SFT data, not a training-loop change.
When entropy collapses under GRPO (the project’s own stated graduation trigger), watch stage-transition tokens specifically — a badly-shaped milestone reward is an easy, low-entropy shortcut to farm, and would accelerate collapse.

Deliberately not recommended as a first move: standing up four independently-trained sub-policies with four separate critics (the full options/ArCHer/HiPER/FeUdal-style architectural decomposition). Real, actively converging in the literature, and MiRA’s 6.4%→43.0% number is the strongest existence-proof in this whole dossier that monolithic reward can leave a large gap on the table — but every one of these is validated on web-navigation or generic agentic benchmarks with crisp, cheap-to-verify milestones, not on an offensive-security CTF corpus. Highest-upside, least-validated-for-this-domain lever here — a candidate for a later, small, gated experiment, not step one.

The decision, as a diagram

flowchart TD
  Start["Failing challenge / corpus\nunder diagnosis"] --> EvalDecomp["Step 1 — decompose the EVAL\n(PTES funnel, near-free)\ndo this unconditionally"]

  EvalDecomp --> Funnel{"Funnel shows which\nbottleneck dominates?"}

  Funnel -->|"F1: rarely reaches\nthe vulnerable endpoint"| ScaleCheck{"arXiv:2603.21972 —\nweak policy, capacity-limited?"}
  Funnel -->|"F2/F3: reaches it,\nfails to convert / clumsy tools"| Mono["Stay monolithic.\nInvest in trajectory curation +\nrejection-sampling SFT data\n(STaR / ReST-EM pattern)"]
  Funnel -->|"F4: no pivot after\na foothold"| Curric["Curriculum first\n(1-hop before 2-hop pivot chains,\nno reward change)"]

  ScaleCheck -->|"yes — occasionally\nstumbles onto it"| Shape["Potential-based milestone\nshaping ON TOP OF the flag reward\n(Ng/Harada/Russell 1999 — provably\npolicy-invariant), NOT stage 3"]
  ScaleCheck -->|"no — already capable,\njust unreliable"| Mono

  Shape --> Guard["Guard: ground-truth verifier only,\nnever a learned/LLM-judge score\n(PURE / Reward-Under-Attack / MONA)"]
  Mono --> Entropy["Watch entropy at GRPO\ngraduation regardless of path taken"]
  Curric --> Entropy
  Guard --> Entropy

  classDef safe fill:#132b22,stroke:#34d399,color:#eafaf3;
  classDef risk fill:#3a1414,stroke:#f87171,color:#fde8e8;
  class EvalDecomp,Curric,Mono safe;
  class Shape,Guard risk;

Contested point, stated plainly: there exists one paper doing genuine kill-chain-staged RL reward inside an LLM+RL hybrid (arXiv:2605.17075, cyber-defense red-teaming) that on its face looks like support for training-decomposition on an adjacent task. Per the project’s standing rule it carries no evidentiary weight here regardless — it’s academic cybersecurity-LLM work, cited for context only. The Medium-confidence, conditional verdict above does not rest on it; it rests on the general theory (potential-based shaping’s policy-invariance guarantee, HIRO’s ceiling-capping mechanism, the PRM-hacking convergence) and on this project’s own diagnosis. Demoting this citation changes nothing about the verdict.

Cross-links

Diagnosing the gap — a scientific framework — the pass@k / Pass@(k,T) / Cover@τ protocol that tells you which gap (knowledge / execution / exploration) a challenge subtype actually has; run this before deciding whether an F1-dominated funnel result calls for exploration-RL or just more samples. That chapter’s routing test and this chapter’s eval-funnel are complementary diagnostics, not competing ones.
RL that creates value — long-horizon, exploration, reasoning, novelty — the mechanics of how to fix an exploration or credit-assignment gap once diagnosed here (GiGPO step-level credit, DAPO, entropy instrumentation, ArCHer, curriculum-band filtering) — this chapter answers whether to decompose; that one answers how to execute the fix on the training-loop mechanics.
Agentic & multi-turn RL — the missing category — the training-loop shape (turn as the unit of advantage) that any of §5’s mechanisms (potential-based shaping, curriculum) has to be implemented inside.
Contested edges & landmines — the “does RL create capability or just amplify it” fight this chapter’s scale-dependence finding (arXiv:2603.21972) directly informs.

Bibliography (all traced to a verified source file, 2026-07-02)

Citation	arXiv / DOI	Role here
Sutton, “The Bitter Lesson”	no arxiv; incompleteideas.net	Don’t hand-author structure that plateaus
DeepSeek-R1 / R1-Zero	2501.12948	Rejects neural PRM at scale; reward-reliability admission
OpenAI Deep Research	system card, no arxiv	End-to-end beats manual orchestration
Llama 4 post-training	ai.meta.com blog, no arxiv	Heavy SFT/DPO caps RL exploration ceiling
Kimi k1.5	2501.12599	Outcome-only + long context beats PRM/MCTS/value-fn
Go-Explore	1901.10995 / Nature s41586-020-03157-9	Pure outcome RL can structurally fail to find sparse reward
OpenAI Five	1912.06680	Long-horizon precedent needed huge scale + denser reward
Options framework (Sutton/Precup/Singh 1999)	AIJ 112, DOI 10.1016/S0004-3702(99)00052-1	Seminal HRL / temporal abstraction
FeUdal Networks	1703.01161	Manager/Worker HRL, option-collapse fix
ArCHer	2402.19446	LLM-native 2-level value function HRL
HiPER	2602.16165	Hierarchical advantage estimation, +6.6–8.3%
MiRA	2603.19685	Milestone reward, 6.4%→43% WebArena-Lite
Ng, Harada, Russell — reward shaping	ICML 1999, no arxiv	Potential-based shaping theorem (policy-invariant)
Müller & Kudenko	2502.01307	PBRS effectiveness depends on potential scaling
RUDDER	1806.07857	Learned return-equivalent redistribution
Randlov & Alstrom (bicycle shaping)	ICML 1998, no arxiv	Canonical non-potential-based shaping failure
Verifiable Process Rewards (VPR)	2605.10325	Safe ground-truth process reward, open-env caveat for stage 3
CM2 checklist rewards	2602.12268	Checklist-style verifiable sub-criteria
Curriculum Learning (Bengio et al.)	ICML 2009, no arxiv	Foundational curriculum citation
h1	2510.07312	Curriculum + outcome-only, exponential sample-complexity gain
FastCuRL	2503.17287	Context-length curriculum, entropy-collapse timing
BPO	2508.03018	Curriculum + rejection-sampling refine, near-identical to project plan
PURE / Stop Summation	2504.15275	Sum-form PRM hacking mechanism, named
Reward Under Attack	2603.06621	PRMs as fluency detectors, adversarial hackability
Gao et al., designing RL reward	2410.15115	Learned PRM+success reward can hurt vs success-only
PRIME	2502.01456	Authors’ own admission of PRM hacking vulnerability
MONA	2501.13011	Multi-step reward hacking even with no bad-looking single step
HIRO	1805.08296	Off-policy correction, HRL non-stationarity / ceiling-capping
AgentBoard	2401.13178	Progress-rate metric, general capability-decomposition principle
MAST	2503.13657	14-mode/3-category failure taxonomy
AgentErrorTaxonomy / AgentDebug	2509.25370	Root-cause diagnosis gains without reward change
Phase-aligned taxonomy (autonomous agents)	2508.13143	Independent-domain convergence on phase-keyed failure
Cybench (academic — context only, not a basis)	2408.08926	Subtask decomposition, eval-only — reassuring convergence with AgentBoard/MAST, not evidence relied on
AutoPenBench (academic — context only, not a basis)	2410.03225	Milestone taxonomy near-matching F1–F4, eval-only — same caveat
NYU CTF Bench (academic — context only, not a basis)	2406.05590	CTF benchmark family
EnIGMA (academic — context only, not a basis)	2409.16165	“Soliloquizing” fabrication failure mode
tau-bench	2406.12045	pass^k reliability decomposition
InterCode-CTF (academic — context only, not a basis)	2306.14898	Seminal monolithic-reward CTF environment
CTF-Dojo (academic — context only, not a basis)	2508.18370	Monolithic rejection-sampling SFT, +11.6% — domain-match only; real basis is STaR/ReST-EM below
Cyber-Zero (academic — context only, not a basis)	2508.00910	Monolithic, simulated env, +13.1% — same caveat
Pentest-R1 (academic — context only, not a basis)	2508.07382	Two-stage curriculum, monolithic per-stage reward
HackSynth-GRPO (academic — context only, not a basis)	2506.02048	Outcome-only GRPO sufficient for single-stage CTF
STaR	2203.14465	Seminal general (non-security) rejection-sampling self-training loop; basis for §5’s SFT-on-own-solves recipe
ReST-EM (“Beyond Human Data”)	2312.06585	DeepMind frontier-lab scaling result for self-training on own correct samples, math/code domain
Kill-chain-staged reward (red-teaming) (academic cybersecurity-LLM — context only, not a basis)	2605.17075	The one LLM+RL staged-reward paper, adjacent domain, un-ablated
DRLRM-PT (reward machine, pentest) (academic — context only, not a basis)	DOI 10.1109/ijcnn60899.2024.10650368	Classical RL, staged reward helps, non-LLM regime
Node-fragility reward shaping (academic — context only, not a basis)	DOI 10.3390/electronics13214311	Classical dense-reward pentest, non-LLM regime
DAgger	1011.0686	Compounding error / distribution drift theory
GAE	1506.02438	Bias/variance dial for advantage estimation
Credit Assignment survey	2312.01072	Separates credit assignment from exploration
Demystifying long-horizon tool-use RL	2603.21972	Scale-dependence: staged reward helps weak models only

Post-Training Field Notes