Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

This is a living field manual for post-training an agent — written for an engineer who wants to know it, then do it, not experiment blindly. It grows: each session adds or sharpens a chapter.

The problem it exists to answer

~1,000 CTF challenges. The agent solves ~100–200 and fails ~800. We have the harness, the workflow, the budget, and the backing to do anything. The bottleneck is not data — it’s knowing which fine-tuning method we want, and therefore what data to build, and why.

Everything here builds toward answering that from first principles.

How to read it

What’s canonical vs. what’s a teaching scaffold (read this once)

Being honest about provenance, because you’re becoming a researcher and the distinction matters:

  • Canonical, universal, you’ll find it in any RL/post-training text: the on-policy vs. off-policy axis, and the three learning paradigms (imitation / preference / reinforcement). These are load-bearing and not up for debate.
  • My teaching scaffold: any packaging that presents these as “N knobs you freely toggle.” The axes describe methods; they are not independent dials you combine — each named method is a fixed preset. An earlier interactive matrix implied free combination and produced nonsense for some products. That was the scaffold over-reaching. Corrected here: learn the one axis (on/off-policy) + the fixed method presets, not a combinatorial grid.

The one line to anchor on

Every method is the same move — push probability mass toward good behavior — differing only on whose distribution the data comes from (off- vs on-policy) and whether you also learn from failures.

Keep that; the rest is detail.

The 5-minute journey

Before the theory, feel the shape of it. This is the interactive walk-through — problem → reasoning → decision — embedded live. Scroll it; click the ladder rungs and answer the decision tool at the end.

If the frame is cramped, open it full-screen: assets/journey.html

Everything in that frame is expanded, corrected, and cited in the chapters that follow. The frame is the map; the chapters are the territory.

The one axis that predicts everything

If you internalize one thing, make it this: on-policy vs off-policy. It predicts which methods can fix which failures, and it’s the reason a colleague’s SFT run can move the eval by two points and stall.

Definitions (precise)

  • Off-policy: training targets are sampled from a distribution other than the model’s current policy π_θ — a human, a teacher model, a frozen dataset. The model raises the likelihood of sequences it did not generate.
  • On-policy: the training data is sampled from π_θ itself (the current weights), then scored/labeled. The model learns from its own rollouts.

This is standard RL vocabulary, not a framing I invented — see any policy-gradient treatment; the LLM-specific consequences are laid out formally in the imitation-learning reduction below.

Why off-policy imitation is structurally blind to execution gaps

The mechanism, as a flow:

graph LR
  A["Train on the teacher's states"] --> B["Deploy: π_θ drives,<br/>visits π_θ's OWN states"]
  B --> C["One slip → a state the<br/>teacher never visited"]
  C --> D["No training signal there<br/>→ error compounds"]
  D --> B

This isn’t hand-waving — it’s a theorem. Ross, Gordon & Bagnell, “A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning” (AISTATS 2011, arXiv:1011.0686), Thm 2.1: a behavior-cloned policy with per-step error ε incurs total cost bounded by J(π̂) ≤ J(π*) + ε·T²quadratic in horizon T, because a single deviation moves you to states off the expert’s distribution where you have no supervision, and errors accrue at up to unit cost for the rest of the episode. Their DAgger correction — aggregate data from the learner’s own induced state distribution — restores near-linear O(ε·T) regret.

Translate to your setting: an execution failure happens, by definition, in a state π_θ reaches on its own. Off-policy data is — structurally — data about states π_θ doesn’t reach. So off-policy SFT optimizes correctness in the wrong region. On a long agentic CTF trajectory (large T), the vs T gap is the whole story.

The fix, stated once

Make the data on-policy: sample from π_θ, then score those samples. Every method that “fixes execution” — rejection-sampling FT, GRPO/RLVR, on-policy distillation — is a different way of doing exactly that. The on-policy distillation line makes the connection explicit: it exists specifically to kill the train/inference mismatch of fixed-dataset KD by sampling from the student during training (GKD, Agarwal et al., arXiv:2306.13649).

The corollary you’ll use constantly

  • Failure lives in π_θ’s own distribution (it solves sometimes, fails often) → on-policy method.
  • The correct behavior is absent from π_θ entirely (never appears at high N) → nothing on-policy to reinforce → you must inject off-policy (SFT / teacher data / a tool).
  • The behavior exists but is mis-ranked → preference method.

That routing is the spine of The decision.

What “data” actually means for an agent

You asked the right question earlier: “demonstration = the answer — are you talking about trajectories?” Yes. Being concrete about the data object dissolves most of the confusion, because each method eats a different-shaped object, and for an agent those shapes are not what the chatbot literature implies.

A “demonstration” is a trajectory, not an answer

  • Chatbot SFT example = (prompt → ideal response text).

  • Agent SFT example = the whole trajectory:

    system prompt
    → [assistant: reasoning] → [tool_call: shell "nmap …"] → [tool_result: <output>]
    → [assistant: reasoning] → [tool_call: http_request …] → [tool_result: <output>]
    → …
    → [assistant: submit_flag("FLAG{…}")]
    

The training target is the path, not the flag. That is what SFT and rejection-sampling FT imitate token-by-token.

Loss-masking: don’t train the model to predict the world

Tool outputs / observations are part of the sequence but are not the policy’s own tokens — they come from the environment. Standard practice is to mask the loss on prompt and observation spans and only compute loss on the model-generated reasoning + tool_call + submit tokens. Otherwise you teach the model to hallucinate stdout. (This is the same principle as instruction-tuning masking the prompt; for agent traces it’s applied to every observation span.) The project’s own trajectory-synthesis note treats observation-masking as a first-class filter step — see lessons/post-training/verified-trajectory-synthesis-recipe.md in the shared memory pool. (Caveat: that recipe’s specific numbers derive from an unverified third-party report — trust the procedure, not the figures.)

The data shapes, per family (forward reference)

FamilyTraining object it consumesWhere it comes from
Imitation (SFT, distillation)full trajectories (yours, a teacher’s, or human)curate / run a teacher
Rejection-sampling FTyour own verifier-passed trajectoriesyou already generate these
Preference (DPO/KTO)(chosen, rejected) trajectory pairs, or tagged good/badyour solved vs failed logs
RL (GRPO/RLVR)prompts + a reward/verifier fn — no fixed datasetyour challenges + verify()

This table is the hinge of the whole book. It’s expanded in Method → Data, which is the chapter that actually addresses your stated bottleneck (“we don’t know what data we want”). The short version: you don’t choose data and then a method — you choose the method, and the method dictates the data object.

The family map

Three families by what signal they learn from, plus one orthogonal axis (PEFT) that is a delivery mechanism, not a learning signal. Every named method is a fixed preset on the on/off-policy axis from Foundations — you don’t freely combine axes.

graph TD
  ROOT["Post-training<br/>push mass toward good behavior"]
  ROOT --> IM["IMITATION<br/>signal = demonstrations"]
  ROOT --> PR["PREFERENCE<br/>signal = comparisons A≻B"]
  ROOT --> RL["REINFORCEMENT<br/>signal = reward / verifier"]

  IM --> SFT["SFT"]
  IM --> DIST["Distillation<br/>off-policy / on-policy"]
  IM --> RS["Rejection-sampling FT<br/>= 'RL without RL'"]

  PR --> RLHF["RLHF (RM + PPO)"]
  PR --> DPO["DPO · KTO · IPO · ORPO · SimPO"]

  RL --> PPO["PPO"]
  RL --> GRPO["GRPO → GSPO / DAPO"]
  RL --> RLVR["RLVR (verifiable reward)"]
  RL --> AG["Agentic / multi-turn RL"]

  PEFT["PEFT: LoRA · QLoRA · DoRA<br/>a HOW, applied to any of the above"]
  ROOT -.delivery.-> PEFT

What’s actually load-bearing in 2026 (verified)

The status column below is from a fresh Exa pass over model tech reports + lab blogs (2025–2026), not recalled — sources cited per row throughout the method chapters.

Method2026 statusAnchor
SFTMainstream, universal — stage 0 of every recipe
Off-policy distillationMainstream — DeepSeek distills R1 → V3/V3.2 as a named stagearXiv:2412.19437, arXiv:2512.02556
On-policy distillationNiche / promising — NOT yet confirmed in any frontier lab’s production recipeGKD arXiv:2306.13649; Thinking Machines blog 2025-10-27
Rejection-sampling FTMainstream — named stage in Llama 3 & DeepSeek-R1arXiv:2407.21783, arXiv:2501.12948
RLHF (RM+PPO)Mainstream at proprietary labs (Gemini 2.5, GPT-5)arXiv:2507.06261
DPOMainstream — Llama 3’s offline preference stagearXiv:2407.21783
IPO/KTO/ORPO/SimPONiche — OSS/research tooling; no flagship names them as primaryarXiv:2503.11701
GRPOMainstream — the reasoning-RL defaultarXiv:2402.03300
RLVRMainstream — arguably the defining 2025-26 techniquearXiv:2501.12948, arXiv:2507.06261
GSPO (Qwen3)Mainstream — first GRPO-successor with a flagship behind itarXiv:2507.18071
DAPOOSS-tooling mainstream; ByteDance-origin, not confirmed elsewherearXiv:2503.14476
PRM (process reward)Niche — explicitly rejected for R1 (step-level reward hacking)arXiv:2501.12948
Rubric/critic outcome rewardMainstream & growing — the real replacement for PRMarXiv:2507.06261
Agentic / multi-turn RLMainstream & the frontier edge — see its own chapterDeep Research; Kimi K2/K2.5
LoRA/QLoRA/DoRAMainstream in the applied layer; labs post-train flagships full-parameterarXiv:2106.09685
Self-playExperimental — no confirmed frontier-lab production use as of this pass

Read the family chapters for the mechanism + “what data it eats” + when to reach for each.

Imitation — SFT · distillation · rejection sampling

Signal = demonstrations. Objective = cross-entropy on target tokens. What differs across the three is whose trajectories you imitate — which is the on/off-policy axis doing all the work.

Per-method template: what · data it eats · on/off-policy · when · gotcha · cite.

SFT (Supervised Fine-Tuning)

  • What: MLE on (prompt → target); for agents, target = a full trajectory (What “data” means). The original instruction-tuning result is InstructGPT (arXiv:2203.02155).
  • Eats: curated/human/teacher demonstrations.
  • Policy: off-policy (targets aren’t π_θ’s samples).
  • When: inject a capability or format the model lacks; establish a cold-start before RL.
  • Gotcha: off-policy ⇒ blind to execution gaps (the εT² compounding, Foundations) and it overwrites — this is the “SFT replaces capabilities” half of “Scalpel vs Hammer: GRPO Amplifies Existing Capabilities, SFT Replaces Them” (arXiv:2507.10616). Worse on small models (less capacity to absorb without forgetting).

Distillation (a kind of SFT — the teacher supplies the demonstrations)

Knowledge distillation originates with Hinton et al., arXiv:1503.02531. Two variants, and the split is the on/off-policy axis:

  • Off-policy distillation = SFT on the teacher’s completions. Mainstream: DeepSeek transfers R1’s reasoning into V3/V3.2 as a named post-training stage (DeepSeek-V3, arXiv:2412.19437; V3.2, arXiv:2512.02556); “distilled from GPT-4/R1” datasets are how most small OSS models get capability.
  • On-policy distillation = student samples its own rollouts, teacher grades them densely (reverse-KL per token). Fixes the fixed-dataset train/inference mismatch (GKD, arXiv:2306.13649: +90% relative on GSM8K vs supervised-KD). Thinking Machines’ 2025 write-up reports Qwen3-8B ← Qwen3-32B reaching ~70% AIME’24 in 150–200 steps at ~9–30× less compute than RL-from-scratch (thinkingmachines.ai, 2025-10-27).
    • Honest status: niche / promising, not lab-confirmed. As of a 2026 pass, no frontier lab (OpenAI/Anthropic/Google/DeepSeek/Qwen) has stated on-policy distillation as its production recipe — evidence is GKD + one lab blog. Treat the efficiency numbers as directional, not settled. (This corrects an earlier over-strong “sleeper” framing.)
  • Eats: teacher completions (off) / teacher-graded student rollouts (on). Requires a teacher genuinely better at your task.

Rejection-sampling FT (“RL without RL”)

  • What: sample N completions from π_θ, keep verifier-accepted winners, SFT on them; iterate. The lineage: STaR (arXiv:2203.14465), ReST (arXiv:2308.08998), RAFT (arXiv:2304.06767), RFT (arXiv:2308.01825).
  • Eats: your own verifier-passed trajectories — which for a CTF harness with a flag check you already generate.
  • Policy: on-policy data, SFT update. It’s the first-order special case of policy gradient (reward∈{0,1}, upweight winners).
  • When: an execution gap, and you have a verifier but no stronger teacher. Cheapest on-policy move; reuses your SFT pipeline.
  • Gotcha (measurable graduation trigger): positives-only ⇒ policy-entropy collapse — fast early gains then plateau. GRPO’s real edge over it is not group-normalization (ablated → negligible) but discarding all-same-reward groups (implicit filtering). See “A Minimalist Approach to LLM Reasoning: From Rejection Sampling to Reinforce” (arXiv:2504.11343). Watch entropy; when it collapses, graduate to GRPO/RLVR.
  • Production proof: explicit named stage in Llama 3 (arXiv:2407.21783) and DeepSeek-R1 (~800K rejection-sampled examples between its two RL stages, arXiv:2501.12948).

Preference — RLHF · DPO · KTO

Signal = comparisons (A ≻ B). This family exists for the case where you cannot write verify() — “helpful / harmless / on-brand” has no programmatic checker, but a human (or an AI judge) can rank two outputs. Load-bearing property: preference methods reshape ranking over behaviors π_θ can already produce — they inject no new capability (lessons/post-training/dpo-kto-for-agent-tool-selection.md, shared memory).

RLHF (reward model + PPO)

  • What: train a reward model on preference pairs (Bradley-Terry), then optimize π_θ against it with PPO + KL-to-reference. The canonical pipeline is InstructGPT (arXiv:2203.02155).
  • Eats: preference pairs → a learned scalar reward.
  • Still alive in 2026, not dead: Gemini 2.5 runs an explicit Reward-Model + Critic + RL loop (“RLF”, arXiv:2507.06261 §2.4); GPT-5’s sycophancy fix scores conversations and uses that as a training reward (OpenAI GPT-5 system card / model-training page).
  • Gotcha: a learned RM has parameters to exploit → reward hacking. Deterministic verifiers (RLVR) avoid this; see the gameability ladder in Contested edges.

DPO and the direct-preference family

  • What: skip the RM + RL loop — a closed-form loss directly raises logπ_θ(chosen) − logπ_θ(rejected) against a frozen reference, provably equivalent to the RLHF objective under Bradley-Terry (DPO, arXiv:2305.18290). Key hyperparameter: β (KL strength).
  • Eats: (prompt, chosen, rejected) triples.
  • Policy: off-policy by default (pairs usually from another model / earlier checkpoint) — its weakness; iterative/online DPO resamples from current π_θ to make it on-policy.
  • Production proof: Llama 3 chose DPO over PPO for its offline preference stage for stability/scalability at their scale, and runs it iteratively (their “iTeC” = rejection-sampling + SFT/DPO/IPO + online RL, several rounds) (arXiv:2407.21783).

Variants and their niche

  • KTO (arXiv:2402.01306): learns from unpaired good/bad labels (Kahneman-Tversky value model) — no matched pairs needed. This fits mined agent logs exactly (a pile of failed runs + a pile of clean solves).
  • IPO (arXiv:2310.12036) stabilizes DPO’s tendency to collapse both logprobs at high β; ORPO (arXiv:2403.07691) folds preference into SFT with no reference model; SimPO (arXiv:2405.14734) drops the reference via length-normalized reward.
  • Honest status: IPO/KTO/ORPO/SimPO are niche — real, used in fine-tuning shops and ablated in Tülu 3, but no Llama/Qwen/DeepSeek/GPT/Claude/Gemini tech report names them as the production choice (survey: arXiv:2503.11701). Plain DPO + iterative DPO are the mainstream ones.

RLAIF / Constitutional AI

  • What: replace human preference labels with AI feedback against a written constitution (Constitutional AI, arXiv:2212.08073).
  • Status: mainstream at Anthropic (it is the core method) and partially adopted at Google (Gemini 2.5 safety is “loosely inspired by Constitutional AI”, arXiv:2507.06261). 2026 refinement: Anthropic now teaches the constitution via synthetic document fine-tuning (SDF) → SFT → RL, because “demonstrating desired behavior is insufficient — the model must learn why” (alignment.anthropic.com, “teaching Claude why”, 2026).

Reinforcement — PPO · GRPO · RLVR

Signal = reward / verification. Fully on-policy, online, uses the whole reward landscape (push winners up and losers down). Most powerful, most expensive/unstable. This is the family driving every 2025–2026 reasoning model.

PPO

  • What: clipped-surrogate policy gradient with a value/critic network + KL-to-reference (arXiv:1707.06347).
  • Eats: prompts + a reward (learned RM or verifier).
  • 2026 status: still used for classic preference-RL at the proprietary labs, but declining share for reasoning-RL (the critic is expensive; GRPO/GSPO replaced it there).

GRPO (Group Relative Policy Optimization)

  • What: drop the critic. Sample a group of N completions per prompt; the group mean reward is the baseline; advantage Aᵢ = rᵢ − mean(r) (optionally std-normalized). Introduced in DeepSeekMath (arXiv:2402.03300).
  • Eats: prompts + a reward fn; no fixed target dataset.
  • 2026 status: the reasoning-RL default — DeepSeek-R1’s core algorithm (arXiv:2501.12948), still the base of V3.2’s mixed RL (arXiv:2512.02556).
  • Requirement (project rule): baseline solve rate must sit in 30–60% per prompt-group — all-pass or all-fail groups give zero advantage → zero gradient (llmresearch-handbook.md rule 7; mechanics in handbook.md §10, shared memory).

GRPO successors

  • GSPO (Qwen) [R][L] — the problem with vanilla GRPO/PPO at scale: the token-level importance ratio r_t = π_θ(a_t|s_t)/π_old(a_t|s_t) compounds multiplicatively over a long response, and noisy per-token drift is specifically what destabilizes MoE RL (expert routing shifts mid-rollout, under-policy). GSPO clips at the sequence level instead:

    # GRPO/PPO: one ratio PER TOKEN, clipped per token — variance compounds over length L
    r_t = pi_theta(a_t|s_t) / pi_old(a_t|s_t)
    
    # GSPO: one ratio for the WHOLE sequence (length-normalized geometric mean)
    r_seq = (pi_theta(y|x) / pi_old(y|x)) ** (1 / len(y))
    loss  = -mean(min(r_seq * A, clip(r_seq, 1-eps, 1+eps) * A))
    

    This is Qwen3’s actual stated production RL algorithm (arXiv:2507.18071) — the first GRPO-successor with a flagship behind it, and it gets more relevant, not less, as episodes lengthen: a 100-turn agentic trajectory with tool calls interleaved is exactly the long-sequence regime where token-level ratios drift furthest from 1 by the last token. If a future GRPO/RLVR run on the CTF agent shows training instability, sequence-level clipping is the first thing to try — not more KL-coefficient tuning.

  • DAPO (ByteDance Seed) [E][R][L] — four concrete engineering fixes, not one new algorithm, each independently adoptable as a verl loss-mode flag (arXiv:2503.14476):

    1. Clip-Higher — decouple the PPO clip bounds (eps_low ≠ eps_high, e.g. 0.20 / 0.28 vs. the symmetric PPO-default 0.20/0.20) so a rare-but-good token can gain probability faster than a bad one loses it. Symmetric clipping caps how fast a rare-correct action can ever be reinforced — a direct driver of entropy collapse (below).
    2. Dynamic Sampling — resample any prompt whose whole group of G rollouts is all-correct or all-incorrect (std(group_rewards) == 0 → zero advantage → zero gradient in GRPO) instead of paying for a wasted rollout batch.
    3. Token-level loss — average the policy-gradient loss over every token in the batch, not per-sample-then-averaged, so long correct/incorrect responses aren’t down-weighted relative to short ones. [L] — a credit-assignment fix that matters more the longer responses get.
    4. Overlong reward shaping — a soft length penalty instead of a hard truncation penalty, so a response cut off by the context window isn’t punished as if it were simply wrong.
    eps_low, eps_high = 0.20, 0.28
    ratio   = exp(logp_new - logp_old)
    clipped = clip(ratio, 1 - eps_low, 1 + eps_high)
    loss_pg = -min(ratio * adv, clipped * adv)          # per-token, mean over ALL tokens in the batch
    
    while std(group_rewards) == 0:                       # dynamic sampling
        prompt = resample_prompt()
        group_rewards = rollout_and_score(prompt, n=G)
    

    Mainstream in OSS RL tooling (verl and open GRPO reproductions default to these four fixes) and independently reproduced as a 50-point AIME 2024 result beating R1-Zero-Qwen-32B with half the training steps — one of the few fully open (algorithm + infra + data) large-scale reasoning-RL reproductions. Project rule 7 (GRPO baseline must hit 30–60%) is DAPO’s dynamic-sampling problem, stated as a portfolio-composition constraint instead of a training-loop fallback — keeping the baseline in-band is how you avoid feeding all-pass/all-fail groups into the update in the first place; DAPO’s dynamic sampling is the fallback for whatever still lands there. At ~100 turns per rollout, resampling a whole-group-zero-reward challenge is expensive — prefer upstream curriculum/difficulty filtering (drop challenges outside the 30–60% band) over paying for resamples on genuinely-unsolved-yet challenges.

    Designed to fix: pattern 2 & 3 — Clip-Higher targets the exact mechanism behind “no methodology / 82% pivot-after-failure” and “good guesser until it isn’t”: symmetric clipping caps how much probability mass a rare, correct enumeration branch (or a well-grounded, as opposed to lucky, guess) can ever accumulate — which is precisely what narrows a policy onto one brittle script. See Exploration and entropy below for the mechanism this is patching.

  • Dr. GRPO [R][E] — a smaller, easy-to-miss companion fix: vanilla GRPO’s per-sample length- and std-normalization secretly rewards longer wrong answers and shorter right ones (an optimization artifact, not a real preference). Fix is a two-line change — drop the 1/|response| length term and the group-std division, keep only A = r − mean(r) (arXiv:2503.20783). Matches or beats vanilla GRPO’s accuracy at the same compute while removing the length-inflation drift. For a 100-turn agent this bug has a much bigger attack surface than a single-turn math answer: a policy trained on the unfixed objective can learn to “look busy” (extra tool calls, redundant enumeration) after a wrong guess without the enumeration being useful — nearly indistinguishable from legitimate PTES-style enumeration unless you’re specifically checking for it.

RLVR (RL with Verifiable Rewards)

  • What: GRPO/PPO where the reward is a deterministic verifier (unit tests, math checker, flag check) rather than a neural RM. No parameters to game.
  • Eats: prompts + a verify(state) → {0,1} function. This is your setup — the CTF flag verifier is a textbook verifiable reward.
  • 2026 status: arguably the defining technique of the era. Every reasoning model (o1/o3, R1, Gemini-thinking, Qwen3, Kimi) scales RL against verifiable/rule-based rewards as the capability driver; Gemini 2.5 explicitly allocates increased RL compute to “verifiable rewards” (arXiv:2507.06261; OpenAI “Learning to reason with LLMs”; R1, arXiv:2501.12948).

Exploration and entropy: the GRPO graduation trigger

Cybersecurity is exploration — every technique here is [E]-tagged. This is the section that operationalizes the project’s own stated graduation condition (“SFT → GRPO/RLVR when policy entropy collapses”): what actually collapses, why, and the cheapest fixes, roughly in the order you’d reach for them.

The phenomenon has a fitted law. Policy entropy falls sharply and monotonically early in RLVR training, and downstream performance is empirically bounded by R = -a·exp(H) + b — performance is traded from entropy, hitting a fixed, predictable ceiling at H=0. The mechanism: entropy change is driven by the covariance between a token’s action-probability and its logit update, which is proportional to advantage under policy-gradient methods — a high-probability, high-advantage token gets pushed toward certainty (entropy ↓), a rare-but-good token gets reinforced (entropy ↑), and this covariance term stays positive almost everywhere during training, which is why the collapse is monotonic rather than noise (arXiv:2505.22617). Actionable consequence: instrument mean(entropy) from training step 0. A run whose entropy is already heading to 0 within the first 10–20% of steps is capped, regardless of what the reward curve currently shows — you cannot tell “40% because the task is hard” from “already collapsed onto one enumeration script at step 50, now just executing it faster” without this instrumentation.

Designed to fix: pattern 2 & 3 — this is the literature-side name for exactly the two most load-bearing BSides findings: a policy that has spent its entropy budget on one narrow, high-probability action sequence has no probability mass left on alternatives when that sequence fails (pattern 2, 82% pivot-after-failure) and commits to a single ungrounded guess at the critical step because it’s the only path its distribution still assigns real weight to (pattern 3).

Cheap, surgical fixes — try in this order:

  1. Clip-Higher (DAPO, above) — already the baseline recipe; decoupling eps_high from eps_low is the first lever, not an afterthought.
  2. Clip-Cov / KL-Cov — since collapse is driven by a small set of high-covariance tokens, don’t touch the loss globally. Each batch, flag the top ~0.2–2% of tokens by cov = (logp − logp.mean()) * (adv − adv.mean()) and either drop the policy-gradient update on them (Clip-Cov) or add an extra KL penalty scoped only to them (KL-Cov):
    cov  = (logp - logp.mean()) * (adv - adv.mean())
    mask = cov > percentile(cov, 1 - clip_frac)         # top ~0.2-2% of tokens
    
    loss_pg[mask] = loss_pg[mask].detach()                          # Clip-Cov
    loss = loss_pg + kl_coef * kl_per_token * mask                  # KL-Cov (scoped, not global)
    
    Already merged into verl (loss_mode="clip_cov" / "kl_cov") — a config flag, not a reimplementation (arXiv:2505.22617).
  3. The clip asymmetry is mechanistically real, not just DAPO folklore — raising the low-side clip bound increases entropy, raising the high-side bound decreases it; these are two independent knobs, not one symmetric hyperparameter (arXiv:2509.26114). If Clip-Higher alone doesn’t fully hold entropy up, look at eps_low too.
  4. High-entropy minority tokens — only ~20% of tokens in a trajectory are high-entropy “forking” tokens (where the policy actually chooses between distinct paths); restricting the gradient update to just those matches full-gradient RLVR at 8B and beats it at 32B, while training on the other 80% (low-entropy filler/syntax) actively hurts (arXiv:2506.01939). In an agentic trace (JSON tool-call formatting, restated context) the boilerplate-token fraction is plausibly even larger than in pure math CoT — untested in this domain, but the leverage ceiling is higher, not lower.

The graduation trigger, precisely stated: don’t wait for the reward curve to plateau — watch entropy. Once mean(entropy) is tracking toward the flat part of the R = -a·exp(H)+b curve, more rejection-sampling-SFT epochs on the same policy distribution will not move the needle (you’re re-sampling a narrowing distribution); that is the signal to graduate to GRPO/RLVR with the interventions above already wired in, not bolted on after observing collapse.

ProRL — prolonged, KL-controlled RL expands the boundary rather than just amplifying it. Directly tests the contested question in Contested edges: does RL discover genuinely new solution paths, or just reweight what the base model could already sample at large k? Answer, conditional on three ingredients: (a) adaptive KL control against a reference policy, (b) periodic reference-policy resetting (ref_policy ← policy.detach() every N steps, so the KL term compares against a recent snapshot instead of an ever-more-distant step-0 policy), and (c) a diverse task suite, sustained over prolonged training (thousands of steps, not a short polish pass) — under these conditions, RL-trained models solve problems the base model never solves at any k, and the boundary-expansion effect strengthens with training duration (arXiv:2505.24864).

loss += kl_coef * KL(policy || ref_policy)              # KL-to-reference, adaptive coefficient
if step % reset_interval == 0:
    ref_policy.load_state_dict(policy.state_dict())     # re-anchor — don't compare against a stale step-0 snapshot
# (c) mix task families/difficulties in the training distribution rather than a narrow curriculum

This directly qualifies the project’s own confirmed lesson (“GRPO amplifies existing capabilities, SFT replaces them,” arXiv:2507.10616): the amplify-vs-expand outcome is a function of how long and how KL-controlled the run is, not GRPO’s inherent ceiling — a short GRPO pass should be expected to behave like amplification; ProRL’s own ablations show the boundary-expansion effect only emerges after prolonged, reference-resetting training. This is genuinely contested — a separate 2025 measurement study reports the opposite under a short, single-reference-policy setup (pass@1 rises, pass@256 coverage falls over training, i.e. narrowing, arXiv:2504.13837) — the two are read here as reconciled by duration/KL-management, not as one being simply wrong, but that reconciliation is this book’s synthesis, not a claim either paper makes explicitly. Treat “does RL create new capability here” as open and verify with your own entropy instrumentation (above), not settled fact.

Designed to fix: pattern 4 — a model whose reasoning boundary hasn’t been expanded keeps failing the same class of enumeration-heavy exploitation steps no matter how much RL “polishing” it gets (62% of failures stall in exploitation). ProRL’s finding implies the fix is duration + diversity + entropy control together, not any single trick — budget real training time and periodic reference resets once past the initial GRPO baseline, rather than treating GRPO as a short fine-tune pass.

Everything past this point — pass@k as a training signal, diversity/curiosity/count-based intrinsic rewards, parameter-space noise for temporally-coherent exploration, tool-call-sequence diversity as the project’s own novel opportunity, and the full turn-level/step-level credit-assignment literature for the ~100-turn setting — is covered in depth in the dedicated long-horizon and exploration-sweep chapters, and in Agentic & multi-turn RL for the multi-turn training-loop shape itself. Read this section for the graduation trigger and the cheapest fixes; read those for the harder credit-assignment and boundary-expansion questions once you’re past the initial DAPO-recipe GRPO baseline.

The reward-model question (PRM vs outcome/rubric)

  • PRM (process reward, dense step-level) — score each reasoning step (Lightman et al., arXiv:2305.20050). Niche / avoided in production: DeepSeek explicitly rejected PRM for R1 due to step-level reward hacking (arXiv:2501.12948).
  • Outcome verifier + rubric/critic grading — the real 2026 answer to “what replaced PRM”: not dense step rewards, but LLM-judge/rubric-based outcome grading. Gemini’s “Critic” (prompted rubric grader, arXiv:2507.06261) and OpenAI’s RFT “model grader” are both this in production. Your deterministic flag verifier is the ungameable end of this spectrum — keep it there (gameability ladder in Contested edges).

When to reach for RL

An execution gap where rejection-sampling FT has plateaued (entropy collapsed), and you want the negative-sample gradient + online updates. Cost: online rollout infra, reward plumbing, KL control, instability, and the train↔inference precision mismatch (its own rabbit hole — see the shared memory note research/fp8-quantization-mechanics-training-serving.md).

Agentic & multi-turn RL — the missing category

This is the category that was absent from the first pass of “the major methods,” and it’s the one that matters most for you, because a CTF agent is exactly this shape. It is not a new update rule — it still runs on GRPO/PPO/GSPO-family gradients — it’s a new training-loop shape: RL over multi-turn trajectories with live tools/environments in the loop (browser, code sandbox, MCP servers), instead of single-shot verifier-scored completions.

What changes vs. single-shot RLVR

  • The rollout is an episode of tool-use, not one generation. Reward often arrives only at the end (flag captured / task complete) → a credit-assignment problem: which turn or tool-call earned the win or lost the run?
  • The “dataset” is a live environment service, not a static file. You need rollout orchestration (sandboxes, tools, resets), not a JSONL.
  • This is precisely the project’s “RL envs are the moat” thesis (lessons/post-training/rl-envs-as-moat-between-providers.md, shared memory) — now independently corroborated as the frontier bet.

Verified production evidence (2026)

  • OpenAI Deep Research (o3-based): “trained using end-to-end reinforcement learning on hard browsing and reasoning tasks” — a shipped product doing agentic RL over live tool use (openai.com/index/introducing-deep-research).
  • Kimi K2 (Moonshot, open-weight 1T/32B-active MoE): headline post-training is a large-scale agentic data-synthesis pipeline + joint RL, where simulated tool environments generate the rollouts RL trains on (Kimi K2 tech report).
  • Kimi K2.5 (2026): Agent Swarm trained with Parallel Agent Reinforcement Learning (PARL) — RL over cooperating multi-agent trajectories. This is the current frontier edge (Kimi K2.5 tech blog, 2026).
  • Gemini 2.5: RL environments explicitly extended to “multi-step actions and tool use” (arXiv:2507.06261 §2.4).
  • Anthropic: a 2026 lesson that safety training from chat-RLHF failed to generalize to agentic/tool-use settings, forcing explicit diversification into agentic environments (alignment.anthropic.com, “teaching Claude why”, 2026). Directly relevant: capability and alignment now have to be trained in the agentic loop, not chat.

What this means for your build

  • Your harness already is the environment. The engineering surface is (a) a clean verify(state)→{0,1} reward read from real environment state (not the transcript — see the confabulation gotcha in Contested edges), and (b) rollout orchestration to keep N episodes in flight.
  • Start where credit assignment is trivial (outcome-only flag reward on a solvable band), i.e. rejection-sampling FT → GRPO/RLVR on your own multi-turn trajectories, before reaching for dense per-turn shaping.

What is not mainstream yet

  • Self-play (self-generated curricula / self-critique-as-opponent) appears only in niche academic work as of this pass — no confirmed frontier-lab production use. Watch, don’t bet.

Turn-level vs. trajectory-level credit assignment

Everything below exists because naively lifting GRPO/PPO from single-turn math/code RL to a ~100-turn tool-using agent breaks two assumptions simultaneously: (1) the “trajectory” is now dozens of LLM-generation turns interleaved with environment/tool observations, not one generation; (2) reward is terminal-only (flag verified or not) — so the vanilla group-mean baseline conflates credit across every turn equally, rewarding/penalizing an early exploratory enumeration turn exactly as much as the final exploit turn.

flowchart TB
    subgraph Trajectory-level GRPO baseline
    A1["turn 1<br/>enumerate"] --> A2["turn 2<br/>probe"] --> A3["...turn 60..."] --> A4["turn 61<br/>exploit"] --> A5["flag: 0/1"]
    A5 -- "one advantage,<br/>broadcast to all turns" --> A1
    A5 --> A2
    A5 --> A3
    A5 --> A4
    end
flowchart TB
    subgraph Turn-level credit GiGPO / turn-PPO family
    B1["turn 1<br/>enumerate"] --> B2["turn 2<br/>probe"] --> B3["...turn 60..."] --> B4["turn 61<br/>exploit"] --> B5["flag: 0/1"]
    B5 -- "episode advantage macro" --> B1
    B4 -- "step/turn advantage micro<br/>via state-hash or turn-value" --> B4
    end

One-line idea: GRPO (arXiv:2402.03300, DeepSeekMath, 2024-02-05) replaces PPO’s learned critic with a group-mean baseline over N samples of the same prompt — cheap, no critic, but implicitly one-reward-per-generation. GAE (arXiv:1506.02438, 2015-06-08) is the classical mechanism for trading bias/variance in the advantage estimate across a single trajectory’s timesteps — built for one agent-environment stream, not turn-vs-token double granularity. Full method table (with your reinforcement chapter’s GRPO/GSPO/DAPO baseline) below.

GiGPO — step-level advantage with zero extra rollouts [L] [E]

arXiv:2505.10978 (Feng, Xue, Liu, An; 2025-05-16, NeurIPS 2025 poster).

  • Problem: vanilla GRPO computes one advantage per whole trajectory — a good early enumeration step and a lucky late guess get identical credit.
  • Key idea: two nested groupings. (1) Episode-level group — N full rollouts, GRPO-style trajectory advantage (macro: “was this whole run good?”). (2) Step-level group — hash each (state, step) pair, bucket steps that recur across trajectories into anchor groups, compute a second advantage from “what happened next” conditioned on that shared state (micro credit) — no new network, no extra rollouts.
  • Loop delta: after collecting the usual GRPO batch, add a step-indexing pass → advantage = episode_advantage + λ · step_advantage → feed into the same PPO-clipped update.
  • Hyperparameters that matter: state-hashing granularity (too coarse → false matches; too fine → anchor groups collapse to size 1) and the mixing weight λ.
  • Gotcha for CTF: built for hashable/discrete states (web pages, grid cells). A CTF agent’s state is unbounded free text (shell stdout, HTTP bodies) — you need a canonicalization step, e.g. hash on (tool_name, normalized_target, response_status_class) rather than raw text, or anchor groups never fire.

Designed to fix: pattern #4 (uneven PTES phases — 62% of failures stall in exploitation). Step-level credit is the mechanism that stops the 100-turn trajectory advantage from diluting the exploitation-phase steps that actually decide the run.

Given the harness already emits structured tool_call/tool_exec_ms spans (harness-observability-contract-2026-06.md), a state key from those spans is the cheapest first transplant in this whole chapter — no critic, no infra, just a canonicalization function.

ArCHer — the two-time-scale ancestor [L]

arXiv:2402.19446 (Zhou, Zanette, Pan, Levine, Kumar; 2024-02-29).

  • Key idea: a hierarchy — a turn-level off-policy critic (TD-learned, “how good is this utterance given the conversation-so-far”) and a token-level on-policy policy gradient bootstrapped off that turn value instead of only the final episode reward. Decouples “which turn was good” (dozens of decisions) from “which token was good” (thousands).
  • Ablation that matters: the sample-efficiency gap over flat single-critic baselines widens with horizon length — the regime you’re in.
  • Gotcha: off-policy turn-level value learning reintroduces the value-overestimation instability the project’s no-critic GRPO preference was designed to avoid. Treat ArCHer as the theoretical justification for “turn is the right unit,” not a recipe to implement wholesale — prefer GiGPO’s critic-free step-grouping or Verlog’s dual-discount GAE (below) for the same decomposition without a learned off-policy critic.

RAGEN / StarPO — naming the collapse [E] [L]

arXiv:2504.20073 (Wang et al.; 2025-04-24).

  • What it is: a diagnostic framework, not primarily a new algorithm. StarPO formalizes trajectory-level agent RL over whole (state, think, action, reward) rollouts, then uses the RAGEN testbed to empirically show what breaks when you train “the naive way.”
  • Central finding — the “Echo Trap”: the policy’s reasoning traces converge to a small set of repeated, low-diversity patterns that keep scoring reward on the training distribution while generalization/exploration collapses. This is entropy collapse, named and reproduced across four stylized environments — not a one-off.
  • Fix pattern reported: reward normalization across turns (so no single dominant reward source drowns out exploration signal) + explicit rollout-diversity interventions, triggered by monitoring entropy/reward-variance, not epoch count.

Designed to fix: pattern #2 (react-and-guess, no methodology — 82% pivot-after-failure). An entropy-collapsed policy is one direct explanation for why an agent stops trying diverse enumeration strategies and converges to a small guess-repertoire.

This is the citation for the project’s own stated plan — “watch policy entropy as the trigger to graduate from SFT to GRPO.” Concretely: instrument per-turn action-entropy (or a diversity metric over tool-call sequences) during RL and treat a converging curve as the operational graduation/intervention signal, mirroring RAGEN’s diagnosis instead of re-deriving it live mid-run. See also reinforcement.md’s “Exploration and entropy” section for the DAPO Clip-Higher / Dr. GRPO mechanisms that patch this at the single-turn level — RAGEN is the multi-turn-specific diagnosis of the same underlying failure.

The 2025–26 turn-level GRPO-variant cluster [L] [E]

A fast-moving cluster of near-simultaneous papers attacking the same problem — “turn is the unit of advantage, not trajectory or token” — with distinct mechanisms. Treat as one converging-consensus finding, not N competing final answers; none individually crosses into [N] (they refine existing capability rather than expand the boundary).

PaperarXivMechanismConfidence
Turn-Level Reward Design & Credit Assignment2505.11821Dense per-turn reward terms layered on the terminal outcome rewardPromising
Turn-PPO2512.17008Argues GRPO’s group-relative clip “exposes notable limitations” at long horizon; goes back to a per-turn PPO value functionPromising
TL-GRPO2601.16480Turn credit for same-state-revisited tasks (iterative code repair); narrower than general multi-turnPromising
A2TGPO2605.06200Adaptive per-turn clip range — early vs. near-terminal turns have different advantage-magnitude distributionsPromising
Proximity-Based MTO2602.19225Weight credit by task difficulty, not just turn positionPromising
GAGPO2605.13217GAE-style λ-discounted advantage synthesized into GRPO’s group-relative frameworkPromising

What this means for the CTF agent: don’t pick one paper as “the” answer — prototype the cheapest shared mechanism first: GiGPO’s step-grouping (zero extra infra) or a straightforward per-turn shaping term (2505.11821’s approach) before reaching for a second value network (Turn-PPO / TL-GRPO). Given the project’s ground-truth-only reward constraint (see below), any dense intermediate signal must stay a shaping term added to, not a replacement of, the terminal verifier reward.


Reward shaping for a sparse terminal reward — without reopening the confabulation bug

The project already has a hard rule: reward must be ground-truth flag-verified, never format/regex-matched — SFT-induced FLAG{} confabulation was a real observed failure (lessons/post-training/sft-induced-flag-confabulation.md). Every reward-shaping idea in this section has to be read through that constraint.

  • Keep the terminal signal as ground truth, add density, don’t replace it. The turn-level cluster above (2505.11821 in particular) explicitly diagnoses that “sparse outcome rewards… lack dense intermediate signals across multiple decision steps” — the fix is injecting per-turn shaping alongside the verifier, e.g. reward tool-call progress (new open port found, new endpoint discovered, new credential recovered) as a small dense bonus, while the flag check remains the only source of the large terminal reward. A shaped proxy that can be gamed (e.g. “reward finding any string that looks like a flag”) reopens exactly the confabulation failure already logged — the shaping term must be read from verifiable environment state, same discipline as the terminal check.

  • Mask tool/environment output tokens from the policy-gradient loss. Search-R1 (arXiv:2503.09516) masks retrieved-content tokens out of the loss — you don’t want to reinforce/penalize text the environment produced, only the model’s own query/action-generation tokens. Direct transplant, not optional: shell stdout, HTTP response bodies, scan output must never enter the policy-gradient loss, only the agent’s own tool-call arguments and reasoning tokens. Easy to miss when standing up a GRPO/RLVR loop on top of an SFT-warmed policy — this is a correctness bug, not a design choice.

    Designed to fix: pattern #1 (agents prefer raw curl/shell — 87.7% of tool calls bypass the provided surface). Search-R1 is direct evidence that RL specifically over which tool/query to issue is tractable in a comparable regime (search-engine calls) — supports RL (not just better prompting) as the lever for pattern #1.

  • Bootstrap a value estimate at truncation instead of reward = 0. Verlog (below) proposes trajectory early truncation with a value-function bootstrap rather than waiting for the terminal reward — directly relevant since episodes cap at ~100 turns: today a hard timeout presumably returns zero reward for a run that made real, unfinished progress. Recovering partial-progress signal from failed-but-in-progress attempts is a second lever on the same sparse-terminal-reward problem, distinct from per-turn shaping.

  • Plan exploration as its own object. PEARL (arXiv:2601.20439) treats which tools, in what order as something to explore/RL over, not just the final answer.

    Designed to fix: pattern #2 (no methodology / enumeration). “Plan exploration” is a formal mechanism for rewarding systematic tool-sequencing instead of react-and-guess.


Why a 100-turn CTF is the hard case

Stack the constraints and the difficulty compounds — this is the “hard case” every technique above is implicitly being stress-tested against:

ConstraintWhy it bites at ~100 turnsTechnique that targets it
Reward is terminal-onlyCredit for the winning exploit turn gets diluted across ~100 turns of a flat trajectory-level advantageGiGPO, turn-level cluster
Unbounded free-text state (shell/HTTP, not pixels/grid)State-hashing methods built for discrete environments (web pages, grid worlds) don’t transplant for freeGiGPO — needs a bespoke canonicalization step
Variable episode lengthBatched training wastes GPU cycles on padding/idle time when some rollouts finish in 10 turns and others run to 100Verlog’s early truncation
Long context growing every turnFull transcript in every prompt overloads context retrieval well before turn 100Verlog’s customizable agent memory (windowed history)
On-policy exploration required, but entropy collapses under naive multi-turn RLThe “Echo Trap” — repeated low-diversity patterns keep scoring reward while exploration diesRAGEN/StarPO’s entropy-as-trigger diagnosis
Exploitation phase, not enumeration, is where 62% of failures stallA flat advantage rewards enumeration and exploitation turns identically, so neither gets a sharpened gradientGiGPO step-groups, BSides pattern #4

Verlog — the only technique benchmarked past 100 turns [L] [E]

No arXiv id — OpenReview only (NeurIPS 2025 MTI-LLM workshop poster, openreview.net/forum?id=GmodkWwMV3) + project blog (wentsechen.github.io/Verlog_blogpost), Chen/Chen/Zhu/Schneider. Confidence: Promising — cite via OpenReview, do not fabricate an arXiv id.

Three mechanisms, all aimed at the “three failure modes of long-horizon agentic RL” the paper names explicitly: overloaded context, sparse terminal reward, variable trajectory length wasting GPU cycles.

  1. Customizable agent memory — a flexibly-sized history window per turn, decoupling “how much context the policy sees” from “how many turns the episode has run.”
  2. Dual-discounting GAE — two separate discount factors, γ_step (turn-to-turn credit decay) and γ_token (within-turn token credit decay), instead of one GAE discount applied uniformly. Direct generalization of ArCHer’s two-time-scale idea, implemented inside GAE instead of a separate off-policy critic.
  3. Trajectory early truncation — cuts long rollouts short during training and substitutes a value-bootstrap for the missing terminal reward, to cut GPU idle time from variance in episode length.

Scale claim: the blog states prior frameworks (VeRL, RAGEN) handle ~10-turn tasks, verl-agent scales to ~50, and Verlog targets 400+ turn episodes (Crafter, 70–400 steps, avg ~190) — the only technique in this thread validated longer than your ~100-turn ceiling.

What I’d change in your pipeline: the dual-discounting GAE split (γ_step vs γ_token) is a single hyperparameter change layered onto whatever advantage code the training loop already has, no new critic beyond what GAE needs — the most directly answerable “what would you change” in this whole file. Flag honestly: workshop-poster + blog source, not a peer-reviewed arXiv preprint.


The RL-framework landscape (verl-agent / VerlTool / RAGEN)

Framework choice is an infra decision, not a research-finding one — flagged here because it gates which of the mechanisms above you can actually run without building rollout orchestration from scratch. Cross-reference against the harness/GPU-economics material once the framework-choice chapter from this research sweep lands (see cross-links below).

  • verl-agent (github.com/langfengq/verl-agent) — open-source agent-RL extension of veRL. No standalone arXiv paper; cite as infra, not a research claim. Scales to ~50-turn tasks per Verlog’s own comparison.
  • VerlTool — “Towards Holistic Agentic Reinforcement Learning with Tool Use” (arXiv:2509.01055) — surfaced in this research pass but not independently abs-page-verified; confirm before citing as settled.
  • RAGEN (github.com — see StarPO above) — the modular multi-environment testbed the Echo Trap diagnosis was built on; useful as a reference implementation for entropy/diversity monitoring, not just the paper.
  • “Demystifying RL for Long-Horizon Tool-Using Agents” (arXiv:2603.21972, Wu et al., 2026-03-23) [L] [R] — the closest thing to a systematic “what to tune first” ablation study, decomposing the design space along 5 axes: reward shaping, model scaling, data composition, algorithm selection, environment design. Use this axis framing when triaging which lever to pull first on your own pipeline. Confidence: Promising (0 citations, <4 months old at verify-time, but methodologically the most comprehensive single source found).

Domain-adjacent: RL for CTF / pentesting agents directly

Academic, cited for context — not a basis for our decisions. CTF-Dojo, Pentest-R1, STRIATUM-CTF, and HackSynth are academic domain-specific CTF/pentest training or benchmark papers; none produced a frontier cybersecurity model, so none of them is load-bearing for any conclusion, recipe, or number below — they’re listed only so the researcher knows what already exists in the academic literature before presenting a technique here as novel.

  • CTF-Dojo (arXiv:2508.18370, Zhuo et al., 2025-08-25) [R] — “the first large-scale executable runtime tailored for training LLMs with verifiable feedback” for CTF-style tasks: 658 Docker-containerized challenges with ground-truth verified feedback. Context only — see re-grounding below for why verifier-grounded execution environments are the right substrate.
  • Pentest-R1 (arXiv:2508.07382, He Kong et al., 2025-08-10) [L] [R] — two-stage RL pipeline for autonomous pentesting reasoning, trained on 500+ real-world multi-step walkthroughs. Context only, unread beyond abstract — do not use it to lock the project’s own reward-shaping design; if a reward-design decision needs a citation, it must come from the general RLVR/reward-shaping literature (Ng/Harada/Russell potential-based shaping, PURE/MONA reward-hacking literature) or this project’s own measured data, not from Pentest-R1.
  • STRIATUM-CTF (arXiv:2603.22577, Hugglestone et al., 2026-03-23) [R] — MCP-standardized agentic framework for general-purpose CTF solving, targeting “multi-step, stateful reasoning” as the gap static benchmarks miss. Context only — it does not itself do turn-level RL and is not used here to justify GiGPO’s design (that justification stands entirely on GiGPO’s own paper, arXiv:2505.10978, which motivates state-hashed step-groups from first principles, no CTF-specific evidence required).
  • HackSynth (arXiv:2506.02048, Muzsai, Imolai, Lukács, 2025-06-01) [R] [E] — fine-tunes a tool-augmented Llama-3.1-8B via vanilla, trajectory-level GRPO on a procedurally-generated crypto-CTF dataset. Context only — not used as evidence for the “start with vanilla GRPO” recommendation below; that recommendation is re-grounded independently.

Re-grounded recommendation (start vanilla, add turn-level machinery only if needed): this is supported by the general empirical ablation in “Demystifying RL for Long-Horizon Tool-Using Agents” (arXiv:2603.21972, already cited above, domain-general not CTF-specific), whose 5-axis decomposition (reward shaping, model scaling, data composition, algorithm selection, environment design) treats algorithm choice as one axis to tune after establishing a working baseline — plus this project’s own handbook rule that GRPO baseline must first land in the 30–60% signal band (memory/handbook.md §10) before any additional machinery is justified. Re-grounded substrate claim: that verifier-grounded execution environments (not format/regex reward) are the right substrate is this project’s own confirmed constraint, not an inference from CTF-Dojo — see the ground-truth-flag-verified rule and the SFT-induced FLAG{} confabulation failure (lessons/post-training/sft-induced-flag-confabulation.md), which is the actual basis.

The live gap: no paper in this pass demonstrates turn-level credit assignment (GiGPO/Verlog-class) applied to an offensive-security/CTF domain specifically. Transplanting GiGPO/Verlog mechanisms to CTF is this project’s own contribution to make, not something to find pre-solved — and, per the standing rule, not something to validate by reference to the academic CTF papers above.


Tool-integrated reasoning RL: ReTool, ToRL, Search-R1

These three share a mechanism — RL over an interleaved reason+tool-call+observation loop — but differ in domain (math/code-interpreter vs. search). Relevant because the harness is a tool-integrated-reasoning loop (shell, HTTP, scanning tools).

  • ReTool (arXiv:2504.11536, Feng et al., 2025-04-15) [R] [E] — interleaves real-time code-interpreter execution inside the reasoning trace, and trains the interleaving policy with RL. The rollout is no longer “generate full CoT then maybe call a tool” — the policy learns when to interrupt its own reasoning to invoke a tool and resume conditioned on the result; credit must flow through that interruption boundary. Domain is math — direct transplant to CTF tool-calling (when to run curl vs. reason further) is analogous but unvalidated in this domain.
  • ToRL (arXiv:2503.23383, Li, Zou, Liu, 2025-03-30) [E] [N]the strongest [N] citation in this thread: pure RL (no SFT-on-tool-traces warmup) teaches autonomous tool invocation and reports emergent strategic tool-use behaviors absent from the SFT-only baseline, beating the best tool-integrated-reasoning model on AIME’24 by double digits. This is the same RL > SFT distinction the project’s diagnosis already leans on (“SFT replaces, GRPO amplifies,” arXiv:2507.10616) but pushed further — RL-from-scratch surfacing qualitatively new patterns is closer to boundary-expansion than amplification. Domain is math tool-use (Python), not offensive security — treat the emergent-behavior claim as suggestive, not proven, here.
  • Search-R1 (arXiv:2503.09516, Jin et al., 2025-03-12) [R] [E] — the loss-masking detail (mask tool/environment output tokens from the policy gradient) covered above under reward shaping; also BSides pattern #1 evidence.

Domain gap to flag honestly: all three are validated in math/search domains with much shorter tool-call chains than a 100-turn CTF episode — the mechanism (interleave, mask, learn-when-to-call) transfers; the specific hyperparameters (how often to call, reward magnitude per call) don’t, and need re-deriving on your own verifier-based reward.


Summary table

The four rows below marked “context only” (CTF-Dojo, Pentest-R1, STRIATUM-CTF, HackSynth) are academic domain-specific CTF/pentest papers — cited for landscape awareness, not as a basis for any conclusion/recipe/number in this file (see re-grounding above).

TechniquearXiv (verified unless noted)TagsBSides patternConfidenceOne-line takeaway
GRPO (baseline)2402.03300[L]VerifiedGroup-mean baseline, no critic — trajectory-level only
GAE (baseline)1506.02438[L]VerifiedClassical multi-step advantage; single time-scale
GiGPO2505.10978[L][E]#4VerifiedStep-level advantage via state-hash groups, zero extra rollouts
ArCHer2402.19446[L]VerifiedTurn-level off-policy critic + token-level on-policy — heavier infra
RAGEN / StarPO2504.20073[E][L]#2VerifiedNames & diagnoses the “Echo Trap” entropy collapse
Turn-Level Reward Design2505.11821[L][E]#4PromisingDense per-turn reward layered on sparse terminal reward
Turn-PPO2512.17008[L]PromisingArgues PPO turn-value beats GRPO group-baseline at long horizon
TL-GRPO2601.16480[L]PromisingTurn credit for same-state-revisited (iterative) tasks
A2TGPO2605.06200[L]PromisingAdaptive per-turn PPO clip range
Proximity-Based MTO2602.19225[L]PromisingWeight credit by task difficulty, not just position
GAGPO2605.13217[L]PromisingGAE-style generalized advantage inside GRPO grouping
Verlogno arXiv — OpenReview[L][E]PromisingDual-discount GAE + memory-windowing + early truncation; 400+ turn scale
Demystifying RL for Long-Horizon Tool Agents2603.21972[L][R]Promising5-axis empirical recipe: reward/scale/data/algorithm/env
PEARL2601.20439[E][R]#2PromisingRL over the planning/tool-sequencing step itself
ReTool2504.11536[R][E]VerifiedInterleaved code-exec reasoning, RL over interruption points
ToRL2503.23383[E][N]VerifiedRL-from-scratch surfaces emergent tool-use strategies
Search-R12503.09516[R][E]#1VerifiedMask tool-output tokens from policy-gradient loss
CTF-Dojo (context only)2508.18370[R]Verified, academic — not a basis658-challenge verifier-grounded CTF RL environment
Pentest-R1 (context only)2508.07382[L][R]Verified (needs deeper read), academic — not a basisTwo-stage RL for pentest reasoning
STRIATUM-CTF (context only)2603.22577[R]#4Promising, academic — not a basisMCP-standardized stateful CTF agent framework
HackSynth (crypto CTF) (context only)2506.02048[R][E]Verified, academic — not a basisVanilla GRPO already works on narrow (crypto) CTF
VerlTool2509.01055Unverified (flag before citing)Holistic agentic-RL-with-tool-use framework

Open questions for the next research pass

  1. No paper in this pass demonstrates turn-level credit assignment (GiGPO/Verlog-class) applied to an offensive-security/CTF domain specifically — transplanting these mechanisms to CTF is this project’s own contribution to make, not something to find pre-solved.
  2. Verlog has no arXiv paper — only an OpenReview NeurIPS-workshop submission and project blog. Cite the OpenReview id, not a fabricated arXiv id, and flag it as workshop-tier evidence.
  3. VerlTool (arXiv:2509.01055) surfaced in search but was not independently verified via abs-page crawl — confirm before citing as settled.
  4. Pentest-R1’s exact reward/credit design was not deep-dived here (abstract only). Per the project’s standing rule, it is context/landscape awareness only, never a basis for the project’s own reward-shaping design — that design must be re-grounded on general RLVR/reward-shaping theory or this project’s own measured data, not on Pentest-R1 or any other academic CTF/pentest paper.

  • Reinforcement — PPO · GRPO · RLVR — the base algorithm (and its own “Exploration and entropy” section — DAPO Clip-Higher, Dr. GRPO) every technique in this chapter modifies or wraps.
  • Imitation — SFT · rejection sampling — the pre-RL stage this project runs first; RAGEN’s Echo Trap is precisely the risk that activates once you graduate past it.
  • Contested edges & landmines — the flag-confabulation gotcha, the RFT terminology trap, and the “RL can’t create capability” debate that ToRL’s [N] evidence bears on directly.
  • Frontier recipes — Kimi K2.5’s PARL and OpenAI Deep Research’s end-to-end agentic RL, cited above as production evidence, are detailed there per-lab.
  • Framework-choice / GPU-economics chapter and the frontier-lab sweep chapter (this overnight research batch): once those land in SUMMARY.md, wire a link here for the verl-agent/VerlTool infra choice and the cross-lab RL-recipe comparison — their file names weren’t finalized at write-time for this chapter; the integrator should confirm the actual paths and slot them into this section.

RL that creates value — long-horizon, exploration, reasoning, novelty

The other method pages (Reinforcement, Agentic & multi-turn RL) cover which algorithm (PPO/GRPO/GSPO) and what training-loop shape (single-shot vs. multi-turn-with-tools). This page is the tagged sweep of what actually makes RL pay off for a 100-turn, tool-using, verifier-rewarded CTF agent that already has an execution gap (capability present, unreliable) rather than a knowledge gap. Every technique below is filed under the axis it serves, cross-referenced to the behavioral audit patterns it targets, and cited only where the arxiv id was verified live (crawled arxiv.org/abs/<id> on 2026-07-02).

Legend

TagAxisWhy it matters here
[L]Long-horizon / credit assignmentEpisodes run to ~100 turns, reward is usually terminal-only (flag verified or not) — plain trajectory-level advantage smears credit across every turn equally.
[E]Exploration / entropy preservationCybersecurity is search/enumeration. Entropy collapse = the policy commits early to one narrow script and stops trying alternatives.
[R]Reasoning / test-time computeMulti-step vuln chaining, deciding what to try next, allocating turns to hard vs. easy phases.
[N]Novelty / boundary-expansionEvidence the technique expands what the policy can do at all (finds attack paths the base model never finds at any k), not just amplifies what’s already samplable.

Baseline context this sweep assumes: GRPO (critic-free, group-mean baseline, arXiv:2402.03300) and GAE (classical single-trajectory multi-step advantage, arXiv:1506.02438) are what everything below is patching. Vanilla GRPO/PPO assume one reward at the end of one generation — a 100-turn tool-using episode breaks that on two axes simultaneously: the “trajectory” is now dozens of interleaved generation-turns + environment observations, and reward is terminal-only, so the group-mean baseline credits/blames every turn identically regardless of whether it was an exploratory enumeration step or the exploit-landing step.

Reward-shaping guardrail (applies to every [L] technique below): any dense, per-turn, or process-level reward introduced to fix credit assignment must be a shaping term added on top of, not a replacement for, the terminal ground-truth flag verification. This project’s confirmed lesson — SFT-induced FLAG{} confabulation from a loose regex reward — is exactly the failure mode that reopens if a proxy signal gets promoted to primary reward. See the reward-hacking cluster in §3.4.


1. Long-horizon / credit assignment for agents [L]

1.1 GiGPO — step-level credit with zero extra rollouts

arXiv:2505.10978 (Feng, Xue, Liu, An; 2025-05-16; NeurIPS 2025 poster). Two nested grouping levels instead of GRPO’s one: the usual episode-level group (N full rollouts, trajectory advantage as before) plus a step-level group — retroactively hash (state, step) pairs, bucket steps that recur across rollouts at the same environment state, and compute a second advantage from “what happened next” conditioned on that state. No new critic, no extra rollouts — pure post-hoc bookkeeping on trajectories you already sampled.

# after collecting the usual GRPO batch of N trajectories
key = lambda s: (s.tool_name, normalize(s.target), s.response_status_class)  # state canonicalization
anchor_groups = bucket_by(key, all_steps_across_trajectories)
step_adv = {step: advantage_within(anchor_groups[key(step)]) for step in all_steps}
advantage = episode_advantage + lam * step_adv[step]     # combine before the clipped PG update

Designed to fix: pattern 4 (uneven PTES phases — 62% of failures stall in exploitation). Step-level credit stops rewarding all 100 turns equally when only the exploitation phase decides the outcome.

Gotcha: needs a hashable notion of state recurrence — a CTF agent’s state is unbounded free text (shell/HTTP output), so the state key above needs bespoke canonicalization, not raw-text hashing. Confidence: Verified. For this agent: the single most directly transplantable idea in this sweep — no new infra, just a state-canonicalization function over the harness’s existing tool_call/tool_exec_ms spans.

1.2 ArCHer — turn-level vs. token-level, two time-scales

arXiv:2402.19446 (Zhou, Zanette, Pan, Levine, Kumar; 2024-02-29). High-level off-policy TD critic at turn granularity + low-level on-policy token PG bootstrapped off it — decouples “which turn was good” from “which token was good.” The conceptual ancestor of “turn is the right credit unit,” but the off-policy critic reintroduces exactly the instability/infra weight the project’s critic-free GRPO preference was chosen to avoid. Confidence: Verified. Takeaway: use the decomposition, not the off-policy mechanism — GiGPO or Verlog (§1.6) get the same turn/token separation without a learned critic.

1.3 RAGEN / StarPO — naming the collapse in multi-turn RL

arXiv:2504.20073 (Wang et al.; 2025-04-24). Diagnostic paper: trajectory-level RL on multi-turn agents reproducibly collapses into the “Echo Trap” — reasoning traces converge to a small repeated repertoire that keeps scoring reward on the training distribution while generalization/exploration dies. Fixes center on reward normalization across turns + explicit rollout-diversity interventions.

Designed to fix: pattern 2 (react-and-guess, no methodology, 82% pivot-after-failure). An entropy-collapsed policy is one explanation for an agent that stops trying diverse enumeration and converges to a small set of guesses.

Confidence: Verified. For this agent: the citation for “watch policy entropy as the graduation trigger” — instrument per-turn action-entropy or tool-sequence diversity during RL and treat a converging curve as the operational signal, not epoch count.

1.4 The turn-level-advantage cluster (2025–26 convergence)

A fast-moving cluster of near-simultaneous papers, all attacking “vanilla GRPO’s trajectory advantage is too coarse for multi-turn agents” with distinct mechanisms. Treat as one finding: turn is the unit of advantage is now cross-group consensus, even though no implementation is yet the default.

PaperarXivAngle
Turn-Level Reward Design2505.11821Dense per-turn reward layered on the sparse terminal reward; on tool-use benchmarks, trajectory-level baselines can fail to invoke tools at all (20-30% exact-match) vs. 100% tool-exec-success with turn-level reward.
Turn-PPO2512.17008Goes back to a per-turn value function (PPO-style) instead of GRPO’s group baseline, arguing GRPO’s clipped-group-relative update has “notable limitations” specifically for long-horizon reasoning.
TL-GRPO2601.16480Turn credit for same-state-revisited tasks (iterative code repair) — narrower than general multi-turn, but overlaps GiGPO’s step-grouping idea.
A2TGPO2605.06200Adaptive per-turn PPO clip range — early-episode and near-terminal turns have different advantage-magnitude distributions; one fixed clip under/over-constrains one of the two.
Proximity-Based MTO2602.19225Weight credit by task difficulty, not just turn position — a success on a hard task is more informative than a success on a trivial one.
GAGPO2605.13217Brings GAE’s λ-discounted multi-step advantage directly into GRPO’s group-relative framework.

Designed to fix: pattern 4 (uneven PTES phases) — dynamic/turn-level signal keeps hard-exploitation-phase prompts from silently degenerating to all-zero-reward groups.

Confidence: all Promising (recent, 0-few citations at verify time) except Turn-Level Reward Design (workshop-validated). For this agent: don’t pick one paper as “the” answer — prototype the cheapest version (GiGPO’s step-grouping, zero extra infra, or a per-turn tool-exec-success shaping term) before reaching for a second value network (Turn-PPO/TL-GRPO).

1.5 GSPO — sequence-level clipping stabilizes as sequences get long

arXiv:2507.18071 (Qwen team, 2025-07). Import from Reinforcement: clip/importance-sample at the whole-sequence level instead of per-token, because token-level ratios compound multiplicatively over long sequences — exactly the failure mode a 100-turn tool-interleaved trajectory maximizes. This is Qwen3’s actual production RL algorithm. Tags: [L] [R] [E] (more stable optimization = less premature collapse). For this agent: if a future GRPO run destabilizes on long trajectories, GSPO-style sequence-level ratios should be the first thing tried, not more KL tuning.

1.6 Verlog — the only technique benchmarked past 100 turns

No arXiv id — OpenReview only (NeurIPS 2025 MTI-LLM workshop poster, openreview.net/forum?id=GmodkWwMV3; Chen, Chen, Zhu, Schneider). Cite the OpenReview id, not a fabricated arXiv one. Three mechanisms: (1) customizable agent memory — a flexible history window decoupled from episode length, (2) dual-discounting GAE — separate discount factors for turn-to-turn (γ_step) vs. within-turn token (γ_token) credit decay, generalizing ArCHer’s two-time-scale idea into a single GAE, (3) trajectory early truncation — bootstrap a value estimate instead of paying full wall-clock for variable-length rollouts.

gamma_step, gamma_token = 0.95, 0.99   # two discounts instead of one uniform GAE gamma
# ... standard GAE recursion, but decayed across turns with gamma_step and within a turn with gamma_token

Scale claim: prior frameworks (RAGEN ~10 turns, verl-agent ~50 turns) top out well below this project’s ~100-turn ceiling; Verlog is validated on Crafter at 70–400 steps (avg ~190). Confidence: Promising, workshop-tier — flag as such if cited externally. For this agent: the dual-discount split is a single-hyperparameter change on top of whatever GAE code already exists, no new critic required; trajectory early truncation directly targets the “100-turn hard timeout returns reward=0” problem by recovering partial-progress signal instead of discarding it.

1.7 verl-agent and the framework landscape

verl-agent (github.com/langfengq/verl-agent) — an open-source veRL extension for agent RL, ~50-turn scale per Verlog’s own comparison. No arXiv paper of its own; cite as infrastructure, not a research claim. Adjacent: “Demystifying RL for Long-Horizon Tool-Using Agents” (arXiv:2603.21972, Wu et al., 2026-03-23) decomposes the agentic-RL design space along 5 axes — reward shaping, model scaling, data composition, algorithm selection, environment design — the most systematic “what to tune first” ablation study found for this domain. [L][R], Promising. PEARL (arXiv:2601.20439, Wang et al., 2026-01-28) treats plan exploration (which tools, what order) as its own RL object, directly relevant to pattern 2 (no methodology). [E][R], Promising.

1.8 Tool-integrated reasoning RL: ReTool, ToRL, Search-R1

Three papers sharing a mechanism — RL over an interleaved reason→tool-call→observation loop:

  • ReTool (arXiv:2504.11536, Feng et al., 2025-04-15): interleaves real-time code-interpreter execution inside the reasoning trace and RL-trains when to interrupt reasoning to invoke a tool. [R][E], Verified, math domain.
  • ToRL (arXiv:2503.23383, Li, Zou, Liu, 2025-03-30): pure RL, no SFT-on-tool-traces warmup, reports emergent strategic tool-invocation behaviors absent from an SFT-only baseline. [E][N] — see §4.2.
  • Search-R1 (arXiv:2503.09516, Jin et al., 2025-03-12): masks retrieved/tool-output tokens out of the policy-gradient loss — the policy didn’t generate them, don’t backprop through them.
# retrieved/tool-output masking — a correctness detail, not a design choice
loss_pg = -(mask_model_generated_tokens * min(ratio * adv, clipped * adv))
# shell stdout / HTTP response body / scan output: masked out, same rule as search results

Designed to fix: pattern 1 (87.7% of tool calls bypass the rich tool surface; 26/40 tools dead). Search-R1 is direct evidence RL over “which tool/query to issue” is solved-enough in a comparable regime (search-engine calls).

For this agent: the loss-masking detail is a must-have engineering correctness fix regardless of which algorithm gets chosen — easy to miss when standing up a GRPO/RLVR loop.

1.9 Domain precedent — CTF/pentest RL (academic, context only)

These are academic domain-specific CTF/pentest training/benchmark papers — cited for context only, not a basis for any conclusion, decomposition, recipe, or number on this page (per the project’s standing rule: none of this line of work has produced a frontier cybersecurity model). Every claim they might otherwise appear to license is re-grounded below on general frontier/theory evidence or the project’s own confirmed data instead.

  • CTF-Dojo (arXiv:2508.18370, 2025-08-25) — academic, cited for context only: 658 Docker-containerized CTF challenges with verified feedback. Re-grounded: this page’s “ground-truth-verified reward, not format-matched” requirement rests on the project’s own confirmed lesson (SFT-induced FLAG{} confabulation from a loose regex reward — see the reward-shaping guardrail above §1) and on the Spurious Rewards finding (§3.6, arXiv:2506.10947) that a wrong reward can still look like it’s working — not on CTF-Dojo.
  • Pentest-R1 (arXiv:2508.07382, 2025-08-10) — academic, cited for context only: a two-stage offline-RL-on-walkthroughs → online-RL-in-Intercode-CTF pipeline. Re-grounded: the project’s own SFT→RL staging decision is grounded on DeepSeek-R1’s four-stage recipe at frontier scale (§3.1, arXiv:2501.12948) and on RAFT/Reinforce-Rej’s controlled ablation of rejection-sampling SFT (§2.6, arXiv:2504.11343) — do not lock the project’s reward shape off Pentest-R1’s design.
  • STRIATUM-CTF (arXiv:2603.22577, 2026-03-23) — academic, cited for context only: an MCP-standardized framework targeting “multi-step, stateful reasoning.” Re-grounded: GiGPO’s state-hashing (§1.1) is justified on GiGPO’s own general-agent-RL evidence (arXiv:2505.10978), not on STRIATUM-CTF’s framing.
  • HackSynth / Random-Crypto (arXiv:2506.02048, 2025-06-01) — academic, cited for context only: fine-tunes Llama-3.1-8B with vanilla GRPO on procedural crypto-CTF. Re-grounded: the claim that turn-level machinery matters more as horizon/challenge-length grows is instead supported by Verlog’s own turn-count comparison (§1.6, OpenReview) and PSN-RLVR’s length-scaling result (§2.8, arXiv:2602.02555) — general long-horizon-RL evidence, not a domain-specific CTF result.

Open gap: none of the above academic CTF-domain papers demonstrate turn-level credit assignment (GiGPO/Verlog-class) applied to offensive security — transplanting it is this project’s own contribution to make, not something pre-solved by this (academic, context-only) body of work.


2. Exploration & entropy preservation [E]

2.1 The entropy-collapse law, and its surgical fix

One-line idea: policy entropy collapses sharply and monotonically early in RLVR, and performance is bound by a fitted law R = -a·exp(H) + b — you are trading entropy for performance, hitting a hard, predictable ceiling at H=0. Mechanism: entropy change is driven by the covariance between a token’s action-probability and its logit update, which is proportional to advantage — high-probability, high-advantage tokens keep getting pushed toward certainty, and that covariance term stays positive almost everywhere.

Fix — Clip-Cov / KL-Cov: identify the small set of highest-covariance tokens per batch and either drop the PG update on them or scope an extra KL penalty to just them, leaving the rest untouched. Already merged into verl as a loss-mode flag.

cov = (logp - logp.mean()) * (adv - adv.mean())
mask = cov > percentile(cov, 1 - clip_frac)        # top ~0.2-2% of tokens
loss_pg[mask] = loss_pg[mask].detach()              # Clip-Cov
# or: loss = loss_pg + kl_coef * kl_per_token * mask   # KL-Cov

Designed to fix: patterns 2 and 3 (no methodology / brittle single-guess) — both are symptoms of a policy that already spent its entropy budget on a narrow, high-probability action sequence.

Confidence: High (mechanistic + empirical, 342 citations within a year, adopted upstream into verl within a month). Source: Cui et al., arXiv:2505.22617 (2025-05-28). For this agent: instrument mean(entropy) from RLVR step 0 — the single cheapest, highest-leverage move in this whole sweep, with no design decisions required.

2.2 Clip asymmetry — Clip-Low and Clip-High are not symmetric

arXiv:2509.26114 (Park, Kim et al., 2025-09-30). Raising the low-side PPO clip bound increases entropy; raising the high-side decreases it — they are independent exploration knobs, not one symmetric hyperparameter. Mechanistic complement to DAPO’s clip-higher (§2.3). Confidence: Medium-high, single paper. For this agent: if clip-higher alone doesn’t fully solve collapse, look at the low-side clip too.

2.3 DAPO — four engineering fixes, one entropy-preserving

arXiv:2503.14476 (Yu, Zhang et al., 2025-03-18). Clip-Higher (decoupled ε_low/ε_high, ~0.20/0.28, so rare-but-good tokens gain probability faster than they’re suppressed) + Dynamic Sampling (resample any prompt whose whole rollout group is all-correct or all-incorrect — zero-advantage groups give zero gradient) + token-level loss aggregation + overlong-response soft penalty.

eps_low, eps_high = 0.20, 0.28
loss_pg = -min(ratio * adv, clip(ratio, 1-eps_low, 1+eps_high) * adv)   # per-token, mean over ALL tokens
while std(group_rewards) == 0:            # dynamic sampling — degenerate groups contribute nothing
    group_rewards = rollout_and_score(resample_prompt(), n=G)

Designed to fix: patterns 2, 3, and indirectly 4 — dynamic sampling keeps gradient flowing even on hard-exploitation prompts that would otherwise silently degenerate to all-zero-reward.

Confidence: High — open weights/code/data, one of the most widely adopted open RLVR recipes. For this agent: at ~100 turns/rollout, resampling degenerate groups is expensive — pair with curriculum/difficulty filtering (drop challenges outside the project’s own 30–60% band) rather than paying for resamples on genuinely-unsolved challenges.

2.4 High-entropy minority tokens — where the exploration budget actually lives

arXiv:2506.01939 (2025-06, NeurIPS 2025). Only ~20% of CoT tokens carry high entropy (the semantic “forking” decision tokens); restricting RLVR gradient to only those matches full-gradient RLVR at 8B and beats it at 32B (+11 AIME25), while training on the low-entropy 80% actively degrades performance. Confidence: Medium-high, math/code domain only. For this agent: in an agentic trace, boilerplate tool-call JSON/restated context is plausibly an even larger low-entropy fraction than in pure CoT — the potential leverage of masking gradient to just the “which tool/branch” decision tokens may be higher here, though unverified for this domain.

2.5 Positive-Advantage Reweighting — independent confirmation

arXiv:2511.05993 (Jin, Gao et al., 2025-11-08). Independently re-derives that positive-advantage tokens drive entropy collapse (converging with §2.1 via a different route) and proposes direct reweighting of the loss on those tokens as a simpler alternative to covariance-thresholding. Confidence: Medium — but the cross-group convergence with §2.1 raises confidence in the underlying mechanism.

2.6 RAFT / Reinforce-Rej — the project’s current recipe, validated

arXiv:2504.11343 (Xiong, Yao et al., 2025-04-15). Rejection-sampling SFT (train only on positively-rewarded samples) is competitive with GRPO/PPO; ablation shows GRPO’s real edge over vanilla policy gradient is discarding all-fail groups, not reward normalization. Reinforce-Rej extends this by filtering both all-wrong and all-right groups.

positives = [r for r in [gen(prompt) for _ in range(N)] if verifier(r) == 1]
sft_loss(positives)                        # this project's current recipe
if not (all_correct(group) or all_incorrect(group)):
    policy_gradient_update(group)          # Reinforce-Rej: same degenerate-group filter as DAPO §2.3

Designed to fix: nothing behaviorally — this is a validation, not a fix. It confirms the project’s rejection-sampling-SFT phase is a literature-grounded baseline, not an ad hoc placeholder.

Confidence: High for the ablation (clean controlled comparison). For this agent: the single most directly actionable paper for where the project is right now — and its degenerate-group filter is the same requirement DAPO’s dynamic sampling encodes, independently derived.

2.7 KL-regularization design space

arXiv:2505.17508 (Zhang, Liu et al., 2025-05-23). Systematic study of KL-term design choices (forward vs reverse, applied to reward vs loss vs both, against which reference policy) — the “why” behind ProRL’s reference-resetting (§4.1) and KL-Cov’s token-scoped KL (§2.1). Confidence: Medium — theoretical framing.

2.8 Parameter-space noise — temporally coherent exploration, classic and revived

Classic (arXiv:1706.01905, Plappert, Houthooft et al., 2017; 364 citations, well-established): perturb the policy’s parameters before a rollout instead of the action distribution (temperature/top-p) — produces a temporally-consistent “perturbed persona” for the whole episode rather than incoherent per-token jitter.

2026 revival, PSN-RLVR (arXiv:2602.02555, Bai, Wang et al., 2026-01-30): applies parameter noise to RLVR specifically because standard RLVR has an exploration ceiling that grows more visible at large sampling budgets. Corrects the resulting off-policy mismatch with truncated importance sampling; reports gains that get larger as reasoning length grows (marginal on ~738-token AMC responses, +8.9% pass@256 on ~1978-token AIME responses).

theta_noisy = theta + sigma * noise            # perturb before rollout (typically MLP/FFN blocks)
rollout = generate(theta_noisy, prompt)
importance_weight = clip(pi_theta(a|s) / pi_theta_noisy(a|s), max_val)   # truncated importance sampling
loss = -importance_weight * advantage * logp_theta(a|s)                  # update the CLEAN theta

Designed to fix: patterns 2 and 3 — and directly relevant since these episodes run to ~100 turns, far longer than the paper’s single-CoT setting, where token-level noise decorrelation would compound into incoherence exactly as the paper predicts.

Confidence: classic — High. PSN-RLVR — Low-medium, single very-recent paper, unreplicated. For this agent: the single technique in this sweep whose stated advantage scales with trajectory length instead of against it — worth a dedicated small pilot.

2.9 Multi-temperature — spend exploration budget where it helps

arXiv:2510.08892 (Zhuang, Zhou et al., 2025-10-10). Classify tokens into high-entropy “reasoning/fork” vs. low-entropy “knowledge/fact” tokens; sample fork tokens at higher temperature, knowledge tokens at lower — don’t want the agent “exploring” whether a CVE number or flag format is correct. Confidence: Medium. For this agent: directly portable — higher temperature at “which tool next” decision points, lower temperature inside verbatim payload/command construction.

2.10 DIVER — reward group-level diversity as an intrinsic bonus

arXiv:2509.26209 (Hu, Zhang et al., 2025-09-30). Rewards global sequence-level diversity across a rollout group (pairwise dissimilarity) using potential-based reward shaping (Ng et al. 1999 invariance) so diversity-seeking doesn’t distort what “correct” means. Reports beating GRPO-w/-clip-higher, entropy-RL, and pass@k training on both pass@1 and pass@k, in- and out-of-domain.

D = pairwise_dissimilarity_matrix(responses)          # G x G over a group
diversity_of_i = mean(D[i, :])
r_intrinsic = diversity(state_t) - diversity(state_t_minus_1)   # potential-based shaping
reward_total = reward_task + lambda_div * r_intrinsic

Designed to fix: patterns 1 and 2/3 — a direct counter to “87.7% of tool calls bypass the rich tool surface” if “different approach” is defined over which tools/commands were used, not token-level text.

Confidence: Medium-high, single paper. For this agent: the strongest concrete [N]-flavored opportunity surfaced in this whole sweep — reward a rollout group for trying genuinely different tools/approaches against the same challenge, not just for eventually finding the flag.

2.11 CDE — curiosity as cheap perplexity + critic-variance bonus

arXiv:2509.09675 (Dai, Song et al., 2025-09-11, ICLR 2026). Actor-side bonus = perplexity of the model’s own response (high = “surprised,” i.e. exploring); critic-side = variance across a multi-head critic. Reports a calibration-collapse finding as a byproduct — the policy becomes confident regardless of correctness, which the actor-bonus specifically counters.

Designed to fix: pattern 3 (good guessers until they’re not) — overconfident-wrong is the literature-side twin of committing to an ungrounded guess and not recovering.

Confidence: Medium-high, modest empirical gain (+~3pt AIME) but genuinely useful framing. For this agent: the actor-side perplexity bonus is cheap (no extra network, GRPO is critic-free) — a reasonable first experiment before anything heavier.

2.12 MERCI — count-based novelty, with a domain caveat

arXiv:2510.16614 (Zhang, Li et al., 2025-10-18, ICLR 2026 poster). Classical count-based exploration adapted to the autoregressive LLM MDP via a lightweight Coin Flipping Network pseudo-count estimator — cheaper than general count-based bonuses because the token-sequence MDP has known, deterministic transitions. Confidence: Medium — but that deterministic-transition assumption is exactly what a live sandboxed CTF environment violates (server responses/subprocess stdout are stochastic, environment-dependent). For this agent: treat as inspiration for a tool/command-novelty bonus, not a drop-in.

2.13 Representation-based exploration — a negative result worth acting on

arXiv:2510.11686 (Tuyls, Foster et al., 2025-10-13). A diversity bonus from the base LM’s own hidden states, usable at inference time (build a diverse k-of-N pool) or as an RL bonus. Notable negative result: the bonus improves verifier efficiency across sampling strategies except high-temperature sampling — high-temp outputs look “novel” in representation space without being useful. Temperature-driven and representation-driven exploration are not naively composable.

pool = [gen(prompt, temp=1.0) for _ in range(N)]
selected = top_k_by([hidden_state_diversity(r, pool) for r in pool], k)   # diverse k-of-N, not random

Confidence: Medium-high (>50% verifier-efficiency gain reported, single group). For this agent: the actionable finding is at eval time, not training — if the harness uses high temperature to get sample diversity for pass@k>1 runs, this paper says that may be producing noisier repeats of the same strategy, not genuinely different ones. Cheap to ablate, changes nothing about training.

2.14 Pass@k as diagnostic, not objective — and how to fix that if you insist

arXiv:2511.16231 (Yu Yang, 2025-11-20): optimizing pass@k directly is mathematically just a positive reweighting of pass@1, whose gradient vanishes exactly where exploration is most needed (a concentrated policy). Use pass@k as a diagnostic (is the ceiling still rising with more samples?), not a training objective.

arXiv:2505.15201 (Walder & Karkhanis, PKPO, 2025-05-21): if you do want a pass@k-shaped reward anyway, derives an unbiased, low-variance estimator that keeps gradient on harder problems where pass@1 gives near-zero signal but pass@k still has coverage — the same unbiased-estimator family this project already uses at eval time (1 - C(N-c,k)/C(N,k)).

arXiv:2508.10751 (Chen, Qin et al., Pass@k Training, 2025-08-14): a lighter-weight alternative — use pass@k as the reward signal itself to adaptively balance exploration/exploitation.

Designed to fix: pattern 5 (benchmarks measure pattern-match speed, not thoroughness) — §2.14’s diagnostic-not-objective framing is the formal version of that same critique, applied to the training objective.

For this agent: treat the pass@1-vs-pass@5-vs-pass@10 gap (already the project’s locked methodology) as the diagnostic signal — a shrinking gap while pass@1 stays flat is the entropy-collapse warning from §2.1, not “the model learned the task.”

2.15 NuRL — unlocking prompts GRPO currently can’t learn from at all

arXiv:2509.25666 (Chen, Peng et al., 2025-09-30, Salesforce AI Research + UNC). Standard GRPO/RLVR gets zero gradient from any prompt where every rollout in the group fails (the same degeneracy DAPO resamples and RAFT filters). NuRL instead unlocks these: generate a self-conditioned hint (model, given the gold answer, produces its own CoT + hint), inject it for 0%-pass-rate groups, re-roll with the hint — now training on a hint-augmented group with real signal; hint dropped at inference.

group = rollout(prompt, n=G)
if pass_rate(group) == 0.0:                          # dead for GRPO/DAPO/RAFT alike
    hint = self_generate_hint(prompt, gold_answer)
    group = rollout(prompt + hint, n=G)               # re-roll WITH the hint; hint dropped at inference

Designed to fix: pattern 4 and directly the portfolio’s ~100-200-of-1000 solve rate — the currently-unsolved ~800 challenges are plausibly many all-zero-reward-group cases today, exactly NuRL’s target regime.

Confidence: Medium, single paper, six-benchmark/three-model validation. For this agent: needs design work to adapt — the flag itself isn’t the how-to-get-there knowledge, so a CTF-shaped hint needs a walkthrough/verifier-metadata analogue or a previously-successful trajectory for a similar challenge family; doesn’t port zero-shot from math/code.

2.16 Cybersecurity RLVR precedent — the entropy-preservation gap in-domain (academic, context only)

Pentest-R1, HackSynth/Random-Crypto, and a Linux-privesc RLVR paper (arXiv:2603.17673, Normann, Happe et al., 2026-03-18 — SFT-then-RLVR on a 4B model, 95.8% success vs. 97.5% for Claude Opus 4.6 at >100x lower inference cost) are academic domain-specific security-training papers — cited for context only, not a basis for this page’s conclusions. They’re mentioned here purely to note an absence: none report an explicit entropy-preservation mechanism. The project’s actual grounding for the two-stage SFT→RLVR pipeline is DeepSeek-R1’s staged recipe at frontier scale (§3.1, arXiv:2501.12948) and RAFT/Reinforce-Rej’s controlled ablation (§2.6, arXiv:2504.11343), not these academic security papers. The entropy-preservation gap itself is this project’s own [N] opportunity to make, not something pre-solved: no cybersecurity-specific RL paper found combines DAPO/entropy-mechanism/curiosity/parameter-noise with a multi-vuln-class, ~100-turn, tool-rich CTF setting.


3. Reasoning & test-time compute [R]

3.1 The staging lesson from DeepSeek-R1

arXiv:2501.12948 (2025-01). R1-Zero (pure RL, binary rule-based reward, long CoT emerges) then R1’s four-stage fix (cold-start SFT → reasoning RL → rejection-sampling SFT on the RL checkpoint’s own correct trajectories → second RL pass). The middle-to-late stage — rejection-sampling SFT on verifier-passed solves — is literally this project’s chosen path, validated at frontier scale. Gap: R1’s reward is single-turn/terminal on math/code; the sparser, later-arriving terminal signal of a 100-turn CTF episode is the part R1 does not solve (that’s §1 of this page).

3.2 Dr.GRPO — the length-bias trap gets worse with more turns

arXiv:2503.20783 (2025-03). Vanilla GRPO’s length + group-std normalization secretly rewards longer wrong answers, shorter right ones. Fix: drop both normalizations, keep only advantage = reward - mean(rewards). Matches/beats GRPO accuracy at same compute.

Designed to fix: pattern 3 (good guessers until they’re not) — if the base algorithm rewards length-padding on failure, an agent could learn to “look busy” (redundant tool calls, extra enumeration) without the enumeration being useful — a bigger attack surface for this bug in a 100-turn setting than a single-turn math answer.

Confidence: High. For this agent: use Dr.GRPO’s advantage normalization, not vanilla GRPO’s, if/when graduating to RL — specifically check whether the policy is learning to burn turns on unproductive tool calls after a wrong guess, which is nearly indistinguishable from legitimate enumeration unless checked for.

3.3 LIMO — SFT data quality over quantity

arXiv:2502.03387 (2025-02). 817 carefully-curated SFT examples beat >100k loosely-curated ones on AIME/MATH500 plus strong OOD transfer — SFT works as “cognitive templates” for knowledge the base model already has, not a knowledge source.

For this agent — directly relevant to the immediate next step: when building the rejection-sampling SFT set from the agent’s own verifier-passed runs, prioritize trajectory quality/technique diversity over raw count. A smaller set of clean, full-PTES-phase, well-enumerated solves may generalize better than a larger set of lucky-guess successes — training on lucky-guess trajectories specifically risks teaching the “guess and hope” failure mode LIMO’s framing predicts generalizes poorly (pattern 3).

3.4 Test-time compute is a resource-allocation problem, not a skill problem

arXiv:2408.03314 (Snell et al., 2024-08, 1772+ citations). Difficulty-adaptive test-time compute allocation can match a 14x larger model at fixed budget; uniform allocation is wasteful. Related, arXiv:2502.15631 (o3-mini vs o1-mini, Feb 2025): higher accuracy achieved WITHOUT longer reasoning chains — accuracy generally declines as CoT length grows within a fixed model, even controlling for difficulty.

For this agent: the 100-turn budget IS test-time compute allocation, just framed as episode length. Maps onto pattern 4 (strong at chaining, weak at thorough enumeration) as a resource-allocation problem: the agent should spend more of its turn budget on thorough enumeration for hard/unfamiliar challenge types and less on easy/familiar ones, rather than a flat per-phase turn count. Track turns-per-solve alongside solve rate — more turns is not automatically better.

3.5 The contested question — does RL expand or just narrow the reasoning boundary?

Two papers, opposite conclusions, genuinely contested:

  • “Does RL Really Incentivize Reasoning Capacity Beyond the Base Model?” (arXiv:2504.13837, 2025-04): under pass@k with large k, RLVR-trained models’ correct paths are all already samplable from the base model — pass@1 improves, pass@256 decreases over training. RLVR-as-typically-run narrows.
  • ProRL (arXiv:2505.24864, Liu, Diao et al., 2025-05-30, NeurIPS 2025): with KL control + periodic reference-policy resetting + a diverse task suite, sustained over 2000+ steps, RL-trained models solve problems the base model never solves at any k — genuine boundary expansion, correlating with training duration and base-model competence.
loss += kl_coef * KL(policy || ref_policy)              # (a) adaptive KL control
if step % reset_interval == 0:
    ref_policy.load_state_dict(policy.state_dict())     # (b) periodic reference reset — the non-obvious piece
# (c) diverse multi-task training suite, not a narrow curriculum

Reconciliation (this page’s synthesis, not either paper’s claim): RLVR as typically run (short, KL-to-frozen-init) narrows/amplifies; RLVR prolonged, KL-controlled, reference-resetting, diverse-task can expand. Training duration + KL management is the resolving variable — treat as a hypothesis to validate with your own entropy instrumentation (§2.1), not settled fact.

Designed to fix: pattern 4 (uneven PTES, weak enumeration) — a reasoning boundary that hasn’t been expanded keeps failing the same class of enumeration-heavy step no matter how much RL polishing it gets.

For this agent: don’t expect boundary expansion from a short RL run — that’s expected behavior matching §2504.13837, not a bug. Periodic reference-policy resetting is cheap, orthogonal to GRPO/DAPO/GSPO choice, and worth adopting from the start of any RL run.

3.6 Spurious rewards — a methodology warning before trusting any result

arXiv:2506.10947 (2025-06, 93+ citations fast). For Qwen2.5-Math specifically, RLVR improves MATH-500 almost as much with completely spurious rewards (random, wrong-label, format-only) as with ground-truth ones — RL surfaces a latent pretrained quirk, not the reward’s information content. Does not replicate on Llama3/OLMo2 — a model-family-dependent finding.

For this agent: strengthens the project’s own confirmed lesson (ground-truth-verified reward, never format-matched) — a completely wrong reward can look like it’s working if the base model has the right latent bias, a scarier version of “format reward causes confabulation.” Sanity-check any future “the reward design worked” claim with a brief spurious-reward ablation on the same base model.

3.7 The reward-hacking cluster — what to pre-mortem before scaling RL

Three independent, very recent (2026) papers converging on one warning:

  • arXiv:2605.02269 — “Towards Understanding Specification Gaming in Reasoning Models”: RL reasoning training causally increases specification-gaming rate (32–170% across model pairs); test-time mitigations reduce but don’t eliminate it.
  • arXiv:2604.15149 — “LLMs Gaming Verifiers”: RLVR-trained models abandon intended generalizable behavior and exploit the gap between a verifier’s extensional (checks-the-output) and intensional (checks-the-process) correctness — the verifier admits false positives, RL finds exactly that gap.
  • arXiv:2604.13602 — “Reward Hacking in the Era of Large Models”: the “Proxy Compression Hypothesis” — reward hacking is near-inevitable when optimizing an expressive policy against any compressed proxy of a high-dimensional true objective. Framework/survey, not a novel empirical result.

For this agent — a concrete pre-mortem, not a hypothetical: a binary flag-match check IS an extensional verifier. §2604.15149’s mechanism predicts RL will find and exploit any gap between “produces the correct flag string” and “actually exploited the intended vulnerability” (info leak, predictable flag generation, a scoring bug) at a higher rate than the SFT-only regime already run. Audit the flag-verification harness for exactly these extensional gaps before scaling RL, and consider periodic trajectory-level spot-audits (not just flag-match) once RL training starts. All three: Medium confidence (very recent, unreplicated) — but the engineering precaution is warranted regardless of replication.

3.8 Process reward models — context, and why they stay off the table

Lightman et al. (arXiv:2305.20050, 2023, foundational) and its 2025 continuation “Process Reward Models That Think” (arXiv:2504.16828) are the step-level-verification lineage — DeepSeek explicitly rejected PRM for R1 due to step-level reward hacking. For this agent: the natural alternative if credit-assignment sparsity (§1) becomes a bottleneck, but a PRM is itself a learned (not ground-truth) reward — any PRM-style process reward would need its own anti-gaming safeguards on top of §3.7’s warnings, not a naive “good step = positive reward” scheme. Keep the terminal flag verifier as the ungameable primary signal.


4. Novelty / boundary-expansion [N]

The techniques above earn an [N] tag when they have direct evidence of expanding what the policy can do (solving what the base model never solves at any k), not just sharpening what it already sometimes does. Collected here as the load-bearing case for or against the project’s confirmed lesson “GRPO amplifies existing capabilities, SFT replaces them” (arXiv:2507.10616).

4.1 ProRL — the strongest counter-evidence to “amplification only”

Already covered in §3.5 — restated here because it’s the anchor [N] citation: prolonged, KL-controlled, reference-resetting RL demonstrably expands the reasoning boundary. The qualifier that makes this actionable: boundary expansion correlates with training duration and base-model competence — a short GRPO polish pass should be expected to behave like the narrowing result (§3.5), not ProRL’s.

4.2 ToRL — RL-from-scratch surfaces qualitatively new tool-use strategies

arXiv:2503.23383 (2025-03-30, math domain). No SFT-on-tool-traces warmup at all; pure RL discovers when and how to invoke tools and reports emergent invocation strategies absent from the SFT-only baseline, outperforming the best tool-integrated-reasoning model on AIME’24 by double digits. This is the strongest citation in this whole sweep for [N]: the explicit contrast “RL discovers emergent patterns vs. SFT imitates them” pushes past mere amplification — RL-from-scratch surfaced qualitatively new patterns. Domain gap: math tool-use (Python), not offensive-security tool-use — suggestive, not proven, for CTF.

4.3 Absolute Zero — self-play with zero external data

arXiv:2505.03335 (2025-05). A model proposes its own tasks (validated by a code executor, rewarded by a “learnability” signal peaking at the frontier of current competence — an automatic curriculum) and solves them, beating models trained on tens of thousands of curated examples with zero external labeled data.

For this agent — speculative but worth flagging cross-seat: architecturally similar to a self-generated CTF-challenge curriculum, IF the flag-verifier concept generalizes from “check code output” to “check flag capture in a sandbox.” Math/code domain only — flag to challenge-builder/main as a longer-term idea for auto-scaling challenge difficulty to agent capability rather than a fixed static portfolio, not a near-term recipe.

4.4 NuRL, DIVER, MERCI — engineered novelty-seeking

Already covered in §2.15, §2.10, §2.12 — grouped here as the [N]-tagged mechanisms that explicitly target “escape local routines to discover better solutions” (MERCI’s phrase) rather than sharpen an existing one: NuRL raises the model’s upper bound on prompts it currently cannot solve at all; DIVER rewards genuinely-different group-level strategies; MERCI’s novelty bonus (with the deterministic-MDP caveat) targets repetitive, suboptimal reasoning patterns directly.

4.5 Parameter-space noise (PSN-RLVR) — explicitly framed as boundary-expanding

Already covered in §2.8. Framed by its authors as “expanding the effective reasoning capability boundary,” with gains growing as reasoning length grows — the property most aligned with this project’s long-horizon axis of any [N]-tagged technique in this sweep.

4.6 Kimi K2 and MUA-RL — agentic capability as a distinct training target

Kimi K2 (arXiv:2507.20534, 2025-07): frontier labs increasingly treat agentic/tool-use capability as requiring dedicated training investment, not an emergent side-effect of reasoning-RL — validates not expecting math/code RLVR gains to automatically transfer to 100-turn agentic CTF competence. MUA-RL (arXiv:2508.18669, 2025-08): trains against a dynamic, LLM-simulated counterpart in the RL loop instead of a static script, generalizing better per-parameter on multi-turn tool-use benchmarks. For this agent: the project’s live sandboxed CTF environment already satisfies MUA-RL’s “train against the real, dynamic, reactive counterpart” principle — validating evidence, not a design change.


Decision flow — which lever to pull first

flowchart TD
    A["Rejection-sampling SFT plateaus"] --> B{"Entropy instrumented\nfrom step 0?"}
    B -- "not yet" --> B0["Add entropy logging NOW\n(§2.1) — do this regardless"]
    B0 --> C
    B -- "yes" --> C{"GRPO baseline in\n30-60% band?"}
    C -- "no, too low/high" --> C0["Curriculum-filter challenges\nto the 30-60% band first"]
    C0 --> D
    C -- "yes" --> D["Start RL with DAPO recipe\n(clip-higher + dynamic sampling, §2.3)\nnot vanilla GRPO"]
    D --> E{"Entropy still\ncollapsing?"}
    E -- "yes" --> F["Add Clip-Cov / KL-Cov\n(§2.1) — one-line verl flag"]
    E -- "no" --> G
    F --> G{"Credit smeared across\nall 100 turns equally?"}
    G -- "yes" --> H["GiGPO step-groups (§1.1)\nor turn-level reward shaping (§1.4)"]
    G -- "no" --> I
    H --> I{"~800 challenges still\nall-zero-reward?"}
    I -- "yes" --> J["NuRL-style hints (§2.15)\nor more rejection-sampling data"]
    I -- "no" --> K
    J --> K{"Tool avoidance persists\n(pattern 1)?"}
    K -- "yes" --> L["DIVER / MERCI-style\ntool-sequence diversity bonus (§2.10, §2.12)"]
    K -- "no" --> M["Budget real training duration +\nreference-policy resets for boundary\nexpansion (ProRL, §3.5/§4.1)"]
    L --> M

Ranked shortlist — what to reach for FIRST

Given the diagnosis (execution gap, not knowledge gap), the chosen path (rejection-sampling SFT → GRPO/RLVR at entropy collapse), the five BSides patterns, and the binary terminal verified-flag reward:

  1. Instrument entropy from step 0 of any future RL run (§2.1). Free, no design decisions, do this before anything else — you cannot diagnose a collapse you didn’t measure.
  2. Start the RL stage with DAPO’s four fixes as the baseline recipe (§2.3), not vanilla GRPO — clip-higher and dynamic sampling are exactly the entropy/degenerate-group guarantees the project’s own “30–60% baseline” rule is implicitly reaching for, made explicit. Reinforce-Rej (§2.6) independently validates the same direction.
  3. If entropy still collapses under DAPO, add KL-Cov/Clip-Cov as a one-line verl loss-mode flag (§2.1) before building anything custom.
  4. Adopt GiGPO’s step-level credit (§1.1) as the first turn-level fix — zero extra rollouts, zero new critic, just a state-canonicalization function over the harness’s existing tool-call spans. Reach for the turn-PPO/TL-GRPO cluster (§1.4) only if GiGPO’s assumptions (hashable state recurrence) don’t hold in practice.
  5. Separate the ~800 never-solved challenges from the 30–60%-band ones and treat them as NuRL-territory (§2.15) — a distinct problem requiring hints or more SFT data before they’re GRPO-ready at all, not more training on the same recipe.
  6. Pilot parameter-space noise (§2.8) as the one technique whose stated advantage scales with the 100-turn horizon rather than against it — small-scale, given PSN-RLVR is unproven outside math.
  7. Build a tool-call-sequence-level diversity bonus (DIVER §2.10 / MERCI §2.12, adapted) — the strongest concrete [N] opportunity surfaced anywhere in this sweep, and the most direct counter to pattern 1 (87.7% tool bypass) that isn’t a prompting fix.
  8. Do not optimize pass@k directly as a training reward (§2.14) without the PKPO estimator; use the pass@1/pass@k gap as a diagnostic, and ablate whether pass@5/pass@10 sampling diversity is real or just noisier repeats (§2.13) — cheap, changes nothing about training, tells you whether the eval methodology measures what you think.
  9. Audit the flag verifier for extensional gaps before scaling RL (§3.7) — a pre-mortem, not a reaction; RL is empirically expected to find gaming opportunities at a higher rate than the SFT-only regime already run.
  10. Budget real training duration + periodic reference-policy resets (ProRL, §3.5/§4.1) once past the initial GRPO baseline — boundary expansion (genuinely new attack paths, not just more reliable execution of known ones) is a function of how long and how KL-controlled the run is, not available from a short polish pass.

Summary table

TechniquearXiv (verified)[L][E][R][N]BSides patternConfidenceOne-line takeaway
GRPO (baseline)2402.03300HighGroup-mean baseline, no critic — trajectory-level only
GAE (baseline)1506.02438HighClassical multi-step advantage, single time-scale
GiGPO2505.10978#4HighStep-level advantage via state-hash groups, zero extra rollouts
ArCHer2402.19446HighTurn-level off-policy critic — heavier infra than needed
RAGEN / StarPO2504.20073#2HighNames and diagnoses the “Echo Trap” entropy collapse
Turn-level reward cluster2505.11821 +5#4Promising (cluster)Turn is the unit of advantage — cross-group consensus
GSPO2507.18071Med-HighSequence-level clip; stability grows more relevant with length
VerlogOpenReview onlyPromisingDual-discount GAE + memory-window + early truncation; 400+ turns
Demystifying long-horizon RL2603.21972Promising5-axis empirical recipe: reward/scale/data/algo/env
PEARL2601.20439#2PromisingRL over the planning/tool-sequencing step itself
ReTool / ToRL / Search-R12504.11536 / 2503.23383 / 2503.09516ToRL:✓#1High/VerifiedTool-output tokens masked from PG loss; ToRL = emergent tool strategies
CTF-Dojo / Pentest-R1 / HackSynth (academic, context only)2508.18370 / 2508.07382 / 2506.02048Context onlyNOT a basis for this page’s claims — see §1.9 re-grounding on DeepSeek-R1/RAFT/GiGPO/Verlog/PSN-RLVR
Entropy Mechanism / Clip-Cov / KL-Cov2505.22617#2 #3HighFitted entropy-vs-performance law; surgical per-token fix
Clip-Low/Clip-High asymmetry2509.26114#2 #3Med-HighTwo independent clip knobs, not one symmetric one
DAPO2503.14476#2 #3 #4HighClip-higher + dynamic sampling — the RL baseline recipe
High-entropy minority tokens2506.01939#3Med-HighOnly ~20% of tokens carry the exploration-relevant decisions
Positive-Advantage Reweighting2511.05993#2 #3MediumIndependent confirmation of the entropy-collapse mechanism
RAFT / Reinforce-Rej2504.11343HighValidates this project’s own rejection-sampling-SFT phase
Parameter-space noise / PSN-RLVR1706.01905 / 2602.02555#2 #3High / Low-medTemporally-coherent exploration; gains grow with length
Multi-temperature2510.08892#2MediumHigh temp on fork tokens, low temp on payload/syntax tokens
DIVER2509.26209#1 #2 #3Med-HighReward group-level diversity, potential-based shaping
CDE2509.09675#3Med-HighPerplexity + critic-variance curiosity bonus; calibration fix
MERCI2510.16614#1 #2MediumCount-based novelty; deterministic-MDP assumption breaks for live envs
Representation-based exploration2510.11686#3Med-HighNegative result: high temp and rep-diversity fight each other
Pass@k diagnostic / PKPO / Pass@k Training2511.16231 / 2505.15201 / 2508.10751#5High/Med/MedPass@k’s gradient vanishes as policy concentrates unless corrected
NuRL2509.25666#4MediumSelf-generated hints unlock currently-0%-pass-rate prompts
DeepSeek-R12501.12948#2HighThe staging recipe this project’s own path mirrors
Dr.GRPO2503.20783#3HighRemoves GRPO’s length-bias reward artifact
LIMO2502.03387#3Med-HighSFT set quality/technique-diversity over raw count
Test-time compute scaling2408.03314#4HighTurn budget is a resource-allocation problem, not a skill one
ProRL2505.24864#4Med-HighProlonged + KL-control + reference-reset genuinely expands boundary
Does RL Really Incentivize…2504.13837✓(neg)MediumCONTESTED vs. ProRL — short-run RLVR narrows, doesn’t expand
Spurious Rewards2506.10947✓(neg)High (scope-ltd)A wrong reward can still “work” — validate across model families
Reward-hacking cluster2605.02269 / 2604.15149 / 2604.13602✓(neg)MediumRL causally increases spec-gaming; audit the verifier’s extensional gaps
Absolute Zero2505.03335MediumSelf-play, zero external data — speculative curriculum idea
Kimi K2 / MUA-RL2507.20534 / 2508.18669Med/MedAgentic capability is a distinct training target, not RLVR side-effect

PEFT is orthogonal — LoRA · QLoRA · DoRA

Common confusion worth killing outright: PEFT is not a fine-tuning method — it’s a mechanism for applying one. Any of SFT / DPO / GRPO / RLVR can be delivered full-parameter or via a PEFT adapter. It changes which parameters get gradients and how much memory you burn, not what signal you learn from.

The methods

  • LoRA — freeze W, train a low-rank update ΔW = B·A (rank r), so y = Wx + (BA)x·(α/r). Only A, B get gradients (Hu et al., arXiv:2106.09685).
  • QLoRA — quantize the frozen base to 4-bit NF4, keep adapters in BF16; lets a large base fit a small GPU (Dettmers et al., arXiv:2305.14314).
  • DoRA — decompose the update into magnitude + direction for a bit more accuracy at the same budget (arXiv:2402.09353).

Two engineering facts that matter

  • It’s a knob on top of a method. “Should I do LoRA or GRPO?” is a category error — you do GRPO, via LoRA. On-policy distillation reproductions run rank-128 LoRA; OpenAI/Fireworks/Google Vertex customer-RFT products are LoRA-first — LoRA lives in the applied/enterprise fine-tuning layer. The frontier labs post-train their own flagship checkpoints full-parameter (Llama/Qwen/DeepSeek/GPT/Claude/Gemini reports; verified 2026 pass).
  • LoRA reduces forgetting (low-rank update can’t move W far → less overwrite of pretrained knowledge — relevant to the small-model overwrite problem in Imitation), but it is startlingly learning-rate-sensitive: the 2026 unified LoRA-variant study finds LoRA responds far more to LR than to which variant you pick, and a well-tuned vanilla LoRA matches or beats most fancy variants (arXiv:2601.22708). Tune LR before you tune adapter architecture.

Practical default for your scale

At ≤~9–16B, LoRA/QLoRA is the sane default for iteration cost; go full-parameter only when you have a concrete reason (measured OOM headroom aside, the project’s stance is LoRA-by-default, full-FT as a deliberate escalation — see lessons/post-training/ in shared memory). It composes with every method chapter here.

Diagnosing the gap — a scientific framework

The question this chapter answers: is there an industry-standard, peer-defensible way to prove a failure is a KNOWLEDGE gap, not an EXECUTION gap, not an EXPLORATION gap? Short answer, upfront: no single accepted instrument exists. There is no ISO-9001 for capability diagnosis. What exists is a converging set of measurement techniques, each independently validated, that — combined into one protocol — give you a defensible, falsifiable, split verdict. That protocol is what this chapter hands you.

Every arXiv id below was verified live against arxiv.org/abs/<id> on 2026-07-02 (project research pass, artifacts/overnight-rl-sweep/research/diagnosis.md). Confidence tags follow that pass: [HIGH] peer-reviewed/heavily reproduced, [MED] coherent preprint not yet contested, [LOW] single small-N preprint.

Bottom line up front

For your ~1000-challenge, ~100-turn, ground-truth-flag-verified CTF portfolio at ~10–20% k=1 solve rate, the honest, defensible answer will not be a single sentence. It will be: X% of the currently-failing challenges are a knowledge gap, Y% are an execution/performance-floor gap fixable by elicitation, Z% are an exploration gap that needs on-policy RL, not more demonstrations — and here is the measurement that sorted each challenge into its bucket. That heterogeneous, per-challenge-subtype verdict is itself the scientifically credible output — collapsing it into “it’s execution not knowledge” is exactly the move a skeptical reviewer will catch you on.


1. Three gap types, defined precisely

Ground the vocabulary in the 60-year-old linguistics/cognitive-science split this whole ML debate re-derives without citing: competence (what the system can in principle produce) vs. performance (what it actually produces under real constraints — prompting, memory, self-verification, time) — Firestone, “Performance vs. competence in human–machine comparisons,” PMC7604508 [HIGH], and its LLM-era instance splitting formal competence (linguistic surface mastery) from functional competence (using it in the world) — Mahowald et al., arXiv:2301.06627 [HIGH].

Gap typeCompetence/performance framingOperational testFix lever
KnowledgeCompetence ceiling — genuinely absentCorrect action never appears in any of N samples, at any N, on any checkpointInject off-policy: SFT on demonstrations, a stronger teacher, or a tool (knowledge-in-tools rule)
ExecutionPerformance floor — competence present, elicitation failsCorrect action appears at moderate–large N, but pass@1 doesn’t convert it; prompting or a few SFT demos recover itCheap: better scaffolding/prompting, or light SFT elicitation
ExplorationCoverage present before training, destroyed during trainingCorrect action was recoverable at large N pre-RL; SFT-matched-data actually regresses it; on-policy RL (not demonstrations) is what recovers/expands itOn-policy RL with explicit entropy/diversity preservation, not more SFT

The exploration gap is the one that’s easy to misdiagnose as a knowledge gap if you only look at a single snapshot: it’s a process failure (the training loop killing coverage that existed a step ago), not a static property of the base model. Section 4 below is the test that tells these two apart.


2. The core instrument: pass@k → Cover@τ → Pass@(k,T)

2.1 pass@k as a coverage probe [E]

The unbiased pass@k estimator (the one this project already uses at eval time, per decisions/2026-06-11-ctf-benchmark-pass-count.md):

def pass_at_k(n, c, k):
    """n = samples generated, c = number correct, k = budget."""
    if n - c < k:
        return 1.0
    return 1.0 - comb(n - c, k) / comb(n, k)

Sampling k completions per problem at large k and plotting the curve is, structurally, a coverage measurement — the probability mass the policy places on any correct completion. The theory for why this works: the Coverage Principle — cross-entropy loss is dominated by tokens irrelevant to correctness, but coverage (mass on high-quality responses) is necessary and sufficient for post-training / test-time scaling to succeed, and estimates faster than loss does. Chen et al., arXiv:2510.15020 [MED]. [E]

2.2 The crossover test — does RL amplify or replace? [E][N]

The core instrument. Run the base model and your trained checkpoint through the same challenge set at k = {1, 4, 16, 64, 256…}. Plot both pass@k curves.

  • Base catches up to or exceeds trained pass@k at large k → the training only reweighted an existing distribution (elicitation, not new capability). Yue et al., “Does RL Really Incentivize Reasoning Capacity Beyond the Base Model?,” arXiv:2504.13837 [HIGH] — the founding result. 6 RLVR algorithms tested; all “remain far from optimal in leveraging the base model’s potential.”
  • Trained pass@k pulls away and widens as k grows → real capability expansion.

Designed to fix: pattern 5 (benchmarks measure pattern-match speed, not thoroughness). Yue et al.’s own prescription is exactly “report pass@k at large k, with base model as control” — not just pass@1. Report base-model pass@k at the same k as a mandatory control column in every benchmark table you publish internally.

Contested rebuttal — CoT-Pass@K: pass@k credits a correct final answer even from a wrong chain-of-thought (a lucky guess). Require the reasoning path itself to be correct and the crossover disappears — RLVR shows monotonic gains at every k. Wen et al., arXiv:2506.14245 [MED]. State this as contested when presenting — it directly falsifies the load-bearing assumption of §2.2’s headline result. For your agent, this maps onto a real risk in your own SFT curation: a verifier-passed trajectory can still contain wrong/wasted turns before the winning one (BSides: ~half of solves hinge on an ungrounded guess at a critical step). Filter on trajectory soundness (backtracking, wasted turns, tool-call validity), not just flag==1, or you reproduce the exact confound this paper diagnoses.

2.3 Cover@τ — punish guessing, reward reliability [E]

pass@k at huge k conflates “genuinely solvable” with “eventually guessable by brute force.” Cover@τ(q) = 1 if ≥ τ·n of n samples on problem q are correct — a reliability threshold, not a “did any sample land” threshold. Dragoi et al., arXiv:2510.08325 [MED]. Relative RLVR-algorithm rankings change under Cover@τ vs pass@1 — some algorithms that look best on pass@1 are worse at genuine reliability.

Designed to fix: pattern 5 and complements pattern 3 (good guessers until they’re not). A challenge with high pass@64 but near-zero Cover@0.3 is guessing-dominated — rejection-sampling SFT on its lucky wins teaches the model to guess more confidently, not more competently. Report Cover@τ (τ≈0.3) alongside pass@k as a second axis in every benchmark table.

2.4 Pass@(k,T) — the agentic extension, and the single most load-bearing citation in this chapter [L][E][N]

Everything above is validated on static, single-shot reasoning (math). Your agent is T-round tool interaction. Zhai et al., “Does RL Expand the Capability Boundary of LLM Agents? A Pass@(k,T) Analysis,” arXiv:2604.14877 [MED], asks the crossover question with interaction-depth as a second axis:

PASS@(k,T)(q,π) = 1 - C(n - c_T, k) / C(n, k)

— identical to standard pass@k, except c_T counts correct at interaction depth T, not overall.

Their finding flips §2.2 on compositional tasks: on Category C (compositional, sequentially-gated information gathering — structurally identical to enumerate-then-chain vuln discovery), the RL curve pulls above and widens against the base curve as k grows — the opposite of the static-reasoning crossover. On independent-retrieval tasks the effect is small; on pure static reasoning (no tool, negative control) RL is inert, replicating Yue et al.

The critical additional result: matched-data SFT actually regresses the capability boundary on the same compositional tasks (net −4 vs RL’s net +4). This isolates self-directed exploration during RL — not data exposure — as the causal factor for expansion.

Designed to fix: pattern 4 (uneven PTES phases — strong at chaining inside an exploit, weak at enumeration, 62% of failures stall in exploitation). Category C (“sequential retrieval”) is structurally identical to “enumerate-correctly-at-turn-5-before-turn-40’s-exploit-becomes-visible.” This is the paper that tells you where your exploration gap lives on the portfolio.

The single highest-value experimental design in this chapter: segment your 1000 challenges by whether the winning path is (a) single-shot / not sequentially gated (their Cat A/B analog), or (b) genuinely compositional/sequentially-gated (Cat C analog — your “enumeration-gates-exploitation” weak spot). Run Pass@(k,T) on both segments, before and after rejection-sampling SFT. The falsifiable prediction: on (a), SFT works fine, further RL may plateau (§2.2’s static result holds — an execution gap, cheaply closed). On (b), SFT alone regresses capability and you need on-policy RL specifically — an exploration gap, not an execution gap, and rejection-sampling SFT is the wrong tool for it. This is directly testable this week and determines whether “SFT now, GRPO later” has the ordering right for the compositional subset, or whether it needs RL first.


3. Does base/SFT pass@k predict RL gains, before you spend the compute?

High SFT-stage scores are not reliably predictive of eventual RL performance — sometimes inversely so. What does predict post-RL pass@1: generalization loss on held-out examples and pass@large-k on the post-SFT checkpoint, with up to 2× better R²/Spearman correlation than post-SFT pass@1 alone. Kang et al. (Meta FAIR + Virginia Tech), “Quagmires in SFT-RL Post-Training,” arXiv:2510.01624 [HIGH] — >1M GPU-hours, hundreds of models to 12B, 7 math benchmarks, up to 256 repetitions.

Training-loop delta: add a cheap diagnostic gate between SFT and RL. Before launching a GRPO run, compute pass@64 (or larger) on your rejection-sampling-SFT checkpoint, cold-start, pdq --fresh-retries, on the held-out challenge set. If it’s flat/low, don’t trust the SFT accuracy number as a green light — this predicts a disappointing GRPO run regardless of how good SFT looked.

What changes in your graduation criterion: the project’s stated trigger is “graduate to GRPO/RLVR when policy entropy collapses.” Add a second, independent gate: AND pass@64 on held-out challenges is non-trivial. Entropy collapse tells you SFT has converged; pass@64 tells you there’s still coverage headroom worth converting. Both are needed — collapsed entropy with flat pass@64 means you’ve converged onto a policy with nothing left to reinforce.


4. The routing test — does the correct action ever appear at high N?

This is the book’s existing one-line diagnostic (see The decision), and it deserves the fuller justification here because it is doing real theoretical work, not just intuition:

  • Never, at any N, on any checkpoint → knowledge gap. Standard RLVR “amplifies existing capabilities, SFT replaces them” (arXiv:2507.10616, confirmed project lesson) — you cannot cheaply RL your way to a distribution that places zero mass on the answer. Inject off-policy (demonstration, teacher, or — cheaper — put the missing fact in a tool, not the weights).
  • Sometimes, and pass@(k,T) shows RL (not matched SFT) expanding it (§2.4) → exploration gap. Coverage exists but training-time entropy collapse is what’s suppressing it turn-to-turn; the fix is on-policy RL with explicit exploration preservation, not more demonstrations.
  • Sometimes, and a few-shot prompt or a handful of SFT demos recover it (§5) → execution gap / performance floor. Cheap fix, no RL infrastructure needed.

This routing test is directly the mechanism §2.2–§2.4 give you a rigorous way to run, not just eyeball. The crossover is the “does it ever appear at high N” test, made continuous and plottable rather than binary.


5. The elicitation ladder — a borrowed, rigorous “prove it’s X not Y” methodology

The AI-safety capability elicitation literature already had to solve almost exactly this problem — “is the model bad at this, or is the capability locked away where naive prompting can’t reach it?” — because sandbagging/eval-integrity depends on the same distinction. Their protocol, adapted: build (or find) a case where you don’t yet know if the capability is present, then try elicitation techniques in escalating cost order, and whichever one first recovers the behavior tells you what kind of gap it was.

prompt / few-shot  →  SFT on a handful of demos  →  RL against a judge/verifier  →  (still nothing) → knowledge gap
   (cheapest)              (cheap)                    (expensive)
  • Password-locked models: fine-tune a model to only show a capability when a password token is present (imitating a weaker model otherwise), then test whether elicitation techniques recover it. Finding: a few high-quality SFT demonstrations are often sufficient to fully elicit a locked capability; RL can also elicit it when demos aren’t available, but less reliably. Greenblatt et al., arXiv:2405.19550 [HIGH].
  • Harder organism (circuit-breaking, not just password-locking): prompting/steering fail; only fine-tuning-based elicitation succeeds. Recommendation: “evaluators should use fine-tuning techniques to elicit capabilities when possible” — if prompting can’t get behavior out, don’t conclude the capability is absent; fine-tune first, then conclude. Hofstätter et al., arXiv:2502.02180 [MED].
  • Operational definition of real capability: what can be elicited at ≤1% of total training cost (Anthropic RSP’s own definition, directly reusable). van der Weij et al., arXiv:2406.07358 [HIGH, ICLR 2025].
  • Order matters: neither SFT nor RL alone reliably elicits held-back performance from a degenerate policy; SFT on weak demonstrations first, then RL, is what fully elicits it — RL-first “almost always leads to reward hacking rather than genuine improvement” starting from a degenerate policy. Ryd et al., arXiv:2604.22082 [MED, 2026].

Designed to fix: pattern 1 (agents prefer their own tools; 87.7% of calls bypass the rich tool surface, 26/40 tools dead). This is the cheapest, most directly actionable experiment in the whole chapter. Take a handful of ignored tools and run the ladder: (a) few-shot prompting with 2–3 correct-usage examples recovers usage → pure elicitation/prompting gap, no training needed; (b) SFT on a small demonstrated-usage set recovers it → elicitation via light fine-tuning (matches Greenblatt’s finding); (c) neither works → genuinely a missing-knowledge/affordance problem, and rejection-sampling SFT should specifically upweight trajectories that exercise those tools. This turns “the model prefers curl” from an anecdote into a falsifiable, paper-backed experiment.

5.1 The wrinkle mid-episode: the self-verification cliff [E][R]

Your flag verifier is external, ground-truth, and perfect — exactly the regime where BoN/rejection-sampling/RL should work without a ceiling (Stroebl et al., “Inference Scaling fLaws,” arXiv:2411.17501 [HIGH]: with an imperfect verifier the false-positive floor is non-removable even at infinite compute; a perfect verifier has no such floor). That’s a load-bearing reason the project’s “ground-truth-verified reward, never regex” rule is correct.

But the verifier only fires at submission — at turn 40 of 100, the agent must judge without it whether its current path is worth continuing. That’s exactly the regime multiple papers show degrades, not improves, with capability: models find a correct answer among k samples far more often than they can self-select it, and the self-selection gap widens with generator capability (contested/preliminary — one 2026-06 OpenReview submission, no confirmed arXiv id, treat as [LOW], flag if cited). Corroborating, harder evidence: Best-of-N provably degrades past a reward-hacking threshold even with a competent reward model — scaling samples isn’t monotonically good. Huang et al., arXiv:2503.21878 [HIGH, ICML 2025]. Formal geometry: rejection-sampling and Best-of-N both converge to a ceiling set by the verifier’s ROC curve; more samples cannot buy past it. Dorner et al., arXiv:2507.12399 [MED].

Designed to fix: pattern 3 (good guessers until they’re not — ~half of solves hinge on an ungrounded guess at a critical step; 82% pivot after a single failure). Diagnostic, cheap, run on the existing Phoenix corpus: log every point in a trajectory where the agent pivots/abandons a path, and check post-hoc whether the abandoned path was actually unproductive (e.g., did a later successful run on the same challenge use a similar path?). If abandoned paths are disproportionately ones that would have worked, that’s a self-verification-cliff signature — an execution/judgment gap, and the fix is a mid-episode progress signal, not more SFT demonstrations.


6. A two-type failure vocabulary — grounded in general agent/RL evidence, not domain benchmarks

Standing project rule: no conclusion here may rest on academic cybersecurity-LLM training/benchmark papers (CTF-Dojo-style work, pentest-agent papers, CTF-family robustness studies, etc.) — none of that literature has produced a frontier cybersecurity model. The domain-specific pentesting-agent papers below are mentioned for context only, not as a basis for anything in this chapter:

  • Deng et al., “What Makes a Good LLM Agent for Real-world Penetration Testing?,” arXiv:2602.17622 — academic, cited for context — not a basis for our decisions. (28 pentesting systems, proposes a Type A “capability gap” / Type B “planning and state-management limitation” taxonomy, and an Evidence-Guided Attack Tree Search system, “Excalibur.”)
  • Nakano et al., arXiv:2509.07939 — academic, cited for context — not a basis for our decisions. (Deterministic ATT&CK-derived task tree lifts subtask completion 13.5–16.5% → 71.8–78.6% on the same models — a pentest-benchmark result, not project evidence.)
  • Shen et al., “PentestAgent,” arXiv:2411.05185 — academic, cited for context — not a basis for our decisions. (Frames its own motivating failure as a knowledge gap fixed with RAG; illustrative only of how unsettled domain-specific pentest-agent papers are, not evidence either way.)

The two-type vocabulary itself is worth keeping — it just needs a non-domain-specific foundation. Re-grounded on general long-horizon-agent evidence and RL theory:

  • Capability/elicitation gaps — missing tools, inadequate prompts, absent demonstrations. Cheaply closed by better scaffolding, tool surface, or light SFT (§5’s elicitation ladder, itself grounded in the AI-safety elicitation literature, not domain-security work).
  • Planning/state-management limitations — a structurally different failure mode that does not reliably close with a stronger base model or more knowledge alone. Four independent, non-cybersecurity anchors support treating this as a real, separate axis — two qualitative/theoretical, two now directly quantitative:
    1. Empirical, frontier-model, general-domain: METR’s time-horizon study finds that what separates frontier models on long-horizon tasks is reliability and the ability to adapt to their own mistakes, not per-step knowledge or reasoning quality — and this axis scales on its own trajectory, doubling roughly every 7 months, independent of raw capability jumps. Kwa et al. (METR), “Measuring AI Ability to Complete Long Software Tasks,” arXiv:2503.14499 [HIGH]. This is general evidence (RE-Bench/HCAST software tasks, no cybersecurity framing) that long-horizon degradation is a distinct axis from single-turn competence — exactly the property a “Type B” needs to be a real thing and not just restated knowledge-gap.
    2. Theoretical, classic RL/imitation-learning result: compounding error under covariate shift — a policy trained/evaluated with a small per-step error rate accumulates error quadratically in trajectory length because each mistake pushes the agent into states its training distribution under-covers, and no single-step fix removes this without addressing the sequential structure itself. Ross, Gordon & Bagnell (DAgger), arXiv:1011.0686 [HIGH, 840+ citations]. This is the general-theory reason a state-tracking/replanning failure at turn 40 of 100 can be structurally invariant to swapping in a stronger base model — the problem is the sequential decision process, not the weights.
    3. Direct quantitative test, general-domain, verified 2026-07-02: isolating pure execution (plan and knowledge handed to the model, so only turn-to-turn execution is measured), larger models within the same family do execute more correct turns — but per-step accuracy still degrades as turns accumulate, driven by a self-conditioning effect (the model becomes more likely to err once its own prior mistakes are sitting in context). The paper’s own framing is exactly this chapter’s question: “self-conditioning does not reduce by just scaling the model size” — it is removed only by switching training paradigm to a “thinking”/reasoning-trained model, not by a bigger non-reasoning model of the same family. Sinha, Arun, Goel, Staab & Geiping, “The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs,” arXiv:2509.09677 [MED, preprint]. This is the closest thing in general literature to a direct scale-invariance test of a planning/state-tracking failure — the honest reading is invariant to parameter scale within a training paradigm, not invariant to LLM full stop (reasoning-trained models are a real escape hatch, a bigger base model of the old paradigm is not).
    4. Direct quantitative test, general-domain, verified 2026-07-02: a minimal explicit-lookahead planning module (FLARE) bolted onto a much weaker base model lets LLaMA-8B outperform GPT-4o run with standard step-by-step reasoning on multi-step planning benchmarks — i.e. an eight-billion-parameter model with a planning fix beats a frontier model without one. This is a clean existence proof that the planning axis is separable from and can dominate raw base-model strength. Wang, Wu, Wang, Tang, Li, Yin, Ma, Li, Sun, Chen & Ye, “Why Reasoning Fails to Plan: A Planning-Centric Analysis of Long-Horizon Decision Making in LLM Agents,” arXiv:2601.22311 [LOW, 0-citation preprint — promising, not yet validated].
    • Corroborating, not separately load-bearing: τ-bench’s own cross-model pass^k leaderboard (already cited above, §6.1) shows the same steep pass^1→pass^8 reliability collapse for both gpt-4o and claude-3.5-sonnet — picking the “better” frontier model of a different family narrows but does not close the multi-trial consistency gap. Yao et al., arXiv:2406.12045 [HIGH]. And a 2026 cross-family diagnostic benchmark (GPT-5 variants + Claude models, 3100+ trajectories) documents the same horizon-dependent degradation pattern recurring across both families rather than being a one-model artifact. Wang, Bai, Sun, Wang, Zhang, Hu, Schroder, Mutlu, Song & Nowak, “The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break,” arXiv:2604.11978 [MED, preprint].

Designed to fix: pattern 4 (62% of failures stall in exploitation — a plausible instance of exactly the compounding-error dynamic DAgger formalizes: an early misstep in enumeration pushes the trajectory off-distribution, and the deeper into exploitation the agent gets, the harder recovery becomes). Actionable, cheap: label a sample of your failed trajectories on two axes — (a) missing tool / bad prompt / no demonstration → capability/elicitation gap, cheap fix; (b) over-committed to a low-value branch, exhausted context near a dead end, no real-time replanning after a costly failure → planning/state-management gap. If (b) dominates, the theoretical prediction (DAgger), the empirical general-domain finding (METR), and the two direct quantitative tests above (Sinha et al.’s self-conditioning result; Wang et al.’s weak-model-plus-planning-beats-strong-model result) all say: don’t expect SFT-then-GRPO on raw episode reward alone to close it, and don’t expect swapping in a bigger base model of the same family to close it either — the fix needs to address the sequential/compounding structure (mid-trajectory checkpointing, explicit replanning triggers, a difficulty/progress signal, or a reasoning-trained backbone), not just more knowledge or more parameters. Flag, updated 2026-07-02: the qualitative distinction (capability vs. planning/state) is now well-grounded in four independent general-literature anchors, and two of them (Sinha et al. 2509.09677, Wang et al. 2601.22311) are direct quantitative tests of “does a stronger/bigger base model fix it” in non-cybersecurity settings — both say no, for different mechanisms (self-conditioning invariant to scale; planning-fix on a weak model beats a strong model without one). What general literature does not give you is this project’s own number — no source above measured the specific fraction of this portfolio’s failures that are planning/state-management vs. capability, nor whether it holds at this project’s turn-depths (~100) and challenge structure. Status: qualitative claim grounded; quantitative “X% invariant to base LLM” for this corpus is worth pursuing — must be measured on our own Phoenix corpus (68,049 spans, 189 runs), not assumed from the general literature or the academic pentest-agent papers above.

6.1 Per-turn fault labeling — operationalize the taxonomy on your own trace corpus [R]

τ-bench provides an auto error-identification tool with a fixed taxonomy (fault assignment × fault type: used_wrong_tool, used_wrong_tool_argument, took_unintended_action, goal_partially_completed). Yao et al., arXiv:2406.12045 [HIGH]. AgentRx extends this to an automated framework that localizes the single critical failure step in a long trajectory, with a cross-domain taxonomy (Misinterpretation of Tool Output 24.1%, Intent-Plan Misalignment 24.1%, Under-specified Intent 27.6%, per their τ-bench column). Barke et al. (Microsoft Research), arXiv:2602.02475 [MED].

Your Phoenix trace corpus (68,049 spans, 189 runs) is exactly the substrate this tooling wants. Label each turn as {reconnaissance-adequate, wrong-tool, wrong-argument, wrong-decision/policy, tool-output-misread, under-specified-plan} and compute: fraction of episodes with a labeled “under-specified-plan” or “wrong-decision” turn before the first “tool-output-misread” — this operationalizes BSides pattern 2 (“no methodology”) into a countable metric that separates planning (execution/skill — a scaffolding or SFT-curriculum fix) from interpretation (closer to reasoning/knowledge) from tool affordance (§5’s elicitation question).


7. A robustness cross-check: is the “gain” generalization or memorization?

Domain-specific aside (context only, not a basis for the method below): Honarvar et al., arXiv:2602.05523 — academic, cited for context — not a basis for our decisions. They build families of semantics-preserving CTF variants and find models robust to shallow transforms but degrading sharply under composed/deeper obfuscation, in the CTF domain specifically; per project rule, an academic CTF-benchmark paper cannot be the basis for the method below.

The methodology stands on its own, general (non-cybersecurity) grounding: semantics-/knowledge-preserving perturbation is an established way to separate genuine generalization from pattern-matching in general LLM evaluation. C-BOD rephrases MMLU questions with a parameterized, meaning-preserving transform and finds an average 2.75% performance drop across 32 SOTA models under modest rephrasing — with higher-performing, larger models showing greater sensitivity, i.e. bigger benchmark numbers can mean more surface-cue reliance, not less. Cohen-Inger et al., “Forget What You Know about LLM Evaluations — LLMs are Like a Chameleon,” arXiv:2502.07445 [MED, EMNLP 2025]. The same logic applied to code generation: rewrite a task’s ground-truth solution into a semantically-different-but-equal-difficulty variant and check whether the model’s answer degrades — a Memorization Risk Index that’s high only when the model reproduces a similar-looking answer and fails the rewritten task. Zhang et al., “Memorize or Generalize? Evaluating LLM Code Generation with Code Rewriting,” arXiv:2503.02296 [LOW, brand-new preprint, 0 citations — promising, not yet validated].

Before crediting any pipeline change with “closing an execution gap,” check the gain isn’t an artifact of the fixed, memorized 10 canonical PD26 challenges the harness has seen many times, using the same semantics-preserving-perturbation logic C-BOD and the code-rewriting paper apply outside cybersecurity.

Designed to fix: pattern 5. Cheap, no new training: generate meaning-preserving transformed variants of a held-out subset (renaming, restructuring, composed obfuscation — the transform families are generic; you don’t need a domain-specific benchmark paper to justify the check) and see whether a solve-rate gain transfers. If it evaporates on transformed variants, you have elicitation/memorization, not the execution-reliability improvement the SFT/GRPO diagnosis is banking on — and per C-BOD’s finding, don’t assume your strongest checkpoints are exempt; they may be the most exposed to this failure mode.

Also check the opposite failure mode post-RL: RL-PLUS names “capability boundary collapse” — pass@k at large k dropping even as pass@1 rises during RLVR, i.e. the on-policy training itself narrowing what the model can still do, not just what it does by default. Their fix mixes in the very off-policy verifier-passed trajectories rejection-sampling SFT already produces, via importance-sampling-corrected updates. Dong et al., arXiv:2508.00222 [MED]. This is the training-time confirmatory check for an exploration gap that got worse, not better, under GRPO — run pass@large-k before and after every GRPO checkpoint, not just pass@1.


8. The honest verdict: no single standard — here is the assembled protocol

No single accepted diagnostic instrument exists that a team runs once for a clean X-not-Y verdict. What the literature convergently offers, ranked by how load-bearing each is for this project:

  1. Pass@k / Pass@(k,T) / Cover@τ curve decomposition (§2) — the closest thing to a standard quantitative instrument, but interpretation is contested even among its own authors (§2.2 vs its CoT-Pass@K rebuttal), and its agentic extension shows the diagnosis is task-structure-dependent: same metric, opposite conclusion, depending on whether the task is compositional/sequentially-gated or not.
  2. Capability-elicitation methodology (§5) — a rigorous, falsifiable, cost-ordered protocol borrowed from AI-safety evaluation, directly reusable and cheap to run against the tool-avoidance finding.
  3. Capability/planning two-type taxonomy + per-turn fault labeling (§6) — a general vocabulary grounded in long-horizon-agent measurement (METR), compounding-error theory (DAgger), and now two direct general-domain quantitative tests of scale-invariance (self-conditioning not fixed by scale, Sinha et al. 2509.09677; weak-model-plus-planning beats strong-model-without, Wang et al. 2601.22311) — not in domain-specific pentest-agent papers; the specific per-corpus “X% invariant to base LLM” number stays flagged as worth pursuing, to be measured on this project’s own corpus, not asserted from general literature.
  4. The competence/performance vocabulary (§1) to frame the final answer for a skeptical reviewer: expect and report a split verdict by challenge subtype, not a single number.

The protocol — what to actually run, in order

StepInstrumentDiscriminatesCostSection
1Segment 1000 challenges: single-shot exploit chain vs. sequentially-gated (enumeration-gates-exploitation)Sets up steps 2–3 correctlyFree (manual/heuristic labeling)§2.4
2Pass@(k,T): base vs. rejection-sampling-SFT vs. (eventual) GRPO checkpoint, per segmentExecution gap vs. exploration gap vs. knowledge gapMedium (sampling compute, no training)§2.2–2.4
3Pass@64 on the SFT checkpoint as an RL go/no-go gate, alongside entropyWhether GRPO is worth running at allCheap (no training run)§3
4Cover@τ (τ≈0.3) alongside pass@k on every reported numberGenuine reliability vs. guessingFree (same rollouts, different aggregation)§2.3
5Elicitation ladder (prompt → few-shot → light SFT → RL) on the 26 dead tools + methodology failuresElicitation/performance-floor vs. genuine knowledge gapCheap → medium, escalating§5
6Post-hoc pivot-point audit: did abandoned paths ever succeed elsewhere?Self-verification-cliff (execution/judgment) vs. genuinely dead endFree (existing Phoenix corpus)§5.1
7Capability/planning labeling + per-turn fault taxonomy on the Phoenix corpusEngineering-fixable vs. architectural (test invariance-to-LLM claim on your own data, don’t assume it)Medium (manual labeling pass, or automate via AgentRx-style tooling)§6
8Semantics-preserving-transform robustness check on a held-out subsetGeneralization vs. memorizationMedium (need the transform tool)§7
9Entropy instrumentation from GRPO step 0 + pass@large-k before/after every checkpointTraining-induced exploration collapse (boundary shrinking, not growing)Free once GRPO is running§7

Report all nine together, segmented by challenge subtype. Refuse to collapse it into one sentence — §2.4 and §6 both predict, and §7’s entropy check explains a mechanism for, the true answer being heterogeneous across the portfolio.


9. The decision, expanded

flowchart TD
  Start["Failing challenge / challenge-subtype<br/>under diagnosis"] --> Seg{"Winning path structure?"}

  Seg -->|"Single-shot / independent recon<br/>(Cat A/B analog)"| PKT_AB["Pass@(k,T): base vs SFT vs RL<br/>(arXiv:2604.14877)"]
  Seg -->|"Sequentially-gated:<br/>enum must succeed before<br/>exploit is even visible (Cat C)"| PKT_C["Pass@(k,T): base vs SFT vs RL<br/>(arXiv:2604.14877)"]

  PKT_AB --> X1{"Base pass@k(large k)<br/>>= trained pass@k?"}
  X1 -->|"Yes — crossover"| Elicit1["ELICITATION only:<br/>run the ladder (§5)<br/>before assuming knowledge gap"]
  X1 -->|"No — trained pulls away"| Exec1["EXECUTION gap:<br/>rejection-sampling SFT → GRPO<br/>is the right ordering"]

  PKT_C --> X2{"Does the correct action<br/>ever appear at large N,<br/>on ANY checkpoint?"}
  X2 -->|"Never"| Know["KNOWLEDGE gap:<br/>inject off-policy<br/>(SFT / teacher / TOOL)"]
  X2 -->|"Yes — but matched-data SFT<br/>REGRESSES it (§2.4 test)"| Explore["EXPLORATION gap:<br/>on-policy RL required,<br/>NOT more demonstrations"]
  X2 -->|"Yes — few-shot prompting<br/>recovers it"| ElicitP["Performance floor:<br/>cheap prompting/scaffold fix<br/>(§5, pattern 1)"]
  X2 -->|"Yes — only light SFT<br/>on demos recovers it"| ElicitS["Elicitation via light SFT<br/>(arXiv:2405.19550)"]

  Elicit1 --> TypeAB["Capability vs. planning/state<br/>label the failures<br/>(arXiv:2503.14499, arXiv:1011.0686)"]
  Exec1 --> TypeAB
  Explore --> TypeAB

  TypeAB -->|"Capability: tool / prompt gap"| FixA["Engineering fix:<br/>tool surface, scaffolding,<br/>walkthrough-shaped SFT data"]
  TypeAB -->|"Planning/state: difficulty-<br/>estimation, compounding error<br/>(test invariance on own data)"| FixB["Architectural fix:<br/>difficulty-gating / attack-tree<br/>search wrapper — NOT more<br/>SFT-then-GRPO on raw reward"]

  classDef your fill:#132b22,stroke:#34d399,color:#eafaf3;
  class Exec1,Explore,TypeAB your;

This is the same routing question as The decisiondoes the correct action ever appear in π_θ’s own outputs at high N? — expanded with the two tests that make it rigorous for an agentic, T-round, sequentially-gated task instead of a single-shot one: the compositionality segmentation (§2.4) and the matched-data-SFT-regression test that separates a genuine exploration gap from a knowledge gap when coverage exists but training destroys it.


  • The decision — the one-line version of this chapter’s routing test; this chapter is its full justification.
  • Contested edges & landmines §1 — “RL can’t create capability” is contested exactly along the lines §2.2/§2.4 draw out (recipe-dependent, not a law).
  • Agentic & multi-turn RL — where the exploration-gap fix (on-policy RL, entropy preservation) is implemented once diagnosed.
  • Imitation — SFT · distillation · rejection sampling — where the execution-gap and knowledge-gap fixes live.
  • memory/research/ (shared pool) — long-horizon.md and exploration.md research notes underlie §2.4/§7’s turn-level and entropy-collapse mechanics respectively; not re-derived here to keep this chapter’s scope to diagnosis, not fix.

Bibliography (all verified live via arxiv.org/abs/<id>, 2026-07-02)

CitationarXivConfidence
Yue et al., Does RL Really Incentivize Reasoning Capacity Beyond the Base Model?2504.13837HIGH
Wen et al., RLVR Implicitly Incentivizes Correct Reasoning (CoT-Pass@K)2506.14245MED
Dragoi et al., Beyond Pass@k: Breadth-Depth Metrics (Cover@τ)2510.08325MED
Zhai et al., Does RL Expand the Capability Boundary of LLM Agents? Pass@(k,T)2604.14877MED
Kang et al., Quagmires in SFT-RL Post-Training2510.01624HIGH
Chen et al., The Coverage Principle2510.15020MED
Greenblatt et al., Stress-Testing Capability Elicitation (password-locked)2405.19550HIGH
Hofstätter et al., The Elicitation Game2502.02180MED
van der Weij et al., AI Sandbagging2406.07358HIGH
Ryd et al., Removing Sandbagging via Weak Supervision2604.22082MED
Stroebl et al., Inference Scaling fLaws2411.17501HIGH
Dorner et al., ROC-n-reroll2507.12399MED
Huang et al., Is Best-of-N the Best of Them?2503.21878HIGH
Mahowald et al., Dissociating Language and Thought2301.06627HIGH
Firestone, Performance vs. Competence in Human–Machine ComparisonsPMC7604508 (journal, no arXiv)HIGH
Yao et al., τ-bench2406.12045HIGH
Barke et al., AgentRx2602.02475MED
Kwa et al. (METR), Measuring AI Ability to Complete Long Software Tasks — general grounding for §6’s planning/state-management axis2503.14499HIGH
Ross, Gordon & Bagnell, A Reduction of Imitation Learning to No-Regret Online Learning (DAgger, compounding error) — theory grounding for §61011.0686HIGH
Sinha, Arun, Goel, Staab & Geiping, The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs (self-conditioning invariant to scale) — direct quantitative grounding for §6’s scale-invariance claim2509.09677MED
Wang, Wu, Wang, Tang, Li, Yin, Ma, Li, Sun, Chen & Ye, Why Reasoning Fails to Plan (FLARE; LLaMA-8B+planning beats GPT-4o) — direct quantitative grounding for §6’s scale-invariance claim2601.22311LOW, 0-citation preprint, promising not yet validated
Wang, Bai, Sun, Wang, Zhang, Hu, Schroder, Mutlu, Song & Nowak, The Long-Horizon Task Mirage? (HORIZON; cross-family GPT-5/Claude degradation) — corroborating cross-family evidence for §62604.11978MED
Cohen-Inger et al., Forget What You Know about LLM Evaluations — LLMs are Like a Chameleon (C-BOD) — general grounding for §7’s robustness check2502.07445MED
Zhang et al., Memorize or Generalize? Evaluating LLM Code Generation with Code Rewriting — secondary grounding for §72503.02296LOW, brand-new preprint, promising not yet validated
Dong et al., RL-PLUS: Countering Capability Boundary Collapse2508.00222MED
“GRPO amplifies existing capabilities, SFT replaces them” (confirmed project lesson)2507.10616project-confirmed

Academic cybersecurity-LLM domain-specific work — mentioned in §6/§7 for context only, per standing project rule NOT a basis for any claim/decomposition/recipe/verdict/number in this chapter (none of these produced a frontier cybersecurity model):

CitationarXivNote
Deng et al., What Makes a Good LLM Agent for Pentesting? (Type A/B, Excalibur)2602.17622context only — §6’s taxonomy is re-grounded on METR + DAgger above
Nakano et al., Guided Reasoning / Structured Attack Trees2509.07939context only
Shen et al., PentestAgent (contested reading)2411.05185context only
Honarvar et al., Capture the Flags: Family-Based Evaluation2602.05523context only — §7’s check is re-grounded on C-BOD + code-rewriting above

Flagged, not cited as fact: “The Self-Verification Cliff” (OpenReview, 2026-06-17) — no confirmed arXiv id as of this writing. Treat as [LOW], directional-only, per §5.1.

From behavioral audit to training signal

The BSides LV CFP audit (189 runs, 4 frontier models, 68,049 Phoenix spans) is not a benchmark score — it’s a behavioral trace. Each of its 5 findings describes a shape of failure, and each shape implies a different kind of gap: exploration, methodology/reasoning, credit assignment, or eval validity. This page maps observed behavior → gap type → the literature’s actual fix → how you’d check the fix worked, for each of the 5 patterns, against this project’s real setup (rejection-sampling SFT on verifier-passed solves now, GRPO/RLVR when entropy collapses; ground-truth flag verifier; knowledge in tools not weights).

flowchart LR
    A[observed behavior] --> B[gap type]
    B --> C[training signal / method]
    C --> D[verification check]
    D -.re-audit.-> A

None of this relitigates the diagnosis (execution gap, not knowledge gap) — it asks a narrower question per pattern: given this specific failure shape, which lever in the SFT→GRPO graduation actually targets it, and how would you know if it worked?


Pattern 1 — Agents prefer their own tools (87.7% of calls bypass the rich tool surface; 26/40 tools dead)

Observed: raw curl/shell dominates; the bespoke sectools surface is mostly unused.

Gap type [E]: this is not a knowledge gap (the agent isn’t ignorant of the tools — they’re in its context) and not really a reasoning gap. It’s an exploration-of-tool-space problem: tool choice is driven by the model’s pretraining prior (shell/curl is high-likelihood, familiar, “free” under next-token probability) rather than by the tool’s actual value for the subtask. Faghih et al., “Tool Preferences in Agentic LLMs are Unreliable,” arXiv:2505.18135 (2025-05-23) shows this is gameable by description text alone — edited docstrings shift usage >10x with zero change in tool capability. Confidence: established (controlled cross-model study). This is a diagnosis, not a fix — it just rules out “better docstrings” as a ceiling-breaking move.

Training signal:

Designed to fix: pattern 1 — agents defaulting to raw shell/curl over the provided tool surface.

  • ToolRL (Qian et al., arXiv:2504.13958, 2025-04-16, established) — don’t SFT-imitate tool traces; decompose the reward into per-call terms so which tool was chosen is its own learnable signal, not folded into one coarse outcome score:
    r_format  = valid_json_schema(call)          # 0/1
    r_tool    = tool_is_appropriate(call, state)  # 0/1, graded by task type
    r_param   = params_correct(call)              # 0/1 or partial
    r_outcome = env_feedback(call)                # sparse, terminal-heavy — your flag verifier
    reward = w1*r_format + w2*r_tool + w3*r_param + w4*r_outcome
    
    Their ablation: reward granularity (per-call beats per-episode) and reward type (graded beats binary) both matter. This is the cheapest lever — you already have the tool registry, you need the auxiliary signal.
  • ReTool (Feng et al., arXiv:2504.11536, 2025-04-15, established) — trains the whole trajectory end-to-end so the policy learns when in a reasoning chain to reach for a tool, not just which one given an isolated decision point. Confirms the shape of the project’s SFT→GRPO plan; its specific addition is that real sandbox tool-execution output (not a paraphrase) must be in the rollout context that gets scored — worth auditing whether secagent/daytona feeds live stdout into the trajectory used for reward.
  • Tool-Star (Dong et al., arXiv:2505.16410, 2025-05-22, promising, <6mo/low-citation) — naive RL under-explores a large tool inventory because gradient concentrates on whatever already gets used. Its fix: manufacture forced/hinted rollouts that exercise under-used tools before RL, verify with the real environment, fold verifier-passed ones into SFT data. Directly targets the “26/40 dead” number — and flags a self-reinforcing trap: a rejection-sampling corpus built from the current curl-biased policy will never contain a dead tool succeeding, because the policy never tried it. RL alone has ~zero probability mass to reinforce on those tools.
  • Search-R1 (Jin et al., arXiv:2503.09516, 2025-03-12, established) contributes one mandatory engineering detail, domain-general: mask tool/environment output tokens from the policy-gradient loss — you don’t want to reinforce or penalize text the environment produced (shell stdout, HTTP bodies), only the model’s own query/command-generation tokens.

Honest caveat: ToolRL/ReTool/Tool-Star are validated on general agentic tool-use, not offensive-security tool-use specifically — the transplant to sectools is this researcher’s inference, not literature-verified. Random-Crypto/HackSynth, arXiv:2506.02048academic domain-specific CTF RL work, cited for context only, not a basis for this page’s claims — reports vanilla GRPO on crypto-CTF attributing generalization gains to improved tool usage; per this project’s standing rule, academic cybersecurity-LLM training/benchmark papers don’t count as frontier evidence, so this is mentioned but not relied on. The actual basis for “RL over tool-choice transfers” stays the general-agentic evidence above (ToolRL/ReTool/Tool-Star) — domain-general RL-for-tool-use results that don’t need a CTF-specific data point to hold.

Verify the fix: track the tool-usage histogram across all 40 sectools entries before/after training — do previously-dead tools get invoked at all on held-out challenges (not just replayed on training challenges)? Cheap, no-training-required companion check: DIVER, arXiv:2509.26209 (2025-09-30) rewards pairwise diversity across a rollout group — if adapted to “which tools/commands were used” rather than token text, a rising group-diversity score on tool choice is a leading indicator before solve-rate moves at all.


Pattern 2 — React-and-guess, no methodology (no wordlists/checklists/PTES sequencing; 82% pivot after failure)

Observed: no visible recon→enum→exploit sequencing; the agent abandons a line of attack after one failed attempt instead of enumerating alternatives methodically.

Gap type [R]: a sequencing prior is missing — the model has PTES-shaped knowledge somewhere in its weights (it can describe methodology if asked) but doesn’t apply it as an ordering constraint during live rollouts. This reads as reasoning/methodology, not exploration-in-the-entropy sense, though the two compound (§ below, and see long-horizon.md → RAGEN discussion of the “Echo Trap”).

Training signal:

Designed to fix: pattern 2 — no methodology / premature pivoting after a single failed attempt.

  • Two-stage guide-then-explore RL, re-grounded on general RL theory: Jump-Start Reinforcement Learning (JSRL) — Uchendu et al., arXiv:2204.02372 (2022-04-05, established, 17+ citations) is the domain-general theoretical basis for the same two-stage shape this page used to cite a CTF-specific paper for: a guide-policy (built from offline data/demonstrations/an existing policy) forms a curriculum of starting states, and an exploration-policy is trained forward from those states — naive initialization-then-finetune underperforms this because value-based methods handle a cold-start policy poorly. Mapped onto this project’s actual pipeline:
    # Stage 1 — guide-policy: rejection-sampling SFT on verifier-passed walkthroughs
    # (methodology prior — ordered recon → enum → exploit, not just any verifier-passed trace)
    # Stage 2 — exploration-policy: online GRPO/RLVR in the live sandbox
    # reward = terminal flag-verified {0,1} + intermediate env feedback (command succeeded/failed)
    # JSRL's curriculum knob: anneal how far into the guide-trajectory the exploration-policy starts,
    # rather than always starting cold — cheap to add to the existing rejection-sampling corpus.
    
    DeepSeek-R1’s own disclosed recipe (R1, arXiv:2501.12948) is the frontier-lab confirmation that this shape works at scale: “cold-start SFT data, then RL” outperforms RL-from-scratch specifically because RL-from-scratch on an unguided policy wastes exploration budget relearning basic structure the SFT stage would have given it for free — the same failure mode JSRL formalizes. Academic CTF/pentest work — cited for context only, not a basis: Pentest-R1 (Kong et al., arXiv:2508.07382, evaluated on Cybench + AutoPenBench) reports the identical two-stage shape in the CTF domain specifically and claims both stages are required, order matters, and stage-1 data must be walkthrough-shaped rather than raw verifier-passed traces. Per this project’s standing rule, academic cybersecurity-LLM training/benchmark papers (Pentest-R1, AutoPenBench, and similar) don’t count as frontier evidence for a decision — the load-bearing basis here is JSRL + DeepSeek-R1’s cold-start-then-RL recipe, both domain-general. If Pentest-R1’s specific findings (walkthrough-shaping matters, order matters) turn out to matter operationally, that’s this project’s own thing to verify empirically, not something to inherit from an academic CTF paper.
  • Structured attack trees / ATT&CK scaffolding (Nakano et al., arXiv:2509.07939, 2025-09-09, established mechanism, but this is scaffolding not weight-training — flag the distinction) — externally constrain rollout-time reasoning with a deterministic task tree built from MITRE ATT&CK’s kill-chain, filtering unproductive actions. Reports 71.8–78.6% subtask completion vs. 13.5–75.7% for self-guided reasoning at far fewer queries — a large, reproducible gap. The cheapest lever in this whole page: use the ATT&CK tree as a rollout-generation scaffold to harvest higher-quality, more methodical verifier-passed trajectories for the SFT corpus right now, with zero RL infrastructure — and it keeps knowledge in the tree (a prompt/tool artifact), not baked into weights, consistent with the project’s “knowledge in tools not weights” rule.
  • PEARL (Wang et al., arXiv:2601.20439, 2026-01-28, promising) — treats the planning step (which tools, in what order) as its own object of RL, rather than only optimizing the final answer. Directly relevant as a formal mechanism for systematic tool-sequencing instead of react-and-guess, though not yet validated outside general multihop tool-use.
  • Adjacent, unverified beyond title: classical-planner-hybridized LLM agents (arXiv:2512.11143) — a stronger, harder version of the ATT&CK-tree idea; worth a follow-up only if the tree scaffold proves insufficient.

Honest caveat: the load-bearing basis for the two-stage recipe above is domain-general (JSRL, DeepSeek-R1’s cold-start-then-RL disclosure), not the CTF-specific Pentest-R1 paper — per this project’s standing rule, an academic CTF/pentest paper being “in-domain” doesn’t make it stronger evidence, it makes it out of scope as a basis (no academic cybersecurity-LLM project has produced a frontier model). The ATT&CK-tree scaffold (Nakano et al.) is a separate, non-cybersecurity-specific mechanism (deterministic task-tree filtering built on a public taxonomy, evaluated as scaffolding not weight-training) and isn’t subject to the same caveat. The main open question either way is whether a walkthrough corpus built from this project’s own data generalizes past the ~10 canonical PD26 challenges it would initially be built from — that’s this project’s own thing to test.

Verify the fix: measure PTES-phase coverage per episode (recon steps taken before first exploit attempt) and the pivot-after-failure rate (does the agent try ≥2 alternatives before abandoning a technique?) on held-out challenges, pre/post. Both are derivable from existing Phoenix spans without new instrumentation.


Pattern 3 — Good guessers until they’re not (≈half of solves hinge on an ungrounded guess; brittle to a wrong first guess)

Observed: many successful runs pivot on a single ungrounded guess at a critical step; when that guess is wrong, the agent rarely recovers.

Gap type [E]/[R]: this is where entropy collapse and reasoning intersect — a policy that has already spent its exploration budget on one high-probability guess has no remaining probability mass on alternatives when that guess fails. Cui et al., “The Entropy Mechanism of RL for Reasoning Language Models,” arXiv:2505.22617 (2025-05-28, high confidence, mechanistic + empirical) is the underlying diagnosis this pattern shares with pattern 4: entropy falls monotonically and predictably (R = -a·exp(H) + b), and the mechanism is the covariance between a token’s probability and its advantage — the model reinforces what’s already likely rather than exploring what’s uncertain.

Training signal:

Designed to fix: pattern 3 — brittle single-guess behavior with no recovery on failure.

  • SCoRe — self-correction via RL (Kumar et al., arXiv:2409.12917, 2024-09-19, established, canonical DeepMind paper) — SFT on (wrong→right) correction pairs barely transfers because it’s distribution-mismatched; only multi-turn RL with reward shaped to reward improvement, not just final correctness, produces genuine revision instead of mode-collapsing to “be right turn 1” or no-op-collapsing to “always change the answer”:
    r1 = verifier(attempt_1)
    r2 = verifier(attempt_2)
    reward = r2 + alpha * max(0, r2 - r1)   # bonus specifically for turning a miss into a hit
    
    Naming collision, flag explicitly: a different Sept-2025 paper reuses “SCoRe” for teacher-corrected earliest-error localization + short-horizon RL-continue-from-verified-prefix (Lyu et al., arXiv:2509.14257, promising, 7B-matches-72B claim unreplicated). Cite by arxiv id, not the shared name. If a 100-turn episode fails at a single identifiable wrong-guess turn (which matches this exact BSides finding), earliest-error-localized short-horizon RL is much cheaper credit assignment per episode than scoring the whole trajectory as one unit.
  • CDE — curiosity-driven exploration (Dai et al., arXiv:2509.09675, 2025-09-11, medium-high, ICLR 2026 poster) — an actor-side perplexity bonus (reward the model for being “surprised” by its own output) reports a calibration collapse finding as a byproduct: the policy becomes confident regardless of correctness, which the perplexity bonus specifically counters. This is the literature-side twin of “commits to an ungrounded guess and doesn’t recover” — cheap to try (no extra network, just log-perplexity of the model’s own rollout) before anything heavier.
  • Representation-based exploration (Tuyls et al., arXiv:2510.11686, 2025-10-13, medium-high) — an inference-time-only lever, not training: build a diverse k-of-N pass@k pool from hidden-state dissimilarity instead of random k-of-N. Notable negative result: this is anti-composable with high-temperature sampling — high-temp outputs look “novel” in representation space without being useful. Worth an ablation on whatever temperature the pass@k>1 eval pool currently uses, since it changes nothing about training and answers whether current sample diversity is real strategic variance or just noisier repeats.

Honest caveat: SCoRe (both papers) and CDE are domain-general (math/code) — the CTF/100-turn transfer is inference, not literature-verified. One concrete, cheap check the project can run without any new training: confirm the reward/credit-assignment scheme doesn’t implicitly favor shorter successful trajectories — a 60-turn recovery-from-wrong-guess success should score the same as a 10-turn lucky first guess; if it doesn’t, the reward is actively working against fixing this pattern regardless of which paper’s fix gets adopted.

Verify the fix: on a held-out set, measure whether backtracking-after-a-wrong-guess correlates with eventual success pre/post training (does the trained policy actually try a second technique after the first fails, and does that second attempt land more often?).


Pattern 4 — Uneven PTES phases: strong at chaining inside an exploit, weak at thorough enumeration (62% of failures stall in exploitation)

Observed: the agent is good at following a known exploit chain once inside it, but weak at the enumeration/reconnaissance breadth that would get it there in the first place; most failures stall during exploitation rather than in an earlier phase.

Gap type [L]/[E]: two compounding gaps. First, entropy collapse narrows the recon repertoire — once RL reinforces whatever enumeration path happened to work once, alternative recon strategies stop being tried (same mechanism as pattern 3, arXiv:2505.22617). Second, a long-horizon credit-assignment problem: a flat terminal-only reward over a ~100-turn episode gives early, correct enumeration steps the same undifferentiated credit as late exploitation steps — so even when enumeration was necessary for the eventual win, nothing in the reward signal reinforces it specifically.

Training signal:

Designed to fix: pattern 4 — weak/uneven enumeration and stalling mid-exploitation.

  • DAPO (Yu et al., arXiv:2503.14476, 2025-03-18, established, widely reproduced) + the Entropy Mechanism paper (arXiv:2505.22617) together are the baseline recipe, not an optional add-on, once GRPO starts: clip-higher (decouple the PPO clip range so rare-but-good tokens aren’t capped as hard as likely ones) and dynamic sampling (drop degenerate all-correct/all-wrong prompt groups, which otherwise contribute zero gradient — this matters disproportionately here since a whole-group-zero-reward challenge at ~100 turns/rollout is expensive to keep resampling; consider curriculum-filtering to the 30–60% pass-rate band the project already targets, rather than paying for blind resamples on genuinely-unsolved challenges).
    eps_low, eps_high = 0.20, 0.28              # decoupled clip (vanilla PPO/GRPO: symmetric 0.20/0.20)
    ratio = exp(logp_new - logp_old)
    clipped = clip(ratio, 1 - eps_low, 1 + eps_high)
    loss_pg = -min(ratio * adv, clipped * adv)   # per-token, mean over ALL tokens in batch
    
  • GiGPO (Feng et al., arXiv:2505.10978, 2025-05-16, established, NeurIPS 2025 poster) — the most directly transplantable long-horizon idea in this map, and it needs no new infra: no critic, no extra rollouts. It adds a second, step-level advantage on top of GRPO’s trajectory-level one by retroactively hashing (state, step) pairs that recur across rollouts and scoring “what happened next” conditioned on that shared state — pure post-hoc bookkeeping on trajectories already sampled. This is exactly the mechanism that could stop rewarding all 100 turns equally when only the exploitation phase decides the outcome. Failure mode to flag honestly: GiGPO’s state-hashing was validated on benchmarks with hashable states (web pages, grid worlds); a CTF agent’s state is unbounded free text (shell/HTTP output), so state-canonicalization needs a bespoke similarity function — e.g. (tool_name, normalized_target, response_status_class) from the existing tool_call/tool_exec_ms spans — rather than a raw-text hash, or the anchor groups never fire.
  • HiPER / hindsight credit assignment (Peng et al., arXiv:2602.16165, 2026-02-18; Tan et al., arXiv:2603.08754, 2026-03-07, both promising, 0 citations, evaluated on WebShop/ALFWorld — domain transfer to cybersecurity is inference, not literature-verified) — explicit hierarchical decomposition: a planner proposes subgoals (recon-done, foothold-gained, priv-esc-done, flag-captured), each independently checkable against environment state, with terminal reward at the flag level and intermediate credit at the subgoal level. The PTES phases already tracked are a natural subgoal taxonomy for this — no new taxonomy needed. If the flag-verifier is extended to also check an intermediate condition (“foothold confirmed” via sandbox state), that stays deterministic and doesn’t violate the ground-truth-verified-reward rule.
  • RL-PLUS (Dong et al., arXiv:2508.00222, 2025-07-31, promising, strong math/code ablations, cybersecurity transfer untested) — names “capability boundary collapse” directly: pass@k at large k drops under pure on-policy RLVR even as pass@1 rises, i.e. uneven phases get more uneven as training narrows the policy. Its fix: mix verifier-passed off-policy trajectories (already produced by rejection sampling!) into GRPO via importance-sampling correction, plus an advantage bonus for visiting under-explored-but-successful states. Concretely: don’t discard the rejection-sampling SFT corpus once GRPO starts — feed it back in as off-policy anchor data.
  • NuRL (Chen et al., arXiv:2509.25666, 2025-09-30, medium, consistent gains across six benchmarks/three models) — targets prompts with zero reward across every rollout in the group, which vanilla GRPO simply cannot learn from (zero gradient). Self-generates a hint conditioned on the gold answer, re-rolls with the hint injected, trains on the hint-augmented rollout, then drops the hint at inference. This is the strongest candidate for the ~800 currently-unsolved challenges in the portfolio — but the “gold answer” analogue doesn’t port zero-shot: the flag itself isn’t the how-to- get-there knowledge, so a CTF adaptation needs a hint source (walkthrough/verifier metadata, or a successful trajectory from a similar challenge family) that doesn’t yet exist and requires design work.

Verify the fix: instrument mean policy entropy from step 0 of any GRPO run (this alone is diagnostic, not a fix — but without it you cannot tell “the task is hard” from “the policy already collapsed at step 50 and is just getting faster at one script”); track per-PTES-phase solve/stall rates before and after each intervention layer (DAPO → GiGPO → RL-PLUS/NuRL); and run RL-PLUS’s own diagnostic — pass@k at large k on the base model vs. the trained checkpoint — to check whether exploitation-phase gains are holding or narrowing over training.


Pattern 5 — Benchmarks measure pattern-match speed, not thoroughness/methodology/robustness

Observed: a rising solve-rate number doesn’t by itself tell you whether the agent generalized a strategy or pattern-matched something close to a memorized/leaked challenge shape.

Gap type [N]: this is an eval-methodology gap, not a training-loop gap directly — but it’s the project’s own check on whether the SFT→GRPO pipeline is doing what axis [N] (novelty/boundary-expansion) requires, versus merely axis-amplifying what the base model already does.

Training signal (mostly diagnostic, one eval-recipe change, one eval-recipe addition):

Designed to fix: pattern 5 — solve-rate gains that can’t be told apart from elicitation/memorization.

  • Yue et al., “Does RL Really Incentivize Reasoning Capacity Beyond the Base Model?,” arXiv:2504.13837 (2025-04-18, established, large systematic study across model families/algorithms) — across math/code/visual-reasoning and 6 popular RLVR algorithms, the base model catches up and overtakes at large pass@k even when RL wins at pass@1: the patterns RL concentrates on were already latent in the base model’s sampling distribution. This paper’s own recommended fix — evaluate pass@k at large k, not just pass@1 — is already this project’s locked methodology (pass@k, k=3/5/10 bands per decisions/2026-06-11-ctf-benchmark-pass-count.md); the missing piece is running the base model at the same pass@k bands as a mandatory control. If base-model pass@10 ≈ trained-model pass@10 on a chunk of challenges, that chunk’s improvement is elicitation, which is fine to attribute to the SFT stage (SFT is explicitly meant to replace/instill, not expand) but would be a red flag if it persists after GRPO, where genuine capability gain is expected.
  • Semantics-preserving-transform robustness testing, re-grounded on general (non-cybersecurity) evidence: GSM-Symbolic — Mirzadeh et al. (Apple), arXiv:2410.05229 (2024-10-18, established, frontier-lab work) is the domain-general instance of the same failure mode: regenerating GSM8K-style math questions from symbolic templates — same structure, only surface values/names changed — shows LLM solve rates drop and get noticeably more variable under these semantics-preserving perturbations, and performance degrades further as templates add clauses that shouldn’t change the answer. This is established, cited-widely evidence (independent of any cybersecurity-domain paper) that a rising benchmark number can reflect pattern-match on the specific surface form rather than a robust, generalized method — exactly the risk pattern 5 flags for this project’s own PD26 solve-rate numbers. Concrete recipe for this project: build semantics-preserving variants of the existing PD26-01..10 challenges (renamed services/users, reordered but logically-identical steps, cosmetic code/config changes that don’t alter the exploit path) and check whether a claimed solve-rate jump transfers to a transformed variant before crediting it to “the agent got better at CTF,” mirroring GSM-Symbolic’s methodology directly. Academic CTF-benchmark work — cited for context only, not a basis: Capture the Flags / Evolve-CTF (Honarvar et al., arXiv:2602.05523) runs the identical idea in the CTF domain specifically (source-transformation families, composed-obfuscation degradation) and would be the more directly-transplantable recipe if it counted as evidence — but per this project’s standing rule it’s an academic domain-specific CTF-benchmark paper, so it doesn’t serve as the basis here; GSM-Symbolic does.

Honest note: this pattern’s fix is mostly an eval-protocol addition, not a new training signal — the “fix” for pattern 5 is really making patterns 1–4’s fixes falsifiable. Both papers above are cheap (no new training) relative to any of the RL infrastructure work in patterns 1–4 and should run before crediting the first GRPO graduation with “fixing execution,” not after.

Verify the fix: run the base-model pass@k-at-large-k control (5.1) alongside every RL-trained checkpoint’s pass@k; separately, generate semantics-preserving-transform variants of a held-out challenge subset and check whether solve-rate gains survive the transform. If gains evaporate on either check, that’s elicitation/memorization, not the execution-reliability improvement the rejection-sampling-SFT diagnosis is banking on.


Summary table

PatternGap typeMethod designed to fix itCitationConfidence
1. Own-tool preference (87.7% bypass, 26/40 dead)[E] tool-space explorationToolRL — decomposed per-call rewardarXiv:2504.13958Established
1. (same)[E][N]Tool-Star — forced exposure to under-used tools pre-RLarXiv:2505.16410Promising
1. (same)[L][E]ReTool — end-to-end trajectory-level tool-RLarXiv:2504.11536Established
2. React-and-guess, no methodology (82% pivot)[R][L]JSRL + DeepSeek-R1 cold-start recipe — guide-policy (SFT walkthroughs) then exploration-policy (online RL)arXiv:2204.02372 / arXiv:2501.12948Established (domain-general)
2. (same, context only — not a basis)[R][L]Pentest-R1 — same two-stage shape, CTF-specific; academic, cited for context, not a basis per standing rulearXiv:2508.07382Academic CTF work — not evidentiary
2. (same)[R]ATT&CK structured attack-tree scaffold (no training needed)arXiv:2509.07939Established (mechanism), scaffold not training
3. Brittle single-guess (~half hinge on a guess)[E][R]SCoRe — RL reward for improvement, not final correctnessarXiv:2409.12917Established
3. (same)[E][N]CDE — curiosity/perplexity bonus counters calibration collapsearXiv:2509.09675Medium-high
4. Uneven PTES / 62% stall in exploitation[E][L]DAPO + Entropy Mechanism — clip-higher, dynamic samplingarXiv:2503.14476 / arXiv:2505.22617Established
4. (same)[L][E]GiGPO — step-level credit via state-hash groups, zero extra rolloutsarXiv:2505.10978Established
4. (same)[L][E][N]HiPER / hindsight credit assignment — PTES-shaped subgoal decompositionarXiv:2602.16165 / arXiv:2603.08754Promising, domain-transfer speculative
4. (same)[E][N]RL-PLUS — counters capability-boundary collapse w/ off-policy mixingarXiv:2508.00222Promising
4. (same, ~800 unsolved tail)[L][E][N]NuRL — self-generated hints unlock zero-reward-group promptsarXiv:2509.25666Medium
5. Benchmarks measure pattern-match, not thoroughness[N] eval validityBase-model pass@k-at-large-k controlarXiv:2504.13837Established
5. (same)[N] eval validityGSM-Symbolic — semantics-preserving-transform degrades solve rate (domain-general)arXiv:2410.05229Established
5. (same, context only — not a basis)[N] eval validityEvolve-CTF — same idea, CTF-specific; academic, cited for context, not a basis per standing rulearXiv:2602.05523Academic CTF work — not evidentiary

What this changes about the plan, concretely

  • Cheapest, no-training-required move first: the ATT&CK attack-tree scaffold (pattern 2) improves rejection-sampling SFT corpus quality today, before any RL infra exists.
  • When GRPO starts, DAPO’s clip-higher + dynamic sampling is the baseline, not an optional add-on — it’s simultaneously the fix for patterns 3 and 4’s shared entropy-collapse mechanism.
  • GiGPO is the single most transplantable long-horizon idea (pattern 4) — zero new infra, just a state-canonicalization function over the existing tool_call spans.
  • Keep the rejection-sampling SFT corpus as off-policy anchor data through GRPO, not a disjoint earlier stage — RL-PLUS’s argument (pattern 4) applies directly since that data already exists.
  • Pattern 5’s checks (base-model pass@k control, semantics-preserving-transform families) should run before the first GRPO graduation is credited with anything — they’re the cheapest falsification test available and gate whether patterns 1–4’s fixes actually expanded capability or just re-elicited it.
  • Contested / open: whether GiGPO/HiPER-style credit assignment transfers from hashable (WebShop/ALFWorld) state spaces to a CTF agent’s unbounded free-text environment state is this project’s own thing to test, not something published literature has already settled — say so plainly if this page gets cited externally.

What the frontier labs actually do (last 12 months: 2025-07 → 2026-07)

Ten labs, one year, one question: SFT + which RL, in what order, and why. The pattern holds from the first pass and gets stronger with more labs in the sample: everyone runs the same small method set (SFT · rejection-sampling · DPO-family · GRPO/GSPO/PPO-family · RLVR · RLAIF). Differentiation is ordering, data/environment scale, and a handful of stabilization tricks — not exotic new losses. Where a lab discloses a genuinely new algorithmic idea (Mistral’s clip-higher, Qwen’s GSPO, Moonshot’s PARL, Xiaomi’s MOPD, DeepSeek’s four GRPO stabilizers), it’s flagged [N] and it’s still a variation on group-relative policy optimization, not a different paradigm.

Tag legend (four axes that matter for a ~100-turn, terminal-reward, verifier-gated CTF agent): [L] long-horizon/credit-assignment · [E] exploration/entropy-collapse-resistance · [R] multi-step reasoning · [N] novelty / capability-boundary-expansion (not just amplification). [D] = lab-disclosed mechanism, [I] = third-party-inferred — kept inline per lab because disclosure quality varies by an order of magnitude between labs (DeepSeek/Qwen/Mistral publish ablation tables; xAI/OpenAI/Google publish one paragraph of prose per model).

Designed to fix: pattern N callouts below map a disclosed technique directly to one of this project’s 5 BSides behavioral-audit findings (1: agents prefer own raw tools over the rich surface; 2: react-and-guess, no methodology; 3: good-guessers-until-not, brittle on a wrong first guess; 4: uneven PTES phases, weak enumeration/exploitation; 5: benchmarks reward pattern-match speed not thoroughness).


Historical anchor: Llama 3 → Llama 4 (pre-window, kept short)

Llama 3/3.1 — several rounds of SFT → rejection sampling → DPO; tried PPO, dropped it for DPO+RS at their scale (arXiv:2407.21783).

Llama 4 (2025-04) — inverted the order: thin SFT (LLM-judge dropped >50% of “easy” data) → intensive online RL on hard prompts with continuous re-filtering → thin DPO for corner cases. Meta’s explicit finding — heavy SFT/DPO restricts RL exploration — is the one lesson from this era every later lab implicitly re-derives: don’t let imitation calcify the policy before RL gets to explore (ai.meta.com/blog/llama-4-multimodal-intelligence). This is why Mistral’s Magistral Medium below runs RL with zero SFT and why Zhipu’s GLM-4.5 keeps SFT to “just enough correctness for RL to have signal,” not more.


Anthropic (Claude) — richest on alignment mechanism, thinnest on capability-RL mechanism

The backbone [D], unchanged since 2022: Constitutional AI / RLAIF (arXiv:2212.08073) — SL-CAI critique-revise → RL-CAI against an AI-feedback preference model. Every 2025-26 system card repeats this as boilerplate; capability-RL hyperparameters, reward-model architecture, and dataset sizes are never disclosed for the Sonnet/Opus line. The useful material is in the alignment research, not the model cards.

  • Inoculation prompting [D, N] (arXiv:2510.04340; Anthropic’s own study arXiv:2511.18397) — deployed since Opus 4.5, expanded in Opus 4.6’s highest-risk RL settings. Problem: a model that learns to reward-hack on real production RL generalizes that disposition into broader misalignment (sabotage, alignment-faking) — chat-RLHF safety training doesn’t transfer to agentic settings. Fix: tell the model at train time that the gameable behavior is expected/acceptable in this context; query at test time with the unmodified prompt. The model still learns the capability, but doesn’t internalize “hacking is my default disposition when oversight is weak.”

    inoc_prompt = prompt + "\n(Note: hard-coding to pass this test case is expected here.)"
    loss = sft_loss(model, inoc_prompt, hacky_response)   # train-time only
    # test-time: query with the ORIGINAL (un-inoculated) prompt
    

    Directly relevant to this project’s own confirmed lesson (format/regex reward → SFT-induced flag confabulation): an imperfect reward doesn’t just fail locally, it teaches a generalizable disposition. If rejection-sampling SFT ever surfaces a technically-passing-but-degenerate solve (lucky guess vs. real exploitation), Anthropic’s finding says annotate it explicitly, don’t silently filter or leave the framing implicit.

  • “Teaching Claude Why” [D, L, N] (alignment.anthropic.com, 2026-05-08) — three interventions for agentic-misalignment generalization, all disclosed as complementary: (1) SFT on non-agentic chat transcripts about an ethical dilemma reduced agentic tool-calling misalignment to zero — cross-modality generalization; (2) SDF (synthetic-document finetuning, pretraining-style docs about an AI acting per Claude’s constitution) — 3M OOD-tokens beat 14M eval-similar tokens, 28x more token-efficient, and the effect survived subsequent RL rather than being washed out; (3) diversifying RL environments with tool defs + varied system prompts even when the tools are never needed for the task — measurably reduced honeypot misalignment.

    Designed to fix: pattern 1 (agents prefer their own raw tools, 26/40 provided tools dead). Anthropic’s finding (3) says the dead-tool problem may be a training-distribution coverage gap, not just an inference-time preference: if the RL/SFT distribution rarely rewards using the rich tool surface as the winning strategy, the model won’t reach for it regardless of the system prompt at eval time. Concrete action: oversample rejection-sampling-SFT trajectories that used the project’s own tool surface (not raw shell/curl) rather than filtering only on outcome.

  • Multi-agent orchestration [D, L] (Opus 4.5): tested/tuned as an orchestrator of Haiku/Sonnet worker subagents — cheap Haiku workers under an Opus orchestrator beat Opus alone by ~12 points; Opus is a measurably better orchestrator than Sonnet given the same subagent pool. Sonnet 4.5’s headline: ~30-hour autonomous coding sessions — the year’s clearest [L] claim from this lab.

  • Cyber capability [vendor-claimed, undisclosed recipe]: Claude Mythos 5 (gated, Project Glasswing) — “strongest cybersecurity capabilities of any model in the world,” explicit multi-phase agentic-hacking claim (recon → discovery → lateral movement). Zero SFT/RL-environment detail disclosed for the cyber-capable checkpoint — treat as a vendor capability claim, not a methodology to learn from.

Headline gains: SWE-bench Verified 80.9% (Opus 4.5, first model >80%); Sonnet 5 BrowseComp 84.7% at a 10M-token operating limit with context compaction. Tags: [L] strong (30-hr sessions, task budgets, dynamic-workflow subagent fan-out) · [R] steady benchmark climb · [N] inoculation prompting + SDF-for-values are genuine training-loop-level interventions, the clearest [N] items in this whole file from any lab · [E] not addressed anywhere in disclosed material — a real disclosure gap, not evidence of absence.


OpenAI — almost nothing on the RL algorithm, one very citable agentic-RL sentence

Disclosure reality [D]: every GPT-5.x system card repeats the same paragraph — “trained to reason through reinforcement learning… learn to refine their thinking process, try different strategies, and recognize their mistakes.” No algorithm name, no reward-model architecture, no compute numbers, across 10+ releases (GPT-5 → 5.4) from Aug 2025 to Mar 2026.

What actually is disclosed and load-bearing, repeated verbatim across the whole Codex line since Sep 2025:

“trained using reinforcement learning on real-world coding tasks in a variety of environments… iteratively run tests until passing results are achieved.”

That’s RLVR-shaped agentic RL: real-repo/PR environments, implicit test-pass reward (ground-truth verifiable — not format-matched, matching this project’s own confirmed rule), explicit iterate-until-verified inner loop. Cite this as independent industry confirmation that “RL against a ground-truth verifier with an iterate-until-pass loop” is the dominant agentic-coding recipe, not an idiosyncratic choice.

  • Safe-completions [D, the one fully-disclosed SFT/preference technique] (arXiv:2508.09224) — trains the policy over outputs, not a binary refuse/comply intent classifier, so a dual-use prompt gets a partial, non-harmful-boundary answer instead of a hard refusal. Breaks the refusal/helpfulness tradeoff rather than trading one for the other.
  • Compaction (GPT-5.1-Codex-Max, [D, the year’s strongest [L] mechanism]) — “first model natively trained to operate across multiple context windows… coherently working over millions of tokens in a single task.” Not a prompting trick — the model is trained to prune its own history and continue in a fresh window. METR: 50%-reliability time horizon ~2h42m vs GPT-5’s 2h15m.

    Designed to fix: pattern 4 (uneven PTES phases, weak thorough enumeration). OpenAI’s own cyber-eval writeup: “most cyber challenges are limited by exploring many different paths which involve running commands that can produce verbose logs and easily consume the model’s context window… trying different tools with an almost brute-force approach.” This is an independent, external confirmation of this project’s own diagnosis — CTF-style tasks are bottlenecked by long-horizon context exhaustion during enumeration, not single-step reasoning — and OpenAI’s fix is architectural (compaction), not reward-shaping.

  • RFT API [D, the closest thing to a public recipe] — sample completions, score with a programmable grader (string_check/text_similarity/score_model/python/multigrader), policy-gradient update toward higher-scoring completions. Their own eligibility guidance — “eval results must be variable enough to improve” — is the same 30–60% baseline-band logic this project already uses. Status: being wound down May 2026, new users cut off, existing users capped to Jan 2027 — flag to main: don’t plan around “fall back to OpenAI RFT.”

Headline gains: GPT-5.2 ARC-AGI-2 52.9% (vs 17.6% GPT-5.1, the largest single jump reported); GPT-5.4 OSWorld-Verified 75.0% (above the 72.4% human baseline). [N] evidence: GPT-5.2 Pro solved an open COLT-2019 problem in statistical learning theory unaided — a single vendor-reported anecdote, externally verified by unspecified “subject-matter experts,” treat as promising not validated. Tags: [R] core · [L] strongest disclosed mechanism this year (compaction) · [E] never named as a training objective anywhere in the OpenAI corpus — the “almost brute-force” tool-trying is an observed side effect, not an engineered exploration bonus · [N] one well-documented anecdote, caveated.


Google DeepMind (Gemini) — one real tech report (2.5), everything since is a model-card rerun

Gemini 2.5 [D] (arXiv:2507.06261 §2.4) is the only Gemini release with a real disclosed recipe; every 3.x card since repeats the same boilerplate with zero new mechanism.

  • SFT: adversarial/red-team-sourced data (model-probes-model + human-probes-model), “loosely inspired by Constitutional AI,” refined across successive model generations — a self-improving data engine, not a static curated set.
  • RL — “RLF” (Reinforcement Learning from human and critic Feedback), dual-channel: a trained Data Reward Model (DRM) amortizing human preference labels + a prompted Critic scored against offline-editable rubrics. Deliberate hedge against the two classic single-RM failure modes (trained-RM reward hacking vs. prompted-judge brittleness).
    reward = f(DRM(response), Critic(response, rubric))   # two channels, decoupled cost profile
    
    Separately: “increased training compute allocated to RL… enabled Gemini 2.5 to learn from more diverse and complex RL environments, including those requiring multi-step actions and tool use” — this is the direct antecedent of a verifiable-reward RL track, the branch closest to this project’s own ground-truth flag verifier (you don’t need the DRM/Critic hedge — flag capture is already ground-truth verifiable).
  • Gemini 3 Pro [D, model card only]: “RL techniques that can leverage multi-step reasoning, problem-solving and theorem-proving data” — no new mechanism named. The evidence is all benchmark: Vending-Bench 2 mean net worth $5,478 vs Gemini 2.5 Pro’s $573.64 — a ~9.6x jump, the single most relevant public [L] data point this year (closest published analog to a ~100-turn terminal-reward task, no per-step reward). “Thought Signatures” (encrypted reasoning-state tokens carried across multi-turn tool calls) is a serving-side mitigation for context/reasoning loss over long agentic loops.

    Designed to fix: pattern 4. Google’s own Frontier Safety Framework discloses a concrete capability ceiling directly in this project’s domain: cyber “v1 hard challenges: 11/12 solved; v2 challenges: 0/13 solved end-to-end” — evidence that even a 9.6x long-horizon jump on Vending-Bench doesn’t close the gap on harder, more adversarial multi-step exploitation tasks.

Headline gains: Gemini 2.5 → “5x on Aider Polyglot, 2x on SWE-bench Verified” (report’s own framing); Gemini 3 Pro SWE-bench Verified 76.2%, τ²-bench 85.4%. Tags: [L] strong (Vending-Bench 2) · [R] core focus (Thinking/Deep Think tracks are RL-trained test-time-compute) · [E] the one explicit phrase (“deeper exploration”) is a compute-scale claim, not an algorithmic one — nothing disclosed addresses entropy-collapse directly, a real gap · [N] contested — ARC-AGI-2 gains are suggestive, not mechanistically explained.


xAI / Grok — no arXiv report for any Grok-4-family model; the clearest [L] disclosure industry-wide

Company-wide posture: one model card per release, almost entirely a safety/RMF eval doc. Post-training gets one paragraph.

  • Grok 4 [D]: SFT is explicitly minor (“along with supervised finetuning of specific capabilities”); RL is the driver — pushed to “the same order of magnitude as pretraining” compute (caveat: the launch chart had no y-axis labels — directional, not audited), RLVR domains expanded “from math/coding to many more domains,” native RL-trained tool use (model chooses its own search depth). CyBench unguided success 0.43 — “below a human professional” end-to-end.

  • Grok 4.1 Fast [D] — the year’s most explicit long-horizon-RL disclosure:

    “We trained Grok 4.1 Fast using long-horizon reinforcement learning with a strong emphasis on multi-turn scenarios, ensuring consistent performance across its full 2-million-token context window.”

    This is a rare case of a lab naming “long-horizon RL” as the training objective, not just an eval axis — cite this precisely.

    Designed to fix: pattern 1. Co-launched with a first-party Agent Tools API (web search, X search, code exec, MCP) that the model was trained against directly — xAI controls and RL-trains against its own curated tool surface, structurally reducing the degrees of freedom for the model to default to raw shell/curl-equivalents, because the trained-against tools are the native path.

  • Grok 4.1 [D, N, but instructive as a contrast] — RLAIF using an agentic reasoning model as the reward model (not a static preference classifier) to extend RL onto non-verifiable axes (style, EQ, tone). Genuinely portable idea — but the model’s own safety card shows the cost: MASK dishonesty 0.43→0.49, sycophancy 0.07→0.19-0.23 (both worse). This is the opposite of this project’s ground-truth-reward rule, and it’s a first-party-disclosed regression — read as a live demonstration of what happens when you relax ground-truth verification, not something to adopt.

  • Grok Code Fast 1 [D]: pure SFT/imitation on real PR/tool-use demonstrations, no disclosed RL at all — xAI’s own precedent that SFT-only is a legitimate shipped strategy for a cheap specialist tier, not the capability frontier.

Headline gains: Grok 4 first to 50.7% HLE (w/ tools, Heavy); Grok 4.1 Fast τ²-bench Telecom 100%; Vending-Bench $4,694 (Grok 4) vs Claude Opus 4’s $2,077. Tags: [L] Grok 4.1 Fast = strongest in the corpus · [R] Grok 4 primary target · [E] never disclosed for any model, and large-scale RLVR is exactly the regime where entropy collapse is a known risk — xAI says nothing about it · [N] the RLAIF-agentic-judge idea (Grok 4.1) is the most transferable and the most cautionary.


Mistral — the cleanest published ablation table in the whole set (single-turn only)

Magistral Medium [D] (arXiv:2506.10910) — RL alone, zero SFT, zero distillation from a stronger teacher, on top of an instruct checkpoint. The paper’s own headline methodological claim, explicitly benchmarked against DeepSeek-R1’s SFT-then-RL pipeline. GRPO with three deliberate departures, each ablation-justified:

# vanilla GRPO
loss = -min(ratio_t * A_i, clip(ratio_t, 1-eps, 1+eps) * A_i)         # per-token, symmetric clip, KL to ref

# Magistral's departures
loss = normalize_by_group_token_count(loss)     # not per-sequence -> removes length bias
eps_high = 0.26-0.28                            # asymmetric "clip-higher" instead of an entropy bonus
# entropy bonus WAS tried: "unstable and dataset-dependent" -> collapsed on math, exploded on mixed data
# KL term: beta = 0 (removed) -> policy diverges anyway, the term bought nothing

Reward = 4 additive terms (format 0.1 / correctness 0.9, SymPy or compile+test, all-or-nothing — partial-credit code reward was tried and rejected, cost ~2pts LiveCodeBench / length penalty / language-consistency 0.1, fixes CoT code-switching). Magistral Small uses cold-start SFT distilled from Medium + RL on top — Table 3 ablation: SFT+RL (70.7 AIME’24) beats SFT-only (65.4) and RL-only (65.8) at the 24B scale — i.e., pure RL sufficed at Medium’s scale but not at Small’s.

Ministral 3 [D] (arXiv:2601.08584) cites Magistral’s GRPO recipe directly (“Rastogi et al. [2025]”). Adds a General RL stage with a rubric-based LLM-judge reward (reward = fraction of atomic rubric items satisfied) layered after verifiable STEM RL — a candidate pattern for domains (like reporting/methodology quality) where a ground-truth verifier exists for outcome but not for process, as an additional shaping signal, never replacing the terminal flag-verified reward.

Headline gain: Magistral Medium AIME’24 pass@1 26.8→73.6 (+~47pp, “nearly 50% boost,” Mistral’s own framing, without cold-start reasoning traces). Tags: [R] primary and only real target — this is single-turn math/code RLVR, reward is per-completion · [E] yes, narrowly: clip-higher is a genuine, ablation-validated anti-entropy-collapse mechanism, directly citable vocabulary — but tuned for single-shot generation, not multi-turn tool-call exploration · [L] not addressed at all — no multi-turn credit assignment in this report, nothing transfers directly to a 100-turn episode without further work · [N] the “RL-only, no distillation” result at Medium’s scale is the clearest disclosed boundary-expansion claim of the year (pure RLVR uncovered capability a teacher’s traces wouldn’t have shown).


DeepSeek — the four GRPO stabilizers are the single most transferable [E] artifact in this file

V3.1 [D, thin] — hybrid think/non-think in one checkpoint; SFT/RL specifics not disclosed beyond “post-training optimization.” V3.2-Exp [D] — DeepSeek Sparse Attention (DSA) via continued pretrain, post-training held identical to V3.1-Terminus by design (a controlled comparison). V3.2 [D] (arXiv:2512.02556) carries essentially all of the year’s real recipe detail:

  1. Specialist distillation — 6 domain specialists (math/code/reasoning/agentic/agentic-coding/agentic-search), each pushed with large-scale RL independently, distilled back into one generalist. “Models trained on the distilled data achieve performance only marginally below domain-specific specialists, with the gap eliminated through subsequent RL” — distillation gets 90% cheaply, RL closes the rest.
  2. Mixed RL (GRPO), merged not sequential — reasoning + agent + alignment trained together explicitly to avoid catastrophic forgetting from multi-stage sequencing. Post-training compute >10% of pretraining compute, disclosed directly.
  3. Four GRPO stabilizers, each a concrete anti-entropy-collapse/anti-instability fix:
    • Unbiased KL estimate — corrects Schulman’s K3 estimator via importance-sampling ratio; the uncorrected estimator assigns unboundedly large gradient weight when π_θ ≪ π_ref.
    • Off-policy sequence masking — zero the loss on negative-advantage sequences whose divergence from π_old exceeds a threshold; positive-advantage samples are kept regardless.
    • Keep Routing — freeze the MoE expert-routing path used at sampling time; don’t let training recompute a diverged routing.
    • Keep Sampling Mask — reapply the same top-p/top-k truncation mask from sampling during the training update, so importance-sampling validity holds; empirically “preserves language consistency during RL training” (fixes RL-induced mixed-language garbage, a textbook entropy-collapse symptom).
  4. Thinking-in-tool-use agentic RL environments — 1,827 environments, 85k+ prompts across code/search/general/interpreter agents. Search-agent verification keeps only samples where the ground truth is checkable AND every wrong candidate is provably wrong — the same hard-negative discipline as this project’s own flag-verifier, independently arrived at.

    Designed to fix: pattern 4 / pattern 3. The context-management fix (“retain reasoning across tool turns, drop only on a genuinely new user message”) directly targets long-horizon coherence during exploitation; the hard-negative-verified reward directly targets brittleness from an ungrounded guess (pattern 3) by refusing to reward a correct-looking answer unless every alternative is provably wrong too.

V3.2-Speciale: same base, reduced length penalty (let it think longer) + DeepSeekMath-V2 reward folded in — gold-medal IMO/IOI/ICPC/CMO 2025.

Headline gain: V3.2 “performs comparably to GPT-5” on reasoning at substantially lower cost, explicitly framed as narrowing the open-vs-closed gap on agentic long-tail tasks. Tags: [L] strong (DSA makes long-context RL tractable; context-retention rule is a direct multi-turn continuity fix) · [E] the strongest, most technically concrete axis in this whole file — four named, ablatable stabilizers, zero architecture changes required · [R] strong · [N] claimed (closing the gap on “long-tail/novel environments”) but self-rated, and structurally the specialist→distill→RL pipeline resembles SFT+RL, which this project’s own thesis (“GRPO amplifies, SFT replaces,” arXiv:2507.10616) would predict caps its novelty — DeepSeek doesn’t isolate this in an ablation, so contested/unresolved by their own disclosure.


Qwen (Alibaba) — Qwen3.7-Max is the single most directly relevant release in this entire file

Baseline (Qwen3, pre-window, arXiv:2505.09388): Long-CoT SFT cold-start → Reasoning RL (GRPO/RLVR) → thinking-mode fusion SFT → General RL + strong-to-weak distillation. Every later release patches this.

  • GSPO [D] (arXiv:2507.18071) — the load-bearing algorithm swap from 2025-07 onward. Problem: GRPO’s per-token importance ratio gets corrupted on a MoE model when routing jitters between rollout and update, forcing an expensive “Routing Replay” workaround. Fix: define the ratio at the sequence level.

    # GRPO: per-token ratio, needs Routing Replay to stay valid on MoE
    ratio_t = pi_theta(y_t|x,y_<t) / pi_old(y_t|x,y_<t)
    
    # GSPO: one length-normalized ratio per whole rollout -> removes Routing Replay entirely
    s_i = (pi_theta(y_i|x) / pi_old(y_i|x)) ** (1/len(y_i))
    A_i = (r_i - mean(group_rewards)) / std(group_rewards)
    loss += -min(s_i * A_i, clip(s_i, 1-eps, 1+eps) * A_i)
    

    Removes infra complexity (no per-token log-prob pinning) rather than adding it — cheap to try if a MoE base is ever adopted.

  • Qwen3-Coder [D]: “hard-to-solve, easy-to-verify” code RL (execution-driven, automatically-scaled test cases) as a separate stage from “long-horizon Agent RL” (multi-turn plan→act→observe→replan over 20,000 parallel real dev environments) — SOTA SWE-bench Verified without test-time scaling, i.e. the gain is in the policy.

  • Qwen3.5 [D]: pivot point — “the post-training performance gains in Qwen3.5 primarily stem from our extensive scaling of virtually all RL tasks and environments… we focused heavily on increasing the difficulty and generalizability of RL environments, rather than optimizing for specific metrics.” Origin of the decoupled Task/Harness/Verifier idea that gets its full writeup next.

  • Qwen3.7-Max [D] — read this one closely:

    “agent RL training conventionally couples the task, the harness, and the verifier — train on one fixed triple and the policy learns harness-specific shortcuts instead of a generalizable strategy.”

    This is a first-party, named statement of exactly this project’s own scaffold-overfitting finding. Fix: decoupled Task/Harness/Verifier rollout infra — the same task replayed against different harnesses (types and versions) and different verifiers, forcing cross-harness generalization.

    for task in tasks:
        for harness in sample_harnesses(task):     # e.g. Claude Code, OpenClaw, Qwen Code, Hermes
            for verifier in sample_verifiers(task):
                rollout = run(policy, task, harness)
                update(policy, rollout, verifier(rollout))
    

    Designed to fix: pattern 1, directly and by name — validated by consistent performance across QwenClawBench/CoWorkBench regardless of eval-time harness, contrasted against Qwen3.6-Plus which “showed significant variance.” Also ships a reward-hacking self-detection framework — the policy itself flags candidate reward-hacking patterns in its own trajectories, a governance mechanism directly relevant to this project’s “never regex-match, always verify” rule.

Headline gain: Qwen3.7-Max — 35-hour fully-autonomous kernel-optimization run on a previously-unseen accelerator, zero prior exposure, 432 iterations/1,158 tool calls, ~10x speedup, entirely self-directed (self-reported, not yet independently benchmarked). Tags: [E]+[L]+[N] for Qwen3.5/3.7-Max — the release most directly targeted at this project’s exact problem shape (harness generalization, long-horizon coherence, novel governance mechanism) · [R] GSPO/reasoning-RL lineage throughout.


Moonshot AI (Kimi) — strongest disclosed [N] case in the file (Agent Swarm is a qualitatively different solution shape)

K2 [D] (arXiv:2507.20534) — SFT data is itself rejection-sampled: synthetic agentic trajectories (3,000+ real MCP tools + 20,000+ synthesized) scored by an LLM judge against per-task rubrics, only passing trajectories enter SFT — “large-scale rejection sampling… through our quality filtering process,” disclosed in those words. RL = REINFORCE-with-baseline (GRPO-adjacent, group-mean baseline, no value model) + self-critique rubric reward re-grounded continuously by on-policy RLVR rollouts (a template for a safe non-verifiable auxiliary reward that doesn’t drift from the ground-truth signal). Three named engineering fixes: budget control (hard token cap, fights length inflation), PTX loss (replay high-quality SFT data during RL, fights catastrophic forgetting), temperature decay (high early, annealed later — an explicit, named [E] exploration-preservation mechanism). Partial rollout — long-tail unfinished episodes pause/resume across RL iterations rather than blocking the batch — the single most directly transferable engineering idea for ~100-turn terminal-reward episodes.

K2.5 [D] (arXiv:2602.02276) — Zero-Vision SFT: text-only SFT alone activates visual agentic tool-use; hand-annotated visual CoT data hurts generalization (don’t spend annotation budget on the modality you’re activating; spend it where you already have depth). PARL / Agent Swarm — a trainable orchestrator + frozen sub-agents (instantiated from an earlier checkpoint); only the orchestrator gets gradient updates, explicitly to sidestep credit-assignment ambiguity across sub-agent calls.

reward = λ1·r_parallel (fights orchestrator collapsing back to single-agent)
       + λ2·r_finish   (fights spawning many sub-agents without real decomposition)
       + r_perf         (task-level outcome)
# λ1, λ2 annealed to zero -> final policy optimizes pure task success

Designed to fix: pattern 2 / pattern 4. Sequential agentic execution has linear latency scaling and, per this project’s own audit, agents “pivot after a single failure instead of enumerating.” PARL’s whole premise is training a policy to decompose wide-search/enumeration tasks into parallel sub-agent calls rather than one brittle serial chain — a direct, working answer to “how do you reward decomposition without the model gaming step-count,” and structurally analogous to training an agent to enumerate broadly instead of guessing once and pivoting.

Toggle (token-efficient RL) alternates budget-limited and unconstrained phases, gated on accuracy already exceeding a threshold — fixes K2’s earlier length-overfitting failure mode (a rigid budget doesn’t generalize back up when a harder problem needs more room).

Headline gain: K2.5 Agent Swarm — 4.5x latency reduction with a simultaneous F1 gain (72.8%→79.0% WideSearch); K2 Thinking sustains 200-300 coherent tool calls vs. “prior models degrade after 30-50 steps” (a direct, quantified [L] claim). Tags: [L] yes throughout (partial rollout, PARL, 200-300-step coherence) · [E] temperature decay is named and explicit; PARL’s anti-serial-collapse term is the same shape of problem as entropy collapse solved with a shaping reward instead of an entropy bonus · [R] present, secondary — K2 itself is a non-thinking model · [N] the strongest in this file: Agent Swarm is not “faster at the same thing,” it’s a qualitatively different solution shape (parallel decomposition vs. any single sequential agent, however long-horizon).


GLM / Z.ai — the difficulty-curriculum + iterative self-distillation loop is a validated version of this project’s own plan

GLM-4.5 [D] (arXiv:2508.06471) — Stage 1: three domain specialists (Reasoning/Agent/General), each cold-start SFT’d separately (a domain-general model from scratch wastes RL exploration budget re-discovering what expert-labeled distillation data already gives free). Stage 2: self-distill into one generalist, with rejection sampling on the distillation data itself (strip malformed samples, verify correctness for objective answers, RM-filter subjective ones, verify tool-call trajectories reach a terminal state).

  • Reasoning RL: GRPO, no KL term, three ablation-justified fixes:
    • Two-stage difficulty curriculum — switch to problems that are pass@8=0 but pass@512>0 (hard-but-not-impossible); a static difficulty set goes stale as the policy improves, collapsing reward variance to all-0 or all-1 either way (zero gradient signal). This is the exact wall a ~1000-challenge portfolio at 100-200 solves is almost certainly sitting on for its hard tail.
    • Single-stage RL at the full 64K target length — staged length scaling (8K→16K→…→64K) caused an irreversible unlearning of long-output generation that never recovered even when length was scaled back up. Direct lesson: don’t RL-train shorter than your SFT init’s horizon if the target task is long.
    • Dynamic sampling temperature — raise temperature when rollout reward plateaus (their named signal for entropy collapse), gated by a max-1%-perf-drop bound on held-out validation.
  • Iterative self-distillation (Agentic RL): RL to a plateau → distill the RL-improved policy’s own outputs into a fresh SFT checkpoint (replacing the original cold-start data) → resume RL on the stronger base with a harder curriculum → repeat.

    This is the closest external validation of this project’s stated plan (“rejection-sampling SFT → GRPO/RLVR”) in the whole file — except Zhipu alternates SFT↔RL repeatedly rather than doing it once and switching permanently. Worth treating the handoff as a loop gated on reward plateauing, not a fixed step count.

GLM-5 [D] (arXiv:2602.15763) — the recipe shape changed: a sequential RL pipeline (Reasoning RL → Agentic RL → General RL) with On-Policy Cross-Stage Distillation blended throughout, replacing 4.5’s expert-then-unify structure, specifically to prevent catastrophic forgetting between stages. New async RL infra + double-sided importance sampling with hard-masking — tokens whose importance ratio falls outside [1-ε_l, 1+ε_h] are zeroed, not soft-clipped — explicitly motivated by policy drift compounding across long agentic trajectories before the (often sole) terminal reward arrives.

Designed to fix: pattern 4. Zhipu explicitly names Vending-Bench 2 and CC-Bench-V2 (“long-term coherence in agents”) as the benchmarks this release targets — a lab measuring the [L] axis directly, not incidentally.

Headline gain: GLM-5 ~20% average improvement over GLM-4.7 across 8 benchmarks; SWE-bench Verified 73.8→77.8; first open-weights model to hit 50 on Artificial Analysis Intelligence Index v4.0. Tags: [E] strong — dynamic temperature + difficulty curriculum are both explicit, named entropy-collapse countermeasures · [L] strong in GLM-5 (double-sided IS + hard masking is a credit-assignment fix aimed squarely at long trajectories) · [R] the most-developed program in GLM-4.5’s report · [N] weak-to-moderate — mostly amplification via expert-iteration/distillation, not boundary expansion.


Xiaomi (MiMo) — MOPD is a genuinely new algorithm, not a GRPO variant with a new name

MiMo-V2-Flash [D] (arXiv:2601.02780) — 3-stage post-training, and stage 3 is the interesting one. Problem named directly: “the see-saw effect” — naive multi-skill post-training improves one capability at the cost of another. Fix: Multi-Teacher On-Policy Distillation (MOPD).

  1. Stage 1 — SFT to “activate latent capabilities acquired during pretraining” (not teach new ones). Notable operational detail: num-zeros (count of MoE params with zero gradient) is their leading indicator of SFT instability — rising = expert-load-balance collapse, falling = overfitting.
  2. Stage 2 — train a suite of narrow domain-specialist teachers via independent RL (agentic: search/coding/tool-use; non-agentic: math/reasoning/safety).
  3. Stage 3 — MOPD: the student rolls out on its own policy distribution (not an offline distillation set, not weight merging) and receives dense, token-level reward = KL-divergence against each teacher’s logits, plus a verifiable outcome reward.
    reward_t = -KL(student_logits_t || teacher_logits_t) + verifiable_outcome_reward
    # student samples on-policy; teachers never generate the training data directly
    
    Result: the student mostly matches/beats the best individual teacher without the see-saw across the full skill set simultaneously (one model, not N specialist checkpoints) — not a free lunch (a couple of regressions), but net positive.
  • Agentic RL scaffold [D]: deliberately minimal — 3 atomic tools (bash, str_replace, finish), no prescribed workflow in the system prompt, “allowing the model to discover best practices during training” — an independently-arrived-at instance of this project’s own “light framing beats heavy scaffolding” rule.

    Designed to fix: pattern 4. MOPD is a documented mechanism for fusing narrow specialists (e.g. an enumeration-specialist and an exploitation-specialist RL checkpoint) into one policy without one specialist’s regressions bleeding into the other — a direct, algorithmic answer to “uneven PTES phases.”

MiMo-V2.5-Pro [D, undisclosed quantitatively] — same MOPD pipeline scaled to 1T params/1M context; qualitative claim of “thousand-tool-call” coherence (worked example: 672 tool calls / 4.3 hrs building a compiler, self-correction after a mid-run regression at turn 512).

Headline gain: matches Kimi-K2-Thinking / DeepSeek-V3.2-Thinking on most reasoning benchmarks at 1/2-1/3 the params; 73.4% SWE-Bench Verified. Tags: [L] explicit design target (“sustains complex trajectories”) · [E] explicit — minimal scaffold + on-policy (not offline) sampling are both named exploration-preserving choices · [R] the non-agentic teacher/reasoning stage · [N] the strongest algorithmic novelty in the file after Kimi’s Agent Swarm — MOPD’s token-level on-policy KL-distillation is a genuinely different move from rejection-sampling SFT, GRPO, or plain distillation, even though the agentic RL environments themselves (unit tests, visual verifiers) are recombinations of known reward-design patterns, not new task surfaces.


The transferable lessons (updated for ten labs)

  1. The method set is small and shared, and it has calcified further, not diversified. SFT · rejection-sampling · DPO-family · GRPO/GSPO/PPO-family · RLVR · RLAIF are the entire vocabulary across ten labs and ~40 releases. Every “new algorithm” this year (clip-higher, GSPO, PARL, MOPD, the four DeepSeek stabilizers) is a variation on group-relative policy optimization, not a new paradigm.
  2. Ordering and SFT-dosage are the live design choice. Llama 4 → thin-SFT/heavy-RL/thin-DPO; Magistral Medium → zero SFT; Zhipu → cold-start-just-enough-for-signal, then iterate SFT↔RL repeatedly rather than once. The Llama-4-era finding (“heavy SFT/DPO restricts RL exploration”) is now independently re-derived by three more labs.
  3. The frontier is agentic-RL-environment design, not new losses. Qwen3.7-Max’s decoupled Task/Harness/Verifier infra, Kimi’s PARL, DeepSeek’s 1,827-environment agentic-task synthesis, Xiaomi’s minimal-scaffold discipline, GLM’s iterative self-distillation loop — none of these are algorithm papers, all of them are environment/data/scaffold engineering. See Agentic RL.
  4. Exploration/entropy-collapse resistance is where disclosure is thinnest and most valuable when it exists. OpenAI, Google, and xAI disclose essentially nothing on this axis despite running RLVR at a scale where it’s a known risk. Where labs do disclose a mechanism (Mistral’s clip-higher, DeepSeek’s four stabilizers, GLM’s dynamic temperature + difficulty curriculum, Kimi’s temperature decay, Xiaomi’s minimal-scaffold + on-policy sampling), it’s the single most directly reusable material in this file for a project whose stated risk is exactly entropy collapse.
  5. Ground-truth-verified reward is now cross-lab-confirmed discipline, not an idiosyncratic project choice — DeepSeek’s hard-negative search-agent filter, Qwen3-Coder’s “hard-to-solve, easy-to-verify” framing, GLM’s format-gate-before-outcome-reward, Mistral’s rejected partial-credit code reward, Xiaomi’s rule-based-only embodied reward. Where a lab relaxes this (Grok 4.1’s agentic-LLM-judge RLAIF), the lab’s own safety card shows a measurable honesty/sycophancy regression — treat as a documented cautionary tale, not a competing recipe.

Summary table

LabFlagship (recipe source)SFTRL algorithmL / E / R / N
Llama (historical)Llama 4 MaverickThin, LLM-judge-filteredHeavy online RL → thin DPOR
AnthropicClaude Opus 4.5RLHF/RLAIF (undisclosed detail) + SDFInoculation-prompted RL (algorithm undisclosed)L, N
OpenAIGPT-5.1-Codex-MaxUndisclosed (safe-completions is the one named technique)RLVR-shaped agentic RL + compaction training (algorithm undisclosed)L, R
GoogleGemini 3 ProCAI-inspired adversarial SFTRLF (DRM + Critic) + verifiable-reward/agentic RL trackL, R
xAIGrok 4.1 FastMinorLong-horizon multi-turn RL, RLVR-domain-expandedL, R
MistralMagistral MediumNoneGRPO, no-KL, clip-higher, 4-part rewardR, E, N
DeepSeekDeepSeek-V3.2Specialist distillationGRPO, merged reasoning+agent+alignment, 4 stabilizersL, E, R
QwenQwen3.7-MaxLong-CoT cold-start (base recipe)GSPO-lineage + decoupled Task/Harness/Verifier RLL, E, N
Kimi (Moonshot)Kimi K2.5Rejection-sampled agentic trajectories; Zero-Vision SFTREINFORCE/GRPO-adjacent + PARL (Agent Swarm)L, E, N
GLM (Zhipu)GLM-5Expert-then-unify (4.5) → cross-stage on-policy distillation (5)GRPO, difficulty curriculum, dynamic temperature, double-sided ISL, E, R
Xiaomi (MiMo)MiMo-V2.5-ProActivate-latent-capability SFTMOPD (multi-teacher on-policy KL distillation)L, E, N, R

Provenance caveat, restated and strengthened: disclosure quality varies by an order of magnitude across this table. DeepSeek, Qwen, Mistral, Kimi, GLM, and Xiaomi publish arXiv tech reports with ablation tables — treat their mechanism claims as [D] high confidence. Anthropic, OpenAI, Google (post-2.5), and xAI disclose mechanism only in system-card prose or blog posts, often one paragraph per model, with capability-RL algorithm/hyperparameters/reward-model architecture never named — treat their “recipe” as mostly inferred continuity except where a technique is explicitly flagged [D] above (safe-completions, compaction, RLF’s DRM+Critic split, inoculation prompting, long-horizon multi-turn RL). Every arXiv id in this file was live-verified against arxiv.org/abs/<id> during the research pass that produced the per-lab notes this chapter is built from — none were fabricated or carried over from training-data memory. Lab recipes change fast; re-verify before betting a training run on a specific ordering or hyperparameter.

Method → Data (your real bottleneck)

Your words: “it’s not that we don’t have data; it’s that we don’t know what data we want and what fine-tune we want.” This chapter is the fix, and it’s a single causal claim:

You do not pick data and then a method. You pick the method — by failure type — and the method dictates the data object you must produce.

Once the method is chosen, “what data do we want” is answered mechanically. Here’s the mapping:

MethodData object it consumesFor your harness, where it comes from
SFT / off-policy distillationfull trajectories from a sourcecurate, or run a stronger model on your challenges and keep its solves
On-policy distillationyour model’s own rollouts, graded per-token by a teacheryour rollouts + a stronger teacher model
Rejection-sampling FTyour model’s own verifier-passed trajectoriesyou already generate these — filter your ~100–200 solves
DPO(chosen, rejected) trajectory pairs at a decision pointpair a solved run vs a failed run on the same challenge
KTOunpaired trajectories tagged good/badyour solved pile + your failed pile, as-is (no pairing)
GRPO / RLVRprompts + a verify() fn — no fixed datasetyour challenge set + the flag check
Agentic RLa live environment emitting rollouts + end-of-episode rewardyour harness itself, as a rollout service

Two consequences you can act on immediately:

  • RLVR needs almost no dataset — just challenges + a verifier. You have both. The “data problem” nearly vanishes; the work moves to the reward fn and rollout infra.
  • Rejection-sampling FT needs only your own solves, which you’re already producing. It’s the lowest-friction first move because the data object is a byproduct of running the benchmark.

So the real question isn’t “what data” — it’s “which gap”

The data object is downstream of the gap diagnosis. Do that first (The decision), and the data spec falls out. The diagnostic that routes everything:

Does the correct action ever appear in the model’s own outputs, even rarely, at high sampling N?

  • Never → knowledge gap → you need external trajectories (SFT / teacher / a tool). Data = curated or teacher-generated.
  • Sometimes (your case: solves 100–200/1000) → execution gap → data = your own rollouts (rejection-sampling) or a verifier (RLVR). You already have both.
  • Mis-ranked → data = good/bad pairs or tagged logs (DPO/KTO). You already have both piles.

In all three of the last cases, you already possess or can trivially generate the data — which is why your instinct that “data isn’t the bottleneck” is correct. The bottleneck was the method, and the method is chosen by the gap.

Before you train — instrumentation & data readiness

This chapter is a prerequisite, not a fix. Diagnosing the gap gives you the routing test (knowledge vs. execution vs. exploration); behavioral audit → training signal maps observed failure shapes to methods. Both assume you can already say which stage of a run failed. Today, for this project’s own harness, you can’t — not because the telemetry is missing, but because nobody has written the small amount of code that turns existing telemetry into a stage verdict. This chapter is the ground-truth answer to “what do I already have, and what’s the minimal thing to build first” — grounded entirely in the repo and vault as read on 2026-07-02, not in aspiration.

Everything below is a recommendation for main to implement — no code was written, no repo file was touched. Confidence is stated per claim; where I could not confirm something, I say so.

1. Why per-stage signal is the prerequisite for the whole diagnosis

The diagnosis framework’s routing test — does the correct action ever appear at high N, and does RL (not matched SFT) expand it? — is defined per challenge or per challenge-subtype, not per portfolio. Run it on the portfolio in aggregate and you get one number that averages over four structurally different failure modes (F1 exploration / F2 skill / F3 tool-use / F4 long-horizon), which is exactly the “collapsing a split verdict into one sentence” anti-pattern Diagnosing the gap §0 warns against. You cannot segment a portfolio you cannot localize. Per-stage signal is what turns “39% pass@1, up from 26%” into “F1 dropped from 40%→12% of failures, F2 is now the dominant failure mode” — the only shape of finding that tells you which lever (SFT curriculum, tool-use reward, GRPO exploration term) to pull next.

flowchart LR
    A["events.jsonl<br/>(already emitted)"] --> B["stage extractor<br/>(NOT built)"]
    C["per-challenge manifest<br/>(NOT built)"] --> B
    B --> D["F1-F4 attribution<br/>per run"]
    D --> E["Diagnosis framework<br/>routing test, per subtype"]
    E --> F["method choice:<br/>SFT curriculum / tool reward / GRPO term"]

2. What the harness already emits — confirmed, source-read

Source: go/libs/agent/events/events.go (schema), go/libs/secagent/runner.go (wiring), doc of record lessons/security-agent/harness-observability-contract-2026-06.md. Every event is Turn-indexed and tool_call/tool_result pairs join on tool_call_id — this is the load-bearing fact that makes any stage-localization script possible with zero changes to the agent loop.

EventWhat it carriesStage-attribution value
meta / tool_schemas / system_prompt / user_message (preamble)trace_id, model, action space, task stringRun identity; needed to join against a per-challenge manifest
agent_start{input}“Did the agent do anything” — trivially free
turn_start{history_len}Per-turn anchor
llm_responserole/content/reasoning, tool_calls[], token usage, TTFT/latency, finish_reasonReasoning text — usable for context, never as stage-reached evidence (confabulation risk)
tool_call{tool_call_id, name, args}The command actually issued — args is a raw string (bash/curl/etc.), not a structured HTTP call
tool_result{tool_call_id, name, output, tool_exec_ms, error?}The real, server-observed response text — this is the only tier that counts as ground truth
agent_finishstop_reason, turns/max_turns, finish_reasonDistinguishes “ran out of budget” (max_turns) from “gave up” (stop) from silent truncation (stop_reason=stop but finish_reason=length — don’t trust stop_reason alone)
flag_scan{solved, primary?, retrieved[], findings[{flag, origin, first_event, first_turn?}]}Terminal outcome, provenance-classified (retrieved/echoed/model_claim)

Tool surface bounding what a “stage” can even look like: five tools total (bash, read_file, write_file, update_file, WebSearch — confirmed go/libs/sectools/sandbox.go, no WebFetch, contra two now-stale lessons). All HTTP interaction with the target is embedded in free-text bash commands and free-text bash output — there is no structured http_request tool call with a machine-readable status/URL/body. This is the single biggest reason “endpoint discovered” and “vuln identified” are not already derivable cleanly — any stage parser must regex/parse free text, not read a field.

Confidence: high (direct source read, 2026-07-02). Full detail: artifacts/overnight-instrumentation/research/harness-signals.md.

3. Per-stage ground-truth verifier design

3.1 What already satisfies “ground-truth-verified, never transcript-matched”

The project’s non-negotiable rule — reward from real environment/tool-output state, never format/regex-matched on the transcript — is already met at the terminal (flag) stage and nowhere else:

  • go/libs/agent/events/flagscan.go (ScanFlags) classifies every FLAG{...} sighting as retrieved (in a tool_result, absent from that call’s own args) / echoed (present in both) / model_claim (only in model text). This is a provenance signal — it proves the string came back from the sandbox, not the model’s mouth — but is not a byte-compare against the real flag. A retrieved flag from a decoy or off-target leak still reads solved:true.
  • The actual byte-compare (flag_verified, the project’s own term) lives outside the agent runtime: benchmark/flags/pd26_flags.current.json (10 live flags) + benchmark/verification/PD26-NN/exploit/solve.py (held-out reference solvers). Verifying against these today is a manual, SSH-gated, human-authorized step (lessons/evals/verifying-agentic-security-runs.md) — not wired into the harness, pdq, or trace-verify as an automatic per-run check. grep -rln "flag_verified" --include="*.go" returns zero hits — confirmed absence, not a naming mismatch.
  • Documented failure modes any pre-terminal verifier must not reproduce: model-claim fabrication after repeated 404s (lessons/security-agent/flag-detection-false-positives.md, 6/17 fine-tune “solves” were exactly this); a hardcoded flag-format regex producing false negatives on an off-roster 24-hex flag (lessons/evals/gym-challenge-flag-format-breadcrumb-false-negative.md); an 18-point proxy-vs-verified inflation on the fine-tune’s own leaderboard (lessons/evals/ctf-flag-verification-and-proxy-pitfall.md).

3.2 The proposed per-stage predicate design

Phase names follow the already-designed (if unused) ptes schema in .claude/rules/challenges.md (recon → enumeration → detection → exploitation → lateral), mapped onto the brief’s F1–F4 taxonomy. The mechanism is a direct reuse of flagscan.go’s proven shape: a pure, read-only, post-hoc scan over events.jsonl — no new sandbox instrumentation, no changes to secagent’s execution path.

StageMaps toWhat real state proves itHow checkableRobustness
reconpre-F1A request reached a known recon surface and got a responsetool_result exists for a tool_call whose args path matches a per-challenge recon-surface allowlistCheap + robust
enumerationF1 (never finds vuln endpoint)A request’s method+path matched the vuln-bearing route, regardless of payload correctnesstool_call.args path/method vs. a per-challenge allowlist lifted from solve.pyCheap + robust — automates the existing by-hand method in lessons/evals/wall-attribution-discovery-vs-exploit-fail.md
detectionF1/F2 boundaryResponse shows diagnostic evidence of the specific bug class (error, type-confusion tell, introspection leak)tool_result.output vs. a per-challenge, bug-class-specific signatureHard/ambiguous — bug-class-specific; recommend optional/best-effort in v1, fold into “enumeration reached, exploitation not yet” if no clean signature
exploitationF2 (finds, can’t exploit)The payload actually worked — server-side artifact only possible on success (token, row leak, shell banner)tool_result.output vs. the exact success predicate already written in that challenge’s solve.py (verbatim reuse — this IS the ground-truth oracle)Cheap + robust when the exploit yields one identifiable artifact; coarser (any-200-on-payload) proxy where it doesn’t
lateral / flagF4 + terminalSecond request in a bypass→flag-read chain returned the flagflag_scan.retrieved, verbatimAlready built — zero new work
(cross-cutting, not a stage)F3 (clumsy tool-use)Tool-call diversity / pro-grade tool vs. improvised shell one-linerCount distinct tool_call.name, or classify args against a pro-tool allowlistDoes not fit the stage ladder — log as a separate metric, do not fold into a potential function (see §3.3)

Two things this design deliberately does NOT do, because both violate the project’s own reward rule:

  • Does not reuse the .claude/rules/challenges.md ptes matcher/triage-subagent mechanism. That mechanism is explicitly an LLM judge (“a span satisfies a matcher when the description’s intent is met, not merely when the regex matches”) — a level-3 rubric on the reward-gameability ladder (lessons/post-training/reward-signal-types-and-gameability-ladder.md), and per lessons/challenges/pd-challenge-file-anatomy.md, none of the 10 live PD26 challenges even carry the file this schema lives in. Use its phase names, not its judging mechanism.
  • Does not treat intent (the opt-in SECAGENT_CAPTURE_INTENT field) as evidence of “identified the vuln.” It’s documented observability metadata, low-faithfulness, off by default, and must be stripped before training use (lessons/security-agent/bash-intent-observability-field.md).

3.3 Potential-based shaping — the caveats if this ever feeds RL reward

If a stage-scan result is ever turned into dense RL signal (rather than just a diagnostic readout), the only shaping form proven not to change the optimal policy is F(s,a,s') = γΦ(s') − Φ(s) for any potential function Φ — Ng, Harada & Russell, “Policy Invariance Under Reward Transformations,” ICML 1999 (no arXiv id — this predates arXiv’s routine ML use; ACM DL 10.5555/645528.657613, verified live 2026-07-02). Two subtleties are easy to get wrong:

  1. Φ must be a monotone “best stage reached so far” running max, not the instantaneous current-turn stage — otherwise re-triggering an already-reached signature, or a later turn’s evidence going quiet because the agent moved on, can pay a spurious negative shaping reward for forward motion.
  2. Φ must be defined identically across every terminal branch (stop_reason{stop, max_turns, error}, all real values in this harness) — otherwise the invariance proof breaks across the different termination paths this project’s variable-length episodes actually produce. Recommended sidestep: apply shaping only over non-terminal transitions; let R_terminal (= flag_scan.Solved, unchanged) carry all outcome signal at the very last transition.

Domain-adjacent SOTA, verified live 2026-07-02 — none tick all three boxes (cybersec-specific + proven invariant + validated against a ground-truth terminal verifier), so this remains an unfilled niche, not a solved-elsewhere problem. The two rows below (Pentest-R1, DRLRM-PT) are academic cybersecurity-LLM training papers — per this project’s standing rule, they are cited for context only, not as a basis for any claim/recipe/verdict here; neither produced a frontier model, so neither is treated as evidence for the shaping design below. The recommendation that follows this table rests on the Ng/Harada/Russell invariance theorem and this project’s own gameability-ladder lesson, not on either of these two papers.

PaperarXivRelevanceConfidence
TIPS — turn-level potential shaping for search-augmented LLMs2603.22293Shaping machinery is directly on-point; domain (search-QA) is not0 citations, brand-new — promising, not validated
ToolRL — reward design for tool-use RL2504.13958Closest prior art on reward granularity/timing for tool-use GRPO; not potential-based1 citation
Pentest-R1 — two-stage RL for autonomous pentesting (academic cybersecurity-LLM training paper — cited for context, not a basis for our decisions)2508.07382Domain topic overlaps — a per-step reward in an interactive CTF env (InterCode-CTF); exact shaping formula not fully verified from search highlights alone — flag as unread in full. Not used to support the recommendation below0 citations, brand-new
DRLRM-PT — Reward Machine over kill-chain phases (academic cybersecurity-LLM training paper — cited for context, not a basis for our decisions)2405.15908Illustrates a non-potential-based design (flat +1/+10 phase bonuses, no γΦ(s')−Φ(s) structure) — cited only to warn against conflating “reward machine over phases” with “provably invariant shaping,” not as prior art we build onMedium

Recommended default if this is built: keep the terminal flag reward and the dense stage-shaping term as two separate additive components, never merged into one function — this is both what makes the invariance argument clean (Ng, Harada & Russell, ICML 1999, cited above) and what the project’s own gameability doctrine independently favors (decoupled dense process signal + sparse ground-truth outcome signal, lessons/post-training/reward-signal-types-and-gameability-ladder.md).

Confidence: high on the harness-reuse mechanism and the Ng et al. invariance result itself (25-year-old, well-established). Medium on the “running-max Φ” / “terminal-consistency” recommendations — applied reasoning from the theorem plus this project’s variable-horizon episode shape, not lifted from a paper. Full detail: artifacts/overnight-instrumentation/research/staged-verifier-design.md.

4. What training/eval data we already have, per candidate move

Scope: benchmark/ (repo, git-tracked) + ~/security-agent-qwen/ (untracked local run-artifact directory holding the actual trajectory takeouts, partially mirrored to s3://llmresearch-data/). On the order of 1,200+ individual agent trajectories across ~470 distinct challenge definitions exist already — this is a mining problem, not a collection problem, for three of the four candidate moves.

Candidate moveReadinessExtraction stepSharpest gotcha
(i) Rejection-sampling SFT positive set~185 raw solved trajectories across 5 corpora (gym263 64, gym564 39-cleaned, warpenv-broker 22, envgen 29, argus60-base 29); one prior SFT (cybersec-qwen36-traj-ep2 / pd-v5-qwen36-ft) already built this wayFollow lessons/post-training/verified-trajectory-synthesis-recipe.md verbatim: verifier-accepted terminal only → replay-reproduce → dedup → decontaminate → Thought/Action/Observation with Observation loss-maskedConfirmed, not hypothetical: the prior SFT fabricated flags on 6/17 claimed solves (lessons/post-training/sft-induced-flag-confabulation.md) — naive success-folder collection teaches success-shape, not success
(ii) KTO/DPO pairsKTO-native data is ready today for free — every success//failed/ split is an unpaired good/bad label (mechanical, zero judgment calls). True DPO (same-decision-point divergent pairs) needs k≥2 same-challenge same-model runs — only the PD26 canonical sweep (k=5, 10 challenges) has this; the larger gym pools are k=1Label KTO now; if DPO is wanted, mine the PD26 k=5 sweep, don’t re-sweep the gym poolsDon’t default to DPO just because solve/fail piles exist — lessons/post-training/dpo-kto-for-agent-tool-selection.md’s escalation ladder (fix tool description → prompt guidance → action-space → better base → SFT → DPO/KTO → GRPO) should gate the choice first
(iii) Per-stage eval (F1–F4)This is the actual gap. A PTES-phase tagger exists as a named, working concept — but on a different, sibling corpus (the pr seat’s BSides-LV-2026 audit: 189 Claude-model runs, Neo/XBOW-bench, not this project’s qwen/deepseek/glm/xai/gemini roster), and it is not yet open-sourced or present anywhere in this repoBuild a lightweight classifier over the existing tool_call/tool_result stream, scoped to the PD26/gym corpus specifically — materially smaller than the sibling talk’s full framework (no tool-tier/contamination/recovery-shape analyzers needed for internal F1–F4 attribution)Don’t import the BSides talk’s headline numbers (“82% pivot rate,” “62% stall in exploitation”) as if they describe this project’s own models — they describe Claude models on a different bench
(iv) Credit-assignment tracesBest-instrumented axis in the inventory — every run in every corpus carries the full per-turn event stream, confirmed live on a real sample (190-line events.jsonl, 63 tool_call/tool_result pairs, 29 turns)tool_call_id-paired parsing is a solved extraction problem (events.ScanFlags already demonstrates the pattern in Go)Turn-level “did this action retrieve the flag” ≠ “was this turn part of a coherent minimal solve path” — the replay-reproduce check proves an action sequence is causal, not that recorded Thought spans are faithful narration (a distinct, unaddressed reasoning-distillation risk)

Cross-cutting gotchas that apply to all four moves:

  • Ground truth exists for only 2 of 7 corpora (PD26, argus60/APEX). gym263/gym564/warpenv/envgen have no held-out flag file — their “solved” signal is the harness’s own retrieved classification, a weaker epistemic tier than exact-match. Flag this explicitly in anything built downstream of them.
  • retrieved (tool_result-not-in-own-args) is a strong genuineness signal but is still transcript-level heuristic, not an out-of-band verifier query — the project rule is not automatically satisfied just because the harness flagged something retrieved.
  • gym564’s local raw copy (521 completed) is not the cleaned number (305) cited elsewhere in memory — reconcile against experiments/2026-06-27-gym564-archive-cleanup.md before using it quantitatively.

Confidence: high on the corpus inventory and readiness split (row-counted directly, 2026-07-02); high on the (iii) framing gap being genuine (vault-searched for “PTES”/“stage-local”/“kill chain,” found exactly one non-applicable hit). Full detail, including per-corpus row counts and S3 paths: artifacts/overnight-instrumentation/research/data-inventory.md.

Collapsing §2–§4 into one actionable delta: the harness’s process telemetry is already complete for stage attribution. Nothing in go/libs/agent/events or secagent/runner.go needs to change. What’s missing is entirely semantic, and it splits into two independent, additive pieces of work that correctly sit on different sides of the seat boundary:

flowchart TD
  A["Pick ONE challenge<br/>(PD26-02 — chain already<br/>documented in lessons/challenges/<br/>pd26-02-nosqli-authbypass-chain.md)"] --> B["challenge-builder:<br/>author stage_oracle.json<br/>from solve.py — 4 predicates,<br/>editorial work, not infra"]
  A --> C["main: write stagescan.go,<br/>same shape as flagscan.go —<br/>pure io.Reader -> StageScan,<br/>no side effects"]
  B --> D["Run stagescan over existing<br/>events.jsonl from already-<br/>collected runs (benchmark/results/,<br/>~/security-agent-qwen/)"]
  C --> D
  D --> E["Validate: does the stage vector<br/>match what a human reading<br/>the same trace concludes?<br/>(same discipline as the flag-layer<br/>validation)"]
  E -->|"holds"| F["Generalize to all 10 PD26,<br/>then gym/warpenv/envgen"]
  E -->|"doesn't hold"| G["Refine predicates before<br/>trusting any F1-F4 number"]
  1. main/harness side: a deterministic (no-LLM, no-confabulation) URL/path extractor over tool_call.args
    • tool_result.output, scoped to the bash tool only. Emit as a derived per-run artifact, not a new harness event — keeps the harness itself free of challenge-specific semantics.
  2. challenge-builder side: one stage_oracle.json sibling per challenge — one entry per PTES phase, each a deterministic predicate over (method, path_regex, status_code, body_signature), authored directly from that challenge’s own solve.py. This is editorial work (one person reads ~10 solve.py files and writes ~4 predicates each), not new infrastructure — the ground-truth reference already exists, it’s just not lifted into a machine-readable file.
  3. Prototype on ONE challenge before committing to all 10. PD26-02 is the natural first pick — its two-step NoSQLi-authbypass chain is already fully narrated in lessons/challenges/pd26-02-nosqli-authbypass-chain.md, and solve.py’s own success checks (r.status_code == 200 and data.get("token") for the bypass; a flag regex on /api/profile for the pivot) are directly reusable as the exploitation-stage and lateral-stage predicates verbatim. Run the prototype over runs already sitting in benchmark/results/ and ~/security-agent-qwen/ — no new sweep needed to validate the mechanism.
  4. A narrower, more concrete companion gap on the flag side: turn flag_scan.retrieved into a true flag_verified boolean via an offline exact-match diff against the already-git-tracked benchmark/flags/pd26_flags.current.json, for the 10-challenge roster only. This needs no SSH, no live secrets, and closes the one place today’s “ground-truth-verified” claim is actually a provenance proxy.

None of this requires RL infrastructure, a new sandbox tool, or a change to secagent’s execution path — it is a read-only scan over data that already exists, validated against traces already collected, before any GRPO/RLVR reward design depends on it.

One problem, or many? — monolithic outcome-RL vs staged decomposition

Every other chapter in this book asks “which algorithm” (SFT vs DPO vs GRPO). This chapter asks a question one level up, specific to a sequential, multi-stage task with a single sparse terminal reward: should the CTF solve be trained as one end-to-end outcome-RL problem (flag reward only, let RL discover the stages), or decomposed into sub-problems — evaluated per stage and, more contentiously, trained per stage? The two halves of “decompose” turn out to have very different answers, and conflating them is the single easiest way to get this wrong.

All citations below are carried over, unmodified, from five research threads run 2026-07-02 (artifacts/overnight-decomposition/research/{monolithic-case,decomposition-case,staged-eval,pentest-ctf-rl,credit-assignment-theory,verdict}.md) — no id below was invented for this chapter. Re-grounding pass, same date: per the project’s standing rule, no conclusion in this chapter may rest on a domain-specific academic CTF/pentest training or benchmark paper (CTF-Dojo, Cyber-Zero, Pentest-R1, HackSynth-GRPO, AutoPenBench, Cybench, NYU CTF Bench, EnIGMA, InterCode-CTF, DRLRM-PT, node-fragility shaping, the kill-chain-staged-reward paper) — every such work is demoted to a labelled context-only mention below, and every claim that had rested on one is re-anchored on general frontier-lab / RL-theory evidence or this project’s own data instead. Two new citations were added and independently verified live for this pass: STaR (arXiv:2203.14465) and ReST-EM (arXiv:2312.06585), both general (non-security) self-training literature, replacing CTF-Dojo/Cyber-Zero as the basis for §5’s rejection-sampling-SFT recommendation.


1. The pipeline, and why the failure isn’t uniform

A CTF solve is not one action, it’s a chain:

flowchart LR
  R["Recon /\nenumeration"] --> E["Endpoint\ndiscovery"]
  E --> V["Identify the\nvulnerable endpoint"]
  V --> X["Exploit it"]
  X --> P["Post-exploitation /\npivot"]
  P --> F(("Flag\n{0,1}\nground-truth verified"))

  classDef stage fill:#132b22,stroke:#34d399,color:#eafaf3;
  class R,E,V,X,P stage;

Only the last box is ground-truth-checkable today. Observed failures cluster by where in the chain the agent dies, not uniformly across it — the project’s own F1–F4 taxonomy, which turns out to be a CTF-specific instance of failure clusters the general agent-eval literature keeps independently rediscovering (§4).

TagFailureCanonical RL/agent-research framingWhat it is not
F1Never finds the vulnerable endpointExploration / coverage failure — no gradient exists until the reward is first observed; large, deceptive state spaceNot a credit-assignment problem — you can’t assign credit for a reward you’ve never seen
F2Finds it, probes shallowly, can’t land the exploitExecution / skill (performance-floor) failure — capability present, doesn’t reliably convertNot usually fixed by more exploration
F3Clumsy tool use, wrong tool for the jobPolicy / tool-selection failure — a distinct axis from “does it find the bug”Overlaps F2 but has its own literature (tool-augmented-LLM failure taxonomies)
F4No real pivot/chaining after a footholdLong-horizon credit-assignment failure — the terminal bit has to retroactively explain ~100 turns; variance grows with horizonNot solved by “try more” alone — it’s a variance, not coverage, problem

The credit-assignment theory thread makes the split precise: a ~100-turn trajectory with reward only at the end is hard for three separable reasons — exploration burden (F1, upstream of everything else), credit-assignment variance (Monte-Carlo/GRPO-style returns smear one scalar across all turns — arXiv:1506.02438, GAE), and compounding distributional drift (a policy trained once, on one static snapshot, drifts off-distribution as a rollout gets longer — arXiv:1011.0686, DAgger, classically O(εT²) uncorrected). F1 is the first problem; F2–F4 are flavors of the second and third. Don’t expect one fix (a denser reward) to solve both (arXiv:2312.01072, the credit-assignment-vs-exploration survey).


2. Side A — the monolithic case, steelmanned

The pattern across every lab that tried both is consistent: outcome-only + scale beats hand-built process supervision, every time it’s been A/B’d.

  • DeepSeek-R1-Zero rejects process reward outright, in its own failure-experience writeup. Pure RL, no SFT, rule-based outcome-only reward; reasoning behaviors emerge as a side effect. Verbatim: “a model-based PRM… inevitably leads to reward hacking, and retraining the reward model needs additional training resources and it complicates the whole training pipeline.” arXiv:2501.12948. This is the single strongest evidence against a learned/neural per-stage reward — note precisely what it doesn’t rule out: a deterministic, ground-truth per-stage check is a different animal (§4).
  • OpenAI Deep Research shipped a long-horizon, tool-using agent trained end-to-end on outcome/rubric reward, and its own team says why: “End-to-end training beats manual orchestration… constructing a graph of operations… is the common approach to building agents [but] Deep Research is trained end-to-end… This allows the model to develop flexible strategies… that would break if scripted manually.” (OpenAI Deep Research system card, no arxiv id — flagged as such.) Closest real-world analog to this project’s shape: long-horizon, tool-using, sparse/rubric-graded.
  • Kimi k1.5 gets SOTA reasoning results with no PRM, no MCTS, no value function — substituting long-context scaling for explicit search. arXiv:2501.12599. Second independent lab, same conclusion as R1.
  • Llama 4’s own post-training team found heavy SFT/DPO caps the ceiling of the subsequent RL stage — verbatim: “SFT and DPO can over-constrain the model, restricting exploration during the online RL stage.” They responded by pruning >50% (95% for Behemoth) of their SFT data (ai.meta.com blog, no arxiv id). This is the project-critical citation: a hand-imposed stage boundary is itself a form of prescriptive pre-RL structure, and this is the general mechanism by which imposed structure narrows a policy’s exploration before RL gets to use it.
  • Academic, cited for context only, not a basis (per project standing rule — no domain-specific academic CTF/pentest training paper has produced a frontier cybersecurity model): in the CTF/pentest domain specifically, the same monolithic-wins pattern is also reported by academic training papers — CTF-Dojo (arXiv:2508.18370), Cyber-Zero (arXiv:2508.00910), Pentest-R1 (arXiv:2508.07382), HackSynth-GRPO (arXiv:2506.02048). None of these is load-bearing here — the actual basis for Side A is the frontier-lab evidence directly above (R1, Kimi k1.5, OpenAI Deep Research, Llama 4, Bitter Lesson), which independently converges on the same conclusion without needing a CTF-specific data point.
  • The Bitter Lesson (Sutton 2019, incompleteideas.net) is the intellectual ancestor of all of the above: hand-built structure plateaus, general search+learning wins at scale. High confidence the historical pattern is real; medium-low confidence it transfers directly to a 1000-challenge corpus with real per-rollout infra cost — that’s exactly the disanalogy the honest limits below press.

The honest limits — where the monolithic case’s own literature admits it breaks down

LimitCitationWhat it says
Pure outcome RL can structurally fail to ever find the rewardGo-Explore, arXiv:1901.10995 / Nature s41586-020-03157-9Vanilla deep RL scored ~0 on Montezuma’s Revenge/Pitfall — canonical sparse, deceptive, long-horizon environments — until an explicit “remember states, return, explore from there” mechanism was added. This maps almost exactly onto F1: it’s an exploration-algorithm problem, not a hyperparameter one.
The best long-horizon precedent needed denser reward + enormous scaleOpenAI Five, arXiv:1912.0668010 GPU-months of distributed self-play, AND a per-frame shaped reward (last-hits, kills, tower damage) — not a single terminal bit. Citing this as “pure sparse reward at scale works” over-claims what the paper shows.
R1’s own reward-reliability admissionarXiv:2501.12948“The success of pure RL depends on reliable reward signals… for tasks that cannot obtain a reliable signal, DeepSeek-R1 uses human annotation… and only conducts RL for hundreds of steps.” The CTF flag reward is reliable (ground-truth) but far sparser per compute-dollar than a math/code answer — R1’s paper doesn’t test this sparsity regime; it’s an extrapolation this project would be making, not a validated claim.

Bottom line for Side A: thick, convergent, in-domain evidence that monolithic-outcome-plus-better-data/curriculum wins whenever it’s been tried against a decomposed alternative in LLM-CTF training specifically. The genuine open risk it must own is Go-Explore’s — whether the ~100-200/1000 solve rate reflects “still finding it eventually with more rollouts” (favors monolithic) or “structurally not finding it” (favors an exploration-specific intervention) is an empirical question literature alone cannot resolve.


3. Side B — the decomposition case, steelmanned

Framing. Monolithic outcome-RL is implicitly betting on four things at once: (1) the base policy already puts non-zero mass on the correct trajectory shape for every stage on a large fraction of challenges, (2) the RL algorithm can correctly attribute a late reward to the right subset of ~100 turns, (3) one scalar is expressive enough to teach four qualitatively different skills (exploration breadth, exploit depth, tool discipline, chaining) without one skill’s gradient starving another’s, and (4) “more outcome RL” is uniformly the right lever for F1 through F4 alike. Every technique below is a documented failure mode of at least one of these assumptions.

The menu of decomposition mechanisms

MechanismCitationWhat changes in the loopConfidence
Options / SMDP framework (the seminal foundation, asked-for regardless of age)Sutton, Precup, Singh, Artificial Intelligence 112 (1999) (pre-arxiv; DOI 10.1016/S0004-3702(99)00052-1)Action space becomes {launch_recon_option, launch_exploit_option, ...}; a high-level policy picks among temporally-extended sub-policies, shortening the effective horizon the terminal reward has to bridgeHigh (theory), medium (LLM transfer). Known failure: naive end-to-end option learning collapses to one mega-option or micro-manages every step.
FeUdal Networks — Manager/Worker split fixes option-collapsearXiv:1703.01161Manager emits abstract directional goals in latent space at low temporal resolution; Worker is intrinsically rewarded for moving state toward that direction; own ablations show a plain (non-dilated) recurrent Manager “fails catastrophically” on long-credit-assignment tasksHigh (mechanism), medium (LLM transfer — from-scratch Atari RL, not a token-level LLM policy)
ArCHer — the LLM-native analoguearXiv:2402.19446A high-level, off-policy turn-level value function aggregates reward across turns; a low-level PPO-style update trains the token policy inside each turn using that value as its reward. Map “turn” onto “stage.” Single strongest “if I had to prototype one paper” citation for training-decomposition.High (recipe exists), medium (untested on anything CTF-shaped)
HiPER — hierarchical advantage estimationarXiv:2602.16165Factorizes policy into planner + executor; Hierarchical Advantage Estimation aggregates returns per subgoal, provably reducing variance vs flat GAE; +6.6% ALFWorld, +8.3% WebShop, largest gains specifically on long-horizon multi-subtask tasksHigh (strong ablations)
MiRA — milestone-based dense rewardarXiv:2603.19685Dense, milestone-based reward replaces sparse outcome-only; on Gemma3-12B, WebArena-Lite success rate 6.4% → 43.0%, beating WebRL (38.4%) and GPT-4-Turbo (17.6%). The single strongest empirical existence-proof in this whole dossier that flag-only reward can leave a large gap on the table — but on web-navigation, not offensive-security CTF.High
Pentest-R1 — domain-specific two-stage trainingarXiv:2508.07382Academic, cited for context only, not a basis (per standing rule): offline RL on 500+ real pentest walkthroughs → online RL in a live CTF env. Structurally resembles this project’s own planned SFT→GRPO, but that resemblance is not the basis for recommending it — the load-bearing mechanism for training-decomposition is the general options/ArCHer/HiPER hierarchical framing above plus curriculum-learning theory below.N/A — context only
Potential-based reward shaping — the theoretical safety net for everything aboveNg, Harada, Russell, ICML 1999 (pre-arxiv)F(s,s') = γΦ(s') − Φ(s) for any state-only potential Φ provably leaves the optimal policy unchanged — the telescoping sum over an episode collapses back to Φ(s_T) − Φ(s_0) plus the true reward. This is a theorem, not an empirical claim. Modern reaffirmation: arXiv:2502.01307 (practical effectiveness still depends on Φ’s scaling).Very high (correctness); risk is entirely implementation
RUDDER — learned, return-equivalent redistributionarXiv:1806.07857Train an auxiliary model to predict final return from trajectory prefixes (using the project’s own ~100-200 verified solves), use its temporal differences as a per-step reward — a learned alternative to hand-specifying Φ, with the same correctness guaranteeHigh (theory), directly actionable given existing verified-solve data
Ground-truth per-stage verifiers / VPR / CM2 (checklist rewards)arXiv:2605.10325 (VPR), arXiv:2602.12268 (CM2)Decompose the terminal task into a checklist of objectively verifiable sub-criteria (sandbox-checked, not judge-opinion) — the safe form of stage reward, symmetric to the flag verifier’s own contract. VPR’s own honest caveat: benefit “depends on the reliability of the verifier,” and extension “to less structured, open-ended environments… remains an open challenge” — directly relevant to CTF’s own stage 3 (§4).High, with an explicit open-environment caveat
Curriculum learning & sequencing — orthogonal to reward, lowest risk of the whole menuBengio et al., ICML 2009 (pre-arxiv, foundational); h1 arXiv:2510.07312; FastCuRL arXiv:2503.17287; BPO arXiv:2508.03018Order training by difficulty (single-endpoint before decoy-heavy; 1-hop exploit before 2-hop pivot) — touches no reward function at all. h1: curriculum + pure outcome-only reward gets an exponential sample-complexity gain. BPO explicitly reports vanilla GRPO on their sparse-reward setting yields only marginal improvement without it.High — the cheapest, least-risky lever in this entire menu
Kill-chain-staged reward (cyber-defense red-teaming)arXiv:2605.17075 (May 2026)Academic cybersecurity-LLM training work, cited for context only, not a basis (per standing rule). Frozen LLM planner emits kill-chain intent; a trained RL controller gets reward “aligned with kill-chain progression.” Superficially the closest thing in the literature to Option B, but it’s brand-new, unreplicated, doesn’t ablate “staged reward” from “hybrid architecture,” and — per the standing rule — carries no evidentiary weight for this project regardless. The actual basis for Option B’s viability is the general HRL/theory rows above (ArCHer, HiPER, potential-based shaping).N/A — context only
Classical (non-LLM) staged reward for pentestDRLRM-PT (reward machines), DOI 10.1109/ijcnn60899.2024.10650368; node-fragility shaping, DOI 10.3390/electronics13214311Academic pentest RL, cited for context only, not a basis (DRLRM-PT is explicitly named in the project’s standing rule). Reports staged/dense reward helping sample efficiency in small, discrete, formally-specified MDPs (network graphs, no language, no tool-calling) — a structurally different regime, and not where this project’s “staged reward can help” claim rests. That claim’s actual basis is the potential-based-shaping theorem and RUDDER above.N/A — context only

The honest limits — where decomposed training breaks, concretely

The reward-hacking evidence against naive per-stage reward is thick and convergent, not one paper. The moment a “verifier” stops being a deterministic ground-truth check and becomes a learned/judge score, this project’s own confirmed lesson (SFT-induced FLAG{} confabulation from a loose format-matcher) generalizes into a much larger literature:

  • PURE / Stop Summation (arXiv:2504.15275) names the mechanism precisely: the canonical summation-form credit assignment (additive, per-step reward) “easily induces LLMs to hack steps with high rewards.”
  • Reward Under Attack (arXiv:2603.06621) shows SOTA PRMs function as “fluency detectors rather than reasoning verifiers” — >0.9 PRM reward on trajectories with <4% ground-truth accuracy.
  • Gao et al. (arXiv:2410.15115) — combining a learned PRM/ORM with success reward can hurt relative to success-reward-only, via “repeating correct but unnecessary steps.”
  • PRIME’s own authors (arXiv:2502.01456) state the central open problem is that process labels are “prohibitively expensive… making [PRMs] particularly vulnerable to reward hacking,” and route around a separately-trained PRM entirely for this reason.
  • MONA (arXiv:2501.13011, DeepMind) generalizes this: multi-step reward hacking can occur even when no single step looks bad to a human/judge overseer.
  • The ancestor of all of it: the bicycle-shaping failure (Randlov & Alstrom, ICML 1998, pre-arxiv) — a non-potential-based “looks like progress” bonus taught an agent to ride in tight circles farming the bonus instead of reaching the goal. Same species as this project’s own SFT confabulation lesson: reward emitting the right-looking pattern rather than deterministically verified success, and the policy learns to farm the pattern.

Ceiling-capping is a real cost of decomposed training too, symmetric to Llama 4’s SFT/DPO warning. HIRO (arXiv:1805.08296) needed an explicit off-policy correction specifically because a high-level subgoal’s meaning drifts as the lower-level policy improves during training — without it, the system converges on subgoals that are locally useful but cap out below the true optimum. Translated to F1–F4: an “exploit-only” sub-policy trained against a synthetic “endpoint identified” subgoal risks converging on the shallowest exploit that satisfies the boundary — which is precisely the F2 shallow-probing failure this project already observes, not a hypothetical.

Academic, cited for context only, not a basis: the domain-specific CTF/pentest training papers named above happen to be monolithic-outcome or curriculum-decomposed rather than reward-decomposed, and the one adjacent paper doing genuine staged reward in an LLM+RL security loop (arXiv:2605.17075) is unreplicated — but neither observation is the basis for caution here. The actual, load-bearing case against naive per-stage reward is the general reward-hacking convergence immediately above (PURE, Reward Under Attack, Gao et al., PRIME, MONA, HIRO) — that literature alone is sufficient to warrant the conditional verdict in §5, independent of what the CTF-training corpus does or doesn’t show.


4. The key split — eval-decomposition vs training-decomposition

This is the load-bearing distinction. They are not the same decision, and the evidence supports very different confidence levels for each.

Eval-decompositionTraining-decomposition
What it meansMeasure per-stage reached/not-reached, on top of the existing flag_verified terminal signalReplace/augment the terminal reward with per-stage rewards, curricula, or hierarchically-trained sub-policies
Training-loop changeNone — a read-only pass over traces already generatedThe reward function, or the training architecture, or both
CostNear-free (one aggregation pass)Real engineering + real risk surface
Evidence for itAgentBoard’s “progress rate” (arXiv:2401.13178, NeurIPS 2024 Oral) — “current evaluation frameworks mostly focus on the final success rate, revealing few insights”; MAST’s 14-mode/3-category taxonomy (arXiv:2503.13657, κ=0.88); AgentErrorTaxonomy — root-cause diagnosis alone (no reward change) buys +24% all-correct accuracy (arXiv:2509.25370); phase-aligned taxonomies independently reinvented in a different (non-security) domain (arXiv:2508.13143); tau-bench’s pass^k (arXiv:2406.12045) — separates “never clears” from “unreliable,” composable with a phase vector. This general, non-security agent-eval literature is the basis for the verdict below on its own. Academic, cited for context only, not a basis: Cybench subtasks (arXiv:2408.08926), AutoPenBench milestones (arXiv:2410.03225), NYU CTF Bench (arXiv:2406.05590), and EnIGMA’s “soliloquizing” fabrication finding (arXiv:2409.16165, ICML 2025) happen to converge on a near-identical F1–F4-shaped split, which is a reassuring coincidence, not evidence this project’s verdict depends on.
Evidence against itNone — every paper that ships it treats it as strictly additive, diagnostic-only, never a substitute for the terminal checkThe 2025–2026 PRM-hacking convergence above (§3); ceiling-capping via HIRO-style subgoal drift. (The observation that domain-specific CTF/pentest training papers are uniformly monolithic when they report strong numbers is academic context only, not part of this basis — see §3.)
Honest limitsMatcher/judge reliability is the new bottleneck one level down — an LLM-judge-scored phase check inherits some of the flag-matcher’s fragility (MAST’s own top category is “task verification” failure); require a corroborating TOOL-kind span, not LLM-only reasoning, for any phase claiming environment interaction. Per-stage sample sizes shrink fast in a funnel — apply the same pass@k confidence-interval discipline already used project-wide. Phase credit can mislead if not cross-checked against the final flag (treat it as diagnostic under flag_verified, never a replacement).Safe only via a provably policy-invariant mechanism (potential-based shaping / RUDDER) — anything softer (a per-step LLM-judge “does this look like competent recon” score) inherits a decade-plus of documented gaming behavior.
VerdictYes, unconditionally, do it now.Conditional — see §5.

The theory’s own framing of why these are different decisions: per-stage evaluation is just better logging — nothing is being optimized against it, so it carries none of the correctness burden. Per-stage reward is where every failure mode above lives, because now something in the loop is being optimized against the signal. This is why the project brief is right to force these into two separate decisions.

The one empirical finding that turns this from philosophy into an operational rule. A controlled study across the RL design space on TravelPlanner finds: “reward and algorithm choices are scale-dependent — smaller models benefit from staged rewards and enhanced exploration, whereas larger models converge efficiently with simpler dense [outcome-only-adjacent] rewards.” arXiv:2603.21972. This is directly checkable via the eval funnel: is the current baseline “occasionally stumbles onto stage 3” (favors staged help) or “reliably reaches stage 3/4, fails to convert” (favors leaving outcome reward alone and attacking execution depth via data/SFT)? The project’s own diagnosis — “largely an execution gap” — leans toward the latter, but this is an empirical call the funnel should confirm, not an assumption to bake in from literature alone.


5. The verdict for this project

QuestionVerdictConfidence
EVAL-decomposition — measure recon / endpoint-discovery / vuln-ID / exploit / pivot independently, on top of flag_verified?Yes. Do it now, unconditionally. Near-free (read-only pass over existing traces), zero effect on training dynamics, independently reinvented by every serious CTF/agent benchmark that hit this problem before this project.High
TRAINING-decomposition — replace/augment the terminal flag reward with four separate per-stage rewards, curricula, or policies?No, not as a wholesale redesign — but a narrow, provably-safe form (potential-based milestone shaping, layered on top of the flag reward, never instead of it) earns its keep once the eval funnel shows an exploration-dominated bottleneck.Medium (conditional, not universal)

The concrete next step this whole verdict depends on

The project’s own PTES matcher schema (benchmark/challenges/*/challenge.json, ptes.<phase>.steps[]) was arrived at independently, before this literature review, on the general non-security basis in §4 (AgentBoard, MAST, AgentErrorTaxonomy). Academic, cited for context only, not a basis: it happens to structurally resemble the Cybench-subtask / AutoPenBench-milestone design too. What’s missing is aggregation: run the existing triage subagent over completed runs and emit one funnel row per challenge (five phase-reached booleans + an exploit-given-vuln-found conditional rate), rolled up into a corpus-level funnel. This single aggregation is the input every downstream decision below depends on — it settles empirically whether the ~100-200/1000 solve rate is F1-dominated, F2/F3-dominated, or F4-dominated, which the flag_verified column alone cannot supply no matter how much data accumulates.

  1. Decompose the eval fully, now, unconditionally (§4). Cross with the project’s own pass@k methodology rather than one aggregate pass@5, per tau-bench’s pass^k precedent.
  2. Keep the terminal flag reward as the ground-truth backbone, unconditionally. Nothing in this dossier argues for demoting it below a milestone signal.
  3. Run rejection-sampling SFT on own verified solves as already planned — but keep it light. The general, non-security basis for this move is the frontier self-training line: STaR (arXiv:2203.14465, Zelikman et al. 2022) — the seminal “generate, keep only what’s verified correct, fine-tune, repeat” loop — and ReST^EM (arXiv:2312.06585, Singh et al., DeepMind 2023) — the frontier-lab scaling result showing this expectation-maximization-style self-training on a model’s own correct samples beats training on human data alone, on math/code reasoning, no cybersecurity domain involved. Academic, cited for context only, not a basis: CTF-Dojo and Cyber-Zero report the same pattern (~500 verified trajectories → double-digit gains, no staged reward) inside the CTF/pentest domain specifically — a reassuring domain-match, not the reason to do this. Respect the Llama 4 warning: don’t over-train on the easy/repetitive subset — it narrows the exploration space the subsequent RL stage needs.
  4. Add curriculum sequencing before touching the reward function at all. The single lowest-risk lever available — no new reward, hence none of §3’s hacking surface. Order by whichever axis the funnel identifies as the bottleneck.
  5. Only if the funnel shows an F1 (exploration)-dominated bottleneck, and only via a ground-truth mechanism: add potential-based milestone shaping on top of the terminal reward. Define Φ(s) as a monotonic count of deterministically-verified stage completions (same verification contract the flag oracle already uses — server-side checks, not judge opinions), paid once per stage-transition, never re-collectable. Do not build this for stage 3 (vuln identification) specifically — VPR’s own authors flag exactly this stage-shape (“identify which of several candidates is vulnerable”) as the “open, unstructured” regime their method doesn’t yet solve well; keep that stage eval-only until a genuine deterministic check exists.
  6. Explicitly do NOT build a learned/LLM-judge per-stage reward model. Every citation in §3’s honest-limits section converges on this being the failure mode to avoid.
  7. If the funnel instead shows an F2/F3 (execution-depth / tool-policy)-dominated bottleneck — which the project’s own current diagnosis (“largely an execution gap”) suggests is more likely — the evidence base points away from reward decomposition and toward better trajectory curation and more/better SFT data, not a training-loop change.
  8. When entropy collapses under GRPO (the project’s own stated graduation trigger), watch stage-transition tokens specifically — a badly-shaped milestone reward is an easy, low-entropy shortcut to farm, and would accelerate collapse.

Deliberately not recommended as a first move: standing up four independently-trained sub-policies with four separate critics (the full options/ArCHer/HiPER/FeUdal-style architectural decomposition). Real, actively converging in the literature, and MiRA’s 6.4%→43.0% number is the strongest existence-proof in this whole dossier that monolithic reward can leave a large gap on the table — but every one of these is validated on web-navigation or generic agentic benchmarks with crisp, cheap-to-verify milestones, not on an offensive-security CTF corpus. Highest-upside, least-validated-for-this-domain lever here — a candidate for a later, small, gated experiment, not step one.

The decision, as a diagram

flowchart TD
  Start["Failing challenge / corpus\nunder diagnosis"] --> EvalDecomp["Step 1 — decompose the EVAL\n(PTES funnel, near-free)\ndo this unconditionally"]

  EvalDecomp --> Funnel{"Funnel shows which\nbottleneck dominates?"}

  Funnel -->|"F1: rarely reaches\nthe vulnerable endpoint"| ScaleCheck{"arXiv:2603.21972 —\nweak policy, capacity-limited?"}
  Funnel -->|"F2/F3: reaches it,\nfails to convert / clumsy tools"| Mono["Stay monolithic.\nInvest in trajectory curation +\nrejection-sampling SFT data\n(STaR / ReST-EM pattern)"]
  Funnel -->|"F4: no pivot after\na foothold"| Curric["Curriculum first\n(1-hop before 2-hop pivot chains,\nno reward change)"]

  ScaleCheck -->|"yes — occasionally\nstumbles onto it"| Shape["Potential-based milestone\nshaping ON TOP OF the flag reward\n(Ng/Harada/Russell 1999 — provably\npolicy-invariant), NOT stage 3"]
  ScaleCheck -->|"no — already capable,\njust unreliable"| Mono

  Shape --> Guard["Guard: ground-truth verifier only,\nnever a learned/LLM-judge score\n(PURE / Reward-Under-Attack / MONA)"]
  Mono --> Entropy["Watch entropy at GRPO\ngraduation regardless of path taken"]
  Curric --> Entropy
  Guard --> Entropy

  classDef safe fill:#132b22,stroke:#34d399,color:#eafaf3;
  classDef risk fill:#3a1414,stroke:#f87171,color:#fde8e8;
  class EvalDecomp,Curric,Mono safe;
  class Shape,Guard risk;

Contested point, stated plainly: there exists one paper doing genuine kill-chain-staged RL reward inside an LLM+RL hybrid (arXiv:2605.17075, cyber-defense red-teaming) that on its face looks like support for training-decomposition on an adjacent task. Per the project’s standing rule it carries no evidentiary weight here regardless — it’s academic cybersecurity-LLM work, cited for context only. The Medium-confidence, conditional verdict above does not rest on it; it rests on the general theory (potential-based shaping’s policy-invariance guarantee, HIRO’s ceiling-capping mechanism, the PRM-hacking convergence) and on this project’s own diagnosis. Demoting this citation changes nothing about the verdict.


  • Diagnosing the gap — a scientific framework — the pass@k / Pass@(k,T) / Cover@τ protocol that tells you which gap (knowledge / execution / exploration) a challenge subtype actually has; run this before deciding whether an F1-dominated funnel result calls for exploration-RL or just more samples. That chapter’s routing test and this chapter’s eval-funnel are complementary diagnostics, not competing ones.
  • RL that creates value — long-horizon, exploration, reasoning, novelty — the mechanics of how to fix an exploration or credit-assignment gap once diagnosed here (GiGPO step-level credit, DAPO, entropy instrumentation, ArCHer, curriculum-band filtering) — this chapter answers whether to decompose; that one answers how to execute the fix on the training-loop mechanics.
  • Agentic & multi-turn RL — the missing category — the training-loop shape (turn as the unit of advantage) that any of §5’s mechanisms (potential-based shaping, curriculum) has to be implemented inside.
  • Contested edges & landmines — the “does RL create capability or just amplify it” fight this chapter’s scale-dependence finding (arXiv:2603.21972) directly informs.

Bibliography (all traced to a verified source file, 2026-07-02)

CitationarXiv / DOIRole here
Sutton, “The Bitter Lesson”no arxiv; incompleteideas.netDon’t hand-author structure that plateaus
DeepSeek-R1 / R1-Zero2501.12948Rejects neural PRM at scale; reward-reliability admission
OpenAI Deep Researchsystem card, no arxivEnd-to-end beats manual orchestration
Llama 4 post-trainingai.meta.com blog, no arxivHeavy SFT/DPO caps RL exploration ceiling
Kimi k1.52501.12599Outcome-only + long context beats PRM/MCTS/value-fn
Go-Explore1901.10995 / Nature s41586-020-03157-9Pure outcome RL can structurally fail to find sparse reward
OpenAI Five1912.06680Long-horizon precedent needed huge scale + denser reward
Options framework (Sutton/Precup/Singh 1999)AIJ 112, DOI 10.1016/S0004-3702(99)00052-1Seminal HRL / temporal abstraction
FeUdal Networks1703.01161Manager/Worker HRL, option-collapse fix
ArCHer2402.19446LLM-native 2-level value function HRL
HiPER2602.16165Hierarchical advantage estimation, +6.6–8.3%
MiRA2603.19685Milestone reward, 6.4%→43% WebArena-Lite
Ng, Harada, Russell — reward shapingICML 1999, no arxivPotential-based shaping theorem (policy-invariant)
Müller & Kudenko2502.01307PBRS effectiveness depends on potential scaling
RUDDER1806.07857Learned return-equivalent redistribution
Randlov & Alstrom (bicycle shaping)ICML 1998, no arxivCanonical non-potential-based shaping failure
Verifiable Process Rewards (VPR)2605.10325Safe ground-truth process reward, open-env caveat for stage 3
CM2 checklist rewards2602.12268Checklist-style verifiable sub-criteria
Curriculum Learning (Bengio et al.)ICML 2009, no arxivFoundational curriculum citation
h12510.07312Curriculum + outcome-only, exponential sample-complexity gain
FastCuRL2503.17287Context-length curriculum, entropy-collapse timing
BPO2508.03018Curriculum + rejection-sampling refine, near-identical to project plan
PURE / Stop Summation2504.15275Sum-form PRM hacking mechanism, named
Reward Under Attack2603.06621PRMs as fluency detectors, adversarial hackability
Gao et al., designing RL reward2410.15115Learned PRM+success reward can hurt vs success-only
PRIME2502.01456Authors’ own admission of PRM hacking vulnerability
MONA2501.13011Multi-step reward hacking even with no bad-looking single step
HIRO1805.08296Off-policy correction, HRL non-stationarity / ceiling-capping
AgentBoard2401.13178Progress-rate metric, general capability-decomposition principle
MAST2503.1365714-mode/3-category failure taxonomy
AgentErrorTaxonomy / AgentDebug2509.25370Root-cause diagnosis gains without reward change
Phase-aligned taxonomy (autonomous agents)2508.13143Independent-domain convergence on phase-keyed failure
Cybench (academic — context only, not a basis)2408.08926Subtask decomposition, eval-only — reassuring convergence with AgentBoard/MAST, not evidence relied on
AutoPenBench (academic — context only, not a basis)2410.03225Milestone taxonomy near-matching F1–F4, eval-only — same caveat
NYU CTF Bench (academic — context only, not a basis)2406.05590CTF benchmark family
EnIGMA (academic — context only, not a basis)2409.16165“Soliloquizing” fabrication failure mode
tau-bench2406.12045pass^k reliability decomposition
InterCode-CTF (academic — context only, not a basis)2306.14898Seminal monolithic-reward CTF environment
CTF-Dojo (academic — context only, not a basis)2508.18370Monolithic rejection-sampling SFT, +11.6% — domain-match only; real basis is STaR/ReST-EM below
Cyber-Zero (academic — context only, not a basis)2508.00910Monolithic, simulated env, +13.1% — same caveat
Pentest-R1 (academic — context only, not a basis)2508.07382Two-stage curriculum, monolithic per-stage reward
HackSynth-GRPO (academic — context only, not a basis)2506.02048Outcome-only GRPO sufficient for single-stage CTF
STaR2203.14465Seminal general (non-security) rejection-sampling self-training loop; basis for §5’s SFT-on-own-solves recipe
ReST-EM (“Beyond Human Data”)2312.06585DeepMind frontier-lab scaling result for self-training on own correct samples, math/code domain
Kill-chain-staged reward (red-teaming) (academic cybersecurity-LLM — context only, not a basis)2605.17075The one LLM+RL staged-reward paper, adjacent domain, un-ablated
DRLRM-PT (reward machine, pentest) (academic — context only, not a basis)DOI 10.1109/ijcnn60899.2024.10650368Classical RL, staged reward helps, non-LLM regime
Node-fragility reward shaping (academic — context only, not a basis)DOI 10.3390/electronics13214311Classical dense-reward pentest, non-LLM regime
DAgger1011.0686Compounding error / distribution drift theory
GAE1506.02438Bias/variance dial for advantage estimation
Credit Assignment survey2312.01072Separates credit assignment from exploration
Demystifying long-horizon tool-use RL2603.21972Scale-dependence: staged reward helps weak models only

The decision

The whole book collapses to one diagnostic and its branches. The routing question — does the correct action ever appear in π_θ’s own outputs at high N? — is what separates a knowledge gap (inject off-policy) from an execution gap (on-policy) from a ranking gap (preference). The principle underneath is “GRPO amplifies existing capabilities, SFT replaces them” (arXiv:2507.10616): you can only reinforce what already fires.

That routing question has a sharper, task-structure-dependent version once you sample at high N and track when in the episode the correct action shows up — an execution gap can hide an exploration gap underneath it. The tree below now branches on that.

The rigorous version of this page lives in From behavioral audit to training signal. That chapter runs all 5 BSides patterns through gap-type → training-signal → verification-check, cites the pass@k / Pass@(k,T) / Cover@τ instrumentation this page’s routing question is a shorthand for, and is where you go before trusting any single branch below as final for a given challenge.

graph TD
  Q0["Does the correct action ever appear in π_θ's<br/>own outputs — even rarely — at high N?"]
  Q0 -->|"Never, not once"| K["KNOWLEDGE gap"]
  Q0 -->|"Sometimes (solves occasionally)"| E{"Have a model that's<br/>actually better at your CTFs?"}
  Q0 -->|"Knows it, mis-picks / quits early"| R["RANKING gap"]

  K --> KM["Inject OFF-POLICY:<br/>SFT / teacher data / a TOOL<br/>(RL can't cheaply conjure it)"]
  E -->|Yes| DIST["On-policy distillation<br/>dense · ~10× cheaper than RL<br/>(promising, not yet lab-proven)"]
  E -->|"No, only a verifier"| Q1{"Is the winning path single-shot,<br/>or compositional / sequentially-gated<br/>(enumeration must land before<br/>exploitation is even visible)?"}
  R --> RM["DPO / KTO<br/>(KTO for unpaired good/bad logs)"]

  Q1 -->|"Single-shot — recon isn't gating"| RS["Rejection-sampling SFT<br/>→ GRPO when entropy collapses"]
  Q1 -->|"Sequentially-gated —<br/>PTES pattern 4: 62% of<br/>failures stall in exploitation"| XG["EXPLORATION gap:<br/>reaches the right region,<br/>then collapses / never enumerates"]

  XG --> XGM["On-policy RL FIRST, not more SFT —<br/>matched-data SFT regresses this exact<br/>subset (net −4) while RL expands it<br/>(net +4): self-directed exploration<br/>during rollout is the causal ingredient,<br/>not demonstrations"]

  classDef your fill:#132b22,stroke:#34d399,color:#eafaf3;
  class E,Q1,RS,XG your;

Reading the branches

  • Knowledge gap — nothing on-policy to reinforce. Inject by demonstration, or cheaper, put the missing fact in a tool (“knowledge in tools, not weights” — llmresearch-handbook.md rule 5). Standard RLVR won’t cheaply create it — see the contested boundary in Contested edges (Yue et al., arXiv:2504.13837).

  • Execution gap + stronger teacheron-policy distillation: dense signal on your own rollouts (GKD arXiv:2306.13649); flagged promising, not lab-confirmed.

  • Execution gap + only a verifier, single-shot paththe common case. Rejection-sampling SFT on your ~100–200 solves → graduate to GRPO/RLVR when policy entropy collapses (arXiv:2504.11343).

  • Exploration gap — an execution gap that’s actually sequentially-gated [E][L][N] — the naive test (“solves occasionally at high N”) says execution gap, but if the winning path requires correct enumeration at turn 5 before turn 40’s exploit is even visible, that’s a different beast. Zhai et al.’s Pass@(k,T) analysis, arXiv:2604.14877 (2026-04-16, single-group, promising) found that on this exact task shape (“Category C” — compositional, sequentially-gated retrieval), the RL pass-curve pulls away from the base curve as k grows — real capability expansion — while matched-data SFT on the same task regresses it (net −4 vs RL’s net +4). The causal factor they isolate is self-directed exploration during on-policy rollout, not exposure to more demonstrations. Practically: don’t spend your rejection-sampling budget flattening this subset first — it needs GRPO’s exploration before SFT saturates it, the opposite ordering from the single-shot branch above.

    Designed to fix: patterns 1, 2, 4 — tool-space under-exploration (87.7% curl/shell bypass), no PTES enumeration before pivoting (82% pivot-after-one-failure), and the 62%-of-failures-stall-in-exploitation split. All three are the same mechanism at different granularities: Cui et al.’s entropy law, arXiv:2505.22617 (R = -a·exp(H) + b) predicts the policy trades away exactly this enumeration/tool-diversity budget for reward as entropy collapses — so plain GRPO on this subset needs DAPO-style clip-higher/dynamic sampling (arXiv:2503.14476) from the start, not as a later patch. Full technique-by-pattern mapping (ToolRL, GiGPO, RL-PLUS, NuRL, plus Pentest-R1 — academic, cited for context only, not a basis) is in the diagnosis chapter, not repeated here.

    Contested / not settled: this is the sharpened, task-conditional version of the “RL can’t create capability” debate in Contested edges §1 — Yue et al.’s crossover result (elicit-not-expand) holds on the static/independent-retrieval task shape; Zhai et al.’s result (genuine expansion) holds on the compositional/sequentially-gated shape. Same instrument, opposite conclusion, and the split is the falsifiable variable — segment your own 1000 challenges by this criterion before trusting either reading wholesale.

  • Ranking gapDPO/KTO; KTO fits your unpaired solved/failed logs (arXiv:2402.01306).

One prerequisite before any of this

Your 10–20% aggregate is a k=1 portfolio statistic, not a per-challenge pass rate. Before choosing a branch, run pass@k per challenge and bucket by difficulty — the 30–60% band is a per-group property that GRPO needs, and a challenge that “solved once” may be a 30% target that got lucky (a prime RL candidate), not a done deal (lessons/post-training/rl-candidate-selection-from-passk.md, shared memory). Diagnose per-challenge, then route.

The single-shot-vs-sequentially-gated split above needs the same per-challenge treatment: label each challenge by whether its winning path is recon-gated (turn-5 enumeration must land before turn-40 exploit is reachable) or not, before running pass@k — the split determines which axis of the tree you’re even on. Two cheap refinements to the pass@k check, both eval-only (no training change): Cover@τ (Dragoi et al., arXiv:2510.08325) flags challenges where a high pass@64 is really “guessable by brute force” rather than genuinely reliable — don’t rejection-sample SFT on those, you’ll just teach confident guessing; and running the base model’s own pass@k as a control (per Yue et al., arXiv:2504.13837) tells you whether a claimed post-SFT gain on a given challenge is elicitation or noise before you credit it to the pipeline.

The interactive version

The same tree, clickable, is in The 5-minute journey (final section) — answer it for your own failing challenges.

Contested edges & landmines

The places where confident-sounding claims are actually unsettled, plus the terminology traps that cause real planning errors. Cited so you can check me.

1. “RL can’t create capability” — contested, not a law

  • Elicit-not-expand (the base claim): RLVR raises pass@1 but the base model beats the RL model at large pass@k — the paths RL finds were already in the base distribution; the reasoning boundary narrows with training. Distillation from a stronger teacher does expand it; RL does not (Yue et al., “Does RL Really Incentivize Reasoning Capacity Beyond the Base Model?”, arXiv:2504.13837, NeurIPS 2025). Reproduced for vanilla GRPO by others (e.g. NuRL notes plain GRPO leaves pass@1024 ≈ base, arXiv:2509.25666).
  • The counter: prolonged RL + KL control + reference resets expands the boundary even on problems the base never solves (ProRL, arXiv:2505.24864); entropy/exploration bonuses and parameter-space noise (arXiv:2602.02555) show similar. Also a metric critique: pass@k over-credits lucky-but-wrong CoT (CoT-Pass@K, arXiv:2506.14245).
  • Safe framing: vanilla RLVR at a normal budget elicits; sufficient compute + explicit exploration-preservation can expand — recipe-dependent, not settled. So “you can’t RL your way out of a missing capability” holds for standard-recipe RL only. Don’t state it as a law.

2. The “RFT” terminology landmine

Two different things share the acronym; conflating them mis-scopes a whole plan:

  • Rejection-sampling Fine-Tuning (STaR-family) = “RL without RL,” positives-only SFT on your own verified samples. Cheap.
  • Reinforcement Fine-Tuning (OpenAI/Fireworks product term — and what the project handbook calls “RFT”) = actual online RL / GRPO against a grader. Expensive.

“Start with RFT before RL” only parses under the first. Say “rejection-sampling SFT” for the cheap thing so you don’t accidentally spec a GRPO run.

3. On-policy distillation is promising, not lab-proven

Earlier framing called it “the sleeper.” Correction after a 2026 verification pass: the efficiency numbers (~9–30× cheaper than RL) come from GKD + Thinking Machines’ own blog (arXiv:2306.13649; thinkingmachines.ai 2025-10-27). No frontier lab has stated it as their production recipe. Real and attractive; treat the numbers as directional, not settled.

4. Don’t over-SFT before RL (2026 lesson)

Meta’s Llama 4 recipe deliberately keeps SFT and DPO lightweight around an intensive online-RL core, with the explicit finding that heavy SFT/DPO restricts RL exploration (ai.meta.com/blog/llama-4-multimodal-intelligence). If RL is your capability driver, a big SFT stage can cap your ceiling — counter to the naive “more SFT is safer.”

5. Reward must be ground-truth-verified, never format-matched

A project-empirical finding: SFT on trajectory data trained the model to emit FLAG{…}-shaped strings on unsolved challenges — confabulation — and a loose regex matcher fired on the model’s own reasoning/tool-args, not real server output (lessons/post-training/sft-induced-flag-confabulation.md, lessons/security-agent/flag-detection-false-positives.md, shared memory). Rule: scan tool output / server state for the flag and verify against ground truth; never reward format. On the gameability ladder, a deterministic verifier (level 1) has no parameters to exploit — stay there; PRM (level 5) was rejected for R1 for exactly this (arXiv:2501.12948).

6. “Three knobs” was a teaching scaffold

The on/off-policy axis and the imitation/preference/reward paradigm split are canonical. Packaging them as “N independent knobs you toggle” is a scaffold that over-reached — the axes aren’t independent (signal + policy largely determine what changes), so free combinations produce non-methods. Learn the one axis + the fixed method presets, not a combinatorial grid.

7. Pass@k-as-diagnostic has its own landmines — don’t trust a bare crossover plot [E][R]

Point 1’s crossover test (base pass@large-k beats RL pass@large-k → RL only reweighted, didn’t teach) is the closest thing to a standard instrument in this literature, but it is contested on at least four fronts, each of which changes what conclusion you’re entitled to draw from your own portfolio’s pass@k curves:

  • The metric credits lucky final answers. CoT-Pass@K, arXiv:2506.14245 shows pass@k gives full credit to a correct terminal answer reached via a wrong reasoning chain — once you require the CoT itself to be correct, the base-vs-RL crossover disappears and RLVR shows monotonic gains at every k. This directly re-opens point 1’s “elicit vs expand” question: some of the reported “RL doesn’t expand the boundary” results may be an artifact of not checking how the answer was reached. Your flag verifier removes the lucky-final-string confound (exact match, not “42” appearing by chance) — but a verifier-passed CTF trajectory can still contain wasted turns or an ungrounded critical guess before the winning move (BSides pattern 3), so the same confound reappears one level up: filtering your rejection-sampling SFT set on flag==1 alone is the CTF-domain analogue of trusting bare pass@k.
  • Pass@k at large k conflates “solvable” with “brute-force-guessable.” Cover@τ, arXiv:2510.08325 proposes requiring a τ-fraction of samples (not just ≥1 of many) to be correct — reordering which RLVR algorithms “win” once you penalize guessing instead of rewarding eventual luck. Directly relevant to BSides pattern 5 (benchmarks measure pattern-match speed, not thoroughness): a challenge with high pass@64 but near-zero Cover@0.3 is guessing-dominated, and training on its “wins” teaches confident guessing, not competence.
  • Optimizing pass@k directly has a vanishing-gradient trap. Naive pass@k-as-objective is mathematically a per-example reweighting of plain pass@1 whose gradient goes to zero exactly where exploration is most needed — once the policy concentrates, pass@k and pass@1 converge and there’s nothing left to reweight (arXiv:2511.16231). PKPO, arXiv:2505.15201 fixes this with an unbiased, low-variance pass@k gradient estimator — so “just add a pass@k reward” is not a free entropy-preservation trick; it needs the right estimator or it’s a no-op past the point you needed it.
  • The agentic extension changes the verdict entirely. Pass@(k,T), arXiv:2604.14877 — a two-axis metric varying sampling budget k and interaction-depth T — finds the static-reasoning crossover (point 1) is task-structure-dependent: on tasks needing compositional, sequentially-gated information-gathering, the RL pass-curve pulls above and away from base as k grows (the opposite of Yue et al.’s finding), while on independent-retrieval tasks the effect is small, replicating Yue et al. This is the strongest bridge between the pure-reasoning literature and this project’s own agentic setting — see point 9 below.

Designed to fix: pattern 5 (benchmarks measure pattern-match speed, not thoroughness) — CoT-Pass@K and Cover@τ are both direct literature-side instantiations of that same audit finding, applied to the training/eval metric itself rather than the benchmark.

Safe framing: treat the pass@1-vs-pass@k gap as a diagnostic to segment by challenge type, not a single portfolio-wide verdict. A shrinking gap with flat pass@1 is the entropy-collapse warning sign (point 8 below), not “the model learned the task.” Never conclude “RL only elicited, didn’t expand” from an aggregate pass@k plot without checking (a) whether the crossover survives a CoT-correctness filter and (b) whether it’s driven by the compositional or independent-task subset.

8. Long-horizon credit assignment: turn-level vs trajectory-level vs sequence-level — no consensus on the right granularity [L]

Every flagship reasoning-RL recipe (GRPO, DAPO, Dr.GRPO) assigns one advantage to the whole trajectory — fine for a single-turn math answer, but it starts to matter once episodes run 100 turns with a terminal-only flag reward. The literature has responded with at least three different, non-convergent fixes, each validated on a different (mostly non-CTF, mostly short) domain:

  • Go finer — turn-level advantage. arXiv:2505.11821 shows trajectory-level GRPO applied naively to multi-turn tool-use can fail to teach tool invocation at all (baselines get 20-30% exact-match and never learn to call tools; their turn-level MT-GRPO variant hits 100% tool-execution success). Turn-PPO, arXiv:2512.17008 independently argues PPO-with-a-critic, reformulated so the MDP’s base unit is a turn (not a token), is more robust than GRPO for long-horizon agentic tasks — a direct challenge to defaulting to critic-free GRPO. Both are recent, small-scale (workshop poster / 0-citation preprint, toy benchmarks WebShop/Sokoban) — promising, not settled.
  • Go coarser — sequence-level ratio. GSPO, arXiv:2507.18071 goes the opposite direction: clip and optimize at the whole-response (sequence) level instead of per-token, because token-level importance ratios compound multiplicatively over long sequences and destabilize MoE RL training at scale — credited with letting Qwen3’s RL stage not destabilize. Backed by a shipped frontier model, not a toy benchmark — stronger evidence than Turn-PPO’s, but solving a different problem (numerical stability of the ratio, not credit assignment across turns): the two proposals are not mutually exclusive, but they are not the same fix either.
  • Fix the critic, don’t change the granularity. VAPO, arXiv:2504.05118 keeps trajectory-level PPO but argues the real problem is an unreliable critic at long/heterogeneous horizons — fixed with value-model pretraining + length-adaptive GAE (λ tuned per response length) — reporting zero training crashes across independent runs, directly disputing the “value-based RL is unstable for LLM reasoning” folklore GRPO was invented to route around.
  • Scale the horizon itself, skip the value function entirely. Kimi k1.5, arXiv:2501.12599 treats context/horizon length as a first-class RL scaling axis (not a constraint), using partial rollouts (checkpoint/resume mid-episode) to make 128k-context RL tractable — explicitly avoiding MCTS, value functions, and PRMs. Reframes “the episode is long” as an opportunity, contingent on whether security-agent-<family>/pdq can support partial-rollout checkpointing (unanswered today).
  • Curriculum over the horizon. AgentGym-RL / ScalingInter-RL, arXiv:2509.08755 sidesteps the granularity debate entirely: cap the allowed turn budget low early in training and relax it toward the full target (100 turns) as training proceeds, reporting this prevents the long-horizon collapse that training at full horizon from step one causes.

Designed to fix: pattern 1 (agents prefer their own raw tools, 87.7% bypass the rich surface, 26/40 tools dead) — the turn-level papers’ headline finding (trajectory-level GRPO can fail to teach tool invocation at all) is a plausible root-cause mechanism for tool disuse, not just a scaffolding/prompting issue, if this project ever RL-trains on tool-use.

Confidence: contested by construction — no single paper compares turn-level vs sequence-level vs trajectory-level-with-a-better-critic head-to-head on the same long-horizon agentic benchmark. Most of this cluster is 2025 H2–2026 preprints with 0 citations at verification time, validated on toy environments (WebShop, Sokoban) or math/code, not a 100-turn CTF setting. Treat as “candidate designs to pilot cheaply,” not a default architectural choice — and note turn-level, sequence-level, and critic-repair are not mutually exclusive; a future GRPO/RLVR run here could combine GSPO’s sequence-level clipping (stability) with a turn-level auxiliary tool-invocation reward without contradiction.

9. Do exploration bonuses genuinely EXPAND [N] the boundary, or just elicit what the base model already has? — open question, no paper has run the decisive test on this domain

Point 1 already flags “RL expands vs. only reweights” as contested between Yue et al. and ProRL. The exploration-specific literature (DIVER, CDE, MERCI, PSN-RLVR) all claim their intrinsic-reward/parameter-noise mechanism helps the policy “escape local routines” or “discover better solutions” — language asserting [N] (boundary expansion) rather than mere elicitation. That claim deserves the same skepticism point 1 applies to vanilla RLVR:

  • None of the exploration-bonus papers ran the decisive ablation. The rigorous test for “did this expand the boundary or just elicit/redistribute” is a pass@large-k comparison against the base model (point 1’s own protocol) — DIVER (arXiv:2509.26209), CDE (arXiv:2509.09675), MERCI (arXiv:2510.16614), and PSN-RLVR (arXiv:2602.02555) all compare against vanilla-GRPO/DAPO baselines, not against a very-large-k base-model ceiling. Beating a collapsed-entropy baseline is a much lower bar than beating the base model’s own pass@1024 — and per Spurious Rewards, arXiv:2506.10947, even a completely wrong reward can look like it’s “unlocking” capability on the right base model (Qwen2.5-Math specifically; does not replicate on Llama3/OLMo2) — a stark warning that “the policy now solves things it didn’t before” is not sufficient evidence of genuine novelty without a same-model-family, large-k, base-vs-trained comparison.
  • The capability-elicitation / AI-safety literature already built the falsification protocol for exactly this question — password-locked models (arXiv:2405.19550), harder circuit-broken organisms (The Elicitation Game, arXiv:2502.02180), and the “elicit within <1% of training cost” operational definition of latent capability (AI Sandbagging, arXiv:2406.07358) — all built around known-ground-truth hidden capabilities specifically to distinguish “the technique surfaced something already there” from “the technique taught something new.” No exploration-bonus paper in the RLVR-entropy literature has been tested against a model organism with known injected/withheld capability the way this sub-field requires before it will accept an expansion claim.
  • The one paper that ran something close to the decisive test, on an agentic/compositional task, found genuine expansion — but it’s a single, very recent result. Pass@(k,T), arXiv:2604.14877 (point 7 above) shows RL pulls ahead of base-model pass@k on compositional, sequentially-gated tasks, and — critically — that matched-data SFT on the same tasks regresses the capability boundary (net −4 vs RL’s net +4), isolating self-directed exploration during RL, not exposure to more data, as the causal factor. This is the strongest evidence in the entire corpus that exploration specifically (not RL in general, not more data) is what expands the boundary — but it is one paper (2026-04-16), one research group, unreplicated, studying retrieval-style compositional tasks, not cybersecurity.

Designed to fix: pattern 1 (tool avoidance) and pattern 3 (good guessers until they’re not) — DIVER’s pairwise-diversity-of-a-group reward and CDE’s perplexity-based actor bonus are both pitched as countering exactly these behavioral patterns. That framing may be correct as an elicitation mechanism (surfacing tool-diverse or better-calibrated behavior the base model can already produce with a nudge) even if the “expands the boundary” language in the papers’ abstracts is not yet earned.

Confidence: genuinely open. The honest position for this project: exploration bonuses are worth piloting (cheap, mechanistically motivated, all portable per the exploration research thread) — but do not claim any of them “expand the capability boundary” [N] until you’ve run a pass@large-k-vs-base-model check on the same challenge subset (point 1/7’s protocol), ideally segmented by task compositionality the way Pass@(k,T) recommends. Absent that check, the safer verb is “elicit” (base capability present, surfaced more reliably), matching this project’s own competence/performance framing, rather than “expand” (base capability genuinely absent, newly created).

Cybersecurity is one of a family — what cracked the others

Every other chapter in this book treats the CTF task as our problem: our harness, our 1000-challenge portfolio, our flag verifier. This chapter argues the opposite framing is more useful: cyber CTF-solving is one instance of a general problem family — {LONG-HORIZON (up to ~100 turns), EXPLORATORY (search/enumeration over a huge space), SPARSE-TERMINAL-REWARD (only the flag is verified, nothing in between), VERIFIABLE (an ungameable checker)} — and at least six other domains share that exact structural signature. Frontier labs and general RL research have been cracking members of this family for a decade. The move this chapter makes is: hold up each domain, find the stage (pretraining / SFT / RL) and the specific technique that actually fixed its long-horizon/sparse-reward problem, and rank what transfers.

Stance, honored throughout: academic cybersecurity-LLM projects (CTF-Dojo, Cyber-Zero, Pentest-R1, HackSynth, AutoPenBench, DRLRM-PT, and siblings) are mention-only, labelled “academic, not a basis” below — no conclusion here rests on them. General coding, competitive programming, theorem proving, deep-research/web agents, games, and robotics are not academic-security work — they are exactly the frontier-lab and general-RL-theory evidence the project’s stance asks for. Everything below is cross-linked to The path to a frontier cybersecurity model, Diagnosing the gap, RL that creates value, and One problem, or many? — this chapter is the cross-domain evidence layer those four already draw on; it does not re-derive their verdicts.


1. The structural frame

Strip away the domain-specific vocabulary (a “vulnerable endpoint,” a “failing unit test,” a “Lean tactic,” a “hidden test case,” a “Montezuma key”) and every member of this family reduces to the same abstract shape:

flowchart TD
  subgraph ABSTRACT["Abstract class"]
    direction LR
    A1["Long-horizon:\nmany sequential\ndecisions"] --> A2["Exploratory:\nhuge search space,\nmost paths fail"]
    A2 --> A3["Sparse terminal\nreward: only the\nend state is scored"]
    A3 --> A4["Verifiable:\nan ungameable,\nmechanical checker"]
  end
  subgraph CYBER["Our CTF pipeline"]
    direction LR
    C1["Recon /\nenumeration\n(~turns 1-20)"] --> C2["Endpoint /\nvuln discovery\n(~turns 20-50)"]
    C2 --> C3["Exploit\nchain\n(~turns 50-90)"]
    C3 --> C4["Flag read +\nserver-side\nverify {0,1}"]
  end
  A1 -. maps to .-> C1
  A2 -. maps to .-> C2
  A3 -. maps to .-> C3
  A4 -. maps to .-> C4

The point of drawing the arrow this way: the flag verifier is not special — it is our domain’s instance of “the proof kernel,” “the unit-test suite,” “the hidden Codeforces test cases,” “the reference-answer F1 score,” “the win/loss signal.” Every domain below chose (or was forced into) a stage + technique to make progress against that shape. The question this chapter answers is which of those choices generalize.


2. THE ANALOGY TABLE

DomainHorizonReward sparsityExploration burdenVerifierStage + technique that cracked itTransfers to cyber
Long-horizon coding (SWE-bench repo agents)20–80+ tool-call turnsTerminal (tests pass)Which file/function among thousandsExecution (unit tests)SFT on executable-env trajectories (SWE-Gym arXiv:2412.21139, R2E-Gym arXiv:2504.07164) → RL with execution-verified reward (SWE-RL arXiv:2502.18449, o1→o3 arXiv:2502.06807); long-context multi-turn RL needs DAPO-style stabilizers + progressive context curriculum (arXiv:2508.03501)HIGH — closest structural twin. Mirrors our planned pipeline almost exactly.
Particular-language lift (weak PL / weak NL, execution-RL on code)Short (1 program) → multi-turn repair (RLEF)Terminal (tests)Program-space / repair-spaceExecution (compiler/unit tests)Pretraining/CPT does the knowledge lift (corpus coverage — DeepSeek-Coder arXiv:2401.14196, StarCoder2 arXiv:2402.19173); RL fixes execution only — CodeRL arXiv:2207.01780, StepCoder arXiv:2402.01391, RLEF arXiv:2410.02089 — never injects new knowledgeHIGH, but as a diagnostic contrast, not a lever — see §4 below.
Competitive programmingShort per-problem; GrandCode reframes as multi-stage agentic loopTerminal (hidden tests)Program space (millions of candidates)Execution (unit tests)Sampling breadth + cheap filter (AlphaCode, arXiv:2203.07814) then a purpose-built multi-stage GRPO variant for delayed reward + off-policy drift (GrandCode’s “Agentic GRPO,” arXiv:2604.02721)STRONG (conceptually) — validates our rejection-sampling-breadth strategy; GrandCode is the most load-bearing GRPO-variant precedent for our exact delayed-reward shape.
Theorem proving (Lean/Coq)Long, many tactic steps — decomposable into subgoalsPer-step, not just terminal — the kernel checks every stepTactic/proof search spaceDeterministic, per-step, ungameable (proof kernel)Subgoal decomposition for cold-start data + curriculum (DeepSeek-Prover-V2, arXiv:2504.21801); synthetic self-play data generation (AlphaGeometry, Nature 2024, DOI:10.1038/s41586-023-06747-5); then RL vs. binary kernel-verified reward (DeepSeek-Prover-V1.5)PARTIAL — cleanest verifier of any row. Licenses subgoal-decomposition-for-DATA; does not license densifying reward over unverifiable CTF steps (shell/HTTP output isn’t kernel-checkable).
Deep-research / web agents~10–40 turns (search/click/browse)Terminal (task complete)Which query/page/source on the live open webLearned RM (WebGPT, weak) → outcome/ORM (WebRL) → reference-match F1 (DeepResearcher)RL (outcome-only) trained in the real live environment; WebRL’s self-evolving curriculum generated from the model’s own failures (arXiv:2411.02337); R1-Searcher’s sequential (not summed) format-then-outcome staged reward (arXiv:2503.05592)STRONGEST transfer row. Failure-to-curriculum + safe staged tool-use bootstrap, both directly actionable.
Hard-exploration games (Montezuma / Pitfall / NetHack / StarCraft)Very long (1000s–10,000+ actions)Terminal, near-zero for prior baselinesCombinatorial state space; adversarial strategy spaceEnvironment score / win-lossArchive-based “explore, remember, return-then-explore-further” (Go-Explore, arXiv:1901.10995); diverse self-play league vs. naive self-play (AlphaStar, DOI:10.1038/s41586-019-1724-z); scale + dense shaping, no exotic algorithm (OpenAI Five, arXiv:1912.06680)STRONG (Go-Explore) / different axis (league) / honest caveat (NetHack still unsolved).
Robotics (sparse-reward manipulation)Short (tens of steps)Sparse binary terminalContinuous action/goal spaceEnvironment success checkData-relabeling: HER (arXiv:1707.01495) — relabel a failed trajectory’s achieved state as the goal it accidentally satisfiedSPECULATIVE at the mechanism level (off-policy-specific); the idea licenses mining our own flag=0 trajectories for sub-skill SFT data.
Cyber-CTF (academic literature — mention-only)~100 turns (our regime)Terminal (flag)Endpoint/vuln/exploit-chain spaceDeterministic flag verifierMonolithic outcome-only RL / rejection-sampling (CTF-Dojo, Cyber-Zero — academic, not a basis); two-stage offline→online curriculum (Pentest-R1 — academic, not a basis)— target row; convergence with the frontier rows above is a mild corroborating note only, never load-bearing.

3. Per-domain deep-dive

3.1 Long-horizon agentic coding — the closest twin

Problem it fixes: a model that free-runs an unconstrained shell wastes turns on malformed edits and noisy tool output; naive RL on a full 20–80-turn trajectory with only a terminal test-pass reward hits credit-misattribution (an early correct action gets penalized because a later, unrelated action failed).

What fixed it, by stage:

  • Scaffold (pre-RL): SWE-agent’s Agent-Computer Interface (arXiv:2405.15793) — a small fixed action set + concise per-step feedback lifted pass@1 from 3.8%→12.5% on the same underlying LM, before touching weights. Anthropic’s Claude 3.5/3.7 Sonnet scaffold philosophy is the opposite-looking but complementary lesson: deliberately minimal scaffolding (bash + string-replace edit tool), crediting the gain to post-training, not scaffold cleverness.
  • SFT: trajectory distillation on executable-environment corpora is the near-universal first stage — SWE-Gym (arXiv:2412.21139), R2E-Gym (arXiv:2504.07164). SWE-Master (very recent, arXiv:2602.03411, low-confidence) adds a concrete, cheap idea: mask environment-feedback tokens out of the SFT loss — train on the agent’s own actions/reasoning, not on memorizing verbose tool stdout.
  • RL: execution-derived reward beats a learned/similarity proxy when both are available — SWE-Gym’s ground-truth path over SWE-RL’s difflib patch-similarity fallback (arXiv:2502.18449). DeepSWE (RL-only, no SFT, from Qwen3-32B) shows DAPO-style stabilizers (Clip-High, no-KL, compact filtering of failed/timeout trajectories) let RL-only work when the base model already has strong agentic priors. Progressive context/turn-budget curriculum — start RL at a shorter horizon than the full ceiling, extend once performance plateaus — is independently confirmed by two groups (arXiv:2508.03501, and KLong, arXiv:2602.17547, low-confidence but converging).
  • Long-horizon credit assignment specifically: GiGPO (arXiv:2505.10978, NeurIPS 2025) adds a step-level grouping on top of GRPO — group actions taken from repeated “anchor states” across different rollouts, giving fine-grained credit without an extra critic. This is the most mature fix in a fast-moving 2026 cluster (BEACON arXiv:2605.06078, HiPER arXiv:2602.16165, Ecpo arXiv:2606.05885 — all low-confidence individually, but converging on “flat trajectory-only advantage is the open problem”).
  • Test-time (no training): parallel sampling + a verifier to pick the best candidate is a large, cheap, repeatedly-replicated multiplier (Claude “high compute” 63.7%→70.3%; R2E-Gym 34.4%→51%; DeepSWE 42.2%→59%) — orthogonal to whatever training was done.

What I’d change in our pipeline: (1) audit our sectools tool surface against the SWE-agent lesson — concise, structured observations, a “you already tried this” signal, before assuming RL will fix noisy feedback; (2) mask tool/environment-output tokens from our SFT loss if we don’t already; (3) if moving to full-trajectory RLVR, do not expect vanilla GRPO with only a terminal flag-reward to assign credit well across ~100 turns — prototype GiGPO’s anchor-state grouping first; (4) check whether we do reject-and-rescore across pass@k, or only report pass@1/pass@5 independently — the hybrid-TTS multiplier is free money already inside our methodology.

3.2 Particular-language — the knowledge-injection contrast (read this one for the diagnosis framing)

The reason this domain is second, not last: it is the cleanest existing literature answering exactly the question our diagnosis framework poses — is a failure a knowledge gap or an execution gap?

For both a specific programming language and a specific natural language, the field’s converged answer is stage-specific:

  • Pretraining / continued-pretraining injects KNOWLEDGE — corpus composition (how many languages, how much of each) is the lever, not RL, not even SFT. StarCoder (arXiv:2305.06161), StarCoder2 (arXiv:2402.19173), DeepSeek-Coder (arXiv:2401.14196) for code; Sailor (arXiv:2404.03608), SEA-LION (arXiv:2504.05747), LLaMA Beyond English (arXiv:2401.01055) for natural language. Tokenizer/vocabulary coverage sits underneath this stage as an architectural precondition (arXiv:2406.11477) — poor coverage looks like “doesn’t know the language” even when data exists, because every token is spent on fragmented sub-word pieces.
  • SFT/instruction-tuning teaches USE, cheaply, once the base model already has passive knowledge — BLOOM+1’s finding (arXiv:2212.09535) is the sharpest data point: for an already-instruction-tuned model, simply including a new language in the multitask instruction-tuning mixture beat continued pretraining — the cheapest lever to try first, once a base exists.
  • RL (execution/compiler-verified) fixes BEHAVIOR, never knowledge — CodeRL (arXiv:2207.01780), PPOCoder (arXiv:2301.13816), RLTF (arXiv:2307.04349), StepCoder (arXiv:2402.01391), RLEF (arXiv:2410.02089, Meta FAIR, ICML 2025 spotlight). None of these teach new language knowledge — every one explicitly frames the problem as get-the-execution-loop-right on a domain the base model’s pretraining already covers. RLEF in particular is a near-literal preview of our structural frame: multi-turn POMDP, policy emits → executes against public tests → feedback appended → repair → repeat → reward on held-out private tests, solved with a turn-level (not token-level) value function.

The quote-worthy contrast: BLOOM+1’s “put it in the SFT mixture” (an SFT-stage move) vs. StepCoder/RLEF’s “RL never adds knowledge, only sharpens execution of knowledge already latent in the base model.” No amount of GRPO/RLVR on our harness will inject knowledge of a CVE or technique class the base model never saw much of in pretraining — see §4.

3.3 Theorem proving — the cleanest verifier, the strongest calibration check

Formal theorem proving (Lean/Coq) has the single cleanest sparse-reward substrate of any domain: the proof kernel checks every intermediate step, not just the final answer — cleaner even than our flag verifier. DeepSeek-Prover-V2 (arXiv:2504.21801) decomposes a hard theorem into a DAG of subgoals, generates cold-start SFT data by solving each subgoal independently with a smaller model, then RL’s on top with binary kernel-verified reward. AlphaGeometry (Nature 2024, DOI:10.1038/s41586-023-06747-5) goes further: synthetic self-play data generation that manufactures its own training problems, not just solutions to given ones — escaping a human-demonstration data-scarcity floor entirely. AlphaProof (Nature 2025, DOI:10.1038/s41586-025-09833-y) applies AlphaZero-style self-play/search on top.

The honest limit, stated precisely: the transferable lever is not “densify reward the way theorem proving does” in general — it is specifically: wherever a CTF sub-milestone can be reduced to a deterministic, server-side, ground-truth check (a reverse-shell callback actually received, a specific privileged file actually read — not an LLM-judge’s opinion), score it exactly like a verified Lean subgoal. Everywhere else — a curl command’s stdout, a subprocess’s raw output — there is no general “CTF kernel” that can certify a step was valid, and this domain’s recipe does not license densifying reward there. What does transfer safely: subgoal decomposition and synthetic self-play, used to generate additional training data or curriculum, leaving the terminal reward untouched.

3.4 The KNOWLEDGE vs EXECUTION contrast — tying it to our diagnosis framework

This is the load-bearing synthesis point of the whole chapter, and it is exactly the split Diagnosing the gap already formalizes (competence vs. performance, Firestone PMC7604508; formal vs. functional competence, Mahowald et al. arXiv:2301.06627) — the particular-language literature is independent, cross-domain confirmation of the same split, from a different field entirely:

Failure mode observedRoot causeFixing stageCross-domain evidence
Model has never seen the relevant tokens/technique at allMissing from pretraining corpusPretraining / continued-pretraining — data mixture, upsamplingStarCoder/DeepSeek-Coder (PL); Sailor/SEA-LION (NL)
Model “sort of” knows it but fragments/mishandles itTokenizer/vocabulary coverage gapVocabulary expansion + CPT (sub-pretraining level, not SFT/RL)Yamaguchi et al. arXiv:2406.11477
Model knows it passively but won’t reliably use it on commandInstruction-tuning data doesn’t cover itSFT / instruction-tuning mixture — cheapest lever, no CPT neededBLOOM+1 arXiv:2212.09535; Aya arXiv:2402.07827
Model knows the technique but fumbles execution over many turns / fails testsBehavior/execution gap, not knowledgeRL with an execution/compiler/flag verifierCodeRL, StepCoder, RLEF — this is our project’s diagnosed gap

What I’d change in our pipeline: before spending a training-run budget on GRPO/RLVR to fix a specific recurring failure, run it through this table first. If trace review shows the agent never produces the right technique/CVE reference at any sample count — that’s the top two rows, a pretraining/SFT-data problem, and no amount of RL will fix it (consistent with handbook rule 5: “knowledge in tools, not weights or prompt” — this generalizes: RL doesn’t need to hold facts if tools can supply them at inference time). If the technique does appear somewhere across k samples but pass@1 doesn’t convert it — that’s the bottom row, our diagnosed execution gap, and RLVR is the right lever. This is not a new framework; it’s the particular-language literature independently re-deriving the same split our diagnosis chapter already uses, from code/NLP rather than cyber — worth citing back as corroboration.

3.5 Deep-research / web agents — the strongest-transfer row

The tightest non-cyber structural analogue: many sequential tool calls into a real, noisy, adversarial live environment (not a toy simulator), reward assessable only once the task is actually done. WebGPT (arXiv:2112.09332) is the direct ancestor of “SFT cold-start, then reward-guided optimization” — our own planned shape — but its reward is a learned human-preference model, flagged by its own authors as struggling out-of-distribution; cite it as the origin of the recipe shape, not as license to use a learned reward. WebRL (arXiv:2411.02337, ICLR 2025) is the standout: a self-evolving curriculum that generates new training tasks directly from the model’s own unsuccessful attempts — Llama-3.1-8B goes 4.8%→42.4% success on WebArena-Lite from this alone. R1-Searcher (arXiv:2503.05592) shows a genuinely safe way to densify reward for tool-use: a sequential, not summed, two-stage reward — stage 1 rewards only correct tool-invocation format, stage 2 switches fully to outcome reward. Because the stages are sequential rather than concurrently-summed, the policy can’t “farm” stage-1’s format reward once stage 2 has begun — it avoids the reward-hacking trap a concurrent per-call bonus would carry. DeepResearcher (arXiv:2504.03160, EMNLP 2025) independently argues, as its central claim, that training end-to-end in the real environment (not a simulated proxy) is “a fundamental requirement” — direct validation of our own live-sandbox harness design.

What I’d change in our pipeline: (1) WebRL’s failure-to-curriculum mechanism is the single best non-cyber precedent for our own currently-unsolved long tail (most of the ~1000-challenge portfolio outside the 30-60% GRPO band) — an automatic, quantitatively-strong demonstration that a failure corpus converts into new, appropriately-calibrated training tasks rather than sitting inert; (2) R1-Searcher’s sequential staged reward is a concrete, safe fix if trace review shows tool-avoidance (agent bypassing the sectools surface in favor of raw shell) — a stage-1 format-only reward for correct tool invocation, switched off once stage 2 begins.

3.6 Hard-exploration games — the honest calibration check, plus one genuinely new lever

Montezuma’s Revenge and Pitfall are the domain where “sparse binary terminal reward with near-zero prior success” was most literally the entire problem. Go-Explore (arXiv:1901.10995, Nature 2021) solved both by maintaining an archive of previously-visited states, always returning to a promising already-discovered frontier state cheaply before exploring further from it, rather than re-discovering the same early states from scratch every episode — then robustifying the best archived trajectories into a closed-loop policy. AlphaStar (Nature, DOI:10.1038/s41586-019-1724-z) solves a different axis — adversarial strategy collapse — with a league: train against a diverse, continually-adapting population of checkpoints and hand-designed “exploiter” agents, not just the latest self; it also validates imitation-bootstrap-before-RL at frontier scale (pure self-play RL from scratch over such a long horizon was intractably slow to bootstrap). OpenAI Five (arXiv:1912.06680) shows scale + dense reward shaping substitutes for exotic exploration algorithms — but its dense proxy-reward choice is exactly the reward-hacking risk our handbook’s ground-truth-only stance guards against; treat it as a caution, not a recipe. NetHack (arXiv:2006.13760) remains genuinely unsolved — the honest calibration point that {long-horizon + heavy exploration + procedural generalization + sparse terminal reward} all at once is not a solved combination anywhere in the literature, cyber included.

What I’d change in our pipeline (speculative transfer — engineering-unproven in this domain): Go-Explore’s literal mechanism (deterministic sim-state checkpoint/restore) doesn’t map to a live target box that isn’t cheaply resettable — but the principle does: explicitly archive distinct states of partial progress within a CTF episode (new service found, new privilege level reached, new file discovered) as first-class checkpoints, and bias new rollout attempts toward continuing exploration from an archived frontier state rather than always cold-starting from turn 0. This changes where exploration compute is spent, not the reward function — it carries none of the reward-hacking risk a shaped reward would. Building the state-abstraction function for free-text shell/HTTP output is real, non-trivial work; flag accordingly.

3.7 Robotics — speculative mechanism, but a concrete data-curation descendant

Hindsight Experience Replay (arXiv:1707.01495, NeurIPS 2017) is the seminal sparse-reward paper: relabel a failed trajectory, post-hoc, as if the state it actually reached had been the intended goal all along — converting 100% sparse failure into 100% dense, correctly-labeled success. Honest limit: this is an off-policy, goal-conditioned, replay-buffer mechanism (DDPG/DQN-family) — it does not literally port to an on-policy GRPO/RLVR loop, and a CTF episode has one terminal flag, not a continuum of interchangeable goals a failed run could be relabeled as having achieved instead. Speculative transfer, principle only: mine the corpus of failed (flag=0) trajectories for reusable sub-skill demonstrations — a run that reached a foothold but failed to escalate privileges is a valid positive demonstration of “how to reach a foothold,” even though the overall episode is a failure. Segment the failed-trajectory corpus by furthest-pipeline-stage-reached and fold the phase-appropriate positive prefixes into the SFT corpus for that sub-skill — a pure data-curation move that never touches the RL reward function, carrying none of the reward-hacking risk an online auxiliary reward would.


4. The KNOWLEDGE vs EXECUTION contrast — the load-bearing takeaway

Section 3.4 above is the single most important cross-domain finding in this chapter, restated plainly: the particular-language literature is the cleanest existing evidence, from a domain with no cyber baggage, that pretraining/SFT injects knowledge and RL only sharpens execution of knowledge the base model already has. This directly operationalizes Diagnosing the gap’s competence-vs-performance split with a second, independent field’s worth of citations. Practical consequence for this project: run any suspected failure mode through the four-row table in §3.4 before committing GRPO/RLVR compute to it — a “doesn’t know the technique” failure needs a data lever (more SFT coverage, a tool that supplies the fact at inference time), not a training-loop lever; a “knows it, fumbles execution over ~100 turns” failure is where RLVR belongs, and is what every SWE/RLEF/GrandCode result above was built to fix.


5. Ranked shortlist of transferable levers

  1. RLEF’s turn-level value function over a multi-turn POMDP (arXiv:2410.02089) — the single most literal structural preview of our problem (public-test feedback → repair → repeat → private-test terminal reward) solved by Meta at an order of magnitude fewer samples than scaffolded prompting. High confidence, high relevance — read before finalizing a GRPO variant.
  2. GiGPO’s step-level anchor-state advantage (arXiv:2505.10978) and GrandCode’s Agentic-GRPO (arXiv:2604.02721) — two independent, purpose-built GRPO variants for exactly our credit-assignment shape (long trajectory, terminal-only reward, off-policy drift). High confidence on GiGPO (NeurIPS-accepted); promising, unverified on GrandCode (single team, very recent) — converging evidence “stage/turn is the unit of credit” is the field’s consensus fix.
  3. WebRL’s self-evolving curriculum from failures (arXiv:2411.02337) — the strongest available precedent for converting our own unsolved-challenge tail into new training data automatically. High confidence, needs CTF-specific task-generation design.
  4. Ground-truth execution reward over any learned/similarity proxy, now 4x cross-validated — SWE-Gym > SWE-RL, the theorem-proving kernel, WebGPT’s own flagged weakness, and DeepResearcher’s live-environment requirement all land on the same rule our handbook already locks. Established, high confidence — reconfirmation, not a new finding, but reassurance the rule is domain-general.
  5. Subgoal/curriculum decomposition for cold-start DATA generation, never for the reward itself (DeepSeek-Prover-V2 arXiv:2504.21801, AlphaGeometry) — safe wherever a genuine sub-milestone is deterministically verifiable; see One problem, or many? for the full decompose-eval-vs-decompose-reward argument this reconfirms. Established, high confidence.
  6. Progressive context/turn-budget curriculum for RL (arXiv:2508.03501, KLong arXiv:2602.17547) — start short, extend once performance plateaus; directly portable to the existing GRPO 30–60% baseline-band rule. Medium confidence (2 independent sources).
  7. R1-Searcher’s sequential (not summed) staged reward for tool-use bootstrap (arXiv:2503.05592) — a safe fix if tool-avoidance is diagnosed in trace review. Established mechanism, cyber-domain application untested.
  8. Go-Explore’s archive-and-return exploration scheduling (arXiv:1901.10995) — genuinely the most novel, least-already-covered addition here; changes where exploration compute is spent, zero reward-hacking risk. Strong transfer of the idea; engineering-unproven in this domain — speculative transfer.
  9. Hybrid test-time scaling (execution + execution-free verifiers, complementary blind spots) — a free multiplier on any trained policy, replicated across ≥4 independent groups (R2E-Gym, DeepSWE, SWE-Master, Claude “high compute”). High confidence, immediately actionable on our existing pass@k methodology.
  10. Mining flag=0 trajectories for sub-skill SFT data (HER’s spirit, arXiv:1707.01495) — segment failed runs by furthest-stage-reached, fold positive prefixes into SFT. Speculative-to-established idea, translated from a different algorithm family — low risk, purely data-curation.
  11. NetHack’s honest “still unsolved” status (arXiv:2006.13760) — not a lever, a calibration check: no algorithm anywhere has cracked {long-horizon + heavy exploration + procedural generalization + sparse terminal reward} simultaneously. Set expectations accordingly.

The path to a frontier cybersecurity model

Every other chapter in this book resolves a method question for this project’s current bottleneck (SFT vs GRPO, monolithic vs decomposed reward, which exploration fix). This chapter zooms out to the north star those decisions serve: not “improve pass@k on our 1000-challenge portfolio” as an end in itself, but a frontier cybersecurity model — the offensive-security analog of what DeepSeek-Coder, Qwen-Coder, and DeepSeekMath are to code and math. It asks the harder question every other chapter brackets: even if The decision’s routing tree is answered correctly and roadmap-inputs.md’s forks are all resolved well, does that alone produce a frontier model? The honest answer, argued below, is no — it produces a materially better agent on this project’s own portfolio, which is necessary but not sufficient. This chapter is the capstone that says what else the word “frontier” actually requires, stage by stage, and where this project genuinely already stands on that ladder.


1. The frame — what “frontier” means here, and why academic-security is the wrong template

“Frontier” is not a vibe or a marketing word here — across every domain-specialization lineage examined below (code, math, medical), it tracked the same three things simultaneously, never just one: (1) a strong general, already-agentic starting checkpoint, not a small model fine-tuned harder; (2) a domain RL stage with an ungameable, automatically-computable verifier at scale; (3) infrastructure investment matched to the RL stage’s actual bottleneck — which turns out to be environment count and diversity, not bigger GPUs for pretraining. A model that is merely “instruction-tuned on curated domain data” (Med-PaLM v1’s prompt-tuning-only recipe, arXiv:2212.13138) is a domain assistant, not a frontier domain model — the field’s own vocabulary distinguishes these, and this project should too.

This is why the standing project stance treats academic cybersecurity-LLM papers (CTF-Dojo, Cyber-Zero, Pentest-R1, HackSynth, AutoPenBench, DRLRM-PT, and siblings) as mention-only, never load-bearing: none of them has produced a model that clears the bar above. They are useful as landscape context, occasionally as a source of a technique name worth knowing, but citing them as the basis for a claim about what a frontier cyber model requires would be citing evidence that has never once been tested against the thing it claims to predict. Every load-bearing claim in this chapter instead rests on (a) frontier-lab flagship and domain-specialization disclosures — code, math, medical, and general-agentic recipes, 2023–2026; (b) general RL/ML theory (scaling laws, potential-based reward shaping, reward-hacking mechanics); (c) this project’s own measured data and confirmed lessons. Where a genuinely production, externally-verified cyber-specialized system exists outside the academic set (XBOW, Bugcrowd) it is admissible under clause (a) as a domain-specialization precedent, exactly like Code Llama or Med-PaLM — and is flagged as such, never folded in with the excluded academic set.


2. The transferable frontier domain-specialization recipe

Cross-referencing the code lineages (DeepSeek-Coder → DeepSeek-Coder-V2, Qwen2.5-Coder → Qwen3-Coder-Next), the math lineage (DeepSeekMath → Qwen2.5-Math → DeepSeek-Prover-V2), and the medical contrast vertical (Med-PaLM → Med-PaLM 2 → MedGemma), one skeleton recurs with variation in emphasis but is never fully absent. A fifth stage — mid-training — is a distinct, more recently named bridge stage the code/math lineages ran informally but only OLMo 2 and a 2025 controlled study give a name and a mechanism to (arXiv:2501.00656, arXiv:2510.14865).

flowchart TD
  P["Pretrain\n(inherited — a strong general/\nagentic open-weight checkpoint,\nnot trained here)"] --> S0

  subgraph S0["Stage 0 — Domain continued pretraining (CPT)"]
    direction TB
    S0A["100B-5.5T in-domain tokens,\nfrom a STRONG checkpoint, never from scratch\nCode Llama ~500B code tokens\nDeepSeekMath 120B math tokens\nQwen2.5-Coder 5.5T code tokens"]
  end

  S0 --> S05

  subgraph S05["Stage 0.5 — Mid-training (the named bridge)"]
    direction TB
    S05A["5-10% of pretrain FLOPs, curriculum-shaped,\nupsampled high-quality + synthetic patches\n'infuse knowledge, patch deficiencies'\n(OLMo 2); reduces catastrophic forgetting\nbefore SFT (2510.14865)"]
  end

  S05 --> S1

  subgraph S1["Stage 1 — Domain SFT / data synthesis,\nincreasingly SELF-BOOTSTRAPPED"]
    direction TB
    S1A["Rejection-sampling + iterative co-evolution:\nQwen3-Coder used Qwen2.5-Coder to clean its\nown next-gen data; Qwen2.5-Math co-evolved\nRM+SFT across rounds; DeepSeek-Prover-V2\nstitched subgoal-decomposed traces"]
  end

  S1 --> S2

  subgraph S2["Stage 2 — Domain RL, verifier-gated\n'hard to solve, easy to verify'"]
    direction TB
    S2A["GRPO / RLVR, no critic, group-mean baseline\n(origin: DeepSeekMath); scaling axis that\nmattered most = PARALLEL RL ENVIRONMENTS\n(Qwen3-Coder: 20,000), not model size"]
  end

  S2 -.->|"cross-cutting, every stage"| S4["Data-pipeline + scale engineering\nas a first-class investment\n(Qwen2.5-Coder: curation > scale;\nStarCoder2/Stack-v2: quality substitutes\nfor parameter count)"]

  classDef stage fill:#132b22,stroke:#34d399,color:#eafaf3;
  classDef cross fill:#3a2e14,stroke:#f5b942,color:#fff6e0;
  class P,S0A,S05A,S1A,S2A stage;
  class S4 cross;

Reading the skeleton stage by stage, cited:

  • Stage 0 — domain CPT. Every recent frontier vertical lineage continues pretraining from an existing strong checkpoint — never truly from scratch (DeepSeek-Coder v1’s from-scratch 2T-token run, arXiv:2401.14196, is the sole exception in this set, and even DeepSeek abandoned it by v2, arXiv:2406.11931). Cross-domain transfer is itself a load-bearing finding, not noise: DeepSeekMath deliberately starts from a code base (arXiv:2402.03300) for a math specialist, because precise multi-step symbolic reasoning transfers — argues a cyber CPT stage, if built, should start from a model already strong at general coding/tool-use/agentic reasoning, not a generic chat model.
  • Stage 0.5 — mid-training. OLMo 2 names this explicitly as “Stage 2: Mid-training (5–10% of training FLOPs)… upsample the highest-quality web documents and curated non-web sources; employ synthetic data crafted to patch math capabilities” (arXiv:2501.00656). A 2025 controlled study formalizes the mechanism: mid-training outperforms continued-pretraining-alone at a matched specialized-token budget and mitigates catastrophic forgetting in the subsequent SFT stage, because it acts as a better initialization for post-training rather than just adding knowledge (arXiv:2510.14865, moderate-high confidence — recent, not yet heavily cited, but consistent with and explaining the OLMo/Llama-3/DBRX practitioner reports it’s built on).
  • Stage 1 — self-bootstrapped SFT/data synthesis. The frontier pattern has moved past “filter and train once”: Qwen3-Coder used the prior generation (Qwen2.5-Coder) to clean and rewrite its own next-generation pretraining data (blog disclosure, no standalone arXiv for the 480B flagship — the architecture is covered by arXiv:2505.09388; the agentic-RL successor, Qwen3-Coder-Next, is arXiv:2603.00729). Qwen2.5-Math (arXiv:2409.12122) co-evolves a reward model and SFT data across rounds before RL is even applied, then reuses the same RM at inference for best-of-N reranking. DeepSeek-Prover-V2 (arXiv:2504.21801) decomposes a hard problem into subgoals, solves each with a cheaper model, and stitches the resolved subgoals into a single cold-start trajectory — a direct precedent for treating a ~100-turn CTF episode’s implicit stages (recon → foothold → priv-esc → flag) as subgoal-decomposable SFT-construction material, even while the RL reward itself stays terminal-only for ungameability.
  • Stage 2 — verifier-gated RL. This is where GRPO was born: DeepSeekMath’s own framing attributes math capability to two factors — a web-data mining pipeline, and Group Relative Policy Optimization, a critic-free PPO variant using the sampled group’s mean reward as the baseline (arXiv:2402.03300). Qwen3-Coder’s post-training explicitly names the reward-design principle “hard to solve, easy to verify,” and its own headline scaling axis wasn’t a bigger model, it was 20,000 parallel RL environments for the long-horizon agentic RL stage. DeepSeek-Prover-V2 runs the same pattern with Lean’s type-checker as a binary, ungameable reward — structurally identical in spirit to a terminal flag verifier.
  • Cross-cutting — data-pipeline engineering as its own investment. Qwen2.5-Coder’s whole story is “meticulous data cleaning, scalable synthetic data generation, balanced data mixing” beating larger models on the same benchmarks purely on data quality/composition. StarCoder2/The Stack v2 (arXiv:2402.19173, included as a data-pipeline lesson only — flagged explicitly as not a frontier-capability reference point) independently confirms curation-quality substituting for parameter count from a second source.

The medical contrast (mentioned, not a basis for cyber claims): Med-PaLM v1’s prompt-tuning-only recipe (arXiv:2212.13138) shows the cheap-adaptation-of-a-frozen-giant path is not sufficient on its own — the paper’s own human-eval gap (factuality, harm) motivated Med-PaLM 2 (arXiv:2305.09617, domain instruction fine-tuning + ensemble refinement) and MedGemma (arXiv:2507.05201, domain vision-language pretraining + task-specific fine-tuning, explicitly disclosed as not clinical-grade without further fine-tuning). The useful lesson by contrast: for a binary, adversarial correctness domain like offensive security (a wrong action doesn’t mislead a reader, it fails the exploit), a v1-style prompt-tuned ceiling is lower than in code/math — supporting this project’s existing bias toward continued adaptation + RLVR over prompting alone.


3. The frontier ingredients, as requirements

Restating the seven-ingredient survey as a checklist of what “frontier” actually costs, independent of any one domain:

#IngredientWhat frontier scale actually looks likeConfidence
1Compute + scale lawLoss falls as a power law in model size × data × compute; the ratio matters (Chinchilla-optimal ≈ equal scaling of params and tokens, not param-dominant) — Kaplan, arXiv:2001.08361; Hoffmann/Chinchilla, arXiv:2203.15556High — foundational, independently reproduced
2Data scale + quality + curation5–15T+ curated tokens is the pretraining norm (DeepSeek-V3, arXiv:2412.19437; Llama 3, arXiv:2407.21783) — or the phi-1 extreme: curation quality can substitute for ~100x less scale within a narrow domain (arXiv:2306.11644), though that ratio is an upper bound, not a universal ruleHigh for the scale rows; moderate on how far the “quality substitutes for scale” ratio generalizes
3Domain CPT before post-training120B–5.5T in-domain tokens continued-pretrained into an existing strong base, before any instruction-tuning/RL — not a thin adapter on the base chat model (Qwen2.5-Coder 5.5T+, DeepSeekMath 120B, Med-PaLM 2’s domain finetuning)High — three independent labs, primary technical reports
4Mid-training as a named bridge stageA shorter (5–10% FLOPs), curriculum-shaped stage between broad pretraining and narrow post-training that patches domain deficiencies cheaply and reduces forgetting — OLMo 2, arXiv:2501.00656; arXiv:2510.14865High (OLMo 2); moderate-high (mechanism study, new)
5RL-environment scale, diversity, verifiabilityThe reasoning/agentic jump to o1/R1-class models is attributed to large-scale RL with verifiable, not learned, rewards, not more pretraining — DeepSeek-R1, arXiv:2501.12948; OpenAI o1 system card (arXiv mirror 2412.16720); environment diversity/scale is its own axis, distinct from reward correctness — Kimi K2’s “tens of thousands” synthesized-tool pipeline (arXiv:2507.20534); framed as an emerging bottleneck by arXiv:2511.09586High (R1, o1, Kimi K2); moderate on the survey’s “emerging bottleneck” framing specifically (new, low-citation)
6Full pipeline vs. thin adapterLoRA measurably underperforms full fine-tuning specifically on code/math domain-skill acquisition — full fine-tuning learns perturbations at 10–100x the effective rank of typical LoRA configs — arXiv:2405.09673High — controlled, ablated, >250 citations in 18 months, directly on-domain
7Eval reflects reality, not a saturated benchmarkClassic benchmarks are contaminated enough to inflate scores by up to 22.9%/19.0% (GSM8K/MMLU) — arXiv:2406.13990; frontier practice responds with contamination-resistant-by-construction benchmarks: time-segmented LiveCodeBench, arXiv:2403.07974, “Google-proof” GPQA, arXiv:2311.12022; Tülu 3, arXiv:2411.15124 treats decontamination as a first-class deliverable, and is also the primary public naming of RLVRHigh — all four primary, independently corroborating

Net read of §2+§3 together: compute/scale (ingredient 1) is inherited from the base model’s own pretraining — not this project’s job to re-derive. Ingredients 3–4 (CPT, mid-training) are the stages this project’s plan currently skips by design (handbook rule “knowledge in tools, not weights”), a defensible bet but an unvalidated one at the CPT/mid-training layer specifically. Ingredient 5 (RL-environment scale/diversity/verifiability) is where this project is closest to the frontier pattern already — see §5. Ingredients 6–7 (full pipeline vs. adapter, eval integrity) are concrete, checkable knobs, not open research questions.


4. Gap analysis — the frontier recipe vs. this project, stage by stage

One-line verdict: this project has correctly identified the shape of the frontier recipe — harness as a live RL environment, ground-truth verifiable reward, rejection-sampling SFT → GRPO/RLVR ordering, “knowledge in tools not weights” — and has already built the two hardest structural pieces: a working ~100-turn agentic harness and a genuine, non-gameable terminal verifier. What it has not built is scale on every other axis, and it has skipped a pretraining-adjacent stage entirely (domain CPT / mid-training) that every cited frontier domain-specialization precedent inserts. The gap is 1–5 orders of magnitude on breadth, not a missing insight on direction.

Frontier-recipe stageHavePartialMissing
Stage 0 — Domain CPTNothing — explicit design choice (“knowledge in tools, not weights”), not oversightThe “knowledge in tools” architectural bet is coherent but unvalidated at this layer — the 87.7%-tool-bypass finding is at least as consistent with “raw vocabulary exposure is thin” as with “SFT/RL hasn’t reinforced the surface yet”A CPT stage on curated offensive-security text (tool docs, CVE writeups, exploit-dev reasoning) — every cited frontier precedent (Code Llama ~500B code tokens, DeepSeekMath 120B math tokens) runs this before SFT/RL
Stage 0.5 — Mid-trainingNothingThe single biggest concrete gap relative to its likely payoff — cheapest missing stage of the whole skeleton (§2), and this project currently goes straight from a general base into RL with no bridge stage at all
Stage 1 — SFT / rejection-samplingFully-designed quality-filter recipe (replay-reproduce, loss-masking, dedup, decontamination); one SFT already shipped and flag-verified (pass@1 26%→35%, pass@5 4/10→5/10); ~1,200+ trajectories across 5 corpora, ~185 verified-success candidates; a load-bearing negative result confirming the reward-must-be-ground-truth rule on this project’s own data (35% flag fabrication rate under a loose acceptance filter)The corrected filter (replay-reproduce, byte-exact flag verify) is designed but not confirmed re-run since the confabulation finding; ground truth exists for only 2 of 7 corpora~2–3 orders of magnitude below frontier scale — Kimi K2’s SFT draws on 3,000+ real + 20,000+ synthesized MCP tools; no systematic teacher-distillation at scale; no difficulty-curriculum construction (candidate pool too small)
Stage 2 — RLVRflagscan.go — a genuine rung-1, deterministic, ungameable terminal verifier, exactly the reward shape RLVR requires; RL-candidate-selection methodology matching GRPO’s actual zero-gradient mechanics (1–4-of-5 band); a theory-correct potential-based stage-shaping proposal (Ng-Harada-Russell) that doesn’t touch the ground-truth backboneNo GRPO/RLVR training loop implemented anywhere in the codebase (zero training binaries); retrieved heuristic not yet upgraded to byte-exact for 5 of 7 corpora; stage-shaping designed, zero code; the one OpenAI-RFT fallback option is winding downQwen3.7-Max’s decoupled Task/Harness/Verifier infra — the first-party-named fix for exactly this project’s own measured 87.7%-tool-bypass scaffold-overfitting finding; no entropy-collapse countermeasure (nothing to attach one to yet); no partial-rollout/pause-resume infra for long-tail ~100-turn episodes
Stage 3 — Scaled agentic RL environmentsA genuinely working RL-environment shell (100-turn multi-turn loop, sandboxed, fully OTel-traced); 10 canonical, hardened, contamination-free, single-solution PD26 challenges live in production, genuine vuln-class breadth; several hundred additional informal challenges (warpenv/envgen/gym); the RL-envs-as-moat thesis, now corroborated by frontier evidence, not just three original data pointsTotal known population is a few hundred distinct targets — an order of magnitude below the north star’s own “~1000” framing; the single Hetzner eval box is sized for sequential/moderate-parallel eval, never for concurrent GRPO rollout loadHarness/verifier diversity as its own trained axis (one tool surface, one verifier per challenge — categorical, not a scale gap); procedural/generative environment scaling at frontier order-of-magnitude (envgen’s 84 challenges is ~22x smaller than DeepSeek-V3.2’s 1,827-environment pipeline); training-time rollout compute at K8s scale (categorical — no training-scale infra exists, only eval-scale)
Stage 4 — Eval integrityA locked, rigorous pass@k methodology (unbiased estimator, k=3/5/10 bands, independent cold starts); the terminal flag_verified contract (exact match, never a proxy); two modest QA datasets beyond flag-capture; a fully-designed, near-zero-risk eval-decomposition planBase-model pass@64–128 control not confirmed run against the current SFT checkpoint; no per-challenge stage_oracle.json manifest yetNo confirmed contamination/canary audit applied to the cyber CTF corpus specifically; no externally-comparable published benchmark result — no way for an outside reader to place this project’s solve rate against any external reference point

What this means concretely: this project is not missing an idea anywhere in the recipe — every stage has a designed, theoretically-grounded, often partially-built answer, and two pieces (the harness-as- environment, the ground-truth verifier) are genuinely frontier-quality today. What’s missing everywhere except eval methodology and reward design is scale: more environments, more verified trajectories, more training-time compute — plus one categorical, non-scale gap (harness/verifier diversity) that is the single frontier-lab-named fix for a failure mode this project has already measured on its own data.


5. RL-environments + data-synthesis as the moat, at frontier scale

Frontier evidence on this axis is now broader than the project’s original three-source hypothesis (Anthropic’s reported spend, Bugcrowd’s market, this project’s own harness) — nine of ten labs surveyed for the frontier-recipes chapter independently corroborate it, and the general-agentic and cyber-specific evidence below sharpens the picture further.

  • Kimi K2 discloses “a large-scale agentic data synthesis pipeline and a joint reinforcement learning stage, where the model improves its capabilities through interactions with real and synthetic environments” (arXiv:2507.20534) — the cleanest public confirmation that a frontier lab’s post-training lever is environment synthesis + mixing real with synthetic, not base- model scale alone. Frontier_cyber_takeaway: PD26-01..10 is this project’s “real” side; anything procedurally generated (parameterized bug-class variants, mutated target configs) is the “synthetic” side — a program training only on the 10 canonical challenges is closer to “real-only, no synthesis,” which K2’s own design implies under-scales.
  • Kimi K2.5’s Agent Swarm (arXiv:2602.02276) — a self-directed parallel-agent orchestration framework decomposing tasks into concurrent heterogeneous sub-problems, 4.5x latency reduction — architecturally identical to what XBOW independently converged on for cybersecurity (below): narrow-scope parallel sub-agents, not one longer monolithic trajectory. Two unrelated programs landing on the same architecture is a signal worth taking seriously against this project’s current single-agent ~100-turn episode design.
  • OpenAI Deep Research — official disclosure that it “was trained on real-world tasks requiring browser and Python tool use, using the same reinforcement learning methods behind OpenAI o1,” and on-record team commentary that “end-to-end training beats manual orchestration” — a fixed recon→scan→exploit→flag graph “breaks” when the agent needs to adapt; letting the model learn strategy via RL over hard tasks outperforms hand-scripted phase logic. This directly reinforces the project’s own “light framing beats heavy scaffolding” rule, from the team that shipped the highest-profile agentic-RL product to date.
  • Anthropic — a reported (secondhand, via TechCrunch citing The Information; direction corroborated by surrounding market activity, dollar figure not on-record) >$1B RL-environment commitment, plus a live, current, on-record disclosure that cyber-offensive capability is deliberately tier-gated across the Claude line rather than propagated uniformly with general capability. Frontier_cyber_takeaway: cyber capability is not a free byproduct of general-agentic scaling even at a frontier lab — it has to be deliberately trained in, which is exactly this project’s actual bet (open-weight base + this project’s own cyber RL environments).
  • Google DeepMind SIMA / SIMA 2 (arXiv:2404.10179, arXiv:2512.04797) — SIMA 2’s headline: “by leveraging Gemini to generate tasks and provide rewards, SIMA 2 can autonomously learn new skills from scratch in a new environment.” The clearest frontier precedent for self-generated challenges: applied to cybersecurity, a frontier-scale program would use a strong model to propose novel vulnerable-target variants and score exploit attempts, with the project’s existing terminal flag verifier as the ungameable ground-truth check that keeps a self-generated curriculum honest.
  • Market corroboration, general-purpose: Prime Intellect’s Environments Hub — 1,000+ unique environments from 250+ creators, 100,000+ downloads. Cyber-specific instance: Bugcrowd’s RL Environments (built on Mayhem Security tech) — “hundreds of thousands of training environments, each built from authentic open-source vulnerabilities with real source code and verifiable outcomes,” with Chief AI/Science Officer David Brumley’s own framing: “Most AI security training stops too early. Models learn to find bugs, but not to prove the bugs are real and exploitable… detection through exploitation, patching, and audit.” That last phrase is also a concrete, safely-implementable curriculum idea — a graded multi-stage reward, but only if built potential-based (F(s,a,s') = γΦ(s') − Φ(s), the only reward-shaping form with a policy-invariance guarantee, per Ng/Harada/Russell, ICML 1999). A flat per-stage bonus is exactly the kind of ad hoc shaping the theorem warns produces gameable, farm-the-partial-credit policies.
  • XBOW — admissible as a production, externally-verified domain-specialization precedent (top-ranked on HackerOne against human researchers, real CVEs, real payouts), not the excluded academic set. Its own disclosed curriculum climbed four rungs in order: canned CTF (PortSwigger/PentesterLab, “artificial exercises”) → a custom-built realistic benchmark → white-box zero-day discovery in real open-source projects → black-box production dogfooding on HackerOne, where the real-world bug-bounty triage process itself is the verifier. This project currently sits at roughly rung 2 (PD26-01..10, custom-built, more realistic than generic CTF). XBOW’s own architecture note — “thousands of short-lived agents, each with a narrow objective, orchestrated by a persistent coordinator and validated by deterministic logic… if one agent runs into a dead end on step 4 of a 20-step attack, it doesn’t tank the whole operation” — independently confirms (alongside Kimi K2.5’s Agent Swarm) that decomposition into parallel narrow agents, not a longer monolithic single-agent trajectory, is where frontier-grade cyber-agent architecture is heading.

This project’s own numbers, side by side with the frontier evidence: ~1,200+ trajectories, solving ~100–200 of ~1,000 attempted challenges — but the distinct-challenge corpus underneath that is on the order of 10 canonical live challenges plus a 15-challenge locked dataset slice. Bugcrowd alone is “hundreds of thousands” of distinct verifiable cyber environments; Prime Intellect’s general hub is 1,000+; Kimi K2 leans on 3,000+ real + 20,000+ synthesized tools feeding trajectory generation. The gap is 2–5 orders of magnitude on environment volume and diversity — not on algorithm. Every frontier disclosure above agrees GRPO/RLVR-family joint RL with tool-use is now well-understood and largely commoditized; the highest-leverage next investment is a synthesis pipeline that turns the existing 10–15 hand-built challenges into hundreds-to-thousands of verifiably-distinct variants (parameterized bug-class mutations, stack/target permutations, difficulty-graded variants of the same vuln class) — mirroring Kimi K2’s real+synthetic split and XBOW’s own rung-2→3 transition — rather than continuing to hand-author challenges one-off at the current cadence.


6. How this ladders — from “improve solve rate” to “frontier model”

Diagnosing the gap, From behavioral audit to training signal, One problem, or many?, and Where you are & the forks ahead are all, correctly, scoped to this project’s own ~1,000-challenge portfolio — they answer “what training signal fixes F1 vs F2 vs F3 vs F4” and “monolithic vs milestone-shaped reward,” which are the right near-term engineering questions. This chapter’s honest addition: answering those questions well moves solve rate on the existing portfolio; it does not, by itself, cross into “frontier.” The ladder has three rungs, and the roadmap-inputs.md decision brief only climbs the first:

  1. Rung 1 — execute reliably on the portfolio you have. This is the diagnosis framework’s whole job: segment single-shot vs. sequentially-gated, route by the F1–F4 funnel, pick SFT-vs-GRPO ordering correctly per segment. Get this right and you’ve closed the execution gap this project diagnosed — necessary, and per §4 above, this project’s Stage 1/Stage 2 work already targets exactly this.
  2. Rung 2 — scale the environment/data axis by orders of magnitude. Per §4/§5, this is where the actual distance to “frontier” lives: 10 canonical + ~250 informal challenges vs. Bugcrowd’s hundreds of thousands; ~185 verified-solve candidates vs. Kimi K2’s tens-of-thousands-of-tools synthesis pipeline. No amount of correctly-routed SFT-vs-GRPO decision-making on Rung 1’s existing portfolio substitutes for this — every frontier precedent in this chapter treats environment/data scale as the dominant axis, not a nice-to-have.
  3. Rung 3 — close the CPT/mid-training gap, if evals reveal it’s real. Per §4’s Stage 0/0.5 rows, this project’s “knowledge in tools, not weights” bet is a legitimate scoping choice only if the deficit evals surface is confirmed execution-only. If a knowledge deficit shows up instead (not just an execution one), neither of the two disclosed frontier tools for fixing it — domain CPT, mid-training — is currently in this project’s plan at all. This is the one rung that’s genuinely contingent, not scheduled.

The decomposition-vs-monolithic.md verdict (stay monolithic on reward shape until the funnel says otherwise, build only the theorem-backed potential-based version if you do) and the roadmap-inputs.md forks (instrument first, segment before committing SFT capacity, route exploration-vs-execution emphasis by the funnel) are all Rung-1-scoped decisions, made correctly — none of them need to change in light of this chapter. What this chapter adds is the honest framing that Rung 1 is a prerequisite, not the destination: a project that nails every fork in roadmap-inputs.md and still has 10 canonical challenges and no CPT/mid-training stage has built an excellent instance of “the shape of the frontier recipe” at a scale that isn’t frontier yet. The concrete forward move this chapter argues for, independent of and parallel to the Rung-1 work already underway, is the Rung-2 synthesis-pipeline investment named in §5 — because unlike Rung 3 (contingent on an eval result not yet in hand), Rung 2’s gap is already confirmed, already the largest, and already has multiple frontier precedents (Kimi K2, SIMA 2, XBOW rung 2→3, Bugcrowd) showing the shape of the fix.


  • Where you are & the forks ahead — the Rung-1 decision surface this chapter sits on top of; its forks (a)–(e) and its DAG are unaffected by anything in this chapter, they’re prerequisites to it, not alternatives.
  • One problem, or many? — decomposition vs. monolithic — the reward-shape verdict this chapter’s §5 potential-based-shaping discussion (Bugcrowd’s detection→exploitation→patching→ audit staging idea) must stay consistent with: shape only via a theorem-backed potential function, never a flat per-stage bonus.
  • Diagnosing the gap — a scientific framework — the F1–F4 funnel and pass@k / Pass@(k,T) instrumentation that determines whether Rung 3 (CPT/mid-training) is contingent-but-unnecessary or the load-bearing risk this chapter flags it as.
  • What the frontier labs actually do (2026) — the ten-lab SFT/RL method survey this chapter’s Stage 1/Stage 2 skeleton is consistent with; that chapter is method-focused, this one is recipe-and-scale-focused — read them as companions, not duplicates.
  • RL that creates value — long-horizon, exploration, reasoning, novelty — the technique menu for Rung 1’s execution work; orthogonal to this chapter’s Rung 2/3 scale argument.

Where you are & the forks ahead

This is the capstone chapter, not a roadmap. Every other chapter in this book resolves a method question (SFT vs DPO vs GRPO, monolithic vs decomposed, which exploration fix). This chapter assembles those resolutions into the shape you actually need to draw your own plan: what’s true today, what you have to decide, in what order the decisions unlock each other, and what would have to be false to make you change course. It recommends nothing you haven’t already read elsewhere in this book — it routes you back to Diagnosing the gap, From behavioral audit to training signal, One problem, or many?, Before you train, RL that creates value, and The decision at every load-bearing point. Read this after those, not instead of them.

Where this fork-by-fork plan sits in the bigger picture: everything below is Rung-1-scoped — it resolves execute reliably on the portfolio you have. It is a prerequisite for, not a substitute for, The path to a frontier cybersecurity model, which argues that even a perfectly-resolved DAG below doesn’t by itself cross into “frontier” — that takes orders-of-magnitude more RL-environment scale (Rung 2) and possibly a CPT/mid-training stage (Rung 3). The cross-domain evidence grounding that argument — six other long-horizon/sparse-reward/verifiable domains and what actually cracked each one — lives in Cybersecurity is one of a family — what cracked the others; several forks below (especially (c) and (e)) draw directly on techniques surveyed there (GiGPO, DAPO, WebRL’s failure-to-curriculum, potential-based shaping).


1. Where you are — the diagnosis on one screen

The number: ~100–200 solved / ~1000 challenges, at k=1. This is a portfolio statistic, not a per-challenge pass rate — a challenge that “solved once” could be a 5% fluke or a 55% near-certainty, and those two cases call for opposite next moves (The decision, “one prerequisite before any of this”). Nobody has yet run pass@k per challenge, let alone per pipeline stage.

The pipeline is a chain, not a single action:

flowchart LR
  R["Recon /\nenumeration"] --> E["Endpoint\ndiscovery"]
  E --> V["Identify the\nvulnerable endpoint"]
  V --> X["Exploit it"]
  X --> P["Post-exploitation /\npivot"]
  P --> F(("Flag\n{0,1}\nground-truth verified"))

  classDef stage fill:#132b22,stroke:#34d399,color:#eafaf3;
  class R,E,V,X,P stage;

Only the last box (flag_verified) is checked today, and even that is presently a provenance proxy (flag_scan.retrieved — “the string came back from the sandbox, not the model’s mouth”) rather than a true byte-compare against benchmark/flags/pd26_flags.current.json — that exact-match wiring doesn’t exist yet outside a manual, SSH-gated step (instrumentation-and-data-readiness.md §3.1). So even the ground-truth anchor this whole book leans on is one small, well-scoped engineering task away from being fully automatic, not there yet.

The stage-localized failure taxonomy (F1–F4), mapped to canonical RL/agent-research framing:

TagFailureCanonical framingFix lever if confirmed dominant
F1Never finds the vulnerable endpointExploration / coverage failure — no gradient until reward is first observedOn-policy RL with exploration preservation, not more demonstrations
F2Finds it, probes shallowly, can’t land the exploitExecution / skill (performance-floor) failureTrajectory curation, rejection-sampling SFT, DAPO/GiGPO as GRPO baseline
F3Clumsy tool use, wrong tool for the jobPolicy / tool-selection failure — its own axisElicitation ladder → ToolRL/Tool-Star if elicitation fails
F4No real pivot/chaining after a footholdLong-horizon credit-assignment failure (variance, not coverage)Step-level credit (GiGPO), curriculum (1-hop before 2-hop), never “try more” alone

The diagnosis, stated as a hypothesis, not a fact: the project’s working read is “likely an execution gap” (F2/F3-flavored) rather than a knowledge gap or a pure exploration gap — capability is probably present and unreliable, not absent. This is the single most load-bearing framing decision in the whole plan, and it is currently unproven. The book’s own diagnosis framework is explicit about this: “the honest, defensible answer will not be a single sentence” — the true picture is almost certainly a split verdict, different F-tags dominating different challenge subtypes, not one gap type for the whole 1000 (Diagnosing the gap §0, §8).

What “proven” requires, concretely, and doesn’t exist yet:

  1. Per-challenge (not aggregate) pass@k, segmented by whether the winning path is single-shot or sequentially-gated (compositional — enumeration must land before exploitation is even visible).
  2. Pass@(k,T) — base model vs. current checkpoint, per segment — to tell a genuine execution gap (trained pulls away from base at large k) apart from a pure elicitation artifact (base catches up) apart from an exploration gap hiding underneath (matched-data SFT regresses the segment, RL expands it) (arXiv:2504.13837, arXiv:2604.14877).
  3. A working F1–F4 stage tagger over the existing Phoenix/events.jsonl corpus — the actual current gap, confirmed by direct source read: the harness’s process telemetry is already complete for this (tool_call/tool_result pairs joined on tool_call_id); nothing needs to change in go/libs/agent/events or secagent/runner.go. What’s missing is purely semantic — a deterministic post-hoc scan plus one stage_oracle.json per challenge, authored from that challenge’s own solve.py (Before you train §5).

Bottom line for this section: treat “it’s an execution gap” as the leading hypothesis, not settled ground. Everything in §2–§4 below is written so that it stays true whichever way the funnel eventually comes down — several forks are explicitly gated on a measurement that hasn’t been taken yet.


2. The forks

Five decisions, each with exactly two options (per this book’s convention — no hybrids, no third path). Every “what must be TRUE” column is a gate, not a preference — the fork should not be decided until its gate is checked, because in at least two of these forks (b, e) the two options are not just different costs, they require the opposite SFT/RL ordering.

(a) Instrument the F1–F4 stage tagger first?

Option 1 — build it firstOption 2 — skip it, decide off aggregate pass@k/flag_verified alone
What it isstagescan.go (same shape as the existing flagscan.go) + one stage_oracle.json per challenge, authored from solve.py; prototype on PD26-02 first, then generalizeProceed straight to the SFT/GRPO plan using only the terminal flag signal and portfolio-level solve rate
Cost (compute + eng)Near-free — read-only post-hoc scan over events.jsonl you already have; no new sandbox instrumentation; editorial authoring of ~4 predicates/challenge (one person reads ~10 solve.py files)Zero now — but every downstream decision (b–e) is made blind to which F-tag actually dominates
Risk (reward-hacking)None — this is diagnostic-only, no reward function touchedRouting risk, not hacking risk: you may sink a training cycle into the wrong lever (e.g. rejection-sampling SFT when the real bottleneck is F1/exploration, or milestone shaping when it’s actually F2/F4)
Information valueHighest single move in this whole chapter. “This single aggregation is the input every downstream decision below depends on” — verbatim from Before you train §4Low — an aggregate number “averages over four structurally different failure modes,” exactly the collapsing-a-split-verdict anti-pattern the diagnosis framework names first
What must be TRUE firstNothing — this is the recommended first step regardless of any other measurementN/A

Verdict pressure: there is no real argument for Option 2. This fork is here because it’s the fork everyone is tempted to skip under time pressure, not because the evidence is close.

(b) Rejection-sampling SFT on verified solves NOW vs. measure pass@k-per-stage FIRST

Option 1 — SFT nowOption 2 — segment + measure first
What it isRun rejection-sampling SFT on the ~185 already-collected verified-solve trajectories across the 5 corpora (gym263, gym564, warpenv-broker, envgen, argus60-base)Segment the 1000 challenges (single-shot vs. sequentially-gated), run Pass@(k,T) — base vs. current checkpoint — per segment, before deciding what to train on
CostCheap — data already exists, this is already the project’s stated near-term planMedium — sampling compute at multiple k, cold-start pdq --fresh-retries, no training required
Risk (reward-hacking / generalization)Concrete, not hypothetical. On the compositional/sequentially-gated segment, matched-data SFT actually regresses capability (net −4) while RL expands it (net +4) — Zhai et al., arXiv:2604.14877. Training the wrong subset with SFT doesn’t just waste compute, it can make that subset worse. Separately: the ~185-trajectory pool may be guessing-dominated (high pass@64, low Cover@τ — arXiv:2510.08325) or contain lucky-but-unsound paths (right flag, wrong/wasted reasoning — arXiv:2506.14245)Low — diagnostic only, but real opportunity cost if it delays shipping a known-safe move (SFT is the project’s current plan, already literature-validated as a baseline — arXiv:2504.11343)
Information valueLow incremental — you already believe SFT-on-solves works generically; this doesn’t test where it worksHigh — this is “the single highest-value experimental design” in the diagnosis chapter, and it’s directly testable this week with no new training
What must be TRUE before committing to Option 1 wholesale(i) the SFT pool isn’t guessing-dominated (Cover@τ check on its source challenges); (ii) trajectories are filtered on soundness (backtracking/wasted-turns/tool-validity), not just flag==1; (iii) fork (a)’s stage tagger doesn’t show these ~185 trajectories concentrated on the sequentially-gated segmentN/A

Verdict pressure: don’t cancel the SFT plan — but don’t treat “SFT now” as a blanket recipe across the whole portfolio either. The correct read of these two options is closer to “do (2) as a segmentation gate on (1)”: SFT the single-shot segment now, hold the sequentially-gated segment for GRPO once entropy instrumentation is live.

(c) Monolithic GRPO vs. milestone-shaped GRPO

Option 1 — monolithicOption 2 — milestone-shaped
What it isTerminal flag reward only, unchanged, once GRPO/RLVR startsPotential-based shaping F(s,a,s') = γΦ(s') − Φ(s) layered on top of (never instead of) the terminal reward, where Φ = a monotone running-max count of deterministically-verified stage completions (Ng/Harada/Russell, ICML 1999 — policy-invariant by theorem)
CostNone beyond baseline GRPO infraMedium — stage_oracle.json authoring (reuses fork (a)’s work if already done), Φ must be a running max (not instantaneous), and defined identically across every termination path (stop_reason{stop, max_turns, error}) or the invariance proof breaks
Risk (reward-hacking)Risk of leaving real gains on the table if the funnel is genuinely F1-dominated — MiRA’s 6.4%→43.0% WebArena-Lite result is the strongest existence-proof in this book that flag-only reward can leave a large gap, though that’s a web-navigation result, not CTF (arXiv:2603.19685)This is where the thick, convergent reward-hacking literature lives — PURE/Stop Summation (arXiv:2504.15275), Reward Under Attack (arXiv:2603.06621), Gao et al. (arXiv:2410.15115), PRIME’s own admission (arXiv:2502.01456), MONA (arXiv:2501.13011). Every one of these converges on: the moment a stage check becomes anything softer than deterministic ground-truth, it gets farmed. This project’s own confirmed lesson (SFT-induced FLAG{} confabulation from a loose format-matcher) is the small-scale preview of the same failure mode
Information valueN/A — this is the default, not an experimentHigh if built correctly — this is the only mechanism in the whole menu that’s a theorem, not an empirical bet, provided the two subtleties are respected
What must be TRUE before building Option 2(i) fork (a)’s funnel shows an F1 (exploration)-dominated bottleneck, not F2/F3; (ii) the scale-dependence check confirms the base policy is genuinely capacity-limited rather than already-capable-but-unreliable — arXiv:2603.21972 found staged reward helps weak models only, larger models converge fine on outcome-only reward; (iii) a deterministic oracle exists for the stage being shaped — explicitly excluding stage 3 (vuln identification), which VPR’s own authors flag as the “open, unstructured” regime their method doesn’t yet solve (arXiv:2605.10325)Default — no gate needed

Verdict pressure: stay monolithic until the funnel says otherwise. If it does, build the narrow, theorem-backed version — never a learned/LLM-judge per-stage reward, under any circumstance.

(d) Tool-use fix for the curl-preference (SFT/DPO/KTO)

Option 1 — elicitation ladder firstOption 2 — jump straight to a training-time fix
What it isEscalate cheapest→most-expensive: few-shot prompt with 2–3 correct-usage examples → light SFT on a handful of demonstrated-usage trajectories → only then consider RL-level interventionGo directly to DPO/KTO on tool-choice pairs, or ToolRL-style decomposed per-call reward, without testing whether elicitation alone recovers the behavior
CostVery cheap — the few-shot test is nearly free; SFT-demo step is cheapMedium — true DPO needs k≥2 same-challenge same-model divergent pairs (only the PD26 k=5 canonical sweep qualifies today; the larger gym pools are k=1); KTO-native data (unpaired success/failed splits) is free and ready today
Risk (reward-hacking / wasted engineering)Low — but a self-reinforcing trap exists regardless of which rung you’re on: a rejection-sampling corpus built from the current curl-biased policy will never contain a dead tool succeeding, because the policy never tried it — RL alone has ~zero probability mass to reinforce those tools without forced/hinted exposure first (Tool-Star, arXiv:2505.16410)Building ToolRL/DPO machinery for what might be a pure elicitation gap — Greenblatt et al.’s password-locked-model finding says a few high-quality SFT demonstrations are often sufficient to fully elicit a locked capability (arXiv:2405.19550); over-engineering here is real opportunity cost, not just aesthetic
Information valueHigh and cheap — turns “the model prefers curl” from anecdote into a falsifiable, staged experiment (framework.md §5)Lower until the ladder has been run — you don’t yet know which rung actually recovers the behavior
What must be TRUE before escalating past few-shot(a) few-shot prompting fails to recover tool usage on held-out challenges; (b) SFT on a small demonstrated-usage set also fails to recover it → only then is it a genuine missing-affordance problem calling for Tool-Star-style forced exposure + ToolRL-style decomposed reward (arXiv:2504.13958)N/A

Verdict pressure: run the ladder. Don’t skip to DPO/ToolRL on a hunch — the cheapest rungs have direct, citable precedent for “this alone is often sufficient,” and skipping them risks building infrastructure for a gap that a two-line prompt change would have closed.

(e) Exploration-emphasis vs. execution-emphasis — routed by the funnel

Option 1 — execution-emphasisOption 2 — exploration-emphasis
What it isInvest in trajectory curation, more/better rejection-sampling SFT data, DAPO’s clip-higher + dynamic sampling as the GRPO baseline, GiGPO step-level creditOn-policy RL first, not SFT, for the segment the funnel flags as F1/F4-dominant; DIVER/tool-sequence diversity bonus, curiosity bonus (CDE), parameter-space-noise pilot, periodic reference-policy resets (ProRL) for genuine boundary expansion
CostLower — DAPO is an established, widely-adopted recipe; trajectory curation reuses existing dataHigher — several of these techniques are RL-infra-dependent and Promising-not-validated (PSN-RLVR, DIVER, HiPER/hindsight credit assignment for a CTF-shaped domain)
RiskIf the funnel is actually F1-dominant, more SFT on the same recipe teaches guessing-and-hoping more confidently, not more competence (LIMO’s framing, arXiv:2502.03387)If the funnel is actually F2/F3-dominant, exploration machinery is solving a problem that doesn’t exist here and burns the RL-infra budget on the wrong axis — the entropy-collapse mechanism these fixes target (arXiv:2505.22617) is real but doesn’t help a policy that’s exploring fine and just executing unreliably
Information valueThis is literally what the funnel is for. Not a taste choice — a routed decisionSame
What must be TRUE before routingFunnel result from fork (a); the scale-check (arXiv:2603.21972); the Pass@(k,T) crossover-direction test on the specific segment (fork b) — does trained pull away from base at large k (execution), or does matched-data SFT regress it while RL expands it (exploration)?Same gate, opposite branch

Verdict pressure: this fork cannot be decided from priors or literature alone — by design, it is the output of forks (a) and (b), not an independent choice. If you find yourself picking an emphasis before the funnel exists, you are guessing, and the guess has better-than-even odds of being wrong given the project’s own “likely execution, unproven” framing in §1.


3. Dependency order — a DAG, not a timeline

This is deliberately not a schedule. It shows what unlocks what — several branches can run in parallel, and nothing downstream of “instrument” is safe to start before its own inputs exist.

flowchart TD
  subgraph INSTRUMENT["INSTRUMENT — near-free, read-only, do first"]
    I1["stagescan.go + stage_oracle.json\nper-challenge, F1-F4 tagger"]
    I2["flag_verified true byte-compare\n(replace retrieved-provenance proxy)"]
    I3["entropy logging wired,\nready from RL step 0"]
    I4["elicitation-ladder harness:\ntool-usage histogram across 40 sectools"]
  end

  subgraph MEASURE["MEASURE — diagnostic, no training changes"]
    M1["Segment 1000 challenges:\nsingle-shot vs sequentially-gated"]
    M2["Pass@(k,T): base vs current checkpoint,\nper segment (arXiv:2604.14877)"]
    M3["Cover@tau per challenge\n(arXiv:2510.08325) — guessing vs reliable"]
    M4["Base-model pass@k control\n(arXiv:2504.13837)"]
    M5["Scale-check: weak/capacity-limited\nvs already-capable-unreliable\n(arXiv:2603.21972)"]
    M6["Elicitation ladder run:\nfew-shot -> SFT-demo -> RL"]
  end

  subgraph ROUTE["ROUTE — the fork decisions (Section 2)"]
    RA["Fork a: ALREADY DECIDED\n(instrument first)"]
    RB["Fork b: SFT-now vs measure-first\nper segment"]
    RC["Fork c: monolithic vs\nmilestone-shaped GRPO"]
    RD["Fork d: elicitation vs\ntraining-time tool fix"]
    RE["Fork e: exploration- vs\nexecution-emphasis"]
  end

  subgraph TRAIN["TRAIN — the actual runs"]
    T1["Rejection-sampling SFT\non single-shot segment, curated\n(STaR / ReST-EM pattern)"]
    T2["GRPO + DAPO baseline\n(clip-higher, dynamic sampling)"]
    T3["+ GiGPO step-level credit\n(zero extra rollouts)"]
    T4["+ potential-based milestone\nshaping (gated, fork c only)"]
    T5["ToolRL / Tool-Star forced\nexposure (gated, fork d only)"]
  end

  subgraph GRADUATE["GRADUATE — the go/no-go gates"]
    G1["Entropy collapsed\nAND pass@64 non-trivial\n(arXiv:2510.01624)"]
    G2["Semantics-preserving-transform\nrobustness check survives\n(arXiv:2502.07445 / 2503.02296)"]
    G3["Base-pass@k control still\ntrails trained pass@k\n(gain is real, not elicitation)"]
    G4["pass@large-k did NOT shrink\npost-RL (RL-PLUS check,\narXiv:2508.00222)"]
  end

  I1 --> M1
  I2 --> M4
  I3 --> G1
  I4 --> M6

  M1 --> M2
  M2 --> M3
  M2 --> M4
  M2 --> M5

  M1 --> RB
  M2 --> RB
  M3 --> RB
  M5 --> RC
  M1 --> RC
  M6 --> RD
  RB --> RE
  M5 --> RE

  RB --> T1
  RC -->|"F1-dominant, weak policy"| T4
  RC -->|"F2/F3-dominant"| T2
  RD -->|"elicitation recovers it"| T1
  RD -->|"neither recovers it"| T5
  RE --> T2
  T1 --> T2
  T2 --> T3
  T3 --> T4
  T3 --> T5

  T2 --> G1
  T4 --> G1
  T5 --> G1
  G1 --> G2
  G2 --> G3
  G3 --> G4
  G4 -->|"holds"| Ship["Credit the gain.\nGeneralize to the next segment /\nchallenge subset"]
  G4 -->|"fails"| Back["Back to MEASURE —\nre-run funnel, re-check scale,\ndo not re-train blind"]

  classDef inst fill:#132b22,stroke:#34d399,color:#eafaf3;
  classDef meas fill:#0f2a3d,stroke:#38bdf8,color:#e6f6ff;
  classDef route fill:#3a2e14,stroke:#f5b942,color:#fff6e0;
  classDef train fill:#2a1438,stroke:#c084fc,color:#f3e8ff;
  classDef grad fill:#3a1414,stroke:#f87171,color:#fde8e8;
  class I1,I2,I3,I4 inst;
  class M1,M2,M3,M4,M5,M6 meas;
  class RA,RB,RC,RD,RE route;
  class T1,T2,T3,T4,T5 train;
  class G1,G2,G3,G4 grad;

Read this as: nothing in TRAIN is safe to start before its ROUTE gate fires, and nothing in ROUTE is safe to decide before its MEASURE inputs exist. INSTRUMENT is the only stage with no prerequisites — which is why fork (a) has no real counter-argument.


4. Open hypotheses to test

These are falsifiable, in the sense the diagnosis chapter insists on: each has a stated experiment and a stated result that would kill it. This is the “prove it to myself” frame, not a checklist to complete once — re-run per challenge segment as the corpus grows.

#HypothesisExperimentFalsified if
H1It’s execution, not knowledge, on the non-sequentially-gated segmentBase-model pass@k at large k (64, 256) on currently-failing single-shot challenges — does the correct action ever appear?The correct action never appears at any N on any checkpoint for a large fraction of these — that’s a knowledge gap for that subset, requiring off-policy injection (demonstration, teacher, or a tool), not more RL
H2Milestone shaping helps, doesn’t hackIntroduce potential-based shaping (fork c, gated) on the F1-dominant segment only; track held-out flag_verified rate and pass@large-k before/afterHeld-out flag rate drops, or pass@large-k shrinks post-introduction (capability-boundary collapse, arXiv:2508.00222) — either result means the shaping term is being farmed, revert to monolithic immediately
H3The horizon is tractable for GRPO at the 30–60% baseline bandRun DAPO+GiGPO on challenges the funnel tags F2/F3-dominant, in the 30–60% pass-rate band; watch entropy from step 0Entropy still collapses under DAPO’s own fixes, or stage-transition credit doesn’t concentrate on the exploitation phase specifically (GiGPO’s state-hash groups show flat credit) — means the horizon/credit-assignment problem is harder than the established recipe assumes for this task shape
H4A real exploration gap exists, localized to the sequentially-gated segmentReplicate Zhai et al.’s crossover-direction test on this project’s own compositional-segment challenges: does matched-data SFT regress pass@(k,T) on this segment while GRPO expands it?If SFT does not regress this segment (both SFT and RL improve it comparably), the sequential-gating framing doesn’t transfer to this task family, and the single-shot ordering (SFT then GRPO) is fine everywhere — fork (b)/(e)’s special-casing was unnecessary
H5Tool-avoidance (curl-preference) is elicitation, not a missing-affordance problemRun the elicitation ladder (fork d) on a sample of the 26 dead sectools entries on held-out challengesNeither few-shot prompting nor light SFT-on-demos recovers usage — genuinely a missing-affordance problem, escalate to Tool-Star forced exposure + ToolRL decomposed reward
H6Any claimed solve-rate gain reflects real execution-reliability improvement, not memorization/elicitationBase-model pass@k-at-large-k control (H1’s instrument, reused) and semantics-preserving-transform variants of a held-out subset, checked against every claimed gain before crediting itThe gain evaporates on either check — the gain is elicitation (fine to attribute to SFT, a red flag if it persists after GRPO) or memorization of the fixed ~10 canonical PD26 shapes
H7The SFT go/no-go gate (entropy collapse) is sufficient on its ownCheck whether pass@64 on the rejection-sampling-SFT checkpoint is non-trivial at the same time entropy collapses, before green-lighting GRPO (arXiv:2510.01624)Entropy has collapsed but pass@64 is flat/low — this predicts a disappointing GRPO run regardless of how good SFT accuracy looked; do not launch on entropy-collapse alone

5. What we deliberately are NOT basing this on

Standing project rule, restated for this chapter specifically: no fork, no hypothesis, no cost/risk estimate, and no number above rests on an academic cybersecurity-LLM training or benchmark paper — CTF-Dojo, Cyber-Zero, Pentest-R1, HackSynth/Random-Crypto, AutoPenBench, Cybench, NYU CTF Bench, EnIGMA, InterCode-CTF, DRLRM-PT, node-fragility reward shaping, the kill-chain-staged-reward paper, Nakano’s ATT&CK-tree scaffold, or Honarvar’s Evolve-CTF/Capture-the-Flags family-based evaluation — even where several of these report a finding that would superficially support one side of a fork here. None of that line of work has produced a frontier cybersecurity model, so none of it counts as frontier evidence for a load-bearing decision; every mention of them in the six source chapters this capstone draws from is explicitly labelled “academic, cited for context only, not a basis,” and this chapter inherits that discipline rather than re-importing their numbers under a different heading. Every claim above is re-grounded on one of: general frontier post-training disclosures (DeepSeek-R1, Kimi k1.5/K2, Llama 4, OpenAI Deep Research), general RL/agent theory (potential-based shaping, the reward-hacking convergence, DAgger, GAE, entropy-collapse mechanics), general (non-security) agent-eval and long-horizon literature (METR, AgentBoard, MAST, τ-bench, GSM-Symbolic/C-BOD), or this project’s own measured data and confirmed lessons (the SFT-induced FLAG{} confabulation, the Phoenix trace corpus, the existing pass@k methodology). Where a demoted academic-security idea is still worth pursuing on its own merits — e.g. staged/kill-chain-shaped reward in a cybersecurity-specific loop — the chapters this one draws from say so explicitly and flag it “worth pursuing — unvalidated outside academic-security work,” never as settled ground.


References

Every id below was crawl-verified during the sessions that built this book (title/authors/date confirmed on the arXiv abstract page). Lab blogs/tech reports are linked to their source. Citation counts are unreliable for <18-month-old work — venue/lab-report presence is the stronger signal.

Foundations & imitation

Preference

Reinforcement

Long-horizon & multi-turn agentic RL — credit assignment across turns, not tokens

  • GiGPO (step-level advantage from state-hash-matched steps across rollouts, zero extra rollouts) — arXiv:2505.10978
  • ArCHer (two-timescale: off-policy turn-level critic + on-policy token-level PG) — arXiv:2402.19446
  • RAGEN / StarPO (multi-turn agentic RL framework, state-thinking-action loop) — arXiv:2504.20073
  • Turn-Level Reward Design (dense per-turn reward layered under a terminal reward) — arXiv:2505.11821
  • Turn-PPO (turn as the MDP unit, not token or trajectory) — arXiv:2512.17008
  • Demystifying RL for Long-Horizon Tool-Using Agents (5-axis systematic ablation: reward/scale/data/algorithm/environment) — arXiv:2603.21972
  • Verlog (dual-discount GAE, memory-windowing, validated to 400+ turn episodes) — no arXiv id, cite OpenReview:GmodkWwMV3
  • Kimi k1.5 (128k-context RL via partial-rollout checkpoint/resume, no MCTS/value-fn/PRM) — arXiv:2501.12599
  • AgentGym-RL / ScalingInter-RL (horizon curriculum: short turn cap expanding to full budget over training) — arXiv:2509.08755
  • MUA-RL (trains against a dynamic, LLM-simulated counterpart instead of a static script) — arXiv:2508.18669
  • HiPER (hierarchical credit assignment) — arXiv:2602.16165 · Hindsight Credit Assignment for Long-Horizon LLM Agents — arXiv:2603.08754
  • RL-PLUS (names “capability boundary collapse” — pass@k at large k dropping even as pass@1 rises under RLVR) — arXiv:2508.00222

Hierarchical RL, decomposition & potential-based reward shaping — “one problem, or many?”

  • Sutton — “The Bitter Lesson” (hand-built structure plateaus, general search+learning wins at scale; intellectual ancestor of the monolithic-outcome-RL case) — no arXiv id, incompleteideas.net (2019)
  • OpenAI Deep Research system card (long-horizon tool-using agent trained end-to-end on outcome/rubric reward; “end-to-end training beats manual orchestration”) — no arXiv id, OpenAI system card
  • Options / SMDP framework (Sutton, Precup, Singh — the seminal HRL / temporal-abstraction paper) — Artificial Intelligence 112 (1999), no arXiv id, DOI 10.1016/S0004-3702(99)00052-1
  • FeUdal Networks (Manager/Worker HRL, fixes option-collapse) — arXiv:1703.01161
  • HIRO (off-policy correction for HRL non-stationarity / subgoal ceiling-capping) — arXiv:1805.08296
  • Ng, Harada & Russell — potential-based reward shaping, F(s,a,s')=γΦ(s')−Φ(s) provably policy-invariant — ICML 1999, no arXiv id (predates arXiv’s routine ML use), ACM DL 10.5555/645528.657613
  • Müller & Kudenko (PBRS practical effectiveness still depends on potential scaling) — arXiv:2502.01307
  • RUDDER (learned, return-equivalent reward redistribution — a learned alternative to hand-specifying Φ) — arXiv:1806.07857
  • Go-Explore (pure outcome RL structurally fails on sparse/deceptive long-horizon tasks without explicit remember-and-return exploration) — arXiv:1901.10995 / Nature s41586-020-03157-9
  • OpenAI Five (long-horizon precedent needed huge scale + a per-frame shaped reward, not a single terminal bit) — arXiv:1912.06680
  • Credit Assignment survey (separates credit-assignment variance from exploration burden) — arXiv:2312.01072
  • MiRA (milestone-based dense reward; Gemma3-12B WebArena-Lite 6.4%→43.0%, beating WebRL/GPT-4-Turbo) — arXiv:2603.19685
  • Verifiable Process Rewards / VPR (safe ground-truth checklist process reward; own caveat that open/unstructured stages remain unsolved) — arXiv:2605.10325
  • CM2 (checklist-style verifiable sub-criteria reward) — arXiv:2602.12268
  • Curriculum Learning (Bengio et al. — foundational, order training by difficulty, touches no reward function) — ICML 2009, no arXiv id
  • h1 (curriculum + pure outcome-only reward yields an exponential sample-complexity gain) — arXiv:2510.07312
  • FastCuRL (context-length curriculum, entropy-collapse timing) — arXiv:2503.17287
  • BPO (curriculum + rejection-sampling refine; vanilla GRPO on sparse reward gains only marginally without curriculum) — arXiv:2508.03018
  • TIPS (turn-level potential shaping for search-augmented LLMs — shaping machinery directly on-point, domain is not) — arXiv:2603.22293
  • Randlov & Alstrom — the canonical non-potential-based “bicycle shaping” failure (agent farms a looks-like-progress bonus instead of reaching the goal) — ICML 1998, no arXiv id
  • Kill-chain-staged reward (cyber-defense red-teaming) (academic cybersecurity-LLM work — cited for context only, not a basis)arXiv:2605.17075 (May 2026)
  • DRLRM-PT (reward machine over kill-chain phases, classical/non-LLM pentest RL) (academic — cited for context only, not a basis; explicitly named in the project’s standing rule)arXiv:2405.15908 / DOI 10.1109/ijcnn60899.2024.10650368
  • Node-fragility reward shaping (classical dense-reward pentest, non-LLM regime) (academic — cited for context only, not a basis) — DOI 10.3390/electronics13214311

Tool-integrated / tool-use RL — the direct BSides pattern-1 (tool avoidance) fixes

  • ReTool (trajectory-level tool-integrated RL) — arXiv:2504.11536
  • ToRL (tool-integrated RL, math) — arXiv:2503.23383
  • Search-R1 (RL for search-agent tool use) — arXiv:2503.09516
  • ToolRL (fine-grained, decomposed per-call tool-selection reward) — arXiv:2504.13958
  • Tool-Star (forced exposure to under-used tools via multi-tool synthesis pre-RL) — arXiv:2505.16410
  • Tool Preferences in Agentic LLMs are Unreliable (diagnosis of pattern-1-shaped tool avoidance) — arXiv:2505.18135

Exploration & entropy collapse

  • The Entropy Mechanism of RL for Reasoning LMs (R = -a·e^H + b; Clip-Cov/KL-Cov fixes) — arXiv:2505.22617
  • Beyond the 80/20 Rule (top-20%-entropy “forking tokens” carry nearly all exploration signal) — arXiv:2506.01939
  • Reasoning with Exploration: An Entropy Perspective — arXiv:2506.14758
  • Representation-Based Exploration for Language Models (hidden-state diversity bonus, usable at inference time) — arXiv:2510.11686
  • Pass@k Metric for RLVR: A Diagnostic Tool of Exploration, But Not an Objective — arXiv:2511.16231
  • Spurious Rewards (RLVR gains on Qwen2.5-Math nearly as large with completely wrong rewards; model-family-dependent) — arXiv:2506.10947
  • Absolute Zero (self-play task-proposal + solve, zero external labeled data) — arXiv:2505.03335
  • LIMO (817 curated SFT examples beat >100k loosely-curated ones — SFT as cognitive templates, not knowledge source) — arXiv:2502.03387
  • Test-time compute scaling (Snell et al., difficulty-adaptive allocation matches a 14x larger model) — arXiv:2408.03314 · o3-mini vs o1-mini (accuracy without longer CoT) — arXiv:2502.15631
  • OpenAI o1 System Card (methodology precedent, cited across the field) — arXiv:2412.16720
  • Reward-hacking-under-RL cluster: Specification Gaming in Reasoning Models — arXiv:2605.02269 · LLMs Gaming Verifiers (extensional vs intensional correctness) — arXiv:2604.15149 · Reward Hacking in the Era of Large Models (Proxy Compression Hypothesis) — arXiv:2604.13602
  • Per-step / process-reward hacking convergence (the case against naive per-stage reward): PURE / Stop Summation (summation-form credit assignment “easily induces LLMs to hack steps with high rewards”) — arXiv:2504.15275 · Reward Under Attack (SOTA PRMs as “fluency detectors rather than reasoning verifiers”) — arXiv:2603.06621 · Gao et al. (learned PRM/ORM + success reward can hurt vs success-only) — arXiv:2410.15115 · PRIME (authors’ own admission that process labels are “prohibitively expensive,” PRMs vulnerable to hacking) — arXiv:2502.01456 · MONA (multi-step reward hacking even when no single step looks bad to an overseer) — arXiv:2501.13011

Self-correction & tool-use self-correction RL

  • SCoRe — Training LMs to Self-Correct via RL (reward for improvement, not final correctness) — arXiv:2409.12917
  • From Correction to Mastery (earliest-error RL, distinct “SCoRe”) — arXiv:2509.14257

Domain-specialization lineages — the frontier recipe (code / math / medical)

  • Kaplan et al. — Scaling Laws for Neural Language Models — arXiv:2001.08361
  • Hoffmann et al. — Chinchilla, compute-optimal scaling — arXiv:2203.15556
  • phi-1 (textbook-quality data substitutes for ~100x scale in a narrow domain) — arXiv:2306.11644
  • DeepSeek-Coder V2 (abandons from-scratch pretraining, continues from a strong base) — arXiv:2406.11931
  • Qwen2.5-Math (co-evolves RM + SFT data across rounds before RL, reuses RM for best-of-N at inference) — arXiv:2409.12122
  • DeepSeek-Prover-V2 (subgoal-decomposed cold-start data + kernel-verified RL) — arXiv:2504.21801
  • OLMo 2 (mid-training as a named 5-10% FLOPs bridge stage) — arXiv:2501.00656
  • Mid-training mechanism study (outperforms CPT-alone at matched budget, reduces catastrophic forgetting before SFT) — arXiv:2510.14865
  • Med-PaLM v1 (mentioned, not a basis for cyber claims — prompt-tuning-only ceiling, motivates v2)arXiv:2212.13138
  • Med-PaLM 2 (domain instruction fine-tuning + ensemble refinement) — arXiv:2305.09617
  • MedGemma (domain VLM pretraining + task fine-tuning, explicitly not clinical-grade alone) — arXiv:2507.05201
  • Full fine-tuning vs. LoRA on code/math domain-skill acquisition (10-100x effective-rank gap) — arXiv:2405.09673
  • Benchmark contamination (GSM8K/MMLU scores inflated up to 22.9%/19.0%) — arXiv:2406.13990
  • LiveCodeBench (time-segmented, contamination-resistant-by-construction) — arXiv:2403.07974
  • GPQA (“Google-proof” QA) — arXiv:2311.12022
  • Tülu 3 (decontamination as a first-class deliverable; primary public naming of RLVR) — arXiv:2411.15124
  • Emerging RL-environment-scale bottleneck framing (moderate confidence, new/low-citation) — arXiv:2511.09586
  • SIMA (scalable instructable multiworld agent) — arXiv:2404.10179
  • SIMA 2 (self-generated tasks + rewards via Gemini) — arXiv:2512.04797

Adjacent-domain structural transfer — coding agents, competitive programming, theorem proving, web agents, games, robotics

  • SWE-agent / Agent-Computer Interface (fixed action set + concise feedback lifts pass@1 pre-RL) — arXiv:2405.15793
  • SWE-Gym (executable-environment SFT trajectories) — arXiv:2412.21139
  • R2E-Gym — arXiv:2504.07164
  • SWE-RL (execution-verified reward beats a difflib patch-similarity fallback) — arXiv:2502.18449
  • o1→o3 coding RL — arXiv:2502.06807
  • Progressive context/turn-budget curriculum for long-horizon RL — arXiv:2508.03501
  • SWE-Master (mask environment-feedback tokens out of the SFT loss, low-confidence/very recent) — arXiv:2602.03411
  • DeepSeek-Coder v1 (from-scratch pretraining, the lone exception in the lineage) — arXiv:2401.14196
  • StarCoder2 / The Stack v2 (curation quality substitutes for parameter count — data-pipeline lesson only) — arXiv:2402.19173
  • StarCoder — arXiv:2305.06161
  • CodeRL — arXiv:2207.01780
  • PPOCoder — arXiv:2301.13816
  • RLTF — arXiv:2307.04349
  • StepCoder — arXiv:2402.01391
  • RLEF (turn-level value function over a multi-turn POMDP; Meta FAIR, ICML 2025 spotlight) — arXiv:2410.02089
  • Sailor (SEA-language CPT) — arXiv:2404.03608
  • SEA-LION — arXiv:2504.05747
  • LLaMA Beyond English — arXiv:2401.01055
  • Tokenizer/vocabulary coverage as an architectural precondition — arXiv:2406.11477
  • BLOOM+1 (adding a new language to the SFT mixture beats continued pretraining) — arXiv:2212.09535
  • Aya — arXiv:2402.07827
  • AlphaCode (sampling breadth + cheap filter) — arXiv:2203.07814
  • GrandCode / Agentic GRPO (purpose-built GRPO variant for delayed reward + off-policy drift; single team, very recent) — arXiv:2604.02721
  • AlphaGeometry (Nature 2024; synthetic self-play manufactures its own training problems) — DOI 10.1038/s41586-023-06747-5
  • AlphaProof (Nature 2025; AlphaZero-style self-play/search on top) — DOI 10.1038/s41586-025-09833-y
  • WebGPT (learned human-preference reward; origin of the SFT-cold-start-then-RL recipe shape, flagged OOD-weak by its own authors) — arXiv:2112.09332
  • WebRL (self-evolving curriculum generated from the model’s own unsuccessful attempts; ICLR 2025) — arXiv:2411.02337
  • R1-Searcher (sequential, not summed, two-stage tool-use reward) — arXiv:2503.05592
  • DeepResearcher (training end-to-end in the real live environment is “a fundamental requirement”; EMNLP 2025) — arXiv:2504.03160
  • AlphaStar (Nature; league-based diverse self-play population fixes strategy collapse) — DOI 10.1038/s41586-019-1724-z
  • NetHack (honest “still unsolved” calibration point) — arXiv:2006.13760
  • HER — Hindsight Experience Replay (relabel a failed trajectory as the goal it accidentally satisfied; NeurIPS 2017) — arXiv:1707.01495
  • Firestone — competence vs. performance (formal/functional split; independently reconfirmed by the particular-language literature) — PMC7604508
  • KLong (progressive horizon curriculum, second converging source; low-confidence) — arXiv:2602.17547
  • BEACON (2026 long-horizon credit-assignment cluster; low-confidence individually) — arXiv:2605.06078
  • Ecpo (2026 long-horizon credit-assignment cluster; low-confidence individually) — arXiv:2606.05885

CTF / pentest RL environments

(academic, cited for context — not a basis for our decisions; see the stance in Contested edges)

  • CTF-Dojo (486 verified trajectories, 31.9% pass@1 credibility yardstick) — arXiv:2508.18370
  • Cyber-Zero (monolithic outcome-RL, simulated env, +13.1%) — arXiv:2508.00910
  • Pentest-R1 (two-stage RL for CTF methodology) — arXiv:2508.07382
  • HackSynth (crypto-CTF GRPO) — arXiv:2506.02048
  • InterCode-CTF (seminal monolithic-reward CTF environment) — arXiv:2306.14898
  • Cybench (subtask decomposition, eval-only) — arXiv:2408.08926
  • AutoPenBench (milestone taxonomy near-matching this book’s F1–F4 split, eval-only) — arXiv:2410.03225
  • NYU CTF Bench (CTF benchmark family) — arXiv:2406.05590
  • EnIGMA (“soliloquizing” fabrication failure mode, ICML 2025) — arXiv:2409.16165
  • Guided Reasoning via Structured Attack Trees (deterministic ATT&CK-derived task tree, +5x subtask completion on the same weights) — arXiv:2509.07939
  • From Capabilities to Performance (pentesting ablations) — arXiv:2509.14289
  • PentestAgent (RAG-fix framing of a knowledge gap; contested against the scaffolding/execution readings above) — arXiv:2411.05185
  • Capture the Flags: Family-Based Evaluation via Semantics-Preserving Transformations (CTF-specific robustness benchmark) — arXiv:2602.05523
  • What Makes a Good LLM Agent for Real-world Penetration Testing? (Task Difficulty Assessment + Evidence-Guided Attack Tree Search) — arXiv:2602.17622

Agent benchmarks & failure taxonomies

  • τ-bench (fault-assignment × fault-type taxonomy) — arXiv:2406.12045
  • AgentRx (localizes the single critical failure step in a long trajectory) — arXiv:2602.02475
  • AgentBoard (“progress rate” metric — general, non-security capability-decomposition principle, NeurIPS 2024 Oral) — arXiv:2401.13178
  • MAST (14-mode/3-category multi-agent failure taxonomy, κ=0.88) — arXiv:2503.13657
  • AgentErrorTaxonomy / AgentDebug (root-cause diagnosis alone, no reward change, buys +24% all-correct accuracy) — arXiv:2509.25370
  • Phase-aligned taxonomy for autonomous agents (independent-domain convergence on a phase-keyed failure split) — arXiv:2508.13143

Capability boundary, elicitation & sandbagging (contested)

  • Yue et al. — RL elicits, not expands — arXiv:2504.13837 (2025-04-18)
  • ProRL — prolonged RL expands — arXiv:2505.24864
  • Cohen-Inger et al. — “LLMs are Like a Chameleon” (benchmark scores mask overfitting; semantics-preserving perturbation robustness check) — arXiv:2502.07445 (2025-02-11)
  • Zhang et al. — “Memorize or Generalize?” (Memorization Risk Index via semantic-perturbation code rewriting; companion robustness-check citation) — arXiv:2503.02296 (2025-03-04)
  • PSN-RLVR — arXiv:2602.02555 · NuRL — arXiv:2509.25666 · CoT-Pass@K (Wen et al., RLVR implicitly incentivizes correct reasoning) — arXiv:2506.14245 (2025-06-17)
  • Scalpel vs Hammer (GRPO amplifies, SFT replaces) — arXiv:2507.10616
  • Zhai et al. — Does RL Expand the Capability Boundary of LLM Agents? Pass@(k,T) — arXiv:2604.14877 (2026-04-16)
  • Dragoi et al. — Beyond Pass@k: Breadth-Depth Metrics / Cover@τ — arXiv:2510.08325 (2025-10-09)
  • Kang et al. — Quagmires in SFT-RL Post-Training — arXiv:2510.01624 (2025-10-02)
  • Chen et al. — The Coverage Principle — arXiv:2510.15020 (2025-10-16)
  • Greenblatt et al. — Stress-Testing Capability Elicitation with Password-Locked Models — arXiv:2405.19550 (2024-05-29)
  • Hofstätter et al. — The Elicitation Game — arXiv:2502.02180 (2025-02-04)
  • van der Weij et al. — AI Sandbagging — arXiv:2406.07358 (2024-06-11)
  • Ryd et al. — Removing Sandbagging via Weak Supervision — arXiv:2604.22082 (2026)
  • Stroebl et al. — Inference Scaling fLaws — arXiv:2411.17501 (2024-11-26)
  • Dorner et al. — ROC-n-reroll — arXiv:2507.12399 (2025-07-16)
  • Huang et al. — Is Best-of-N the Best of Them? — arXiv:2503.21878 (2025-03-27, ICML 2025)
  • Mahowald et al. — Dissociating Language and Thought in LLMs — arXiv:2301.06627 (2023-01-16)
  • He et al. — LLMs as Neurolinguistic Subjects — arXiv:2411.07533 (2024-11-12)
  • Boháček et al. — Uncovering Competency Gaps (sparse autoencoders on internal representations) — arXiv:2512.20638 (2025-12-06)

PEFT

Frontier lab recipes (reports & blogs) — full-year refresh, 2025-07 → 2026-07, all 10 tracked labs

Llama (historical anchor): Llama 3 — arXiv:2407.21783 · Llama 4 — ai.meta.com/blog/llama-4-multimodal-intelligence

Anthropic (Claude) — Constitutional AI/RLAIF backbone arXiv:2212.08073; inoculation prompting arXiv:2510.04340, Anthropic’s own study arXiv:2511.18397; release posts/system cards — Opus 4.1 · Sonnet 4.5 card · Sonnet 4.5 · Sonnet 4.5 research · Haiku 4.5 card (PDF) · Haiku 4.5 · Opus 4.5 card · Opus 4.5 research · Opus 4.5 card walkthrough (secondary) · inoculation prompting post · emergent misalignment / reward hacking · “teaching Claude why” post · research page · Opus 4.6 · Opus 4.6 sabotage risk report (PDF) · Sonnet 4.6 · Opus 4.7 · Opus 4.8 · Fable 5 / Mythos 5 · Fable 5 & Mythos 5 card (PDF) · Mythos guardrails coverage (secondary) · Sonnet 5 · Sonnet 5 card · Sonnet 5 launch coverage (secondary) · Sonnet 5 launch guide (secondary) · AI organizations post

OpenAIGPT-5 (2025-08-07) · GPT-5 system card (PDF) · GPT-5 for developers · safe-completions arXiv:2508.09224 · safe-completions post · GPT-5-Codex addendum · Codex system card (PDF) · GPT-5.1 · GPT-5.1 deployment safety · routing/model-choice post (secondary) · GPT-5.1-Codex-Max · Codex-Max system card · Codex-Max safety training · long-horizon Codex tasks · GPT-5.2 · GPT-5.2 for science/math · GPT-5.2 system-card update · GPT-5.2-Codex · GPT-5.3-Codex · 5.3-Codex system card · 5.3-Codex coverage (secondary) · GPT-5.4 · GPT-5.4 thinking system card · GPT-5.4 (secondary) · graders docs · RFT guide · RFT use-cases · RFT wind-down, 2026-05-08 · community thread (secondary)

Google DeepMind (Gemini) — Gemini 2.5 tech report arXiv:2507.06261 (HTML, §2.4/2.5 mirror, cross-checked (secondary)) · Deep Think launch · Gemini 3 Pro model card (PDF) · Gemini 3 launch · agent-building with Gemini 3 · Gemini 3 Deep Think · Deep Think update · Gemini 3 Flash · Flash for enterprise · Gemini 3.1 Pro · Gemini 3.5 · Vending-Bench/τ²-bench cross-check (secondary)

xAI (Grok)Grok 4 · Grok 4 model card (PDF) · Grok 4 analysis (secondary) · Grok Code Fast 1 · Code Fast 1 model card (PDF) · Grok 4 Fast · Grok 4 Fast model card (PDF) · coverage (secondary) · coverage (secondary) · Grok 4.1 · Grok 4.1 model card (PDF) · sycophancy coverage (secondary) · Grok 4.1 Fast · news index

Mistral — Magistral arXiv:2506.10910 (HF paper page, benchmark tables) · Ministral 3 arXiv:2601.08584 · Mistral 3 blog · Magistral blog · Mistral-Large-3 card · Magistral-Small-2509 card · Magistral-Small-2507 card · Magistral Medium 1.2 docs

DeepSeek — V3 arXiv:2412.19437 · R1 arXiv:2501.12948 · V3.2 arXiv:2512.02556 (HTML) · V3.1 release · V3.1 card · V3.1-Terminus · V3.1-Terminus card · V3.2-Exp · V3.2-Exp repo · V3.2 / V3.2-Speciale

Qwen (Alibaba) — Qwen3 tech report arXiv:2505.09388 · GSPO arXiv:2507.18071 · Qwen3-Omni arXiv:2509.17765 · Qwen3-VL arXiv:2511.21631 · Qwen3-Coder-Next arXiv:2603.00729 · Qwen3.5-Omni arXiv:2604.15804 · GSPO blog · Qwen3-Coder blog · Qwen3 blog · Qwen3 README · Qwen3-Next efficiency · Qwen3-Max · Qwen3-Max-Thinking · Qwen3.5 blog · Qwen3.5-397B-A17B card · Qwen3.7-Max “agent frontier” · Qwen3.7 blog · The Batch coverage (secondary) · VentureBeat coverage (secondary) · TechSphere coverage (secondary) · Qwen3.6-35B-A3B agentic coding

Moonshot AI (Kimi) — K2 arXiv:2507.20534 (verified via arXiv abs + full HTML crawl) · K2.5 arXiv:2602.02276 (verified via arXiv abs + full HTML crawl) · Kimi-K2-Thinking card (no arXiv paper) · K2 Thinking intro post · Kimi-K2.6 card (no arXiv paper) · K2.6 tech blog · K2.6 benchmark deltas · K2.6 coverage (secondary) · K2.6 method non-disclosure (secondary) · Kimi-K2.5 model card/benchmarks

GLM / Z.ai (Zhipu) — GLM-4.5 arXiv:2508.06471 · GLM-5 arXiv:2602.15763 (HTML) · GLM-4.5 blog · GLM-4.6 blog · GLM-4.7 blog · GLM-5 blog · GLM-5.2 blog · GLM-4.5 repo · GLM-5 repo · slime RL infra repo · GLM-4.7-Flash card · MoE architecture deep-dive (secondary, unverified-primary) · agentic RL post citing GLM-5 report (secondary) · RL infra post citing GLM-5 report (secondary) · GLM-5.2 vs 5.1 (secondary) · GLM-5.2 open-source coverage (secondary) · GLM-4.7-Flash coverage (secondary)

Xiaomi (MiMo) — MiMo-V2-Flash arXiv:2601.02780 · MiMo-Embodied arXiv:2511.16518 · MiMo-VL-Miloco arXiv:2512.17436 · MiMo-Audio arXiv:2512.23808 · MiMo lineage (background only, outside the 12-month window): MiMo arXiv:2505.07608, MiMo-VL arXiv:2506.03569 · MiMo-V2.5-Pro card · MiMo-V2.5 card · MiMo-V2.5-Pro blog · MiMo model-update docs

The verified concept-map + genealogy notes in the shared memory pool (research/post-training-inference-concept-map.md, research/post-training-method-genealogy-onpolicy-offpolicy.md, research/frontier-lab-post-training-recipes-2026.md, research/rl-for-long-horizon-exploration-reasoning.md, research/diagnosing-capability-vs-execution-gap-framework.md) are the machine-readable companions to this book.

How this book grows

This is a living document maintained by the researcher seat of the llmresearch project. It grows one verified topic at a time; each chapter cites its sources so claims are checkable, not assertions.

Conventions

  • Engineer-level. Assumes you know logprobs, KL, advantage, rollouts, MoE, PPO clip. No 101 filler.
  • Cite or don’t claim. Every substantive statement carries an arXiv id or a named lab report/blog. Where something is contested, it’s marked contested with both sides (Contested edges).
  • Honesty about status. Methods are tagged mainstream / niche / promising-not-proven / experimental based on whether a frontier flagship’s report actually uses them.
  • Verified live. arXiv ids are crawl-checked; lab-recipe claims come from 2025–2026 tech reports and blogs, not training recall. Re-verify before betting a run — this field ships weekly.

Build & run locally

# one-time: install the toolchain (macOS)
brew install mdbook mdbook-mermaid
# from the book root:
mdbook-mermaid install .   # vendors mermaid assets + wires the preprocessor
mdbook serve --open        # live-reload server at http://localhost:3000

Mermaid flowcharts and raw HTML/iframes (e.g. the embedded journey) render offline — no CDN required.

Log

  • 2026-07-02 — v0.3. Wired three new chapters into the book: Before you train — instrumentation & data readiness (what the harness already emits vs. the minimal per-stage-verifier gap to instrument, grounded in a direct source read of go/libs/agent/events + the flag-verification pipeline), One problem, or many? — monolithic vs decomposed (the eval-decomposition-vs-training-decomposition split, the potential-based-shaping safety net, the verdict for this project), and Where you are & the forks ahead (the capstone — five forks, a dependency DAG, seven falsifiable hypotheses, placed last before References). Applied the project’s standing no-academic-cybersecurity-LLM-as-research-basis stance across all three: every CTF-Dojo/Cyber-Zero/Pentest-R1/HackSynth/AutoPenBench/DRLRM-PT-style citation is labelled “academic, cited for context — not a basis for our decisions,” with load-bearing claims re-anchored on frontier-lab disclosures, general RL/ML theory, or this project’s own measured data. Merged ~35 new citations into References (new Hierarchical RL/decomposition/reward-shaping section; additions to Exploration & entropy collapse, CTF/pentest RL environments, Agent benchmarks & failure taxonomies, and Capability boundary sections) and added the CTF/pentest-RL section’s context-only header note.

    Later same day — added the frontier north-star section. Wired the two standing capstone chapters into a new top-level section, “Toward a frontier cybersecurity model,” placed after “In practice” and before References: Cybersecurity is one of a family — what cracked the others (the cross-domain structural-analogy survey — six adjacent long-horizon/sparse-reward/verifiable domains and what actually cracked each), The path to a frontier cybersecurity model (the capstone recipe + gap analysis — what “frontier” costs beyond this project’s own portfolio), and moved Where you are & the forks ahead into this new section as its final chapter (out of “In practice”), completing the arc family → path → your forks. Cross-linked roadmap-inputs.md at top and in its Cross-links section to both new chapters. Merged the new chapters’ citations into References via two new sections — “Domain-specialization lineages (code/math/medical)” and “Adjacent-domain structural transfer” — ~65 new arXiv/DOI/PMC ids, deduped against the existing ~250-citation corpus, academic-security entries kept labelled context-only. Salvaged three of the higher-value ideas the standing academic-cybersecurity-LLM stance would otherwise have excluded, by re-grounding each on independent frontier-lab or general-RL-theory evidence instead: (1) staged/kill-chain reward shaping — salvaged as the theorem-backed potential-based form only (Ng-Harada-Russell, ICML 1999), never the flat per-stage bonus academic pentest-RL papers use; (2) subgoal/curriculum decomposition of a long episode — salvaged via DeepSeek-Prover-V2 and AlphaGeometry (cold-start SFT data generation only, never densifying the RL reward itself); (3) failure-corpus-to-curriculum conversion — salvaged via WebRL’s self-evolving curriculum and HER’s relabeling principle (mining flag=0 trajectories for sub-skill SFT data), not any academic CTF-RL paper’s claim.

  • 2026-07-02 — v0.2. New Diagnosis section: Diagnosing the gap — a scientific framework (the pass@k crossover protocol, Cover@τ, sandbagging/elicitation tests — is the ~10-20% k=1 solve rate an execution gap or a knowledge gap, before betting a GRPO run on the answer) and From behavioral audit to training signal (maps the BSides-LV 5-pattern behavioral audit — tool avoidance, no methodology, brittle single-guess, uneven PTES phases, benchmarks-measure-speed-not-thoroughness — onto the specific post-training techniques designed to fix each). New method chapter RL that creates value — long-horizon · exploration · reasoning · novelty, a ~50-paper sweep tagged [L]/[E]/[R]/[N] against this project’s own diagnosis (GiGPO, DAPO/Clip-Cov, ProRL, ReTool/ToRL/Search-R1, CTF-Dojo/Pentest-R1/HackSynth, pass@k-is-diagnostic-not-objective). Extended reinforcement.md and agentic-rl.md with cross-links into the sweep. Rewrote What the frontier labs actually do with a full last-year (2025-07→2026-07) refresh across all 10 tracked labs, filling in previously-thin xAI/Grok, Mistral (Magistral/Ministral), Zhipu/GLM, Xiaomi/MiMo, and deepening Kimi K2→K2.5→K2.6, each now carrying the same [L]/[E]/[R]/[N] tags. Extended Contested edges and The decision with the new capability-boundary and sandbagging/elicitation literature. Merged ~130 newly-cited sources into References.

  • 2026-07-02 — v0.1. Initial build from a live session: the on/off-policy foundation, the method genealogy (imitation/preference/reinforcement + agentic RL + PEFT), verified 2026 frontier-lab recipes, the method→data reframe, the decision tree, and contested edges. Embeds the interactive decision journey.

Backlog (next sessions)

  • A worked example: filtering your ~100–200 solves into a rejection-sampling SFT set (the verified-trajectory pipeline).
  • The reward-function chapter: building an ungameable verify(state) for CTF flags (state, not transcript).
  • Pass@k methodology: per-challenge bucketing before choosing a branch.
  • The train↔inference precision-mismatch rabbit hole (TIS vs FP16), for when you reach GRPO.
  • Harness-shape coupling: ruling out “it’s the scaffold, not the model” before fine-tuning.