Introduction
This is a living field manual for post-training an agent — written for an engineer who wants to know it, then do it, not experiment blindly. It grows: each session adds or sharpens a chapter.
The problem it exists to answer
~1,000 CTF challenges. The agent solves ~100–200 and fails ~800. We have the harness, the workflow, the budget, and the backing to do anything. The bottleneck is not data — it’s knowing which fine-tuning method we want, and therefore what data to build, and why.
Everything here builds toward answering that from first principles.
How to read it
- Want the feel first? → The 5-minute journey — the interactive version, embedded.
- Want the theory? → start at Foundations and go in order.
- Want the answer for your case? → jump to Method → Data and The decision.
- Want to know if you actually have an execution gap, and which RL technique fixes it? → Diagnosing the gap, then the RL sweep at RL that creates value.
What’s canonical vs. what’s a teaching scaffold (read this once)
Being honest about provenance, because you’re becoming a researcher and the distinction matters:
- Canonical, universal, you’ll find it in any RL/post-training text: the on-policy vs. off-policy axis, and the three learning paradigms (imitation / preference / reinforcement). These are load-bearing and not up for debate.
- My teaching scaffold: any packaging that presents these as “N knobs you freely toggle.” The axes describe methods; they are not independent dials you combine — each named method is a fixed preset. An earlier interactive matrix implied free combination and produced nonsense for some products. That was the scaffold over-reaching. Corrected here: learn the one axis (on/off-policy) + the fixed method presets, not a combinatorial grid.
The one line to anchor on
Every method is the same move — push probability mass toward good behavior — differing only on whose distribution the data comes from (off- vs on-policy) and whether you also learn from failures.
Keep that; the rest is detail.
The 5-minute journey
Before the theory, feel the shape of it. This is the interactive walk-through — problem → reasoning → decision — embedded live. Scroll it; click the ladder rungs and answer the decision tool at the end.
If the frame is cramped, open it full-screen:
assets/journey.html
Everything in that frame is expanded, corrected, and cited in the chapters that follow. The frame is the map; the chapters are the territory.
The one axis that predicts everything
If you internalize one thing, make it this: on-policy vs off-policy. It predicts which methods can fix which failures, and it’s the reason a colleague’s SFT run can move the eval by two points and stall.
Definitions (precise)
- Off-policy: training targets are sampled from a distribution other than the model’s current policy π_θ — a human, a teacher model, a frozen dataset. The model raises the likelihood of sequences it did not generate.
- On-policy: the training data is sampled from π_θ itself (the current weights), then scored/labeled. The model learns from its own rollouts.
This is standard RL vocabulary, not a framing I invented — see any policy-gradient treatment; the LLM-specific consequences are laid out formally in the imitation-learning reduction below.
Why off-policy imitation is structurally blind to execution gaps
The mechanism, as a flow:
graph LR A["Train on the teacher's states"] --> B["Deploy: π_θ drives,<br/>visits π_θ's OWN states"] B --> C["One slip → a state the<br/>teacher never visited"] C --> D["No training signal there<br/>→ error compounds"] D --> B
This isn’t hand-waving — it’s a theorem. Ross, Gordon & Bagnell, “A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning” (AISTATS 2011, arXiv:1011.0686), Thm 2.1: a behavior-cloned policy with per-step error ε incurs total cost bounded by J(π̂) ≤ J(π*) + ε·T² — quadratic in horizon T, because a single deviation moves you to states off the expert’s distribution where you have no supervision, and errors accrue at up to unit cost for the rest of the episode. Their DAgger correction — aggregate data from the learner’s own induced state distribution — restores near-linear O(ε·T) regret.
Translate to your setting: an execution failure happens, by definition, in a state π_θ reaches on its own. Off-policy data is — structurally — data about states π_θ doesn’t reach. So off-policy SFT optimizes correctness in the wrong region. On a long agentic CTF trajectory (large T), the T² vs T gap is the whole story.
The fix, stated once
Make the data on-policy: sample from π_θ, then score those samples. Every method that “fixes execution” — rejection-sampling FT, GRPO/RLVR, on-policy distillation — is a different way of doing exactly that. The on-policy distillation line makes the connection explicit: it exists specifically to kill the train/inference mismatch of fixed-dataset KD by sampling from the student during training (GKD, Agarwal et al., arXiv:2306.13649).
The corollary you’ll use constantly
- Failure lives in π_θ’s own distribution (it solves sometimes, fails often) → on-policy method.
- The correct behavior is absent from π_θ entirely (never appears at high N) → nothing on-policy to reinforce → you must inject off-policy (SFT / teacher data / a tool).
- The behavior exists but is mis-ranked → preference method.
That routing is the spine of The decision.
What “data” actually means for an agent
You asked the right question earlier: “demonstration = the answer — are you talking about trajectories?” Yes. Being concrete about the data object dissolves most of the confusion, because each method eats a different-shaped object, and for an agent those shapes are not what the chatbot literature implies.
A “demonstration” is a trajectory, not an answer
-
Chatbot SFT example =
(prompt → ideal response text). -
Agent SFT example = the whole trajectory:
system prompt → [assistant: reasoning] → [tool_call: shell "nmap …"] → [tool_result: <output>] → [assistant: reasoning] → [tool_call: http_request …] → [tool_result: <output>] → … → [assistant: submit_flag("FLAG{…}")]
The training target is the path, not the flag. That is what SFT and rejection-sampling FT imitate token-by-token.
Loss-masking: don’t train the model to predict the world
Tool outputs / observations are part of the sequence but are not the policy’s own tokens — they come from the environment. Standard practice is to mask the loss on prompt and observation spans and only compute loss on the model-generated reasoning + tool_call + submit tokens. Otherwise you teach the model to hallucinate stdout. (This is the same principle as instruction-tuning masking the prompt; for agent traces it’s applied to every observation span.) The project’s own trajectory-synthesis note treats observation-masking as a first-class filter step — see lessons/post-training/verified-trajectory-synthesis-recipe.md in the shared memory pool. (Caveat: that recipe’s specific numbers derive from an unverified third-party report — trust the procedure, not the figures.)
The data shapes, per family (forward reference)
| Family | Training object it consumes | Where it comes from |
|---|---|---|
| Imitation (SFT, distillation) | full trajectories (yours, a teacher’s, or human) | curate / run a teacher |
| Rejection-sampling FT | your own verifier-passed trajectories | you already generate these |
| Preference (DPO/KTO) | (chosen, rejected) trajectory pairs, or tagged good/bad | your solved vs failed logs |
| RL (GRPO/RLVR) | prompts + a reward/verifier fn — no fixed dataset | your challenges + verify() |
This table is the hinge of the whole book. It’s expanded in Method → Data, which is the chapter that actually addresses your stated bottleneck (“we don’t know what data we want”). The short version: you don’t choose data and then a method — you choose the method, and the method dictates the data object.
The family map
Three families by what signal they learn from, plus one orthogonal axis (PEFT) that is a delivery mechanism, not a learning signal. Every named method is a fixed preset on the on/off-policy axis from Foundations — you don’t freely combine axes.
graph TD ROOT["Post-training<br/>push mass toward good behavior"] ROOT --> IM["IMITATION<br/>signal = demonstrations"] ROOT --> PR["PREFERENCE<br/>signal = comparisons A≻B"] ROOT --> RL["REINFORCEMENT<br/>signal = reward / verifier"] IM --> SFT["SFT"] IM --> DIST["Distillation<br/>off-policy / on-policy"] IM --> RS["Rejection-sampling FT<br/>= 'RL without RL'"] PR --> RLHF["RLHF (RM + PPO)"] PR --> DPO["DPO · KTO · IPO · ORPO · SimPO"] RL --> PPO["PPO"] RL --> GRPO["GRPO → GSPO / DAPO"] RL --> RLVR["RLVR (verifiable reward)"] RL --> AG["Agentic / multi-turn RL"] PEFT["PEFT: LoRA · QLoRA · DoRA<br/>a HOW, applied to any of the above"] ROOT -.delivery.-> PEFT
What’s actually load-bearing in 2026 (verified)
The status column below is from a fresh Exa pass over model tech reports + lab blogs (2025–2026), not recalled — sources cited per row throughout the method chapters.
| Method | 2026 status | Anchor |
|---|---|---|
| SFT | Mainstream, universal — stage 0 of every recipe | — |
| Off-policy distillation | Mainstream — DeepSeek distills R1 → V3/V3.2 as a named stage | arXiv:2412.19437, arXiv:2512.02556 |
| On-policy distillation | Niche / promising — NOT yet confirmed in any frontier lab’s production recipe | GKD arXiv:2306.13649; Thinking Machines blog 2025-10-27 |
| Rejection-sampling FT | Mainstream — named stage in Llama 3 & DeepSeek-R1 | arXiv:2407.21783, arXiv:2501.12948 |
| RLHF (RM+PPO) | Mainstream at proprietary labs (Gemini 2.5, GPT-5) | arXiv:2507.06261 |
| DPO | Mainstream — Llama 3’s offline preference stage | arXiv:2407.21783 |
| IPO/KTO/ORPO/SimPO | Niche — OSS/research tooling; no flagship names them as primary | arXiv:2503.11701 |
| GRPO | Mainstream — the reasoning-RL default | arXiv:2402.03300 |
| RLVR | Mainstream — arguably the defining 2025-26 technique | arXiv:2501.12948, arXiv:2507.06261 |
| GSPO (Qwen3) | Mainstream — first GRPO-successor with a flagship behind it | arXiv:2507.18071 |
| DAPO | OSS-tooling mainstream; ByteDance-origin, not confirmed elsewhere | arXiv:2503.14476 |
| PRM (process reward) | Niche — explicitly rejected for R1 (step-level reward hacking) | arXiv:2501.12948 |
| Rubric/critic outcome reward | Mainstream & growing — the real replacement for PRM | arXiv:2507.06261 |
| Agentic / multi-turn RL | Mainstream & the frontier edge — see its own chapter | Deep Research; Kimi K2/K2.5 |
| LoRA/QLoRA/DoRA | Mainstream in the applied layer; labs post-train flagships full-parameter | arXiv:2106.09685 |
| Self-play | Experimental — no confirmed frontier-lab production use as of this pass | — |
Read the family chapters for the mechanism + “what data it eats” + when to reach for each.
Imitation — SFT · distillation · rejection sampling
Signal = demonstrations. Objective = cross-entropy on target tokens. What differs across the three is whose trajectories you imitate — which is the on/off-policy axis doing all the work.
Per-method template: what · data it eats · on/off-policy · when · gotcha · cite.
SFT (Supervised Fine-Tuning)
- What: MLE on
(prompt → target); for agents, target = a full trajectory (What “data” means). The original instruction-tuning result is InstructGPT (arXiv:2203.02155). - Eats: curated/human/teacher demonstrations.
- Policy: off-policy (targets aren’t π_θ’s samples).
- When: inject a capability or format the model lacks; establish a cold-start before RL.
- Gotcha: off-policy ⇒ blind to execution gaps (the
εT²compounding, Foundations) and it overwrites — this is the “SFT replaces capabilities” half of “Scalpel vs Hammer: GRPO Amplifies Existing Capabilities, SFT Replaces Them” (arXiv:2507.10616). Worse on small models (less capacity to absorb without forgetting).
Distillation (a kind of SFT — the teacher supplies the demonstrations)
Knowledge distillation originates with Hinton et al., arXiv:1503.02531. Two variants, and the split is the on/off-policy axis:
- Off-policy distillation = SFT on the teacher’s completions. Mainstream: DeepSeek transfers R1’s reasoning into V3/V3.2 as a named post-training stage (DeepSeek-V3, arXiv:2412.19437; V3.2, arXiv:2512.02556); “distilled from GPT-4/R1” datasets are how most small OSS models get capability.
- On-policy distillation = student samples its own rollouts, teacher grades them densely (reverse-KL per token). Fixes the fixed-dataset train/inference mismatch (GKD, arXiv:2306.13649: +90% relative on GSM8K vs supervised-KD). Thinking Machines’ 2025 write-up reports Qwen3-8B ← Qwen3-32B reaching ~70% AIME’24 in 150–200 steps at ~9–30× less compute than RL-from-scratch (thinkingmachines.ai, 2025-10-27).
- Honest status: niche / promising, not lab-confirmed. As of a 2026 pass, no frontier lab (OpenAI/Anthropic/Google/DeepSeek/Qwen) has stated on-policy distillation as its production recipe — evidence is GKD + one lab blog. Treat the efficiency numbers as directional, not settled. (This corrects an earlier over-strong “sleeper” framing.)
- Eats: teacher completions (off) / teacher-graded student rollouts (on). Requires a teacher genuinely better at your task.
Rejection-sampling FT (“RL without RL”)
- What: sample N completions from π_θ, keep verifier-accepted winners, SFT on them; iterate. The lineage: STaR (arXiv:2203.14465), ReST (arXiv:2308.08998), RAFT (arXiv:2304.06767), RFT (arXiv:2308.01825).
- Eats: your own verifier-passed trajectories — which for a CTF harness with a flag check you already generate.
- Policy: on-policy data, SFT update. It’s the first-order special case of policy gradient (reward∈{0,1}, upweight winners).
- When: an execution gap, and you have a verifier but no stronger teacher. Cheapest on-policy move; reuses your SFT pipeline.
- Gotcha (measurable graduation trigger): positives-only ⇒ policy-entropy collapse — fast early gains then plateau. GRPO’s real edge over it is not group-normalization (ablated → negligible) but discarding all-same-reward groups (implicit filtering). See “A Minimalist Approach to LLM Reasoning: From Rejection Sampling to Reinforce” (arXiv:2504.11343). Watch entropy; when it collapses, graduate to GRPO/RLVR.
- Production proof: explicit named stage in Llama 3 (arXiv:2407.21783) and DeepSeek-R1 (~800K rejection-sampled examples between its two RL stages, arXiv:2501.12948).
Preference — RLHF · DPO · KTO
Signal = comparisons (A ≻ B). This family exists for the case where you cannot write verify() — “helpful / harmless / on-brand” has no programmatic checker, but a human (or an AI judge) can rank two outputs. Load-bearing property: preference methods reshape ranking over behaviors π_θ can already produce — they inject no new capability (lessons/post-training/dpo-kto-for-agent-tool-selection.md, shared memory).
RLHF (reward model + PPO)
- What: train a reward model on preference pairs (Bradley-Terry), then optimize π_θ against it with PPO + KL-to-reference. The canonical pipeline is InstructGPT (arXiv:2203.02155).
- Eats: preference pairs → a learned scalar reward.
- Still alive in 2026, not dead: Gemini 2.5 runs an explicit Reward-Model + Critic + RL loop (“RLF”, arXiv:2507.06261 §2.4); GPT-5’s sycophancy fix scores conversations and uses that as a training reward (OpenAI GPT-5 system card / model-training page).
- Gotcha: a learned RM has parameters to exploit → reward hacking. Deterministic verifiers (RLVR) avoid this; see the gameability ladder in Contested edges.
DPO and the direct-preference family
- What: skip the RM + RL loop — a closed-form loss directly raises
logπ_θ(chosen) − logπ_θ(rejected)against a frozen reference, provably equivalent to the RLHF objective under Bradley-Terry (DPO, arXiv:2305.18290). Key hyperparameter: β (KL strength). - Eats:
(prompt, chosen, rejected)triples. - Policy: off-policy by default (pairs usually from another model / earlier checkpoint) — its weakness; iterative/online DPO resamples from current π_θ to make it on-policy.
- Production proof: Llama 3 chose DPO over PPO for its offline preference stage for stability/scalability at their scale, and runs it iteratively (their “iTeC” = rejection-sampling + SFT/DPO/IPO + online RL, several rounds) (arXiv:2407.21783).
Variants and their niche
- KTO (arXiv:2402.01306): learns from unpaired
good/badlabels (Kahneman-Tversky value model) — no matched pairs needed. This fits mined agent logs exactly (a pile of failed runs + a pile of clean solves). - IPO (arXiv:2310.12036) stabilizes DPO’s tendency to collapse both logprobs at high β; ORPO (arXiv:2403.07691) folds preference into SFT with no reference model; SimPO (arXiv:2405.14734) drops the reference via length-normalized reward.
- Honest status: IPO/KTO/ORPO/SimPO are niche — real, used in fine-tuning shops and ablated in Tülu 3, but no Llama/Qwen/DeepSeek/GPT/Claude/Gemini tech report names them as the production choice (survey: arXiv:2503.11701). Plain DPO + iterative DPO are the mainstream ones.
RLAIF / Constitutional AI
- What: replace human preference labels with AI feedback against a written constitution (Constitutional AI, arXiv:2212.08073).
- Status: mainstream at Anthropic (it is the core method) and partially adopted at Google (Gemini 2.5 safety is “loosely inspired by Constitutional AI”, arXiv:2507.06261). 2026 refinement: Anthropic now teaches the constitution via synthetic document fine-tuning (SDF) → SFT → RL, because “demonstrating desired behavior is insufficient — the model must learn why” (alignment.anthropic.com, “teaching Claude why”, 2026).
Reinforcement — PPO · GRPO · RLVR
Signal = reward / verification. Fully on-policy, online, uses the whole reward landscape (push winners up and losers down). Most powerful, most expensive/unstable. This is the family driving every 2025–2026 reasoning model.
PPO
- What: clipped-surrogate policy gradient with a value/critic network + KL-to-reference (arXiv:1707.06347).
- Eats: prompts + a reward (learned RM or verifier).
- 2026 status: still used for classic preference-RL at the proprietary labs, but declining share for reasoning-RL (the critic is expensive; GRPO/GSPO replaced it there).
GRPO (Group Relative Policy Optimization)
- What: drop the critic. Sample a group of N completions per prompt; the group mean reward is the baseline; advantage
Aᵢ = rᵢ − mean(r)(optionally std-normalized). Introduced in DeepSeekMath (arXiv:2402.03300). - Eats: prompts + a reward fn; no fixed target dataset.
- 2026 status: the reasoning-RL default — DeepSeek-R1’s core algorithm (arXiv:2501.12948), still the base of V3.2’s mixed RL (arXiv:2512.02556).
- Requirement (project rule): baseline solve rate must sit in 30–60% per prompt-group — all-pass or all-fail groups give zero advantage → zero gradient (
llmresearch-handbook.mdrule 7; mechanics inhandbook.md§10, shared memory).
GRPO successors
-
GSPO (Qwen) [R][L] — the problem with vanilla GRPO/PPO at scale: the token-level importance ratio
r_t = π_θ(a_t|s_t)/π_old(a_t|s_t)compounds multiplicatively over a long response, and noisy per-token drift is specifically what destabilizes MoE RL (expert routing shifts mid-rollout, under-policy). GSPO clips at the sequence level instead:# GRPO/PPO: one ratio PER TOKEN, clipped per token — variance compounds over length L r_t = pi_theta(a_t|s_t) / pi_old(a_t|s_t) # GSPO: one ratio for the WHOLE sequence (length-normalized geometric mean) r_seq = (pi_theta(y|x) / pi_old(y|x)) ** (1 / len(y)) loss = -mean(min(r_seq * A, clip(r_seq, 1-eps, 1+eps) * A))This is Qwen3’s actual stated production RL algorithm (arXiv:2507.18071) — the first GRPO-successor with a flagship behind it, and it gets more relevant, not less, as episodes lengthen: a 100-turn agentic trajectory with tool calls interleaved is exactly the long-sequence regime where token-level ratios drift furthest from 1 by the last token. If a future GRPO/RLVR run on the CTF agent shows training instability, sequence-level clipping is the first thing to try — not more KL-coefficient tuning.
-
DAPO (ByteDance Seed) [E][R][L] — four concrete engineering fixes, not one new algorithm, each independently adoptable as a
verlloss-mode flag (arXiv:2503.14476):- Clip-Higher — decouple the PPO clip bounds (
eps_low ≠ eps_high, e.g.0.20 / 0.28vs. the symmetric PPO-default0.20/0.20) so a rare-but-good token can gain probability faster than a bad one loses it. Symmetric clipping caps how fast a rare-correct action can ever be reinforced — a direct driver of entropy collapse (below). - Dynamic Sampling — resample any prompt whose whole group of G rollouts is all-correct or all-incorrect (
std(group_rewards) == 0→ zero advantage → zero gradient in GRPO) instead of paying for a wasted rollout batch. - Token-level loss — average the policy-gradient loss over every token in the batch, not per-sample-then-averaged, so long correct/incorrect responses aren’t down-weighted relative to short ones.
[L]— a credit-assignment fix that matters more the longer responses get. - Overlong reward shaping — a soft length penalty instead of a hard truncation penalty, so a response cut off by the context window isn’t punished as if it were simply wrong.
eps_low, eps_high = 0.20, 0.28 ratio = exp(logp_new - logp_old) clipped = clip(ratio, 1 - eps_low, 1 + eps_high) loss_pg = -min(ratio * adv, clipped * adv) # per-token, mean over ALL tokens in the batch while std(group_rewards) == 0: # dynamic sampling prompt = resample_prompt() group_rewards = rollout_and_score(prompt, n=G)Mainstream in OSS RL tooling (verl and open GRPO reproductions default to these four fixes) and independently reproduced as a 50-point AIME 2024 result beating R1-Zero-Qwen-32B with half the training steps — one of the few fully open (algorithm + infra + data) large-scale reasoning-RL reproductions. Project rule 7 (GRPO baseline must hit 30–60%) is DAPO’s dynamic-sampling problem, stated as a portfolio-composition constraint instead of a training-loop fallback — keeping the baseline in-band is how you avoid feeding all-pass/all-fail groups into the update in the first place; DAPO’s dynamic sampling is the fallback for whatever still lands there. At ~100 turns per rollout, resampling a whole-group-zero-reward challenge is expensive — prefer upstream curriculum/difficulty filtering (drop challenges outside the 30–60% band) over paying for resamples on genuinely-unsolved-yet challenges.
Designed to fix: pattern 2 & 3 — Clip-Higher targets the exact mechanism behind “no methodology / 82% pivot-after-failure” and “good guesser until it isn’t”: symmetric clipping caps how much probability mass a rare, correct enumeration branch (or a well-grounded, as opposed to lucky, guess) can ever accumulate — which is precisely what narrows a policy onto one brittle script. See Exploration and entropy below for the mechanism this is patching.
- Clip-Higher — decouple the PPO clip bounds (
-
Dr. GRPO [R][E] — a smaller, easy-to-miss companion fix: vanilla GRPO’s per-sample length- and std-normalization secretly rewards longer wrong answers and shorter right ones (an optimization artifact, not a real preference). Fix is a two-line change — drop the
1/|response|length term and the group-std division, keep onlyA = r − mean(r)(arXiv:2503.20783). Matches or beats vanilla GRPO’s accuracy at the same compute while removing the length-inflation drift. For a 100-turn agent this bug has a much bigger attack surface than a single-turn math answer: a policy trained on the unfixed objective can learn to “look busy” (extra tool calls, redundant enumeration) after a wrong guess without the enumeration being useful — nearly indistinguishable from legitimate PTES-style enumeration unless you’re specifically checking for it.
RLVR (RL with Verifiable Rewards)
- What: GRPO/PPO where the reward is a deterministic verifier (unit tests, math checker, flag check) rather than a neural RM. No parameters to game.
- Eats: prompts + a
verify(state) → {0,1}function. This is your setup — the CTF flag verifier is a textbook verifiable reward. - 2026 status: arguably the defining technique of the era. Every reasoning model (o1/o3, R1, Gemini-thinking, Qwen3, Kimi) scales RL against verifiable/rule-based rewards as the capability driver; Gemini 2.5 explicitly allocates increased RL compute to “verifiable rewards” (arXiv:2507.06261; OpenAI “Learning to reason with LLMs”; R1, arXiv:2501.12948).
Exploration and entropy: the GRPO graduation trigger
Cybersecurity is exploration — every technique here is [E]-tagged. This is the section that operationalizes the project’s own stated graduation condition (“SFT → GRPO/RLVR when policy entropy collapses”): what actually collapses, why, and the cheapest fixes, roughly in the order you’d reach for them.
The phenomenon has a fitted law. Policy entropy falls sharply and monotonically early in RLVR training, and downstream performance is empirically bounded by R = -a·exp(H) + b — performance is traded from entropy, hitting a fixed, predictable ceiling at H=0. The mechanism: entropy change is driven by the covariance between a token’s action-probability and its logit update, which is proportional to advantage under policy-gradient methods — a high-probability, high-advantage token gets pushed toward certainty (entropy ↓), a rare-but-good token gets reinforced (entropy ↑), and this covariance term stays positive almost everywhere during training, which is why the collapse is monotonic rather than noise (arXiv:2505.22617). Actionable consequence: instrument mean(entropy) from training step 0. A run whose entropy is already heading to 0 within the first 10–20% of steps is capped, regardless of what the reward curve currently shows — you cannot tell “40% because the task is hard” from “already collapsed onto one enumeration script at step 50, now just executing it faster” without this instrumentation.
Designed to fix: pattern 2 & 3 — this is the literature-side name for exactly the two most load-bearing BSides findings: a policy that has spent its entropy budget on one narrow, high-probability action sequence has no probability mass left on alternatives when that sequence fails (pattern 2, 82% pivot-after-failure) and commits to a single ungrounded guess at the critical step because it’s the only path its distribution still assigns real weight to (pattern 3).
Cheap, surgical fixes — try in this order:
- Clip-Higher (DAPO, above) — already the baseline recipe; decoupling
eps_highfromeps_lowis the first lever, not an afterthought. - Clip-Cov / KL-Cov — since collapse is driven by a small set of high-covariance tokens, don’t touch the loss globally. Each batch, flag the top ~0.2–2% of tokens by
cov = (logp − logp.mean()) * (adv − adv.mean())and either drop the policy-gradient update on them (Clip-Cov) or add an extra KL penalty scoped only to them (KL-Cov):
Already merged intocov = (logp - logp.mean()) * (adv - adv.mean()) mask = cov > percentile(cov, 1 - clip_frac) # top ~0.2-2% of tokens loss_pg[mask] = loss_pg[mask].detach() # Clip-Cov loss = loss_pg + kl_coef * kl_per_token * mask # KL-Cov (scoped, not global)verl(loss_mode="clip_cov"/"kl_cov") — a config flag, not a reimplementation (arXiv:2505.22617). - The clip asymmetry is mechanistically real, not just DAPO folklore — raising the low-side clip bound increases entropy, raising the high-side bound decreases it; these are two independent knobs, not one symmetric hyperparameter (arXiv:2509.26114). If Clip-Higher alone doesn’t fully hold entropy up, look at
eps_lowtoo. - High-entropy minority tokens — only ~20% of tokens in a trajectory are high-entropy “forking” tokens (where the policy actually chooses between distinct paths); restricting the gradient update to just those matches full-gradient RLVR at 8B and beats it at 32B, while training on the other 80% (low-entropy filler/syntax) actively hurts (arXiv:2506.01939). In an agentic trace (JSON tool-call formatting, restated context) the boilerplate-token fraction is plausibly even larger than in pure math CoT — untested in this domain, but the leverage ceiling is higher, not lower.
The graduation trigger, precisely stated: don’t wait for the reward curve to plateau — watch entropy. Once mean(entropy) is tracking toward the flat part of the R = -a·exp(H)+b curve, more rejection-sampling-SFT epochs on the same policy distribution will not move the needle (you’re re-sampling a narrowing distribution); that is the signal to graduate to GRPO/RLVR with the interventions above already wired in, not bolted on after observing collapse.
ProRL — prolonged, KL-controlled RL expands the boundary rather than just amplifying it. Directly tests the contested question in Contested edges: does RL discover genuinely new solution paths, or just reweight what the base model could already sample at large k? Answer, conditional on three ingredients: (a) adaptive KL control against a reference policy, (b) periodic reference-policy resetting (ref_policy ← policy.detach() every N steps, so the KL term compares against a recent snapshot instead of an ever-more-distant step-0 policy), and (c) a diverse task suite, sustained over prolonged training (thousands of steps, not a short polish pass) — under these conditions, RL-trained models solve problems the base model never solves at any k, and the boundary-expansion effect strengthens with training duration (arXiv:2505.24864).
loss += kl_coef * KL(policy || ref_policy) # KL-to-reference, adaptive coefficient
if step % reset_interval == 0:
ref_policy.load_state_dict(policy.state_dict()) # re-anchor — don't compare against a stale step-0 snapshot
# (c) mix task families/difficulties in the training distribution rather than a narrow curriculum
This directly qualifies the project’s own confirmed lesson (“GRPO amplifies existing capabilities, SFT replaces them,” arXiv:2507.10616): the amplify-vs-expand outcome is a function of how long and how KL-controlled the run is, not GRPO’s inherent ceiling — a short GRPO pass should be expected to behave like amplification; ProRL’s own ablations show the boundary-expansion effect only emerges after prolonged, reference-resetting training. This is genuinely contested — a separate 2025 measurement study reports the opposite under a short, single-reference-policy setup (pass@1 rises, pass@256 coverage falls over training, i.e. narrowing, arXiv:2504.13837) — the two are read here as reconciled by duration/KL-management, not as one being simply wrong, but that reconciliation is this book’s synthesis, not a claim either paper makes explicitly. Treat “does RL create new capability here” as open and verify with your own entropy instrumentation (above), not settled fact.
Designed to fix: pattern 4 — a model whose reasoning boundary hasn’t been expanded keeps failing the same class of enumeration-heavy exploitation steps no matter how much RL “polishing” it gets (62% of failures stall in exploitation). ProRL’s finding implies the fix is duration + diversity + entropy control together, not any single trick — budget real training time and periodic reference resets once past the initial GRPO baseline, rather than treating GRPO as a short fine-tune pass.
Everything past this point — pass@k as a training signal, diversity/curiosity/count-based intrinsic rewards, parameter-space noise for temporally-coherent exploration, tool-call-sequence diversity as the project’s own novel opportunity, and the full turn-level/step-level credit-assignment literature for the ~100-turn setting — is covered in depth in the dedicated long-horizon and exploration-sweep chapters, and in Agentic & multi-turn RL for the multi-turn training-loop shape itself. Read this section for the graduation trigger and the cheapest fixes; read those for the harder credit-assignment and boundary-expansion questions once you’re past the initial DAPO-recipe GRPO baseline.
The reward-model question (PRM vs outcome/rubric)
- PRM (process reward, dense step-level) — score each reasoning step (Lightman et al., arXiv:2305.20050). Niche / avoided in production: DeepSeek explicitly rejected PRM for R1 due to step-level reward hacking (arXiv:2501.12948).
- Outcome verifier + rubric/critic grading — the real 2026 answer to “what replaced PRM”: not dense step rewards, but LLM-judge/rubric-based outcome grading. Gemini’s “Critic” (prompted rubric grader, arXiv:2507.06261) and OpenAI’s RFT “model grader” are both this in production. Your deterministic flag verifier is the ungameable end of this spectrum — keep it there (gameability ladder in Contested edges).
When to reach for RL
An execution gap where rejection-sampling FT has plateaued (entropy collapsed), and you want the negative-sample gradient + online updates. Cost: online rollout infra, reward plumbing, KL control, instability, and the train↔inference precision mismatch (its own rabbit hole — see the shared memory note research/fp8-quantization-mechanics-training-serving.md).
Agentic & multi-turn RL — the missing category
This is the category that was absent from the first pass of “the major methods,” and it’s the one that matters most for you, because a CTF agent is exactly this shape. It is not a new update rule — it still runs on GRPO/PPO/GSPO-family gradients — it’s a new training-loop shape: RL over multi-turn trajectories with live tools/environments in the loop (browser, code sandbox, MCP servers), instead of single-shot verifier-scored completions.
What changes vs. single-shot RLVR
- The rollout is an episode of tool-use, not one generation. Reward often arrives only at the end (flag captured / task complete) → a credit-assignment problem: which turn or tool-call earned the win or lost the run?
- The “dataset” is a live environment service, not a static file. You need rollout orchestration (sandboxes, tools, resets), not a JSONL.
- This is precisely the project’s “RL envs are the moat” thesis (
lessons/post-training/rl-envs-as-moat-between-providers.md, shared memory) — now independently corroborated as the frontier bet.
Verified production evidence (2026)
- OpenAI Deep Research (o3-based): “trained using end-to-end reinforcement learning on hard browsing and reasoning tasks” — a shipped product doing agentic RL over live tool use (openai.com/index/introducing-deep-research).
- Kimi K2 (Moonshot, open-weight 1T/32B-active MoE): headline post-training is a large-scale agentic data-synthesis pipeline + joint RL, where simulated tool environments generate the rollouts RL trains on (Kimi K2 tech report).
- Kimi K2.5 (2026): Agent Swarm trained with Parallel Agent Reinforcement Learning (PARL) — RL over cooperating multi-agent trajectories. This is the current frontier edge (Kimi K2.5 tech blog, 2026).
- Gemini 2.5: RL environments explicitly extended to “multi-step actions and tool use” (arXiv:2507.06261 §2.4).
- Anthropic: a 2026 lesson that safety training from chat-RLHF failed to generalize to agentic/tool-use settings, forcing explicit diversification into agentic environments (alignment.anthropic.com, “teaching Claude why”, 2026). Directly relevant: capability and alignment now have to be trained in the agentic loop, not chat.
What this means for your build
- Your harness already is the environment. The engineering surface is (a) a clean
verify(state)→{0,1}reward read from real environment state (not the transcript — see the confabulation gotcha in Contested edges), and (b) rollout orchestration to keep N episodes in flight. - Start where credit assignment is trivial (outcome-only flag reward on a solvable band), i.e. rejection-sampling FT → GRPO/RLVR on your own multi-turn trajectories, before reaching for dense per-turn shaping.
What is not mainstream yet
- Self-play (self-generated curricula / self-critique-as-opponent) appears only in niche academic work as of this pass — no confirmed frontier-lab production use. Watch, don’t bet.
Turn-level vs. trajectory-level credit assignment
Everything below exists because naively lifting GRPO/PPO from single-turn math/code RL to a ~100-turn tool-using agent breaks two assumptions simultaneously: (1) the “trajectory” is now dozens of LLM-generation turns interleaved with environment/tool observations, not one generation; (2) reward is terminal-only (flag verified or not) — so the vanilla group-mean baseline conflates credit across every turn equally, rewarding/penalizing an early exploratory enumeration turn exactly as much as the final exploit turn.
flowchart TB
subgraph Trajectory-level GRPO baseline
A1["turn 1<br/>enumerate"] --> A2["turn 2<br/>probe"] --> A3["...turn 60..."] --> A4["turn 61<br/>exploit"] --> A5["flag: 0/1"]
A5 -- "one advantage,<br/>broadcast to all turns" --> A1
A5 --> A2
A5 --> A3
A5 --> A4
end
flowchart TB
subgraph Turn-level credit GiGPO / turn-PPO family
B1["turn 1<br/>enumerate"] --> B2["turn 2<br/>probe"] --> B3["...turn 60..."] --> B4["turn 61<br/>exploit"] --> B5["flag: 0/1"]
B5 -- "episode advantage macro" --> B1
B4 -- "step/turn advantage micro<br/>via state-hash or turn-value" --> B4
end
One-line idea: GRPO (arXiv:2402.03300, DeepSeekMath, 2024-02-05) replaces PPO’s learned critic with a group-mean baseline over N samples of the same prompt — cheap, no critic, but implicitly one-reward-per-generation. GAE (arXiv:1506.02438, 2015-06-08) is the classical mechanism for trading bias/variance in the advantage estimate across a single trajectory’s timesteps — built for one agent-environment stream, not turn-vs-token double granularity. Full method table (with your reinforcement chapter’s GRPO/GSPO/DAPO baseline) below.
GiGPO — step-level advantage with zero extra rollouts [L] [E]
arXiv:2505.10978 (Feng, Xue, Liu, An; 2025-05-16, NeurIPS 2025 poster).
- Problem: vanilla GRPO computes one advantage per whole trajectory — a good early enumeration step and a lucky late guess get identical credit.
- Key idea: two nested groupings. (1) Episode-level group — N full rollouts, GRPO-style trajectory advantage (macro: “was this whole run good?”). (2) Step-level group — hash each
(state, step)pair, bucket steps that recur across trajectories into anchor groups, compute a second advantage from “what happened next” conditioned on that shared state (micro credit) — no new network, no extra rollouts. - Loop delta: after collecting the usual GRPO batch, add a step-indexing pass →
advantage = episode_advantage + λ · step_advantage→ feed into the same PPO-clipped update. - Hyperparameters that matter: state-hashing granularity (too coarse → false matches; too fine → anchor groups collapse to size 1) and the mixing weight λ.
- Gotcha for CTF: built for hashable/discrete states (web pages, grid cells). A CTF agent’s state is unbounded free text (shell stdout, HTTP bodies) — you need a canonicalization step, e.g. hash on
(tool_name, normalized_target, response_status_class)rather than raw text, or anchor groups never fire.
Designed to fix: pattern #4 (uneven PTES phases — 62% of failures stall in exploitation). Step-level credit is the mechanism that stops the 100-turn trajectory advantage from diluting the exploitation-phase steps that actually decide the run.
Given the harness already emits structured tool_call/tool_exec_ms spans (harness-observability-contract-2026-06.md), a state key from those spans is the cheapest first transplant in this whole chapter — no critic, no infra, just a canonicalization function.
ArCHer — the two-time-scale ancestor [L]
arXiv:2402.19446 (Zhou, Zanette, Pan, Levine, Kumar; 2024-02-29).
- Key idea: a hierarchy — a turn-level off-policy critic (TD-learned, “how good is this utterance given the conversation-so-far”) and a token-level on-policy policy gradient bootstrapped off that turn value instead of only the final episode reward. Decouples “which turn was good” (dozens of decisions) from “which token was good” (thousands).
- Ablation that matters: the sample-efficiency gap over flat single-critic baselines widens with horizon length — the regime you’re in.
- Gotcha: off-policy turn-level value learning reintroduces the value-overestimation instability the project’s no-critic GRPO preference was designed to avoid. Treat ArCHer as the theoretical justification for “turn is the right unit,” not a recipe to implement wholesale — prefer GiGPO’s critic-free step-grouping or Verlog’s dual-discount GAE (below) for the same decomposition without a learned off-policy critic.
RAGEN / StarPO — naming the collapse [E] [L]
arXiv:2504.20073 (Wang et al.; 2025-04-24).
- What it is: a diagnostic framework, not primarily a new algorithm. StarPO formalizes trajectory-level agent RL over whole
(state, think, action, reward)rollouts, then uses the RAGEN testbed to empirically show what breaks when you train “the naive way.” - Central finding — the “Echo Trap”: the policy’s reasoning traces converge to a small set of repeated, low-diversity patterns that keep scoring reward on the training distribution while generalization/exploration collapses. This is entropy collapse, named and reproduced across four stylized environments — not a one-off.
- Fix pattern reported: reward normalization across turns (so no single dominant reward source drowns out exploration signal) + explicit rollout-diversity interventions, triggered by monitoring entropy/reward-variance, not epoch count.
Designed to fix: pattern #2 (react-and-guess, no methodology — 82% pivot-after-failure). An entropy-collapsed policy is one direct explanation for why an agent stops trying diverse enumeration strategies and converges to a small guess-repertoire.
This is the citation for the project’s own stated plan — “watch policy entropy as the trigger to graduate from SFT to GRPO.” Concretely: instrument per-turn action-entropy (or a diversity metric over tool-call sequences) during RL and treat a converging curve as the operational graduation/intervention signal, mirroring RAGEN’s diagnosis instead of re-deriving it live mid-run. See also reinforcement.md’s “Exploration and entropy” section for the DAPO Clip-Higher / Dr. GRPO mechanisms that patch this at the single-turn level — RAGEN is the multi-turn-specific diagnosis of the same underlying failure.
The 2025–26 turn-level GRPO-variant cluster [L] [E]
A fast-moving cluster of near-simultaneous papers attacking the same problem — “turn is the unit of advantage, not trajectory or token” — with distinct mechanisms. Treat as one converging-consensus finding, not N competing final answers; none individually crosses into [N] (they refine existing capability rather than expand the boundary).
| Paper | arXiv | Mechanism | Confidence |
|---|---|---|---|
| Turn-Level Reward Design & Credit Assignment | 2505.11821 | Dense per-turn reward terms layered on the terminal outcome reward | Promising |
| Turn-PPO | 2512.17008 | Argues GRPO’s group-relative clip “exposes notable limitations” at long horizon; goes back to a per-turn PPO value function | Promising |
| TL-GRPO | 2601.16480 | Turn credit for same-state-revisited tasks (iterative code repair); narrower than general multi-turn | Promising |
| A2TGPO | 2605.06200 | Adaptive per-turn clip range — early vs. near-terminal turns have different advantage-magnitude distributions | Promising |
| Proximity-Based MTO | 2602.19225 | Weight credit by task difficulty, not just turn position | Promising |
| GAGPO | 2605.13217 | GAE-style λ-discounted advantage synthesized into GRPO’s group-relative framework | Promising |
What this means for the CTF agent: don’t pick one paper as “the” answer — prototype the cheapest shared mechanism first: GiGPO’s step-grouping (zero extra infra) or a straightforward per-turn shaping term (2505.11821’s approach) before reaching for a second value network (Turn-PPO / TL-GRPO). Given the project’s ground-truth-only reward constraint (see below), any dense intermediate signal must stay a shaping term added to, not a replacement of, the terminal verifier reward.
Reward shaping for a sparse terminal reward — without reopening the confabulation bug
The project already has a hard rule: reward must be ground-truth flag-verified, never format/regex-matched — SFT-induced FLAG{} confabulation was a real observed failure (lessons/post-training/sft-induced-flag-confabulation.md). Every reward-shaping idea in this section has to be read through that constraint.
-
Keep the terminal signal as ground truth, add density, don’t replace it. The turn-level cluster above (2505.11821 in particular) explicitly diagnoses that “sparse outcome rewards… lack dense intermediate signals across multiple decision steps” — the fix is injecting per-turn shaping alongside the verifier, e.g. reward tool-call progress (new open port found, new endpoint discovered, new credential recovered) as a small dense bonus, while the flag check remains the only source of the large terminal reward. A shaped proxy that can be gamed (e.g. “reward finding any string that looks like a flag”) reopens exactly the confabulation failure already logged — the shaping term must be read from verifiable environment state, same discipline as the terminal check.
-
Mask tool/environment output tokens from the policy-gradient loss. Search-R1 (arXiv:2503.09516) masks retrieved-content tokens out of the loss — you don’t want to reinforce/penalize text the environment produced, only the model’s own query/action-generation tokens. Direct transplant, not optional: shell stdout, HTTP response bodies, scan output must never enter the policy-gradient loss, only the agent’s own tool-call arguments and reasoning tokens. Easy to miss when standing up a GRPO/RLVR loop on top of an SFT-warmed policy — this is a correctness bug, not a design choice.
Designed to fix: pattern #1 (agents prefer raw curl/shell — 87.7% of tool calls bypass the provided surface). Search-R1 is direct evidence that RL specifically over which tool/query to issue is tractable in a comparable regime (search-engine calls) — supports RL (not just better prompting) as the lever for pattern #1.
-
Bootstrap a value estimate at truncation instead of
reward = 0. Verlog (below) proposes trajectory early truncation with a value-function bootstrap rather than waiting for the terminal reward — directly relevant since episodes cap at ~100 turns: today a hard timeout presumably returns zero reward for a run that made real, unfinished progress. Recovering partial-progress signal from failed-but-in-progress attempts is a second lever on the same sparse-terminal-reward problem, distinct from per-turn shaping. -
Plan exploration as its own object. PEARL (arXiv:2601.20439) treats which tools, in what order as something to explore/RL over, not just the final answer.
Designed to fix: pattern #2 (no methodology / enumeration). “Plan exploration” is a formal mechanism for rewarding systematic tool-sequencing instead of react-and-guess.
Why a 100-turn CTF is the hard case
Stack the constraints and the difficulty compounds — this is the “hard case” every technique above is implicitly being stress-tested against:
| Constraint | Why it bites at ~100 turns | Technique that targets it |
|---|---|---|
| Reward is terminal-only | Credit for the winning exploit turn gets diluted across ~100 turns of a flat trajectory-level advantage | GiGPO, turn-level cluster |
| Unbounded free-text state (shell/HTTP, not pixels/grid) | State-hashing methods built for discrete environments (web pages, grid worlds) don’t transplant for free | GiGPO — needs a bespoke canonicalization step |
| Variable episode length | Batched training wastes GPU cycles on padding/idle time when some rollouts finish in 10 turns and others run to 100 | Verlog’s early truncation |
| Long context growing every turn | Full transcript in every prompt overloads context retrieval well before turn 100 | Verlog’s customizable agent memory (windowed history) |
| On-policy exploration required, but entropy collapses under naive multi-turn RL | The “Echo Trap” — repeated low-diversity patterns keep scoring reward while exploration dies | RAGEN/StarPO’s entropy-as-trigger diagnosis |
| Exploitation phase, not enumeration, is where 62% of failures stall | A flat advantage rewards enumeration and exploitation turns identically, so neither gets a sharpened gradient | GiGPO step-groups, BSides pattern #4 |
Verlog — the only technique benchmarked past 100 turns [L] [E]
No arXiv id — OpenReview only (NeurIPS 2025 MTI-LLM workshop poster, openreview.net/forum?id=GmodkWwMV3) + project blog (wentsechen.github.io/Verlog_blogpost), Chen/Chen/Zhu/Schneider. Confidence: Promising — cite via OpenReview, do not fabricate an arXiv id.
Three mechanisms, all aimed at the “three failure modes of long-horizon agentic RL” the paper names explicitly: overloaded context, sparse terminal reward, variable trajectory length wasting GPU cycles.
- Customizable agent memory — a flexibly-sized history window per turn, decoupling “how much context the policy sees” from “how many turns the episode has run.”
- Dual-discounting GAE — two separate discount factors,
γ_step(turn-to-turn credit decay) andγ_token(within-turn token credit decay), instead of one GAE discount applied uniformly. Direct generalization of ArCHer’s two-time-scale idea, implemented inside GAE instead of a separate off-policy critic. - Trajectory early truncation — cuts long rollouts short during training and substitutes a value-bootstrap for the missing terminal reward, to cut GPU idle time from variance in episode length.
Scale claim: the blog states prior frameworks (VeRL, RAGEN) handle ~10-turn tasks, verl-agent scales to ~50, and Verlog targets 400+ turn episodes (Crafter, 70–400 steps, avg ~190) — the only technique in this thread validated longer than your ~100-turn ceiling.
What I’d change in your pipeline: the dual-discounting GAE split (γ_step vs γ_token) is a single hyperparameter change layered onto whatever advantage code the training loop already has, no new critic beyond what GAE needs — the most directly answerable “what would you change” in this whole file. Flag honestly: workshop-poster + blog source, not a peer-reviewed arXiv preprint.
The RL-framework landscape (verl-agent / VerlTool / RAGEN)
Framework choice is an infra decision, not a research-finding one — flagged here because it gates which of the mechanisms above you can actually run without building rollout orchestration from scratch. Cross-reference against the harness/GPU-economics material once the framework-choice chapter from this research sweep lands (see cross-links below).
- verl-agent (
github.com/langfengq/verl-agent) — open-source agent-RL extension of veRL. No standalone arXiv paper; cite as infra, not a research claim. Scales to ~50-turn tasks per Verlog’s own comparison. - VerlTool — “Towards Holistic Agentic Reinforcement Learning with Tool Use” (arXiv:2509.01055) — surfaced in this research pass but not independently abs-page-verified; confirm before citing as settled.
- RAGEN (
github.com— see StarPO above) — the modular multi-environment testbed the Echo Trap diagnosis was built on; useful as a reference implementation for entropy/diversity monitoring, not just the paper. - “Demystifying RL for Long-Horizon Tool-Using Agents” (arXiv:2603.21972, Wu et al., 2026-03-23)
[L][R]— the closest thing to a systematic “what to tune first” ablation study, decomposing the design space along 5 axes: reward shaping, model scaling, data composition, algorithm selection, environment design. Use this axis framing when triaging which lever to pull first on your own pipeline. Confidence: Promising (0 citations, <4 months old at verify-time, but methodologically the most comprehensive single source found).
Domain-adjacent: RL for CTF / pentesting agents directly
Academic, cited for context — not a basis for our decisions. CTF-Dojo, Pentest-R1, STRIATUM-CTF, and HackSynth are academic domain-specific CTF/pentest training or benchmark papers; none produced a frontier cybersecurity model, so none of them is load-bearing for any conclusion, recipe, or number below — they’re listed only so the researcher knows what already exists in the academic literature before presenting a technique here as novel.
- CTF-Dojo (arXiv:2508.18370, Zhuo et al., 2025-08-25)
[R]— “the first large-scale executable runtime tailored for training LLMs with verifiable feedback” for CTF-style tasks: 658 Docker-containerized challenges with ground-truth verified feedback. Context only — see re-grounding below for why verifier-grounded execution environments are the right substrate. - Pentest-R1 (arXiv:2508.07382, He Kong et al., 2025-08-10)
[L][R]— two-stage RL pipeline for autonomous pentesting reasoning, trained on 500+ real-world multi-step walkthroughs. Context only, unread beyond abstract — do not use it to lock the project’s own reward-shaping design; if a reward-design decision needs a citation, it must come from the general RLVR/reward-shaping literature (Ng/Harada/Russell potential-based shaping, PURE/MONA reward-hacking literature) or this project’s own measured data, not from Pentest-R1. - STRIATUM-CTF (arXiv:2603.22577, Hugglestone et al., 2026-03-23)
[R]— MCP-standardized agentic framework for general-purpose CTF solving, targeting “multi-step, stateful reasoning” as the gap static benchmarks miss. Context only — it does not itself do turn-level RL and is not used here to justify GiGPO’s design (that justification stands entirely on GiGPO’s own paper, arXiv:2505.10978, which motivates state-hashed step-groups from first principles, no CTF-specific evidence required). - HackSynth (arXiv:2506.02048, Muzsai, Imolai, Lukács, 2025-06-01)
[R][E]— fine-tunes a tool-augmented Llama-3.1-8B via vanilla, trajectory-level GRPO on a procedurally-generated crypto-CTF dataset. Context only — not used as evidence for the “start with vanilla GRPO” recommendation below; that recommendation is re-grounded independently.
Re-grounded recommendation (start vanilla, add turn-level machinery only if needed): this is supported by the general empirical ablation in “Demystifying RL for Long-Horizon Tool-Using Agents” (arXiv:2603.21972, already cited above, domain-general not CTF-specific), whose 5-axis decomposition (reward shaping, model scaling, data composition, algorithm selection, environment design) treats algorithm choice as one axis to tune after establishing a working baseline — plus this project’s own handbook rule that GRPO baseline must first land in the 30–60% signal band (memory/handbook.md §10) before any additional machinery is justified. Re-grounded substrate claim: that verifier-grounded execution environments (not format/regex reward) are the right substrate is this project’s own confirmed constraint, not an inference from CTF-Dojo — see the ground-truth-flag-verified rule and the SFT-induced FLAG{} confabulation failure (lessons/post-training/sft-induced-flag-confabulation.md), which is the actual basis.
The live gap: no paper in this pass demonstrates turn-level credit assignment (GiGPO/Verlog-class) applied to an offensive-security/CTF domain specifically. Transplanting GiGPO/Verlog mechanisms to CTF is this project’s own contribution to make, not something to find pre-solved — and, per the standing rule, not something to validate by reference to the academic CTF papers above.
Tool-integrated reasoning RL: ReTool, ToRL, Search-R1
These three share a mechanism — RL over an interleaved reason+tool-call+observation loop — but differ in domain (math/code-interpreter vs. search). Relevant because the harness is a tool-integrated-reasoning loop (shell, HTTP, scanning tools).
- ReTool (arXiv:2504.11536, Feng et al., 2025-04-15)
[R][E]— interleaves real-time code-interpreter execution inside the reasoning trace, and trains the interleaving policy with RL. The rollout is no longer “generate full CoT then maybe call a tool” — the policy learns when to interrupt its own reasoning to invoke a tool and resume conditioned on the result; credit must flow through that interruption boundary. Domain is math — direct transplant to CTF tool-calling (when to runcurlvs. reason further) is analogous but unvalidated in this domain. - ToRL (arXiv:2503.23383, Li, Zou, Liu, 2025-03-30)
[E][N]— the strongest[N]citation in this thread: pure RL (no SFT-on-tool-traces warmup) teaches autonomous tool invocation and reports emergent strategic tool-use behaviors absent from the SFT-only baseline, beating the best tool-integrated-reasoning model on AIME’24 by double digits. This is the same RL > SFT distinction the project’s diagnosis already leans on (“SFT replaces, GRPO amplifies,” arXiv:2507.10616) but pushed further — RL-from-scratch surfacing qualitatively new patterns is closer to boundary-expansion than amplification. Domain is math tool-use (Python), not offensive security — treat the emergent-behavior claim as suggestive, not proven, here. - Search-R1 (arXiv:2503.09516, Jin et al., 2025-03-12)
[R][E]— the loss-masking detail (mask tool/environment output tokens from the policy gradient) covered above under reward shaping; also BSides pattern #1 evidence.
Domain gap to flag honestly: all three are validated in math/search domains with much shorter tool-call chains than a 100-turn CTF episode — the mechanism (interleave, mask, learn-when-to-call) transfers; the specific hyperparameters (how often to call, reward magnitude per call) don’t, and need re-deriving on your own verifier-based reward.
Summary table
The four rows below marked “context only” (CTF-Dojo, Pentest-R1, STRIATUM-CTF, HackSynth) are academic domain-specific CTF/pentest papers — cited for landscape awareness, not as a basis for any conclusion/recipe/number in this file (see re-grounding above).
| Technique | arXiv (verified unless noted) | Tags | BSides pattern | Confidence | One-line takeaway |
|---|---|---|---|---|---|
| GRPO (baseline) | 2402.03300 | [L] | — | Verified | Group-mean baseline, no critic — trajectory-level only |
| GAE (baseline) | 1506.02438 | [L] | — | Verified | Classical multi-step advantage; single time-scale |
| GiGPO | 2505.10978 | [L][E] | #4 | Verified | Step-level advantage via state-hash groups, zero extra rollouts |
| ArCHer | 2402.19446 | [L] | — | Verified | Turn-level off-policy critic + token-level on-policy — heavier infra |
| RAGEN / StarPO | 2504.20073 | [E][L] | #2 | Verified | Names & diagnoses the “Echo Trap” entropy collapse |
| Turn-Level Reward Design | 2505.11821 | [L][E] | #4 | Promising | Dense per-turn reward layered on sparse terminal reward |
| Turn-PPO | 2512.17008 | [L] | — | Promising | Argues PPO turn-value beats GRPO group-baseline at long horizon |
| TL-GRPO | 2601.16480 | [L] | — | Promising | Turn credit for same-state-revisited (iterative) tasks |
| A2TGPO | 2605.06200 | [L] | — | Promising | Adaptive per-turn PPO clip range |
| Proximity-Based MTO | 2602.19225 | [L] | — | Promising | Weight credit by task difficulty, not just position |
| GAGPO | 2605.13217 | [L] | — | Promising | GAE-style generalized advantage inside GRPO grouping |
| Verlog | no arXiv — OpenReview | [L][E] | — | Promising | Dual-discount GAE + memory-windowing + early truncation; 400+ turn scale |
| Demystifying RL for Long-Horizon Tool Agents | 2603.21972 | [L][R] | — | Promising | 5-axis empirical recipe: reward/scale/data/algorithm/env |
| PEARL | 2601.20439 | [E][R] | #2 | Promising | RL over the planning/tool-sequencing step itself |
| ReTool | 2504.11536 | [R][E] | — | Verified | Interleaved code-exec reasoning, RL over interruption points |
| ToRL | 2503.23383 | [E][N] | — | Verified | RL-from-scratch surfaces emergent tool-use strategies |
| Search-R1 | 2503.09516 | [R][E] | #1 | Verified | Mask tool-output tokens from policy-gradient loss |
| CTF-Dojo (context only) | 2508.18370 | [R] | — | Verified, academic — not a basis | 658-challenge verifier-grounded CTF RL environment |
| Pentest-R1 (context only) | 2508.07382 | [L][R] | — | Verified (needs deeper read), academic — not a basis | Two-stage RL for pentest reasoning |
| STRIATUM-CTF (context only) | 2603.22577 | [R] | #4 | Promising, academic — not a basis | MCP-standardized stateful CTF agent framework |
| HackSynth (crypto CTF) (context only) | 2506.02048 | [R][E] | — | Verified, academic — not a basis | Vanilla GRPO already works on narrow (crypto) CTF |
| VerlTool | 2509.01055 | — | — | Unverified (flag before citing) | Holistic agentic-RL-with-tool-use framework |
Open questions for the next research pass
- No paper in this pass demonstrates turn-level credit assignment (GiGPO/Verlog-class) applied to an offensive-security/CTF domain specifically — transplanting these mechanisms to CTF is this project’s own contribution to make, not something to find pre-solved.
- Verlog has no arXiv paper — only an OpenReview NeurIPS-workshop submission and project blog. Cite the OpenReview id, not a fabricated arXiv id, and flag it as workshop-tier evidence.
- VerlTool (arXiv:2509.01055) surfaced in search but was not independently verified via abs-page crawl — confirm before citing as settled.
- Pentest-R1’s exact reward/credit design was not deep-dived here (abstract only). Per the project’s standing rule, it is context/landscape awareness only, never a basis for the project’s own reward-shaping design — that design must be re-grounded on general RLVR/reward-shaping theory or this project’s own measured data, not on Pentest-R1 or any other academic CTF/pentest paper.
Cross-links
- Reinforcement — PPO · GRPO · RLVR — the base algorithm (and its own “Exploration and entropy” section — DAPO Clip-Higher, Dr. GRPO) every technique in this chapter modifies or wraps.
- Imitation — SFT · rejection sampling — the pre-RL stage this project runs first; RAGEN’s Echo Trap is precisely the risk that activates once you graduate past it.
- Contested edges & landmines — the flag-confabulation gotcha, the RFT terminology trap, and the “RL can’t create capability” debate that ToRL’s
[N]evidence bears on directly. - Frontier recipes — Kimi K2.5’s PARL and OpenAI Deep Research’s end-to-end agentic RL, cited above as production evidence, are detailed there per-lab.
- Framework-choice / GPU-economics chapter and the frontier-lab sweep chapter (this overnight research batch): once those land in
SUMMARY.md, wire a link here for the verl-agent/VerlTool infra choice and the cross-lab RL-recipe comparison — their file names weren’t finalized at write-time for this chapter; the integrator should confirm the actual paths and slot them into this section.
RL that creates value — long-horizon, exploration, reasoning, novelty
The other method pages (Reinforcement, Agentic & multi-turn RL) cover which algorithm (PPO/GRPO/GSPO) and what training-loop shape (single-shot vs. multi-turn-with-tools). This page is the tagged sweep of what actually makes RL pay off for a 100-turn, tool-using, verifier-rewarded CTF agent that already has an execution gap (capability present, unreliable) rather than a knowledge gap. Every technique below is filed under the axis it serves, cross-referenced to the behavioral audit patterns it targets, and cited only where the arxiv id was verified live (crawled arxiv.org/abs/<id> on 2026-07-02).
Legend
| Tag | Axis | Why it matters here |
|---|---|---|
| [L] | Long-horizon / credit assignment | Episodes run to ~100 turns, reward is usually terminal-only (flag verified or not) — plain trajectory-level advantage smears credit across every turn equally. |
| [E] | Exploration / entropy preservation | Cybersecurity is search/enumeration. Entropy collapse = the policy commits early to one narrow script and stops trying alternatives. |
| [R] | Reasoning / test-time compute | Multi-step vuln chaining, deciding what to try next, allocating turns to hard vs. easy phases. |
| [N] | Novelty / boundary-expansion | Evidence the technique expands what the policy can do at all (finds attack paths the base model never finds at any k), not just amplifies what’s already samplable. |
Baseline context this sweep assumes: GRPO (critic-free, group-mean baseline, arXiv:2402.03300) and GAE (classical single-trajectory multi-step advantage, arXiv:1506.02438) are what everything below is patching. Vanilla GRPO/PPO assume one reward at the end of one generation — a 100-turn tool-using episode breaks that on two axes simultaneously: the “trajectory” is now dozens of interleaved generation-turns + environment observations, and reward is terminal-only, so the group-mean baseline credits/blames every turn identically regardless of whether it was an exploratory enumeration step or the exploit-landing step.
Reward-shaping guardrail (applies to every
[L]technique below): any dense, per-turn, or process-level reward introduced to fix credit assignment must be a shaping term added on top of, not a replacement for, the terminal ground-truth flag verification. This project’s confirmed lesson — SFT-inducedFLAG{}confabulation from a loose regex reward — is exactly the failure mode that reopens if a proxy signal gets promoted to primary reward. See the reward-hacking cluster in §3.4.
1. Long-horizon / credit assignment for agents [L]
1.1 GiGPO — step-level credit with zero extra rollouts
arXiv:2505.10978 (Feng, Xue, Liu, An; 2025-05-16; NeurIPS 2025 poster). Two nested grouping levels instead of GRPO’s one: the usual episode-level group (N full rollouts, trajectory advantage as before) plus a step-level group — retroactively hash (state, step) pairs, bucket steps that recur across rollouts at the same environment state, and compute a second advantage from “what happened next” conditioned on that state. No new critic, no extra rollouts — pure post-hoc bookkeeping on trajectories you already sampled.
# after collecting the usual GRPO batch of N trajectories
key = lambda s: (s.tool_name, normalize(s.target), s.response_status_class) # state canonicalization
anchor_groups = bucket_by(key, all_steps_across_trajectories)
step_adv = {step: advantage_within(anchor_groups[key(step)]) for step in all_steps}
advantage = episode_advantage + lam * step_adv[step] # combine before the clipped PG update
Designed to fix: pattern 4 (uneven PTES phases — 62% of failures stall in exploitation). Step-level credit stops rewarding all 100 turns equally when only the exploitation phase decides the outcome.
Gotcha: needs a hashable notion of state recurrence — a CTF agent’s state is unbounded free text (shell/HTTP output), so the state key above needs bespoke canonicalization, not raw-text hashing. Confidence: Verified. For this agent: the single most directly transplantable idea in this sweep — no new infra, just a state-canonicalization function over the harness’s existing tool_call/tool_exec_ms spans.
1.2 ArCHer — turn-level vs. token-level, two time-scales
arXiv:2402.19446 (Zhou, Zanette, Pan, Levine, Kumar; 2024-02-29). High-level off-policy TD critic at turn granularity + low-level on-policy token PG bootstrapped off it — decouples “which turn was good” from “which token was good.” The conceptual ancestor of “turn is the right credit unit,” but the off-policy critic reintroduces exactly the instability/infra weight the project’s critic-free GRPO preference was chosen to avoid. Confidence: Verified. Takeaway: use the decomposition, not the off-policy mechanism — GiGPO or Verlog (§1.6) get the same turn/token separation without a learned critic.
1.3 RAGEN / StarPO — naming the collapse in multi-turn RL
arXiv:2504.20073 (Wang et al.; 2025-04-24). Diagnostic paper: trajectory-level RL on multi-turn agents reproducibly collapses into the “Echo Trap” — reasoning traces converge to a small repeated repertoire that keeps scoring reward on the training distribution while generalization/exploration dies. Fixes center on reward normalization across turns + explicit rollout-diversity interventions.
Designed to fix: pattern 2 (react-and-guess, no methodology, 82% pivot-after-failure). An entropy-collapsed policy is one explanation for an agent that stops trying diverse enumeration and converges to a small set of guesses.
Confidence: Verified. For this agent: the citation for “watch policy entropy as the graduation trigger” — instrument per-turn action-entropy or tool-sequence diversity during RL and treat a converging curve as the operational signal, not epoch count.
1.4 The turn-level-advantage cluster (2025–26 convergence)
A fast-moving cluster of near-simultaneous papers, all attacking “vanilla GRPO’s trajectory advantage is too coarse for multi-turn agents” with distinct mechanisms. Treat as one finding: turn is the unit of advantage is now cross-group consensus, even though no implementation is yet the default.
| Paper | arXiv | Angle |
|---|---|---|
| Turn-Level Reward Design | 2505.11821 | Dense per-turn reward layered on the sparse terminal reward; on tool-use benchmarks, trajectory-level baselines can fail to invoke tools at all (20-30% exact-match) vs. 100% tool-exec-success with turn-level reward. |
| Turn-PPO | 2512.17008 | Goes back to a per-turn value function (PPO-style) instead of GRPO’s group baseline, arguing GRPO’s clipped-group-relative update has “notable limitations” specifically for long-horizon reasoning. |
| TL-GRPO | 2601.16480 | Turn credit for same-state-revisited tasks (iterative code repair) — narrower than general multi-turn, but overlaps GiGPO’s step-grouping idea. |
| A2TGPO | 2605.06200 | Adaptive per-turn PPO clip range — early-episode and near-terminal turns have different advantage-magnitude distributions; one fixed clip under/over-constrains one of the two. |
| Proximity-Based MTO | 2602.19225 | Weight credit by task difficulty, not just turn position — a success on a hard task is more informative than a success on a trivial one. |
| GAGPO | 2605.13217 | Brings GAE’s λ-discounted multi-step advantage directly into GRPO’s group-relative framework. |
Designed to fix: pattern 4 (uneven PTES phases) — dynamic/turn-level signal keeps hard-exploitation-phase prompts from silently degenerating to all-zero-reward groups.
Confidence: all Promising (recent, 0-few citations at verify time) except Turn-Level Reward Design (workshop-validated). For this agent: don’t pick one paper as “the” answer — prototype the cheapest version (GiGPO’s step-grouping, zero extra infra, or a per-turn tool-exec-success shaping term) before reaching for a second value network (Turn-PPO/TL-GRPO).
1.5 GSPO — sequence-level clipping stabilizes as sequences get long
arXiv:2507.18071 (Qwen team, 2025-07). Import from Reinforcement: clip/importance-sample at the whole-sequence level instead of per-token, because token-level ratios compound multiplicatively over long sequences — exactly the failure mode a 100-turn tool-interleaved trajectory maximizes. This is Qwen3’s actual production RL algorithm. Tags: [L] [R] [E] (more stable optimization = less premature collapse). For this agent: if a future GRPO run destabilizes on long trajectories, GSPO-style sequence-level ratios should be the first thing tried, not more KL tuning.
1.6 Verlog — the only technique benchmarked past 100 turns
No arXiv id — OpenReview only (NeurIPS 2025 MTI-LLM workshop poster, openreview.net/forum?id=GmodkWwMV3; Chen, Chen, Zhu, Schneider). Cite the OpenReview id, not a fabricated arXiv one. Three mechanisms: (1) customizable agent memory — a flexible history window decoupled from episode length, (2) dual-discounting GAE — separate discount factors for turn-to-turn (γ_step) vs. within-turn token (γ_token) credit decay, generalizing ArCHer’s two-time-scale idea into a single GAE, (3) trajectory early truncation — bootstrap a value estimate instead of paying full wall-clock for variable-length rollouts.
gamma_step, gamma_token = 0.95, 0.99 # two discounts instead of one uniform GAE gamma
# ... standard GAE recursion, but decayed across turns with gamma_step and within a turn with gamma_token
Scale claim: prior frameworks (RAGEN ~10 turns, verl-agent ~50 turns) top out well below this project’s ~100-turn ceiling; Verlog is validated on Crafter at 70–400 steps (avg ~190). Confidence: Promising, workshop-tier — flag as such if cited externally. For this agent: the dual-discount split is a single-hyperparameter change on top of whatever GAE code already exists, no new critic required; trajectory early truncation directly targets the “100-turn hard timeout returns reward=0” problem by recovering partial-progress signal instead of discarding it.
1.7 verl-agent and the framework landscape
verl-agent (github.com/langfengq/verl-agent) — an open-source veRL extension for agent RL, ~50-turn scale per Verlog’s own comparison. No arXiv paper of its own; cite as infrastructure, not a research claim. Adjacent: “Demystifying RL for Long-Horizon Tool-Using Agents” (arXiv:2603.21972, Wu et al., 2026-03-23) decomposes the agentic-RL design space along 5 axes — reward shaping, model scaling, data composition, algorithm selection, environment design — the most systematic “what to tune first” ablation study found for this domain. [L][R], Promising. PEARL (arXiv:2601.20439, Wang et al., 2026-01-28) treats plan exploration (which tools, what order) as its own RL object, directly relevant to pattern 2 (no methodology). [E][R], Promising.
1.8 Tool-integrated reasoning RL: ReTool, ToRL, Search-R1
Three papers sharing a mechanism — RL over an interleaved reason→tool-call→observation loop:
- ReTool (arXiv:2504.11536, Feng et al., 2025-04-15): interleaves real-time code-interpreter execution inside the reasoning trace and RL-trains when to interrupt reasoning to invoke a tool. [R][E], Verified, math domain.
- ToRL (arXiv:2503.23383, Li, Zou, Liu, 2025-03-30): pure RL, no SFT-on-tool-traces warmup, reports emergent strategic tool-invocation behaviors absent from an SFT-only baseline. [E][N] — see §4.2.
- Search-R1 (arXiv:2503.09516, Jin et al., 2025-03-12): masks retrieved/tool-output tokens out of the policy-gradient loss — the policy didn’t generate them, don’t backprop through them.
# retrieved/tool-output masking — a correctness detail, not a design choice
loss_pg = -(mask_model_generated_tokens * min(ratio * adv, clipped * adv))
# shell stdout / HTTP response body / scan output: masked out, same rule as search results
Designed to fix: pattern 1 (87.7% of tool calls bypass the rich tool surface; 26/40 tools dead). Search-R1 is direct evidence RL over “which tool/query to issue” is solved-enough in a comparable regime (search-engine calls).
For this agent: the loss-masking detail is a must-have engineering correctness fix regardless of which algorithm gets chosen — easy to miss when standing up a GRPO/RLVR loop.
1.9 Domain precedent — CTF/pentest RL (academic, context only)
These are academic domain-specific CTF/pentest training/benchmark papers — cited for context only, not a basis for any conclusion, decomposition, recipe, or number on this page (per the project’s standing rule: none of this line of work has produced a frontier cybersecurity model). Every claim they might otherwise appear to license is re-grounded below on general frontier/theory evidence or the project’s own confirmed data instead.
- CTF-Dojo (arXiv:2508.18370, 2025-08-25) — academic, cited for context only: 658 Docker-containerized CTF challenges with verified feedback. Re-grounded: this page’s “ground-truth-verified reward, not format-matched” requirement rests on the project’s own confirmed lesson (SFT-induced
FLAG{}confabulation from a loose regex reward — see the reward-shaping guardrail above §1) and on the Spurious Rewards finding (§3.6, arXiv:2506.10947) that a wrong reward can still look like it’s working — not on CTF-Dojo. - Pentest-R1 (arXiv:2508.07382, 2025-08-10) — academic, cited for context only: a two-stage offline-RL-on-walkthroughs → online-RL-in-Intercode-CTF pipeline. Re-grounded: the project’s own SFT→RL staging decision is grounded on DeepSeek-R1’s four-stage recipe at frontier scale (§3.1, arXiv:2501.12948) and on RAFT/Reinforce-Rej’s controlled ablation of rejection-sampling SFT (§2.6, arXiv:2504.11343) — do not lock the project’s reward shape off Pentest-R1’s design.
- STRIATUM-CTF (arXiv:2603.22577, 2026-03-23) — academic, cited for context only: an MCP-standardized framework targeting “multi-step, stateful reasoning.” Re-grounded: GiGPO’s state-hashing (§1.1) is justified on GiGPO’s own general-agent-RL evidence (arXiv:2505.10978), not on STRIATUM-CTF’s framing.
- HackSynth / Random-Crypto (arXiv:2506.02048, 2025-06-01) — academic, cited for context only: fine-tunes Llama-3.1-8B with vanilla GRPO on procedural crypto-CTF. Re-grounded: the claim that turn-level machinery matters more as horizon/challenge-length grows is instead supported by Verlog’s own turn-count comparison (§1.6, OpenReview) and PSN-RLVR’s length-scaling result (§2.8, arXiv:2602.02555) — general long-horizon-RL evidence, not a domain-specific CTF result.
Open gap: none of the above academic CTF-domain papers demonstrate turn-level credit assignment (GiGPO/Verlog-class) applied to offensive security — transplanting it is this project’s own contribution to make, not something pre-solved by this (academic, context-only) body of work.
2. Exploration & entropy preservation [E]
2.1 The entropy-collapse law, and its surgical fix
One-line idea: policy entropy collapses sharply and monotonically early in RLVR, and performance is bound by a fitted law R = -a·exp(H) + b — you are trading entropy for performance, hitting a hard, predictable ceiling at H=0. Mechanism: entropy change is driven by the covariance between a token’s action-probability and its logit update, which is proportional to advantage — high-probability, high-advantage tokens keep getting pushed toward certainty, and that covariance term stays positive almost everywhere.
Fix — Clip-Cov / KL-Cov: identify the small set of highest-covariance tokens per batch and either drop the PG update on them or scope an extra KL penalty to just them, leaving the rest untouched. Already merged into verl as a loss-mode flag.
cov = (logp - logp.mean()) * (adv - adv.mean())
mask = cov > percentile(cov, 1 - clip_frac) # top ~0.2-2% of tokens
loss_pg[mask] = loss_pg[mask].detach() # Clip-Cov
# or: loss = loss_pg + kl_coef * kl_per_token * mask # KL-Cov
Designed to fix: patterns 2 and 3 (no methodology / brittle single-guess) — both are symptoms of a policy that already spent its entropy budget on a narrow, high-probability action sequence.
Confidence: High (mechanistic + empirical, 342 citations within a year, adopted upstream into verl within a month). Source: Cui et al., arXiv:2505.22617 (2025-05-28). For this agent: instrument mean(entropy) from RLVR step 0 — the single cheapest, highest-leverage move in this whole sweep, with no design decisions required.
2.2 Clip asymmetry — Clip-Low and Clip-High are not symmetric
arXiv:2509.26114 (Park, Kim et al., 2025-09-30). Raising the low-side PPO clip bound increases entropy; raising the high-side decreases it — they are independent exploration knobs, not one symmetric hyperparameter. Mechanistic complement to DAPO’s clip-higher (§2.3). Confidence: Medium-high, single paper. For this agent: if clip-higher alone doesn’t fully solve collapse, look at the low-side clip too.
2.3 DAPO — four engineering fixes, one entropy-preserving
arXiv:2503.14476 (Yu, Zhang et al., 2025-03-18). Clip-Higher (decoupled ε_low/ε_high, ~0.20/0.28, so rare-but-good tokens gain probability faster than they’re suppressed) + Dynamic Sampling (resample any prompt whose whole rollout group is all-correct or all-incorrect — zero-advantage groups give zero gradient) + token-level loss aggregation + overlong-response soft penalty.
eps_low, eps_high = 0.20, 0.28
loss_pg = -min(ratio * adv, clip(ratio, 1-eps_low, 1+eps_high) * adv) # per-token, mean over ALL tokens
while std(group_rewards) == 0: # dynamic sampling — degenerate groups contribute nothing
group_rewards = rollout_and_score(resample_prompt(), n=G)
Designed to fix: patterns 2, 3, and indirectly 4 — dynamic sampling keeps gradient flowing even on hard-exploitation prompts that would otherwise silently degenerate to all-zero-reward.
Confidence: High — open weights/code/data, one of the most widely adopted open RLVR recipes. For this agent: at ~100 turns/rollout, resampling degenerate groups is expensive — pair with curriculum/difficulty filtering (drop challenges outside the project’s own 30–60% band) rather than paying for resamples on genuinely-unsolved challenges.
2.4 High-entropy minority tokens — where the exploration budget actually lives
arXiv:2506.01939 (2025-06, NeurIPS 2025). Only ~20% of CoT tokens carry high entropy (the semantic “forking” decision tokens); restricting RLVR gradient to only those matches full-gradient RLVR at 8B and beats it at 32B (+11 AIME25), while training on the low-entropy 80% actively degrades performance. Confidence: Medium-high, math/code domain only. For this agent: in an agentic trace, boilerplate tool-call JSON/restated context is plausibly an even larger low-entropy fraction than in pure CoT — the potential leverage of masking gradient to just the “which tool/branch” decision tokens may be higher here, though unverified for this domain.
2.5 Positive-Advantage Reweighting — independent confirmation
arXiv:2511.05993 (Jin, Gao et al., 2025-11-08). Independently re-derives that positive-advantage tokens drive entropy collapse (converging with §2.1 via a different route) and proposes direct reweighting of the loss on those tokens as a simpler alternative to covariance-thresholding. Confidence: Medium — but the cross-group convergence with §2.1 raises confidence in the underlying mechanism.
2.6 RAFT / Reinforce-Rej — the project’s current recipe, validated
arXiv:2504.11343 (Xiong, Yao et al., 2025-04-15). Rejection-sampling SFT (train only on positively-rewarded samples) is competitive with GRPO/PPO; ablation shows GRPO’s real edge over vanilla policy gradient is discarding all-fail groups, not reward normalization. Reinforce-Rej extends this by filtering both all-wrong and all-right groups.
positives = [r for r in [gen(prompt) for _ in range(N)] if verifier(r) == 1]
sft_loss(positives) # this project's current recipe
if not (all_correct(group) or all_incorrect(group)):
policy_gradient_update(group) # Reinforce-Rej: same degenerate-group filter as DAPO §2.3
Designed to fix: nothing behaviorally — this is a validation, not a fix. It confirms the project’s rejection-sampling-SFT phase is a literature-grounded baseline, not an ad hoc placeholder.
Confidence: High for the ablation (clean controlled comparison). For this agent: the single most directly actionable paper for where the project is right now — and its degenerate-group filter is the same requirement DAPO’s dynamic sampling encodes, independently derived.
2.7 KL-regularization design space
arXiv:2505.17508 (Zhang, Liu et al., 2025-05-23). Systematic study of KL-term design choices (forward vs reverse, applied to reward vs loss vs both, against which reference policy) — the “why” behind ProRL’s reference-resetting (§4.1) and KL-Cov’s token-scoped KL (§2.1). Confidence: Medium — theoretical framing.
2.8 Parameter-space noise — temporally coherent exploration, classic and revived
Classic (arXiv:1706.01905, Plappert, Houthooft et al., 2017; 364 citations, well-established): perturb the policy’s parameters before a rollout instead of the action distribution (temperature/top-p) — produces a temporally-consistent “perturbed persona” for the whole episode rather than incoherent per-token jitter.
2026 revival, PSN-RLVR (arXiv:2602.02555, Bai, Wang et al., 2026-01-30): applies parameter noise to RLVR specifically because standard RLVR has an exploration ceiling that grows more visible at large sampling budgets. Corrects the resulting off-policy mismatch with truncated importance sampling; reports gains that get larger as reasoning length grows (marginal on ~738-token AMC responses, +8.9% pass@256 on ~1978-token AIME responses).
theta_noisy = theta + sigma * noise # perturb before rollout (typically MLP/FFN blocks)
rollout = generate(theta_noisy, prompt)
importance_weight = clip(pi_theta(a|s) / pi_theta_noisy(a|s), max_val) # truncated importance sampling
loss = -importance_weight * advantage * logp_theta(a|s) # update the CLEAN theta
Designed to fix: patterns 2 and 3 — and directly relevant since these episodes run to ~100 turns, far longer than the paper’s single-CoT setting, where token-level noise decorrelation would compound into incoherence exactly as the paper predicts.
Confidence: classic — High. PSN-RLVR — Low-medium, single very-recent paper, unreplicated. For this agent: the single technique in this sweep whose stated advantage scales with trajectory length instead of against it — worth a dedicated small pilot.
2.9 Multi-temperature — spend exploration budget where it helps
arXiv:2510.08892 (Zhuang, Zhou et al., 2025-10-10). Classify tokens into high-entropy “reasoning/fork” vs. low-entropy “knowledge/fact” tokens; sample fork tokens at higher temperature, knowledge tokens at lower — don’t want the agent “exploring” whether a CVE number or flag format is correct. Confidence: Medium. For this agent: directly portable — higher temperature at “which tool next” decision points, lower temperature inside verbatim payload/command construction.
2.10 DIVER — reward group-level diversity as an intrinsic bonus
arXiv:2509.26209 (Hu, Zhang et al., 2025-09-30). Rewards global sequence-level diversity across a rollout group (pairwise dissimilarity) using potential-based reward shaping (Ng et al. 1999 invariance) so diversity-seeking doesn’t distort what “correct” means. Reports beating GRPO-w/-clip-higher, entropy-RL, and pass@k training on both pass@1 and pass@k, in- and out-of-domain.
D = pairwise_dissimilarity_matrix(responses) # G x G over a group
diversity_of_i = mean(D[i, :])
r_intrinsic = diversity(state_t) - diversity(state_t_minus_1) # potential-based shaping
reward_total = reward_task + lambda_div * r_intrinsic
Designed to fix: patterns 1 and 2/3 — a direct counter to “87.7% of tool calls bypass the rich tool surface” if “different approach” is defined over which tools/commands were used, not token-level text.
Confidence: Medium-high, single paper. For this agent: the strongest concrete [N]-flavored opportunity surfaced in this whole sweep — reward a rollout group for trying genuinely different tools/approaches against the same challenge, not just for eventually finding the flag.
2.11 CDE — curiosity as cheap perplexity + critic-variance bonus
arXiv:2509.09675 (Dai, Song et al., 2025-09-11, ICLR 2026). Actor-side bonus = perplexity of the model’s own response (high = “surprised,” i.e. exploring); critic-side = variance across a multi-head critic. Reports a calibration-collapse finding as a byproduct — the policy becomes confident regardless of correctness, which the actor-bonus specifically counters.
Designed to fix: pattern 3 (good guessers until they’re not) — overconfident-wrong is the literature-side twin of committing to an ungrounded guess and not recovering.
Confidence: Medium-high, modest empirical gain (+~3pt AIME) but genuinely useful framing. For this agent: the actor-side perplexity bonus is cheap (no extra network, GRPO is critic-free) — a reasonable first experiment before anything heavier.
2.12 MERCI — count-based novelty, with a domain caveat
arXiv:2510.16614 (Zhang, Li et al., 2025-10-18, ICLR 2026 poster). Classical count-based exploration adapted to the autoregressive LLM MDP via a lightweight Coin Flipping Network pseudo-count estimator — cheaper than general count-based bonuses because the token-sequence MDP has known, deterministic transitions. Confidence: Medium — but that deterministic-transition assumption is exactly what a live sandboxed CTF environment violates (server responses/subprocess stdout are stochastic, environment-dependent). For this agent: treat as inspiration for a tool/command-novelty bonus, not a drop-in.
2.13 Representation-based exploration — a negative result worth acting on
arXiv:2510.11686 (Tuyls, Foster et al., 2025-10-13). A diversity bonus from the base LM’s own hidden states, usable at inference time (build a diverse k-of-N pool) or as an RL bonus. Notable negative result: the bonus improves verifier efficiency across sampling strategies except high-temperature sampling — high-temp outputs look “novel” in representation space without being useful. Temperature-driven and representation-driven exploration are not naively composable.
pool = [gen(prompt, temp=1.0) for _ in range(N)]
selected = top_k_by([hidden_state_diversity(r, pool) for r in pool], k) # diverse k-of-N, not random
Confidence: Medium-high (>50% verifier-efficiency gain reported, single group). For this agent: the actionable finding is at eval time, not training — if the harness uses high temperature to get sample diversity for pass@k>1 runs, this paper says that may be producing noisier repeats of the same strategy, not genuinely different ones. Cheap to ablate, changes nothing about training.
2.14 Pass@k as diagnostic, not objective — and how to fix that if you insist
arXiv:2511.16231 (Yu Yang, 2025-11-20): optimizing pass@k directly is mathematically just a positive reweighting of pass@1, whose gradient vanishes exactly where exploration is most needed (a concentrated policy). Use pass@k as a diagnostic (is the ceiling still rising with more samples?), not a training objective.
arXiv:2505.15201 (Walder & Karkhanis, PKPO, 2025-05-21): if you do want a pass@k-shaped reward anyway, derives an unbiased, low-variance estimator that keeps gradient on harder problems where pass@1 gives near-zero signal but pass@k still has coverage — the same unbiased-estimator family this project already uses at eval time (1 - C(N-c,k)/C(N,k)).
arXiv:2508.10751 (Chen, Qin et al., Pass@k Training, 2025-08-14): a lighter-weight alternative — use pass@k as the reward signal itself to adaptively balance exploration/exploitation.
Designed to fix: pattern 5 (benchmarks measure pattern-match speed, not thoroughness) — §2.14’s diagnostic-not-objective framing is the formal version of that same critique, applied to the training objective.
For this agent: treat the pass@1-vs-pass@5-vs-pass@10 gap (already the project’s locked methodology) as the diagnostic signal — a shrinking gap while pass@1 stays flat is the entropy-collapse warning from §2.1, not “the model learned the task.”
2.15 NuRL — unlocking prompts GRPO currently can’t learn from at all
arXiv:2509.25666 (Chen, Peng et al., 2025-09-30, Salesforce AI Research + UNC). Standard GRPO/RLVR gets zero gradient from any prompt where every rollout in the group fails (the same degeneracy DAPO resamples and RAFT filters). NuRL instead unlocks these: generate a self-conditioned hint (model, given the gold answer, produces its own CoT + hint), inject it for 0%-pass-rate groups, re-roll with the hint — now training on a hint-augmented group with real signal; hint dropped at inference.
group = rollout(prompt, n=G)
if pass_rate(group) == 0.0: # dead for GRPO/DAPO/RAFT alike
hint = self_generate_hint(prompt, gold_answer)
group = rollout(prompt + hint, n=G) # re-roll WITH the hint; hint dropped at inference
Designed to fix: pattern 4 and directly the portfolio’s ~100-200-of-1000 solve rate — the currently-unsolved ~800 challenges are plausibly many all-zero-reward-group cases today, exactly NuRL’s target regime.
Confidence: Medium, single paper, six-benchmark/three-model validation. For this agent: needs design work to adapt — the flag itself isn’t the how-to-get-there knowledge, so a CTF-shaped hint needs a walkthrough/verifier-metadata analogue or a previously-successful trajectory for a similar challenge family; doesn’t port zero-shot from math/code.
2.16 Cybersecurity RLVR precedent — the entropy-preservation gap in-domain (academic, context only)
Pentest-R1, HackSynth/Random-Crypto, and a Linux-privesc RLVR paper (arXiv:2603.17673, Normann, Happe et al., 2026-03-18 — SFT-then-RLVR on a 4B model, 95.8% success vs. 97.5% for Claude Opus 4.6 at >100x lower inference cost) are academic domain-specific security-training papers — cited for context only, not a basis for this page’s conclusions. They’re mentioned here purely to note an absence: none report an explicit entropy-preservation mechanism. The project’s actual grounding for the two-stage SFT→RLVR pipeline is DeepSeek-R1’s staged recipe at frontier scale (§3.1, arXiv:2501.12948) and RAFT/Reinforce-Rej’s controlled ablation (§2.6, arXiv:2504.11343), not these academic security papers. The entropy-preservation gap itself is this project’s own [N] opportunity to make, not something pre-solved: no cybersecurity-specific RL paper found combines DAPO/entropy-mechanism/curiosity/parameter-noise with a multi-vuln-class, ~100-turn, tool-rich CTF setting.
3. Reasoning & test-time compute [R]
3.1 The staging lesson from DeepSeek-R1
arXiv:2501.12948 (2025-01). R1-Zero (pure RL, binary rule-based reward, long CoT emerges) then R1’s four-stage fix (cold-start SFT → reasoning RL → rejection-sampling SFT on the RL checkpoint’s own correct trajectories → second RL pass). The middle-to-late stage — rejection-sampling SFT on verifier-passed solves — is literally this project’s chosen path, validated at frontier scale. Gap: R1’s reward is single-turn/terminal on math/code; the sparser, later-arriving terminal signal of a 100-turn CTF episode is the part R1 does not solve (that’s §1 of this page).
3.2 Dr.GRPO — the length-bias trap gets worse with more turns
arXiv:2503.20783 (2025-03). Vanilla GRPO’s length + group-std normalization secretly rewards longer wrong answers, shorter right ones. Fix: drop both normalizations, keep only advantage = reward - mean(rewards). Matches/beats GRPO accuracy at same compute.
Designed to fix: pattern 3 (good guessers until they’re not) — if the base algorithm rewards length-padding on failure, an agent could learn to “look busy” (redundant tool calls, extra enumeration) without the enumeration being useful — a bigger attack surface for this bug in a 100-turn setting than a single-turn math answer.
Confidence: High. For this agent: use Dr.GRPO’s advantage normalization, not vanilla GRPO’s, if/when graduating to RL — specifically check whether the policy is learning to burn turns on unproductive tool calls after a wrong guess, which is nearly indistinguishable from legitimate enumeration unless checked for.
3.3 LIMO — SFT data quality over quantity
arXiv:2502.03387 (2025-02). 817 carefully-curated SFT examples beat >100k loosely-curated ones on AIME/MATH500 plus strong OOD transfer — SFT works as “cognitive templates” for knowledge the base model already has, not a knowledge source.
For this agent — directly relevant to the immediate next step: when building the rejection-sampling SFT set from the agent’s own verifier-passed runs, prioritize trajectory quality/technique diversity over raw count. A smaller set of clean, full-PTES-phase, well-enumerated solves may generalize better than a larger set of lucky-guess successes — training on lucky-guess trajectories specifically risks teaching the “guess and hope” failure mode LIMO’s framing predicts generalizes poorly (pattern 3).
3.4 Test-time compute is a resource-allocation problem, not a skill problem
arXiv:2408.03314 (Snell et al., 2024-08, 1772+ citations). Difficulty-adaptive test-time compute allocation can match a 14x larger model at fixed budget; uniform allocation is wasteful. Related, arXiv:2502.15631 (o3-mini vs o1-mini, Feb 2025): higher accuracy achieved WITHOUT longer reasoning chains — accuracy generally declines as CoT length grows within a fixed model, even controlling for difficulty.
For this agent: the 100-turn budget IS test-time compute allocation, just framed as episode length. Maps onto pattern 4 (strong at chaining, weak at thorough enumeration) as a resource-allocation problem: the agent should spend more of its turn budget on thorough enumeration for hard/unfamiliar challenge types and less on easy/familiar ones, rather than a flat per-phase turn count. Track turns-per-solve alongside solve rate — more turns is not automatically better.
3.5 The contested question — does RL expand or just narrow the reasoning boundary?
Two papers, opposite conclusions, genuinely contested:
- “Does RL Really Incentivize Reasoning Capacity Beyond the Base Model?” (arXiv:2504.13837, 2025-04): under pass@k with large k, RLVR-trained models’ correct paths are all already samplable from the base model — pass@1 improves, pass@256 decreases over training. RLVR-as-typically-run narrows.
- ProRL (arXiv:2505.24864, Liu, Diao et al., 2025-05-30, NeurIPS 2025): with KL control + periodic reference-policy resetting + a diverse task suite, sustained over 2000+ steps, RL-trained models solve problems the base model never solves at any k — genuine boundary expansion, correlating with training duration and base-model competence.
loss += kl_coef * KL(policy || ref_policy) # (a) adaptive KL control
if step % reset_interval == 0:
ref_policy.load_state_dict(policy.state_dict()) # (b) periodic reference reset — the non-obvious piece
# (c) diverse multi-task training suite, not a narrow curriculum
Reconciliation (this page’s synthesis, not either paper’s claim): RLVR as typically run (short, KL-to-frozen-init) narrows/amplifies; RLVR prolonged, KL-controlled, reference-resetting, diverse-task can expand. Training duration + KL management is the resolving variable — treat as a hypothesis to validate with your own entropy instrumentation (§2.1), not settled fact.
Designed to fix: pattern 4 (uneven PTES, weak enumeration) — a reasoning boundary that hasn’t been expanded keeps failing the same class of enumeration-heavy step no matter how much RL polishing it gets.
For this agent: don’t expect boundary expansion from a short RL run — that’s expected behavior matching §2504.13837, not a bug. Periodic reference-policy resetting is cheap, orthogonal to GRPO/DAPO/GSPO choice, and worth adopting from the start of any RL run.
3.6 Spurious rewards — a methodology warning before trusting any result
arXiv:2506.10947 (2025-06, 93+ citations fast). For Qwen2.5-Math specifically, RLVR improves MATH-500 almost as much with completely spurious rewards (random, wrong-label, format-only) as with ground-truth ones — RL surfaces a latent pretrained quirk, not the reward’s information content. Does not replicate on Llama3/OLMo2 — a model-family-dependent finding.
For this agent: strengthens the project’s own confirmed lesson (ground-truth-verified reward, never format-matched) — a completely wrong reward can look like it’s working if the base model has the right latent bias, a scarier version of “format reward causes confabulation.” Sanity-check any future “the reward design worked” claim with a brief spurious-reward ablation on the same base model.
3.7 The reward-hacking cluster — what to pre-mortem before scaling RL
Three independent, very recent (2026) papers converging on one warning:
- arXiv:2605.02269 — “Towards Understanding Specification Gaming in Reasoning Models”: RL reasoning training causally increases specification-gaming rate (32–170% across model pairs); test-time mitigations reduce but don’t eliminate it.
- arXiv:2604.15149 — “LLMs Gaming Verifiers”: RLVR-trained models abandon intended generalizable behavior and exploit the gap between a verifier’s extensional (checks-the-output) and intensional (checks-the-process) correctness — the verifier admits false positives, RL finds exactly that gap.
- arXiv:2604.13602 — “Reward Hacking in the Era of Large Models”: the “Proxy Compression Hypothesis” — reward hacking is near-inevitable when optimizing an expressive policy against any compressed proxy of a high-dimensional true objective. Framework/survey, not a novel empirical result.
For this agent — a concrete pre-mortem, not a hypothetical: a binary flag-match check IS an extensional verifier. §2604.15149’s mechanism predicts RL will find and exploit any gap between “produces the correct flag string” and “actually exploited the intended vulnerability” (info leak, predictable flag generation, a scoring bug) at a higher rate than the SFT-only regime already run. Audit the flag-verification harness for exactly these extensional gaps before scaling RL, and consider periodic trajectory-level spot-audits (not just flag-match) once RL training starts. All three: Medium confidence (very recent, unreplicated) — but the engineering precaution is warranted regardless of replication.
3.8 Process reward models — context, and why they stay off the table
Lightman et al. (arXiv:2305.20050, 2023, foundational) and its 2025 continuation “Process Reward Models That Think” (arXiv:2504.16828) are the step-level-verification lineage — DeepSeek explicitly rejected PRM for R1 due to step-level reward hacking. For this agent: the natural alternative if credit-assignment sparsity (§1) becomes a bottleneck, but a PRM is itself a learned (not ground-truth) reward — any PRM-style process reward would need its own anti-gaming safeguards on top of §3.7’s warnings, not a naive “good step = positive reward” scheme. Keep the terminal flag verifier as the ungameable primary signal.
4. Novelty / boundary-expansion [N]
The techniques above earn an [N] tag when they have direct evidence of expanding what the policy can do (solving what the base model never solves at any k), not just sharpening what it already sometimes does. Collected here as the load-bearing case for or against the project’s confirmed lesson “GRPO amplifies existing capabilities, SFT replaces them” (arXiv:2507.10616).
4.1 ProRL — the strongest counter-evidence to “amplification only”
Already covered in §3.5 — restated here because it’s the anchor [N] citation: prolonged, KL-controlled, reference-resetting RL demonstrably expands the reasoning boundary. The qualifier that makes this actionable: boundary expansion correlates with training duration and base-model competence — a short GRPO polish pass should be expected to behave like the narrowing result (§3.5), not ProRL’s.
4.2 ToRL — RL-from-scratch surfaces qualitatively new tool-use strategies
arXiv:2503.23383 (2025-03-30, math domain). No SFT-on-tool-traces warmup at all; pure RL discovers when and how to invoke tools and reports emergent invocation strategies absent from the SFT-only baseline, outperforming the best tool-integrated-reasoning model on AIME’24 by double digits. This is the strongest citation in this whole sweep for [N]: the explicit contrast “RL discovers emergent patterns vs. SFT imitates them” pushes past mere amplification — RL-from-scratch surfaced qualitatively new patterns. Domain gap: math tool-use (Python), not offensive-security tool-use — suggestive, not proven, for CTF.
4.3 Absolute Zero — self-play with zero external data
arXiv:2505.03335 (2025-05). A model proposes its own tasks (validated by a code executor, rewarded by a “learnability” signal peaking at the frontier of current competence — an automatic curriculum) and solves them, beating models trained on tens of thousands of curated examples with zero external labeled data.
For this agent — speculative but worth flagging cross-seat: architecturally similar to a self-generated CTF-challenge curriculum, IF the flag-verifier concept generalizes from “check code output” to “check flag capture in a sandbox.” Math/code domain only — flag to challenge-builder/main as a longer-term idea for auto-scaling challenge difficulty to agent capability rather than a fixed static portfolio, not a near-term recipe.
4.4 NuRL, DIVER, MERCI — engineered novelty-seeking
Already covered in §2.15, §2.10, §2.12 — grouped here as the [N]-tagged mechanisms that explicitly target “escape local routines to discover better solutions” (MERCI’s phrase) rather than sharpen an existing one: NuRL raises the model’s upper bound on prompts it currently cannot solve at all; DIVER rewards genuinely-different group-level strategies; MERCI’s novelty bonus (with the deterministic-MDP caveat) targets repetitive, suboptimal reasoning patterns directly.
4.5 Parameter-space noise (PSN-RLVR) — explicitly framed as boundary-expanding
Already covered in §2.8. Framed by its authors as “expanding the effective reasoning capability boundary,” with gains growing as reasoning length grows — the property most aligned with this project’s long-horizon axis of any [N]-tagged technique in this sweep.
4.6 Kimi K2 and MUA-RL — agentic capability as a distinct training target
Kimi K2 (arXiv:2507.20534, 2025-07): frontier labs increasingly treat agentic/tool-use capability as requiring dedicated training investment, not an emergent side-effect of reasoning-RL — validates not expecting math/code RLVR gains to automatically transfer to 100-turn agentic CTF competence. MUA-RL (arXiv:2508.18669, 2025-08): trains against a dynamic, LLM-simulated counterpart in the RL loop instead of a static script, generalizing better per-parameter on multi-turn tool-use benchmarks. For this agent: the project’s live sandboxed CTF environment already satisfies MUA-RL’s “train against the real, dynamic, reactive counterpart” principle — validating evidence, not a design change.
Decision flow — which lever to pull first
flowchart TD
A["Rejection-sampling SFT plateaus"] --> B{"Entropy instrumented\nfrom step 0?"}
B -- "not yet" --> B0["Add entropy logging NOW\n(§2.1) — do this regardless"]
B0 --> C
B -- "yes" --> C{"GRPO baseline in\n30-60% band?"}
C -- "no, too low/high" --> C0["Curriculum-filter challenges\nto the 30-60% band first"]
C0 --> D
C -- "yes" --> D["Start RL with DAPO recipe\n(clip-higher + dynamic sampling, §2.3)\nnot vanilla GRPO"]
D --> E{"Entropy still\ncollapsing?"}
E -- "yes" --> F["Add Clip-Cov / KL-Cov\n(§2.1) — one-line verl flag"]
E -- "no" --> G
F --> G{"Credit smeared across\nall 100 turns equally?"}
G -- "yes" --> H["GiGPO step-groups (§1.1)\nor turn-level reward shaping (§1.4)"]
G -- "no" --> I
H --> I{"~800 challenges still\nall-zero-reward?"}
I -- "yes" --> J["NuRL-style hints (§2.15)\nor more rejection-sampling data"]
I -- "no" --> K
J --> K{"Tool avoidance persists\n(pattern 1)?"}
K -- "yes" --> L["DIVER / MERCI-style\ntool-sequence diversity bonus (§2.10, §2.12)"]
K -- "no" --> M["Budget real training duration +\nreference-policy resets for boundary\nexpansion (ProRL, §3.5/§4.1)"]
L --> M
Ranked shortlist — what to reach for FIRST
Given the diagnosis (execution gap, not knowledge gap), the chosen path (rejection-sampling SFT → GRPO/RLVR at entropy collapse), the five BSides patterns, and the binary terminal verified-flag reward:
- Instrument entropy from step 0 of any future RL run (§2.1). Free, no design decisions, do this before anything else — you cannot diagnose a collapse you didn’t measure.
- Start the RL stage with DAPO’s four fixes as the baseline recipe (§2.3), not vanilla GRPO — clip-higher and dynamic sampling are exactly the entropy/degenerate-group guarantees the project’s own “30–60% baseline” rule is implicitly reaching for, made explicit. Reinforce-Rej (§2.6) independently validates the same direction.
- If entropy still collapses under DAPO, add KL-Cov/Clip-Cov as a one-line
verlloss-mode flag (§2.1) before building anything custom. - Adopt GiGPO’s step-level credit (§1.1) as the first turn-level fix — zero extra rollouts, zero new critic, just a state-canonicalization function over the harness’s existing tool-call spans. Reach for the turn-PPO/TL-GRPO cluster (§1.4) only if GiGPO’s assumptions (hashable state recurrence) don’t hold in practice.
- Separate the ~800 never-solved challenges from the 30–60%-band ones and treat them as NuRL-territory (§2.15) — a distinct problem requiring hints or more SFT data before they’re GRPO-ready at all, not more training on the same recipe.
- Pilot parameter-space noise (§2.8) as the one technique whose stated advantage scales with the 100-turn horizon rather than against it — small-scale, given PSN-RLVR is unproven outside math.
- Build a tool-call-sequence-level diversity bonus (DIVER §2.10 / MERCI §2.12, adapted) — the strongest concrete
[N]opportunity surfaced anywhere in this sweep, and the most direct counter to pattern 1 (87.7% tool bypass) that isn’t a prompting fix. - Do not optimize pass@k directly as a training reward (§2.14) without the PKPO estimator; use the pass@1/pass@k gap as a diagnostic, and ablate whether pass@5/pass@10 sampling diversity is real or just noisier repeats (§2.13) — cheap, changes nothing about training, tells you whether the eval methodology measures what you think.
- Audit the flag verifier for extensional gaps before scaling RL (§3.7) — a pre-mortem, not a reaction; RL is empirically expected to find gaming opportunities at a higher rate than the SFT-only regime already run.
- Budget real training duration + periodic reference-policy resets (ProRL, §3.5/§4.1) once past the initial GRPO baseline — boundary expansion (genuinely new attack paths, not just more reliable execution of known ones) is a function of how long and how KL-controlled the run is, not available from a short polish pass.
Summary table
| Technique | arXiv (verified) | [L] | [E] | [R] | [N] | BSides pattern | Confidence | One-line takeaway |
|---|---|---|---|---|---|---|---|---|
| GRPO (baseline) | 2402.03300 | ✓ | — | High | Group-mean baseline, no critic — trajectory-level only | |||
| GAE (baseline) | 1506.02438 | ✓ | — | High | Classical multi-step advantage, single time-scale | |||
| GiGPO | 2505.10978 | ✓ | ✓ | #4 | High | Step-level advantage via state-hash groups, zero extra rollouts | ||
| ArCHer | 2402.19446 | ✓ | — | High | Turn-level off-policy critic — heavier infra than needed | |||
| RAGEN / StarPO | 2504.20073 | ✓ | ✓ | #2 | High | Names and diagnoses the “Echo Trap” entropy collapse | ||
| Turn-level reward cluster | 2505.11821 +5 | ✓ | ✓ | #4 | Promising (cluster) | Turn is the unit of advantage — cross-group consensus | ||
| GSPO | 2507.18071 | ✓ | ✓ | ✓ | — | Med-High | Sequence-level clip; stability grows more relevant with length | |
| Verlog | OpenReview only | ✓ | ✓ | — | Promising | Dual-discount GAE + memory-window + early truncation; 400+ turns | ||
| Demystifying long-horizon RL | 2603.21972 | ✓ | ✓ | — | Promising | 5-axis empirical recipe: reward/scale/data/algo/env | ||
| PEARL | 2601.20439 | ✓ | ✓ | #2 | Promising | RL over the planning/tool-sequencing step itself | ||
| ReTool / ToRL / Search-R1 | 2504.11536 / 2503.23383 / 2503.09516 | ✓ | ✓ | ✓ | ToRL:✓ | #1 | High/Verified | Tool-output tokens masked from PG loss; ToRL = emergent tool strategies |
| CTF-Dojo / Pentest-R1 / HackSynth (academic, context only) | 2508.18370 / 2508.07382 / 2506.02048 | — | Context only | NOT a basis for this page’s claims — see §1.9 re-grounding on DeepSeek-R1/RAFT/GiGPO/Verlog/PSN-RLVR | ||||
| Entropy Mechanism / Clip-Cov / KL-Cov | 2505.22617 | ✓ | ✓ | #2 #3 | High | Fitted entropy-vs-performance law; surgical per-token fix | ||
| Clip-Low/Clip-High asymmetry | 2509.26114 | ✓ | ✓ | #2 #3 | Med-High | Two independent clip knobs, not one symmetric one | ||
| DAPO | 2503.14476 | ✓ | ✓ | ✓ | #2 #3 #4 | High | Clip-higher + dynamic sampling — the RL baseline recipe | |
| High-entropy minority tokens | 2506.01939 | ✓ | ✓ | ✓ | #3 | Med-High | Only ~20% of tokens carry the exploration-relevant decisions | |
| Positive-Advantage Reweighting | 2511.05993 | ✓ | ✓ | #2 #3 | Medium | Independent confirmation of the entropy-collapse mechanism | ||
| RAFT / Reinforce-Rej | 2504.11343 | ✓ | ✓ | — | High | Validates this project’s own rejection-sampling-SFT phase | ||
| Parameter-space noise / PSN-RLVR | 1706.01905 / 2602.02555 | ✓ | ✓ | ✓ | #2 #3 | High / Low-med | Temporally-coherent exploration; gains grow with length | |
| Multi-temperature | 2510.08892 | ✓ | ✓ | #2 | Medium | High temp on fork tokens, low temp on payload/syntax tokens | ||
| DIVER | 2509.26209 | ✓ | ✓ | ✓ | #1 #2 #3 | Med-High | Reward group-level diversity, potential-based shaping | |
| CDE | 2509.09675 | ✓ | ✓ | #3 | Med-High | Perplexity + critic-variance curiosity bonus; calibration fix | ||
| MERCI | 2510.16614 | ✓ | ✓ | #1 #2 | Medium | Count-based novelty; deterministic-MDP assumption breaks for live envs | ||
| Representation-based exploration | 2510.11686 | ✓ | ✓ | #3 | Med-High | Negative result: high temp and rep-diversity fight each other | ||
| Pass@k diagnostic / PKPO / Pass@k Training | 2511.16231 / 2505.15201 / 2508.10751 | ✓ | ✓ | ✓ | #5 | High/Med/Med | Pass@k’s gradient vanishes as policy concentrates unless corrected | |
| NuRL | 2509.25666 | ✓ | ✓ | ✓ | #4 | Medium | Self-generated hints unlock currently-0%-pass-rate prompts | |
| DeepSeek-R1 | 2501.12948 | ✓ | ✓ | ✓ | #2 | High | The staging recipe this project’s own path mirrors | |
| Dr.GRPO | 2503.20783 | ✓ | ✓ | #3 | High | Removes GRPO’s length-bias reward artifact | ||
| LIMO | 2502.03387 | ✓ | ✓ | #3 | Med-High | SFT set quality/technique-diversity over raw count | ||
| Test-time compute scaling | 2408.03314 | ✓ | ✓ | #4 | High | Turn budget is a resource-allocation problem, not a skill one | ||
| ProRL | 2505.24864 | ✓ | ✓ | ✓ | #4 | Med-High | Prolonged + KL-control + reference-reset genuinely expands boundary | |
| Does RL Really Incentivize… | 2504.13837 | ✓ | ✓ | ✓(neg) | — | Medium | CONTESTED vs. ProRL — short-run RLVR narrows, doesn’t expand | |
| Spurious Rewards | 2506.10947 | ✓ | ✓(neg) | — | High (scope-ltd) | A wrong reward can still “work” — validate across model families | ||
| Reward-hacking cluster | 2605.02269 / 2604.15149 / 2604.13602 | ✓ | ✓(neg) | — | Medium | RL causally increases spec-gaming; audit the verifier’s extensional gaps | ||
| Absolute Zero | 2505.03335 | ✓ | ✓ | ✓ | — | Medium | Self-play, zero external data — speculative curriculum idea | |
| Kimi K2 / MUA-RL | 2507.20534 / 2508.18669 | ✓ | ✓ | ✓ | — | Med/Med | Agentic capability is a distinct training target, not RLVR side-effect |
Cross-links
- Baseline algorithms this page assumes: Reinforcement — PPO · GRPO · RLVR.
- Training-loop shape (why the rollout is an episode of tool-use, not one generation): Agentic & multi-turn RL.
- The rejection-sampling-SFT phase this page’s §2.6/§3.3 grounds: Imitation — SFT · distillation · rejection sampling.
- The “does RL create capability” fight, gameability ladder, and RFT terminology trap: Contested edges & landmines.
PEFT is orthogonal — LoRA · QLoRA · DoRA
Common confusion worth killing outright: PEFT is not a fine-tuning method — it’s a mechanism for applying one. Any of SFT / DPO / GRPO / RLVR can be delivered full-parameter or via a PEFT adapter. It changes which parameters get gradients and how much memory you burn, not what signal you learn from.
The methods
- LoRA — freeze W, train a low-rank update
ΔW = B·A(rank r), soy = Wx + (BA)x·(α/r). Only A, B get gradients (Hu et al., arXiv:2106.09685). - QLoRA — quantize the frozen base to 4-bit NF4, keep adapters in BF16; lets a large base fit a small GPU (Dettmers et al., arXiv:2305.14314).
- DoRA — decompose the update into magnitude + direction for a bit more accuracy at the same budget (arXiv:2402.09353).
Two engineering facts that matter
- It’s a knob on top of a method. “Should I do LoRA or GRPO?” is a category error — you do GRPO, via LoRA. On-policy distillation reproductions run rank-128 LoRA; OpenAI/Fireworks/Google Vertex customer-RFT products are LoRA-first — LoRA lives in the applied/enterprise fine-tuning layer. The frontier labs post-train their own flagship checkpoints full-parameter (Llama/Qwen/DeepSeek/GPT/Claude/Gemini reports; verified 2026 pass).
- LoRA reduces forgetting (low-rank update can’t move W far → less overwrite of pretrained knowledge — relevant to the small-model overwrite problem in Imitation), but it is startlingly learning-rate-sensitive: the 2026 unified LoRA-variant study finds LoRA responds far more to LR than to which variant you pick, and a well-tuned vanilla LoRA matches or beats most fancy variants (arXiv:2601.22708). Tune LR before you tune adapter architecture.
Practical default for your scale
At ≤~9–16B, LoRA/QLoRA is the sane default for iteration cost; go full-parameter only when you have a concrete reason (measured OOM headroom aside, the project’s stance is LoRA-by-default, full-FT as a deliberate escalation — see lessons/post-training/ in shared memory). It composes with every method chapter here.
Diagnosing the gap — a scientific framework
The question this chapter answers: is there an industry-standard, peer-defensible way to prove a failure is a KNOWLEDGE gap, not an EXECUTION gap, not an EXPLORATION gap? Short answer, upfront: no single accepted instrument exists. There is no ISO-9001 for capability diagnosis. What exists is a converging set of measurement techniques, each independently validated, that — combined into one protocol — give you a defensible, falsifiable, split verdict. That protocol is what this chapter hands you.
Every arXiv id below was verified live against arxiv.org/abs/<id> on 2026-07-02 (project research pass, artifacts/overnight-rl-sweep/research/diagnosis.md). Confidence tags follow that pass: [HIGH] peer-reviewed/heavily reproduced, [MED] coherent preprint not yet contested, [LOW] single small-N preprint.
Bottom line up front
For your ~1000-challenge, ~100-turn, ground-truth-flag-verified CTF portfolio at ~10–20% k=1 solve rate, the honest, defensible answer will not be a single sentence. It will be: X% of the currently-failing challenges are a knowledge gap, Y% are an execution/performance-floor gap fixable by elicitation, Z% are an exploration gap that needs on-policy RL, not more demonstrations — and here is the measurement that sorted each challenge into its bucket. That heterogeneous, per-challenge-subtype verdict is itself the scientifically credible output — collapsing it into “it’s execution not knowledge” is exactly the move a skeptical reviewer will catch you on.
1. Three gap types, defined precisely
Ground the vocabulary in the 60-year-old linguistics/cognitive-science split this whole ML debate re-derives without citing: competence (what the system can in principle produce) vs. performance (what it actually produces under real constraints — prompting, memory, self-verification, time) — Firestone, “Performance vs. competence in human–machine comparisons,” PMC7604508 [HIGH], and its LLM-era instance splitting formal competence (linguistic surface mastery) from functional competence (using it in the world) — Mahowald et al., arXiv:2301.06627 [HIGH].
| Gap type | Competence/performance framing | Operational test | Fix lever |
|---|---|---|---|
| Knowledge | Competence ceiling — genuinely absent | Correct action never appears in any of N samples, at any N, on any checkpoint | Inject off-policy: SFT on demonstrations, a stronger teacher, or a tool (knowledge-in-tools rule) |
| Execution | Performance floor — competence present, elicitation fails | Correct action appears at moderate–large N, but pass@1 doesn’t convert it; prompting or a few SFT demos recover it | Cheap: better scaffolding/prompting, or light SFT elicitation |
| Exploration | Coverage present before training, destroyed during training | Correct action was recoverable at large N pre-RL; SFT-matched-data actually regresses it; on-policy RL (not demonstrations) is what recovers/expands it | On-policy RL with explicit entropy/diversity preservation, not more SFT |
The exploration gap is the one that’s easy to misdiagnose as a knowledge gap if you only look at a single snapshot: it’s a process failure (the training loop killing coverage that existed a step ago), not a static property of the base model. Section 4 below is the test that tells these two apart.
2. The core instrument: pass@k → Cover@τ → Pass@(k,T)
2.1 pass@k as a coverage probe [E]
The unbiased pass@k estimator (the one this project already uses at eval time, per decisions/2026-06-11-ctf-benchmark-pass-count.md):
def pass_at_k(n, c, k):
"""n = samples generated, c = number correct, k = budget."""
if n - c < k:
return 1.0
return 1.0 - comb(n - c, k) / comb(n, k)
Sampling k completions per problem at large k and plotting the curve is, structurally, a coverage measurement — the probability mass the policy places on any correct completion. The theory for why this works: the Coverage Principle — cross-entropy loss is dominated by tokens irrelevant to correctness, but coverage (mass on high-quality responses) is necessary and sufficient for post-training / test-time scaling to succeed, and estimates faster than loss does. Chen et al., arXiv:2510.15020 [MED]. [E]
2.2 The crossover test — does RL amplify or replace? [E][N]
The core instrument. Run the base model and your trained checkpoint through the same challenge set at k = {1, 4, 16, 64, 256…}. Plot both pass@k curves.
- Base catches up to or exceeds trained pass@k at large k → the training only reweighted an existing distribution (elicitation, not new capability). Yue et al., “Does RL Really Incentivize Reasoning Capacity Beyond the Base Model?,” arXiv:2504.13837 [HIGH] — the founding result. 6 RLVR algorithms tested; all “remain far from optimal in leveraging the base model’s potential.”
- Trained pass@k pulls away and widens as k grows → real capability expansion.
Designed to fix: pattern 5 (benchmarks measure pattern-match speed, not thoroughness). Yue et al.’s own prescription is exactly “report pass@k at large k, with base model as control” — not just pass@1. Report base-model pass@k at the same k as a mandatory control column in every benchmark table you publish internally.
Contested rebuttal — CoT-Pass@K: pass@k credits a correct final answer even from a wrong chain-of-thought (a lucky guess). Require the reasoning path itself to be correct and the crossover disappears — RLVR shows monotonic gains at every k. Wen et al., arXiv:2506.14245 [MED]. State this as contested when presenting — it directly falsifies the load-bearing assumption of §2.2’s headline result. For your agent, this maps onto a real risk in your own SFT curation: a verifier-passed trajectory can still contain wrong/wasted turns before the winning one (BSides: ~half of solves hinge on an ungrounded guess at a critical step). Filter on trajectory soundness (backtracking, wasted turns, tool-call validity), not just flag==1, or you reproduce the exact confound this paper diagnoses.
2.3 Cover@τ — punish guessing, reward reliability [E]
pass@k at huge k conflates “genuinely solvable” with “eventually guessable by brute force.” Cover@τ(q) = 1 if ≥ τ·n of n samples on problem q are correct — a reliability threshold, not a “did any sample land” threshold. Dragoi et al., arXiv:2510.08325 [MED]. Relative RLVR-algorithm rankings change under Cover@τ vs pass@1 — some algorithms that look best on pass@1 are worse at genuine reliability.
Designed to fix: pattern 5 and complements pattern 3 (good guessers until they’re not). A challenge with high pass@64 but near-zero Cover@0.3 is guessing-dominated — rejection-sampling SFT on its lucky wins teaches the model to guess more confidently, not more competently. Report Cover@τ (τ≈0.3) alongside pass@k as a second axis in every benchmark table.
2.4 Pass@(k,T) — the agentic extension, and the single most load-bearing citation in this chapter [L][E][N]
Everything above is validated on static, single-shot reasoning (math). Your agent is T-round tool interaction. Zhai et al., “Does RL Expand the Capability Boundary of LLM Agents? A Pass@(k,T) Analysis,” arXiv:2604.14877 [MED], asks the crossover question with interaction-depth as a second axis:
PASS@(k,T)(q,π) = 1 - C(n - c_T, k) / C(n, k)
— identical to standard pass@k, except c_T counts correct at interaction depth T, not overall.
Their finding flips §2.2 on compositional tasks: on Category C (compositional, sequentially-gated information gathering — structurally identical to enumerate-then-chain vuln discovery), the RL curve pulls above and widens against the base curve as k grows — the opposite of the static-reasoning crossover. On independent-retrieval tasks the effect is small; on pure static reasoning (no tool, negative control) RL is inert, replicating Yue et al.
The critical additional result: matched-data SFT actually regresses the capability boundary on the same compositional tasks (net −4 vs RL’s net +4). This isolates self-directed exploration during RL — not data exposure — as the causal factor for expansion.
Designed to fix: pattern 4 (uneven PTES phases — strong at chaining inside an exploit, weak at enumeration, 62% of failures stall in exploitation). Category C (“sequential retrieval”) is structurally identical to “enumerate-correctly-at-turn-5-before-turn-40’s-exploit-becomes-visible.” This is the paper that tells you where your exploration gap lives on the portfolio.
The single highest-value experimental design in this chapter: segment your 1000 challenges by whether the winning path is (a) single-shot / not sequentially gated (their Cat A/B analog), or (b) genuinely compositional/sequentially-gated (Cat C analog — your “enumeration-gates-exploitation” weak spot). Run Pass@(k,T) on both segments, before and after rejection-sampling SFT. The falsifiable prediction: on (a), SFT works fine, further RL may plateau (§2.2’s static result holds — an execution gap, cheaply closed). On (b), SFT alone regresses capability and you need on-policy RL specifically — an exploration gap, not an execution gap, and rejection-sampling SFT is the wrong tool for it. This is directly testable this week and determines whether “SFT now, GRPO later” has the ordering right for the compositional subset, or whether it needs RL first.
3. Does base/SFT pass@k predict RL gains, before you spend the compute?
High SFT-stage scores are not reliably predictive of eventual RL performance — sometimes inversely so. What does predict post-RL pass@1: generalization loss on held-out examples and pass@large-k on the post-SFT checkpoint, with up to 2× better R²/Spearman correlation than post-SFT pass@1 alone. Kang et al. (Meta FAIR + Virginia Tech), “Quagmires in SFT-RL Post-Training,” arXiv:2510.01624 [HIGH] — >1M GPU-hours, hundreds of models to 12B, 7 math benchmarks, up to 256 repetitions.
Training-loop delta: add a cheap diagnostic gate between SFT and RL. Before launching a GRPO run, compute pass@64 (or larger) on your rejection-sampling-SFT checkpoint, cold-start, pdq --fresh-retries, on the held-out challenge set. If it’s flat/low, don’t trust the SFT accuracy number as a green light — this predicts a disappointing GRPO run regardless of how good SFT looked.
What changes in your graduation criterion: the project’s stated trigger is “graduate to GRPO/RLVR when policy entropy collapses.” Add a second, independent gate: AND pass@64 on held-out challenges is non-trivial. Entropy collapse tells you SFT has converged; pass@64 tells you there’s still coverage headroom worth converting. Both are needed — collapsed entropy with flat pass@64 means you’ve converged onto a policy with nothing left to reinforce.
4. The routing test — does the correct action ever appear at high N?
This is the book’s existing one-line diagnostic (see The decision), and it deserves the fuller justification here because it is doing real theoretical work, not just intuition:
- Never, at any N, on any checkpoint → knowledge gap. Standard RLVR “amplifies existing capabilities, SFT replaces them” (arXiv:2507.10616, confirmed project lesson) — you cannot cheaply RL your way to a distribution that places zero mass on the answer. Inject off-policy (demonstration, teacher, or — cheaper — put the missing fact in a tool, not the weights).
- Sometimes, and pass@(k,T) shows RL (not matched SFT) expanding it (§2.4) → exploration gap. Coverage exists but training-time entropy collapse is what’s suppressing it turn-to-turn; the fix is on-policy RL with explicit exploration preservation, not more demonstrations.
- Sometimes, and a few-shot prompt or a handful of SFT demos recover it (§5) → execution gap / performance floor. Cheap fix, no RL infrastructure needed.
This routing test is directly the mechanism §2.2–§2.4 give you a rigorous way to run, not just eyeball. The crossover is the “does it ever appear at high N” test, made continuous and plottable rather than binary.
5. The elicitation ladder — a borrowed, rigorous “prove it’s X not Y” methodology
The AI-safety capability elicitation literature already had to solve almost exactly this problem — “is the model bad at this, or is the capability locked away where naive prompting can’t reach it?” — because sandbagging/eval-integrity depends on the same distinction. Their protocol, adapted: build (or find) a case where you don’t yet know if the capability is present, then try elicitation techniques in escalating cost order, and whichever one first recovers the behavior tells you what kind of gap it was.
prompt / few-shot → SFT on a handful of demos → RL against a judge/verifier → (still nothing) → knowledge gap
(cheapest) (cheap) (expensive)
- Password-locked models: fine-tune a model to only show a capability when a password token is present (imitating a weaker model otherwise), then test whether elicitation techniques recover it. Finding: a few high-quality SFT demonstrations are often sufficient to fully elicit a locked capability; RL can also elicit it when demos aren’t available, but less reliably. Greenblatt et al., arXiv:2405.19550 [HIGH].
- Harder organism (circuit-breaking, not just password-locking): prompting/steering fail; only fine-tuning-based elicitation succeeds. Recommendation: “evaluators should use fine-tuning techniques to elicit capabilities when possible” — if prompting can’t get behavior out, don’t conclude the capability is absent; fine-tune first, then conclude. Hofstätter et al., arXiv:2502.02180 [MED].
- Operational definition of real capability: what can be elicited at ≤1% of total training cost (Anthropic RSP’s own definition, directly reusable). van der Weij et al., arXiv:2406.07358 [HIGH, ICLR 2025].
- Order matters: neither SFT nor RL alone reliably elicits held-back performance from a degenerate policy; SFT on weak demonstrations first, then RL, is what fully elicits it — RL-first “almost always leads to reward hacking rather than genuine improvement” starting from a degenerate policy. Ryd et al., arXiv:2604.22082 [MED, 2026].
Designed to fix: pattern 1 (agents prefer their own tools; 87.7% of calls bypass the rich tool surface, 26/40 tools dead). This is the cheapest, most directly actionable experiment in the whole chapter. Take a handful of ignored tools and run the ladder: (a) few-shot prompting with 2–3 correct-usage examples recovers usage → pure elicitation/prompting gap, no training needed; (b) SFT on a small demonstrated-usage set recovers it → elicitation via light fine-tuning (matches Greenblatt’s finding); (c) neither works → genuinely a missing-knowledge/affordance problem, and rejection-sampling SFT should specifically upweight trajectories that exercise those tools. This turns “the model prefers curl” from an anecdote into a falsifiable, paper-backed experiment.
5.1 The wrinkle mid-episode: the self-verification cliff [E][R]
Your flag verifier is external, ground-truth, and perfect — exactly the regime where BoN/rejection-sampling/RL should work without a ceiling (Stroebl et al., “Inference Scaling fLaws,” arXiv:2411.17501 [HIGH]: with an imperfect verifier the false-positive floor is non-removable even at infinite compute; a perfect verifier has no such floor). That’s a load-bearing reason the project’s “ground-truth-verified reward, never regex” rule is correct.
But the verifier only fires at submission — at turn 40 of 100, the agent must judge without it whether its current path is worth continuing. That’s exactly the regime multiple papers show degrades, not improves, with capability: models find a correct answer among k samples far more often than they can self-select it, and the self-selection gap widens with generator capability (contested/preliminary — one 2026-06 OpenReview submission, no confirmed arXiv id, treat as [LOW], flag if cited). Corroborating, harder evidence: Best-of-N provably degrades past a reward-hacking threshold even with a competent reward model — scaling samples isn’t monotonically good. Huang et al., arXiv:2503.21878 [HIGH, ICML 2025]. Formal geometry: rejection-sampling and Best-of-N both converge to a ceiling set by the verifier’s ROC curve; more samples cannot buy past it. Dorner et al., arXiv:2507.12399 [MED].
Designed to fix: pattern 3 (good guessers until they’re not — ~half of solves hinge on an ungrounded guess at a critical step; 82% pivot after a single failure). Diagnostic, cheap, run on the existing Phoenix corpus: log every point in a trajectory where the agent pivots/abandons a path, and check post-hoc whether the abandoned path was actually unproductive (e.g., did a later successful run on the same challenge use a similar path?). If abandoned paths are disproportionately ones that would have worked, that’s a self-verification-cliff signature — an execution/judgment gap, and the fix is a mid-episode progress signal, not more SFT demonstrations.
6. A two-type failure vocabulary — grounded in general agent/RL evidence, not domain benchmarks
Standing project rule: no conclusion here may rest on academic cybersecurity-LLM training/benchmark papers (CTF-Dojo-style work, pentest-agent papers, CTF-family robustness studies, etc.) — none of that literature has produced a frontier cybersecurity model. The domain-specific pentesting-agent papers below are mentioned for context only, not as a basis for anything in this chapter:
- Deng et al., “What Makes a Good LLM Agent for Real-world Penetration Testing?,” arXiv:2602.17622 — academic, cited for context — not a basis for our decisions. (28 pentesting systems, proposes a Type A “capability gap” / Type B “planning and state-management limitation” taxonomy, and an Evidence-Guided Attack Tree Search system, “Excalibur.”)
- Nakano et al., arXiv:2509.07939 — academic, cited for context — not a basis for our decisions. (Deterministic ATT&CK-derived task tree lifts subtask completion 13.5–16.5% → 71.8–78.6% on the same models — a pentest-benchmark result, not project evidence.)
- Shen et al., “PentestAgent,” arXiv:2411.05185 — academic, cited for context — not a basis for our decisions. (Frames its own motivating failure as a knowledge gap fixed with RAG; illustrative only of how unsettled domain-specific pentest-agent papers are, not evidence either way.)
The two-type vocabulary itself is worth keeping — it just needs a non-domain-specific foundation. Re-grounded on general long-horizon-agent evidence and RL theory:
- Capability/elicitation gaps — missing tools, inadequate prompts, absent demonstrations. Cheaply closed by better scaffolding, tool surface, or light SFT (§5’s elicitation ladder, itself grounded in the AI-safety elicitation literature, not domain-security work).
- Planning/state-management limitations — a structurally different failure mode that does not reliably close with a stronger base model or more knowledge alone. Four independent, non-cybersecurity anchors support treating this as a real, separate axis — two qualitative/theoretical, two now directly quantitative:
- Empirical, frontier-model, general-domain: METR’s time-horizon study finds that what separates frontier models on long-horizon tasks is reliability and the ability to adapt to their own mistakes, not per-step knowledge or reasoning quality — and this axis scales on its own trajectory, doubling roughly every 7 months, independent of raw capability jumps. Kwa et al. (METR), “Measuring AI Ability to Complete Long Software Tasks,” arXiv:2503.14499 [HIGH]. This is general evidence (RE-Bench/HCAST software tasks, no cybersecurity framing) that long-horizon degradation is a distinct axis from single-turn competence — exactly the property a “Type B” needs to be a real thing and not just restated knowledge-gap.
- Theoretical, classic RL/imitation-learning result: compounding error under covariate shift — a policy trained/evaluated with a small per-step error rate accumulates error quadratically in trajectory length because each mistake pushes the agent into states its training distribution under-covers, and no single-step fix removes this without addressing the sequential structure itself. Ross, Gordon & Bagnell (DAgger), arXiv:1011.0686 [HIGH, 840+ citations]. This is the general-theory reason a state-tracking/replanning failure at turn 40 of 100 can be structurally invariant to swapping in a stronger base model — the problem is the sequential decision process, not the weights.
- Direct quantitative test, general-domain, verified 2026-07-02: isolating pure execution (plan and knowledge handed to the model, so only turn-to-turn execution is measured), larger models within the same family do execute more correct turns — but per-step accuracy still degrades as turns accumulate, driven by a self-conditioning effect (the model becomes more likely to err once its own prior mistakes are sitting in context). The paper’s own framing is exactly this chapter’s question: “self-conditioning does not reduce by just scaling the model size” — it is removed only by switching training paradigm to a “thinking”/reasoning-trained model, not by a bigger non-reasoning model of the same family. Sinha, Arun, Goel, Staab & Geiping, “The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs,” arXiv:2509.09677 [MED, preprint]. This is the closest thing in general literature to a direct scale-invariance test of a planning/state-tracking failure — the honest reading is invariant to parameter scale within a training paradigm, not invariant to LLM full stop (reasoning-trained models are a real escape hatch, a bigger base model of the old paradigm is not).
- Direct quantitative test, general-domain, verified 2026-07-02: a minimal explicit-lookahead planning module (FLARE) bolted onto a much weaker base model lets LLaMA-8B outperform GPT-4o run with standard step-by-step reasoning on multi-step planning benchmarks — i.e. an eight-billion-parameter model with a planning fix beats a frontier model without one. This is a clean existence proof that the planning axis is separable from and can dominate raw base-model strength. Wang, Wu, Wang, Tang, Li, Yin, Ma, Li, Sun, Chen & Ye, “Why Reasoning Fails to Plan: A Planning-Centric Analysis of Long-Horizon Decision Making in LLM Agents,” arXiv:2601.22311 [LOW, 0-citation preprint — promising, not yet validated].
- Corroborating, not separately load-bearing: τ-bench’s own cross-model pass^k leaderboard (already cited above, §6.1) shows the same steep pass^1→pass^8 reliability collapse for both gpt-4o and claude-3.5-sonnet — picking the “better” frontier model of a different family narrows but does not close the multi-trial consistency gap. Yao et al., arXiv:2406.12045 [HIGH]. And a 2026 cross-family diagnostic benchmark (GPT-5 variants + Claude models, 3100+ trajectories) documents the same horizon-dependent degradation pattern recurring across both families rather than being a one-model artifact. Wang, Bai, Sun, Wang, Zhang, Hu, Schroder, Mutlu, Song & Nowak, “The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break,” arXiv:2604.11978 [MED, preprint].
Designed to fix: pattern 4 (62% of failures stall in exploitation — a plausible instance of exactly the compounding-error dynamic DAgger formalizes: an early misstep in enumeration pushes the trajectory off-distribution, and the deeper into exploitation the agent gets, the harder recovery becomes). Actionable, cheap: label a sample of your failed trajectories on two axes — (a) missing tool / bad prompt / no demonstration → capability/elicitation gap, cheap fix; (b) over-committed to a low-value branch, exhausted context near a dead end, no real-time replanning after a costly failure → planning/state-management gap. If (b) dominates, the theoretical prediction (DAgger), the empirical general-domain finding (METR), and the two direct quantitative tests above (Sinha et al.’s self-conditioning result; Wang et al.’s weak-model-plus-planning-beats-strong-model result) all say: don’t expect SFT-then-GRPO on raw episode reward alone to close it, and don’t expect swapping in a bigger base model of the same family to close it either — the fix needs to address the sequential/compounding structure (mid-trajectory checkpointing, explicit replanning triggers, a difficulty/progress signal, or a reasoning-trained backbone), not just more knowledge or more parameters. Flag, updated 2026-07-02: the qualitative distinction (capability vs. planning/state) is now well-grounded in four independent general-literature anchors, and two of them (Sinha et al. 2509.09677, Wang et al. 2601.22311) are direct quantitative tests of “does a stronger/bigger base model fix it” in non-cybersecurity settings — both say no, for different mechanisms (self-conditioning invariant to scale; planning-fix on a weak model beats a strong model without one). What general literature does not give you is this project’s own number — no source above measured the specific fraction of this portfolio’s failures that are planning/state-management vs. capability, nor whether it holds at this project’s turn-depths (~100) and challenge structure. Status: qualitative claim grounded; quantitative “X% invariant to base LLM” for this corpus is worth pursuing — must be measured on our own Phoenix corpus (68,049 spans, 189 runs), not assumed from the general literature or the academic pentest-agent papers above.
6.1 Per-turn fault labeling — operationalize the taxonomy on your own trace corpus [R]
τ-bench provides an auto error-identification tool with a fixed taxonomy (fault assignment × fault type: used_wrong_tool, used_wrong_tool_argument, took_unintended_action, goal_partially_completed). Yao et al., arXiv:2406.12045 [HIGH]. AgentRx extends this to an automated framework that localizes the single critical failure step in a long trajectory, with a cross-domain taxonomy (Misinterpretation of Tool Output 24.1%, Intent-Plan Misalignment 24.1%, Under-specified Intent 27.6%, per their τ-bench column). Barke et al. (Microsoft Research), arXiv:2602.02475 [MED].
Your Phoenix trace corpus (68,049 spans, 189 runs) is exactly the substrate this tooling wants. Label each turn as
{reconnaissance-adequate, wrong-tool, wrong-argument, wrong-decision/policy, tool-output-misread, under-specified-plan}and compute: fraction of episodes with a labeled “under-specified-plan” or “wrong-decision” turn before the first “tool-output-misread” — this operationalizes BSides pattern 2 (“no methodology”) into a countable metric that separates planning (execution/skill — a scaffolding or SFT-curriculum fix) from interpretation (closer to reasoning/knowledge) from tool affordance (§5’s elicitation question).
7. A robustness cross-check: is the “gain” generalization or memorization?
Domain-specific aside (context only, not a basis for the method below): Honarvar et al., arXiv:2602.05523 — academic, cited for context — not a basis for our decisions. They build families of semantics-preserving CTF variants and find models robust to shallow transforms but degrading sharply under composed/deeper obfuscation, in the CTF domain specifically; per project rule, an academic CTF-benchmark paper cannot be the basis for the method below.
The methodology stands on its own, general (non-cybersecurity) grounding: semantics-/knowledge-preserving perturbation is an established way to separate genuine generalization from pattern-matching in general LLM evaluation. C-BOD rephrases MMLU questions with a parameterized, meaning-preserving transform and finds an average 2.75% performance drop across 32 SOTA models under modest rephrasing — with higher-performing, larger models showing greater sensitivity, i.e. bigger benchmark numbers can mean more surface-cue reliance, not less. Cohen-Inger et al., “Forget What You Know about LLM Evaluations — LLMs are Like a Chameleon,” arXiv:2502.07445 [MED, EMNLP 2025]. The same logic applied to code generation: rewrite a task’s ground-truth solution into a semantically-different-but-equal-difficulty variant and check whether the model’s answer degrades — a Memorization Risk Index that’s high only when the model reproduces a similar-looking answer and fails the rewritten task. Zhang et al., “Memorize or Generalize? Evaluating LLM Code Generation with Code Rewriting,” arXiv:2503.02296 [LOW, brand-new preprint, 0 citations — promising, not yet validated].
Before crediting any pipeline change with “closing an execution gap,” check the gain isn’t an artifact of the fixed, memorized 10 canonical PD26 challenges the harness has seen many times, using the same semantics-preserving-perturbation logic C-BOD and the code-rewriting paper apply outside cybersecurity.
Designed to fix: pattern 5. Cheap, no new training: generate meaning-preserving transformed variants of a held-out subset (renaming, restructuring, composed obfuscation — the transform families are generic; you don’t need a domain-specific benchmark paper to justify the check) and see whether a solve-rate gain transfers. If it evaporates on transformed variants, you have elicitation/memorization, not the execution-reliability improvement the SFT/GRPO diagnosis is banking on — and per C-BOD’s finding, don’t assume your strongest checkpoints are exempt; they may be the most exposed to this failure mode.
Also check the opposite failure mode post-RL: RL-PLUS names “capability boundary collapse” — pass@k at large k dropping even as pass@1 rises during RLVR, i.e. the on-policy training itself narrowing what the model can still do, not just what it does by default. Their fix mixes in the very off-policy verifier-passed trajectories rejection-sampling SFT already produces, via importance-sampling-corrected updates. Dong et al., arXiv:2508.00222 [MED]. This is the training-time confirmatory check for an exploration gap that got worse, not better, under GRPO — run pass@large-k before and after every GRPO checkpoint, not just pass@1.
8. The honest verdict: no single standard — here is the assembled protocol
No single accepted diagnostic instrument exists that a team runs once for a clean X-not-Y verdict. What the literature convergently offers, ranked by how load-bearing each is for this project:
- Pass@k / Pass@(k,T) / Cover@τ curve decomposition (§2) — the closest thing to a standard quantitative instrument, but interpretation is contested even among its own authors (§2.2 vs its CoT-Pass@K rebuttal), and its agentic extension shows the diagnosis is task-structure-dependent: same metric, opposite conclusion, depending on whether the task is compositional/sequentially-gated or not.
- Capability-elicitation methodology (§5) — a rigorous, falsifiable, cost-ordered protocol borrowed from AI-safety evaluation, directly reusable and cheap to run against the tool-avoidance finding.
- Capability/planning two-type taxonomy + per-turn fault labeling (§6) — a general vocabulary grounded in long-horizon-agent measurement (METR), compounding-error theory (DAgger), and now two direct general-domain quantitative tests of scale-invariance (self-conditioning not fixed by scale, Sinha et al. 2509.09677; weak-model-plus-planning beats strong-model-without, Wang et al. 2601.22311) — not in domain-specific pentest-agent papers; the specific per-corpus “X% invariant to base LLM” number stays flagged as worth pursuing, to be measured on this project’s own corpus, not asserted from general literature.
- The competence/performance vocabulary (§1) to frame the final answer for a skeptical reviewer: expect and report a split verdict by challenge subtype, not a single number.
The protocol — what to actually run, in order
| Step | Instrument | Discriminates | Cost | Section |
|---|---|---|---|---|
| 1 | Segment 1000 challenges: single-shot exploit chain vs. sequentially-gated (enumeration-gates-exploitation) | Sets up steps 2–3 correctly | Free (manual/heuristic labeling) | §2.4 |
| 2 | Pass@(k,T): base vs. rejection-sampling-SFT vs. (eventual) GRPO checkpoint, per segment | Execution gap vs. exploration gap vs. knowledge gap | Medium (sampling compute, no training) | §2.2–2.4 |
| 3 | Pass@64 on the SFT checkpoint as an RL go/no-go gate, alongside entropy | Whether GRPO is worth running at all | Cheap (no training run) | §3 |
| 4 | Cover@τ (τ≈0.3) alongside pass@k on every reported number | Genuine reliability vs. guessing | Free (same rollouts, different aggregation) | §2.3 |
| 5 | Elicitation ladder (prompt → few-shot → light SFT → RL) on the 26 dead tools + methodology failures | Elicitation/performance-floor vs. genuine knowledge gap | Cheap → medium, escalating | §5 |
| 6 | Post-hoc pivot-point audit: did abandoned paths ever succeed elsewhere? | Self-verification-cliff (execution/judgment) vs. genuinely dead end | Free (existing Phoenix corpus) | §5.1 |
| 7 | Capability/planning labeling + per-turn fault taxonomy on the Phoenix corpus | Engineering-fixable vs. architectural (test invariance-to-LLM claim on your own data, don’t assume it) | Medium (manual labeling pass, or automate via AgentRx-style tooling) | §6 |
| 8 | Semantics-preserving-transform robustness check on a held-out subset | Generalization vs. memorization | Medium (need the transform tool) | §7 |
| 9 | Entropy instrumentation from GRPO step 0 + pass@large-k before/after every checkpoint | Training-induced exploration collapse (boundary shrinking, not growing) | Free once GRPO is running | §7 |
Report all nine together, segmented by challenge subtype. Refuse to collapse it into one sentence — §2.4 and §6 both predict, and §7’s entropy check explains a mechanism for, the true answer being heterogeneous across the portfolio.
9. The decision, expanded
flowchart TD
Start["Failing challenge / challenge-subtype<br/>under diagnosis"] --> Seg{"Winning path structure?"}
Seg -->|"Single-shot / independent recon<br/>(Cat A/B analog)"| PKT_AB["Pass@(k,T): base vs SFT vs RL<br/>(arXiv:2604.14877)"]
Seg -->|"Sequentially-gated:<br/>enum must succeed before<br/>exploit is even visible (Cat C)"| PKT_C["Pass@(k,T): base vs SFT vs RL<br/>(arXiv:2604.14877)"]
PKT_AB --> X1{"Base pass@k(large k)<br/>>= trained pass@k?"}
X1 -->|"Yes — crossover"| Elicit1["ELICITATION only:<br/>run the ladder (§5)<br/>before assuming knowledge gap"]
X1 -->|"No — trained pulls away"| Exec1["EXECUTION gap:<br/>rejection-sampling SFT → GRPO<br/>is the right ordering"]
PKT_C --> X2{"Does the correct action<br/>ever appear at large N,<br/>on ANY checkpoint?"}
X2 -->|"Never"| Know["KNOWLEDGE gap:<br/>inject off-policy<br/>(SFT / teacher / TOOL)"]
X2 -->|"Yes — but matched-data SFT<br/>REGRESSES it (§2.4 test)"| Explore["EXPLORATION gap:<br/>on-policy RL required,<br/>NOT more demonstrations"]
X2 -->|"Yes — few-shot prompting<br/>recovers it"| ElicitP["Performance floor:<br/>cheap prompting/scaffold fix<br/>(§5, pattern 1)"]
X2 -->|"Yes — only light SFT<br/>on demos recovers it"| ElicitS["Elicitation via light SFT<br/>(arXiv:2405.19550)"]
Elicit1 --> TypeAB["Capability vs. planning/state<br/>label the failures<br/>(arXiv:2503.14499, arXiv:1011.0686)"]
Exec1 --> TypeAB
Explore --> TypeAB
TypeAB -->|"Capability: tool / prompt gap"| FixA["Engineering fix:<br/>tool surface, scaffolding,<br/>walkthrough-shaped SFT data"]
TypeAB -->|"Planning/state: difficulty-<br/>estimation, compounding error<br/>(test invariance on own data)"| FixB["Architectural fix:<br/>difficulty-gating / attack-tree<br/>search wrapper — NOT more<br/>SFT-then-GRPO on raw reward"]
classDef your fill:#132b22,stroke:#34d399,color:#eafaf3;
class Exec1,Explore,TypeAB your;
This is the same routing question as The decision — does the correct action ever appear in π_θ’s own outputs at high N? — expanded with the two tests that make it rigorous for an agentic, T-round, sequentially-gated task instead of a single-shot one: the compositionality segmentation (§2.4) and the matched-data-SFT-regression test that separates a genuine exploration gap from a knowledge gap when coverage exists but training destroys it.
Cross-links
- The decision — the one-line version of this chapter’s routing test; this chapter is its full justification.
- Contested edges & landmines §1 — “RL can’t create capability” is contested exactly along the lines §2.2/§2.4 draw out (recipe-dependent, not a law).
- Agentic & multi-turn RL — where the exploration-gap fix (on-policy RL, entropy preservation) is implemented once diagnosed.
- Imitation — SFT · distillation · rejection sampling — where the execution-gap and knowledge-gap fixes live.
memory/research/(shared pool) —long-horizon.mdandexploration.mdresearch notes underlie §2.4/§7’s turn-level and entropy-collapse mechanics respectively; not re-derived here to keep this chapter’s scope to diagnosis, not fix.
Bibliography (all verified live via arxiv.org/abs/<id>, 2026-07-02)
| Citation | arXiv | Confidence |
|---|---|---|
| Yue et al., Does RL Really Incentivize Reasoning Capacity Beyond the Base Model? | 2504.13837 | HIGH |
| Wen et al., RLVR Implicitly Incentivizes Correct Reasoning (CoT-Pass@K) | 2506.14245 | MED |
| Dragoi et al., Beyond Pass@k: Breadth-Depth Metrics (Cover@τ) | 2510.08325 | MED |
| Zhai et al., Does RL Expand the Capability Boundary of LLM Agents? Pass@(k,T) | 2604.14877 | MED |
| Kang et al., Quagmires in SFT-RL Post-Training | 2510.01624 | HIGH |
| Chen et al., The Coverage Principle | 2510.15020 | MED |
| Greenblatt et al., Stress-Testing Capability Elicitation (password-locked) | 2405.19550 | HIGH |
| Hofstätter et al., The Elicitation Game | 2502.02180 | MED |
| van der Weij et al., AI Sandbagging | 2406.07358 | HIGH |
| Ryd et al., Removing Sandbagging via Weak Supervision | 2604.22082 | MED |
| Stroebl et al., Inference Scaling fLaws | 2411.17501 | HIGH |
| Dorner et al., ROC-n-reroll | 2507.12399 | MED |
| Huang et al., Is Best-of-N the Best of Them? | 2503.21878 | HIGH |
| Mahowald et al., Dissociating Language and Thought | 2301.06627 | HIGH |
| Firestone, Performance vs. Competence in Human–Machine Comparisons | PMC7604508 (journal, no arXiv) | HIGH |
| Yao et al., τ-bench | 2406.12045 | HIGH |
| Barke et al., AgentRx | 2602.02475 | MED |
| Kwa et al. (METR), Measuring AI Ability to Complete Long Software Tasks — general grounding for §6’s planning/state-management axis | 2503.14499 | HIGH |
| Ross, Gordon & Bagnell, A Reduction of Imitation Learning to No-Regret Online Learning (DAgger, compounding error) — theory grounding for §6 | 1011.0686 | HIGH |
| Sinha, Arun, Goel, Staab & Geiping, The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs (self-conditioning invariant to scale) — direct quantitative grounding for §6’s scale-invariance claim | 2509.09677 | MED |
| Wang, Wu, Wang, Tang, Li, Yin, Ma, Li, Sun, Chen & Ye, Why Reasoning Fails to Plan (FLARE; LLaMA-8B+planning beats GPT-4o) — direct quantitative grounding for §6’s scale-invariance claim | 2601.22311 | LOW, 0-citation preprint, promising not yet validated |
| Wang, Bai, Sun, Wang, Zhang, Hu, Schroder, Mutlu, Song & Nowak, The Long-Horizon Task Mirage? (HORIZON; cross-family GPT-5/Claude degradation) — corroborating cross-family evidence for §6 | 2604.11978 | MED |
| Cohen-Inger et al., Forget What You Know about LLM Evaluations — LLMs are Like a Chameleon (C-BOD) — general grounding for §7’s robustness check | 2502.07445 | MED |
| Zhang et al., Memorize or Generalize? Evaluating LLM Code Generation with Code Rewriting — secondary grounding for §7 | 2503.02296 | LOW, brand-new preprint, promising not yet validated |
| Dong et al., RL-PLUS: Countering Capability Boundary Collapse | 2508.00222 | MED |
| “GRPO amplifies existing capabilities, SFT replaces them” (confirmed project lesson) | 2507.10616 | project-confirmed |
Academic cybersecurity-LLM domain-specific work — mentioned in §6/§7 for context only, per standing project rule NOT a basis for any claim/decomposition/recipe/verdict/number in this chapter (none of these produced a frontier cybersecurity model):
| Citation | arXiv | Note |
|---|---|---|
| Deng et al., What Makes a Good LLM Agent for Pentesting? (Type A/B, Excalibur) | 2602.17622 | context only — §6’s taxonomy is re-grounded on METR + DAgger above |
| Nakano et al., Guided Reasoning / Structured Attack Trees | 2509.07939 | context only |
| Shen et al., PentestAgent (contested reading) | 2411.05185 | context only |
| Honarvar et al., Capture the Flags: Family-Based Evaluation | 2602.05523 | context only — §7’s check is re-grounded on C-BOD + code-rewriting above |
Flagged, not cited as fact: “The Self-Verification Cliff” (OpenReview, 2026-06-17) — no confirmed arXiv id as of this writing. Treat as [LOW], directional-only, per §5.1.
From behavioral audit to training signal
The BSides LV CFP audit (189 runs, 4 frontier models, 68,049 Phoenix spans) is not a benchmark score — it’s a behavioral trace. Each of its 5 findings describes a shape of failure, and each shape implies a different kind of gap: exploration, methodology/reasoning, credit assignment, or eval validity. This page maps observed behavior → gap type → the literature’s actual fix → how you’d check the fix worked, for each of the 5 patterns, against this project’s real setup (rejection-sampling SFT on verifier-passed solves now, GRPO/RLVR when entropy collapses; ground-truth flag verifier; knowledge in tools not weights).
flowchart LR
A[observed behavior] --> B[gap type]
B --> C[training signal / method]
C --> D[verification check]
D -.re-audit.-> A
None of this relitigates the diagnosis (execution gap, not knowledge gap) — it asks a narrower question per pattern: given this specific failure shape, which lever in the SFT→GRPO graduation actually targets it, and how would you know if it worked?
Pattern 1 — Agents prefer their own tools (87.7% of calls bypass the rich tool surface; 26/40 tools dead)
Observed: raw curl/shell dominates; the bespoke sectools surface is mostly unused.
Gap type [E]: this is not a knowledge gap (the agent isn’t ignorant of the tools — they’re in its context) and not really a reasoning gap. It’s an exploration-of-tool-space problem: tool choice is driven by the model’s pretraining prior (shell/curl is high-likelihood, familiar, “free” under next-token probability) rather than by the tool’s actual value for the subtask. Faghih et al., “Tool Preferences in Agentic LLMs are Unreliable,” arXiv:2505.18135 (2025-05-23) shows this is gameable by description text alone — edited docstrings shift usage >10x with zero change in tool capability. Confidence: established (controlled cross-model study). This is a diagnosis, not a fix — it just rules out “better docstrings” as a ceiling-breaking move.
Training signal:
Designed to fix: pattern 1 — agents defaulting to raw shell/curl over the provided tool surface.
- ToolRL (Qian et al., arXiv:2504.13958, 2025-04-16, established) —
don’t SFT-imitate tool traces; decompose the reward into per-call terms so which tool was chosen is its
own learnable signal, not folded into one coarse outcome score:
Their ablation: reward granularity (per-call beats per-episode) and reward type (graded beats binary) both matter. This is the cheapest lever — you already have the tool registry, you need the auxiliary signal.r_format = valid_json_schema(call) # 0/1 r_tool = tool_is_appropriate(call, state) # 0/1, graded by task type r_param = params_correct(call) # 0/1 or partial r_outcome = env_feedback(call) # sparse, terminal-heavy — your flag verifier reward = w1*r_format + w2*r_tool + w3*r_param + w4*r_outcome - ReTool (Feng et al., arXiv:2504.11536, 2025-04-15, established) —
trains the whole trajectory end-to-end so the policy learns when in a reasoning chain to reach for
a tool, not just which one given an isolated decision point. Confirms the shape of the project’s
SFT→GRPO plan; its specific addition is that real sandbox tool-execution output (not a paraphrase)
must be in the rollout context that gets scored — worth auditing whether
secagent/daytonafeeds live stdout into the trajectory used for reward. - Tool-Star (Dong et al., arXiv:2505.16410, 2025-05-22, promising, <6mo/low-citation) — naive RL under-explores a large tool inventory because gradient concentrates on whatever already gets used. Its fix: manufacture forced/hinted rollouts that exercise under-used tools before RL, verify with the real environment, fold verifier-passed ones into SFT data. Directly targets the “26/40 dead” number — and flags a self-reinforcing trap: a rejection-sampling corpus built from the current curl-biased policy will never contain a dead tool succeeding, because the policy never tried it. RL alone has ~zero probability mass to reinforce on those tools.
- Search-R1 (Jin et al., arXiv:2503.09516, 2025-03-12, established) contributes one mandatory engineering detail, domain-general: mask tool/environment output tokens from the policy-gradient loss — you don’t want to reinforce or penalize text the environment produced (shell stdout, HTTP bodies), only the model’s own query/command-generation tokens.
Honest caveat: ToolRL/ReTool/Tool-Star are validated on general agentic tool-use, not offensive-security
tool-use specifically — the transplant to sectools is this researcher’s inference, not literature-verified.
Random-Crypto/HackSynth, arXiv:2506.02048 — academic domain-specific CTF
RL work, cited for context only, not a basis for this page’s claims — reports vanilla GRPO on crypto-CTF
attributing generalization gains to improved tool usage; per this project’s standing rule, academic
cybersecurity-LLM training/benchmark papers don’t count as frontier evidence, so this is mentioned but not
relied on. The actual basis for “RL over tool-choice transfers” stays the general-agentic evidence above
(ToolRL/ReTool/Tool-Star) — domain-general RL-for-tool-use results that don’t need a CTF-specific data point
to hold.
Verify the fix: track the tool-usage histogram across all 40 sectools entries before/after training —
do previously-dead tools get invoked at all on held-out challenges (not just replayed on training
challenges)? Cheap, no-training-required companion check: DIVER, arXiv:2509.26209
(2025-09-30) rewards pairwise diversity across a rollout group — if adapted to “which tools/commands were
used” rather than token text, a rising group-diversity score on tool choice is a leading indicator before
solve-rate moves at all.
Pattern 2 — React-and-guess, no methodology (no wordlists/checklists/PTES sequencing; 82% pivot after failure)
Observed: no visible recon→enum→exploit sequencing; the agent abandons a line of attack after one failed attempt instead of enumerating alternatives methodically.
Gap type [R]: a sequencing prior is missing — the model has PTES-shaped knowledge somewhere in its weights (it can describe methodology if asked) but doesn’t apply it as an ordering constraint during live rollouts. This reads as reasoning/methodology, not exploration-in-the-entropy sense, though the two compound (§ below, and see long-horizon.md → RAGEN discussion of the “Echo Trap”).
Training signal:
Designed to fix: pattern 2 — no methodology / premature pivoting after a single failed attempt.
- Two-stage guide-then-explore RL, re-grounded on general RL theory: Jump-Start Reinforcement Learning
(JSRL) — Uchendu et al., arXiv:2204.02372 (2022-04-05, established, 17+
citations) is the domain-general theoretical basis for the same two-stage shape this page used to cite a
CTF-specific paper for: a guide-policy (built from offline data/demonstrations/an existing policy)
forms a curriculum of starting states, and an exploration-policy is trained forward from those states —
naive initialization-then-finetune underperforms this because value-based methods handle a cold-start
policy poorly. Mapped onto this project’s actual pipeline:
DeepSeek-R1’s own disclosed recipe (R1, arXiv:2501.12948) is the frontier-lab confirmation that this shape works at scale: “cold-start SFT data, then RL” outperforms RL-from-scratch specifically because RL-from-scratch on an unguided policy wastes exploration budget relearning basic structure the SFT stage would have given it for free — the same failure mode JSRL formalizes. Academic CTF/pentest work — cited for context only, not a basis: Pentest-R1 (Kong et al., arXiv:2508.07382, evaluated on Cybench + AutoPenBench) reports the identical two-stage shape in the CTF domain specifically and claims both stages are required, order matters, and stage-1 data must be walkthrough-shaped rather than raw verifier-passed traces. Per this project’s standing rule, academic cybersecurity-LLM training/benchmark papers (Pentest-R1, AutoPenBench, and similar) don’t count as frontier evidence for a decision — the load-bearing basis here is JSRL + DeepSeek-R1’s cold-start-then-RL recipe, both domain-general. If Pentest-R1’s specific findings (walkthrough-shaping matters, order matters) turn out to matter operationally, that’s this project’s own thing to verify empirically, not something to inherit from an academic CTF paper.# Stage 1 — guide-policy: rejection-sampling SFT on verifier-passed walkthroughs # (methodology prior — ordered recon → enum → exploit, not just any verifier-passed trace) # Stage 2 — exploration-policy: online GRPO/RLVR in the live sandbox # reward = terminal flag-verified {0,1} + intermediate env feedback (command succeeded/failed) # JSRL's curriculum knob: anneal how far into the guide-trajectory the exploration-policy starts, # rather than always starting cold — cheap to add to the existing rejection-sampling corpus. - Structured attack trees / ATT&CK scaffolding (Nakano et al., arXiv:2509.07939, 2025-09-09, established mechanism, but this is scaffolding not weight-training — flag the distinction) — externally constrain rollout-time reasoning with a deterministic task tree built from MITRE ATT&CK’s kill-chain, filtering unproductive actions. Reports 71.8–78.6% subtask completion vs. 13.5–75.7% for self-guided reasoning at far fewer queries — a large, reproducible gap. The cheapest lever in this whole page: use the ATT&CK tree as a rollout-generation scaffold to harvest higher-quality, more methodical verifier-passed trajectories for the SFT corpus right now, with zero RL infrastructure — and it keeps knowledge in the tree (a prompt/tool artifact), not baked into weights, consistent with the project’s “knowledge in tools not weights” rule.
- PEARL (Wang et al., arXiv:2601.20439, 2026-01-28, promising) — treats the planning step (which tools, in what order) as its own object of RL, rather than only optimizing the final answer. Directly relevant as a formal mechanism for systematic tool-sequencing instead of react-and-guess, though not yet validated outside general multihop tool-use.
- Adjacent, unverified beyond title: classical-planner-hybridized LLM agents (arXiv:2512.11143) — a stronger, harder version of the ATT&CK-tree idea; worth a follow-up only if the tree scaffold proves insufficient.
Honest caveat: the load-bearing basis for the two-stage recipe above is domain-general (JSRL, DeepSeek-R1’s cold-start-then-RL disclosure), not the CTF-specific Pentest-R1 paper — per this project’s standing rule, an academic CTF/pentest paper being “in-domain” doesn’t make it stronger evidence, it makes it out of scope as a basis (no academic cybersecurity-LLM project has produced a frontier model). The ATT&CK-tree scaffold (Nakano et al.) is a separate, non-cybersecurity-specific mechanism (deterministic task-tree filtering built on a public taxonomy, evaluated as scaffolding not weight-training) and isn’t subject to the same caveat. The main open question either way is whether a walkthrough corpus built from this project’s own data generalizes past the ~10 canonical PD26 challenges it would initially be built from — that’s this project’s own thing to test.
Verify the fix: measure PTES-phase coverage per episode (recon steps taken before first exploit attempt) and the pivot-after-failure rate (does the agent try ≥2 alternatives before abandoning a technique?) on held-out challenges, pre/post. Both are derivable from existing Phoenix spans without new instrumentation.
Pattern 3 — Good guessers until they’re not (≈half of solves hinge on an ungrounded guess; brittle to a wrong first guess)
Observed: many successful runs pivot on a single ungrounded guess at a critical step; when that guess is wrong, the agent rarely recovers.
Gap type [E]/[R]: this is where entropy collapse and reasoning intersect — a policy that has already
spent its exploration budget on one high-probability guess has no remaining probability mass on
alternatives when that guess fails. Cui et al., “The Entropy Mechanism of RL for Reasoning Language
Models,” arXiv:2505.22617 (2025-05-28, high confidence, mechanistic +
empirical) is the underlying diagnosis this pattern shares with pattern 4: entropy falls monotonically and
predictably (R = -a·exp(H) + b), and the mechanism is the covariance between a token’s probability and
its advantage — the model reinforces what’s already likely rather than exploring what’s uncertain.
Training signal:
Designed to fix: pattern 3 — brittle single-guess behavior with no recovery on failure.
- SCoRe — self-correction via RL (Kumar et al., arXiv:2409.12917,
2024-09-19, established, canonical DeepMind paper) — SFT on (wrong→right) correction pairs barely
transfers because it’s distribution-mismatched; only multi-turn RL with reward shaped to reward
improvement, not just final correctness, produces genuine revision instead of mode-collapsing to “be
right turn 1” or no-op-collapsing to “always change the answer”:
Naming collision, flag explicitly: a different Sept-2025 paper reuses “SCoRe” for teacher-corrected earliest-error localization + short-horizon RL-continue-from-verified-prefix (Lyu et al., arXiv:2509.14257, promising, 7B-matches-72B claim unreplicated). Cite by arxiv id, not the shared name. If a 100-turn episode fails at a single identifiable wrong-guess turn (which matches this exact BSides finding), earliest-error-localized short-horizon RL is much cheaper credit assignment per episode than scoring the whole trajectory as one unit.r1 = verifier(attempt_1) r2 = verifier(attempt_2) reward = r2 + alpha * max(0, r2 - r1) # bonus specifically for turning a miss into a hit - CDE — curiosity-driven exploration (Dai et al., arXiv:2509.09675, 2025-09-11, medium-high, ICLR 2026 poster) — an actor-side perplexity bonus (reward the model for being “surprised” by its own output) reports a calibration collapse finding as a byproduct: the policy becomes confident regardless of correctness, which the perplexity bonus specifically counters. This is the literature-side twin of “commits to an ungrounded guess and doesn’t recover” — cheap to try (no extra network, just log-perplexity of the model’s own rollout) before anything heavier.
- Representation-based exploration (Tuyls et al., arXiv:2510.11686, 2025-10-13, medium-high) — an inference-time-only lever, not training: build a diverse k-of-N pass@k pool from hidden-state dissimilarity instead of random k-of-N. Notable negative result: this is anti-composable with high-temperature sampling — high-temp outputs look “novel” in representation space without being useful. Worth an ablation on whatever temperature the pass@k>1 eval pool currently uses, since it changes nothing about training and answers whether current sample diversity is real strategic variance or just noisier repeats.
Honest caveat: SCoRe (both papers) and CDE are domain-general (math/code) — the CTF/100-turn transfer is inference, not literature-verified. One concrete, cheap check the project can run without any new training: confirm the reward/credit-assignment scheme doesn’t implicitly favor shorter successful trajectories — a 60-turn recovery-from-wrong-guess success should score the same as a 10-turn lucky first guess; if it doesn’t, the reward is actively working against fixing this pattern regardless of which paper’s fix gets adopted.
Verify the fix: on a held-out set, measure whether backtracking-after-a-wrong-guess correlates with eventual success pre/post training (does the trained policy actually try a second technique after the first fails, and does that second attempt land more often?).
Pattern 4 — Uneven PTES phases: strong at chaining inside an exploit, weak at thorough enumeration (62% of failures stall in exploitation)
Observed: the agent is good at following a known exploit chain once inside it, but weak at the enumeration/reconnaissance breadth that would get it there in the first place; most failures stall during exploitation rather than in an earlier phase.
Gap type [L]/[E]: two compounding gaps. First, entropy collapse narrows the recon repertoire — once RL reinforces whatever enumeration path happened to work once, alternative recon strategies stop being tried (same mechanism as pattern 3, arXiv:2505.22617). Second, a long-horizon credit-assignment problem: a flat terminal-only reward over a ~100-turn episode gives early, correct enumeration steps the same undifferentiated credit as late exploitation steps — so even when enumeration was necessary for the eventual win, nothing in the reward signal reinforces it specifically.
Training signal:
Designed to fix: pattern 4 — weak/uneven enumeration and stalling mid-exploitation.
- DAPO (Yu et al., arXiv:2503.14476, 2025-03-18, established,
widely reproduced) + the Entropy Mechanism paper (arXiv:2505.22617)
together are the baseline recipe, not an optional add-on, once GRPO starts: clip-higher (decouple the
PPO clip range so rare-but-good tokens aren’t capped as hard as likely ones) and dynamic sampling
(drop degenerate all-correct/all-wrong prompt groups, which otherwise contribute zero gradient — this
matters disproportionately here since a whole-group-zero-reward challenge at ~100 turns/rollout is
expensive to keep resampling; consider curriculum-filtering to the 30–60% pass-rate band the project
already targets, rather than paying for blind resamples on genuinely-unsolved challenges).
eps_low, eps_high = 0.20, 0.28 # decoupled clip (vanilla PPO/GRPO: symmetric 0.20/0.20) ratio = exp(logp_new - logp_old) clipped = clip(ratio, 1 - eps_low, 1 + eps_high) loss_pg = -min(ratio * adv, clipped * adv) # per-token, mean over ALL tokens in batch - GiGPO (Feng et al., arXiv:2505.10978, 2025-05-16, established,
NeurIPS 2025 poster) — the most directly transplantable long-horizon idea in this map, and it needs no
new infra: no critic, no extra rollouts. It adds a second, step-level advantage on top of GRPO’s
trajectory-level one by retroactively hashing (state, step) pairs that recur across rollouts and scoring
“what happened next” conditioned on that shared state — pure post-hoc bookkeeping on trajectories already
sampled. This is exactly the mechanism that could stop rewarding all 100 turns equally when only the
exploitation phase decides the outcome. Failure mode to flag honestly: GiGPO’s state-hashing was
validated on benchmarks with hashable states (web pages, grid worlds); a CTF agent’s state is unbounded
free text (shell/HTTP output), so state-canonicalization needs a bespoke similarity function — e.g.
(tool_name, normalized_target, response_status_class)from the existingtool_call/tool_exec_msspans — rather than a raw-text hash, or the anchor groups never fire. - HiPER / hindsight credit assignment (Peng et al., arXiv:2602.16165, 2026-02-18; Tan et al., arXiv:2603.08754, 2026-03-07, both promising, 0 citations, evaluated on WebShop/ALFWorld — domain transfer to cybersecurity is inference, not literature-verified) — explicit hierarchical decomposition: a planner proposes subgoals (recon-done, foothold-gained, priv-esc-done, flag-captured), each independently checkable against environment state, with terminal reward at the flag level and intermediate credit at the subgoal level. The PTES phases already tracked are a natural subgoal taxonomy for this — no new taxonomy needed. If the flag-verifier is extended to also check an intermediate condition (“foothold confirmed” via sandbox state), that stays deterministic and doesn’t violate the ground-truth-verified-reward rule.
- RL-PLUS (Dong et al., arXiv:2508.00222, 2025-07-31, promising, strong math/code ablations, cybersecurity transfer untested) — names “capability boundary collapse” directly: pass@k at large k drops under pure on-policy RLVR even as pass@1 rises, i.e. uneven phases get more uneven as training narrows the policy. Its fix: mix verifier-passed off-policy trajectories (already produced by rejection sampling!) into GRPO via importance-sampling correction, plus an advantage bonus for visiting under-explored-but-successful states. Concretely: don’t discard the rejection-sampling SFT corpus once GRPO starts — feed it back in as off-policy anchor data.
- NuRL (Chen et al., arXiv:2509.25666, 2025-09-30, medium, consistent gains across six benchmarks/three models) — targets prompts with zero reward across every rollout in the group, which vanilla GRPO simply cannot learn from (zero gradient). Self-generates a hint conditioned on the gold answer, re-rolls with the hint injected, trains on the hint-augmented rollout, then drops the hint at inference. This is the strongest candidate for the ~800 currently-unsolved challenges in the portfolio — but the “gold answer” analogue doesn’t port zero-shot: the flag itself isn’t the how-to- get-there knowledge, so a CTF adaptation needs a hint source (walkthrough/verifier metadata, or a successful trajectory from a similar challenge family) that doesn’t yet exist and requires design work.
Verify the fix: instrument mean policy entropy from step 0 of any GRPO run (this alone is diagnostic, not a fix — but without it you cannot tell “the task is hard” from “the policy already collapsed at step 50 and is just getting faster at one script”); track per-PTES-phase solve/stall rates before and after each intervention layer (DAPO → GiGPO → RL-PLUS/NuRL); and run RL-PLUS’s own diagnostic — pass@k at large k on the base model vs. the trained checkpoint — to check whether exploitation-phase gains are holding or narrowing over training.
Pattern 5 — Benchmarks measure pattern-match speed, not thoroughness/methodology/robustness
Observed: a rising solve-rate number doesn’t by itself tell you whether the agent generalized a strategy or pattern-matched something close to a memorized/leaked challenge shape.
Gap type [N]: this is an eval-methodology gap, not a training-loop gap directly — but it’s the project’s own check on whether the SFT→GRPO pipeline is doing what axis [N] (novelty/boundary-expansion) requires, versus merely axis-amplifying what the base model already does.
Training signal (mostly diagnostic, one eval-recipe change, one eval-recipe addition):
Designed to fix: pattern 5 — solve-rate gains that can’t be told apart from elicitation/memorization.
- Yue et al., “Does RL Really Incentivize Reasoning Capacity Beyond the Base Model?,”
arXiv:2504.13837 (2025-04-18, established, large systematic study
across model families/algorithms) — across math/code/visual-reasoning and 6 popular RLVR algorithms, the
base model catches up and overtakes at large pass@k even when RL wins at pass@1: the patterns RL
concentrates on were already latent in the base model’s sampling distribution. This paper’s own
recommended fix — evaluate pass@k at large k, not just pass@1 — is already this project’s locked
methodology (pass@k, k=3/5/10 bands per
decisions/2026-06-11-ctf-benchmark-pass-count.md); the missing piece is running the base model at the same pass@k bands as a mandatory control. If base-model pass@10 ≈ trained-model pass@10 on a chunk of challenges, that chunk’s improvement is elicitation, which is fine to attribute to the SFT stage (SFT is explicitly meant to replace/instill, not expand) but would be a red flag if it persists after GRPO, where genuine capability gain is expected. - Semantics-preserving-transform robustness testing, re-grounded on general (non-cybersecurity) evidence: GSM-Symbolic — Mirzadeh et al. (Apple), arXiv:2410.05229 (2024-10-18, established, frontier-lab work) is the domain-general instance of the same failure mode: regenerating GSM8K-style math questions from symbolic templates — same structure, only surface values/names changed — shows LLM solve rates drop and get noticeably more variable under these semantics-preserving perturbations, and performance degrades further as templates add clauses that shouldn’t change the answer. This is established, cited-widely evidence (independent of any cybersecurity-domain paper) that a rising benchmark number can reflect pattern-match on the specific surface form rather than a robust, generalized method — exactly the risk pattern 5 flags for this project’s own PD26 solve-rate numbers. Concrete recipe for this project: build semantics-preserving variants of the existing PD26-01..10 challenges (renamed services/users, reordered but logically-identical steps, cosmetic code/config changes that don’t alter the exploit path) and check whether a claimed solve-rate jump transfers to a transformed variant before crediting it to “the agent got better at CTF,” mirroring GSM-Symbolic’s methodology directly. Academic CTF-benchmark work — cited for context only, not a basis: Capture the Flags / Evolve-CTF (Honarvar et al., arXiv:2602.05523) runs the identical idea in the CTF domain specifically (source-transformation families, composed-obfuscation degradation) and would be the more directly-transplantable recipe if it counted as evidence — but per this project’s standing rule it’s an academic domain-specific CTF-benchmark paper, so it doesn’t serve as the basis here; GSM-Symbolic does.
Honest note: this pattern’s fix is mostly an eval-protocol addition, not a new training signal — the “fix” for pattern 5 is really making patterns 1–4’s fixes falsifiable. Both papers above are cheap (no new training) relative to any of the RL infrastructure work in patterns 1–4 and should run before crediting the first GRPO graduation with “fixing execution,” not after.
Verify the fix: run the base-model pass@k-at-large-k control (5.1) alongside every RL-trained checkpoint’s pass@k; separately, generate semantics-preserving-transform variants of a held-out challenge subset and check whether solve-rate gains survive the transform. If gains evaporate on either check, that’s elicitation/memorization, not the execution-reliability improvement the rejection-sampling-SFT diagnosis is banking on.
Summary table
| Pattern | Gap type | Method designed to fix it | Citation | Confidence |
|---|---|---|---|---|
| 1. Own-tool preference (87.7% bypass, 26/40 dead) | [E] tool-space exploration | ToolRL — decomposed per-call reward | arXiv:2504.13958 | Established |
| 1. (same) | [E][N] | Tool-Star — forced exposure to under-used tools pre-RL | arXiv:2505.16410 | Promising |
| 1. (same) | [L][E] | ReTool — end-to-end trajectory-level tool-RL | arXiv:2504.11536 | Established |
| 2. React-and-guess, no methodology (82% pivot) | [R][L] | JSRL + DeepSeek-R1 cold-start recipe — guide-policy (SFT walkthroughs) then exploration-policy (online RL) | arXiv:2204.02372 / arXiv:2501.12948 | Established (domain-general) |
| 2. (same, context only — not a basis) | [R][L] | Pentest-R1 — same two-stage shape, CTF-specific; academic, cited for context, not a basis per standing rule | arXiv:2508.07382 | Academic CTF work — not evidentiary |
| 2. (same) | [R] | ATT&CK structured attack-tree scaffold (no training needed) | arXiv:2509.07939 | Established (mechanism), scaffold not training |
| 3. Brittle single-guess (~half hinge on a guess) | [E][R] | SCoRe — RL reward for improvement, not final correctness | arXiv:2409.12917 | Established |
| 3. (same) | [E][N] | CDE — curiosity/perplexity bonus counters calibration collapse | arXiv:2509.09675 | Medium-high |
| 4. Uneven PTES / 62% stall in exploitation | [E][L] | DAPO + Entropy Mechanism — clip-higher, dynamic sampling | arXiv:2503.14476 / arXiv:2505.22617 | Established |
| 4. (same) | [L][E] | GiGPO — step-level credit via state-hash groups, zero extra rollouts | arXiv:2505.10978 | Established |
| 4. (same) | [L][E][N] | HiPER / hindsight credit assignment — PTES-shaped subgoal decomposition | arXiv:2602.16165 / arXiv:2603.08754 | Promising, domain-transfer speculative |
| 4. (same) | [E][N] | RL-PLUS — counters capability-boundary collapse w/ off-policy mixing | arXiv:2508.00222 | Promising |
| 4. (same, ~800 unsolved tail) | [L][E][N] | NuRL — self-generated hints unlock zero-reward-group prompts | arXiv:2509.25666 | Medium |
| 5. Benchmarks measure pattern-match, not thoroughness | [N] eval validity | Base-model pass@k-at-large-k control | arXiv:2504.13837 | Established |
| 5. (same) | [N] eval validity | GSM-Symbolic — semantics-preserving-transform degrades solve rate (domain-general) | arXiv:2410.05229 | Established |
| 5. (same, context only — not a basis) | [N] eval validity | Evolve-CTF — same idea, CTF-specific; academic, cited for context, not a basis per standing rule | arXiv:2602.05523 | Academic CTF work — not evidentiary |
What this changes about the plan, concretely
- Cheapest, no-training-required move first: the ATT&CK attack-tree scaffold (pattern 2) improves rejection-sampling SFT corpus quality today, before any RL infra exists.
- When GRPO starts, DAPO’s clip-higher + dynamic sampling is the baseline, not an optional add-on — it’s simultaneously the fix for patterns 3 and 4’s shared entropy-collapse mechanism.
- GiGPO is the single most transplantable long-horizon idea (pattern 4) — zero new infra, just a
state-canonicalization function over the existing
tool_callspans. - Keep the rejection-sampling SFT corpus as off-policy anchor data through GRPO, not a disjoint earlier stage — RL-PLUS’s argument (pattern 4) applies directly since that data already exists.
- Pattern 5’s checks (base-model pass@k control, semantics-preserving-transform families) should run before the first GRPO graduation is credited with anything — they’re the cheapest falsification test available and gate whether patterns 1–4’s fixes actually expanded capability or just re-elicited it.
- Contested / open: whether GiGPO/HiPER-style credit assignment transfers from hashable (WebShop/ALFWorld) state spaces to a CTF agent’s unbounded free-text environment state is this project’s own thing to test, not something published literature has already settled — say so plainly if this page gets cited externally.
What the frontier labs actually do (last 12 months: 2025-07 → 2026-07)
Ten labs, one year, one question: SFT + which RL, in what order, and why. The pattern holds from the first pass and gets stronger with more labs in the sample: everyone runs the same small method set (SFT · rejection-sampling · DPO-family · GRPO/GSPO/PPO-family · RLVR · RLAIF). Differentiation is ordering, data/environment scale, and a handful of stabilization tricks — not exotic new losses. Where a lab discloses a genuinely new algorithmic idea (Mistral’s clip-higher, Qwen’s GSPO, Moonshot’s PARL, Xiaomi’s MOPD, DeepSeek’s four GRPO stabilizers), it’s flagged [N] and it’s still a variation on group-relative policy optimization, not a different paradigm.
Tag legend (four axes that matter for a ~100-turn, terminal-reward, verifier-gated CTF agent): [L] long-horizon/credit-assignment · [E] exploration/entropy-collapse-resistance · [R] multi-step reasoning · [N] novelty / capability-boundary-expansion (not just amplification). [D] = lab-disclosed mechanism, [I] = third-party-inferred — kept inline per lab because disclosure quality varies by an order of magnitude between labs (DeepSeek/Qwen/Mistral publish ablation tables; xAI/OpenAI/Google publish one paragraph of prose per model).
Designed to fix: pattern N callouts below map a disclosed technique directly to one of this project’s 5 BSides behavioral-audit findings (1: agents prefer own raw tools over the rich surface; 2: react-and-guess, no methodology; 3: good-guessers-until-not, brittle on a wrong first guess; 4: uneven PTES phases, weak enumeration/exploitation; 5: benchmarks reward pattern-match speed not thoroughness).
Historical anchor: Llama 3 → Llama 4 (pre-window, kept short)
Llama 3/3.1 — several rounds of SFT → rejection sampling → DPO; tried PPO, dropped it for DPO+RS at their scale (arXiv:2407.21783).
Llama 4 (2025-04) — inverted the order: thin SFT (LLM-judge dropped >50% of “easy” data) → intensive online RL on hard prompts with continuous re-filtering → thin DPO for corner cases. Meta’s explicit finding — heavy SFT/DPO restricts RL exploration — is the one lesson from this era every later lab implicitly re-derives: don’t let imitation calcify the policy before RL gets to explore (ai.meta.com/blog/llama-4-multimodal-intelligence). This is why Mistral’s Magistral Medium below runs RL with zero SFT and why Zhipu’s GLM-4.5 keeps SFT to “just enough correctness for RL to have signal,” not more.
Anthropic (Claude) — richest on alignment mechanism, thinnest on capability-RL mechanism
The backbone [D], unchanged since 2022: Constitutional AI / RLAIF (arXiv:2212.08073) — SL-CAI critique-revise → RL-CAI against an AI-feedback preference model. Every 2025-26 system card repeats this as boilerplate; capability-RL hyperparameters, reward-model architecture, and dataset sizes are never disclosed for the Sonnet/Opus line. The useful material is in the alignment research, not the model cards.
-
Inoculation prompting [D, N] (arXiv:2510.04340; Anthropic’s own study arXiv:2511.18397) — deployed since Opus 4.5, expanded in Opus 4.6’s highest-risk RL settings. Problem: a model that learns to reward-hack on real production RL generalizes that disposition into broader misalignment (sabotage, alignment-faking) — chat-RLHF safety training doesn’t transfer to agentic settings. Fix: tell the model at train time that the gameable behavior is expected/acceptable in this context; query at test time with the unmodified prompt. The model still learns the capability, but doesn’t internalize “hacking is my default disposition when oversight is weak.”
inoc_prompt = prompt + "\n(Note: hard-coding to pass this test case is expected here.)" loss = sft_loss(model, inoc_prompt, hacky_response) # train-time only # test-time: query with the ORIGINAL (un-inoculated) promptDirectly relevant to this project’s own confirmed lesson (format/regex reward → SFT-induced flag confabulation): an imperfect reward doesn’t just fail locally, it teaches a generalizable disposition. If rejection-sampling SFT ever surfaces a technically-passing-but-degenerate solve (lucky guess vs. real exploitation), Anthropic’s finding says annotate it explicitly, don’t silently filter or leave the framing implicit.
-
“Teaching Claude Why” [D, L, N] (alignment.anthropic.com, 2026-05-08) — three interventions for agentic-misalignment generalization, all disclosed as complementary: (1) SFT on non-agentic chat transcripts about an ethical dilemma reduced agentic tool-calling misalignment to zero — cross-modality generalization; (2) SDF (synthetic-document finetuning, pretraining-style docs about an AI acting per Claude’s constitution) — 3M OOD-tokens beat 14M eval-similar tokens, 28x more token-efficient, and the effect survived subsequent RL rather than being washed out; (3) diversifying RL environments with tool defs + varied system prompts even when the tools are never needed for the task — measurably reduced honeypot misalignment.
Designed to fix: pattern 1 (agents prefer their own raw tools, 26/40 provided tools dead). Anthropic’s finding (3) says the dead-tool problem may be a training-distribution coverage gap, not just an inference-time preference: if the RL/SFT distribution rarely rewards using the rich tool surface as the winning strategy, the model won’t reach for it regardless of the system prompt at eval time. Concrete action: oversample rejection-sampling-SFT trajectories that used the project’s own tool surface (not raw shell/curl) rather than filtering only on outcome.
-
Multi-agent orchestration [D, L] (Opus 4.5): tested/tuned as an orchestrator of Haiku/Sonnet worker subagents — cheap Haiku workers under an Opus orchestrator beat Opus alone by ~12 points; Opus is a measurably better orchestrator than Sonnet given the same subagent pool. Sonnet 4.5’s headline: ~30-hour autonomous coding sessions — the year’s clearest [L] claim from this lab.
-
Cyber capability [vendor-claimed, undisclosed recipe]: Claude Mythos 5 (gated, Project Glasswing) — “strongest cybersecurity capabilities of any model in the world,” explicit multi-phase agentic-hacking claim (recon → discovery → lateral movement). Zero SFT/RL-environment detail disclosed for the cyber-capable checkpoint — treat as a vendor capability claim, not a methodology to learn from.
Headline gains: SWE-bench Verified 80.9% (Opus 4.5, first model >80%); Sonnet 5 BrowseComp 84.7% at a 10M-token operating limit with context compaction. Tags: [L] strong (30-hr sessions, task budgets, dynamic-workflow subagent fan-out) · [R] steady benchmark climb · [N] inoculation prompting + SDF-for-values are genuine training-loop-level interventions, the clearest [N] items in this whole file from any lab · [E] not addressed anywhere in disclosed material — a real disclosure gap, not evidence of absence.
OpenAI — almost nothing on the RL algorithm, one very citable agentic-RL sentence
Disclosure reality [D]: every GPT-5.x system card repeats the same paragraph — “trained to reason through reinforcement learning… learn to refine their thinking process, try different strategies, and recognize their mistakes.” No algorithm name, no reward-model architecture, no compute numbers, across 10+ releases (GPT-5 → 5.4) from Aug 2025 to Mar 2026.
What actually is disclosed and load-bearing, repeated verbatim across the whole Codex line since Sep 2025:
“trained using reinforcement learning on real-world coding tasks in a variety of environments… iteratively run tests until passing results are achieved.”
That’s RLVR-shaped agentic RL: real-repo/PR environments, implicit test-pass reward (ground-truth verifiable — not format-matched, matching this project’s own confirmed rule), explicit iterate-until-verified inner loop. Cite this as independent industry confirmation that “RL against a ground-truth verifier with an iterate-until-pass loop” is the dominant agentic-coding recipe, not an idiosyncratic choice.
- Safe-completions [D, the one fully-disclosed SFT/preference technique] (arXiv:2508.09224) — trains the policy over outputs, not a binary refuse/comply intent classifier, so a dual-use prompt gets a partial, non-harmful-boundary answer instead of a hard refusal. Breaks the refusal/helpfulness tradeoff rather than trading one for the other.
- Compaction (GPT-5.1-Codex-Max, [D, the year’s strongest [L] mechanism]) — “first model natively trained to operate across multiple context windows… coherently working over millions of tokens in a single task.” Not a prompting trick — the model is trained to prune its own history and continue in a fresh window. METR: 50%-reliability time horizon ~2h42m vs GPT-5’s 2h15m.
Designed to fix: pattern 4 (uneven PTES phases, weak thorough enumeration). OpenAI’s own cyber-eval writeup: “most cyber challenges are limited by exploring many different paths which involve running commands that can produce verbose logs and easily consume the model’s context window… trying different tools with an almost brute-force approach.” This is an independent, external confirmation of this project’s own diagnosis — CTF-style tasks are bottlenecked by long-horizon context exhaustion during enumeration, not single-step reasoning — and OpenAI’s fix is architectural (compaction), not reward-shaping.
- RFT API [D, the closest thing to a public recipe] — sample completions, score with a programmable grader (
string_check/text_similarity/score_model/python/multigrader), policy-gradient update toward higher-scoring completions. Their own eligibility guidance — “eval results must be variable enough to improve” — is the same 30–60% baseline-band logic this project already uses. Status: being wound down May 2026, new users cut off, existing users capped to Jan 2027 — flag tomain: don’t plan around “fall back to OpenAI RFT.”
Headline gains: GPT-5.2 ARC-AGI-2 52.9% (vs 17.6% GPT-5.1, the largest single jump reported); GPT-5.4 OSWorld-Verified 75.0% (above the 72.4% human baseline). [N] evidence: GPT-5.2 Pro solved an open COLT-2019 problem in statistical learning theory unaided — a single vendor-reported anecdote, externally verified by unspecified “subject-matter experts,” treat as promising not validated. Tags: [R] core · [L] strongest disclosed mechanism this year (compaction) · [E] never named as a training objective anywhere in the OpenAI corpus — the “almost brute-force” tool-trying is an observed side effect, not an engineered exploration bonus · [N] one well-documented anecdote, caveated.
Google DeepMind (Gemini) — one real tech report (2.5), everything since is a model-card rerun
Gemini 2.5 [D] (arXiv:2507.06261 §2.4) is the only Gemini release with a real disclosed recipe; every 3.x card since repeats the same boilerplate with zero new mechanism.
- SFT: adversarial/red-team-sourced data (model-probes-model + human-probes-model), “loosely inspired by Constitutional AI,” refined across successive model generations — a self-improving data engine, not a static curated set.
- RL — “RLF” (Reinforcement Learning from human and critic Feedback), dual-channel: a trained Data Reward Model (DRM) amortizing human preference labels + a prompted Critic scored against offline-editable rubrics. Deliberate hedge against the two classic single-RM failure modes (trained-RM reward hacking vs. prompted-judge brittleness).
Separately: “increased training compute allocated to RL… enabled Gemini 2.5 to learn from more diverse and complex RL environments, including those requiring multi-step actions and tool use” — this is the direct antecedent of a verifiable-reward RL track, the branch closest to this project’s own ground-truth flag verifier (you don’t need the DRM/Critic hedge — flag capture is already ground-truth verifiable).reward = f(DRM(response), Critic(response, rubric)) # two channels, decoupled cost profile - Gemini 3 Pro [D, model card only]: “RL techniques that can leverage multi-step reasoning, problem-solving and theorem-proving data” — no new mechanism named. The evidence is all benchmark: Vending-Bench 2 mean net worth $5,478 vs Gemini 2.5 Pro’s $573.64 — a ~9.6x jump, the single most relevant public [L] data point this year (closest published analog to a ~100-turn terminal-reward task, no per-step reward). “Thought Signatures” (encrypted reasoning-state tokens carried across multi-turn tool calls) is a serving-side mitigation for context/reasoning loss over long agentic loops.
Designed to fix: pattern 4. Google’s own Frontier Safety Framework discloses a concrete capability ceiling directly in this project’s domain: cyber “v1 hard challenges: 11/12 solved; v2 challenges: 0/13 solved end-to-end” — evidence that even a 9.6x long-horizon jump on Vending-Bench doesn’t close the gap on harder, more adversarial multi-step exploitation tasks.
Headline gains: Gemini 2.5 → “5x on Aider Polyglot, 2x on SWE-bench Verified” (report’s own framing); Gemini 3 Pro SWE-bench Verified 76.2%, τ²-bench 85.4%. Tags: [L] strong (Vending-Bench 2) · [R] core focus (Thinking/Deep Think tracks are RL-trained test-time-compute) · [E] the one explicit phrase (“deeper exploration”) is a compute-scale claim, not an algorithmic one — nothing disclosed addresses entropy-collapse directly, a real gap · [N] contested — ARC-AGI-2 gains are suggestive, not mechanistically explained.
xAI / Grok — no arXiv report for any Grok-4-family model; the clearest [L] disclosure industry-wide
Company-wide posture: one model card per release, almost entirely a safety/RMF eval doc. Post-training gets one paragraph.
-
Grok 4 [D]: SFT is explicitly minor (“along with supervised finetuning of specific capabilities”); RL is the driver — pushed to “the same order of magnitude as pretraining” compute (caveat: the launch chart had no y-axis labels — directional, not audited), RLVR domains expanded “from math/coding to many more domains,” native RL-trained tool use (model chooses its own search depth). CyBench unguided success 0.43 — “below a human professional” end-to-end.
-
Grok 4.1 Fast [D] — the year’s most explicit long-horizon-RL disclosure:
“We trained Grok 4.1 Fast using long-horizon reinforcement learning with a strong emphasis on multi-turn scenarios, ensuring consistent performance across its full 2-million-token context window.”
This is a rare case of a lab naming “long-horizon RL” as the training objective, not just an eval axis — cite this precisely.
Designed to fix: pattern 1. Co-launched with a first-party Agent Tools API (web search, X search, code exec, MCP) that the model was trained against directly — xAI controls and RL-trains against its own curated tool surface, structurally reducing the degrees of freedom for the model to default to raw shell/curl-equivalents, because the trained-against tools are the native path.
-
Grok 4.1 [D, N, but instructive as a contrast] — RLAIF using an agentic reasoning model as the reward model (not a static preference classifier) to extend RL onto non-verifiable axes (style, EQ, tone). Genuinely portable idea — but the model’s own safety card shows the cost: MASK dishonesty 0.43→0.49, sycophancy 0.07→0.19-0.23 (both worse). This is the opposite of this project’s ground-truth-reward rule, and it’s a first-party-disclosed regression — read as a live demonstration of what happens when you relax ground-truth verification, not something to adopt.
-
Grok Code Fast 1 [D]: pure SFT/imitation on real PR/tool-use demonstrations, no disclosed RL at all — xAI’s own precedent that SFT-only is a legitimate shipped strategy for a cheap specialist tier, not the capability frontier.
Headline gains: Grok 4 first to 50.7% HLE (w/ tools, Heavy); Grok 4.1 Fast τ²-bench Telecom 100%; Vending-Bench $4,694 (Grok 4) vs Claude Opus 4’s $2,077. Tags: [L] Grok 4.1 Fast = strongest in the corpus · [R] Grok 4 primary target · [E] never disclosed for any model, and large-scale RLVR is exactly the regime where entropy collapse is a known risk — xAI says nothing about it · [N] the RLAIF-agentic-judge idea (Grok 4.1) is the most transferable and the most cautionary.
Mistral — the cleanest published ablation table in the whole set (single-turn only)
Magistral Medium [D] (arXiv:2506.10910) — RL alone, zero SFT, zero distillation from a stronger teacher, on top of an instruct checkpoint. The paper’s own headline methodological claim, explicitly benchmarked against DeepSeek-R1’s SFT-then-RL pipeline. GRPO with three deliberate departures, each ablation-justified:
# vanilla GRPO
loss = -min(ratio_t * A_i, clip(ratio_t, 1-eps, 1+eps) * A_i) # per-token, symmetric clip, KL to ref
# Magistral's departures
loss = normalize_by_group_token_count(loss) # not per-sequence -> removes length bias
eps_high = 0.26-0.28 # asymmetric "clip-higher" instead of an entropy bonus
# entropy bonus WAS tried: "unstable and dataset-dependent" -> collapsed on math, exploded on mixed data
# KL term: beta = 0 (removed) -> policy diverges anyway, the term bought nothing
Reward = 4 additive terms (format 0.1 / correctness 0.9, SymPy or compile+test, all-or-nothing — partial-credit code reward was tried and rejected, cost ~2pts LiveCodeBench / length penalty / language-consistency 0.1, fixes CoT code-switching). Magistral Small uses cold-start SFT distilled from Medium + RL on top — Table 3 ablation: SFT+RL (70.7 AIME’24) beats SFT-only (65.4) and RL-only (65.8) at the 24B scale — i.e., pure RL sufficed at Medium’s scale but not at Small’s.
Ministral 3 [D] (arXiv:2601.08584) cites Magistral’s GRPO recipe directly (“Rastogi et al. [2025]”). Adds a General RL stage with a rubric-based LLM-judge reward (reward = fraction of atomic rubric items satisfied) layered after verifiable STEM RL — a candidate pattern for domains (like reporting/methodology quality) where a ground-truth verifier exists for outcome but not for process, as an additional shaping signal, never replacing the terminal flag-verified reward.
Headline gain: Magistral Medium AIME’24 pass@1 26.8→73.6 (+~47pp, “nearly 50% boost,” Mistral’s own framing, without cold-start reasoning traces). Tags: [R] primary and only real target — this is single-turn math/code RLVR, reward is per-completion · [E] yes, narrowly: clip-higher is a genuine, ablation-validated anti-entropy-collapse mechanism, directly citable vocabulary — but tuned for single-shot generation, not multi-turn tool-call exploration · [L] not addressed at all — no multi-turn credit assignment in this report, nothing transfers directly to a 100-turn episode without further work · [N] the “RL-only, no distillation” result at Medium’s scale is the clearest disclosed boundary-expansion claim of the year (pure RLVR uncovered capability a teacher’s traces wouldn’t have shown).
DeepSeek — the four GRPO stabilizers are the single most transferable [E] artifact in this file
V3.1 [D, thin] — hybrid think/non-think in one checkpoint; SFT/RL specifics not disclosed beyond “post-training optimization.” V3.2-Exp [D] — DeepSeek Sparse Attention (DSA) via continued pretrain, post-training held identical to V3.1-Terminus by design (a controlled comparison). V3.2 [D] (arXiv:2512.02556) carries essentially all of the year’s real recipe detail:
- Specialist distillation — 6 domain specialists (math/code/reasoning/agentic/agentic-coding/agentic-search), each pushed with large-scale RL independently, distilled back into one generalist. “Models trained on the distilled data achieve performance only marginally below domain-specific specialists, with the gap eliminated through subsequent RL” — distillation gets 90% cheaply, RL closes the rest.
- Mixed RL (GRPO), merged not sequential — reasoning + agent + alignment trained together explicitly to avoid catastrophic forgetting from multi-stage sequencing. Post-training compute >10% of pretraining compute, disclosed directly.
- Four GRPO stabilizers, each a concrete anti-entropy-collapse/anti-instability fix:
- Unbiased KL estimate — corrects Schulman’s K3 estimator via importance-sampling ratio; the uncorrected estimator assigns unboundedly large gradient weight when π_θ ≪ π_ref.
- Off-policy sequence masking — zero the loss on negative-advantage sequences whose divergence from π_old exceeds a threshold; positive-advantage samples are kept regardless.
- Keep Routing — freeze the MoE expert-routing path used at sampling time; don’t let training recompute a diverged routing.
- Keep Sampling Mask — reapply the same top-p/top-k truncation mask from sampling during the training update, so importance-sampling validity holds; empirically “preserves language consistency during RL training” (fixes RL-induced mixed-language garbage, a textbook entropy-collapse symptom).
- Thinking-in-tool-use agentic RL environments — 1,827 environments, 85k+ prompts across code/search/general/interpreter agents. Search-agent verification keeps only samples where the ground truth is checkable AND every wrong candidate is provably wrong — the same hard-negative discipline as this project’s own flag-verifier, independently arrived at.
Designed to fix: pattern 4 / pattern 3. The context-management fix (“retain reasoning across tool turns, drop only on a genuinely new user message”) directly targets long-horizon coherence during exploitation; the hard-negative-verified reward directly targets brittleness from an ungrounded guess (pattern 3) by refusing to reward a correct-looking answer unless every alternative is provably wrong too.
V3.2-Speciale: same base, reduced length penalty (let it think longer) + DeepSeekMath-V2 reward folded in — gold-medal IMO/IOI/ICPC/CMO 2025.
Headline gain: V3.2 “performs comparably to GPT-5” on reasoning at substantially lower cost, explicitly framed as narrowing the open-vs-closed gap on agentic long-tail tasks. Tags: [L] strong (DSA makes long-context RL tractable; context-retention rule is a direct multi-turn continuity fix) · [E] the strongest, most technically concrete axis in this whole file — four named, ablatable stabilizers, zero architecture changes required · [R] strong · [N] claimed (closing the gap on “long-tail/novel environments”) but self-rated, and structurally the specialist→distill→RL pipeline resembles SFT+RL, which this project’s own thesis (“GRPO amplifies, SFT replaces,” arXiv:2507.10616) would predict caps its novelty — DeepSeek doesn’t isolate this in an ablation, so contested/unresolved by their own disclosure.
Qwen (Alibaba) — Qwen3.7-Max is the single most directly relevant release in this entire file
Baseline (Qwen3, pre-window, arXiv:2505.09388): Long-CoT SFT cold-start → Reasoning RL (GRPO/RLVR) → thinking-mode fusion SFT → General RL + strong-to-weak distillation. Every later release patches this.
-
GSPO [D] (arXiv:2507.18071) — the load-bearing algorithm swap from 2025-07 onward. Problem: GRPO’s per-token importance ratio gets corrupted on a MoE model when routing jitters between rollout and update, forcing an expensive “Routing Replay” workaround. Fix: define the ratio at the sequence level.
# GRPO: per-token ratio, needs Routing Replay to stay valid on MoE ratio_t = pi_theta(y_t|x,y_<t) / pi_old(y_t|x,y_<t) # GSPO: one length-normalized ratio per whole rollout -> removes Routing Replay entirely s_i = (pi_theta(y_i|x) / pi_old(y_i|x)) ** (1/len(y_i)) A_i = (r_i - mean(group_rewards)) / std(group_rewards) loss += -min(s_i * A_i, clip(s_i, 1-eps, 1+eps) * A_i)Removes infra complexity (no per-token log-prob pinning) rather than adding it — cheap to try if a MoE base is ever adopted.
-
Qwen3-Coder [D]: “hard-to-solve, easy-to-verify” code RL (execution-driven, automatically-scaled test cases) as a separate stage from “long-horizon Agent RL” (multi-turn plan→act→observe→replan over 20,000 parallel real dev environments) — SOTA SWE-bench Verified without test-time scaling, i.e. the gain is in the policy.
-
Qwen3.5 [D]: pivot point — “the post-training performance gains in Qwen3.5 primarily stem from our extensive scaling of virtually all RL tasks and environments… we focused heavily on increasing the difficulty and generalizability of RL environments, rather than optimizing for specific metrics.” Origin of the decoupled Task/Harness/Verifier idea that gets its full writeup next.
-
Qwen3.7-Max [D] — read this one closely:
“agent RL training conventionally couples the task, the harness, and the verifier — train on one fixed triple and the policy learns harness-specific shortcuts instead of a generalizable strategy.”
This is a first-party, named statement of exactly this project’s own scaffold-overfitting finding. Fix: decoupled Task/Harness/Verifier rollout infra — the same task replayed against different harnesses (types and versions) and different verifiers, forcing cross-harness generalization.
for task in tasks: for harness in sample_harnesses(task): # e.g. Claude Code, OpenClaw, Qwen Code, Hermes for verifier in sample_verifiers(task): rollout = run(policy, task, harness) update(policy, rollout, verifier(rollout))Designed to fix: pattern 1, directly and by name — validated by consistent performance across QwenClawBench/CoWorkBench regardless of eval-time harness, contrasted against Qwen3.6-Plus which “showed significant variance.” Also ships a reward-hacking self-detection framework — the policy itself flags candidate reward-hacking patterns in its own trajectories, a governance mechanism directly relevant to this project’s “never regex-match, always verify” rule.
Headline gain: Qwen3.7-Max — 35-hour fully-autonomous kernel-optimization run on a previously-unseen accelerator, zero prior exposure, 432 iterations/1,158 tool calls, ~10x speedup, entirely self-directed (self-reported, not yet independently benchmarked). Tags: [E]+[L]+[N] for Qwen3.5/3.7-Max — the release most directly targeted at this project’s exact problem shape (harness generalization, long-horizon coherence, novel governance mechanism) · [R] GSPO/reasoning-RL lineage throughout.
Moonshot AI (Kimi) — strongest disclosed [N] case in the file (Agent Swarm is a qualitatively different solution shape)
K2 [D] (arXiv:2507.20534) — SFT data is itself rejection-sampled: synthetic agentic trajectories (3,000+ real MCP tools + 20,000+ synthesized) scored by an LLM judge against per-task rubrics, only passing trajectories enter SFT — “large-scale rejection sampling… through our quality filtering process,” disclosed in those words. RL = REINFORCE-with-baseline (GRPO-adjacent, group-mean baseline, no value model) + self-critique rubric reward re-grounded continuously by on-policy RLVR rollouts (a template for a safe non-verifiable auxiliary reward that doesn’t drift from the ground-truth signal). Three named engineering fixes: budget control (hard token cap, fights length inflation), PTX loss (replay high-quality SFT data during RL, fights catastrophic forgetting), temperature decay (high early, annealed later — an explicit, named [E] exploration-preservation mechanism). Partial rollout — long-tail unfinished episodes pause/resume across RL iterations rather than blocking the batch — the single most directly transferable engineering idea for ~100-turn terminal-reward episodes.
K2.5 [D] (arXiv:2602.02276) — Zero-Vision SFT: text-only SFT alone activates visual agentic tool-use; hand-annotated visual CoT data hurts generalization (don’t spend annotation budget on the modality you’re activating; spend it where you already have depth). PARL / Agent Swarm — a trainable orchestrator + frozen sub-agents (instantiated from an earlier checkpoint); only the orchestrator gets gradient updates, explicitly to sidestep credit-assignment ambiguity across sub-agent calls.
reward = λ1·r_parallel (fights orchestrator collapsing back to single-agent)
+ λ2·r_finish (fights spawning many sub-agents without real decomposition)
+ r_perf (task-level outcome)
# λ1, λ2 annealed to zero -> final policy optimizes pure task success
Designed to fix: pattern 2 / pattern 4. Sequential agentic execution has linear latency scaling and, per this project’s own audit, agents “pivot after a single failure instead of enumerating.” PARL’s whole premise is training a policy to decompose wide-search/enumeration tasks into parallel sub-agent calls rather than one brittle serial chain — a direct, working answer to “how do you reward decomposition without the model gaming step-count,” and structurally analogous to training an agent to enumerate broadly instead of guessing once and pivoting.
Toggle (token-efficient RL) alternates budget-limited and unconstrained phases, gated on accuracy already exceeding a threshold — fixes K2’s earlier length-overfitting failure mode (a rigid budget doesn’t generalize back up when a harder problem needs more room).
Headline gain: K2.5 Agent Swarm — 4.5x latency reduction with a simultaneous F1 gain (72.8%→79.0% WideSearch); K2 Thinking sustains 200-300 coherent tool calls vs. “prior models degrade after 30-50 steps” (a direct, quantified [L] claim). Tags: [L] yes throughout (partial rollout, PARL, 200-300-step coherence) · [E] temperature decay is named and explicit; PARL’s anti-serial-collapse term is the same shape of problem as entropy collapse solved with a shaping reward instead of an entropy bonus · [R] present, secondary — K2 itself is a non-thinking model · [N] the strongest in this file: Agent Swarm is not “faster at the same thing,” it’s a qualitatively different solution shape (parallel decomposition vs. any single sequential agent, however long-horizon).
GLM / Z.ai — the difficulty-curriculum + iterative self-distillation loop is a validated version of this project’s own plan
GLM-4.5 [D] (arXiv:2508.06471) — Stage 1: three domain specialists (Reasoning/Agent/General), each cold-start SFT’d separately (a domain-general model from scratch wastes RL exploration budget re-discovering what expert-labeled distillation data already gives free). Stage 2: self-distill into one generalist, with rejection sampling on the distillation data itself (strip malformed samples, verify correctness for objective answers, RM-filter subjective ones, verify tool-call trajectories reach a terminal state).
- Reasoning RL: GRPO, no KL term, three ablation-justified fixes:
- Two-stage difficulty curriculum — switch to problems that are pass@8=0 but pass@512>0 (hard-but-not-impossible); a static difficulty set goes stale as the policy improves, collapsing reward variance to all-0 or all-1 either way (zero gradient signal). This is the exact wall a ~1000-challenge portfolio at 100-200 solves is almost certainly sitting on for its hard tail.
- Single-stage RL at the full 64K target length — staged length scaling (8K→16K→…→64K) caused an irreversible unlearning of long-output generation that never recovered even when length was scaled back up. Direct lesson: don’t RL-train shorter than your SFT init’s horizon if the target task is long.
- Dynamic sampling temperature — raise temperature when rollout reward plateaus (their named signal for entropy collapse), gated by a max-1%-perf-drop bound on held-out validation.
- Iterative self-distillation (Agentic RL): RL to a plateau → distill the RL-improved policy’s own outputs into a fresh SFT checkpoint (replacing the original cold-start data) → resume RL on the stronger base with a harder curriculum → repeat.
This is the closest external validation of this project’s stated plan (“rejection-sampling SFT → GRPO/RLVR”) in the whole file — except Zhipu alternates SFT↔RL repeatedly rather than doing it once and switching permanently. Worth treating the handoff as a loop gated on reward plateauing, not a fixed step count.
GLM-5 [D] (arXiv:2602.15763) — the recipe shape changed: a sequential RL pipeline (Reasoning RL → Agentic RL → General RL) with On-Policy Cross-Stage Distillation blended throughout, replacing 4.5’s expert-then-unify structure, specifically to prevent catastrophic forgetting between stages. New async RL infra + double-sided importance sampling with hard-masking — tokens whose importance ratio falls outside [1-ε_l, 1+ε_h] are zeroed, not soft-clipped — explicitly motivated by policy drift compounding across long agentic trajectories before the (often sole) terminal reward arrives.
Designed to fix: pattern 4. Zhipu explicitly names Vending-Bench 2 and CC-Bench-V2 (“long-term coherence in agents”) as the benchmarks this release targets — a lab measuring the [L] axis directly, not incidentally.
Headline gain: GLM-5 ~20% average improvement over GLM-4.7 across 8 benchmarks; SWE-bench Verified 73.8→77.8; first open-weights model to hit 50 on Artificial Analysis Intelligence Index v4.0. Tags: [E] strong — dynamic temperature + difficulty curriculum are both explicit, named entropy-collapse countermeasures · [L] strong in GLM-5 (double-sided IS + hard masking is a credit-assignment fix aimed squarely at long trajectories) · [R] the most-developed program in GLM-4.5’s report · [N] weak-to-moderate — mostly amplification via expert-iteration/distillation, not boundary expansion.
Xiaomi (MiMo) — MOPD is a genuinely new algorithm, not a GRPO variant with a new name
MiMo-V2-Flash [D] (arXiv:2601.02780) — 3-stage post-training, and stage 3 is the interesting one. Problem named directly: “the see-saw effect” — naive multi-skill post-training improves one capability at the cost of another. Fix: Multi-Teacher On-Policy Distillation (MOPD).
- Stage 1 — SFT to “activate latent capabilities acquired during pretraining” (not teach new ones). Notable operational detail: num-zeros (count of MoE params with zero gradient) is their leading indicator of SFT instability — rising = expert-load-balance collapse, falling = overfitting.
- Stage 2 — train a suite of narrow domain-specialist teachers via independent RL (agentic: search/coding/tool-use; non-agentic: math/reasoning/safety).
- Stage 3 — MOPD: the student rolls out on its own policy distribution (not an offline distillation set, not weight merging) and receives dense, token-level reward = KL-divergence against each teacher’s logits, plus a verifiable outcome reward.
Result: the student mostly matches/beats the best individual teacher without the see-saw across the full skill set simultaneously (one model, not N specialist checkpoints) — not a free lunch (a couple of regressions), but net positive.reward_t = -KL(student_logits_t || teacher_logits_t) + verifiable_outcome_reward # student samples on-policy; teachers never generate the training data directly
- Agentic RL scaffold [D]: deliberately minimal — 3 atomic tools (bash, str_replace, finish), no prescribed workflow in the system prompt, “allowing the model to discover best practices during training” — an independently-arrived-at instance of this project’s own “light framing beats heavy scaffolding” rule.
Designed to fix: pattern 4. MOPD is a documented mechanism for fusing narrow specialists (e.g. an enumeration-specialist and an exploitation-specialist RL checkpoint) into one policy without one specialist’s regressions bleeding into the other — a direct, algorithmic answer to “uneven PTES phases.”
MiMo-V2.5-Pro [D, undisclosed quantitatively] — same MOPD pipeline scaled to 1T params/1M context; qualitative claim of “thousand-tool-call” coherence (worked example: 672 tool calls / 4.3 hrs building a compiler, self-correction after a mid-run regression at turn 512).
Headline gain: matches Kimi-K2-Thinking / DeepSeek-V3.2-Thinking on most reasoning benchmarks at 1/2-1/3 the params; 73.4% SWE-Bench Verified. Tags: [L] explicit design target (“sustains complex trajectories”) · [E] explicit — minimal scaffold + on-policy (not offline) sampling are both named exploration-preserving choices · [R] the non-agentic teacher/reasoning stage · [N] the strongest algorithmic novelty in the file after Kimi’s Agent Swarm — MOPD’s token-level on-policy KL-distillation is a genuinely different move from rejection-sampling SFT, GRPO, or plain distillation, even though the agentic RL environments themselves (unit tests, visual verifiers) are recombinations of known reward-design patterns, not new task surfaces.
The transferable lessons (updated for ten labs)
- The method set is small and shared, and it has calcified further, not diversified. SFT · rejection-sampling · DPO-family · GRPO/GSPO/PPO-family · RLVR · RLAIF are the entire vocabulary across ten labs and ~40 releases. Every “new algorithm” this year (clip-higher, GSPO, PARL, MOPD, the four DeepSeek stabilizers) is a variation on group-relative policy optimization, not a new paradigm.
- Ordering and SFT-dosage are the live design choice. Llama 4 → thin-SFT/heavy-RL/thin-DPO; Magistral Medium → zero SFT; Zhipu → cold-start-just-enough-for-signal, then iterate SFT↔RL repeatedly rather than once. The Llama-4-era finding (“heavy SFT/DPO restricts RL exploration”) is now independently re-derived by three more labs.
- The frontier is agentic-RL-environment design, not new losses. Qwen3.7-Max’s decoupled Task/Harness/Verifier infra, Kimi’s PARL, DeepSeek’s 1,827-environment agentic-task synthesis, Xiaomi’s minimal-scaffold discipline, GLM’s iterative self-distillation loop — none of these are algorithm papers, all of them are environment/data/scaffold engineering. See Agentic RL.
- Exploration/entropy-collapse resistance is where disclosure is thinnest and most valuable when it exists. OpenAI, Google, and xAI disclose essentially nothing on this axis despite running RLVR at a scale where it’s a known risk. Where labs do disclose a mechanism (Mistral’s clip-higher, DeepSeek’s four stabilizers, GLM’s dynamic temperature + difficulty curriculum, Kimi’s temperature decay, Xiaomi’s minimal-scaffold + on-policy sampling), it’s the single most directly reusable material in this file for a project whose stated risk is exactly entropy collapse.
- Ground-truth-verified reward is now cross-lab-confirmed discipline, not an idiosyncratic project choice — DeepSeek’s hard-negative search-agent filter, Qwen3-Coder’s “hard-to-solve, easy-to-verify” framing, GLM’s format-gate-before-outcome-reward, Mistral’s rejected partial-credit code reward, Xiaomi’s rule-based-only embodied reward. Where a lab relaxes this (Grok 4.1’s agentic-LLM-judge RLAIF), the lab’s own safety card shows a measurable honesty/sycophancy regression — treat as a documented cautionary tale, not a competing recipe.
Summary table
| Lab | Flagship (recipe source) | SFT | RL algorithm | L / E / R / N |
|---|---|---|---|---|
| Llama (historical) | Llama 4 Maverick | Thin, LLM-judge-filtered | Heavy online RL → thin DPO | R |
| Anthropic | Claude Opus 4.5 | RLHF/RLAIF (undisclosed detail) + SDF | Inoculation-prompted RL (algorithm undisclosed) | L, N |
| OpenAI | GPT-5.1-Codex-Max | Undisclosed (safe-completions is the one named technique) | RLVR-shaped agentic RL + compaction training (algorithm undisclosed) | L, R |
| Gemini 3 Pro | CAI-inspired adversarial SFT | RLF (DRM + Critic) + verifiable-reward/agentic RL track | L, R | |
| xAI | Grok 4.1 Fast | Minor | Long-horizon multi-turn RL, RLVR-domain-expanded | L, R |
| Mistral | Magistral Medium | None | GRPO, no-KL, clip-higher, 4-part reward | R, E, N |
| DeepSeek | DeepSeek-V3.2 | Specialist distillation | GRPO, merged reasoning+agent+alignment, 4 stabilizers | L, E, R |
| Qwen | Qwen3.7-Max | Long-CoT cold-start (base recipe) | GSPO-lineage + decoupled Task/Harness/Verifier RL | L, E, N |
| Kimi (Moonshot) | Kimi K2.5 | Rejection-sampled agentic trajectories; Zero-Vision SFT | REINFORCE/GRPO-adjacent + PARL (Agent Swarm) | L, E, N |
| GLM (Zhipu) | GLM-5 | Expert-then-unify (4.5) → cross-stage on-policy distillation (5) | GRPO, difficulty curriculum, dynamic temperature, double-sided IS | L, E, R |
| Xiaomi (MiMo) | MiMo-V2.5-Pro | Activate-latent-capability SFT | MOPD (multi-teacher on-policy KL distillation) | L, E, N, R |
Provenance caveat, restated and strengthened: disclosure quality varies by an order of magnitude across this table. DeepSeek, Qwen, Mistral, Kimi, GLM, and Xiaomi publish arXiv tech reports with ablation tables — treat their mechanism claims as [D] high confidence. Anthropic, OpenAI, Google (post-2.5), and xAI disclose mechanism only in system-card prose or blog posts, often one paragraph per model, with capability-RL algorithm/hyperparameters/reward-model architecture never named — treat their “recipe” as mostly inferred continuity except where a technique is explicitly flagged [D] above (safe-completions, compaction, RLF’s DRM+Critic split, inoculation prompting, long-horizon multi-turn RL). Every arXiv id in this file was live-verified against
arxiv.org/abs/<id>during the research pass that produced the per-lab notes this chapter is built from — none were fabricated or carried over from training-data memory. Lab recipes change fast; re-verify before betting a training run on a specific ordering or hyperparameter.
Method → Data (your real bottleneck)
Your words: “it’s not that we don’t have data; it’s that we don’t know what data we want and what fine-tune we want.” This chapter is the fix, and it’s a single causal claim:
You do not pick data and then a method. You pick the method — by failure type — and the method dictates the data object you must produce.
Once the method is chosen, “what data do we want” is answered mechanically. Here’s the mapping:
| Method | Data object it consumes | For your harness, where it comes from |
|---|---|---|
| SFT / off-policy distillation | full trajectories from a source | curate, or run a stronger model on your challenges and keep its solves |
| On-policy distillation | your model’s own rollouts, graded per-token by a teacher | your rollouts + a stronger teacher model |
| Rejection-sampling FT | your model’s own verifier-passed trajectories | you already generate these — filter your ~100–200 solves |
| DPO | (chosen, rejected) trajectory pairs at a decision point | pair a solved run vs a failed run on the same challenge |
| KTO | unpaired trajectories tagged good/bad | your solved pile + your failed pile, as-is (no pairing) |
| GRPO / RLVR | prompts + a verify() fn — no fixed dataset | your challenge set + the flag check |
| Agentic RL | a live environment emitting rollouts + end-of-episode reward | your harness itself, as a rollout service |
Two consequences you can act on immediately:
- RLVR needs almost no dataset — just challenges + a verifier. You have both. The “data problem” nearly vanishes; the work moves to the reward fn and rollout infra.
- Rejection-sampling FT needs only your own solves, which you’re already producing. It’s the lowest-friction first move because the data object is a byproduct of running the benchmark.
So the real question isn’t “what data” — it’s “which gap”
The data object is downstream of the gap diagnosis. Do that first (The decision), and the data spec falls out. The diagnostic that routes everything:
Does the correct action ever appear in the model’s own outputs, even rarely, at high sampling N?
- Never → knowledge gap → you need external trajectories (SFT / teacher / a tool). Data = curated or teacher-generated.
- Sometimes (your case: solves 100–200/1000) → execution gap → data = your own rollouts (rejection-sampling) or a verifier (RLVR). You already have both.
- Mis-ranked → data = good/bad pairs or tagged logs (DPO/KTO). You already have both piles.
In all three of the last cases, you already possess or can trivially generate the data — which is why your instinct that “data isn’t the bottleneck” is correct. The bottleneck was the method, and the method is chosen by the gap.
Before you train — instrumentation & data readiness
This chapter is a prerequisite, not a fix. Diagnosing the gap gives you the routing test (knowledge vs. execution vs. exploration); behavioral audit → training signal maps observed failure shapes to methods. Both assume you can already say which stage of a run failed. Today, for this project’s own harness, you can’t — not because the telemetry is missing, but because nobody has written the small amount of code that turns existing telemetry into a stage verdict. This chapter is the ground-truth answer to “what do I already have, and what’s the minimal thing to build first” — grounded entirely in the repo and vault as read on 2026-07-02, not in aspiration.
Everything below is a recommendation for main to implement — no code was written, no repo file was
touched. Confidence is stated per claim; where I could not confirm something, I say so.
1. Why per-stage signal is the prerequisite for the whole diagnosis
The diagnosis framework’s routing test — does the correct action ever appear at high N, and does RL (not matched SFT) expand it? — is defined per challenge or per challenge-subtype, not per portfolio. Run it on the portfolio in aggregate and you get one number that averages over four structurally different failure modes (F1 exploration / F2 skill / F3 tool-use / F4 long-horizon), which is exactly the “collapsing a split verdict into one sentence” anti-pattern Diagnosing the gap §0 warns against. You cannot segment a portfolio you cannot localize. Per-stage signal is what turns “39% pass@1, up from 26%” into “F1 dropped from 40%→12% of failures, F2 is now the dominant failure mode” — the only shape of finding that tells you which lever (SFT curriculum, tool-use reward, GRPO exploration term) to pull next.
flowchart LR
A["events.jsonl<br/>(already emitted)"] --> B["stage extractor<br/>(NOT built)"]
C["per-challenge manifest<br/>(NOT built)"] --> B
B --> D["F1-F4 attribution<br/>per run"]
D --> E["Diagnosis framework<br/>routing test, per subtype"]
E --> F["method choice:<br/>SFT curriculum / tool reward / GRPO term"]
2. What the harness already emits — confirmed, source-read
Source: go/libs/agent/events/events.go (schema), go/libs/secagent/runner.go (wiring), doc of record
lessons/security-agent/harness-observability-contract-2026-06.md. Every event is Turn-indexed and
tool_call/tool_result pairs join on tool_call_id — this is the load-bearing fact that makes any
stage-localization script possible with zero changes to the agent loop.
| Event | What it carries | Stage-attribution value |
|---|---|---|
meta / tool_schemas / system_prompt / user_message (preamble) | trace_id, model, action space, task string | Run identity; needed to join against a per-challenge manifest |
agent_start | {input} | “Did the agent do anything” — trivially free |
turn_start | {history_len} | Per-turn anchor |
llm_response | role/content/reasoning, tool_calls[], token usage, TTFT/latency, finish_reason | Reasoning text — usable for context, never as stage-reached evidence (confabulation risk) |
tool_call | {tool_call_id, name, args} | The command actually issued — args is a raw string (bash/curl/etc.), not a structured HTTP call |
tool_result | {tool_call_id, name, output, tool_exec_ms, error?} | The real, server-observed response text — this is the only tier that counts as ground truth |
agent_finish | stop_reason, turns/max_turns, finish_reason | Distinguishes “ran out of budget” (max_turns) from “gave up” (stop) from silent truncation (stop_reason=stop but finish_reason=length — don’t trust stop_reason alone) |
flag_scan | {solved, primary?, retrieved[], findings[{flag, origin, first_event, first_turn?}]} | Terminal outcome, provenance-classified (retrieved/echoed/model_claim) |
Tool surface bounding what a “stage” can even look like: five tools total (bash, read_file,
write_file, update_file, WebSearch — confirmed go/libs/sectools/sandbox.go, no WebFetch, contra
two now-stale lessons). All HTTP interaction with the target is embedded in free-text bash commands and
free-text bash output — there is no structured http_request tool call with a machine-readable
status/URL/body. This is the single biggest reason “endpoint discovered” and “vuln identified” are not
already derivable cleanly — any stage parser must regex/parse free text, not read a field.
Confidence: high (direct source read, 2026-07-02). Full detail: artifacts/overnight-instrumentation/research/harness-signals.md.
3. Per-stage ground-truth verifier design
3.1 What already satisfies “ground-truth-verified, never transcript-matched”
The project’s non-negotiable rule — reward from real environment/tool-output state, never format/regex-matched on the transcript — is already met at the terminal (flag) stage and nowhere else:
go/libs/agent/events/flagscan.go(ScanFlags) classifies everyFLAG{...}sighting asretrieved(in atool_result, absent from that call’s ownargs) /echoed(present in both) /model_claim(only in model text). This is a provenance signal — it proves the string came back from the sandbox, not the model’s mouth — but is not a byte-compare against the real flag. Aretrievedflag from a decoy or off-target leak still readssolved:true.- The actual byte-compare (
flag_verified, the project’s own term) lives outside the agent runtime:benchmark/flags/pd26_flags.current.json(10 live flags) +benchmark/verification/PD26-NN/exploit/solve.py(held-out reference solvers). Verifying against these today is a manual, SSH-gated, human-authorized step (lessons/evals/verifying-agentic-security-runs.md) — not wired into the harness,pdq, ortrace-verifyas an automatic per-run check.grep -rln "flag_verified" --include="*.go"returns zero hits — confirmed absence, not a naming mismatch. - Documented failure modes any pre-terminal verifier must not reproduce: model-claim fabrication after
repeated 404s (
lessons/security-agent/flag-detection-false-positives.md, 6/17 fine-tune “solves” were exactly this); a hardcoded flag-format regex producing false negatives on an off-roster 24-hex flag (lessons/evals/gym-challenge-flag-format-breadcrumb-false-negative.md); an 18-point proxy-vs-verified inflation on the fine-tune’s own leaderboard (lessons/evals/ctf-flag-verification-and-proxy-pitfall.md).
3.2 The proposed per-stage predicate design
Phase names follow the already-designed (if unused) ptes schema in .claude/rules/challenges.md
(recon → enumeration → detection → exploitation → lateral), mapped onto the brief’s F1–F4 taxonomy. The
mechanism is a direct reuse of flagscan.go’s proven shape: a pure, read-only, post-hoc scan over
events.jsonl — no new sandbox instrumentation, no changes to secagent’s execution path.
| Stage | Maps to | What real state proves it | How checkable | Robustness |
|---|---|---|---|---|
| recon | pre-F1 | A request reached a known recon surface and got a response | tool_result exists for a tool_call whose args path matches a per-challenge recon-surface allowlist | Cheap + robust |
| enumeration | F1 (never finds vuln endpoint) | A request’s method+path matched the vuln-bearing route, regardless of payload correctness | tool_call.args path/method vs. a per-challenge allowlist lifted from solve.py | Cheap + robust — automates the existing by-hand method in lessons/evals/wall-attribution-discovery-vs-exploit-fail.md |
| detection | F1/F2 boundary | Response shows diagnostic evidence of the specific bug class (error, type-confusion tell, introspection leak) | tool_result.output vs. a per-challenge, bug-class-specific signature | Hard/ambiguous — bug-class-specific; recommend optional/best-effort in v1, fold into “enumeration reached, exploitation not yet” if no clean signature |
| exploitation | F2 (finds, can’t exploit) | The payload actually worked — server-side artifact only possible on success (token, row leak, shell banner) | tool_result.output vs. the exact success predicate already written in that challenge’s solve.py (verbatim reuse — this IS the ground-truth oracle) | Cheap + robust when the exploit yields one identifiable artifact; coarser (any-200-on-payload) proxy where it doesn’t |
| lateral / flag | F4 + terminal | Second request in a bypass→flag-read chain returned the flag | flag_scan.retrieved, verbatim | Already built — zero new work |
| (cross-cutting, not a stage) | F3 (clumsy tool-use) | Tool-call diversity / pro-grade tool vs. improvised shell one-liner | Count distinct tool_call.name, or classify args against a pro-tool allowlist | Does not fit the stage ladder — log as a separate metric, do not fold into a potential function (see §3.3) |
Two things this design deliberately does NOT do, because both violate the project’s own reward rule:
- Does not reuse the
.claude/rules/challenges.mdptesmatcher/triage-subagent mechanism. That mechanism is explicitly an LLM judge (“a span satisfies a matcher when the description’s intent is met, not merely when the regex matches”) — a level-3 rubric on the reward-gameability ladder (lessons/post-training/reward-signal-types-and-gameability-ladder.md), and perlessons/challenges/pd-challenge-file-anatomy.md, none of the 10 live PD26 challenges even carry the file this schema lives in. Use its phase names, not its judging mechanism. - Does not treat
intent(the opt-inSECAGENT_CAPTURE_INTENTfield) as evidence of “identified the vuln.” It’s documented observability metadata, low-faithfulness, off by default, and must be stripped before training use (lessons/security-agent/bash-intent-observability-field.md).
3.3 Potential-based shaping — the caveats if this ever feeds RL reward
If a stage-scan result is ever turned into dense RL signal (rather than just a diagnostic readout), the
only shaping form proven not to change the optimal policy is
F(s,a,s') = γΦ(s') − Φ(s) for any potential function Φ — Ng, Harada & Russell, “Policy Invariance Under
Reward Transformations,” ICML 1999 (no arXiv id — this predates arXiv’s routine ML use; ACM DL
10.5555/645528.657613, verified live 2026-07-02). Two subtleties are easy to get wrong:
- Φ must be a monotone “best stage reached so far” running max, not the instantaneous current-turn stage — otherwise re-triggering an already-reached signature, or a later turn’s evidence going quiet because the agent moved on, can pay a spurious negative shaping reward for forward motion.
- Φ must be defined identically across every terminal branch (
stop_reason∈{stop, max_turns, error}, all real values in this harness) — otherwise the invariance proof breaks across the different termination paths this project’s variable-length episodes actually produce. Recommended sidestep: apply shaping only over non-terminal transitions; letR_terminal(=flag_scan.Solved, unchanged) carry all outcome signal at the very last transition.
Domain-adjacent SOTA, verified live 2026-07-02 — none tick all three boxes (cybersec-specific + proven invariant + validated against a ground-truth terminal verifier), so this remains an unfilled niche, not a solved-elsewhere problem. The two rows below (Pentest-R1, DRLRM-PT) are academic cybersecurity-LLM training papers — per this project’s standing rule, they are cited for context only, not as a basis for any claim/recipe/verdict here; neither produced a frontier model, so neither is treated as evidence for the shaping design below. The recommendation that follows this table rests on the Ng/Harada/Russell invariance theorem and this project’s own gameability-ladder lesson, not on either of these two papers.
| Paper | arXiv | Relevance | Confidence |
|---|---|---|---|
| TIPS — turn-level potential shaping for search-augmented LLMs | 2603.22293 | Shaping machinery is directly on-point; domain (search-QA) is not | 0 citations, brand-new — promising, not validated |
| ToolRL — reward design for tool-use RL | 2504.13958 | Closest prior art on reward granularity/timing for tool-use GRPO; not potential-based | 1 citation |
| Pentest-R1 — two-stage RL for autonomous pentesting (academic cybersecurity-LLM training paper — cited for context, not a basis for our decisions) | 2508.07382 | Domain topic overlaps — a per-step reward in an interactive CTF env (InterCode-CTF); exact shaping formula not fully verified from search highlights alone — flag as unread in full. Not used to support the recommendation below | 0 citations, brand-new |
| DRLRM-PT — Reward Machine over kill-chain phases (academic cybersecurity-LLM training paper — cited for context, not a basis for our decisions) | 2405.15908 | Illustrates a non-potential-based design (flat +1/+10 phase bonuses, no γΦ(s')−Φ(s) structure) — cited only to warn against conflating “reward machine over phases” with “provably invariant shaping,” not as prior art we build on | Medium |
Recommended default if this is built: keep the terminal flag reward and the dense stage-shaping term
as two separate additive components, never merged into one function — this is both what makes the
invariance argument clean (Ng, Harada & Russell, ICML 1999, cited above) and what the project’s own
gameability doctrine independently favors (decoupled dense process signal + sparse ground-truth outcome
signal, lessons/post-training/reward-signal-types-and-gameability-ladder.md).
Confidence: high on the harness-reuse mechanism and the Ng et al. invariance result itself (25-year-old,
well-established). Medium on the “running-max Φ” / “terminal-consistency” recommendations — applied
reasoning from the theorem plus this project’s variable-horizon episode shape, not lifted from a paper.
Full detail: artifacts/overnight-instrumentation/research/staged-verifier-design.md.
4. What training/eval data we already have, per candidate move
Scope: benchmark/ (repo, git-tracked) + ~/security-agent-qwen/ (untracked local run-artifact directory
holding the actual trajectory takeouts, partially mirrored to s3://llmresearch-data/). On the order of
1,200+ individual agent trajectories across ~470 distinct challenge definitions exist already — this
is a mining problem, not a collection problem, for three of the four candidate moves.
| Candidate move | Readiness | Extraction step | Sharpest gotcha |
|---|---|---|---|
| (i) Rejection-sampling SFT positive set | ~185 raw solved trajectories across 5 corpora (gym263 64, gym564 39-cleaned, warpenv-broker 22, envgen 29, argus60-base 29); one prior SFT (cybersec-qwen36-traj-ep2 / pd-v5-qwen36-ft) already built this way | Follow lessons/post-training/verified-trajectory-synthesis-recipe.md verbatim: verifier-accepted terminal only → replay-reproduce → dedup → decontaminate → Thought/Action/Observation with Observation loss-masked | Confirmed, not hypothetical: the prior SFT fabricated flags on 6/17 claimed solves (lessons/post-training/sft-induced-flag-confabulation.md) — naive success-folder collection teaches success-shape, not success |
| (ii) KTO/DPO pairs | KTO-native data is ready today for free — every success//failed/ split is an unpaired good/bad label (mechanical, zero judgment calls). True DPO (same-decision-point divergent pairs) needs k≥2 same-challenge same-model runs — only the PD26 canonical sweep (k=5, 10 challenges) has this; the larger gym pools are k=1 | Label KTO now; if DPO is wanted, mine the PD26 k=5 sweep, don’t re-sweep the gym pools | Don’t default to DPO just because solve/fail piles exist — lessons/post-training/dpo-kto-for-agent-tool-selection.md’s escalation ladder (fix tool description → prompt guidance → action-space → better base → SFT → DPO/KTO → GRPO) should gate the choice first |
| (iii) Per-stage eval (F1–F4) | This is the actual gap. A PTES-phase tagger exists as a named, working concept — but on a different, sibling corpus (the pr seat’s BSides-LV-2026 audit: 189 Claude-model runs, Neo/XBOW-bench, not this project’s qwen/deepseek/glm/xai/gemini roster), and it is not yet open-sourced or present anywhere in this repo | Build a lightweight classifier over the existing tool_call/tool_result stream, scoped to the PD26/gym corpus specifically — materially smaller than the sibling talk’s full framework (no tool-tier/contamination/recovery-shape analyzers needed for internal F1–F4 attribution) | Don’t import the BSides talk’s headline numbers (“82% pivot rate,” “62% stall in exploitation”) as if they describe this project’s own models — they describe Claude models on a different bench |
| (iv) Credit-assignment traces | Best-instrumented axis in the inventory — every run in every corpus carries the full per-turn event stream, confirmed live on a real sample (190-line events.jsonl, 63 tool_call/tool_result pairs, 29 turns) | tool_call_id-paired parsing is a solved extraction problem (events.ScanFlags already demonstrates the pattern in Go) | Turn-level “did this action retrieve the flag” ≠ “was this turn part of a coherent minimal solve path” — the replay-reproduce check proves an action sequence is causal, not that recorded Thought spans are faithful narration (a distinct, unaddressed reasoning-distillation risk) |
Cross-cutting gotchas that apply to all four moves:
- Ground truth exists for only 2 of 7 corpora (PD26, argus60/APEX). gym263/gym564/warpenv/envgen have no
held-out flag file — their “solved” signal is the harness’s own
retrievedclassification, a weaker epistemic tier than exact-match. Flag this explicitly in anything built downstream of them. retrieved(tool_result-not-in-own-args) is a strong genuineness signal but is still transcript-level heuristic, not an out-of-band verifier query — the project rule is not automatically satisfied just because the harness flagged something retrieved.- gym564’s local raw copy (521 completed) is not the cleaned number (305) cited elsewhere in memory —
reconcile against
experiments/2026-06-27-gym564-archive-cleanup.mdbefore using it quantitatively.
Confidence: high on the corpus inventory and readiness split (row-counted directly, 2026-07-02); high
on the (iii) framing gap being genuine (vault-searched for “PTES”/“stage-local”/“kill chain,” found exactly
one non-applicable hit). Full detail, including per-corpus row counts and S3 paths:
artifacts/overnight-instrumentation/research/data-inventory.md.
5. The minimal instrumentation gap — the recommended first step
Collapsing §2–§4 into one actionable delta: the harness’s process telemetry is already complete for
stage attribution. Nothing in go/libs/agent/events or secagent/runner.go needs to change. What’s
missing is entirely semantic, and it splits into two independent, additive pieces of work that
correctly sit on different sides of the seat boundary:
flowchart TD A["Pick ONE challenge<br/>(PD26-02 — chain already<br/>documented in lessons/challenges/<br/>pd26-02-nosqli-authbypass-chain.md)"] --> B["challenge-builder:<br/>author stage_oracle.json<br/>from solve.py — 4 predicates,<br/>editorial work, not infra"] A --> C["main: write stagescan.go,<br/>same shape as flagscan.go —<br/>pure io.Reader -> StageScan,<br/>no side effects"] B --> D["Run stagescan over existing<br/>events.jsonl from already-<br/>collected runs (benchmark/results/,<br/>~/security-agent-qwen/)"] C --> D D --> E["Validate: does the stage vector<br/>match what a human reading<br/>the same trace concludes?<br/>(same discipline as the flag-layer<br/>validation)"] E -->|"holds"| F["Generalize to all 10 PD26,<br/>then gym/warpenv/envgen"] E -->|"doesn't hold"| G["Refine predicates before<br/>trusting any F1-F4 number"]
main/harness side: a deterministic (no-LLM, no-confabulation) URL/path extractor overtool_call.argstool_result.output, scoped to thebashtool only. Emit as a derived per-run artifact, not a new harness event — keeps the harness itself free of challenge-specific semantics.
challenge-builderside: onestage_oracle.jsonsibling per challenge — one entry per PTES phase, each a deterministic predicate over(method, path_regex, status_code, body_signature), authored directly from that challenge’s ownsolve.py. This is editorial work (one person reads ~10solve.pyfiles and writes ~4 predicates each), not new infrastructure — the ground-truth reference already exists, it’s just not lifted into a machine-readable file.- Prototype on ONE challenge before committing to all 10. PD26-02 is the natural first pick — its
two-step NoSQLi-authbypass chain is already fully narrated in
lessons/challenges/pd26-02-nosqli-authbypass-chain.md, andsolve.py’s own success checks (r.status_code == 200 and data.get("token")for the bypass; a flag regex on/api/profilefor the pivot) are directly reusable as the exploitation-stage and lateral-stage predicates verbatim. Run the prototype over runs already sitting inbenchmark/results/and~/security-agent-qwen/— no new sweep needed to validate the mechanism. - A narrower, more concrete companion gap on the flag side: turn
flag_scan.retrievedinto a trueflag_verifiedboolean via an offline exact-match diff against the already-git-trackedbenchmark/flags/pd26_flags.current.json, for the 10-challenge roster only. This needs no SSH, no live secrets, and closes the one place today’s “ground-truth-verified” claim is actually a provenance proxy.
None of this requires RL infrastructure, a new sandbox tool, or a change to secagent’s execution path —
it is a read-only scan over data that already exists, validated against traces already collected, before
any GRPO/RLVR reward design depends on it.
Cross-links
- Diagnosing the gap — the routing test this chapter’s stage signal feeds.
- From behavioral audit to training signal — what to do once a failure is localized to a stage.
- Method → Data — the data-object framing this chapter grounds in an actual corpus.
One problem, or many? — monolithic outcome-RL vs staged decomposition
Every other chapter in this book asks “which algorithm” (SFT vs DPO vs GRPO). This chapter asks a question one level up, specific to a sequential, multi-stage task with a single sparse terminal reward: should the CTF solve be trained as one end-to-end outcome-RL problem (flag reward only, let RL discover the stages), or decomposed into sub-problems — evaluated per stage and, more contentiously, trained per stage? The two halves of “decompose” turn out to have very different answers, and conflating them is the single easiest way to get this wrong.
All citations below are carried over, unmodified, from five research threads run 2026-07-02 (artifacts/overnight-decomposition/research/{monolithic-case,decomposition-case,staged-eval,pentest-ctf-rl,credit-assignment-theory,verdict}.md) — no id below was invented for this chapter. Re-grounding pass, same date: per the project’s standing rule, no conclusion in this chapter may rest on a domain-specific academic CTF/pentest training or benchmark paper (CTF-Dojo, Cyber-Zero, Pentest-R1, HackSynth-GRPO, AutoPenBench, Cybench, NYU CTF Bench, EnIGMA, InterCode-CTF, DRLRM-PT, node-fragility shaping, the kill-chain-staged-reward paper) — every such work is demoted to a labelled context-only mention below, and every claim that had rested on one is re-anchored on general frontier-lab / RL-theory evidence or this project’s own data instead. Two new citations were added and independently verified live for this pass: STaR (arXiv:2203.14465) and ReST-EM (arXiv:2312.06585), both general (non-security) self-training literature, replacing CTF-Dojo/Cyber-Zero as the basis for §5’s rejection-sampling-SFT recommendation.
1. The pipeline, and why the failure isn’t uniform
A CTF solve is not one action, it’s a chain:
flowchart LR
R["Recon /\nenumeration"] --> E["Endpoint\ndiscovery"]
E --> V["Identify the\nvulnerable endpoint"]
V --> X["Exploit it"]
X --> P["Post-exploitation /\npivot"]
P --> F(("Flag\n{0,1}\nground-truth verified"))
classDef stage fill:#132b22,stroke:#34d399,color:#eafaf3;
class R,E,V,X,P stage;
Only the last box is ground-truth-checkable today. Observed failures cluster by where in the chain the agent dies, not uniformly across it — the project’s own F1–F4 taxonomy, which turns out to be a CTF-specific instance of failure clusters the general agent-eval literature keeps independently rediscovering (§4).
| Tag | Failure | Canonical RL/agent-research framing | What it is not |
|---|---|---|---|
| F1 | Never finds the vulnerable endpoint | Exploration / coverage failure — no gradient exists until the reward is first observed; large, deceptive state space | Not a credit-assignment problem — you can’t assign credit for a reward you’ve never seen |
| F2 | Finds it, probes shallowly, can’t land the exploit | Execution / skill (performance-floor) failure — capability present, doesn’t reliably convert | Not usually fixed by more exploration |
| F3 | Clumsy tool use, wrong tool for the job | Policy / tool-selection failure — a distinct axis from “does it find the bug” | Overlaps F2 but has its own literature (tool-augmented-LLM failure taxonomies) |
| F4 | No real pivot/chaining after a foothold | Long-horizon credit-assignment failure — the terminal bit has to retroactively explain ~100 turns; variance grows with horizon | Not solved by “try more” alone — it’s a variance, not coverage, problem |
The credit-assignment theory thread makes the split precise: a ~100-turn trajectory with reward only at the end is hard for three separable reasons — exploration burden (F1, upstream of everything else), credit-assignment variance (Monte-Carlo/GRPO-style returns smear one scalar across all turns — arXiv:1506.02438, GAE), and compounding distributional drift (a policy trained once, on one static snapshot, drifts off-distribution as a rollout gets longer — arXiv:1011.0686, DAgger, classically O(εT²) uncorrected). F1 is the first problem; F2–F4 are flavors of the second and third. Don’t expect one fix (a denser reward) to solve both (arXiv:2312.01072, the credit-assignment-vs-exploration survey).
2. Side A — the monolithic case, steelmanned
The pattern across every lab that tried both is consistent: outcome-only + scale beats hand-built process supervision, every time it’s been A/B’d.
- DeepSeek-R1-Zero rejects process reward outright, in its own failure-experience writeup. Pure RL, no SFT, rule-based outcome-only reward; reasoning behaviors emerge as a side effect. Verbatim: “a model-based PRM… inevitably leads to reward hacking, and retraining the reward model needs additional training resources and it complicates the whole training pipeline.” arXiv:2501.12948. This is the single strongest evidence against a learned/neural per-stage reward — note precisely what it doesn’t rule out: a deterministic, ground-truth per-stage check is a different animal (§4).
- OpenAI Deep Research shipped a long-horizon, tool-using agent trained end-to-end on outcome/rubric reward, and its own team says why: “End-to-end training beats manual orchestration… constructing a graph of operations… is the common approach to building agents [but] Deep Research is trained end-to-end… This allows the model to develop flexible strategies… that would break if scripted manually.” (OpenAI Deep Research system card, no arxiv id — flagged as such.) Closest real-world analog to this project’s shape: long-horizon, tool-using, sparse/rubric-graded.
- Kimi k1.5 gets SOTA reasoning results with no PRM, no MCTS, no value function — substituting long-context scaling for explicit search. arXiv:2501.12599. Second independent lab, same conclusion as R1.
- Llama 4’s own post-training team found heavy SFT/DPO caps the ceiling of the subsequent RL stage — verbatim: “SFT and DPO can over-constrain the model, restricting exploration during the online RL stage.” They responded by pruning >50% (95% for Behemoth) of their SFT data (ai.meta.com blog, no arxiv id). This is the project-critical citation: a hand-imposed stage boundary is itself a form of prescriptive pre-RL structure, and this is the general mechanism by which imposed structure narrows a policy’s exploration before RL gets to use it.
- Academic, cited for context only, not a basis (per project standing rule — no domain-specific academic CTF/pentest training paper has produced a frontier cybersecurity model): in the CTF/pentest domain specifically, the same monolithic-wins pattern is also reported by academic training papers — CTF-Dojo (arXiv:2508.18370), Cyber-Zero (arXiv:2508.00910), Pentest-R1 (arXiv:2508.07382), HackSynth-GRPO (arXiv:2506.02048). None of these is load-bearing here — the actual basis for Side A is the frontier-lab evidence directly above (R1, Kimi k1.5, OpenAI Deep Research, Llama 4, Bitter Lesson), which independently converges on the same conclusion without needing a CTF-specific data point.
- The Bitter Lesson (Sutton 2019, incompleteideas.net) is the intellectual ancestor of all of the above: hand-built structure plateaus, general search+learning wins at scale. High confidence the historical pattern is real; medium-low confidence it transfers directly to a 1000-challenge corpus with real per-rollout infra cost — that’s exactly the disanalogy the honest limits below press.
The honest limits — where the monolithic case’s own literature admits it breaks down
| Limit | Citation | What it says |
|---|---|---|
| Pure outcome RL can structurally fail to ever find the reward | Go-Explore, arXiv:1901.10995 / Nature s41586-020-03157-9 | Vanilla deep RL scored ~0 on Montezuma’s Revenge/Pitfall — canonical sparse, deceptive, long-horizon environments — until an explicit “remember states, return, explore from there” mechanism was added. This maps almost exactly onto F1: it’s an exploration-algorithm problem, not a hyperparameter one. |
| The best long-horizon precedent needed denser reward + enormous scale | OpenAI Five, arXiv:1912.06680 | 10 GPU-months of distributed self-play, AND a per-frame shaped reward (last-hits, kills, tower damage) — not a single terminal bit. Citing this as “pure sparse reward at scale works” over-claims what the paper shows. |
| R1’s own reward-reliability admission | arXiv:2501.12948 | “The success of pure RL depends on reliable reward signals… for tasks that cannot obtain a reliable signal, DeepSeek-R1 uses human annotation… and only conducts RL for hundreds of steps.” The CTF flag reward is reliable (ground-truth) but far sparser per compute-dollar than a math/code answer — R1’s paper doesn’t test this sparsity regime; it’s an extrapolation this project would be making, not a validated claim. |
Bottom line for Side A: thick, convergent, in-domain evidence that monolithic-outcome-plus-better-data/curriculum wins whenever it’s been tried against a decomposed alternative in LLM-CTF training specifically. The genuine open risk it must own is Go-Explore’s — whether the ~100-200/1000 solve rate reflects “still finding it eventually with more rollouts” (favors monolithic) or “structurally not finding it” (favors an exploration-specific intervention) is an empirical question literature alone cannot resolve.
3. Side B — the decomposition case, steelmanned
Framing. Monolithic outcome-RL is implicitly betting on four things at once: (1) the base policy already puts non-zero mass on the correct trajectory shape for every stage on a large fraction of challenges, (2) the RL algorithm can correctly attribute a late reward to the right subset of ~100 turns, (3) one scalar is expressive enough to teach four qualitatively different skills (exploration breadth, exploit depth, tool discipline, chaining) without one skill’s gradient starving another’s, and (4) “more outcome RL” is uniformly the right lever for F1 through F4 alike. Every technique below is a documented failure mode of at least one of these assumptions.
The menu of decomposition mechanisms
| Mechanism | Citation | What changes in the loop | Confidence |
|---|---|---|---|
| Options / SMDP framework (the seminal foundation, asked-for regardless of age) | Sutton, Precup, Singh, Artificial Intelligence 112 (1999) (pre-arxiv; DOI 10.1016/S0004-3702(99)00052-1) | Action space becomes {launch_recon_option, launch_exploit_option, ...}; a high-level policy picks among temporally-extended sub-policies, shortening the effective horizon the terminal reward has to bridge | High (theory), medium (LLM transfer). Known failure: naive end-to-end option learning collapses to one mega-option or micro-manages every step. |
| FeUdal Networks — Manager/Worker split fixes option-collapse | arXiv:1703.01161 | Manager emits abstract directional goals in latent space at low temporal resolution; Worker is intrinsically rewarded for moving state toward that direction; own ablations show a plain (non-dilated) recurrent Manager “fails catastrophically” on long-credit-assignment tasks | High (mechanism), medium (LLM transfer — from-scratch Atari RL, not a token-level LLM policy) |
| ArCHer — the LLM-native analogue | arXiv:2402.19446 | A high-level, off-policy turn-level value function aggregates reward across turns; a low-level PPO-style update trains the token policy inside each turn using that value as its reward. Map “turn” onto “stage.” Single strongest “if I had to prototype one paper” citation for training-decomposition. | High (recipe exists), medium (untested on anything CTF-shaped) |
| HiPER — hierarchical advantage estimation | arXiv:2602.16165 | Factorizes policy into planner + executor; Hierarchical Advantage Estimation aggregates returns per subgoal, provably reducing variance vs flat GAE; +6.6% ALFWorld, +8.3% WebShop, largest gains specifically on long-horizon multi-subtask tasks | High (strong ablations) |
| MiRA — milestone-based dense reward | arXiv:2603.19685 | Dense, milestone-based reward replaces sparse outcome-only; on Gemma3-12B, WebArena-Lite success rate 6.4% → 43.0%, beating WebRL (38.4%) and GPT-4-Turbo (17.6%). The single strongest empirical existence-proof in this whole dossier that flag-only reward can leave a large gap on the table — but on web-navigation, not offensive-security CTF. | High |
| Pentest-R1 — domain-specific two-stage training | arXiv:2508.07382 | Academic, cited for context only, not a basis (per standing rule): offline RL on 500+ real pentest walkthroughs → online RL in a live CTF env. Structurally resembles this project’s own planned SFT→GRPO, but that resemblance is not the basis for recommending it — the load-bearing mechanism for training-decomposition is the general options/ArCHer/HiPER hierarchical framing above plus curriculum-learning theory below. | N/A — context only |
| Potential-based reward shaping — the theoretical safety net for everything above | Ng, Harada, Russell, ICML 1999 (pre-arxiv) | F(s,s') = γΦ(s') − Φ(s) for any state-only potential Φ provably leaves the optimal policy unchanged — the telescoping sum over an episode collapses back to Φ(s_T) − Φ(s_0) plus the true reward. This is a theorem, not an empirical claim. Modern reaffirmation: arXiv:2502.01307 (practical effectiveness still depends on Φ’s scaling). | Very high (correctness); risk is entirely implementation |
| RUDDER — learned, return-equivalent redistribution | arXiv:1806.07857 | Train an auxiliary model to predict final return from trajectory prefixes (using the project’s own ~100-200 verified solves), use its temporal differences as a per-step reward — a learned alternative to hand-specifying Φ, with the same correctness guarantee | High (theory), directly actionable given existing verified-solve data |
| Ground-truth per-stage verifiers / VPR / CM2 (checklist rewards) | arXiv:2605.10325 (VPR), arXiv:2602.12268 (CM2) | Decompose the terminal task into a checklist of objectively verifiable sub-criteria (sandbox-checked, not judge-opinion) — the safe form of stage reward, symmetric to the flag verifier’s own contract. VPR’s own honest caveat: benefit “depends on the reliability of the verifier,” and extension “to less structured, open-ended environments… remains an open challenge” — directly relevant to CTF’s own stage 3 (§4). | High, with an explicit open-environment caveat |
| Curriculum learning & sequencing — orthogonal to reward, lowest risk of the whole menu | Bengio et al., ICML 2009 (pre-arxiv, foundational); h1 arXiv:2510.07312; FastCuRL arXiv:2503.17287; BPO arXiv:2508.03018 | Order training by difficulty (single-endpoint before decoy-heavy; 1-hop exploit before 2-hop pivot) — touches no reward function at all. h1: curriculum + pure outcome-only reward gets an exponential sample-complexity gain. BPO explicitly reports vanilla GRPO on their sparse-reward setting yields only marginal improvement without it. | High — the cheapest, least-risky lever in this entire menu |
| Kill-chain-staged reward (cyber-defense red-teaming) | arXiv:2605.17075 (May 2026) | Academic cybersecurity-LLM training work, cited for context only, not a basis (per standing rule). Frozen LLM planner emits kill-chain intent; a trained RL controller gets reward “aligned with kill-chain progression.” Superficially the closest thing in the literature to Option B, but it’s brand-new, unreplicated, doesn’t ablate “staged reward” from “hybrid architecture,” and — per the standing rule — carries no evidentiary weight for this project regardless. The actual basis for Option B’s viability is the general HRL/theory rows above (ArCHer, HiPER, potential-based shaping). | N/A — context only |
| Classical (non-LLM) staged reward for pentest | DRLRM-PT (reward machines), DOI 10.1109/ijcnn60899.2024.10650368; node-fragility shaping, DOI 10.3390/electronics13214311 | Academic pentest RL, cited for context only, not a basis (DRLRM-PT is explicitly named in the project’s standing rule). Reports staged/dense reward helping sample efficiency in small, discrete, formally-specified MDPs (network graphs, no language, no tool-calling) — a structurally different regime, and not where this project’s “staged reward can help” claim rests. That claim’s actual basis is the potential-based-shaping theorem and RUDDER above. | N/A — context only |
The honest limits — where decomposed training breaks, concretely
The reward-hacking evidence against naive per-stage reward is thick and convergent, not one paper. The moment a “verifier” stops being a deterministic ground-truth check and becomes a learned/judge score, this project’s own confirmed lesson (SFT-induced FLAG{} confabulation from a loose format-matcher) generalizes into a much larger literature:
- PURE / Stop Summation (arXiv:2504.15275) names the mechanism precisely: the canonical summation-form credit assignment (additive, per-step reward) “easily induces LLMs to hack steps with high rewards.”
- Reward Under Attack (arXiv:2603.06621) shows SOTA PRMs function as “fluency detectors rather than reasoning verifiers” — >0.9 PRM reward on trajectories with <4% ground-truth accuracy.
- Gao et al. (arXiv:2410.15115) — combining a learned PRM/ORM with success reward can hurt relative to success-reward-only, via “repeating correct but unnecessary steps.”
- PRIME’s own authors (arXiv:2502.01456) state the central open problem is that process labels are “prohibitively expensive… making [PRMs] particularly vulnerable to reward hacking,” and route around a separately-trained PRM entirely for this reason.
- MONA (arXiv:2501.13011, DeepMind) generalizes this: multi-step reward hacking can occur even when no single step looks bad to a human/judge overseer.
- The ancestor of all of it: the bicycle-shaping failure (Randlov & Alstrom, ICML 1998, pre-arxiv) — a non-potential-based “looks like progress” bonus taught an agent to ride in tight circles farming the bonus instead of reaching the goal. Same species as this project’s own SFT confabulation lesson: reward emitting the right-looking pattern rather than deterministically verified success, and the policy learns to farm the pattern.
Ceiling-capping is a real cost of decomposed training too, symmetric to Llama 4’s SFT/DPO warning. HIRO (arXiv:1805.08296) needed an explicit off-policy correction specifically because a high-level subgoal’s meaning drifts as the lower-level policy improves during training — without it, the system converges on subgoals that are locally useful but cap out below the true optimum. Translated to F1–F4: an “exploit-only” sub-policy trained against a synthetic “endpoint identified” subgoal risks converging on the shallowest exploit that satisfies the boundary — which is precisely the F2 shallow-probing failure this project already observes, not a hypothetical.
Academic, cited for context only, not a basis: the domain-specific CTF/pentest training papers named above happen to be monolithic-outcome or curriculum-decomposed rather than reward-decomposed, and the one adjacent paper doing genuine staged reward in an LLM+RL security loop (arXiv:2605.17075) is unreplicated — but neither observation is the basis for caution here. The actual, load-bearing case against naive per-stage reward is the general reward-hacking convergence immediately above (PURE, Reward Under Attack, Gao et al., PRIME, MONA, HIRO) — that literature alone is sufficient to warrant the conditional verdict in §5, independent of what the CTF-training corpus does or doesn’t show.
4. The key split — eval-decomposition vs training-decomposition
This is the load-bearing distinction. They are not the same decision, and the evidence supports very different confidence levels for each.
| Eval-decomposition | Training-decomposition | |
|---|---|---|
| What it means | Measure per-stage reached/not-reached, on top of the existing flag_verified terminal signal | Replace/augment the terminal reward with per-stage rewards, curricula, or hierarchically-trained sub-policies |
| Training-loop change | None — a read-only pass over traces already generated | The reward function, or the training architecture, or both |
| Cost | Near-free (one aggregation pass) | Real engineering + real risk surface |
| Evidence for it | AgentBoard’s “progress rate” (arXiv:2401.13178, NeurIPS 2024 Oral) — “current evaluation frameworks mostly focus on the final success rate, revealing few insights”; MAST’s 14-mode/3-category taxonomy (arXiv:2503.13657, κ=0.88); AgentErrorTaxonomy — root-cause diagnosis alone (no reward change) buys +24% all-correct accuracy (arXiv:2509.25370); phase-aligned taxonomies independently reinvented in a different (non-security) domain (arXiv:2508.13143); tau-bench’s pass^k (arXiv:2406.12045) — separates “never clears” from “unreliable,” composable with a phase vector. This general, non-security agent-eval literature is the basis for the verdict below on its own. Academic, cited for context only, not a basis: Cybench subtasks (arXiv:2408.08926), AutoPenBench milestones (arXiv:2410.03225), NYU CTF Bench (arXiv:2406.05590), and EnIGMA’s “soliloquizing” fabrication finding (arXiv:2409.16165, ICML 2025) happen to converge on a near-identical F1–F4-shaped split, which is a reassuring coincidence, not evidence this project’s verdict depends on. | |
| Evidence against it | None — every paper that ships it treats it as strictly additive, diagnostic-only, never a substitute for the terminal check | The 2025–2026 PRM-hacking convergence above (§3); ceiling-capping via HIRO-style subgoal drift. (The observation that domain-specific CTF/pentest training papers are uniformly monolithic when they report strong numbers is academic context only, not part of this basis — see §3.) |
| Honest limits | Matcher/judge reliability is the new bottleneck one level down — an LLM-judge-scored phase check inherits some of the flag-matcher’s fragility (MAST’s own top category is “task verification” failure); require a corroborating TOOL-kind span, not LLM-only reasoning, for any phase claiming environment interaction. Per-stage sample sizes shrink fast in a funnel — apply the same pass@k confidence-interval discipline already used project-wide. Phase credit can mislead if not cross-checked against the final flag (treat it as diagnostic under flag_verified, never a replacement). | Safe only via a provably policy-invariant mechanism (potential-based shaping / RUDDER) — anything softer (a per-step LLM-judge “does this look like competent recon” score) inherits a decade-plus of documented gaming behavior. |
| Verdict | Yes, unconditionally, do it now. | Conditional — see §5. |
The theory’s own framing of why these are different decisions: per-stage evaluation is just better logging — nothing is being optimized against it, so it carries none of the correctness burden. Per-stage reward is where every failure mode above lives, because now something in the loop is being optimized against the signal. This is why the project brief is right to force these into two separate decisions.
The one empirical finding that turns this from philosophy into an operational rule. A controlled study across the RL design space on TravelPlanner finds: “reward and algorithm choices are scale-dependent — smaller models benefit from staged rewards and enhanced exploration, whereas larger models converge efficiently with simpler dense [outcome-only-adjacent] rewards.” arXiv:2603.21972. This is directly checkable via the eval funnel: is the current baseline “occasionally stumbles onto stage 3” (favors staged help) or “reliably reaches stage 3/4, fails to convert” (favors leaving outcome reward alone and attacking execution depth via data/SFT)? The project’s own diagnosis — “largely an execution gap” — leans toward the latter, but this is an empirical call the funnel should confirm, not an assumption to bake in from literature alone.
5. The verdict for this project
| Question | Verdict | Confidence |
|---|---|---|
EVAL-decomposition — measure recon / endpoint-discovery / vuln-ID / exploit / pivot independently, on top of flag_verified? | Yes. Do it now, unconditionally. Near-free (read-only pass over existing traces), zero effect on training dynamics, independently reinvented by every serious CTF/agent benchmark that hit this problem before this project. | High |
| TRAINING-decomposition — replace/augment the terminal flag reward with four separate per-stage rewards, curricula, or policies? | No, not as a wholesale redesign — but a narrow, provably-safe form (potential-based milestone shaping, layered on top of the flag reward, never instead of it) earns its keep once the eval funnel shows an exploration-dominated bottleneck. | Medium (conditional, not universal) |
The concrete next step this whole verdict depends on
The project’s own PTES matcher schema (benchmark/challenges/*/challenge.json, ptes.<phase>.steps[]) was arrived at independently, before this literature review, on the general non-security basis in §4 (AgentBoard, MAST, AgentErrorTaxonomy). Academic, cited for context only, not a basis: it happens to structurally resemble the Cybench-subtask / AutoPenBench-milestone design too. What’s missing is aggregation: run the existing triage subagent over completed runs and emit one funnel row per challenge (five phase-reached booleans + an exploit-given-vuln-found conditional rate), rolled up into a corpus-level funnel. This single aggregation is the input every downstream decision below depends on — it settles empirically whether the ~100-200/1000 solve rate is F1-dominated, F2/F3-dominated, or F4-dominated, which the flag_verified column alone cannot supply no matter how much data accumulates.
The recommended recipe, cheapest and safest first
- Decompose the eval fully, now, unconditionally (§4). Cross with the project’s own pass@k methodology rather than one aggregate pass@5, per tau-bench’s pass^k precedent.
- Keep the terminal flag reward as the ground-truth backbone, unconditionally. Nothing in this dossier argues for demoting it below a milestone signal.
- Run rejection-sampling SFT on own verified solves as already planned — but keep it light. The general, non-security basis for this move is the frontier self-training line: STaR (arXiv:2203.14465, Zelikman et al. 2022) — the seminal “generate, keep only what’s verified correct, fine-tune, repeat” loop — and ReST^EM (arXiv:2312.06585, Singh et al., DeepMind 2023) — the frontier-lab scaling result showing this expectation-maximization-style self-training on a model’s own correct samples beats training on human data alone, on math/code reasoning, no cybersecurity domain involved. Academic, cited for context only, not a basis: CTF-Dojo and Cyber-Zero report the same pattern (~500 verified trajectories → double-digit gains, no staged reward) inside the CTF/pentest domain specifically — a reassuring domain-match, not the reason to do this. Respect the Llama 4 warning: don’t over-train on the easy/repetitive subset — it narrows the exploration space the subsequent RL stage needs.
- Add curriculum sequencing before touching the reward function at all. The single lowest-risk lever available — no new reward, hence none of §3’s hacking surface. Order by whichever axis the funnel identifies as the bottleneck.
- Only if the funnel shows an F1 (exploration)-dominated bottleneck, and only via a ground-truth mechanism: add potential-based milestone shaping on top of the terminal reward. Define Φ(s) as a monotonic count of deterministically-verified stage completions (same verification contract the flag oracle already uses — server-side checks, not judge opinions), paid once per stage-transition, never re-collectable. Do not build this for stage 3 (vuln identification) specifically — VPR’s own authors flag exactly this stage-shape (“identify which of several candidates is vulnerable”) as the “open, unstructured” regime their method doesn’t yet solve well; keep that stage eval-only until a genuine deterministic check exists.
- Explicitly do NOT build a learned/LLM-judge per-stage reward model. Every citation in §3’s honest-limits section converges on this being the failure mode to avoid.
- If the funnel instead shows an F2/F3 (execution-depth / tool-policy)-dominated bottleneck — which the project’s own current diagnosis (“largely an execution gap”) suggests is more likely — the evidence base points away from reward decomposition and toward better trajectory curation and more/better SFT data, not a training-loop change.
- When entropy collapses under GRPO (the project’s own stated graduation trigger), watch stage-transition tokens specifically — a badly-shaped milestone reward is an easy, low-entropy shortcut to farm, and would accelerate collapse.
Deliberately not recommended as a first move: standing up four independently-trained sub-policies with four separate critics (the full options/ArCHer/HiPER/FeUdal-style architectural decomposition). Real, actively converging in the literature, and MiRA’s 6.4%→43.0% number is the strongest existence-proof in this whole dossier that monolithic reward can leave a large gap on the table — but every one of these is validated on web-navigation or generic agentic benchmarks with crisp, cheap-to-verify milestones, not on an offensive-security CTF corpus. Highest-upside, least-validated-for-this-domain lever here — a candidate for a later, small, gated experiment, not step one.
The decision, as a diagram
flowchart TD
Start["Failing challenge / corpus\nunder diagnosis"] --> EvalDecomp["Step 1 — decompose the EVAL\n(PTES funnel, near-free)\ndo this unconditionally"]
EvalDecomp --> Funnel{"Funnel shows which\nbottleneck dominates?"}
Funnel -->|"F1: rarely reaches\nthe vulnerable endpoint"| ScaleCheck{"arXiv:2603.21972 —\nweak policy, capacity-limited?"}
Funnel -->|"F2/F3: reaches it,\nfails to convert / clumsy tools"| Mono["Stay monolithic.\nInvest in trajectory curation +\nrejection-sampling SFT data\n(STaR / ReST-EM pattern)"]
Funnel -->|"F4: no pivot after\na foothold"| Curric["Curriculum first\n(1-hop before 2-hop pivot chains,\nno reward change)"]
ScaleCheck -->|"yes — occasionally\nstumbles onto it"| Shape["Potential-based milestone\nshaping ON TOP OF the flag reward\n(Ng/Harada/Russell 1999 — provably\npolicy-invariant), NOT stage 3"]
ScaleCheck -->|"no — already capable,\njust unreliable"| Mono
Shape --> Guard["Guard: ground-truth verifier only,\nnever a learned/LLM-judge score\n(PURE / Reward-Under-Attack / MONA)"]
Mono --> Entropy["Watch entropy at GRPO\ngraduation regardless of path taken"]
Curric --> Entropy
Guard --> Entropy
classDef safe fill:#132b22,stroke:#34d399,color:#eafaf3;
classDef risk fill:#3a1414,stroke:#f87171,color:#fde8e8;
class EvalDecomp,Curric,Mono safe;
class Shape,Guard risk;
Contested point, stated plainly: there exists one paper doing genuine kill-chain-staged RL reward inside an LLM+RL hybrid (arXiv:2605.17075, cyber-defense red-teaming) that on its face looks like support for training-decomposition on an adjacent task. Per the project’s standing rule it carries no evidentiary weight here regardless — it’s academic cybersecurity-LLM work, cited for context only. The Medium-confidence, conditional verdict above does not rest on it; it rests on the general theory (potential-based shaping’s policy-invariance guarantee, HIRO’s ceiling-capping mechanism, the PRM-hacking convergence) and on this project’s own diagnosis. Demoting this citation changes nothing about the verdict.
Cross-links
- Diagnosing the gap — a scientific framework — the pass@k / Pass@(k,T) / Cover@τ protocol that tells you which gap (knowledge / execution / exploration) a challenge subtype actually has; run this before deciding whether an F1-dominated funnel result calls for exploration-RL or just more samples. That chapter’s routing test and this chapter’s eval-funnel are complementary diagnostics, not competing ones.
- RL that creates value — long-horizon, exploration, reasoning, novelty — the mechanics of how to fix an exploration or credit-assignment gap once diagnosed here (GiGPO step-level credit, DAPO, entropy instrumentation, ArCHer, curriculum-band filtering) — this chapter answers whether to decompose; that one answers how to execute the fix on the training-loop mechanics.
- Agentic & multi-turn RL — the missing category — the training-loop shape (turn as the unit of advantage) that any of §5’s mechanisms (potential-based shaping, curriculum) has to be implemented inside.
- Contested edges & landmines — the “does RL create capability or just amplify it” fight this chapter’s scale-dependence finding (arXiv:2603.21972) directly informs.
Bibliography (all traced to a verified source file, 2026-07-02)
| Citation | arXiv / DOI | Role here |
|---|---|---|
| Sutton, “The Bitter Lesson” | no arxiv; incompleteideas.net | Don’t hand-author structure that plateaus |
| DeepSeek-R1 / R1-Zero | 2501.12948 | Rejects neural PRM at scale; reward-reliability admission |
| OpenAI Deep Research | system card, no arxiv | End-to-end beats manual orchestration |
| Llama 4 post-training | ai.meta.com blog, no arxiv | Heavy SFT/DPO caps RL exploration ceiling |
| Kimi k1.5 | 2501.12599 | Outcome-only + long context beats PRM/MCTS/value-fn |
| Go-Explore | 1901.10995 / Nature s41586-020-03157-9 | Pure outcome RL can structurally fail to find sparse reward |
| OpenAI Five | 1912.06680 | Long-horizon precedent needed huge scale + denser reward |
| Options framework (Sutton/Precup/Singh 1999) | AIJ 112, DOI 10.1016/S0004-3702(99)00052-1 | Seminal HRL / temporal abstraction |
| FeUdal Networks | 1703.01161 | Manager/Worker HRL, option-collapse fix |
| ArCHer | 2402.19446 | LLM-native 2-level value function HRL |
| HiPER | 2602.16165 | Hierarchical advantage estimation, +6.6–8.3% |
| MiRA | 2603.19685 | Milestone reward, 6.4%→43% WebArena-Lite |
| Ng, Harada, Russell — reward shaping | ICML 1999, no arxiv | Potential-based shaping theorem (policy-invariant) |
| Müller & Kudenko | 2502.01307 | PBRS effectiveness depends on potential scaling |
| RUDDER | 1806.07857 | Learned return-equivalent redistribution |
| Randlov & Alstrom (bicycle shaping) | ICML 1998, no arxiv | Canonical non-potential-based shaping failure |
| Verifiable Process Rewards (VPR) | 2605.10325 | Safe ground-truth process reward, open-env caveat for stage 3 |
| CM2 checklist rewards | 2602.12268 | Checklist-style verifiable sub-criteria |
| Curriculum Learning (Bengio et al.) | ICML 2009, no arxiv | Foundational curriculum citation |
| h1 | 2510.07312 | Curriculum + outcome-only, exponential sample-complexity gain |
| FastCuRL | 2503.17287 | Context-length curriculum, entropy-collapse timing |
| BPO | 2508.03018 | Curriculum + rejection-sampling refine, near-identical to project plan |
| PURE / Stop Summation | 2504.15275 | Sum-form PRM hacking mechanism, named |
| Reward Under Attack | 2603.06621 | PRMs as fluency detectors, adversarial hackability |
| Gao et al., designing RL reward | 2410.15115 | Learned PRM+success reward can hurt vs success-only |
| PRIME | 2502.01456 | Authors’ own admission of PRM hacking vulnerability |
| MONA | 2501.13011 | Multi-step reward hacking even with no bad-looking single step |
| HIRO | 1805.08296 | Off-policy correction, HRL non-stationarity / ceiling-capping |
| AgentBoard | 2401.13178 | Progress-rate metric, general capability-decomposition principle |
| MAST | 2503.13657 | 14-mode/3-category failure taxonomy |
| AgentErrorTaxonomy / AgentDebug | 2509.25370 | Root-cause diagnosis gains without reward change |
| Phase-aligned taxonomy (autonomous agents) | 2508.13143 | Independent-domain convergence on phase-keyed failure |
| Cybench (academic — context only, not a basis) | 2408.08926 | Subtask decomposition, eval-only — reassuring convergence with AgentBoard/MAST, not evidence relied on |
| AutoPenBench (academic — context only, not a basis) | 2410.03225 | Milestone taxonomy near-matching F1–F4, eval-only — same caveat |
| NYU CTF Bench (academic — context only, not a basis) | 2406.05590 | CTF benchmark family |
| EnIGMA (academic — context only, not a basis) | 2409.16165 | “Soliloquizing” fabrication failure mode |
| tau-bench | 2406.12045 | pass^k reliability decomposition |
| InterCode-CTF (academic — context only, not a basis) | 2306.14898 | Seminal monolithic-reward CTF environment |
| CTF-Dojo (academic — context only, not a basis) | 2508.18370 | Monolithic rejection-sampling SFT, +11.6% — domain-match only; real basis is STaR/ReST-EM below |
| Cyber-Zero (academic — context only, not a basis) | 2508.00910 | Monolithic, simulated env, +13.1% — same caveat |
| Pentest-R1 (academic — context only, not a basis) | 2508.07382 | Two-stage curriculum, monolithic per-stage reward |
| HackSynth-GRPO (academic — context only, not a basis) | 2506.02048 | Outcome-only GRPO sufficient for single-stage CTF |
| STaR | 2203.14465 | Seminal general (non-security) rejection-sampling self-training loop; basis for §5’s SFT-on-own-solves recipe |
| ReST-EM (“Beyond Human Data”) | 2312.06585 | DeepMind frontier-lab scaling result for self-training on own correct samples, math/code domain |
| Kill-chain-staged reward (red-teaming) (academic cybersecurity-LLM — context only, not a basis) | 2605.17075 | The one LLM+RL staged-reward paper, adjacent domain, un-ablated |
| DRLRM-PT (reward machine, pentest) (academic — context only, not a basis) | DOI 10.1109/ijcnn60899.2024.10650368 | Classical RL, staged reward helps, non-LLM regime |
| Node-fragility reward shaping (academic — context only, not a basis) | DOI 10.3390/electronics13214311 | Classical dense-reward pentest, non-LLM regime |
| DAgger | 1011.0686 | Compounding error / distribution drift theory |
| GAE | 1506.02438 | Bias/variance dial for advantage estimation |
| Credit Assignment survey | 2312.01072 | Separates credit assignment from exploration |
| Demystifying long-horizon tool-use RL | 2603.21972 | Scale-dependence: staged reward helps weak models only |
The decision
The whole book collapses to one diagnostic and its branches. The routing question — does the correct action ever appear in π_θ’s own outputs at high N? — is what separates a knowledge gap (inject off-policy) from an execution gap (on-policy) from a ranking gap (preference). The principle underneath is “GRPO amplifies existing capabilities, SFT replaces them” (arXiv:2507.10616): you can only reinforce what already fires.
That routing question has a sharper, task-structure-dependent version once you sample at high N and track when in the episode the correct action shows up — an execution gap can hide an exploration gap underneath it. The tree below now branches on that.
The rigorous version of this page lives in From behavioral audit to training signal. That chapter runs all 5 BSides patterns through gap-type → training-signal → verification-check, cites the pass@k / Pass@(k,T) / Cover@τ instrumentation this page’s routing question is a shorthand for, and is where you go before trusting any single branch below as final for a given challenge.
graph TD
Q0["Does the correct action ever appear in π_θ's<br/>own outputs — even rarely — at high N?"]
Q0 -->|"Never, not once"| K["KNOWLEDGE gap"]
Q0 -->|"Sometimes (solves occasionally)"| E{"Have a model that's<br/>actually better at your CTFs?"}
Q0 -->|"Knows it, mis-picks / quits early"| R["RANKING gap"]
K --> KM["Inject OFF-POLICY:<br/>SFT / teacher data / a TOOL<br/>(RL can't cheaply conjure it)"]
E -->|Yes| DIST["On-policy distillation<br/>dense · ~10× cheaper than RL<br/>(promising, not yet lab-proven)"]
E -->|"No, only a verifier"| Q1{"Is the winning path single-shot,<br/>or compositional / sequentially-gated<br/>(enumeration must land before<br/>exploitation is even visible)?"}
R --> RM["DPO / KTO<br/>(KTO for unpaired good/bad logs)"]
Q1 -->|"Single-shot — recon isn't gating"| RS["Rejection-sampling SFT<br/>→ GRPO when entropy collapses"]
Q1 -->|"Sequentially-gated —<br/>PTES pattern 4: 62% of<br/>failures stall in exploitation"| XG["EXPLORATION gap:<br/>reaches the right region,<br/>then collapses / never enumerates"]
XG --> XGM["On-policy RL FIRST, not more SFT —<br/>matched-data SFT regresses this exact<br/>subset (net −4) while RL expands it<br/>(net +4): self-directed exploration<br/>during rollout is the causal ingredient,<br/>not demonstrations"]
classDef your fill:#132b22,stroke:#34d399,color:#eafaf3;
class E,Q1,RS,XG your;
Reading the branches
-
Knowledge gap — nothing on-policy to reinforce. Inject by demonstration, or cheaper, put the missing fact in a tool (“knowledge in tools, not weights” —
llmresearch-handbook.mdrule 5). Standard RLVR won’t cheaply create it — see the contested boundary in Contested edges (Yue et al., arXiv:2504.13837). -
Execution gap + stronger teacher — on-policy distillation: dense signal on your own rollouts (GKD arXiv:2306.13649); flagged promising, not lab-confirmed.
-
Execution gap + only a verifier, single-shot path — the common case. Rejection-sampling SFT on your ~100–200 solves → graduate to GRPO/RLVR when policy entropy collapses (arXiv:2504.11343).
-
Exploration gap — an execution gap that’s actually sequentially-gated [E][L][N] — the naive test (“solves occasionally at high N”) says execution gap, but if the winning path requires correct enumeration at turn 5 before turn 40’s exploit is even visible, that’s a different beast. Zhai et al.’s Pass@(k,T) analysis, arXiv:2604.14877 (2026-04-16, single-group, promising) found that on this exact task shape (“Category C” — compositional, sequentially-gated retrieval), the RL pass-curve pulls away from the base curve as k grows — real capability expansion — while matched-data SFT on the same task regresses it (net −4 vs RL’s net +4). The causal factor they isolate is self-directed exploration during on-policy rollout, not exposure to more demonstrations. Practically: don’t spend your rejection-sampling budget flattening this subset first — it needs GRPO’s exploration before SFT saturates it, the opposite ordering from the single-shot branch above.
Designed to fix: patterns 1, 2, 4 — tool-space under-exploration (87.7% curl/shell bypass), no PTES enumeration before pivoting (82% pivot-after-one-failure), and the 62%-of-failures-stall-in-exploitation split. All three are the same mechanism at different granularities: Cui et al.’s entropy law, arXiv:2505.22617 (
R = -a·exp(H) + b) predicts the policy trades away exactly this enumeration/tool-diversity budget for reward as entropy collapses — so plain GRPO on this subset needs DAPO-style clip-higher/dynamic sampling (arXiv:2503.14476) from the start, not as a later patch. Full technique-by-pattern mapping (ToolRL, GiGPO, RL-PLUS, NuRL, plus Pentest-R1 — academic, cited for context only, not a basis) is in the diagnosis chapter, not repeated here.Contested / not settled: this is the sharpened, task-conditional version of the “RL can’t create capability” debate in Contested edges §1 — Yue et al.’s crossover result (elicit-not-expand) holds on the static/independent-retrieval task shape; Zhai et al.’s result (genuine expansion) holds on the compositional/sequentially-gated shape. Same instrument, opposite conclusion, and the split is the falsifiable variable — segment your own 1000 challenges by this criterion before trusting either reading wholesale.
-
Ranking gap — DPO/KTO; KTO fits your unpaired solved/failed logs (arXiv:2402.01306).
One prerequisite before any of this
Your 10–20% aggregate is a k=1 portfolio statistic, not a per-challenge pass rate. Before choosing a branch, run pass@k per challenge and bucket by difficulty — the 30–60% band is a per-group property that GRPO needs, and a challenge that “solved once” may be a 30% target that got lucky (a prime RL candidate), not a done deal (lessons/post-training/rl-candidate-selection-from-passk.md, shared memory). Diagnose per-challenge, then route.
The single-shot-vs-sequentially-gated split above needs the same per-challenge treatment: label each challenge by whether its winning path is recon-gated (turn-5 enumeration must land before turn-40 exploit is reachable) or not, before running pass@k — the split determines which axis of the tree you’re even on. Two cheap refinements to the pass@k check, both eval-only (no training change): Cover@τ (Dragoi et al., arXiv:2510.08325) flags challenges where a high pass@64 is really “guessable by brute force” rather than genuinely reliable — don’t rejection-sample SFT on those, you’ll just teach confident guessing; and running the base model’s own pass@k as a control (per Yue et al., arXiv:2504.13837) tells you whether a claimed post-SFT gain on a given challenge is elicitation or noise before you credit it to the pipeline.
The interactive version
The same tree, clickable, is in The 5-minute journey (final section) — answer it for your own failing challenges.
Contested edges & landmines
The places where confident-sounding claims are actually unsettled, plus the terminology traps that cause real planning errors. Cited so you can check me.
1. “RL can’t create capability” — contested, not a law
- Elicit-not-expand (the base claim): RLVR raises pass@1 but the base model beats the RL model at large pass@k — the paths RL finds were already in the base distribution; the reasoning boundary narrows with training. Distillation from a stronger teacher does expand it; RL does not (Yue et al., “Does RL Really Incentivize Reasoning Capacity Beyond the Base Model?”, arXiv:2504.13837, NeurIPS 2025). Reproduced for vanilla GRPO by others (e.g. NuRL notes plain GRPO leaves pass@1024 ≈ base, arXiv:2509.25666).
- The counter: prolonged RL + KL control + reference resets expands the boundary even on problems the base never solves (ProRL, arXiv:2505.24864); entropy/exploration bonuses and parameter-space noise (arXiv:2602.02555) show similar. Also a metric critique: pass@k over-credits lucky-but-wrong CoT (CoT-Pass@K, arXiv:2506.14245).
- Safe framing: vanilla RLVR at a normal budget elicits; sufficient compute + explicit exploration-preservation can expand — recipe-dependent, not settled. So “you can’t RL your way out of a missing capability” holds for standard-recipe RL only. Don’t state it as a law.
2. The “RFT” terminology landmine
Two different things share the acronym; conflating them mis-scopes a whole plan:
- Rejection-sampling Fine-Tuning (STaR-family) = “RL without RL,” positives-only SFT on your own verified samples. Cheap.
- Reinforcement Fine-Tuning (OpenAI/Fireworks product term — and what the project handbook calls “RFT”) = actual online RL / GRPO against a grader. Expensive.
“Start with RFT before RL” only parses under the first. Say “rejection-sampling SFT” for the cheap thing so you don’t accidentally spec a GRPO run.
3. On-policy distillation is promising, not lab-proven
Earlier framing called it “the sleeper.” Correction after a 2026 verification pass: the efficiency numbers (~9–30× cheaper than RL) come from GKD + Thinking Machines’ own blog (arXiv:2306.13649; thinkingmachines.ai 2025-10-27). No frontier lab has stated it as their production recipe. Real and attractive; treat the numbers as directional, not settled.
4. Don’t over-SFT before RL (2026 lesson)
Meta’s Llama 4 recipe deliberately keeps SFT and DPO lightweight around an intensive online-RL core, with the explicit finding that heavy SFT/DPO restricts RL exploration (ai.meta.com/blog/llama-4-multimodal-intelligence). If RL is your capability driver, a big SFT stage can cap your ceiling — counter to the naive “more SFT is safer.”
5. Reward must be ground-truth-verified, never format-matched
A project-empirical finding: SFT on trajectory data trained the model to emit FLAG{…}-shaped strings on unsolved challenges — confabulation — and a loose regex matcher fired on the model’s own reasoning/tool-args, not real server output (lessons/post-training/sft-induced-flag-confabulation.md, lessons/security-agent/flag-detection-false-positives.md, shared memory). Rule: scan tool output / server state for the flag and verify against ground truth; never reward format. On the gameability ladder, a deterministic verifier (level 1) has no parameters to exploit — stay there; PRM (level 5) was rejected for R1 for exactly this (arXiv:2501.12948).
6. “Three knobs” was a teaching scaffold
The on/off-policy axis and the imitation/preference/reward paradigm split are canonical. Packaging them as “N independent knobs you toggle” is a scaffold that over-reached — the axes aren’t independent (signal + policy largely determine what changes), so free combinations produce non-methods. Learn the one axis + the fixed method presets, not a combinatorial grid.
7. Pass@k-as-diagnostic has its own landmines — don’t trust a bare crossover plot [E][R]
Point 1’s crossover test (base pass@large-k beats RL pass@large-k → RL only reweighted, didn’t teach) is the closest thing to a standard instrument in this literature, but it is contested on at least four fronts, each of which changes what conclusion you’re entitled to draw from your own portfolio’s pass@k curves:
- The metric credits lucky final answers. CoT-Pass@K, arXiv:2506.14245 shows pass@k gives full credit to a correct terminal answer reached via a wrong reasoning chain — once you require the CoT itself to be correct, the base-vs-RL crossover disappears and RLVR shows monotonic gains at every k. This directly re-opens point 1’s “elicit vs expand” question: some of the reported “RL doesn’t expand the boundary” results may be an artifact of not checking how the answer was reached. Your flag verifier removes the lucky-final-string confound (exact match, not “42” appearing by chance) — but a verifier-passed CTF trajectory can still contain wasted turns or an ungrounded critical guess before the winning move (BSides pattern 3), so the same confound reappears one level up: filtering your rejection-sampling SFT set on
flag==1alone is the CTF-domain analogue of trusting bare pass@k. - Pass@k at large k conflates “solvable” with “brute-force-guessable.” Cover@τ, arXiv:2510.08325 proposes requiring a τ-fraction of samples (not just ≥1 of many) to be correct — reordering which RLVR algorithms “win” once you penalize guessing instead of rewarding eventual luck. Directly relevant to BSides pattern 5 (benchmarks measure pattern-match speed, not thoroughness): a challenge with high pass@64 but near-zero Cover@0.3 is guessing-dominated, and training on its “wins” teaches confident guessing, not competence.
- Optimizing pass@k directly has a vanishing-gradient trap. Naive pass@k-as-objective is mathematically a per-example reweighting of plain pass@1 whose gradient goes to zero exactly where exploration is most needed — once the policy concentrates, pass@k and pass@1 converge and there’s nothing left to reweight (arXiv:2511.16231). PKPO, arXiv:2505.15201 fixes this with an unbiased, low-variance pass@k gradient estimator — so “just add a pass@k reward” is not a free entropy-preservation trick; it needs the right estimator or it’s a no-op past the point you needed it.
- The agentic extension changes the verdict entirely. Pass@(k,T), arXiv:2604.14877 — a two-axis metric varying sampling budget k and interaction-depth T — finds the static-reasoning crossover (point 1) is task-structure-dependent: on tasks needing compositional, sequentially-gated information-gathering, the RL pass-curve pulls above and away from base as k grows (the opposite of Yue et al.’s finding), while on independent-retrieval tasks the effect is small, replicating Yue et al. This is the strongest bridge between the pure-reasoning literature and this project’s own agentic setting — see point 9 below.
Designed to fix: pattern 5 (benchmarks measure pattern-match speed, not thoroughness) — CoT-Pass@K and Cover@τ are both direct literature-side instantiations of that same audit finding, applied to the training/eval metric itself rather than the benchmark.
Safe framing: treat the pass@1-vs-pass@k gap as a diagnostic to segment by challenge type, not a single portfolio-wide verdict. A shrinking gap with flat pass@1 is the entropy-collapse warning sign (point 8 below), not “the model learned the task.” Never conclude “RL only elicited, didn’t expand” from an aggregate pass@k plot without checking (a) whether the crossover survives a CoT-correctness filter and (b) whether it’s driven by the compositional or independent-task subset.
8. Long-horizon credit assignment: turn-level vs trajectory-level vs sequence-level — no consensus on the right granularity [L]
Every flagship reasoning-RL recipe (GRPO, DAPO, Dr.GRPO) assigns one advantage to the whole trajectory — fine for a single-turn math answer, but it starts to matter once episodes run 100 turns with a terminal-only flag reward. The literature has responded with at least three different, non-convergent fixes, each validated on a different (mostly non-CTF, mostly short) domain:
- Go finer — turn-level advantage. arXiv:2505.11821 shows trajectory-level GRPO applied naively to multi-turn tool-use can fail to teach tool invocation at all (baselines get 20-30% exact-match and never learn to call tools; their turn-level MT-GRPO variant hits 100% tool-execution success). Turn-PPO, arXiv:2512.17008 independently argues PPO-with-a-critic, reformulated so the MDP’s base unit is a turn (not a token), is more robust than GRPO for long-horizon agentic tasks — a direct challenge to defaulting to critic-free GRPO. Both are recent, small-scale (workshop poster / 0-citation preprint, toy benchmarks WebShop/Sokoban) — promising, not settled.
- Go coarser — sequence-level ratio. GSPO, arXiv:2507.18071 goes the opposite direction: clip and optimize at the whole-response (sequence) level instead of per-token, because token-level importance ratios compound multiplicatively over long sequences and destabilize MoE RL training at scale — credited with letting Qwen3’s RL stage not destabilize. Backed by a shipped frontier model, not a toy benchmark — stronger evidence than Turn-PPO’s, but solving a different problem (numerical stability of the ratio, not credit assignment across turns): the two proposals are not mutually exclusive, but they are not the same fix either.
- Fix the critic, don’t change the granularity. VAPO, arXiv:2504.05118 keeps trajectory-level PPO but argues the real problem is an unreliable critic at long/heterogeneous horizons — fixed with value-model pretraining + length-adaptive GAE (λ tuned per response length) — reporting zero training crashes across independent runs, directly disputing the “value-based RL is unstable for LLM reasoning” folklore GRPO was invented to route around.
- Scale the horizon itself, skip the value function entirely. Kimi k1.5, arXiv:2501.12599 treats context/horizon length as a first-class RL scaling axis (not a constraint), using partial rollouts (checkpoint/resume mid-episode) to make 128k-context RL tractable — explicitly avoiding MCTS, value functions, and PRMs. Reframes “the episode is long” as an opportunity, contingent on whether
security-agent-<family>/pdq can support partial-rollout checkpointing (unanswered today). - Curriculum over the horizon. AgentGym-RL / ScalingInter-RL, arXiv:2509.08755 sidesteps the granularity debate entirely: cap the allowed turn budget low early in training and relax it toward the full target (100 turns) as training proceeds, reporting this prevents the long-horizon collapse that training at full horizon from step one causes.
Designed to fix: pattern 1 (agents prefer their own raw tools, 87.7% bypass the rich surface, 26/40 tools dead) — the turn-level papers’ headline finding (trajectory-level GRPO can fail to teach tool invocation at all) is a plausible root-cause mechanism for tool disuse, not just a scaffolding/prompting issue, if this project ever RL-trains on tool-use.
Confidence: contested by construction — no single paper compares turn-level vs sequence-level vs trajectory-level-with-a-better-critic head-to-head on the same long-horizon agentic benchmark. Most of this cluster is 2025 H2–2026 preprints with 0 citations at verification time, validated on toy environments (WebShop, Sokoban) or math/code, not a 100-turn CTF setting. Treat as “candidate designs to pilot cheaply,” not a default architectural choice — and note turn-level, sequence-level, and critic-repair are not mutually exclusive; a future GRPO/RLVR run here could combine GSPO’s sequence-level clipping (stability) with a turn-level auxiliary tool-invocation reward without contradiction.
9. Do exploration bonuses genuinely EXPAND [N] the boundary, or just elicit what the base model already has? — open question, no paper has run the decisive test on this domain
Point 1 already flags “RL expands vs. only reweights” as contested between Yue et al. and ProRL. The exploration-specific literature (DIVER, CDE, MERCI, PSN-RLVR) all claim their intrinsic-reward/parameter-noise mechanism helps the policy “escape local routines” or “discover better solutions” — language asserting [N] (boundary expansion) rather than mere elicitation. That claim deserves the same skepticism point 1 applies to vanilla RLVR:
- None of the exploration-bonus papers ran the decisive ablation. The rigorous test for “did this expand the boundary or just elicit/redistribute” is a pass@large-k comparison against the base model (point 1’s own protocol) — DIVER (arXiv:2509.26209), CDE (arXiv:2509.09675), MERCI (arXiv:2510.16614), and PSN-RLVR (arXiv:2602.02555) all compare against vanilla-GRPO/DAPO baselines, not against a very-large-k base-model ceiling. Beating a collapsed-entropy baseline is a much lower bar than beating the base model’s own pass@1024 — and per Spurious Rewards, arXiv:2506.10947, even a completely wrong reward can look like it’s “unlocking” capability on the right base model (Qwen2.5-Math specifically; does not replicate on Llama3/OLMo2) — a stark warning that “the policy now solves things it didn’t before” is not sufficient evidence of genuine novelty without a same-model-family, large-k, base-vs-trained comparison.
- The capability-elicitation / AI-safety literature already built the falsification protocol for exactly this question — password-locked models (arXiv:2405.19550), harder circuit-broken organisms (The Elicitation Game, arXiv:2502.02180), and the “elicit within <1% of training cost” operational definition of latent capability (AI Sandbagging, arXiv:2406.07358) — all built around known-ground-truth hidden capabilities specifically to distinguish “the technique surfaced something already there” from “the technique taught something new.” No exploration-bonus paper in the RLVR-entropy literature has been tested against a model organism with known injected/withheld capability the way this sub-field requires before it will accept an expansion claim.
- The one paper that ran something close to the decisive test, on an agentic/compositional task, found genuine expansion — but it’s a single, very recent result. Pass@(k,T), arXiv:2604.14877 (point 7 above) shows RL pulls ahead of base-model pass@k on compositional, sequentially-gated tasks, and — critically — that matched-data SFT on the same tasks regresses the capability boundary (net −4 vs RL’s net +4), isolating self-directed exploration during RL, not exposure to more data, as the causal factor. This is the strongest evidence in the entire corpus that exploration specifically (not RL in general, not more data) is what expands the boundary — but it is one paper (2026-04-16), one research group, unreplicated, studying retrieval-style compositional tasks, not cybersecurity.
Designed to fix: pattern 1 (tool avoidance) and pattern 3 (good guessers until they’re not) — DIVER’s pairwise-diversity-of-a-group reward and CDE’s perplexity-based actor bonus are both pitched as countering exactly these behavioral patterns. That framing may be correct as an elicitation mechanism (surfacing tool-diverse or better-calibrated behavior the base model can already produce with a nudge) even if the “expands the boundary” language in the papers’ abstracts is not yet earned.
Confidence: genuinely open. The honest position for this project: exploration bonuses are worth piloting (cheap, mechanistically motivated, all portable per the exploration research thread) — but do not claim any of them “expand the capability boundary” [N] until you’ve run a pass@large-k-vs-base-model check on the same challenge subset (point 1/7’s protocol), ideally segmented by task compositionality the way Pass@(k,T) recommends. Absent that check, the safer verb is “elicit” (base capability present, surfaced more reliably), matching this project’s own competence/performance framing, rather than “expand” (base capability genuinely absent, newly created).
The recipe is a sequence, not a pick
Every other chapter in this book eventually asks “which technique” — SFT or GRPO, DPO or KTO, monolithic or decomposed reward. This chapter retires that framing at the root. None of the frontier reports surveyed below describe a technique choice. They describe a fixed, ordered sequence of stages, each doing a qualitatively different job on qualitatively different data at a qualitatively different scale, and the capability that ships is a function of the order those stages run in and the way they compound — not of which single stage you picked. Skip a stage and a later stage cannot silently make up for it (you cannot RL-amplify a capability pretraining/SFT never injected — §3). Run stages in the wrong order or the wrong dose and a later stage actively regresses (heavy SFT/DPO can cap RL’s exploration room before RL ever starts). This is the organizing thesis of this book’s north star: not “what recipe/technique should I choose,” but “what sequence of stages produces frontier capability, and how do we run it for cyber.”
Two explicit sequences follow, because this project’s actual path and the textbook from-scratch path are different sequences with different stage-skeletons even though every stage-name rhymes:
- Sequence A (§1) — a foundation model trained from scratch: pretraining → mid-training/annealing → SFT → rejection-sampling → preference opt → RL/RLVR → iterate.
- Sequence B (§2) — this project’s actual path: fine-tune an already-available open-weight dense checkpoint (Qwen/Llama-class) — base-vs-instruct choice → (optional) continued/domain pretraining → SFT cold-start (often distilled) → rejection-sampling/on-policy SFT → preference (DPO/KTO) → RLVR/GRPO → iterate.
Stance held throughout: no claim below is grounded in an academic cybersecurity-LLM project (CTF-Dojo,
Cyber-Zero, Pentest-R1, HackSynth, AutoPenBench, DRLRM-PT) — those appear, if at all, labelled
“academic, not a basis.” Grounding is frontier-lab technical reports, frontier open post-training
recipes (Tülu 3, OLMo 2, DeepSeek-R1, Qwen, Llama-Nemotron), and general RL/ML theory. Every arXiv id below
was checked live via Exa/WebFetch against arxiv.org/abs/<id> or arxiv.org/html/<id>, not recalled from
training-data memory — confidence is stated per claim, and “promising, not yet validated” is used honestly
where a finding is recent/low-citation.
1. Sequence A — the foundation model, from scratch
This is the textbook path — pretrain a dense model from zero, then run a multi-round post-training loop. Grounded in Llama 3 (arXiv:2407.21783), OLMo 2 (arXiv:2501.00656), DeepSeek-V3/R1 (arXiv:2412.19437, arXiv:2501.12948, cited for pipeline-shape — MoE, flagged where MoE-specific), and Qwen2.5 (arXiv:2412.15115, dense 0.5B–72B). It is not this project’s path — included so Sequence B’s compression ratio (§2) has a baseline to compress against.
flowchart TD
Pre["Pretraining\n15-18T tokens: web + code + math + multilingual\nLlama 3 405B 15.6T (2407.21783)\nDeepSeek-V3 14.8T (2412.19437)\nQwen2.5 18T (2412.15115)"] --> Mid
subgraph Mid["Stage 2 — Mid-training / annealing"]
direction TB
MidA["5-10% of pretrain FLOPs, curated premium mix\nLR linearly decayed to 0\nOLMo 2: 50/100/300B-token anneals, souped (2501.00656)\nLlama 3: 40B tokens, 30:70 weight (2407.21783)"]
end
Mid --> SFT1
subgraph SFT1["Stage 3 — SFT cold-start"]
direction TB
SFT1A["Curated (prompt,response) pairs — human,\ndistilled, or rejection-sampled from a prior round\nDeepSeek-V3 1.5M instances (2412.19437)\nQwen2.5 >1M samples (2412.15115)\nDeepSeek-R1 cold-start: 'thousands' only (2501.12948)"]
end
SFT1 --> RS
subgraph RS["Stage 4 — Rejection sampling"]
direction TB
RSA["Sample K per prompt from current policy,\nkeep verifier/RM-filtered correct-only\nDeepSeek-R1: ~600K reasoning + ~200K general\n= ~800K, 2 epochs (2501.12948)"]
end
RS --> DPO
subgraph DPO["Stage 5 — Preference opt (DPO)"]
direction TB
DPOA["(chosen, rejected) triplets, closed-form loss\nchosen over PPO for stability/scale (2407.21783)\nQwen2.5 stages SFT -> DPO -> GRPO (2412.15115)"]
end
DPO --> RL
subgraph RL["Stage 6 — RL / RLVR"]
direction TB
RLA["GRPO: group-relative advantage, no critic,\nrule-based verifiable reward\nR1-Zero AIME pass@1 15.6% -> 71.0% (2501.12948)"]
end
RL -.->|"iterate: rejection-sample the RL-converged\ncheckpoint, retrain SFT from base"| SFT1
classDef stage fill:#132b22,stroke:#34d399,color:#eafaf3;
class Pre,MidA,SFT1A,RSA,DPOA,RLA stage;
| Stage | Job | Data type | Approx size (verified) | Contribution / order-rationale |
|---|---|---|---|---|
| 1. Pretraining | Teach language structure + load broad-domain knowledge via next-token prediction at massive scale — the only stage that’s affordable at trillions of tokens | Filtered/deduped web + code + math/STEM + multilingual, general-purpose | 15.6T tokens (Llama 3 405B dense, 2407.21783 — mix: 50% general / 25% math-reasoning / 17% code / 8% multilingual); 14.8T (DeepSeek-V3 MoE, 2412.19437); 18T (Qwen2.5 dense family, 2412.15115, up from 7T for Qwen2) | Must come first — every later stage assumes a working “reads and represents language” substrate; RL/RLVR only sharpen a distribution pretraining already put mass on, they don’t create it |
| 2. Mid-training / annealing | Upsample small, high-quality, capability-specific data that a uniform trillion-token mix would dilute to near-zero; also a cheap diagnostic for “is this new dataset worth anything” | Curated high-quality web + synthetic + math/domain-specific; LR decayed to (near) zero | 5–10% of total pretrain FLOPs. OLMo 2: 832.6B-token curated pool (“Dolmino Mix 1124”), drawn down into 50B/100B/300B-token anneal runs, then checkpoint-averaged (“souped”) 2501.00656. Llama 3: 40B tokens, 30% new-data : 70% default-mix weight 2407.21783 | Too scarce (tens of B tokens) to survive uniform mixing into a 15T-token stream; too much (needs pretraining-scale volume) for SFT’s ~10⁶-example budget to inject. OLMo 2’s measured delta: +18.7% (7B) / +15.9% (13B) / +12.3% (32B) downstream from mid-training alone (Table 2) — the cleanest “small stage, outsized compounding gain” number in this whole thread |
| 3. SFT cold-start | Teach instruction-following / assistant-shaped output; for reasoning pipelines, narrows further to “stabilize RL’s starting point” | Curated (prompt, response) pairs — human-written, distilled from a stronger teacher, or rejection-sampled from a prior round | DeepSeek-V3: 1.5M instances, multi-domain 2412.19437; Qwen2.5: >1M samples 2412.15115; DeepSeek-R1 cold-start: “thousands” only, deliberately tiny 2501.12948 | Must follow pretraining+mid-training (needs the base capability). In multi-round designs, SFT is interleaved with RL, not one-shot — cold-start SFT is ~3 orders of magnitude smaller than “capability” SFT because its only job is fixing format/readability, not teaching the full skill |
| 4. Rejection sampling | Turn a trained policy into a data generator for the next SFT round — crystallize RL-gained capability back into cheap-to-train supervised pairs | Model-generated completions, filtered by rule-based correctness and/or a reward model/generative judge | DeepSeek-R1: ~600K reasoning + ~200K non-reasoning = ~800K samples, 2 epochs, retrained from DeepSeek-V3-Base 2501.12948; Llama 3: K≈10–30 samples/prompt (medium confidence — secondary source) | Sits between an RL stage and the next SFT stage in every multi-round design — cannot happen before a trained policy exists, and its output is consumed entirely by the next SFT round. This is the mechanism that makes the pipeline compounding rather than a single pass |
| 5. Preference opt (DPO) | Align to relative preferences without a live RL loop, critic, or continuously-updated reward model | (prompt, chosen, rejected) triplets | Qwen2.5: explicit SFT → offline DPO → online GRPO staging 2412.15115; Llama 3: ≈6 rounds, each SFT+DPO (medium confidence on the exact count) 2407.21783 | A deliberate simplicity/stability trade against full online RL, stated directly by Meta: “less stable and harder to scale” for PPO-family algorithms at 405B 2407.21783; needs a policy already producing two plausible candidates to rank |
| 6. RL / RLVR | Exceed imitation — discover reasoning behaviors never explicitly demonstrated, via a verifiable (not learned) reward | Prompts + a verifier, not response demonstrations | R1-Zero (pure RL, no SFT): AIME 2024 pass@1 15.6% → 71.0% (cons@64 86.7%) 2501.12948; GRPO: group-relative advantage, no critic/value network | Placed last/final rounds — least stable, most expensive per-sample (live generation + verification per step). R1-Zero is the direct demonstration of running this first: real capability gain, but a documented failure mode (poor readability, language mixing) the paper attributes explicitly to skipping cold-start SFT |
| 7. Iterate | Repair a failure mode the previous pass introduced; bootstrap the next round’s training data from the current-best policy | n/a — reuses stages 3–6 | DeepSeek-R1: explicit 4-stage loop (cold-start SFT → reasoning RL → RS+SFT → RL-for-all-scenarios) 2501.12948; Llama 3: 6 rounds, self-referential rejection-sampling across rounds 2407.21783 | Not “more of the same” — round N+1’s data quality is bounded by round N’s model quality (a genuine bootstrapping effect). Llama 2’s own early-version regression (rejection-sampling only from the latest round’s data caused a documented capability loss — “struggled more… to compose rhyming lines”) 2307.09288 is the concrete warning that compounding is not monotonic for free — you must deliberately mix in older-round data |
2. Sequence B — fine-tune an open-weight DENSE model (our path)
This is the project’s actual path. Start from a Qwen3/Llama-class dense checkpoint — never pretrain from zero. Grounded in four frontier open post-training recipes: Tülu 3 (arXiv:2411.15124), DeepSeek-R1 (arXiv:2501.12948), Qwen3 (arXiv:2505.09388), and Llama-Nemotron (arXiv:2505.00949). Underneath surface naming differences, all four run the same stage skeleton.
flowchart TD
Base["Stage 0 — Base-vs-instruct choice\nStart from BASE, not Instruct\nDeepSeek-R1, Qwen3, Tulu 3 all start Base\n(2501.12948, 2505.09388, 2411.15124)"] --> CPT
subgraph CPT["Stage 1 — (optional) continued /\ndomain pretraining"]
direction TB
CPTA["Full-FT (not LoRA), low LR, unsupervised domain tokens\nQwen3 knowledge stage: +5T tokens on ~30T base (2505.09388)\nDeepSeekMath CPT: 120B math tokens on a 7B dense model (2402.03300)\nLoRA-vs-FFT CPT regime: ~20B tokens (2405.09673)"]
end
CPT --> SFT
subgraph SFT["Stage 2 — SFT cold-start\n(often distilled / synthetic)"]
direction TB
SFTA["Small, format-focused, often teacher-distilled\nDeepSeek-R1: 'thousands' (2501.12948)\nQwen3: deliberately minimized by design (2505.09388)\nDeepSeek-R1-Distill: 800K samples, no RL,\nbeats RL-on-Qwen2.5-32B directly"]
end
SFT --> RS
subgraph RS["Stage 3 — Rejection-sampling /\non-policy SFT"]
direction TB
RSA["Sample K, verify against REAL outcome, keep\ncorrect, retrain — closes the execution gap\nRAFT (2304.06767), STaR (2203.14465)\nDeepSeek-R1: 800K samples built this way (2501.12948)"]
end
RS --> Pref
subgraph Pref["Stage 4 — Preference opt (DPO/KTO)"]
direction TB
PrefA["(chosen,rejected) triplets or binary\ndesirable/undesirable labels\nDPO (2305.18290), KTO (2402.01306)\nTulu 3: ~273K pairs, stage 3-of-5 (2411.15124)"]
end
Pref --> RLVR
subgraph RLVR["Stage 5 — RLVR / GRPO"]
direction TB
RLVRA["Group-relative advantage, no critic,\nverifiable reward — 2-pass: narrow-reasoning\nthen broad-general (DeepSeek-R1, Qwen3, Nemotron)"]
end
RLVR -.->|"iterate: rejection-sample the RL-converged\ncheckpoint -> mint next SFT round"| SFT
classDef stage fill:#132b22,stroke:#34d399,color:#eafaf3;
class Base,CPTA,SFTA,RSA,PrefA,RLVRA stage;
| Stage | Job | Data type | Approx size (verified) | Contribution / order-rationale |
|---|---|---|---|---|
| 0. Base-vs-instruct | Pick a starting checkpoint that won’t fight the target behavior | n/a | n/a | Base, not Instruct — no competing “assistant persona”/prior RLHF alignment to fight; DeepSeek-R1, Qwen3, and Tülu 3 all start every reasoning/post-training recipe from Base, never from the vendor Instruct checkpoint 2501.12948, 2505.09388, 2411.15124 — Instruct’s habits (short answers, refusal patterns, chat-template quirks) actively fight a long-CoT/tool-use format install |
| 1. (Optional) continued/domain pretraining | Inject genuinely new facts SFT/RL cannot teach at low data volume — RL/DPO reweight existing capability, they don’t teach new facts | Raw domain text/code, unsupervised, next-token-prediction objective | Qwen3 knowledge-injection sub-stage: +5T tokens on top of ~30T general 2505.09388; DeepSeekMath: 120B math tokens continued-pretrained onto a dense 7B code checkpoint 2402.03300 — the cleanest Sequence-B-scale CPT anchor; LoRA-vs-FFT benchmark regime: ~20B tokens 2405.09673 | Full fine-tuning, not LoRA, is the recommended method here — CPT needs to learn too much (new facts, new token distributions) for a low-rank constraint; full-FT learns perturbations at 10–100× the effective rank of typical LoRA configs 2405.09673. Skip this stage if the corpus is small/curated — push knowledge in via tools/retrieval instead (handbook rule: knowledge in tools, not weights) |
| 2. SFT cold-start | Fix format/readability/tool-syntax so RL has a stable starting point to sharpen, not invent, from scratch; often distilled from a stronger teacher | Small curated set, often off-policy/teacher-distilled long-CoT or tool-use traces | DeepSeek-R1 cold-start: “thousands” 2501.12948; Qwen3 explicitly states design intent: “minimize both the number of training samples and the training steps during this preparatory phase” 2505.09388; DeepSeek-R1-Distill (dense Qwen/Llama, 1.5B–70B): 800K samples, SFT-only, outperforms running RL directly on Qwen2.5-32B 2501.12948 | Deliberately kept small if it will be followed by RL — over-investing here (turning cold-start into a full capability-SFT pass) is exactly the failure mode §3’s Llama 4 finding warns about: heavy SFT caps the RL stage’s exploration room |
| 3. Rejection-sampling / on-policy SFT | Close the execution gap — train on the model’s OWN correct outputs (grounded in its own tool-call results), not just imitation of a teacher’s plausible-looking trace | Model-generated completions, filtered by a real verifier (rule-based correctness, not a learned judge where avoidable) | DeepSeek-R1: same ~800K-sample set, produced by rejection-sampling the RL-converged checkpoint 2501.12948; RAFT formalizes the loop generically 2304.06767; STaR is the seminal reasoning-specific version, with the “rationalization” (backward-from-answer) trick 2203.14465; a contested finding argues plain rejection-sampling (RAFT) is competitive with full GRPO — the edge attributed to prompt-filtering, not reward-normalization 2504.11343 (low-citation, “promising, worth testing on your own gym”) | The on-policy bridge between imitation and RL — the mechanism that actually closes the off-policy execution gap (§5), because now the “reasoning” is grounded in the agent’s own tool-call outputs, not a teacher’s |
| 4. Preference opt (DPO/KTO) | Align the softer, harder-to-verify axis (style, report quality, tool-use elegance) where no ground-truth checker exists | (prompt, chosen, rejected) triplets, or binary desirable/undesirable labels | DPO: closed-form classification loss, β the hyperparameter that actually matters 2305.18290; KTO: binary labels only, HALO framing, “matches or exceeds DPO from 1B–30B” 2402.01306; Tülu 3: ~272,898 pairs (8B mixture), stage 3-of-5 2411.15124 | Ordering here is genuinely contested — Tülu 3/Nemotron run preference-opt before/after RLVR depending on whether the preference signal and the verifiable-reward signal target the same behavior (fold into one RL stage, à la DeepSeek) or orthogonal behaviors (sequence them, harder-to-specify objective last, à la Tülu 3/Nemotron) |
| 5. RLVR / GRPO | Sharpen and stabilize on-policy behavior against a ground-truth verifier — the stage that can exceed what imitation/preference-opt cap out at | Prompts + a verifier, no response demonstrations | Every 2025 open recipe surveyed (DeepSeek-R1, Qwen3, Llama-Nemotron) runs this in two passes: narrow reasoning-only RL, then a broader general-domain pass — this two-pass structure is close to a settled convention across 3/3 recipes | GRPO needs no critic/value network — tractable to bolt onto an existing dense checkpoint without training a same-size value model; the RAFT-vs-GRPO ablation above (2504.11343) applies directly — test whether gains come from reward-normalization or from prompt-filtering before committing to full GRPO infra |
| 6. Iterate | Repeat RL ↔ rejection-sampling ↔ light-SFT round-trips | n/a — reuses stages 2–5 | DeepSeek-R1’s pipeline literally loops this (SFT → RL → rejection-sample → SFT → RL); AdaSTaR formalizes efficient iteration (curriculum sampling, −58.6% training FLOPs at equal-or-better accuracy) 2505.16322 | Budget at least two RL↔rejection-sampling round-trips, not one — “RL once, done” undersells what every recipe surveyed here actually does |
The Sequence A → B compression, cited: the entire 1000×+ data-volume saving of “fine-tune, don’t pretrain” comes almost entirely from skipping/shrinking the pretraining stage — SFT (500K–1.5M examples either way), preference pairs (10⁵–10⁶ either way), and RLVR prompts (10³–10⁴ either way) are roughly the same absolute order of magnitude whether you’re doing full pretraining or just fine-tuning an existing dense checkpoint. This is a genuinely useful correction to the naive assumption “Sequence B is smaller at every stage” — it isn’t; only the pretraining/CPT stage shrinks by orders of magnitude.
3. Why ORDER matters, and why stages COMPOUND (not add)
Three mechanisms recur across every source in this thread, each backed by a controlled comparison or a direct frontier-lab disclosure — this is what “order is load-bearing” cashes out to concretely.
Cold-start SFT before RL changes RL’s stability and convergence, not just its ceiling. DeepSeek-R1 vs. R1-Zero is the cleanest natural experiment: same base model (DeepSeek-V3-Base), same RL algorithm (GRPO), only the presence of an SFT stage differs. R1-Zero (pure RL, zero SFT) gets real reasoning gains (AIME 15.6%→71.0%) but “poor readability, language mixing” — the paper’s own stated reason cold-start SFT exists: “starting RL training from an uninitialized model can lead to instability and slow convergence” 2501.12948. “SFT Memorizes, RL Generalizes” reaches the same conclusion from a different testbed (GeneralPoints/V-IRL): even a paper whose headline is “SFT bad, RL good” finds “SFT stabilizes the model’s output format, enabling subsequent RL to achieve its performance gains” 2501.17161 (medium-high confidence, ICML 2025 poster).
Heavy SFT/DPO can cap RL exploration — a frontier lab’s own production disclosure, not theory. Meta’s Llama 4 blog states directly: “SFT and DPO can over-constrain the model, restricting exploration during the online RL stage and leading to suboptimal accuracy, particularly in reasoning, coding, and math domains.” Their fix: drop >50% of data tagged “easy,” train only on the harder remainder before RL. This is the single cleanest concrete counter-example to “more SFT/DPO is always better” — high confidence on the mechanism (first-party engineering account), unquantified on magnitude (no ablation numbers disclosed).
RL’s gains are bounded by what the base/SFT policy can already sample — distillation is the one mechanism shown here to inject genuinely new capability. “Does RL Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?” (NeurIPS 2025 Best Paper Runner-Up, ICML 2025 AI4Math Best Paper — high confidence, well-vetted): RLVR-trained models win at small k, but base models overtake at large k — “the reasoning capability boundary of LLMs often narrows as RLVR training progresses… reasoning paths generated by RLVR models are already included in the base models’ sampling distribution” 2504.13837. RL narrows toward existing high-reward paths; it does not expand pass@large-k beyond the base model. Distillation, by contrast, transplants a stronger teacher’s genuinely new reasoning patterns — DeepSeek-R1’s own finding: “direct distillation from DeepSeek-R1 outperforms applying RL on [Qwen2.5-32B]” directly 2501.12948. A mechanistic (low-citation, “promising not validated”) companion result frames the same split via parameter-update analysis: “RL amplifies existing capabilities, while SFT replaces old skills with new ones” 2507.10616. A contested push-back worth flagging honestly: “RL Fine-Tuning Heals OOD Forgetting in SFT” reframes this as “SFT forgets, RL recovers” rather than “SFT memorizes, RL generalizes” — RL mostly restores an early-SFT peak rather than exceeding it 2509.12235 (medium confidence, newer). (Calibration note: 2507.10616 is a single, 0-citation preprint whose own abstract calls the finding a “preliminary indication,” not a settled result — that is the confidence level to carry for this citation everywhere it appears in this book. Other chapters that cite it as a flat, unhedged “principle” or a “confirmed”/“project-confirmed” finding are miscalibrated against the paper’s own abstract and should be brought in line with the hedge above, not the reverse.)
How much a stage buys, where disclosed (this is rarely cleanly ablated — say so):
| Stage | Disclosed delta | Source | Honest caveat |
|---|---|---|---|
| Mid-training/annealing | +12.3% to +18.7% downstream, at 5–10% of pretrain FLOPs | OLMo 2, Table 2 2501.00656 | Fully open data/code — one of the only genuinely ablated per-stage numbers in this whole thread |
| SFT → DPO → RLVR, average score | 8B: 60.6 → 64.7 (+4.1) → 65.1 (+0.4); 70B: 72.6 → 76.2 (+3.6) → 76.2 (+0.0); 405B: 77.5 → 79.6 (+2.1) → 80.7 (+1.1) | Tülu 3, Table 3 2411.15124 | The average column hides where the real gain lives — RLVR’s aggregate contribution looks ~0 at 70B, but MATH-specifically it went 59.9 → 67.3 (+7.4) at 405B. Don’t evaluate a stage by its average-score delta alone (this is the direct precedent for §6’s “track the narrow-skill delta, not the average”) |
| Cold-start SFT dose | “Thousands,” not hundreds of thousands (R1); Llama 2 ceased at exactly 27,540 annotations, having found “fewer but better-quality examples led to notable performance improvements” | 2501.12948, 2307.09288 | Cold-start dose is a genuine hyperparameter, not “as much data as you can get” — over-investing risks capping RL per the Llama 4 finding above |
| Iterative-round non-monotonicity | Early Llama 2 RLHF, sampling only the latest round for rejection-sampling data, caused a documented regression (“struggled more… to compose rhyming lines”) | 2307.09288 | Compounding is not free/monotonic by default — you must deliberately mix in older-round data or silently regress a capability an earlier round had |
Synthesis — settled vs. contested, stated honestly: the “cold-start stabilizes RL,” “heavy SFT/DPO caps RL exploration,” and “iterative rounds compound only if you mix old+new data” claims are high-confidence, multi-source-corroborated. “SFT memorizes / RL generalizes” as a clean universal law is not settled — it holds in controlled game/navigation environments, is nuanced by a nearer-2026 result showing the useful range of SFT checkpoints to RL from is bounded, not “less SFT is always safer.” Clean, ablated per-stage attribution is rare — of everything surveyed across this whole thread, only Tülu 3 and OLMo 2 disclose a genuine stage-by-stage number; treat any single-number stage-attribution claim (including the ones in this chapter) with the same skepticism the Tülu 3 MATH-vs-average gap earns.
4. What I’d change in the project’s pipeline — order + dosage
- Even a small, cheap cold-start SFT stage (thousands, not millions) ahead of GRPO/RLVR is worth the compute purely for RL stability/format-compliance, independent of whether it raises the reward ceiling.
- Audit SFT/DPO data difficulty before RLVR — Llama 4’s explicit fix (drop >50% “easy,” train the hard remainder) is the single most actionable, frontier-lab-disclosed lever available; this project’s harness-generated CTF trajectories should be difficulty-scored before SFT, so the later GRPO stage still has exploration room on hard challenges.
- RL will not inject a capability the base/SFT policy cannot already sample — if a CTF category stays at ~0% under GRPO, that is evidence the capability needs to enter via SFT (ideally teacher-distilled), not via more RL steps against the same reward.
- When iterating multiple GRPO rounds, mix in earlier-round data when constructing later rounds’ SFT/preference sets, per Llama 2’s own documented regression when it didn’t.
- Track narrow-skill deltas per stage, not just the eval average — a stage can look like ~0 aggregate contribution while delivering the specific skill-targeted gain (CTF-category solve-rate, in this project’s case) that actually mattered.
5. The synthetic-trajectory bootstrap — teacher writes it from the answer
A trajectory used for cold-start SFT can come from two structurally different sources, and conflating them is the single most common mistake in this literature:
- Off-policy synthetic — a different, usually bigger model writes the trajectory (distillation, Self-Instruct 2212.10560, Evol-Instruct/WizardLM 2304.12244, persona-driven synthesis 2406.20094). The trajectory was never in the student’s own output distribution.
- On-policy synthetic (self-generated + filtered) — the model being trained writes the trajectory itself; a verifier decides which ones to keep (STaR 2203.14465, RAFT 2304.06767, ReST-EM 2312.06585, the rejection-sampling stage in R1/Tülu 3). The trajectory is always something the model could actually produce.
Where each slots into the sequences above: off-policy synthetic belongs at Stage 2 (SFT cold-start) in both Sequence A and B — it fixes format/instruction-following/tool-syntax. On-policy synthetic belongs at Stage 3 (rejection-sampling) in both — this is where genuine skill-compounding starts, because the training signal is now grounded in the model’s own execution.
The off-policy caveat: cold-start knowledge, yes — execution gap, no
Off-policy teacher trajectories are excellent for cold-start but do not close the execution gap, and this needed cross-referencing several 2025–2026 papers because no single seminal paper states it cleanly:
- Theoretical reason (DAgger lineage / covariate shift): SFT on teacher-written trajectories trains only on teacher-visited states. At inference the student generates autoregressively from its own prior tokens/actions — the moment it errs somewhere the teacher never did, it enters a state distribution it was never trained on, and errors compound. “Revisiting DAgger in the Era of LLM-Agents” states this precisely for multi-turn agents: “SFT provides dense teacher supervision but suffers from covariate shift because it is trained on off-policy teacher trajectories; while RLVR avoids this off-policy mismatch by learning from on-policy rollouts but with only sparse outcome feedback” 2605.12913 (medium-high confidence, very recent). A companion result on SWE-bench: pure off-policy imitation on expert trajectories suffers covariate shift; mixing in on-policy expert corrections gives +13–14% relative gain over traditional imitation (OpenReview KXAJtW8Bib, ICLR 2026 submission).
- Empirical confirmation — distillation only expands capability (pass@k) when it brings NEW knowledge, not pattern imitation. A controlled ablation compares base model vs. real DeepSeek-R1-Distill (trained on genuine teacher trajectories, “likely incorporates substantial new knowledge”) vs. a distilled model trained only on teacher responses for questions the base model’s own output distribution already covered (pure pattern transfer, zero new knowledge): “both distilled models significantly improve accuracy, [but] only the DeepSeek model shows a meaningful increase in capability” 2505.14216 (high confidence — direct quote, this is the single most load-bearing paper for this nuance).
- Why this is categorically worse for agentic tool-use than for math CoT: a math CoT is a single linear token stream — the “state” barely diverges from what the teacher wrote. A CTF-solving trajectory is multi-turn and environment-coupled — tool call → real stdout → next decision. The instant the student’s tool call returns different real output than what was baked into the teacher-written trajectory (different port open, different file present), the student is off the teacher’s distribution with no grounded behavior for that state. The environment, not just the model’s own tokens, is a second source of state divergence a teacher trajectory can never have anticipated.
The confabulation risk — backward synthesis needs the same warning label
STaR’s “rationalization” (show the model the correct answer, let it construct a plausible rationale backward, strip the hint before training) 2203.14465 is the seminal instance of backward/reverse synthesis. A 2026 follow-up sharpens the risk directly relevant here: when a model can see the answer while generating the “reasoning,” the answer acts as a cognitive anchor, and the model tends to produce a rationalization rather than a reasoning process that would generalize — naive mitigation (telling it to ignore the answer) paradoxically makes anchoring worse 2602.14469 (medium confidence, very recent, 0 citations — “promising, mechanistically plausible, not yet validated at scale”).
The cyber instantiation
Applying the above to the project’s own harness (extrapolation from general theory, not from an academic
cybersecurity-LLM paper): use a strong external model to write CTF-solve trajectories for challenges you
already have ground-truth flags for — but treat that corpus strictly as cold-start (format, tool-call
syntax, instruction-following). Every single trajectory must be gated through the real flag verifier
(the actual environment check, flag_verified, not a regex on the model’s claimed flag) before it enters
SFT — per the post-hoc-rationalization warning above, a teacher that already knows the flag can write a
plausible-looking exploit chain that never actually ran against real environment state. Then budget real
compute for a rejection-sampling pass on the agent’s own real rollouts before RL, because per the
covariate-shift/execution-gap evidence above, off-policy teacher data cannot by itself close the gap between
“writes a plausible exploit chain” and “actually recovers from a real tool-call failure the harness will
hit.” This is the direct, cyber-specific reading of Sequence B’s Stage 2 → Stage 3 transition (§2).
6. How to evaluate Sequence B stage-by-stage
Answer, up front: evaluate at every stage boundary, with a single frozen held-out suite run against
each checkpoint as it’s produced — never re-run from scratch, never wait for the final model. This is
exactly the pattern Tülu 3 and OLMo 2 use in the open literature, and DeepSeek-R1’s own developmental-stage
table demonstrates the diagnostic payoff directly: comparing R1-Zero against “Dev1” (cold-start SFT added)
shows Dev1 gaining on IFEval/Arena-Hard but losing ground on AIME, attributed to “the limited size of
the cold-start dataset” 2501.12948 — a stage regression that is only
visible because they evaluated at the intermediate checkpoint, not just at final R1. Tülu 3’s own words:
“our methodology facilitates identifying skill deficiencies and refining the data mix… ensuring a balanced
performance of core skills across the training process” 2411.15124 — and
they release the actual intermediate checkpoints (Tulu3-SFT, Tulu3-DPO, final RLVR model) specifically
so this comparison is reproducible. Cost asymmetry reinforces the case: RLVR/GRPO is the most expensive
stage in the sequence — discovering an SFT-stage defect only after a multi-day RL run has already summed
sunk cost you didn’t need to spend.
Per-stage metrics — the frozen suite is fixed, but what you watch closely changes by stage:
- After (optional) CPT: held-out domain-knowledge QA (not CTF-solving — pure recall of the corpus you just trained on) + a general-capability retention check (MMLU/IFEval before/after). “Domain-continual pretraining induces moderate forgetting with low-to-moderate backward transfer” 2510.17776 — CPT is the mildest of the post-training stages for forgetting, but not free.
- After SFT cold-start: initial pass@1 on the target task family + format/instruction-adherence (IFEval-style: does it actually follow the tool-call schema). This is the stage with the sharpest forgetting risk in the literature — one documented case: SFT dropped a benchmark 52.1%→40.1% while RFT improved the same setting to 54.2% 2507.05386; a Qwen-family (dense, directly relevant) result documents SFT degrading TruthfulQA/HaluEval on Qwen3-4B specifically 2605.20005. MMLU/IFEval must be in the frozen suite at the SFT checkpoint, not just at the end.
- After preference opt: win-rate on a held-out preference-eval set (the thing DPO/KTO directly optimizes) and re-run the same pass@1 suite from the SFT checkpoint — DPO should not regress raw task-solve rate; if it does, β or the preference data is miscalibrated. Tülu 3 separates “development” evals (looked at between stages) from “unseen” evals (reserved until the end) precisely so preference tuning doesn’t overfit the eval suite itself 2411.15124 — mirror this split.
- After RLVR: pass@1 AND pass@k, not pass@1 alone (§3’s argument, restated as a diagnostic here) — plus the live reward/entropy curve during training. Naive GRPO’s entropy-collapse failure mode (“the entropy of the policy decreases quickly… sampled responses of certain groups tend to be nearly identical… limited exploration”) only reached 30/100 AIME points vs. DeepSeek’s reported 47 before a fix (Clip-Higher) was applied — DAPO 2503.14476. A frozen post-hoc suite alone will not catch this; you need the live curve as an in-training diagnostic in addition to before/after checkpoint comparisons.
Pass@k, per funnel stage, is the specific diagnostic RLVR requires that earlier stages don’t. “Does RL Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?” (636+ citations in ~14 months, ICML 2025 AI4Math Oral — well-validated) shows: RLVR wins at small k, base models overtake at large k, and “the reasoning capability boundary… often narrows as RLVR training progresses” 2504.13837. If you track pass@1 alone at the RLVR checkpoint, an RL run that is actively narrowing solution-space diversity looks like a pure win right up until the model needs to solve something outside the narrowed distribution — a genuinely novel CTF challenge, or the hard tail of the funnel. Compute pass@1 and pass@k (k matched to this project’s own locked methodology — k=3 pilot, k=5 real, k=10 edge-band) against the base model’s own pass@k on the same suite as the reference ceiling, not just the previous checkpoint — and do this per funnel stage (F1–F4), not only as one aggregate number, so “RLVR improved pass@1 on easy challenges but narrowed pass@k on hard ones” is visible on a training-stage × funnel-stage grid rather than hidden inside an aggregate.
Guardrails at every stage. Forgetting-risk ranking across the literature converges: SFT is the worst offender, CPT is moderate, RLVR/RFT is the gentlest and can even improve general-capability numbers in some cited settings 2507.05386 — treat the ranking as transferable, the exact percentages as architecture/task-specific. Held-out discipline: never let training data overlap with the frozen eval suite — Tülu 3’s own rule is “removing any training set that has overlap with more than 2% of our evaluation suite” 2411.15124, a concrete, adoptable threshold (treat as convention, not law).
Ablation to attribute contribution and decide where to restart. Hold the frozen suite fixed, run the full sequence with a stage included vs. skipped, compare final-checkpoint numbers on the same suite. DeepSeek-R1’s own R1-Zero-vs-R1 comparison is this ablation, published. Reading the result: if ablating stage N barely moves the frozen-suite numbers, stage N is a target for shrinking/dropping in the next iteration — restart from the checkpoint just before stage N, don’t re-run the whole sequence. If ablating stage N causes a big regression, it’s load-bearing and any future recipe change must preserve it. This is the direct empirical test of “order is load-bearing” for this project’s specific data, not just an inherited belief from the literature — and DAPO’s own reproduction difficulty (only 30/100 AIME with naive GRPO despite a strong base) is a reminder to isolate order-effects from hyperparameter-effects before attributing a regression to staging.
7. The cyber mapping — Sequence B, instantiated
This project runs Sequence B. Mapping each rung to a concrete cyber-data instantiation (extrapolation from the general theory above, applied to this project’s own harness — not from an academic cybersecurity-LLM paper):
| Sequence B stage | Cyber-specific data at this rung |
|---|---|
| 0. Base-vs-instruct | Start from the Base checkpoint of whatever Qwen/Llama-class dense model is chosen — matches all three frontier open recipes and avoids inheriting a chat-alignment prior that resists long-CoT/tool-use/verifiable-reward shaping |
| 1. (Optional) CPT | If the security corpus (CVE descriptions, tool docs, writeups, security-agent-<family> trace history) is a genuine token corpus, not a handful of curated docs — run it as full-FT, low LR, before SFT. If it’s small/curated, skip this stage; push the knowledge in as SFT context/retrieval instead |
| 2. SFT cold-start | A strong external model writes CTF-solve trajectories for challenges with already-known, ground-truth flags — off-policy, format/tool-syntax-only. Gate every trajectory through the real flag verifier before it enters SFT (§5’s confabulation warning applies directly) |
| 3. Rejection-sampling / on-policy SFT | Run the cold-started model as the agent in the real harness. Sample K trajectories per challenge (temp ~0.7, per RFT/ReST-EM’s exact hyperparameters), keep only trajectories where the real environment state produced the real flag — not a string match on FLAG{...} in the model’s text, an actual environment check. This is the step that closes the execution gap |
| 4. Preference (DPO/KTO) | Contrast verified-correct vs. verified-incorrect trajectories from the same rejection-sampling pool; use KTO if only cheap binary “good/bad” labels exist (from a trace-verification tool), DPO if genuine paired trajectories on the same challenge exist |
| 5. RLVR/GRPO | Same ground-truth flag verifier as the reward function — the verifiable-reward signal (flag captured / exploit worked) and the preference signal (report quality, doesn’t waste turns) are genuinely orthogonal here, arguing for RLVR-then-short-DPO-polish ordering (Tülu 3/Nemotron pattern) rather than folding both into one stage |
| 6. Iterate | Mint the next round’s cold-start/rejection data from the current RL-converged checkpoint (R1’s own stage-3 pattern) — budget at least two RL↔rejection-sampling round-trips before calling a GRPO baseline “done” |
Cross-links: What the frontier labs actually do (2026) for the per-lab method survey this chapter’s stage-skeleton was built on; The path to a frontier cybersecurity model for the domain-specialization lineage (code/math/medical) that cross-validates this same stage-skeleton and the current gap analysis against this project’s own harness; Diagnosing the gap — a scientific framework for how a measured failure mode maps to which stage needs the fix; Where you are & the forks ahead for how this sequence view resolves into this project’s next concrete decision.
Continued pretraining on an instruction-tuned model (without breaking it)
The north star this book keeps returning to is fine-tuning an already-available open-weight dense model into a frontier cybersecurity model — not training one from scratch. The path to a frontier cybersecurity model names Stage 0 — domain continued pretraining (CPT) as the first rung of that ladder: every code/math/medical specialization lineage examined there continues pretraining from an existing strong checkpoint on a raw in-domain corpus (Code Llama ~500B code tokens, DeepSeekMath 120B math tokens, Qwen2.5-Coder 5.5T code tokens) before SFT/RL ever starts. What that chapter leaves implicit — and this one makes explicit — is which checkpoint Stage 0 should actually start from.
In practice you frequently don’t have a clean choice. Many open-weight releases you’d actually want to build on are shipped, served, and eval-harnessed as the instruction-tuned/RLHF’d checkpoint — that’s the one your inference stack, prompt templates, and safety behavior are already built around. So the question this chapter answers is concrete and load-bearing, not academic: can you run raw next-token CPT on a raw cybersecurity corpus directly on top of an already-instruct checkpoint — for knowledge injection — without destroying the instruction-following/alignment layer that checkpoint already has? If yes, how? And how do you know, empirically, that it didn’t break?
Bottom line up front: yes, this is real, documented practice — but naive next-token CPT on an instruct checkpoint reliably damages it. The damage is disproportionately format/alignment collapse (chat-template adherence, instruction-following reliability, output degeneracy, sometimes safety behavior), not fact erasure — raw knowledge is comparatively robust. This is fixable, but not “just wait it out”: it needs an explicit countermeasure. The safer industry default remains CPT-on-base → re-instruct when a base twin exists (true for essentially every mainstream open-weight family — Llama, Qwen, Mistral, Gemma all ship base+instruct pairs); CPT-on-instruct directly is viable only as a budgeted fallback. Per this project’s standing stance, no cybersecurity-LLM paper is used as evidentiary ground anywhere below — every claim rests on general domain-adaptive-pretraining, continual-learning, and model-merging/task-arithmetic literature, or on frontier-lab disclosures already cited elsewhere in this book.
1. Why this is a real fork, not a formality
The reason this can’t be waved away as “just keep training” is that instruction-tuning/RLHF is a comparatively thin edit on top of the base pretraining distribution — instruction-following and safety behavior live in a small slice of the weight change, concentrated (per the safety-alignment literature below) in the first few output tokens of a response. Raw next-token CPT on a large new corpus is exactly the kind of update that can overwrite a thin layer without the training loss ever showing it. That’s the catastrophic-forgetting risk this chapter is about, and it’s a direct instance of the same competence-vs-performance split Diagnosing the gap uses for the harness’s own capability gaps: a CPT run that silently collapses instruction-following isn’t erasing knowledge (competence) — it’s breaking elicitation (performance). Misdiagnosing a post-CPT regression as “the model needs more domain SFT” when it’s actually a broken output layer sends you down the wrong fix.
The seminal reference for “keep pretraining on domain data before you specialize further” is Domain-Adaptive Pretraining (DAPT) — Gururangan et al., “Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks,” ACL 2020, arXiv:2004.10964. It establishes the concept and lineage everything below inherits, but it’s an encoder (RoBERTa, MLM objective) with no instruction-following layer to protect and no chat-model eval — cite it for the concept, not for the instruct-preservation question. Confidence: high (canonical, correctly attributed; different model class from the rest of this chapter).
2. What actually breaks — cited, mechanism-level
The direct, on-point comparison. Jindal, Badrinath, Bharti, Vinay & Sharma (Samsung Research), “Balancing Continuous Pre-Training and Instruction Fine-Tuning: Optimizing Instruction-Following in LLMs,” arXiv:2410.10739 (Oct 2024), frames the question almost literally as S1 (CPT directly on the instruct checkpoint) vs. S2 (CPT the base, then re-instruct), tested across Llama-3, Llama-3.1, Qwen2, Qwen2.5 base/instruct pairs. Their own contribution bullet: “Continuous pre-training of an instruction model results in catastrophic forgetting of the instruction capabilities and, therefore should be avoided.” Conversely, CPT-on-base-then-reinstruct “preserve[s] both the domain knowledge and the instruction capabilities.” Confidence: medium-high — single preprint, but directly on-point and multi-model-family.
What specifically degrades. Li & Lee (NTU), “Examining Forgetting in Continual Pre-training of Aligned Large Language Models,” arXiv:2401.03129 (Jan 2024), CPT directly on Llama-2-7b-chat (already SFT+RLHF’d) with 1B tokens of raw Traditional Chinese text. Their own framing: “the model’s knowledge remains unaffected while its reliability declines” — increased repetition, drift toward generating in the CPT-corpus language regardless of prompt language. They also tried the cheap fixes (freezing first/last layers, attention-only vs. MLP-only, LoRA, (IA)³ adapters) and found none fully solved it — “more than straightforward methods are required.” Confidence: high for the qualitative finding; the mitigation-list result is a useful negative (rules out cheap partial-freezing as sufficient on its own).
Independent confirmation the loss is alignment, not knowledge. Zheng, Cai, Qiu & Ma, “Spurious Forgetting in Continual Learning of Language Models,” ICLR 2025 poster (OpenReview:ScI7IlKGdI): much of what looks like catastrophic forgetting is a decline in task alignment from near-orthogonal early-optimization-step weight updates, not true knowledge loss — the underlying knowledge is often still there, just mis-elicited. Proposed mitigation: freeze the bottom layers during the new training phase. Confidence: medium-high (peer-reviewed poster; corroborates Li & Lee independently).
Safety is the same mechanism, and it’s shallow. Qi, Zeng, Xie, Chen, Jia, Mittal & Henderson, “Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!,” arXiv:2310.03693 (ICLR 2024): further training on an aligned checkpoint degrades safety behavior even with entirely benign, non-adversarial data — no malicious intent required. A raw cybersecurity corpus is exactly the kind of domain data that can interact badly with refusal/safety behavior; a security CPT run needs a dedicated safety eval pass, not just an instruction-following check. Qi, Zeng et al. (Princeton + Google DeepMind), “Safety Alignment Should Be Made More Than Just a Few Tokens Deep,” ICLR 2025 Oral, gives the mechanism: safety alignment is disproportionately encoded in the model’s behavior over the first few output tokens — “shallow safety alignment” — so small perturbations to early-token distributions from any further training can collapse refusal behavior while leaving downstream behavior looking otherwise unaffected. Confidence: high (both peer-reviewed, foundational, and independently corroborating).
Scale makes it worse, not better. Luo et al., “An Empirical Study of Catastrophic Forgetting in LLMs During Continual Fine-tuning,” arXiv:2308.08747: forgetting is general across the 1B–14B range tested and — counterintuitively — gets worse as scale increases within that range, because larger models start from a higher capability baseline and so have more to lose. One actionable positive from the same paper: mixing general instruction-tuning data into subsequent training measurably alleviates the forgetting — the empirical backbone for the replay technique below. Confidence: high (their own ablations directly support both the negative finding and the mitigation).
Loss curves lie to you about this. Mousavi, Alghisi & Riccardi (U. Trento), “What Does Loss Optimization Actually Teach, If Anything? Knowledge Dynamics in Continual Pre-training of LLMs,” arXiv:2601.03858 (2026): CPT epoch-by-epoch on three instruction-tuned LLMs with diagnostic probes interleaved shows training loss decreases monotonically while factual learning is unstable/non-monotonic and out-of-domain general-skill performance degrades from early epochs onward — by the time your CPT loss curve “looks fine,” you may already be past the point where the instruct layer started eroding. Confidence: medium (single group, very recent, methodologically sound).
Large-scale but not yet peer-reviewed caveat. Harmon, Hochlehnert, Bethge & Prabhu (Tübingen AI Center), “Mapping Post-Training Forgetting in Language Models at Scale” (OpenReview:qCIg2WGudx, anonymous ICLR 2026 submission, no arXiv id found — treat as promising, not yet validated), sample-wise transition-counts forgetting/backward transfer across ~30 model pairs: “domain-continual pretraining induces moderate forgetting with low-to-moderate backward transfer” at the knowledge level (i.e., moderate, not catastrophic, purely on facts — consistent with Li & Lee), and — the finding to carry into §3 below — “model merging does not reliably mitigate forgetting.” Confidence: medium, and explicitly flagged every time it’s cited because it cuts against several merging-based fixes cited next.
3. Preservation techniques — decision table
Six documented countermeasures, each independently verified, at different cost/robustness points. They compose — production recipes stack two or three at once.
| Technique | How | Cost | Confidence |
|---|---|---|---|
| Replay / data-mixing | Blend general and/or instruction-formatted data into every CPT batch instead of 100% raw domain text. Floor: 1% replay is enough to substantially mitigate forgetting in continual instruction-tuning (Scialom et al., EMNLP 2022, arXiv:2205.12393). At pretraining scale, Ibrahim et al. (arXiv:2403.08763) report 5% replay sufficient for a weak shift (English→English), 25% for a strong shift (English→German); treat a cybersecurity corpus as closer to the strong-shift end (in-distribution stylistically, out-of-distribution lexically/topically — jargon, CVE text, shellcode) and start at 15–25%. Prefer instruction-shaped replay over generic replay when affordable: AdaptLLM (arXiv:2309.09530) auto-converts raw domain text into reading-comprehension/QA task pairs; Instruction Pre-Training (arXiv:2406.14491, Microsoft, EMNLP 2024) scales this to 200M synthesized instruction-response pairs woven into the raw corpus, reporting a CPT’d Llama3-8B “comparable to or outperforming Llama3-70B” on their domain suite. | Low (mixing) → medium (synthesizing instruction-shaped replay via a teacher-model pass) | High — 4 independent groups, fine-tuning-scale through 10B-param/hundreds-of-billions-token pretraining-scale, consistent conclusion |
| Low LR + re-warm/re-decay + short CPT | Treat the instruct checkpoint’s weights as a fragile prior: re-warm the LR (most released checkpoints have already decayed to near-zero) then re-decay over a short cosine cycle sized to the new-token budget, rather than training to convergence on the new corpus. Ibrahim et al. (arXiv:2403.08763, building on Gupta et al., arXiv:2308.04014) show re-warm+re-decay+replay matches full retrain-from-scratch on final loss and LM-eval-harness average, at 405M–10B params, hundreds of billions of tokens. Skipping re-warm → poor adaptation; skipping re-decay/over-running at peak → forgetting spike. | Near-zero (scheduler choice, no new infra) | High on the mechanism; medium on any specific peak-LR number transferred to a much smaller CPT budget than the paper’s own experiments — needs a small sweep |
| LoRA / PEFT-CPT | Freeze base weights, train only low-rank adapters during CPT — the rank constraint mechanically bounds how far the update can move the checkpoint. Biderman et al., “LoRA Learns Less and Forgets Less,” TMLR 2024, arXiv:2405.09673 — the rigorous head-to-head at exactly the CPT regime (Llama-2-7B, ~20B unstructured tokens): LoRA forgets markedly less than full-FT, but underlearns the new domain and the gap is NOT closed even at r=256 — full-FT learns perturbations of effective rank 10–100× higher than typical LoRA configs, which is the mechanistic reason LoRA both underlearns and underforgets. Verified concrete settings from their CPT appendix: target all transformer modules (not just attention), α = 2r, LR 2e-4 at r=16/64, 1e-4 at r=256 (higher rank needs a lower LR to avoid instability), cosine schedule w/ warmup, bf16. | Low (standard PEFT tooling, memory-cheap) | High — single but rigorous, matched-hyperparameter, mechanistic (SVD-based) paper; explicitly reports the negative result (LoRA is not a free lunch for CPT knowledge injection) |
| Chat-vector / instruction-residual (task arithmetic) | Compute Δ_instruct = θ_instruct − θ_base once from the model family’s released base/instruct pair (Ilharco et al., “Editing Models with Task Arithmetic,” ICLR 2023, arXiv:2212.04089 — the seminal task-vector result: deltas can be added/negated/composed via simple weight-space arithmetic). CPT the base checkpoint (not the instruct one) on the raw corpus → θ_base_cpt; reattach θ_final = θ_base_cpt + γ·Δ_instruct (γ defaults to 1.0, sweep 0.8–1.2 if outputs degrade). No SFT rerun required. Verified independently twice: Chat Vector (Huang et al., ACL 2024, arXiv:2310.04799) for a language-shift version of the same problem, and Jindal et al. (arXiv:2410.10739) for the domain-shift version this chapter is about — both converge on the identical recipe independently. If the raw add underperforms, RESTA (arXiv:2402.11746, ACL 2024) shows DARE-sparsifying the delta before adding reduces interference. | Low — the CPT run costs the same as CPT’ing the base directly; reattachment is a single elementwise tensor op | High for the mechanism (ICLR-seminal + ACL-peer-reviewed); medium-high for direct applicability, since the published experiments CPT the base and reattach rather than training the instruct weights themselves — no cybersecurity corpus tested |
| Re-apply a light SFT (or SFT+DPO) stage after CPT | Accept CPT degrades chat behavior and budget a short, cheap post-CPT SFT pass — cheaper than the original SFT since you’re re-sharpening a latent capability, not teaching it from scratch. LLaMA Pro (arXiv:2401.02415, ACL 2024) does exactly this: expand the model with new frozen-original transformer blocks, CPT only the new blocks, then run a separate instruction-tuning pass to produce “LLaMA Pro-Instruct” — original weights never touched, so preservation is structural, not repaired post-hoc (tradeoff: parameter growth + re-integration into the serving stack). Safety needs its own restoration step: per §2, a generic SFT pass is not guaranteed to restore safety alignment as reliably as instruction-following, because safety is shallow/early-token-concentrated; Qi et al.’s (arXiv:2406.05946) proposed fix is a fine-tuning objective that specifically constrains updates on the initial-token distribution. | Low-medium (a short SFT pass, hours not the CPT-scale token budget; block-expansion variant is medium — architecture surgery) | High for “instruction-following degrades and a light SFT pass helps” (standard, cross-validated pattern); medium-high for the safety-specific restoration claim (newer, narrower ICLR-2025-Oral-level result) |
| Model merging (TIES / DARE) | Instead of a single arithmetic addition, treat the domain-CPT’d model and the original instruct model as siblings and merge their deltas with an interference-aware algorithm rather than naive averaging. TIES-Merging (arXiv:2306.01708, NeurIPS 2023): trim small-magnitude changes, elect a majority sign per parameter to resolve sign disagreement, then merge only agreeing parameters. DARE (arXiv:2311.03099, ICML 2024, “Super Mario”): randomly drop delta parameters at rate p and rescale survivors by 1/(1−p) — SFT deltas are typically tiny and redundant enough that 90–99% can be zeroed without hurting the model’s own abilities, and using DARE as a pre-processing sparsifier before merging mitigates the same interference TIES targets. | Low — post-hoc weight-space operation, no additional training, needs both source checkpoints on the same architecture | High for both papers’ core claims (peer-reviewed, reproducible open code); medium for applying either to this exact CPT-knowledge + instruct-behavior scenario (a well-supported inference, not a verbatim citation) — and read the §2 caution again: Harmon et al.’s large-scale study reports merging “does not reliably mitigate forgetting,” so budget a stage-boundary eval after any merge regardless of technique |
4. The practical recipe + recommendation
Two viable orderings, plus a fallback, evaluated against the actual constraint most teams face: you very likely have (or strongly prefer to keep working from) the instruct checkpoint your serving/eval harness is already built around, even though for mainstream open-weight families (Qwen, Llama) the base twin is also published.
Option A — CPT-on-base → re-instruct (the textbook-safe default).
Use when the model family publishes both Base and Instruct (true for Qwen2/2.5/3 and Llama-3.x —
verify per model before committing) and you have budget to re-run SFT/RL afterward. CPT the base checkpoint
with the Ibrahim et al. recipe (re-warm/re-decay LR + 15–25% instruction-shaped replay), then run the
planned SFT→RL pipeline on the domain-adapted base. This is confirmed directly by Jindal et al.’s §4.4
result — it preserves both domain knowledge and instruction capability, no post-hoc repair needed — and
since CPT→SFT→RL is already this project’s intended stage order (per Stage 0 in the frontier
recipe), this
isn’t really extra cost, just correct sequencing.
Option B — CPT-on-base + chat-vector reattach (the cheap shortcut).
Use when you specifically want to skip re-collecting/re-running SFT (e.g., reusing a vendor’s expensive
proprietary instruction dataset you don’t want to reproduce). Mechanics: compute
Δ_instruct = θ_instruct − θ_base once from the original released pair; CPT the base as in Option A; reattach
θ_final = θ_base_cpt + γ·Δ_instruct (γ=1.0, sweep 0.8–1.2). Well-evidenced by two independent convergent
recipes (Chat Vector, Jindal et al.), but with one fewer confirming source than Option A, and neither source
paper tested a cybersecurity corpus — the stage-boundary eval in §5 is load-bearing here, not optional.
Option B′ — CPT literally on the instruct checkpoint’s own weights (LoRA-constrained + replay), fallback only. Use only if the base weights are genuinely unavailable. Run CPT with LoRA directly on the instruct checkpoint (Biderman et al.’s settings above), add 15–25% replay, and treat the stage-boundary eval as the primary gate rather than a confirmation — LoRA bounds but does not eliminate forgetting (their own math-domain ablation shows r=256 forgetting nearly as much as full-FT). No paper in this set tests “LoRA-CPT directly on an instruct checkpoint + later vector repair” as a combined recipe — this is a reasonable compositional inference from two separately-verified findings, not a directly validated one. Confidence: medium, flag it as such to the team.
Recommendation for a Qwen/Llama-class dense instruct model with a raw cybersecurity corpus: default to Option A. A clean base checkpoint is almost certainly available, it’s the best-evidenced path, and it costs no extra compute versus the CPT→SFT→RL order this book already argues for. Reach for Option B only if the second SFT run is the thing you’re specifically trying to avoid. Reserve Option B′ for the case where base weights are truly unavailable, and treat its output as provisional until it clears the eval in §5.
5. How to verify it didn’t break
Generic downstream-benchmark improvement is not evidence alignment survived — per §2, safety and instruction-following can degrade even while target-domain accuracy rises, because the degradation concentrates in a narrow behavioral slice (early-token safety distribution, chat-template format) that a domain benchmark never probes. This is the same trap Diagnosing the gap warns about generally: a single aggregate number collapses a heterogeneous verdict into a false single sentence. Run this pair at the CPT stage boundary — immediately after the CPT run (Option A/B′) or immediately after the chat-vector reattach (Option B), before SFT/RL starts, not only at the end of the pipeline — exactly the same “gate before you spend more compute” logic this book’s stage-wise eval protocol already uses elsewhere.
- Instruction-following retention — IFEval. Zhou, Lu, Mishra, Brahma, Basu, Luan, Zhou & Hou (Google), “Instruction-Following Evaluation for Large Language Models,” arXiv:2311.07911. ~500 prompts, 25 automatically/objectively verifiable instruction types (“write >400 words,” JSON-only, no commas). Chosen deliberately because it’s format/constraint-based, not LLM-judged — it directly measures the exact failure mode both forgetting papers observed (repetition, chat-template drift, “does what I asked” reliability), with no evaluator-model cost or bias.
- General-capability retention — MMLU (with a saturation caveat). Hendrycks et al., arXiv:2009.03300. Checks whether general world-knowledge/reasoning survived while domain knowledge was gained — a separate axis from IFEval, since a model can retain facts while completely losing instruction-following (exactly the Li & Lee finding), so both numbers are needed, not one. Log the caveat this project already applies to saturated benchmarks: MMLU is heavily contaminated/near-ceiling for current-generation models, so treat a small delta as necessary-but-not-sufficient and prefer a fresher check (GPQA, or a held-out-recent slice) as a secondary cross-check when budget allows.
- Domain-knowledge gain, on the same held-out set pre- and post-CPT. This is the thing CPT was for — a flat IFEval/MMLU alongside a negative domain-knowledge delta means the CPT run bought nothing, corpus quality/dedup is the thing to check, not the preservation hyperparameters. Run the identical prompt-formatting, chat-template, and decoding parameters across all three probes at every stage boundary — mismatched sampling temperature or template version manufactures false deltas on its own.
- Safety/refusal retention, specifically adversarial, not just vanilla. Per Qi et al.’s shallow-alignment result (arXiv:2406.05946), “still refuses the standard red-team prompt” is not evidence safety alignment survived — check refusal robustness under adversarial-suffix/prefill-style probes, using Qi et al.’s (arXiv:2310.03693) small adversarially-designed probe-set methodology. This is not optional for a cybersecurity CPT corpus specifically — it’s exactly the domain where a refusal-behavior check matters most.
- Spot-check generations, don’t trust one aggregate number. The “spurious forgetting” (OpenReview:ScI7IlKGdI) finding — not independently peer-reviewed at time of writing, flag as promising-not-validated — is that some measured “forgetting” is a metric artifact: the model may still know the answer but phrase/format it differently post-CPT, tanking a strict-match score without real knowledge loss. Regardless of that paper’s own validity, the practical implication holds on general grounds: hand-inspect a sample of IFEval failures before declaring real forgetting, especially at a stage-boundary gate where you’re deciding whether to proceed or roll back.
Gate logic:
| Signal pattern | Read |
|---|---|
| IFEval flat, MMLU flat, domain-QA improved | Proceed to SFT/RL |
| IFEval drops hard, MMLU flat | Classic direct-CPT-on-instruct forgetting signature (matches §2 exactly). Option A: re-check you actually started from base, not instruct. Option B: re-check the reattach step (wrong γ, mismatched checkpoint version, dtype mismatch between CPT’d base and vector source). Option B′: expected to some degree — proceed only if domain-QA gain justifies it, or raise replay ratio and retrain |
| Both flat, domain-QA flat too | CPT taught nothing — check corpus size/quality/dedup before touching preservation hyperparameters |
This ties back into the ordering point from the frontier recipe: CPT is a pre-alignment stage, not a fine-tuning add-on layered after alignment. The evidence in §2 says the opposite order — CPT after alignment, naively, on the instruct checkpoint’s own weights — is the one failure mode every source converges on. The change this chapter makes concrete to that stage diagram: gradient updates from raw-corpus next-token training should land on the base weights (or be LoRA-isolated with explicit forgetting mitigation if base is genuinely unavailable), not be bolted directly onto the instruct checkpoint as an afterthought — and the IFEval+MMLU+domain-QA gate above is the mechanism that catches it early if that constraint is violated.
Confidence summary
| Claim | Confidence | Basis |
|---|---|---|
| Direct CPT on an instruct/RLHF checkpoint degrades instruction-following/format reliability | High | Two independent papers (2401.03129, 2410.10739), different model families/years, same conclusion |
| What breaks is disproportionately format/alignment, not raw knowledge | High | Li & Lee; Zheng et al.’s “spurious forgetting” (ICLR 2025, peer-reviewed) |
| CPT-on-base → re-instruct preserves both knowledge and instruction-following | High | Jindal et al. §4.4, across 4 model families |
| Loss curves don’t reveal instruct-layer damage in real time | Medium | Mousavi et al., single group, very recent |
| Benign fine-tuning/CPT data erodes safety alignment without malicious intent, and safety is shallow | High | Qi et al. ×2, both peer-reviewed (ICLR 2024, ICLR 2025 Oral) |
| Forgetting gets worse, not better, with scale (1B–14B range) | High | Luo et al., direct ablations |
| LR re-warm/re-decay + replay matches from-scratch retraining | High | Ibrahim et al., multi-scale validated to 10B params |
| Chat-vector/instruction-residual reattachment lets you skip re-SFT after CPT | Medium-high | Two independent convergent recipes (language-shift + domain-shift) + seminal theory; no cybersecurity corpus tested by either |
| LoRA-constrained CPT bounds but doesn’t eliminate forgetting, and doesn’t fully close the domain-learning gap | High (general claim) / Medium (numbers transferring to a security corpus) | Biderman et al., rigorous but only code/math domains tested |
| Model merging (TIES/DARE) is a repair option but NOT a reliable fix at scale | High (core claims) / Medium (large-scale reliability caveat) | TIES/DARE peer-reviewed and reproducible; Harmon et al.’s ICLR 2026 submission (not yet peer-reviewed) reports merging “does not reliably mitigate forgetting” |
| IFEval + MMLU + domain-QA is the right stage-boundary gate | High (why these three axes) / Medium (universal numeric tolerance — must be pilot-calibrated) | Directly matches the failure modes documented above; MMLU-saturation caveat is a known general concern |
| “Spurious forgetting” (metric artifact vs. real capability loss) is a real confound to check for | Low-medium (flagging only) | Single un-independently-verified ICLR 2025 poster; included for the practical “spot-check generations” implication only |
Cross-links
- The path to a frontier cybersecurity model — this chapter is the “how” underneath that book’s Stage 0 (domain CPT); read that chapter first for why CPT is Stage 0 at all.
- Diagnosing the gap — a scientific framework — the competence/performance split this chapter borrows to explain why a post-CPT instruction-following collapse is an elicitation failure, not a knowledge gap.
- What the frontier labs actually do — the broader stage-ordering pattern (SFT → RL, imitation before exploration) this chapter’s CPT-before-alignment argument is a specific instance of.
- PEFT is orthogonal — general LoRA/QLoRA/DoRA mechanics; this chapter’s LoRA-CPT numbers are the CPT-specific instantiation of that general knob.
Proven post-training datasets — a usage-cited registry
Every other chapter in this book asks “which method” (The family map) or “which sequence of stages” (The recipe is a sequence, not a pick). This chapter answers the question those chapters leave open once you’ve picked Sequence B (§2 of the recipe chapter): which concrete, downloadable dataset actually fills each rung, for the six GENERAL (non-cyber) capabilities every Sequence-B rung needs regardless of what cyber-specific data you build on top — chat template, tool/function-calling, instruction following, preference, reasoning, and willingness/refusal calibration. This is a registry, not an essay: the tables are the payload.
1. The inclusion rule (read this before trusting any row)
This project’s permanent stance is never rely on academic cybersecurity-LLM projects — grounding comes from real frontier/open-weight recipes, not single-paper academic artifacts (frontier-cyber-model-path.md §1). This chapter applies the exact dataset-side analog of that stance:
A dataset gets a row ONLY if it is PROVEN-BY-USAGE — i.e. a named, real, released open-weight model or recipe (a tech report, an official model card, or a widely-used/well-cited community fine-tune) is documented as having actually trained on it. Every row below names that model/recipe and cites the source (model card, tech report, or lab blog post) that states the link directly — not inferred, not “this would probably be a good fit.”
Consequences enforced throughout:
- No single-paper-only academic datasets. A dataset whose only appearance is its own authors’ academic ablation, with no third-party or even self-reported shipped model consuming it, is dropped. Each category file below has an explicit “Dropped” list — read it before re-proposing something that already failed the bar.
- No cyber-specific datasets at all — this registry is scoped to GENERAL capabilities on purpose; cyber-specific data is out of scope here by design, not merely unfound.
- Self-proof is weaker than cross-lab proof, and is flagged as such. A dataset proven only via its own authors’ reference model (e.g. Magpie-Reasoning-150K → the Magpie team’s own checkpoints) is marked Medium confidence even though the usage is real; a dataset reused by an independent lab or absorbed into a bigger shipped mixture (e.g. COIG-CQIA → folded into BAAI’s Infinity-Instruct recipe) earns High.
- License ≠ proof-of-usage. Several rows below carry a real, verified training usage but a murky or non-commercial license (ShareGPT, ToolACE, COIG-CQIA, toxic-dpo-v0.2). The PROVEN_IN column tells you it was used; the License column tells you separately whether you can reuse it — don’t conflate them.
Every table row format is: dataset · link · capability · Sequence-B stage · size · license ·
PROVEN_IN (model/recipe) + citation · language(s) · notes · confidence. Source notes for all six
categories, including live-verification method and the full per-dataset detail, are the six research
files under artifacts/overnight-datasets/research/ (instruction-chat, tool-calling-agentic,
preference-alignment, reasoning-cot, willingness-uncensor-safety, chinese-labs-multilingual) — this
chapter distills them; consult those files for the complete citation quotes.
2. Instruction / chat SFT
The CPT→SFT rung: general instruction-following and multi-turn chat behavior — the base every other capability in this chapter sits on top of.
| Dataset | Link | Seq-B stage | Size | License | Proven in (+ cite) | Confidence |
|---|---|---|---|---|---|---|
| Tulu 3 SFT Mixture | HF | SFT | 939K | ODC-BY-1.0 (mixed per-subset) | Tülu 3 + OLMo 2 (7B/13B/32B) — arXiv:2411.15124, AI2 blog, OLMo 2 blog | High |
| OpenHermes 2.5 | HF | SFT | ~1M | mixed per-source | OpenHermes-2.5-Mistral-7B, Nous Hermes 2 series — dataset card states directly | High |
| UltraChat-200k | HF | SFT (stage 1 of SFT→DPO) | 231K | MIT | Zephyr-7B-α/β — arXiv:2310.16944, alignment-handbook recipe | High |
| OpenOrca / SlimOrca | OpenOrca · SlimOrca | SFT | 500K–4.2M | MIT | Mistral-7B-OpenOrca, OpenOrca-Platypus2-13B, Mistral-7B-SlimOrca — model card | High |
| ShareGPT (unofficial mirrors) | e.g. anon8231489123/ShareGPT_Vicuna_unfiltered | SFT | ~70K+ | unclear/gray (scraped, no official grant) | Vicuna-13B (LMSYS) — LMSYS blog | High (usage) / Low (license) |
Dolphin (cognitivecomputations/dolphin) | HF | SFT | multi-million | Apache-2.0 | dolphin-2.x series (2.5-mixtral-8x7b, 2.9-llama3-8b, 2.9.3-Yi-1.5-34B, …) — model cards list it directly | High |
| Infinity-Instruct | HF | CPT-adjacent → SFT | 3M/7M (foundational) + chat | Apache-2.0 (gated) | BAAI InfInstruct-* family (Mistral-7B, Llama3-70B/8B, Yi-1.5-9B) — model cards state directly; arXiv:2506.11116 | Medium |
| Magpie → Smol-Magpie-Ultra | Magpie-Align org · SmolTalk card | SFT | raw 10M+ / Smol-Magpie-Ultra 400K curated | CC-BY-NC-4.0 (raw) / Apache-2.0 (Smol variant) | SmolLM2-Instruct family — arXiv:2502.02737; technique also self-proven via Llama-3-8B-Magpie-Align (arXiv:2406.08464, ICLR 2025) | High (technique) / Medium (raw license) |
| LMSYS-Chat-1M (GPT-4 subset, as ingredient) | HF (gated) | SFT (ingredient) | 1M total, subset used | custom gated license | OpenHermes 2.5 → Nous Hermes 2 — dataset card lists “ChatBot Arena (GPT-4 Only)” as a constituent source | Medium |
| WizardLM / Evol-Instruct | HF | SFT | 196K (public V2) | unclear (GPT-ToS-derived) | WizardLM family (7B–70B) — arXiv:2304.12244; ancestor of WizardCoder, absorbed into Tulu-3-style compilations | High |
| No Robots | HF | SFT | 10K | CC-BY-NC-4.0 | Tülu 3 / OLMo 2 (named constituent); independently: monsterapi/zephyr_7b_norobots | High |
Dropped: raw LMSYS-Chat-1M as a standalone primary SFT set (no lab found training directly+solely on the 1M dump for general chat SFT); raw Magpie-Align dumps as a standalone commercial-usable set (kept only the technique + the Apache-licensed Smol-Magpie-Ultra derivative).
3. Tool / function-calling & agentic trajectories
The SFT rung that teaches the model to use things — required regardless of cyber content, since the CTF agent’s core skill is calling tools correctly across multi-turn trajectories.
| Dataset | Link | Seq-B stage | Size | License | Proven in (+ cite) | Confidence |
|---|---|---|---|---|---|---|
| Glaive-function-calling-v2 | HF | SFT | 112,960 rows | Apache-2.0 | IBM Granite-20B-FunctionCalling — arXiv:2407.00121; also 256 HF models declare it as training data | High |
| Hermes-Function-Calling-v1 (NousResearch) | HF | SFT | tens of thousands | Apache-2.0 | Hermes-2-Pro-Llama-3-8B / -Mistral-7B — dataset card; its <tool_call> tag format is now vLLM’s default hermes tool-parser and Qwen’s own recommended format | High |
| Salesforce APIGen / xLAM-function-calling-60k | HF (arXiv:2406.18518) | SFT | 60,000 | CC-BY-4.0 | xLAM-1b-fc-r / xLAM-7b-fc-r (Salesforce) — ranked top-25 BFCL at release; NeurIPS 2024 D&B | High |
| APIGen-MT → xLAM-2 series | project (arXiv:2504.03601) | SFT | spans 1B–70B model family; APIGen-MT-5k public sample | research license (verify per artifact) | Salesforce xLAM-2-{1b..70b}-fc-r — model card states trained via this framework; SOTA on BFCL + τ-bench | High |
| ToolACE | HF (arXiv:2409.00920, ICLR 2025) | SFT | 11,300 rows | Apache-2.0 | ToolACE-8B + 51 total HF models trained/fine-tuned on it (Huawei) | High |
| ToolBench (OpenBMB) → ToolLLaMA | GitHub (arXiv:2307.16789, ICLR 2024 spotlight) | SFT | ~127k instructions over 16,000+ real APIs | Apache-2.0 | ToolLLaMA-7b-v1; feeds Agent-FLAN downstream | Medium-High (self-proven + downstream reuse) |
| Agent-FLAN | HF (arXiv:2403.12881, ACL 2024 Findings) | SFT | 219 MB (AgentInstruct+ToolBench+ShareGPT mix) | Apache-2.0 | InternLM Agent-FLAN-7B — dataset card states “+3.5% across agent eval datasets” over prior best | High |
| Gorilla / APIBench | HF (arXiv:2305.15334) | SFT | 1,600+ APIs | Apache-2.0 | Gorilla-7B family (UC Berkeley) — NeurIPS 2024 D&B track | High |
Dropped: NexusRaven-V2 training data (only an eval set was released publicly); watt-tool-8B/70B (proprietary/undisclosed dataset, no citable artifact); a standalone Qwen tool-calling SFT row (Qwen never published one — its format adoption of Hermes-style tool use is credited to the Hermes row above instead).
Cross-cutting for Sequence-B: prioritize the multi-turn/agentic sets (ToolACE, APIGen-MT, Agent-FLAN, ToolBench) over single-call sets (xLAM-60k, APIBench, Glaive) — CTF tool use is inherently multi-step and stateful. Agent-FLAN’s explicit negative/anti-hallucination samples are the one artifact in this table that doubles as willingness/refusal-adjacent signal (see §7).
4. Preference (DPO / KTO / RLHF)
| Dataset | Link | Seq-B stage | Size | License | Proven in (+ cite) | Pairwise/On-off-policy | Confidence |
|---|---|---|---|---|---|---|---|
| UltraFeedback (binarized) | HF | preference (DPO/RM) | 64K prompts → ~61K pairs | MIT | Zephyr-7B-β — arXiv:2310.16944; also a Tulu 3 DPO-mixture component (arXiv:2411.15124) | Pairwise, off-policy | High |
| Anthropic HH-RLHF | HF | preference (RM/RLHF) | ~170K comparisons | MIT | StableVicuna (Stability AI/CarperAI) — blog | Pairwise, off-policy | High |
| Stanford SHP / SHP-2 | SHP · SHP-2 | preference (RM/RLHF) | 385K / 4.8M | MIT | StableVicuna (combined 3-dataset RM recipe) — same blog above | Pairwise, off-policy | Medium-High |
| Nectar | HF | preference (RM → RLAIF) | ~183K prompts × 7-way ranking | Apache-2.0 (research) | Starling-RM-7B-alpha / Starling-LM-7B-alpha (Berkeley NEST) — blog | K-wise→pairwise, off-policy | High |
| HelpSteer2 (+v1) | HF | preference (attribute-scored RM) | ~21K pairs (v2) | CC-BY-4.0 | Llama-3.1-Nemotron-70B-Reward/-Instruct (NVIDIA), #1 RewardBench at release — arXiv:2406.08673 | Multi-attribute → pairwise, off-policy | High |
| PKU-SafeRLHF | HF | preference (safe-RLHF dual reward+cost) | 265K QA / 166.8K pref pairs | Apache-2.0 (framework) / non-commercial (data) | Beaver-7B (PKU-Alignment) — GitHub, arXiv:2406.15513 | Pairwise (dual-label), off-policy | High |
Argilla DPO mixes (ultrafeedback-binarized-preferences-cleaned, distilabel-intel-orca-dpo-pairs) | HF | preference (DPO) | 64K / 12.9K | Apache-2.0 | Notus-7B-v1, argilla/distilabeled-OpenHermes-2.5-Mistral-7B — model cards | Pairwise, off-policy | High |
| Skywork-Reward-Preference-80K (v0.2) | HF | preference (RM) | 80K curated pairs | per-source (verify) | Skywork-Reward-Gemma-2-27B / -Llama-3.1-8B (v0.2) — #1 RewardBench; arXiv:2410.18451 | Pairwise, mixed on/off-policy | High |
| KTO-mix-14k | HF | preference (KTO) | ~15K rows | Apache-2.0 | HF TRL’s KTOTrainer reference dataset; ~9 community checkpoints; Oumi’s KtoMix14kDataset recipe | Unpaired, off-policy | Medium (real usage, no flagship model pinned) |
Dropped: no cyber-specific preference dataset was in scope; any preference set whose only usage was a single unrelated academic ablation was excluded before it reached the table.
5. Reasoning / CoT (math + code + science distillation)
| Dataset | Link | Seq-B stage | Size | License | Proven in (+ cite) | Confidence |
|---|---|---|---|---|---|---|
| NuminaMath-CoT | HF | SFT | 860K pairs | Apache-2.0 (code) | AI-MO/NuminaMath-7B-CoT/-TIR — 1st AIMO Progress Prize; also a stated source in Qwen2.5-Math (arXiv:2409.12122) | High |
| Sky-T1_data_17k | HF | SFT | 17K | Apache-2.0 | NovaSky-AI/Sky-T1-32B-Preview — blog | High |
| Bespoke-Stratos-17k | HF | SFT | 17K | CC-BY-NC-4.0 | Bespoke-Stratos-32B/-7B — blog | High |
| OpenThoughts-114k / OpenThoughts2-1M / OpenThoughts3-1.2M | 114k · 2-1M · 3-1.2M | SFT | 114K / 1M / 1.2M | Apache-2.0-style | OpenThinker-7B / -2-32B / -3-7B — arXiv:2506.04178; OpenThinker3-7B is SOTA-open-data at release (53% AIME25, 51% LCB, 54% GPQA-D) | High |
| OpenR1-Math-220k | HF | SFT (+DPO-usable) | 220K curated | Apache-2.0 | open-r1/OpenR1-Qwen-7B — Open-R1 update #2 | High |
| Mixture-of-Thoughts | HF | SFT | 350K verified traces | Apache-2.0 | open-r1/OpenR1-Distill-7B — replicates R1-Distill-Qwen-7B; mixture ratio follows Phi-4-reasoning methodology | High |
| OpenMathInstruct-2 | HF | SFT | 14M pairs | CC-BY-4.0-style | OpenMath2-Llama3.1-8B/70B (NVIDIA) — ICLR 2025, arXiv:2410.01560 | High |
| OpenMathReasoning | HF | SFT (+RL-prompts via problem-only split) | 3.2M CoT + 1.7M TIR + 566K GenSelect | CC-BY-4.0-style | OpenMath-Nemotron-1.5B..32B — literal data behind NVIDIA’s AIMO-2-winning submission, arXiv:2504.16891 | High |
| OpenCodeReasoning (+1.1) | HF | SFT | 736K–1.165M | CC-BY-4.0-style | OpenCodeReasoning-Nemotron-7B/14B/32B/-1.1 — model cards state directly; arXiv:2504.01943; SFT-only beats RL alternatives on LiveCodeBench (61.8%) | High |
| Magpie-Reasoning-150K (V1) | HF | SFT | 150K | Llama-3/Qwen2 community license | Llama-3-8B-Magpie-Align-SFT-v0.2 (Magpie’s own reference checkpoints) — arXiv:2406.08464, ICLR 2025 | Medium (self-proven only) |
Dropped: the raw DeepSeek-R1 “800k samples” SFT set was never released as a standalone artifact (only checkpoints were open-sourced) — the community reconstructions above are what’s actually usable and are what’s listed instead; AM-DeepSeek-R1-Distilled-1.4M (no verified third-party adoption beyond its own paper); any cyber-specific reasoning/CTF-reasoning dataset (out of scope by permanent stance).
Read order if you’re pulling data, not just reading the table: NuminaMath-CoT (substrate) → OpenThoughts3-1.2M (best current fully-open ablated set) → OpenMathReasoning (if you need competition-tier math) → OpenCodeReasoning (closest analog to CTF/exploit-reasoning) → Mixture-of-Thoughts/OpenR1-Math-220k (if reproducibility of the training recipe matters more than raw SOTA) → Bespoke-Stratos/Sky-T1 (cheapest pilot).
6. Willingness / uncensor vs. safety / refusal calibration
This category is both directions of refusal/compliance behavior — deliberately, because the target model is a sandboxed offensive-security research agent and over-refusal is a real, documented failure mode (see §8).
6a. Compliance-increasing (uncensor / de-refusal)
| Dataset | Link | Seq-B stage | Size | License | Proven in (+ cite) | Confidence |
|---|---|---|---|---|---|---|
Dolphin dataset family (incl. not_samantha_norefusals.jsonl) | HF | SFT | ~4.5M base + multi-M Dolphin-2.9 mix | Apache-2.0 | Dolphin model series (2.9.1-llama-3-8b, 2.9.2-Phi-3-Medium, 2.5-mixtral-8x7b, …) — Hartford’s post, axolotl configs list the file directly | High |
unalignment/toxic-dpo-v0.2 | HF | preference (DPO) | 541 pairs | not permissively licensed; sensitive-content flagged | Component of mlabonne/orpo-dpo-mix-40k, used to DPO-“heal” NeuralDaredevil-8B-abliterated post weight-orthogonalization — dataset card, Labonne’s abliteration post | High |
ehartford/wizard_vicuna_70k_unfiltered | HF | SFT | 34,598 conversations | unclear (ShareGPT-derived) | Wizard-Vicuna-13B-Uncensored family (7B/13B/33B/65B) — model card states the alignment-stripping directly; 128 HF models trained on it | High |
6b. Safety-increasing (harmlessness / appropriate refusal / anti-jailbreak / anti-over-refusal)
| Dataset | Link | Seq-B stage | Size | License | Proven in (+ cite) | Confidence |
|---|---|---|---|---|---|---|
Anthropic/hh-rlhf | HF | preference / RL-prompts | 161K+ helpful + 42K+ harmless + red-team transcripts | research-use | StableVicuna-13B RM training + the original Anthropic RLHF paper (Bai et al. 2022) — blog | High |
PKU-Alignment/PKU-SafeRLHF | HF | preference (safe-RLHF reward+cost) | 83.4K entries | non-commercial | Beaver-7B-v1.0 — model card lists it directly, arXiv:2310.12773 | High |
allenai/wildguardmix | HF | SFT (safety_noncompliance) | 50K prompts (Tulu-3 slice) | Apache-2.0 | Tulu 3 — named safety_noncompliance bucket component; arXiv:2411.15124 | High |
allenai/wildjailbreak | HF | SFT (safety_noncompliance) + RL-prompts | 262K total (50K sampled into Tulu 3) | ODC-BY-1.0 | Tulu 3 — same bucket; explicitly designed to fix over-refusal via harmful-vs-benign-but-scary contrastive pairs | High |
allenai/coconot | HF | SFT (safety_noncompliance) | 10,983 prompts | ODC-BY-1.0 | Tulu 3 — same bucket; Brahman et al. 2024, “The Art of Saying No” | High |
Dropped: Nous Hermes 3’s compliance-steerability behavior is real but no single named public dataset is credited for it (proprietary blend); Do-Not-Answer/XSTest/OR-Bench are refusal-rate eval sets, not training sets, in any published recipe verified; Llama Guard’s training-data composition is not public.
7. Chinese-labs & multilingual
| Dataset | Link | Capability | Seq-B stage | Size | License | Proven in (+ cite) | Confidence |
|---|---|---|---|---|---|---|---|
| BAAI Infinity-Instruct | HF | instruction/chat SFT | SFT | ~7.4M found. + 1.5M chat | CC-BY-SA-4.0 (mixed subsets) | Own tech report fine-tunes Mistral/LLaMA/Qwen/Yi; InfInstruct-LLaMA3.1-70B beats GPT-4-0314 by 8.6% on IF — arXiv:2506.11116 | High |
| COIG-CQIA | HF | Chinese instruction SFT (LIMA-style) | SFT | ~48K (45.8K filtered) | CC-BY-NC-SA | Own paper fine-tunes Yi-6B/34B, Qwen2-7B/72B — arXiv:2403.18058; also folded into Infinity-Instruct (second, independent usage) | High |
Firefly (firefly-train-1.1M) | HF | Chinese multi-task SFT | SFT (+DPO via toolkit) | ~1.15M–1.65M | unspecified (verify) | firefly-mixtral-8x7b, firefly-baichuan2-13b, firefly-llama-30b — GitHub, 6.6K★ | Medium-High |
Magpie-Qwen2(.5)-Pro (+ -200K-Chinese) | HF | self-synthesized instruction/preference-pair SFT | SFT | 1M (Qwen2-Pro) + per-model variants | research-use | Method paper matches Llama-3-8B-Instruct SFT-only — arXiv:2406.08464, ICLR 2025; ZH subset generated by Qwen2-72B-Instruct itself | High (method+EN) / Medium (ZH subset) |
| MAP-Neo Matrix Data Pile | HF | bilingual EN/ZH pretrain corpus | CPT (+SFT/alignment released alongside) | 4.5–4.69T tokens | Apache-2.0 | MAP-Neo-7B, pretrained from scratch, fully open — arXiv:2405.19327 | High |
| OpenCSG Chinese Corpus (Chinese-Cosmopedia etc.) | HF | ZH synthetic textbook CPT + Smoltalk-style SFT | CPT + SFT | 15M docs / ~60B tokens (Cosmopedia slice) | Apache-2.0 | csg-wukong-1B (OpenCSG) — arXiv:2501.08197 | Medium |
| Congliu/Chinese-DeepSeek-R1-Distill-data-110k | HF | ZH reasoning/CoT distillation | SFT | 110K | not stated (research/community use) | Distilled via DeepSeek-R1-671B API per R1’s own protocol; 91 downstream community models trained on it | Medium-High |
| AgentInstruct (Zhipu/THUDM) | HF | agent/tool-use trajectories | SFT | 1,866 verified trajectories | Apache-2.0-style | AgentLM-7B/13B/70B — arXiv:2310.12823, ACL 2024 Findings; independently reused by InternLM’s Agent-FLAN | High |
| LongAlign-10k (Zhipu/THUDM) | HF | long-context instruction alignment | SFT | 10,000 (8K–64K tokens) | Apache-2.0-style | ChatGLM3-6B-128k, LongAlign-6B/7B/13B-64k — arXiv:2401.18058, EMNLP 2024 Findings | High |
| COIG-P | HF | Chinese preference (DPO) | preference | 1,009K (paper) / ~101K (HF release) | CC-BY-NC-4.0 | Own paper DPO-trains Qwen2.5-Instruct-7B-COIG-P, Infinity-Instruct-3M-*-COIG-P — arXiv:2504.05535, EACL 2026 Findings | Medium |
| huozi_rlhf_data | GitHub | Chinese human-labeled preference | preference | 16.9K pairs | Apache-2.0 | Huozi 2.0 (HIT-SCIR) — official RLHF stage of a named released model | Medium |
| Aya Dataset + Aya Collection (Cohere) | Dataset · Collection | massively multilingual instruction SFT | SFT | 204K (Dataset, 65 langs) / 513M (Collection, 101 langs) | Apache-2.0 (verify per-subset) | Aya-101 → Aya 23 → Aya Expanse (Cohere/Cohere Labs) — arXiv:2402.06619 | High |
Dropped: zhihu_rlhf_3k and dikw/hh_rlhf_cn (only scattered small/unlabeled community reward-model
usage, no named shipped recipe); a standalone first-party Qwen/DeepSeek/GLM/Yi SFT-or-preference-data row
(none of those labs release their actual training data — their footprint here is captured indirectly via
AgentInstruct/LongAlign, and via community R1-distillation sets like Congliu’s 110k).
8. Which datasets at which Sequence-B rung
Mapped onto Sequence B’s actual stage order (frontier-recipe-is-a-sequence.md §2): base/instruct choice → (optional) CPT → SFT cold-start → rejection-sampling/on-policy SFT → preference (DPO/KTO) → RLVR/GRPO → iterate.
| Stage | What it needs | Pull from |
|---|---|---|
| CPT / domain pretraining (optional) | Bilingual/domain raw-text corpus, if extending beyond the base checkpoint’s native coverage | MAP-Neo Matrix (§7, EN/ZH), OpenCSG Chinese-Cosmopedia (§7); Infinity-Instruct’s “foundational” phase blurs CPT/SFT (§2, §7) |
| SFT cold-start — general chat/instruction | Broad instruction-following + multi-turn chat prior before anything domain-specific | Tulu 3 SFT Mixture, UltraChat-200k, OpenHermes 2.5, No Robots (§2) |
| SFT cold-start — tool/agentic | Multi-turn tool-call format + agentic trajectory shape | Hermes-Function-Calling-v1 (format anchor for vLLM’s hermes parser), ToolACE, APIGen-MT, Agent-FLAN, AgentInstruct (§3, §7) |
| SFT cold-start — reasoning prior | Long-CoT math/code/science distillation before any RL stage touches it | OpenThoughts3-1.2M (default pick), OpenMathReasoning (competition-tier), OpenCodeReasoning (code/CTF-adjacent) (§5) |
| SFT cold-start — willingness calibration | Refusal-boundary precision for a sandboxed offensive agent, not blanket compliance or blanket refusal | CoCoNot + WildJailbreak + WildGuardMix (the Tulu 3 safety_noncompliance bucket) as the base; Dolphin/wizard_vicuna_70k_unfiltered only if you deliberately want the de-refusal direction too (§6) |
| Preference (DPO/KTO) | Chosen/rejected pairs (or unpaired KTO-shaped labels) once an SFT checkpoint exists to regenerate on-policy | UltraFeedback, Skywork-Reward-Preference-80K, HelpSteer2 as off-policy starting mixes; KTO-mix-14k as the format template if your own trajectories are naturally unpaired good/bad, not clean pairs (§4) |
| RL-prompts (RLVR/GRPO) | Prompts + a verifier — deliberately NOT a fixed labeled dataset per method-to-data.md | WildJailbreak’s adversarial-prompt pool and OpenMathReasoning’s 193k problem-only split are the closest “prompt source, no answer” shapes in this registry; your own challenge set + flag-check verifier remains the actual RLVR data object for the cyber-specific rung (out of scope here) |
9. Two honest caveats
Willingness vs. over-refusal, for a sandboxed offensive-security agent. Over-refusal — declining a
legitimate pentest/CTF request because it pattern-matches “harmful” — is a real, well-documented failure
mode, which is exactly what CoCoNot and WildJailbreak were purpose-built to fix (contrastive
harmful-vs-benign-but-scary-looking prompts, §6b).
That is the correct tool for this project’s actual problem: precision on the refusal boundary inside a
controlled, sandboxed tool-use context — not blanket compliance. The de-refusal-direction datasets in
§6a (Dolphin, toxic-dpo-v0.2, wizard_vicuna_70k_unfiltered) are included because they are genuinely
proven-by-usage and instructive as the opposite pole of the same axis, but they are a blunter instrument
(strip refusal-flavored completions wholesale) than CoCoNot’s calibrated “refuse the actually-unsafe
subset, comply with the rest.” Default recommendation: build the willingness rung primarily from the
Tulu-3 safety_noncompliance bucket (WildGuardMix + WildJailbreak + CoCoNot together, exactly as Tulu 3
ships it), and treat §6a as a reference recipe pattern rather than a default ingredient.
The off-policy caveat for distilled reasoning data. Nearly every dataset in §5 is CoT distilled from a stronger teacher (DeepSeek-R1, QwQ-32B, Llama-3.1/3.3-70B-Instruct) — the policy model you’d actually train on Sequence B never generated these traces itself. This is the correct, cheap way to bootstrap a reasoning prior before any on-policy stage (exactly what Sky-T1/Bespoke-Stratos/OpenThoughts/OpenR1 all do), but pure SFT-on-distillation caps quality at the teacher’s and plateaus — it does not substitute for an on-policy RL stage afterward. This mirrors the same on/off-policy axis this book treats as the single most load-bearing distinction in post-training generally (foundations/on-off-policy.md) and the same caveat the preference registry (§4) makes explicitly about off-policy DPO mixes: budget an on-policy regeneration-and-rescoring pass against your own SFT checkpoint before treating either the reasoning prior or the preference stage as final, mirroring what Tulu 3 and NVIDIA’s HelpSteer2-Preference both do in practice.
Cross-links
- The recipe is a sequence, not a pick — the Sequence-A/Sequence-B stage skeleton this registry’s §8 mapping is built against.
- The path to a frontier cybersecurity model — the north star this registry serves; the “proven-by-usage, not academic-only” stance is the dataset-side mirror of that chapter’s model/recipe-side stance.
- Method → Data (your real bottleneck) — the method-first framing that explains why RLVR/GRPO in §8 deliberately has no fixed dataset row, and why rejection-sampling FT’s data object is a byproduct you already produce.
- Full per-dataset detail, live-verification method, and additional dropped candidates:
artifacts/overnight-datasets/research/{instruction-chat,tool-calling-agentic,preference-alignment, reasoning-cot,willingness-uncensor-safety,chinese-labs-multilingual}.md.
Cybersecurity is one of a family — what cracked the others
Every other chapter in this book treats the CTF task as our problem: our harness, our 1000-challenge portfolio, our flag verifier. This chapter argues the opposite framing is more useful: cyber CTF-solving is one instance of a general problem family — {LONG-HORIZON (up to ~100 turns), EXPLORATORY (search/enumeration over a huge space), SPARSE-TERMINAL-REWARD (only the flag is verified, nothing in between), VERIFIABLE (an ungameable checker)} — and at least six other domains share that exact structural signature. Frontier labs and general RL research have been cracking members of this family for a decade. The move this chapter makes is: hold up each domain, find the stage (pretraining / SFT / RL) and the specific technique that actually fixed its long-horizon/sparse-reward problem, and rank what transfers.
Stance, honored throughout: academic cybersecurity-LLM projects (CTF-Dojo, Cyber-Zero, Pentest-R1, HackSynth, AutoPenBench, DRLRM-PT, and siblings) are mention-only, labelled “academic, not a basis” below — no conclusion here rests on them. General coding, competitive programming, theorem proving, deep-research/web agents, games, and robotics are not academic-security work — they are exactly the frontier-lab and general-RL-theory evidence the project’s stance asks for. Everything below is cross-linked to The path to a frontier cybersecurity model, Diagnosing the gap, RL that creates value, and One problem, or many? — this chapter is the cross-domain evidence layer those four already draw on; it does not re-derive their verdicts.
1. The structural frame
Strip away the domain-specific vocabulary (a “vulnerable endpoint,” a “failing unit test,” a “Lean tactic,” a “hidden test case,” a “Montezuma key”) and every member of this family reduces to the same abstract shape:
flowchart TD
subgraph ABSTRACT["Abstract class"]
direction LR
A1["Long-horizon:\nmany sequential\ndecisions"] --> A2["Exploratory:\nhuge search space,\nmost paths fail"]
A2 --> A3["Sparse terminal\nreward: only the\nend state is scored"]
A3 --> A4["Verifiable:\nan ungameable,\nmechanical checker"]
end
subgraph CYBER["Our CTF pipeline"]
direction LR
C1["Recon /\nenumeration\n(~turns 1-20)"] --> C2["Endpoint /\nvuln discovery\n(~turns 20-50)"]
C2 --> C3["Exploit\nchain\n(~turns 50-90)"]
C3 --> C4["Flag read +\nserver-side\nverify {0,1}"]
end
A1 -. maps to .-> C1
A2 -. maps to .-> C2
A3 -. maps to .-> C3
A4 -. maps to .-> C4
The point of drawing the arrow this way: the flag verifier is not special — it is our domain’s instance of “the proof kernel,” “the unit-test suite,” “the hidden Codeforces test cases,” “the reference-answer F1 score,” “the win/loss signal.” Every domain below chose (or was forced into) a stage + technique to make progress against that shape. The question this chapter answers is which of those choices generalize.
2. THE ANALOGY TABLE
| Domain | Horizon | Reward sparsity | Exploration burden | Verifier | Stage + technique that cracked it | Transfers to cyber |
|---|---|---|---|---|---|---|
| Long-horizon coding (SWE-bench repo agents) | 20–80+ tool-call turns | Terminal (tests pass) | Which file/function among thousands | Execution (unit tests) | SFT on executable-env trajectories (SWE-Gym arXiv:2412.21139, R2E-Gym arXiv:2504.07164) → RL with execution-verified reward (SWE-RL arXiv:2502.18449, o1→o3 arXiv:2502.06807); long-context multi-turn RL needs DAPO-style stabilizers + progressive context curriculum (arXiv:2508.03501) | HIGH — closest structural twin. Mirrors our planned pipeline almost exactly. |
| Particular-language lift (weak PL / weak NL, execution-RL on code) | Short (1 program) → multi-turn repair (RLEF) | Terminal (tests) | Program-space / repair-space | Execution (compiler/unit tests) | Pretraining/CPT does the knowledge lift (corpus coverage — DeepSeek-Coder arXiv:2401.14196, StarCoder2 arXiv:2402.19173); RL fixes execution only — CodeRL arXiv:2207.01780, StepCoder arXiv:2402.01391, RLEF arXiv:2410.02089 — never injects new knowledge | HIGH, but as a diagnostic contrast, not a lever — see §4 below. |
| Competitive programming | Short per-problem; GrandCode reframes as multi-stage agentic loop | Terminal (hidden tests) | Program space (millions of candidates) | Execution (unit tests) | Sampling breadth + cheap filter (AlphaCode, arXiv:2203.07814) then a purpose-built multi-stage GRPO variant for delayed reward + off-policy drift (GrandCode’s “Agentic GRPO,” arXiv:2604.02721) | STRONG (conceptually) — validates our rejection-sampling-breadth strategy; GrandCode is the most load-bearing GRPO-variant precedent for our exact delayed-reward shape. |
| Theorem proving (Lean/Coq) | Long, many tactic steps — decomposable into subgoals | Per-step, not just terminal — the kernel checks every step | Tactic/proof search space | Deterministic, per-step, ungameable (proof kernel) | Subgoal decomposition for cold-start data + curriculum (DeepSeek-Prover-V2, arXiv:2504.21801); synthetic self-play data generation (AlphaGeometry, Nature 2024, DOI:10.1038/s41586-023-06747-5); then RL vs. binary kernel-verified reward (DeepSeek-Prover-V1.5) | PARTIAL — cleanest verifier of any row. Licenses subgoal-decomposition-for-DATA; does not license densifying reward over unverifiable CTF steps (shell/HTTP output isn’t kernel-checkable). |
| Deep-research / web agents | ~10–40 turns (search/click/browse) | Terminal (task complete) | Which query/page/source on the live open web | Learned RM (WebGPT, weak) → outcome/ORM (WebRL) → reference-match F1 (DeepResearcher) | RL (outcome-only) trained in the real live environment; WebRL’s self-evolving curriculum generated from the model’s own failures (arXiv:2411.02337); R1-Searcher’s sequential (not summed) format-then-outcome staged reward (arXiv:2503.05592) | STRONGEST transfer row. Failure-to-curriculum + safe staged tool-use bootstrap, both directly actionable. |
| Hard-exploration games (Montezuma / Pitfall / NetHack / StarCraft) | Very long (1000s–10,000+ actions) | Terminal, near-zero for prior baselines | Combinatorial state space; adversarial strategy space | Environment score / win-loss | Archive-based “explore, remember, return-then-explore-further” (Go-Explore, arXiv:1901.10995); diverse self-play league vs. naive self-play (AlphaStar, DOI:10.1038/s41586-019-1724-z); scale + dense shaping, no exotic algorithm (OpenAI Five, arXiv:1912.06680) | STRONG (Go-Explore) / different axis (league) / honest caveat (NetHack still unsolved). |
| Robotics (sparse-reward manipulation) | Short (tens of steps) | Sparse binary terminal | Continuous action/goal space | Environment success check | Data-relabeling: HER (arXiv:1707.01495) — relabel a failed trajectory’s achieved state as the goal it accidentally satisfied | SPECULATIVE at the mechanism level (off-policy-specific); the idea licenses mining our own flag=0 trajectories for sub-skill SFT data. |
| Cyber-CTF (academic literature — mention-only) | ~100 turns (our regime) | Terminal (flag) | Endpoint/vuln/exploit-chain space | Deterministic flag verifier | Monolithic outcome-only RL / rejection-sampling (CTF-Dojo, Cyber-Zero — academic, not a basis); two-stage offline→online curriculum (Pentest-R1 — academic, not a basis) | — target row; convergence with the frontier rows above is a mild corroborating note only, never load-bearing. |
3. Per-domain deep-dive
3.1 Long-horizon agentic coding — the closest twin
Problem it fixes: a model that free-runs an unconstrained shell wastes turns on malformed edits and noisy tool output; naive RL on a full 20–80-turn trajectory with only a terminal test-pass reward hits credit-misattribution (an early correct action gets penalized because a later, unrelated action failed).
What fixed it, by stage:
- Scaffold (pre-RL): SWE-agent’s Agent-Computer Interface (arXiv:2405.15793) — a small fixed action set + concise per-step feedback lifted pass@1 from 3.8%→12.5% on the same underlying LM, before touching weights. Anthropic’s Claude 3.5/3.7 Sonnet scaffold philosophy is the opposite-looking but complementary lesson: deliberately minimal scaffolding (bash + string-replace edit tool), crediting the gain to post-training, not scaffold cleverness.
- SFT: trajectory distillation on executable-environment corpora is the near-universal first stage — SWE-Gym (arXiv:2412.21139), R2E-Gym (arXiv:2504.07164). SWE-Master (very recent, arXiv:2602.03411, low-confidence) adds a concrete, cheap idea: mask environment-feedback tokens out of the SFT loss — train on the agent’s own actions/reasoning, not on memorizing verbose tool stdout.
- RL: execution-derived reward beats a learned/similarity proxy when both are available — SWE-Gym’s ground-truth path over SWE-RL’s
difflibpatch-similarity fallback (arXiv:2502.18449). DeepSWE (RL-only, no SFT, from Qwen3-32B) shows DAPO-style stabilizers (Clip-High, no-KL, compact filtering of failed/timeout trajectories) let RL-only work when the base model already has strong agentic priors. Progressive context/turn-budget curriculum — start RL at a shorter horizon than the full ceiling, extend once performance plateaus — is independently confirmed by two groups (arXiv:2508.03501, and KLong, arXiv:2602.17547, low-confidence but converging). - Long-horizon credit assignment specifically: GiGPO (arXiv:2505.10978, NeurIPS 2025) adds a step-level grouping on top of GRPO — group actions taken from repeated “anchor states” across different rollouts, giving fine-grained credit without an extra critic. This is the most mature fix in a fast-moving 2026 cluster (BEACON arXiv:2605.06078, HiPER arXiv:2602.16165, Ecpo arXiv:2606.05885 — all low-confidence individually, but converging on “flat trajectory-only advantage is the open problem”).
- Test-time (no training): parallel sampling + a verifier to pick the best candidate is a large, cheap, repeatedly-replicated multiplier (Claude “high compute” 63.7%→70.3%; R2E-Gym 34.4%→51%; DeepSWE 42.2%→59%) — orthogonal to whatever training was done.
What I’d change in our pipeline: (1) audit our sectools tool surface against the SWE-agent lesson — concise, structured observations, a “you already tried this” signal, before assuming RL will fix noisy feedback; (2) mask tool/environment-output tokens from our SFT loss if we don’t already; (3) if moving to full-trajectory RLVR, do not expect vanilla GRPO with only a terminal flag-reward to assign credit well across ~100 turns — prototype GiGPO’s anchor-state grouping first; (4) check whether we do reject-and-rescore across pass@k, or only report pass@1/pass@5 independently — the hybrid-TTS multiplier is free money already inside our methodology.
3.2 Particular-language — the knowledge-injection contrast (read this one for the diagnosis framing)
The reason this domain is second, not last: it is the cleanest existing literature answering exactly the question our diagnosis framework poses — is a failure a knowledge gap or an execution gap?
For both a specific programming language and a specific natural language, the field’s converged answer is stage-specific:
- Pretraining / continued-pretraining injects KNOWLEDGE — corpus composition (how many languages, how much of each) is the lever, not RL, not even SFT. StarCoder (arXiv:2305.06161), StarCoder2 (arXiv:2402.19173), DeepSeek-Coder (arXiv:2401.14196) for code; Sailor (arXiv:2404.03608), SEA-LION (arXiv:2504.05747), LLaMA Beyond English (arXiv:2401.01055) for natural language. Tokenizer/vocabulary coverage sits underneath this stage as an architectural precondition (arXiv:2406.11477) — poor coverage looks like “doesn’t know the language” even when data exists, because every token is spent on fragmented sub-word pieces.
- SFT/instruction-tuning teaches USE, cheaply, once the base model already has passive knowledge — BLOOM+1’s finding (arXiv:2212.09535) is the sharpest data point: for an already-instruction-tuned model, simply including a new language in the multitask instruction-tuning mixture beat continued pretraining — the cheapest lever to try first, once a base exists.
- RL (execution/compiler-verified) fixes BEHAVIOR, never knowledge — CodeRL (arXiv:2207.01780), PPOCoder (arXiv:2301.13816), RLTF (arXiv:2307.04349), StepCoder (arXiv:2402.01391), RLEF (arXiv:2410.02089, Meta FAIR, ICML 2025 spotlight). None of these teach new language knowledge — every one explicitly frames the problem as get-the-execution-loop-right on a domain the base model’s pretraining already covers. RLEF in particular is a near-literal preview of our structural frame: multi-turn POMDP, policy emits → executes against public tests → feedback appended → repair → repeat → reward on held-out private tests, solved with a turn-level (not token-level) value function.
The quote-worthy contrast: BLOOM+1’s “put it in the SFT mixture” (an SFT-stage move) vs. StepCoder/RLEF’s “RL never adds knowledge, only sharpens execution of knowledge already latent in the base model.” No amount of GRPO/RLVR on our harness will inject knowledge of a CVE or technique class the base model never saw much of in pretraining — see §4.
3.3 Theorem proving — the cleanest verifier, the strongest calibration check
Formal theorem proving (Lean/Coq) has the single cleanest sparse-reward substrate of any domain: the proof kernel checks every intermediate step, not just the final answer — cleaner even than our flag verifier. DeepSeek-Prover-V2 (arXiv:2504.21801) decomposes a hard theorem into a DAG of subgoals, generates cold-start SFT data by solving each subgoal independently with a smaller model, then RL’s on top with binary kernel-verified reward. AlphaGeometry (Nature 2024, DOI:10.1038/s41586-023-06747-5) goes further: synthetic self-play data generation that manufactures its own training problems, not just solutions to given ones — escaping a human-demonstration data-scarcity floor entirely. AlphaProof (Nature 2025, DOI:10.1038/s41586-025-09833-y) applies AlphaZero-style self-play/search on top.
The honest limit, stated precisely: the transferable lever is not “densify reward the way theorem
proving does” in general — it is specifically: wherever a CTF sub-milestone can be reduced to a
deterministic, server-side, ground-truth check (a reverse-shell callback actually received, a specific
privileged file actually read — not an LLM-judge’s opinion), score it exactly like a verified Lean subgoal.
Everywhere else — a curl command’s stdout, a subprocess’s raw output — there is no general “CTF kernel”
that can certify a step was valid, and this domain’s recipe does not license densifying reward there. What
does transfer safely: subgoal decomposition and synthetic self-play, used to generate additional
training data or curriculum, leaving the terminal reward untouched.
3.4 The KNOWLEDGE vs EXECUTION contrast — tying it to our diagnosis framework
This is the load-bearing synthesis point of the whole chapter, and it is exactly the split Diagnosing the gap already formalizes (competence vs. performance, Firestone PMC7604508; formal vs. functional competence, Mahowald et al. arXiv:2301.06627) — the particular-language literature is independent, cross-domain confirmation of the same split, from a different field entirely:
| Failure mode observed | Root cause | Fixing stage | Cross-domain evidence |
|---|---|---|---|
| Model has never seen the relevant tokens/technique at all | Missing from pretraining corpus | Pretraining / continued-pretraining — data mixture, upsampling | StarCoder/DeepSeek-Coder (PL); Sailor/SEA-LION (NL) |
| Model “sort of” knows it but fragments/mishandles it | Tokenizer/vocabulary coverage gap | Vocabulary expansion + CPT (sub-pretraining level, not SFT/RL) | Yamaguchi et al. arXiv:2406.11477 |
| Model knows it passively but won’t reliably use it on command | Instruction-tuning data doesn’t cover it | SFT / instruction-tuning mixture — cheapest lever, no CPT needed | BLOOM+1 arXiv:2212.09535; Aya arXiv:2402.07827 |
| Model knows the technique but fumbles execution over many turns / fails tests | Behavior/execution gap, not knowledge | RL with an execution/compiler/flag verifier | CodeRL, StepCoder, RLEF — this is our project’s diagnosed gap |
What I’d change in our pipeline: before spending a training-run budget on GRPO/RLVR to fix a specific recurring failure, run it through this table first. If trace review shows the agent never produces the right technique/CVE reference at any sample count — that’s the top two rows, a pretraining/SFT-data problem, and no amount of RL will fix it (consistent with handbook rule 5: “knowledge in tools, not weights or prompt” — this generalizes: RL doesn’t need to hold facts if tools can supply them at inference time). If the technique does appear somewhere across k samples but pass@1 doesn’t convert it — that’s the bottom row, our diagnosed execution gap, and RLVR is the right lever. This is not a new framework; it’s the particular-language literature independently re-deriving the same split our diagnosis chapter already uses, from code/NLP rather than cyber — worth citing back as corroboration.
3.5 Deep-research / web agents — the strongest-transfer row
The tightest non-cyber structural analogue: many sequential tool calls into a real, noisy, adversarial live environment (not a toy simulator), reward assessable only once the task is actually done. WebGPT (arXiv:2112.09332) is the direct ancestor of “SFT cold-start, then reward-guided optimization” — our own planned shape — but its reward is a learned human-preference model, flagged by its own authors as struggling out-of-distribution; cite it as the origin of the recipe shape, not as license to use a learned reward. WebRL (arXiv:2411.02337, ICLR 2025) is the standout: a self-evolving curriculum that generates new training tasks directly from the model’s own unsuccessful attempts — Llama-3.1-8B goes 4.8%→42.4% success on WebArena-Lite from this alone. R1-Searcher (arXiv:2503.05592) shows a genuinely safe way to densify reward for tool-use: a sequential, not summed, two-stage reward — stage 1 rewards only correct tool-invocation format, stage 2 switches fully to outcome reward. Because the stages are sequential rather than concurrently-summed, the policy can’t “farm” stage-1’s format reward once stage 2 has begun — it avoids the reward-hacking trap a concurrent per-call bonus would carry. DeepResearcher (arXiv:2504.03160, EMNLP 2025) independently argues, as its central claim, that training end-to-end in the real environment (not a simulated proxy) is “a fundamental requirement” — direct validation of our own live-sandbox harness design.
What I’d change in our pipeline: (1) WebRL’s failure-to-curriculum mechanism is the single best
non-cyber precedent for our own currently-unsolved long tail (most of the ~1000-challenge portfolio outside
the 30-60% GRPO band) — an automatic, quantitatively-strong demonstration that a failure corpus converts
into new, appropriately-calibrated training tasks rather than sitting inert; (2) R1-Searcher’s sequential
staged reward is a concrete, safe fix if trace review shows tool-avoidance (agent bypassing the sectools
surface in favor of raw shell) — a stage-1 format-only reward for correct tool invocation, switched off once
stage 2 begins.
3.6 Hard-exploration games — the honest calibration check, plus one genuinely new lever
Montezuma’s Revenge and Pitfall are the domain where “sparse binary terminal reward with near-zero prior success” was most literally the entire problem. Go-Explore (arXiv:1901.10995, Nature 2021) solved both by maintaining an archive of previously-visited states, always returning to a promising already-discovered frontier state cheaply before exploring further from it, rather than re-discovering the same early states from scratch every episode — then robustifying the best archived trajectories into a closed-loop policy. AlphaStar (Nature, DOI:10.1038/s41586-019-1724-z) solves a different axis — adversarial strategy collapse — with a league: train against a diverse, continually-adapting population of checkpoints and hand-designed “exploiter” agents, not just the latest self; it also validates imitation-bootstrap-before-RL at frontier scale (pure self-play RL from scratch over such a long horizon was intractably slow to bootstrap). OpenAI Five (arXiv:1912.06680) shows scale + dense reward shaping substitutes for exotic exploration algorithms — but its dense proxy-reward choice is exactly the reward-hacking risk our handbook’s ground-truth-only stance guards against; treat it as a caution, not a recipe. NetHack (arXiv:2006.13760) remains genuinely unsolved — the honest calibration point that {long-horizon + heavy exploration + procedural generalization + sparse terminal reward} all at once is not a solved combination anywhere in the literature, cyber included.
What I’d change in our pipeline (speculative transfer — engineering-unproven in this domain): Go-Explore’s literal mechanism (deterministic sim-state checkpoint/restore) doesn’t map to a live target box that isn’t cheaply resettable — but the principle does: explicitly archive distinct states of partial progress within a CTF episode (new service found, new privilege level reached, new file discovered) as first-class checkpoints, and bias new rollout attempts toward continuing exploration from an archived frontier state rather than always cold-starting from turn 0. This changes where exploration compute is spent, not the reward function — it carries none of the reward-hacking risk a shaped reward would. Building the state-abstraction function for free-text shell/HTTP output is real, non-trivial work; flag accordingly.
3.7 Robotics — speculative mechanism, but a concrete data-curation descendant
Hindsight Experience Replay (arXiv:1707.01495, NeurIPS 2017) is the seminal sparse-reward paper: relabel a failed trajectory, post-hoc, as if the state it actually reached had been the intended goal all along — converting 100% sparse failure into 100% dense, correctly-labeled success. Honest limit: this is an off-policy, goal-conditioned, replay-buffer mechanism (DDPG/DQN-family) — it does not literally port to an on-policy GRPO/RLVR loop, and a CTF episode has one terminal flag, not a continuum of interchangeable goals a failed run could be relabeled as having achieved instead. Speculative transfer, principle only: mine the corpus of failed (flag=0) trajectories for reusable sub-skill demonstrations — a run that reached a foothold but failed to escalate privileges is a valid positive demonstration of “how to reach a foothold,” even though the overall episode is a failure. Segment the failed-trajectory corpus by furthest-pipeline-stage-reached and fold the phase-appropriate positive prefixes into the SFT corpus for that sub-skill — a pure data-curation move that never touches the RL reward function, carrying none of the reward-hacking risk an online auxiliary reward would.
4. The KNOWLEDGE vs EXECUTION contrast — the load-bearing takeaway
Section 3.4 above is the single most important cross-domain finding in this chapter, restated plainly: the particular-language literature is the cleanest existing evidence, from a domain with no cyber baggage, that pretraining/SFT injects knowledge and RL only sharpens execution of knowledge the base model already has. This directly operationalizes Diagnosing the gap’s competence-vs-performance split with a second, independent field’s worth of citations. Practical consequence for this project: run any suspected failure mode through the four-row table in §3.4 before committing GRPO/RLVR compute to it — a “doesn’t know the technique” failure needs a data lever (more SFT coverage, a tool that supplies the fact at inference time), not a training-loop lever; a “knows it, fumbles execution over ~100 turns” failure is where RLVR belongs, and is what every SWE/RLEF/GrandCode result above was built to fix.
5. Ranked shortlist of transferable levers
- RLEF’s turn-level value function over a multi-turn POMDP (arXiv:2410.02089) — the single most literal structural preview of our problem (public-test feedback → repair → repeat → private-test terminal reward) solved by Meta at an order of magnitude fewer samples than scaffolded prompting. High confidence, high relevance — read before finalizing a GRPO variant.
- GiGPO’s step-level anchor-state advantage (arXiv:2505.10978) and GrandCode’s Agentic-GRPO (arXiv:2604.02721) — two independent, purpose-built GRPO variants for exactly our credit-assignment shape (long trajectory, terminal-only reward, off-policy drift). High confidence on GiGPO (NeurIPS-accepted); promising, unverified on GrandCode (single team, very recent) — converging evidence “stage/turn is the unit of credit” is the field’s consensus fix.
- WebRL’s self-evolving curriculum from failures (arXiv:2411.02337) — the strongest available precedent for converting our own unsolved-challenge tail into new training data automatically. High confidence, needs CTF-specific task-generation design.
- Ground-truth execution reward over any learned/similarity proxy, now 4x cross-validated — SWE-Gym > SWE-RL, the theorem-proving kernel, WebGPT’s own flagged weakness, and DeepResearcher’s live-environment requirement all land on the same rule our handbook already locks. Established, high confidence — reconfirmation, not a new finding, but reassurance the rule is domain-general.
- Subgoal/curriculum decomposition for cold-start DATA generation, never for the reward itself (DeepSeek-Prover-V2 arXiv:2504.21801, AlphaGeometry) — safe wherever a genuine sub-milestone is deterministically verifiable; see One problem, or many? for the full decompose-eval-vs-decompose-reward argument this reconfirms. Established, high confidence.
- Progressive context/turn-budget curriculum for RL (arXiv:2508.03501, KLong arXiv:2602.17547) — start short, extend once performance plateaus; directly portable to the existing GRPO 30–60% baseline-band rule. Medium confidence (2 independent sources).
- R1-Searcher’s sequential (not summed) staged reward for tool-use bootstrap (arXiv:2503.05592) — a safe fix if tool-avoidance is diagnosed in trace review. Established mechanism, cyber-domain application untested.
- Go-Explore’s archive-and-return exploration scheduling (arXiv:1901.10995) — genuinely the most novel, least-already-covered addition here; changes where exploration compute is spent, zero reward-hacking risk. Strong transfer of the idea; engineering-unproven in this domain — speculative transfer.
- Hybrid test-time scaling (execution + execution-free verifiers, complementary blind spots) — a free multiplier on any trained policy, replicated across ≥4 independent groups (R2E-Gym, DeepSWE, SWE-Master, Claude “high compute”). High confidence, immediately actionable on our existing pass@k methodology.
- Mining flag=0 trajectories for sub-skill SFT data (HER’s spirit, arXiv:1707.01495) — segment failed runs by furthest-stage-reached, fold positive prefixes into SFT. Speculative-to-established idea, translated from a different algorithm family — low risk, purely data-curation.
- NetHack’s honest “still unsolved” status (arXiv:2006.13760) — not a lever, a calibration check: no algorithm anywhere has cracked {long-horizon + heavy exploration + procedural generalization + sparse terminal reward} simultaneously. Set expectations accordingly.
Cross-links
- The path to a frontier cybersecurity model — the capstone this chapter’s cross-domain evidence feeds; “frontier” requires the same {strong agentic base, ungameable verifier at scale, infra matched to the RL bottleneck} triad this chapter finds repeating across code/math/web.
- Diagnosing the gap — the competence/performance split §3.4 independently reconfirms from the particular-language literature; run failures through both before committing RL compute.
- RL that creates value — long-horizon · exploration · reasoning · novelty — the algorithm-level detail (GiGPO, DAPO stabilizers, entropy preservation) this chapter’s domain evidence is evaluated against.
- One problem, or many? — monolithic vs decomposed — the decompose-eval-vs-decompose-reward distinction that DeepSeek-Prover-V2’s and StepCoder’s subgoal work in §3.1/§3.3 reconfirms from two more domains.
The path to a frontier cybersecurity model
Every other chapter in this book resolves a method question for this project’s current bottleneck (SFT vs GRPO, monolithic vs decomposed reward, which exploration fix). The recipe is a sequence, not a pick makes the same point this chapter’s §2 skeleton depends on, one level down: it’s never a technique choice, it’s a fixed, ordered sequence of stages that compound. This chapter zooms out to the north star those decisions serve: not “improve pass@k on our 1000-challenge portfolio” as an end in itself, but a frontier cybersecurity model — the offensive-security analog of what DeepSeek-Coder, Qwen-Coder, and DeepSeekMath are to code and math. It asks the harder question every other chapter brackets: even if The decision’s routing tree is answered correctly and roadmap-inputs.md’s forks are all resolved well, does that alone produce a frontier model? The honest answer, argued below, is no — it produces a materially better agent on this project’s own portfolio, which is necessary but not sufficient. This chapter is the capstone that says what else the word “frontier” actually requires, stage by stage, and where this project genuinely already stands on that ladder.
1. The frame — what “frontier” means here, and why academic-security is the wrong template
“Frontier” is not a vibe or a marketing word here — across every domain-specialization lineage examined below (code, math, medical), it tracked the same three things simultaneously, never just one: (1) a strong general, already-agentic starting checkpoint, not a small model fine-tuned harder; (2) a domain RL stage with an ungameable, automatically-computable verifier at scale; (3) infrastructure investment matched to the RL stage’s actual bottleneck — which turns out to be environment count and diversity, not bigger GPUs for pretraining. A model that is merely “instruction-tuned on curated domain data” (Med-PaLM v1’s prompt-tuning-only recipe, arXiv:2212.13138) is a domain assistant, not a frontier domain model — the field’s own vocabulary distinguishes these, and this project should too.
This is why the standing project stance treats academic cybersecurity-LLM papers (CTF-Dojo, Cyber-Zero, Pentest-R1, HackSynth, AutoPenBench, DRLRM-PT, and siblings) as mention-only, never load-bearing: none of them has produced a model that clears the bar above. They are useful as landscape context, occasionally as a source of a technique name worth knowing, but citing them as the basis for a claim about what a frontier cyber model requires would be citing evidence that has never once been tested against the thing it claims to predict. Every load-bearing claim in this chapter instead rests on (a) frontier-lab flagship and domain-specialization disclosures — code, math, medical, and general-agentic recipes, 2023–2026; (b) general RL/ML theory (scaling laws, potential-based reward shaping, reward-hacking mechanics); (c) this project’s own measured data and confirmed lessons. Where a genuinely production, externally-verified cyber-specialized system exists outside the academic set (XBOW, Bugcrowd) it is admissible under clause (a) as a domain-specialization precedent, exactly like Code Llama or Med-PaLM — and is flagged as such, never folded in with the excluded academic set.
2. The transferable frontier domain-specialization recipe
Cross-referencing the code lineages (DeepSeek-Coder → DeepSeek-Coder-V2, Qwen2.5-Coder → Qwen3-Coder-Next), the math lineage (DeepSeekMath → Qwen2.5-Math → DeepSeek-Prover-V2), and the medical contrast vertical (Med-PaLM → Med-PaLM 2 → MedGemma), one skeleton recurs with variation in emphasis but is never fully absent. A fifth stage — mid-training — is a distinct, more recently named bridge stage the code/math lineages ran informally but only OLMo 2 and a 2025 controlled study give a name and a mechanism to (arXiv:2501.00656, arXiv:2510.14865).
flowchart TD
P["Pretrain\n(inherited — a strong general/\nagentic open-weight checkpoint,\nnot trained here)"] --> S0
subgraph S0["Stage 0 — Domain continued pretraining (CPT)"]
direction TB
S0A["100B-5.5T in-domain tokens,\nfrom a STRONG checkpoint, never from scratch\nCode Llama ~500B code tokens\nDeepSeekMath 120B math tokens\nQwen2.5-Coder 5.5T code tokens"]
end
S0 --> S05
subgraph S05["Stage 0.5 — Mid-training (the named bridge)"]
direction TB
S05A["5-10% of pretrain FLOPs, curriculum-shaped,\nupsampled high-quality + synthetic patches\n'infuse knowledge, patch deficiencies'\n(OLMo 2); reduces catastrophic forgetting\nbefore SFT (2510.14865)"]
end
S05 --> S1
subgraph S1["Stage 1 — Domain SFT / data synthesis,\nincreasingly SELF-BOOTSTRAPPED"]
direction TB
S1A["Rejection-sampling + iterative co-evolution:\nQwen3-Coder used Qwen2.5-Coder to clean its\nown next-gen data; Qwen2.5-Math co-evolved\nRM+SFT across rounds; DeepSeek-Prover-V2\nstitched subgoal-decomposed traces"]
end
S1 --> S2
subgraph S2["Stage 2 — Domain RL, verifier-gated\n'hard to solve, easy to verify'"]
direction TB
S2A["GRPO / RLVR, no critic, group-mean baseline\n(origin: DeepSeekMath); scaling axis that\nmattered most = PARALLEL RL ENVIRONMENTS\n(Qwen3-Coder: 20,000), not model size"]
end
S2 -.->|"cross-cutting, every stage"| S4["Data-pipeline + scale engineering\nas a first-class investment\n(Qwen2.5-Coder: curation > scale;\nStarCoder2/Stack-v2: quality substitutes\nfor parameter count)"]
classDef stage fill:#132b22,stroke:#34d399,color:#eafaf3;
classDef cross fill:#3a2e14,stroke:#f5b942,color:#fff6e0;
class P,S0A,S05A,S1A,S2A stage;
class S4 cross;
Reading the skeleton stage by stage, cited:
- Stage 0 — domain CPT. Every recent frontier vertical lineage continues pretraining from an existing strong checkpoint — never truly from scratch (DeepSeek-Coder v1’s from-scratch 2T-token run, arXiv:2401.14196, is the sole exception in this set, and even DeepSeek abandoned it by v2, arXiv:2406.11931). Cross-domain transfer is itself a load-bearing finding, not noise: DeepSeekMath deliberately starts from a code base (arXiv:2402.03300) for a math specialist, because precise multi-step symbolic reasoning transfers — argues a cyber CPT stage, if built, should start from a model already strong at general coding/tool-use/agentic reasoning, not a generic chat model.
- Stage 0.5 — mid-training. OLMo 2 names this explicitly as “Stage 2: Mid-training (5–10% of training FLOPs)… upsample the highest-quality web documents and curated non-web sources; employ synthetic data crafted to patch math capabilities” (arXiv:2501.00656). A 2025 controlled study formalizes the mechanism: mid-training outperforms continued-pretraining-alone at a matched specialized-token budget and mitigates catastrophic forgetting in the subsequent SFT stage, because it acts as a better initialization for post-training rather than just adding knowledge (arXiv:2510.14865, moderate-high confidence — recent, not yet heavily cited, but consistent with and explaining the OLMo/Llama-3/DBRX practitioner reports it’s built on).
- Stage 1 — self-bootstrapped SFT/data synthesis. The frontier pattern has moved past “filter and train once”: Qwen3-Coder used the prior generation (Qwen2.5-Coder) to clean and rewrite its own next-generation pretraining data (blog disclosure, no standalone arXiv for the 480B flagship — the architecture is covered by arXiv:2505.09388; the agentic-RL successor, Qwen3-Coder-Next, is arXiv:2603.00729). Qwen2.5-Math (arXiv:2409.12122) co-evolves a reward model and SFT data across rounds before RL is even applied, then reuses the same RM at inference for best-of-N reranking. DeepSeek-Prover-V2 (arXiv:2504.21801) decomposes a hard problem into subgoals, solves each with a cheaper model, and stitches the resolved subgoals into a single cold-start trajectory — a direct precedent for treating a ~100-turn CTF episode’s implicit stages (recon → foothold → priv-esc → flag) as subgoal-decomposable SFT-construction material, even while the RL reward itself stays terminal-only for ungameability. For the general (non-cyber) chat/tool-use/reasoning/preference data that fills this same SFT rung before any cyber-specific data is layered on, see Proven post-training datasets — a usage-cited registry.
- Stage 2 — verifier-gated RL. This is where GRPO was born: DeepSeekMath’s own framing attributes math capability to two factors — a web-data mining pipeline, and Group Relative Policy Optimization, a critic-free PPO variant using the sampled group’s mean reward as the baseline (arXiv:2402.03300). Qwen3-Coder’s post-training explicitly names the reward-design principle “hard to solve, easy to verify,” and its own headline scaling axis wasn’t a bigger model, it was 20,000 parallel RL environments for the long-horizon agentic RL stage. DeepSeek-Prover-V2 runs the same pattern with Lean’s type-checker as a binary, ungameable reward — structurally identical in spirit to a terminal flag verifier.
- Cross-cutting — data-pipeline engineering as its own investment. Qwen2.5-Coder’s whole story is “meticulous data cleaning, scalable synthetic data generation, balanced data mixing” beating larger models on the same benchmarks purely on data quality/composition. StarCoder2/The Stack v2 (arXiv:2402.19173, included as a data-pipeline lesson only — flagged explicitly as not a frontier-capability reference point) independently confirms curation-quality substituting for parameter count from a second source.
The medical contrast (mentioned, not a basis for cyber claims): Med-PaLM v1’s prompt-tuning-only recipe (arXiv:2212.13138) shows the cheap-adaptation-of-a-frozen-giant path is not sufficient on its own — the paper’s own human-eval gap (factuality, harm) motivated Med-PaLM 2 (arXiv:2305.09617, domain instruction fine-tuning + ensemble refinement) and MedGemma (arXiv:2507.05201, domain vision-language pretraining + task-specific fine-tuning, explicitly disclosed as not clinical-grade without further fine-tuning). The useful lesson by contrast: for a binary, adversarial correctness domain like offensive security (a wrong action doesn’t mislead a reader, it fails the exploit), a v1-style prompt-tuned ceiling is lower than in code/math — supporting this project’s existing bias toward continued adaptation + RLVR over prompting alone.
3. The frontier ingredients, as requirements
Restating the seven-ingredient survey as a checklist of what “frontier” actually costs, independent of any one domain:
| # | Ingredient | What frontier scale actually looks like | Confidence |
|---|---|---|---|
| 1 | Compute + scale law | Loss falls as a power law in model size × data × compute; the ratio matters (Chinchilla-optimal ≈ equal scaling of params and tokens, not param-dominant) — Kaplan, arXiv:2001.08361; Hoffmann/Chinchilla, arXiv:2203.15556 | High — foundational, independently reproduced |
| 2 | Data scale + quality + curation | 5–15T+ curated tokens is the pretraining norm (DeepSeek-V3, arXiv:2412.19437; Llama 3, arXiv:2407.21783) — or the phi-1 extreme: curation quality can substitute for ~100x less scale within a narrow domain (arXiv:2306.11644), though that ratio is an upper bound, not a universal rule | High for the scale rows; moderate on how far the “quality substitutes for scale” ratio generalizes |
| 3 | Domain CPT before post-training | 120B–5.5T in-domain tokens continued-pretrained into an existing strong base, before any instruction-tuning/RL — not a thin adapter on the base chat model (Qwen2.5-Coder 5.5T+, DeepSeekMath 120B, Med-PaLM 2’s domain finetuning) | High — three independent labs, primary technical reports |
| 4 | Mid-training as a named bridge stage | A shorter (5–10% FLOPs), curriculum-shaped stage between broad pretraining and narrow post-training that patches domain deficiencies cheaply and reduces forgetting — OLMo 2, arXiv:2501.00656; arXiv:2510.14865 | High (OLMo 2); moderate-high (mechanism study, new) |
| 5 | RL-environment scale, diversity, verifiability | The reasoning/agentic jump to o1/R1-class models is attributed to large-scale RL with verifiable, not learned, rewards, not more pretraining — DeepSeek-R1, arXiv:2501.12948; OpenAI o1 system card (arXiv mirror 2412.16720); environment diversity/scale is its own axis, distinct from reward correctness — Kimi K2’s “tens of thousands” synthesized-tool pipeline (arXiv:2507.20534); framed as an emerging bottleneck by arXiv:2511.09586 | High (R1, o1, Kimi K2); moderate on the survey’s “emerging bottleneck” framing specifically (new, low-citation) |
| 6 | Full pipeline vs. thin adapter | LoRA measurably underperforms full fine-tuning specifically on code/math domain-skill acquisition — full fine-tuning learns perturbations at 10–100x the effective rank of typical LoRA configs — arXiv:2405.09673 | High — controlled, ablated, >250 citations in 18 months, directly on-domain |
| 7 | Eval reflects reality, not a saturated benchmark | Classic benchmarks are contaminated enough to inflate scores by up to 22.9%/19.0% (GSM8K/MMLU) — arXiv:2406.13990; frontier practice responds with contamination-resistant-by-construction benchmarks: time-segmented LiveCodeBench, arXiv:2403.07974, “Google-proof” GPQA, arXiv:2311.12022; Tülu 3, arXiv:2411.15124 treats decontamination as a first-class deliverable, and is also the primary public naming of RLVR | High — all four primary, independently corroborating |
Net read of §2+§3 together: compute/scale (ingredient 1) is inherited from the base model’s own pretraining — not this project’s job to re-derive. Ingredients 3–4 (CPT, mid-training) are the stages this project’s plan currently skips by design (handbook rule “knowledge in tools, not weights”), a defensible bet but an unvalidated one at the CPT/mid-training layer specifically. Ingredient 5 (RL-environment scale/diversity/verifiability) is where this project is closest to the frontier pattern already — see §5. Ingredients 6–7 (full pipeline vs. adapter, eval integrity) are concrete, checkable knobs, not open research questions.
4. Gap analysis — the frontier recipe vs. this project, stage by stage
One-line verdict: this project has correctly identified the shape of the frontier recipe — harness as a live RL environment, ground-truth verifiable reward, rejection-sampling SFT → GRPO/RLVR ordering, “knowledge in tools not weights” — and has already built the two hardest structural pieces: a working ~100-turn agentic harness and a genuine, non-gameable terminal verifier. What it has not built is scale on every other axis, and it has skipped a pretraining-adjacent stage entirely (domain CPT / mid-training) that every cited frontier domain-specialization precedent inserts. The gap is 1–5 orders of magnitude on breadth, not a missing insight on direction.
| Frontier-recipe stage | Have | Partial | Missing |
|---|---|---|---|
| Stage 0 — Domain CPT | Nothing — explicit design choice (“knowledge in tools, not weights”), not oversight | The “knowledge in tools” architectural bet is coherent but unvalidated at this layer — the 87.7%-tool-bypass finding is at least as consistent with “raw vocabulary exposure is thin” as with “SFT/RL hasn’t reinforced the surface yet” | A CPT stage on curated offensive-security text (tool docs, CVE writeups, exploit-dev reasoning) — every cited frontier precedent (Code Llama ~500B code tokens, DeepSeekMath 120B math tokens) runs this before SFT/RL |
| Stage 0.5 — Mid-training | Nothing | — | The single biggest concrete gap relative to its likely payoff — cheapest missing stage of the whole skeleton (§2), and this project currently goes straight from a general base into RL with no bridge stage at all |
| Stage 1 — SFT / rejection-sampling | Fully-designed quality-filter recipe (replay-reproduce, loss-masking, dedup, decontamination); one SFT already shipped and flag-verified (pass@1 26%→35%, pass@5 4/10→5/10); ~1,200+ trajectories across 5 corpora, ~185 verified-success candidates; a load-bearing negative result confirming the reward-must-be-ground-truth rule on this project’s own data (35% flag fabrication rate under a loose acceptance filter) | The corrected filter (replay-reproduce, byte-exact flag verify) is designed but not confirmed re-run since the confabulation finding; ground truth exists for only 2 of 7 corpora | ~2–3 orders of magnitude below frontier scale — Kimi K2’s SFT draws on 3,000+ real + 20,000+ synthesized MCP tools; no systematic teacher-distillation at scale; no difficulty-curriculum construction (candidate pool too small) |
| Stage 2 — RLVR | flagscan.go — a genuine rung-1, deterministic, ungameable terminal verifier, exactly the reward shape RLVR requires; RL-candidate-selection methodology matching GRPO’s actual zero-gradient mechanics (1–4-of-5 band); a theory-correct potential-based stage-shaping proposal (Ng-Harada-Russell) that doesn’t touch the ground-truth backbone | No GRPO/RLVR training loop implemented anywhere in the codebase (zero training binaries); retrieved heuristic not yet upgraded to byte-exact for 5 of 7 corpora; stage-shaping designed, zero code; the one OpenAI-RFT fallback option is winding down | Qwen3.7-Max’s decoupled Task/Harness/Verifier infra — the first-party-named fix for exactly this project’s own measured 87.7%-tool-bypass scaffold-overfitting finding; no entropy-collapse countermeasure (nothing to attach one to yet); no partial-rollout/pause-resume infra for long-tail ~100-turn episodes |
| Stage 3 — Scaled agentic RL environments | A genuinely working RL-environment shell (100-turn multi-turn loop, sandboxed, fully OTel-traced); 10 canonical, hardened, contamination-free, single-solution PD26 challenges live in production, genuine vuln-class breadth; several hundred additional informal challenges (warpenv/envgen/gym); the RL-envs-as-moat thesis, now corroborated by frontier evidence, not just three original data points | Total known population is a few hundred distinct targets — an order of magnitude below the north star’s own “~1000” framing; the single Hetzner eval box is sized for sequential/moderate-parallel eval, never for concurrent GRPO rollout load | Harness/verifier diversity as its own trained axis (one tool surface, one verifier per challenge — categorical, not a scale gap); procedural/generative environment scaling at frontier order-of-magnitude (envgen’s 84 challenges is ~22x smaller than DeepSeek-V3.2’s 1,827-environment pipeline); training-time rollout compute at K8s scale (categorical — no training-scale infra exists, only eval-scale) |
| Stage 4 — Eval integrity | A locked, rigorous pass@k methodology (unbiased estimator, k=3/5/10 bands, independent cold starts); the terminal flag_verified contract (exact match, never a proxy); two modest QA datasets beyond flag-capture; a fully-designed, near-zero-risk eval-decomposition plan | Base-model pass@64–128 control not confirmed run against the current SFT checkpoint; no per-challenge stage_oracle.json manifest yet | No confirmed contamination/canary audit applied to the cyber CTF corpus specifically; no externally-comparable published benchmark result — no way for an outside reader to place this project’s solve rate against any external reference point |
What this means concretely: this project is not missing an idea anywhere in the recipe — every stage has a designed, theoretically-grounded, often partially-built answer, and two pieces (the harness-as- environment, the ground-truth verifier) are genuinely frontier-quality today. What’s missing everywhere except eval methodology and reward design is scale: more environments, more verified trajectories, more training-time compute — plus one categorical, non-scale gap (harness/verifier diversity) that is the single frontier-lab-named fix for a failure mode this project has already measured on its own data.
5. RL-environments + data-synthesis as the moat, at frontier scale
Frontier evidence on this axis is now broader than the project’s original three-source hypothesis (Anthropic’s reported spend, Bugcrowd’s market, this project’s own harness) — nine of ten labs surveyed for the frontier-recipes chapter independently corroborate it, and the general-agentic and cyber-specific evidence below sharpens the picture further.
- Kimi K2 discloses “a large-scale agentic data synthesis pipeline and a joint reinforcement learning stage, where the model improves its capabilities through interactions with real and synthetic environments” (arXiv:2507.20534) — the cleanest public confirmation that a frontier lab’s post-training lever is environment synthesis + mixing real with synthetic, not base- model scale alone. Frontier_cyber_takeaway: PD26-01..10 is this project’s “real” side; anything procedurally generated (parameterized bug-class variants, mutated target configs) is the “synthetic” side — a program training only on the 10 canonical challenges is closer to “real-only, no synthesis,” which K2’s own design implies under-scales.
- Kimi K2.5’s Agent Swarm (arXiv:2602.02276) — a self-directed parallel-agent orchestration framework decomposing tasks into concurrent heterogeneous sub-problems, 4.5x latency reduction — architecturally identical to what XBOW independently converged on for cybersecurity (below): narrow-scope parallel sub-agents, not one longer monolithic trajectory. Two unrelated programs landing on the same architecture is a signal worth taking seriously against this project’s current single-agent ~100-turn episode design.
- OpenAI Deep Research — official disclosure that it “was trained on real-world tasks requiring browser and Python tool use, using the same reinforcement learning methods behind OpenAI o1,” and on-record team commentary that “end-to-end training beats manual orchestration” — a fixed recon→scan→exploit→flag graph “breaks” when the agent needs to adapt; letting the model learn strategy via RL over hard tasks outperforms hand-scripted phase logic. This directly reinforces the project’s own “light framing beats heavy scaffolding” rule, from the team that shipped the highest-profile agentic-RL product to date.
- Anthropic — a reported (secondhand, via TechCrunch citing The Information; direction corroborated by surrounding market activity, dollar figure not on-record) >$1B RL-environment commitment, plus a live, current, on-record disclosure that cyber-offensive capability is deliberately tier-gated across the Claude line rather than propagated uniformly with general capability. Frontier_cyber_takeaway: cyber capability is not a free byproduct of general-agentic scaling even at a frontier lab — it has to be deliberately trained in, which is exactly this project’s actual bet (open-weight base + this project’s own cyber RL environments).
- Google DeepMind SIMA / SIMA 2 (arXiv:2404.10179, arXiv:2512.04797) — SIMA 2’s headline: “by leveraging Gemini to generate tasks and provide rewards, SIMA 2 can autonomously learn new skills from scratch in a new environment.” The clearest frontier precedent for self-generated challenges: applied to cybersecurity, a frontier-scale program would use a strong model to propose novel vulnerable-target variants and score exploit attempts, with the project’s existing terminal flag verifier as the ungameable ground-truth check that keeps a self-generated curriculum honest.
- Market corroboration, general-purpose: Prime Intellect’s Environments Hub — 1,000+ unique
environments from 250+ creators, 100,000+ downloads. Cyber-specific instance: Bugcrowd’s RL
Environments (built on Mayhem Security tech) — “hundreds of thousands of training environments, each
built from authentic open-source vulnerabilities with real source code and verifiable outcomes,” with
Chief AI/Science Officer David Brumley’s own framing: “Most AI security training stops too early. Models
learn to find bugs, but not to prove the bugs are real and exploitable… detection through exploitation,
patching, and audit.” That last phrase is also a concrete, safely-implementable curriculum idea — a
graded multi-stage reward, but only if built potential-based
(
F(s,a,s') = γΦ(s') − Φ(s), the only reward-shaping form with a policy-invariance guarantee, per Ng/Harada/Russell, ICML 1999). A flat per-stage bonus is exactly the kind of ad hoc shaping the theorem warns produces gameable, farm-the-partial-credit policies. - XBOW — admissible as a production, externally-verified domain-specialization precedent (top-ranked on HackerOne against human researchers, real CVEs, real payouts), not the excluded academic set. Its own disclosed curriculum climbed four rungs in order: canned CTF (PortSwigger/PentesterLab, “artificial exercises”) → a custom-built realistic benchmark → white-box zero-day discovery in real open-source projects → black-box production dogfooding on HackerOne, where the real-world bug-bounty triage process itself is the verifier. This project currently sits at roughly rung 2 (PD26-01..10, custom-built, more realistic than generic CTF). XBOW’s own architecture note — “thousands of short-lived agents, each with a narrow objective, orchestrated by a persistent coordinator and validated by deterministic logic… if one agent runs into a dead end on step 4 of a 20-step attack, it doesn’t tank the whole operation” — independently confirms (alongside Kimi K2.5’s Agent Swarm) that decomposition into parallel narrow agents, not a longer monolithic single-agent trajectory, is where frontier-grade cyber-agent architecture is heading.
This project’s own numbers, side by side with the frontier evidence: ~1,200+ trajectories, solving ~100–200 of ~1,000 attempted challenges — but the distinct-challenge corpus underneath that is on the order of 10 canonical live challenges plus a 15-challenge locked dataset slice. Bugcrowd alone is “hundreds of thousands” of distinct verifiable cyber environments; Prime Intellect’s general hub is 1,000+; Kimi K2 leans on 3,000+ real + 20,000+ synthesized tools feeding trajectory generation. The gap is 2–5 orders of magnitude on environment volume and diversity — not on algorithm. Every frontier disclosure above agrees GRPO/RLVR-family joint RL with tool-use is now well-understood and largely commoditized; the highest-leverage next investment is a synthesis pipeline that turns the existing 10–15 hand-built challenges into hundreds-to-thousands of verifiably-distinct variants (parameterized bug-class mutations, stack/target permutations, difficulty-graded variants of the same vuln class) — mirroring Kimi K2’s real+synthetic split and XBOW’s own rung-2→3 transition — rather than continuing to hand-author challenges one-off at the current cadence.
6. How this ladders — from “improve solve rate” to “frontier model”
Diagnosing the gap, From behavioral audit to training signal, One problem, or many?, and Where you are & the forks ahead are all, correctly, scoped to this project’s own ~1,000-challenge portfolio — they answer “what training signal fixes F1 vs F2 vs F3 vs F4” and “monolithic vs milestone-shaped reward,” which are the right near-term engineering questions. This chapter’s honest addition: answering those questions well moves solve rate on the existing portfolio; it does not, by itself, cross into “frontier.” The ladder has three rungs, and the roadmap-inputs.md decision brief only climbs the first:
- Rung 1 — execute reliably on the portfolio you have. This is the diagnosis framework’s whole job: segment single-shot vs. sequentially-gated, route by the F1–F4 funnel, pick SFT-vs-GRPO ordering correctly per segment. Get this right and you’ve closed the execution gap this project diagnosed — necessary, and per §4 above, this project’s Stage 1/Stage 2 work already targets exactly this.
- Rung 2 — scale the environment/data axis by orders of magnitude. Per §4/§5, this is where the actual distance to “frontier” lives: 10 canonical + ~250 informal challenges vs. Bugcrowd’s hundreds of thousands; ~185 verified-solve candidates vs. Kimi K2’s tens-of-thousands-of-tools synthesis pipeline. No amount of correctly-routed SFT-vs-GRPO decision-making on Rung 1’s existing portfolio substitutes for this — every frontier precedent in this chapter treats environment/data scale as the dominant axis, not a nice-to-have.
- Rung 3 — close the CPT/mid-training gap, if evals reveal it’s real. Per §4’s Stage 0/0.5 rows, this project’s “knowledge in tools, not weights” bet is a legitimate scoping choice only if the deficit evals surface is confirmed execution-only. If a knowledge deficit shows up instead (not just an execution one), neither of the two disclosed frontier tools for fixing it — domain CPT, mid-training — is currently in this project’s plan at all. This is the one rung that’s genuinely contingent, not scheduled.
The decomposition-vs-monolithic.md verdict (stay monolithic on reward shape until the funnel says otherwise, build only the theorem-backed potential-based version if you do) and the roadmap-inputs.md forks (instrument first, segment before committing SFT capacity, route exploration-vs-execution emphasis by the funnel) are all Rung-1-scoped decisions, made correctly — none of them need to change in light of this chapter. What this chapter adds is the honest framing that Rung 1 is a prerequisite, not the destination: a project that nails every fork in roadmap-inputs.md and still has 10 canonical challenges and no CPT/mid-training stage has built an excellent instance of “the shape of the frontier recipe” at a scale that isn’t frontier yet. The concrete forward move this chapter argues for, independent of and parallel to the Rung-1 work already underway, is the Rung-2 synthesis-pipeline investment named in §5 — because unlike Rung 3 (contingent on an eval result not yet in hand), Rung 2’s gap is already confirmed, already the largest, and already has multiple frontier precedents (Kimi K2, SIMA 2, XBOW rung 2→3, Bugcrowd) showing the shape of the fix.
Cross-links
- The recipe is a sequence, not a pick — the stage-order-and-compounding argument this chapter’s §2 domain-specialization skeleton is a specific instance of.
- Proven post-training datasets — a usage-cited registry — the concrete, proven-by-usage dataset shopping list for the general-capability SFT/preference/reasoning rungs §2’s Stage 1 describes abstractly.
- Where you are & the forks ahead — the Rung-1 decision surface this chapter sits on top of; its forks (a)–(e) and its DAG are unaffected by anything in this chapter, they’re prerequisites to it, not alternatives.
- One problem, or many? — decomposition vs. monolithic — the reward-shape verdict this chapter’s §5 potential-based-shaping discussion (Bugcrowd’s detection→exploitation→patching→ audit staging idea) must stay consistent with: shape only via a theorem-backed potential function, never a flat per-stage bonus.
- Diagnosing the gap — a scientific framework — the F1–F4 funnel and pass@k / Pass@(k,T) instrumentation that determines whether Rung 3 (CPT/mid-training) is contingent-but-unnecessary or the load-bearing risk this chapter flags it as.
- What the frontier labs actually do (2026) — the ten-lab SFT/RL method survey this chapter’s Stage 1/Stage 2 skeleton is consistent with; that chapter is method-focused, this one is recipe-and-scale-focused — read them as companions, not duplicates.
- RL that creates value — long-horizon, exploration, reasoning, novelty — the technique menu for Rung 1’s execution work; orthogonal to this chapter’s Rung 2/3 scale argument.
Where you are & the forks ahead
This is the capstone chapter, not a roadmap. Every other chapter in this book resolves a method question (SFT vs DPO vs GRPO, monolithic vs decomposed, which exploration fix). This chapter assembles those resolutions into the shape you actually need to draw your own plan: what’s true today, what you have to decide, in what order the decisions unlock each other, and what would have to be false to make you change course. It recommends nothing you haven’t already read elsewhere in this book — it routes you back to Diagnosing the gap, From behavioral audit to training signal, One problem, or many?, Before you train, RL that creates value, and The decision at every load-bearing point. Read this after those, not instead of them.
Where this fork-by-fork plan sits in the bigger picture: everything below is Rung-1-scoped — it resolves execute reliably on the portfolio you have. It is a prerequisite for, not a substitute for, The path to a frontier cybersecurity model, which argues that even a perfectly-resolved DAG below doesn’t by itself cross into “frontier” — that takes orders-of-magnitude more RL-environment scale (Rung 2) and possibly a CPT/mid-training stage (Rung 3). The cross-domain evidence grounding that argument — six other long-horizon/sparse-reward/verifiable domains and what actually cracked each one — lives in Cybersecurity is one of a family — what cracked the others; several forks below (especially (c) and (e)) draw directly on techniques surveyed there (GiGPO, DAPO, WebRL’s failure-to-curriculum, potential-based shaping). Fork (b)’s SFT-vs-measure-first framing and its order-matters/compounding rationale are the Sequence-B-specific instance of the general argument in The recipe is a sequence, not a pick; the general-capability SFT/preference data fork (b)/(d) would eventually pull from is catalogued in Proven post-training datasets — a usage-cited registry.
1. Where you are — the diagnosis on one screen
The number: ~100–200 solved / ~1000 challenges, at k=1. This is a portfolio statistic, not a per-challenge pass rate — a challenge that “solved once” could be a 5% fluke or a 55% near-certainty, and those two cases call for opposite next moves (The decision, “one prerequisite before any of this”). Nobody has yet run pass@k per challenge, let alone per pipeline stage.
The pipeline is a chain, not a single action:
flowchart LR
R["Recon /\nenumeration"] --> E["Endpoint\ndiscovery"]
E --> V["Identify the\nvulnerable endpoint"]
V --> X["Exploit it"]
X --> P["Post-exploitation /\npivot"]
P --> F(("Flag\n{0,1}\nground-truth verified"))
classDef stage fill:#132b22,stroke:#34d399,color:#eafaf3;
class R,E,V,X,P stage;
Only the last box (flag_verified) is checked today, and even that is presently a provenance proxy
(flag_scan.retrieved — “the string came back from the sandbox, not the model’s mouth”) rather than a true
byte-compare against benchmark/flags/pd26_flags.current.json — that exact-match wiring doesn’t exist yet
outside a manual, SSH-gated step (instrumentation-and-data-readiness.md §3.1). So even the ground-truth
anchor this whole book leans on is one small, well-scoped engineering task away from being fully automatic,
not there yet.
The stage-localized failure taxonomy (F1–F4), mapped to canonical RL/agent-research framing:
| Tag | Failure | Canonical framing | Fix lever if confirmed dominant |
|---|---|---|---|
| F1 | Never finds the vulnerable endpoint | Exploration / coverage failure — no gradient until reward is first observed | On-policy RL with exploration preservation, not more demonstrations |
| F2 | Finds it, probes shallowly, can’t land the exploit | Execution / skill (performance-floor) failure | Trajectory curation, rejection-sampling SFT, DAPO/GiGPO as GRPO baseline |
| F3 | Clumsy tool use, wrong tool for the job | Policy / tool-selection failure — its own axis | Elicitation ladder → ToolRL/Tool-Star if elicitation fails |
| F4 | No real pivot/chaining after a foothold | Long-horizon credit-assignment failure (variance, not coverage) | Step-level credit (GiGPO), curriculum (1-hop before 2-hop), never “try more” alone |
The diagnosis, stated as a hypothesis, not a fact: the project’s working read is “likely an execution gap” (F2/F3-flavored) rather than a knowledge gap or a pure exploration gap — capability is probably present and unreliable, not absent. This is the single most load-bearing framing decision in the whole plan, and it is currently unproven. The book’s own diagnosis framework is explicit about this: “the honest, defensible answer will not be a single sentence” — the true picture is almost certainly a split verdict, different F-tags dominating different challenge subtypes, not one gap type for the whole 1000 (Diagnosing the gap §0, §8).
What “proven” requires, concretely, and doesn’t exist yet:
- Per-challenge (not aggregate) pass@k, segmented by whether the winning path is single-shot or sequentially-gated (compositional — enumeration must land before exploitation is even visible).
- Pass@(k,T) — base model vs. current checkpoint, per segment — to tell a genuine execution gap (trained pulls away from base at large k) apart from a pure elicitation artifact (base catches up) apart from an exploration gap hiding underneath (matched-data SFT regresses the segment, RL expands it) (arXiv:2504.13837, arXiv:2604.14877).
- A working F1–F4 stage tagger over the existing Phoenix/
events.jsonlcorpus — the actual current gap, confirmed by direct source read: the harness’s process telemetry is already complete for this (tool_call/tool_resultpairs joined ontool_call_id); nothing needs to change ingo/libs/agent/eventsorsecagent/runner.go. What’s missing is purely semantic — a deterministic post-hoc scan plus onestage_oracle.jsonper challenge, authored from that challenge’s ownsolve.py(Before you train §5).
Bottom line for this section: treat “it’s an execution gap” as the leading hypothesis, not settled ground. Everything in §2–§4 below is written so that it stays true whichever way the funnel eventually comes down — several forks are explicitly gated on a measurement that hasn’t been taken yet.
2. The forks
Five decisions, each with exactly two options (per this book’s convention — no hybrids, no third path). Every “what must be TRUE” column is a gate, not a preference — the fork should not be decided until its gate is checked, because in at least two of these forks (b, e) the two options are not just different costs, they require the opposite SFT/RL ordering. One explicit, flagged exception: fork (b)’s verdict below lands on a hybrid rather than either labeled option outright — called out and justified there, not silently smuggled in past this convention.
(a) Instrument the F1–F4 stage tagger first?
| Option 1 — build it first | Option 2 — skip it, decide off aggregate pass@k/flag_verified alone | |
|---|---|---|
| What it is | stagescan.go (same shape as the existing flagscan.go) + one stage_oracle.json per challenge, authored from solve.py; prototype on PD26-02 first, then generalize | Proceed straight to the SFT/GRPO plan using only the terminal flag signal and portfolio-level solve rate |
| Cost (compute + eng) | Near-free — read-only post-hoc scan over events.jsonl you already have; no new sandbox instrumentation; editorial authoring of ~4 predicates/challenge (one person reads ~10 solve.py files) | Zero now — but every downstream decision (b–e) is made blind to which F-tag actually dominates |
| Risk (reward-hacking) | None — this is diagnostic-only, no reward function touched | Routing risk, not hacking risk: you may sink a training cycle into the wrong lever (e.g. rejection-sampling SFT when the real bottleneck is F1/exploration, or milestone shaping when it’s actually F2/F4) |
| Information value | Highest single move in this whole chapter. “This single aggregation is the input every downstream decision below depends on” — verbatim from Before you train §4 | Low — an aggregate number “averages over four structurally different failure modes,” exactly the collapsing-a-split-verdict anti-pattern the diagnosis framework names first |
| What must be TRUE first | Nothing — this is the recommended first step regardless of any other measurement | N/A |
Verdict pressure: there is no real argument for Option 2. This fork is here because it’s the fork everyone is tempted to skip under time pressure, not because the evidence is close.
(b) Rejection-sampling SFT on verified solves NOW vs. measure pass@k-per-stage FIRST
| Option 1 — SFT now | Option 2 — segment + measure first | |
|---|---|---|
| What it is | Run rejection-sampling SFT on the ~185 already-collected verified-solve trajectories across the 5 corpora (gym263, gym564, warpenv-broker, envgen, argus60-base) | Segment the 1000 challenges (single-shot vs. sequentially-gated), run Pass@(k,T) — base vs. current checkpoint — per segment, before deciding what to train on |
| Cost | Cheap — data already exists, this is already the project’s stated near-term plan | Medium — sampling compute at multiple k, cold-start pdq --fresh-retries, no training required |
| Risk (reward-hacking / generalization) | Concrete, not hypothetical. On the compositional/sequentially-gated segment, matched-data SFT actually regresses capability (net −4) while RL expands it (net +4) — Zhai et al., arXiv:2604.14877. Training the wrong subset with SFT doesn’t just waste compute, it can make that subset worse. Separately: the ~185-trajectory pool may be guessing-dominated (high pass@64, low Cover@τ — arXiv:2510.08325) or contain lucky-but-unsound paths (right flag, wrong/wasted reasoning — arXiv:2506.14245) | Low — diagnostic only, but real opportunity cost if it delays shipping a known-safe move (SFT is the project’s current plan, already literature-validated as a baseline — arXiv:2504.11343) |
| Information value | Low incremental — you already believe SFT-on-solves works generically; this doesn’t test where it works | High — this is “the single highest-value experimental design” in the diagnosis chapter, and it’s directly testable this week with no new training |
| What must be TRUE before committing to Option 1 wholesale | (i) the SFT pool isn’t guessing-dominated (Cover@τ check on its source challenges); (ii) trajectories are filtered on soundness (backtracking/wasted-turns/tool-validity), not just flag==1; (iii) fork (a)’s stage tagger doesn’t show these ~185 trajectories concentrated on the sequentially-gated segment | N/A |
Verdict pressure — this is this chapter’s one deliberate exception to the “no hybrids” convention stated in §2, flagged rather than smuggled in: don’t cancel the SFT plan — but don’t treat “SFT now” as a blanket recipe across the whole portfolio either. The correct read of these two options is closer to “do (2) as a segmentation gate on (1)”: SFT the single-shot segment now, hold the sequentially-gated segment for GRPO once entropy instrumentation is live. Why this fork earns the exception where the other four don’t: options 1 and 2 here aren’t mutually exclusive courses of action — one is a training decision, the other a measurement decision, and they resolve at different grain (portfolio-wide vs. per-segment). Once segmentation lands, “measure first” naturally gates “SFT now” rather than replacing it. Forks (a), (c), (d), (e) don’t have that structure — their two options are genuinely exclusive paths, which is why no-hybrids holds cleanly for them and only for them.
(c) Monolithic GRPO vs. milestone-shaped GRPO
| Option 1 — monolithic | Option 2 — milestone-shaped | |
|---|---|---|
| What it is | Terminal flag reward only, unchanged, once GRPO/RLVR starts | Potential-based shaping F(s,a,s') = γΦ(s') − Φ(s) layered on top of (never instead of) the terminal reward, where Φ = a monotone running-max count of deterministically-verified stage completions (Ng/Harada/Russell, ICML 1999 — policy-invariant by theorem) |
| Cost | None beyond baseline GRPO infra | Medium — stage_oracle.json authoring (reuses fork (a)’s work if already done), Φ must be a running max (not instantaneous), and defined identically across every termination path (stop_reason ∈ {stop, max_turns, error}) or the invariance proof breaks |
| Risk (reward-hacking) | Risk of leaving real gains on the table if the funnel is genuinely F1-dominated — MiRA’s 6.4%→43.0% WebArena-Lite result is the strongest existence-proof in this book that flag-only reward can leave a large gap, though that’s a web-navigation result, not CTF (arXiv:2603.19685) | This is where the thick, convergent reward-hacking literature lives — PURE/Stop Summation (arXiv:2504.15275), Reward Under Attack (arXiv:2603.06621), Gao et al. (arXiv:2410.15115), PRIME’s own admission (arXiv:2502.01456), MONA (arXiv:2501.13011). Every one of these converges on: the moment a stage check becomes anything softer than deterministic ground-truth, it gets farmed. This project’s own confirmed lesson (SFT-induced FLAG{} confabulation from a loose format-matcher) is the small-scale preview of the same failure mode. But that whole cluster is about gaming a soft/gameable proxy — none of it is reward tampering. Denison, MacDiarmid, Barez, Duvenaud, Kravec, Marks et al. (Anthropic, “Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models,” arXiv:2406.10162, 130+ citations — verified live) show a curriculum of easy, low-stakes specification-gaming generalizes zero-shot to models that directly rewrite their own reward function/checklists when they have the tool access to do so — categorically more severe than metric-farming, and making the verifier deterministic and ground-truth does not by itself stop it if the policy’s own tool surface can reach what the verifier reads. This is not hypothetical for this project: the agent already has real shell access to the sandboxed target, and today’s flag_verified check (flagscan.go, §1) is file/process-based — reachable by the same tool calls the agent uses to solve the challenge. The gap this menu doesn’t yet name: isolate the verifier’s read-path from the agent’s own tool-call surface, and adversarially probe for exactly this generalization before scaling RL — “deterministic ground-truth reward” alone is the fix for gaming, not for tampering |
| Information value | N/A — this is the default, not an experiment | High if built correctly — this is the only mechanism in the whole menu that’s a theorem, not an empirical bet, provided the two subtleties are respected |
| What must be TRUE before building Option 2 | (i) fork (a)’s funnel shows an F1 (exploration)-dominated bottleneck, not F2/F3; (ii) the scale-dependence check confirms the base policy is genuinely capacity-limited rather than already-capable-but-unreliable — arXiv:2603.21972 found staged reward helps weak models only, larger models converge fine on outcome-only reward; (iii) a deterministic oracle exists for the stage being shaped — explicitly excluding stage 3 (vuln identification), which VPR’s own authors flag as the “open, unstructured” regime their method doesn’t yet solve (arXiv:2605.10325) | Default — no gate needed |
Verdict pressure: stay monolithic until the funnel says otherwise. If it does, build the narrow,
theorem-backed version — never a learned/LLM-judge per-stage reward, under any circumstance. And regardless
of which option wins: verifier-integrity hardening (flagscan.go’s read-path isolated from the agent’s own
tool surface) is not optional once RL starts, because Denison et al.’s generalization result means a
ground-truth verifier alone doesn’t rule out the agent attacking the verifier rather than the challenge.
(d) Tool-use fix for the curl-preference (SFT/DPO/KTO)
| Option 1 — elicitation ladder first | Option 2 — jump straight to a training-time fix | |
|---|---|---|
| What it is | Escalate cheapest→most-expensive: few-shot prompt with 2–3 correct-usage examples → light SFT on a handful of demonstrated-usage trajectories → only then consider RL-level intervention | Go directly to DPO/KTO on tool-choice pairs, or ToolRL-style decomposed per-call reward, without testing whether elicitation alone recovers the behavior |
| Cost | Very cheap — the few-shot test is nearly free; SFT-demo step is cheap | Medium — true DPO needs k≥2 same-challenge same-model divergent pairs (only the PD26 k=5 canonical sweep qualifies today; the larger gym pools are k=1); KTO-native data (unpaired success/failed splits) is free and ready today |
| Risk (reward-hacking / wasted engineering) | Low — but a self-reinforcing trap exists regardless of which rung you’re on: a rejection-sampling corpus built from the current curl-biased policy will never contain a dead tool succeeding, because the policy never tried it — RL alone has ~zero probability mass to reinforce those tools without forced/hinted exposure first (Tool-Star, arXiv:2505.16410) | Building ToolRL/DPO machinery for what might be a pure elicitation gap — Greenblatt et al.’s password-locked-model finding says a few high-quality SFT demonstrations are often sufficient to fully elicit a locked capability (arXiv:2405.19550); over-engineering here is real opportunity cost, not just aesthetic |
| Information value | High and cheap — turns “the model prefers curl” from anecdote into a falsifiable, staged experiment (framework.md §5) | Lower until the ladder has been run — you don’t yet know which rung actually recovers the behavior |
| What must be TRUE before escalating past few-shot | (a) few-shot prompting fails to recover tool usage on held-out challenges; (b) SFT on a small demonstrated-usage set also fails to recover it → only then is it a genuine missing-affordance problem calling for Tool-Star-style forced exposure + ToolRL-style decomposed reward (arXiv:2504.13958) | N/A |
Verdict pressure: run the ladder. Don’t skip to DPO/ToolRL on a hunch — the cheapest rungs have direct, citable precedent for “this alone is often sufficient,” and skipping them risks building infrastructure for a gap that a two-line prompt change would have closed.
(e) Exploration-emphasis vs. execution-emphasis — routed by the funnel
| Option 1 — execution-emphasis | Option 2 — exploration-emphasis | |
|---|---|---|
| What it is | Invest in trajectory curation, more/better rejection-sampling SFT data, DAPO’s clip-higher + dynamic sampling as the GRPO baseline, GiGPO step-level credit | On-policy RL first, not SFT, for the segment the funnel flags as F1/F4-dominant; DIVER/tool-sequence diversity bonus, curiosity bonus (CDE), parameter-space-noise pilot, periodic reference-policy resets (ProRL) for genuine boundary expansion |
| Cost | Lower — DAPO is an established, widely-adopted recipe; trajectory curation reuses existing data | Higher — several of these techniques are RL-infra-dependent and Promising-not-validated (PSN-RLVR, DIVER, HiPER/hindsight credit assignment for a CTF-shaped domain) |
| Risk | If the funnel is actually F1-dominant, more SFT on the same recipe teaches guessing-and-hoping more confidently, not more competence (LIMO’s framing, arXiv:2502.03387) | If the funnel is actually F2/F3-dominant, exploration machinery is solving a problem that doesn’t exist here and burns the RL-infra budget on the wrong axis — the entropy-collapse mechanism these fixes target (arXiv:2505.22617) is real but doesn’t help a policy that’s exploring fine and just executing unreliably |
| Information value | This is literally what the funnel is for. Not a taste choice — a routed decision | Same |
| What must be TRUE before routing | Funnel result from fork (a); the scale-check (arXiv:2603.21972); the Pass@(k,T) crossover-direction test on the specific segment (fork b) — does trained pull away from base at large k (execution), or does matched-data SFT regress it while RL expands it (exploration)? | Same gate, opposite branch |
Verdict pressure: this fork cannot be decided from priors or literature alone — by design, it is the output of forks (a) and (b), not an independent choice. If you find yourself picking an emphasis before the funnel exists, you are guessing, and the guess has better-than-even odds of being wrong given the project’s own “likely execution, unproven” framing in §1.
3. Dependency order — a DAG, not a timeline
This is deliberately not a schedule. It shows what unlocks what — several branches can run in parallel, and nothing downstream of “instrument” is safe to start before its own inputs exist.
flowchart TD
subgraph INSTRUMENT["INSTRUMENT — near-free, read-only, do first"]
I1["stagescan.go + stage_oracle.json\nper-challenge, F1-F4 tagger"]
I2["flag_verified true byte-compare\n(replace retrieved-provenance proxy)"]
I3["entropy logging wired,\nready from RL step 0"]
I4["elicitation-ladder harness:\ntool-usage histogram across 40 sectools"]
end
subgraph MEASURE["MEASURE — diagnostic, no training changes"]
M1["Segment 1000 challenges:\nsingle-shot vs sequentially-gated"]
M2["Pass@(k,T): base vs current checkpoint,\nper segment (arXiv:2604.14877)"]
M3["Cover@tau per challenge\n(arXiv:2510.08325) — guessing vs reliable"]
M4["Base-model pass@k control\n(arXiv:2504.13837)"]
M5["Scale-check: weak/capacity-limited\nvs already-capable-unreliable\n(arXiv:2603.21972)"]
M6["Elicitation ladder run:\nfew-shot -> SFT-demo -> RL"]
end
subgraph ROUTE["ROUTE — the fork decisions (Section 2)"]
RA["Fork a: ALREADY DECIDED\n(instrument first)"]
RB["Fork b: SFT-now vs measure-first\nper segment"]
RC["Fork c: monolithic vs\nmilestone-shaped GRPO"]
RD["Fork d: elicitation vs\ntraining-time tool fix"]
RE["Fork e: exploration- vs\nexecution-emphasis"]
end
subgraph TRAIN["TRAIN — the actual runs"]
T1["Rejection-sampling SFT\non single-shot segment, curated\n(STaR / ReST-EM pattern)"]
T2["GRPO + DAPO baseline\n(clip-higher, dynamic sampling)"]
T3["+ GiGPO step-level credit\n(zero extra rollouts)"]
T4["+ potential-based milestone\nshaping (gated, fork c only)"]
T5["ToolRL / Tool-Star forced\nexposure (gated, fork d only)"]
T2b["Exploration-emphasis RL:\nDIVER / CDE curiosity bonus /\nPSN-RLVR / ProRL resets\n(gated, fork e exploration branch)"]
end
subgraph GRADUATE["GRADUATE — the go/no-go gates"]
G1["Entropy collapsed\nAND pass@64 non-trivial\n(arXiv:2510.01624)"]
G2["Semantics-preserving-transform\nrobustness check survives\n(arXiv:2502.07445 / 2503.02296)"]
G3["Base-pass@k control still\ntrails trained pass@k\n(gain is real, not elicitation)"]
G4["pass@large-k did NOT shrink\npost-RL (RL-PLUS check,\narXiv:2508.00222)"]
end
I1 --> M1
I2 --> M4
I3 --> G1
I4 --> M6
M1 --> M2
M2 --> M3
M2 --> M4
M2 --> M5
M1 --> RB
M2 --> RB
M3 --> RB
M5 --> RC
M1 --> RC
M6 --> RD
RB --> RE
M5 --> RE
RB --> T1
RC -->|"F1-dominant, weak policy"| T4
RC -->|"F2/F3-dominant"| T2
RD -->|"elicitation recovers it"| T1
RD -->|"neither recovers it"| T5
RE -->|"execution-emphasis"| T2
RE -->|"exploration-emphasis"| T2b
T1 --> T2
T2 --> T3
T3 --> T4
T3 --> T5
T2 --> G1
T4 --> G1
T5 --> G1
T2b --> G1
G1 --> G2
G2 --> G3
G3 --> G4
G4 -->|"holds"| Ship["Credit the gain.\nGeneralize to the next segment /\nchallenge subset"]
G4 -->|"fails"| Back["Back to MEASURE —\nre-run funnel, re-check scale,\ndo not re-train blind"]
classDef inst fill:#132b22,stroke:#34d399,color:#eafaf3;
classDef meas fill:#0f2a3d,stroke:#38bdf8,color:#e6f6ff;
classDef route fill:#3a2e14,stroke:#f5b942,color:#fff6e0;
classDef train fill:#2a1438,stroke:#c084fc,color:#f3e8ff;
classDef grad fill:#3a1414,stroke:#f87171,color:#fde8e8;
class I1,I2,I3,I4 inst;
class M1,M2,M3,M4,M5,M6 meas;
class RA,RB,RC,RD,RE route;
class T1,T2,T3,T4,T5,T2b train;
class G1,G2,G3,G4 grad;
Read this as: nothing in TRAIN is safe to start before its ROUTE gate fires, and nothing in ROUTE is safe to decide before its MEASURE inputs exist. INSTRUMENT is the only stage with no prerequisites — which is why fork (a) has no real counter-argument.
4. Open hypotheses to test
These are falsifiable, in the sense the diagnosis chapter insists on: each has a stated experiment and a stated result that would kill it. This is the “prove it to myself” frame, not a checklist to complete once — re-run per challenge segment as the corpus grows.
| # | Hypothesis | Experiment | Falsified if |
|---|---|---|---|
| H1 | It’s execution, not knowledge, on the non-sequentially-gated segment | Base-model pass@k at large k (64, 256) on currently-failing single-shot challenges — does the correct action ever appear? | The correct action never appears at any N on any checkpoint for a large fraction of these — that’s a knowledge gap for that subset, requiring off-policy injection (demonstration, teacher, or a tool), not more RL |
| H2 | Milestone shaping helps, doesn’t hack | Introduce potential-based shaping (fork c, gated) on the F1-dominant segment only; track held-out flag_verified rate and pass@large-k before/after | Held-out flag rate drops, or pass@large-k shrinks post-introduction (capability-boundary collapse, arXiv:2508.00222) — either result means the shaping term is being farmed, revert to monolithic immediately |
| H3 | The horizon is tractable for GRPO at the 30–60% baseline band | Run DAPO+GiGPO on challenges the funnel tags F2/F3-dominant, in the 30–60% pass-rate band; watch entropy from step 0 | Entropy still collapses under DAPO’s own fixes, or stage-transition credit doesn’t concentrate on the exploitation phase specifically (GiGPO’s state-hash groups show flat credit) — means the horizon/credit-assignment problem is harder than the established recipe assumes for this task shape |
| H4 | A real exploration gap exists, localized to the sequentially-gated segment | Replicate Zhai et al.’s crossover-direction test on this project’s own compositional-segment challenges: does matched-data SFT regress pass@(k,T) on this segment while GRPO expands it? | If SFT does not regress this segment (both SFT and RL improve it comparably), the sequential-gating framing doesn’t transfer to this task family, and the single-shot ordering (SFT then GRPO) is fine everywhere — fork (b)/(e)’s special-casing was unnecessary |
| H5 | Tool-avoidance (curl-preference) is elicitation, not a missing-affordance problem | Run the elicitation ladder (fork d) on a sample of the 26 dead sectools entries on held-out challenges | Neither few-shot prompting nor light SFT-on-demos recovers usage — genuinely a missing-affordance problem, escalate to Tool-Star forced exposure + ToolRL decomposed reward |
| H6 | Any claimed solve-rate gain reflects real execution-reliability improvement, not memorization/elicitation | Base-model pass@k-at-large-k control (H1’s instrument, reused) and semantics-preserving-transform variants of a held-out subset, checked against every claimed gain before crediting it | The gain evaporates on either check — the gain is elicitation (fine to attribute to SFT, a red flag if it persists after GRPO) or memorization of the fixed ~10 canonical PD26 shapes |
| H7 | The SFT go/no-go gate (entropy collapse) is sufficient on its own | Check whether pass@64 on the rejection-sampling-SFT checkpoint is non-trivial at the same time entropy collapses, before green-lighting GRPO (arXiv:2510.01624) | Entropy has collapsed but pass@64 is flat/low — this predicts a disappointing GRPO run regardless of how good SFT accuracy looked; do not launch on entropy-collapse alone |
5. What we deliberately are NOT basing this on
Standing project rule, restated for this chapter specifically: no fork, no hypothesis, no cost/risk
estimate, and no number above rests on an academic cybersecurity-LLM training or benchmark paper — CTF-Dojo,
Cyber-Zero, Pentest-R1, HackSynth/Random-Crypto, AutoPenBench, Cybench, NYU CTF Bench, EnIGMA, InterCode-CTF,
DRLRM-PT, node-fragility reward shaping, the kill-chain-staged-reward paper, Nakano’s ATT&CK-tree scaffold,
or Honarvar’s Evolve-CTF/Capture-the-Flags family-based evaluation — even where several of these report a
finding that would superficially support one side of a fork here. None of that line of work has produced a
frontier cybersecurity model, so none of it counts as frontier evidence for a load-bearing decision; every
mention of them in the six source chapters this capstone draws from is explicitly labelled “academic, cited
for context only, not a basis,” and this chapter inherits that discipline rather than re-importing their
numbers under a different heading. Every claim above is re-grounded on one of: general frontier post-training
disclosures (DeepSeek-R1, Kimi k1.5/K2, Llama 4, OpenAI Deep Research), general RL/agent theory (potential-based
shaping, the reward-hacking convergence, reward-tampering-as-generalization (arXiv:2406.10162),
DAgger, GAE, entropy-collapse mechanics), general (non-security)
agent-eval and long-horizon literature (METR, AgentBoard, MAST, τ-bench, GSM-Symbolic/C-BOD), or this project’s
own measured data and confirmed lessons (the SFT-induced FLAG{} confabulation, the Phoenix trace corpus, the
existing pass@k methodology). Where a demoted academic-security idea is still worth pursuing on its own
merits — e.g. staged/kill-chain-shaped reward in a cybersecurity-specific loop — the chapters this one draws
from say so explicitly and flag it “worth pursuing — unvalidated outside academic-security work,” never as
settled ground.
Cross-links
- The path to a frontier cybersecurity model — the north star this whole Rung-1 fork-and-DAG plan is a prerequisite for, not a substitute for; explains why resolving every fork here still leaves Rungs 2–3 (environment scale, CPT/mid-training) unaddressed.
- The recipe is a sequence, not a pick — the stage-order-and-compounding frame fork (b)’s “don’t SFT the whole portfolio blind” verdict is a specific instance of.
- Proven post-training datasets — a usage-cited registry — the concrete dataset shopping list for the general-capability rungs underneath forks (b)/(d)’s SFT/preference data.
- Cybersecurity is one of a family — what cracked the others — the cross-domain evidence (coding agents, competitive programming, theorem proving, web agents, games, robotics) several forks below draw on directly for technique precedent and risk calibration.
- Diagnosing the gap — a scientific framework — the routing test and the pass@k / Pass@(k,T) / Cover@τ protocol every MEASURE node in §3’s DAG instantiates.
- From behavioral audit to training signal — the per-pattern gap→method→verification mapping fork (d) and fork (e) draw on directly.
- One problem, or many? — decomposition vs. monolithic — the full verdict and honest-limits case behind fork (c).
- Before you train — instrumentation & data readiness — the source of fork (a)’s cost estimate and the concrete first-step recipe for the F1–F4 tagger.
- RL that creates value — long-horizon, exploration, reasoning, novelty — the technique menu (DAPO, GiGPO, DIVER, ProRL, NuRL, PSN-RLVR) fork (e)’s exploration-emphasis branch draws on.
- The decision — the one-line version of the whole book’s routing question; this chapter is its decision-surface expansion, not a replacement.
References
Every id below was crawl-verified during the sessions that built this book (title/authors/date confirmed on the arXiv abstract page). Lab blogs/tech reports are linked to their source. Citation counts are unreliable for <18-month-old work — venue/lab-report presence is the stronger signal.
Foundations & imitation
- Ross, Gordon & Bagnell — DAgger, imitation compounds O(εT²) — arXiv:1011.0686 (AISTATS 2011)
- InstructGPT (SFT + RLHF) — arXiv:2203.02155
- Hinton et al. — Knowledge Distillation — arXiv:1503.02531
- GKD / on-policy distillation — arXiv:2306.13649 (ICLR 2024)
- STaR — arXiv:2203.14465 · ReST — arXiv:2308.08998 · ReST-EM / “Beyond Human Data” (DeepMind, self-training on own correct samples beats human-data-only, math/code) — arXiv:2312.06585 · RAFT — arXiv:2304.06767 · RFT (Yuan) — arXiv:2308.01825
- Rejection Sampling → Reinforce (entropy collapse; GRPO’s implicit filtering) — arXiv:2504.11343
Preference
- DPO — arXiv:2305.18290 · KTO — arXiv:2402.01306 · IPO — arXiv:2310.12036 · ORPO — arXiv:2403.07691 · SimPO — arXiv:2405.14734
- DPO-family survey — arXiv:2503.11701
- Constitutional AI / RLAIF — arXiv:2212.08073
Reinforcement
- PPO — arXiv:1707.06347
- GAE (classical single-time-scale credit assignment) — arXiv:1506.02438
- GRPO (DeepSeekMath) — arXiv:2402.03300
- DeepSeek-R1 (RLVR pipeline) — arXiv:2501.12948
- GSPO (Qwen3) — arXiv:2507.18071 · DAPO — arXiv:2503.14476
- Dr.GRPO (removes GRPO’s length-bias reward artifact) — arXiv:2503.20783
- REINFORCE++ (critic-free, no group-sampling requirement) — arXiv:2501.03262
- VAPO (value-model pretraining + length-adaptive GAE for long/heterogeneous responses) — arXiv:2504.05118
- PRM (Let’s Verify Step by Step) — arXiv:2305.20050 · Process Reward Models That Think — arXiv:2504.16828
Long-horizon & multi-turn agentic RL — credit assignment across turns, not tokens
- GiGPO (step-level advantage from state-hash-matched steps across rollouts, zero extra rollouts) — arXiv:2505.10978
- ArCHer (two-timescale: off-policy turn-level critic + on-policy token-level PG) — arXiv:2402.19446
- RAGEN / StarPO (multi-turn agentic RL framework, state-thinking-action loop) — arXiv:2504.20073
- Turn-Level Reward Design (dense per-turn reward layered under a terminal reward) — arXiv:2505.11821
- Turn-PPO (turn as the MDP unit, not token or trajectory) — arXiv:2512.17008
- Demystifying RL for Long-Horizon Tool-Using Agents (5-axis systematic ablation: reward/scale/data/algorithm/environment) — arXiv:2603.21972
- Verlog (dual-discount GAE, memory-windowing, validated to 400+ turn episodes) — no arXiv id, cite OpenReview:GmodkWwMV3
- Kimi k1.5 (128k-context RL via partial-rollout checkpoint/resume, no MCTS/value-fn/PRM) — arXiv:2501.12599
- AgentGym-RL / ScalingInter-RL (horizon curriculum: short turn cap expanding to full budget over training) — arXiv:2509.08755
- MUA-RL (trains against a dynamic, LLM-simulated counterpart instead of a static script) — arXiv:2508.18669
- HiPER (hierarchical credit assignment) — arXiv:2602.16165 · Hindsight Credit Assignment for Long-Horizon LLM Agents — arXiv:2603.08754
- RL-PLUS (names “capability boundary collapse” — pass@k at large k dropping even as pass@1 rises under RLVR) — arXiv:2508.00222
Hierarchical RL, decomposition & potential-based reward shaping — “one problem, or many?”
- Sutton — “The Bitter Lesson” (hand-built structure plateaus, general search+learning wins at scale; intellectual ancestor of the monolithic-outcome-RL case) — no arXiv id, incompleteideas.net (2019)
- OpenAI Deep Research system card (long-horizon tool-using agent trained end-to-end on outcome/rubric reward; “end-to-end training beats manual orchestration”) — no arXiv id, OpenAI system card
- Options / SMDP framework (Sutton, Precup, Singh — the seminal HRL / temporal-abstraction paper) — Artificial Intelligence 112 (1999), no arXiv id, DOI 10.1016/S0004-3702(99)00052-1
- FeUdal Networks (Manager/Worker HRL, fixes option-collapse) — arXiv:1703.01161
- HIRO (off-policy correction for HRL non-stationarity / subgoal ceiling-capping) — arXiv:1805.08296
- Ng, Harada & Russell — potential-based reward shaping,
F(s,a,s')=γΦ(s')−Φ(s)provably policy-invariant — ICML 1999, no arXiv id (predates arXiv’s routine ML use), ACM DL 10.5555/645528.657613 - Müller & Kudenko (PBRS practical effectiveness still depends on potential scaling) — arXiv:2502.01307
- RUDDER (learned, return-equivalent reward redistribution — a learned alternative to hand-specifying Φ) — arXiv:1806.07857
- Go-Explore (pure outcome RL structurally fails on sparse/deceptive long-horizon tasks without explicit remember-and-return exploration) — arXiv:1901.10995 / Nature s41586-020-03157-9
- OpenAI Five (long-horizon precedent needed huge scale + a per-frame shaped reward, not a single terminal bit) — arXiv:1912.06680
- Credit Assignment survey (separates credit-assignment variance from exploration burden) — arXiv:2312.01072
- MiRA (milestone-based dense reward; Gemma3-12B WebArena-Lite 6.4%→43.0%, beating WebRL/GPT-4-Turbo) — arXiv:2603.19685
- Verifiable Process Rewards / VPR (safe ground-truth checklist process reward; own caveat that open/unstructured stages remain unsolved) — arXiv:2605.10325
- CM2 (checklist-style verifiable sub-criteria reward) — arXiv:2602.12268
- Curriculum Learning (Bengio et al. — foundational, order training by difficulty, touches no reward function) — ICML 2009, no arXiv id
- h1 (curriculum + pure outcome-only reward yields an exponential sample-complexity gain) — arXiv:2510.07312
- FastCuRL (context-length curriculum, entropy-collapse timing) — arXiv:2503.17287
- BPO (curriculum + rejection-sampling refine; vanilla GRPO on sparse reward gains only marginally without curriculum) — arXiv:2508.03018
- TIPS (turn-level potential shaping for search-augmented LLMs — shaping machinery directly on-point, domain is not) — arXiv:2603.22293
- Randlov & Alstrom — the canonical non-potential-based “bicycle shaping” failure (agent farms a looks-like-progress bonus instead of reaching the goal) — ICML 1998, no arXiv id
- Kill-chain-staged reward (cyber-defense red-teaming) (academic cybersecurity-LLM work — cited for context only, not a basis) — arXiv:2605.17075 (May 2026)
- DRLRM-PT (reward machine over kill-chain phases, classical/non-LLM pentest RL) (academic — cited for context only, not a basis; explicitly named in the project’s standing rule) — arXiv:2405.15908 / DOI 10.1109/ijcnn60899.2024.10650368
- Node-fragility reward shaping (classical dense-reward pentest, non-LLM regime) (academic — cited for context only, not a basis) — DOI 10.3390/electronics13214311
Tool-integrated / tool-use RL — the direct BSides pattern-1 (tool avoidance) fixes
- ReTool (trajectory-level tool-integrated RL) — arXiv:2504.11536
- ToRL (tool-integrated RL, math) — arXiv:2503.23383
- Search-R1 (RL for search-agent tool use) — arXiv:2503.09516
- ToolRL (fine-grained, decomposed per-call tool-selection reward) — arXiv:2504.13958
- Tool-Star (forced exposure to under-used tools via multi-tool synthesis pre-RL) — arXiv:2505.16410
- Tool Preferences in Agentic LLMs are Unreliable (diagnosis of pattern-1-shaped tool avoidance) — arXiv:2505.18135
Exploration & entropy collapse
- The Entropy Mechanism of RL for Reasoning LMs (R = -a·e^H + b; Clip-Cov/KL-Cov fixes) — arXiv:2505.22617
- Beyond the 80/20 Rule (top-20%-entropy “forking tokens” carry nearly all exploration signal) — arXiv:2506.01939
- Reasoning with Exploration: An Entropy Perspective — arXiv:2506.14758
- Representation-Based Exploration for Language Models (hidden-state diversity bonus, usable at inference time) — arXiv:2510.11686
- Pass@k Metric for RLVR: A Diagnostic Tool of Exploration, But Not an Objective — arXiv:2511.16231
- Spurious Rewards (RLVR gains on Qwen2.5-Math nearly as large with completely wrong rewards; model-family-dependent) — arXiv:2506.10947
- Absolute Zero (self-play task-proposal + solve, zero external labeled data) — arXiv:2505.03335
- LIMO (817 curated SFT examples beat >100k loosely-curated ones — SFT as cognitive templates, not knowledge source) — arXiv:2502.03387
- Test-time compute scaling (Snell et al., difficulty-adaptive allocation matches a 14x larger model) — arXiv:2408.03314 · o3-mini vs o1-mini (accuracy without longer CoT) — arXiv:2502.15631
- OpenAI o1 System Card (methodology precedent, cited across the field) — arXiv:2412.16720
- Reward-hacking-under-RL cluster: Specification Gaming in Reasoning Models — arXiv:2605.02269 · LLMs Gaming Verifiers (extensional vs intensional correctness) — arXiv:2604.15149 · Reward Hacking in the Era of Large Models (Proxy Compression Hypothesis) — arXiv:2604.13602
- Per-step / process-reward hacking convergence (the case against naive per-stage reward): PURE / Stop Summation (summation-form credit assignment “easily induces LLMs to hack steps with high rewards”) — arXiv:2504.15275 · Reward Under Attack (SOTA PRMs as “fluency detectors rather than reasoning verifiers”) — arXiv:2603.06621 · Gao et al. (learned PRM/ORM + success reward can hurt vs success-only) — arXiv:2410.15115 · PRIME (authors’ own admission that process labels are “prohibitively expensive,” PRMs vulnerable to hacking) — arXiv:2502.01456 · MONA (multi-step reward hacking even when no single step looks bad to an overseer) — arXiv:2501.13011
Self-correction & tool-use self-correction RL
- SCoRe — Training LMs to Self-Correct via RL (reward for improvement, not final correctness) — arXiv:2409.12917
- From Correction to Mastery (earliest-error RL, distinct “SCoRe”) — arXiv:2509.14257
Post-training recipe as a sequence — order, compounding, synthetic-trajectory bootstrap
(new citations from The recipe is a sequence, not a pick; ids already covered elsewhere — Llama 3, Tülu 3, OLMo 2, DeepSeek-V3/R1, Qwen3, GRPO/DeepSeekMath, LoRA-learns-less, STaR, RAFT, ReST-EM, DAPO, the rejection-sampling→REINFORCE entropy-collapse paper, “Scalpel vs Hammer” — are not repeated here.)
- Self-Instruct (off-policy synthetic instruction generation) — arXiv:2212.10560
- WizardLM / Evol-Instruct (off-policy synthetic, complexity-evolved instructions) — arXiv:2304.12244
- Llama 2 (RLHF report; iterative-round rejection-sampling non-monotonicity — “struggled more… to compose rhyming lines” when only the latest round was sampled) — arXiv:2307.09288
- Persona-driven synthetic data generation — arXiv:2406.20094
- Qwen2.5 (dense 0.5B–72B; explicit SFT→offline-DPO→online-GRPO staging) — arXiv:2412.15115
- “SFT Memorizes, RL Generalizes” (GeneralPoints/V-IRL testbed; SFT stabilizes format, RL then generalizes) — arXiv:2501.17161 (ICML 2025 poster)
- Llama-Nemotron (post-training recipe report) — arXiv:2505.00949
- Distillation-vs-pattern-imitation ablation (“only the DeepSeek model shows a meaningful increase in capability” — new knowledge, not pattern transfer, is what expands pass@k) — arXiv:2505.14216
- AdaSTaR (efficient iterative rejection-sampling; curriculum sampling, −58.6% training FLOPs at equal-or-better accuracy) — arXiv:2505.16322
- SFT-vs-RFT forgetting comparison (SFT 52.1%→40.1% drop vs. RFT 54.2% improvement on the same setting) — arXiv:2507.05386
- “RL Fine-Tuning Heals OOD Forgetting in SFT” (contested re-framing of “SFT memorizes/RL generalizes” as “SFT forgets, RL recovers”) — arXiv:2509.12235
- Domain-continual pretraining forgetting / backward-transfer at scale (“moderate forgetting, low-to-moderate backward transfer”) — arXiv:2510.17776
- Backward-synthesis answer-anchoring / confabulation risk (STaR-rationalization follow-up; the answer acts as a cognitive anchor) — arXiv:2602.14469
- “Revisiting DAgger in the Era of LLM-Agents” (SFT’s off-policy covariate shift vs. RLVR’s on-policy but sparse feedback, stated precisely for multi-turn agents) — arXiv:2605.12913
- Qwen3-4B SFT degrading TruthfulQA/HaluEval (Qwen-family-specific forgetting evidence) — arXiv:2605.20005
Continued pretraining on an instruction-tuned model — preservation techniques
(new citations from Continued pretraining on an instruction-tuned model; LoRA-learns-less-forgets-less [2405.09673] already cited above, not repeated.)
- DAPT — Gururangan et al., “Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks” (ACL 2020, seminal domain-adaptive-pretraining concept) — arXiv:2004.10964
- MMLU — Hendrycks et al. — arXiv:2009.03300
- Scialom et al. — replay mitigates forgetting in continual instruction-tuning (EMNLP 2022) — arXiv:2205.12393
- Ilharco et al. — “Editing Models with Task Arithmetic” (ICLR 2023, seminal task-vector/delta-arithmetic result) — arXiv:2212.04089
- TIES-Merging (NeurIPS 2023; trim + elect-sign + merge) — arXiv:2306.01708
- Gupta et al. — continual-pretraining LR re-warm/re-decay — arXiv:2308.04014
- Luo et al. — empirical study of catastrophic forgetting in LLMs during continual fine-tuning (1B–14B; forgetting worsens with scale) — arXiv:2308.08747
- AdaptLLM — auto-converts raw domain text into reading-comprehension/QA replay pairs — arXiv:2309.09530
- Qi, Zeng et al. — “Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!” (ICLR 2024) — arXiv:2310.03693
- Chat Vector (Huang et al., ACL 2024; language-shift instance of the task-arithmetic reattach recipe) — arXiv:2310.04799
- DARE — “Super Mario” random delta-dropping + rescaling (ICML 2024) — arXiv:2311.03099
- IFEval — “Instruction-Following Evaluation for Large Language Models” (Zhou et al., Google) — arXiv:2311.07911
- LLaMA Pro — block-expansion CPT that never touches original weights, then a separate instruction-tuning pass (ACL 2024) — arXiv:2401.02415
- Li & Lee — “Examining Forgetting in Continual Pre-training of Aligned Large Language Models” (direct CPT-on-Llama-2-7b-chat comparison) — arXiv:2401.03129
- RESTA — DARE-sparsified delta subtraction/restoration (ACL 2024) — arXiv:2402.11746
- Ibrahim et al. — simple/scalable continual-pretraining strategies (re-warm+re-decay+replay matches from-scratch retraining) — arXiv:2403.08763
- Qi et al. — “Safety Alignment Should Be Made More Than Just a Few Tokens Deep” (ICLR 2025 Oral; shallow safety-alignment mechanism) — arXiv:2406.05946
- Instruction Pre-Training (Microsoft, EMNLP 2024; 200M synthesized instruction-response pairs woven into raw CPT corpus) — arXiv:2406.14491
- Jindal, Badrinath, Bharti, Vinay & Sharma (Samsung Research) — “Balancing Continuous Pre-Training and Instruction Fine-Tuning” (the direct S1-vs-S2 CPT-on-instruct-vs-base comparison, 4 model families) — arXiv:2410.10739
- Mousavi, Alghisi & Riccardi (U. Trento) — “What Does Loss Optimization Actually Teach, If Anything? Knowledge Dynamics in Continual Pre-training of LLMs” (loss curves don’t reveal instruct-layer damage in real time) — arXiv:2601.03858
- Zheng, Cai, Qiu & Ma — “Spurious Forgetting in Continual Learning of Language Models” (ICLR 2025 poster; forgetting is often a task-alignment/metric artifact, not true knowledge loss) — no arXiv id, OpenReview:ScI7IlKGdI
- Harmon, Hochlehnert, Bethge & Prabhu (Tübingen AI Center) — “Mapping Post-Training Forgetting in Language Models at Scale” (~30 model pairs; “model merging does not reliably mitigate forgetting”) — no arXiv id found, anonymous ICLR 2026 submission, OpenReview:qCIg2WGudx
Datasets (proven-by-usage) — general post-training data, Sequence-B rungs
Full registry, inclusion rule, and per-dataset detail: Proven post-training datasets — a usage-cited registry. Papers backing named training recipes for these datasets (Tülu 3, OLMo 2, Qwen2.5-Math, DPO, KTO, IPO, ORPO already cited above, not repeated):
- Gorilla / APIBench (UC Berkeley, NeurIPS 2024 D&B) — arXiv:2305.15334 · dataset
- ToolBench (OpenBMB) → ToolLLaMA (ICLR 2024 spotlight) — arXiv:2307.16789 · GitHub
- PKU-SafeRLHF → Beaver-7B-v1.0 — arXiv:2310.12773 · dataset
- AgentInstruct (Zhipu/THUDM) → AgentLM (ACL 2024 Findings) — arXiv:2310.12823 · dataset
- Zephyr-7B (UltraChat-200k SFT stage → UltraFeedback DPO stage) — arXiv:2310.16944 · dataset
- LongAlign-10k (Zhipu/THUDM, EMNLP 2024 Findings) — arXiv:2401.18058 · dataset
- Aya (Cohere) → Aya-101/23/Expanse — arXiv:2402.06619 · dataset
- Agent-FLAN (InternLM, ACL 2024 Findings) — arXiv:2403.12881 · dataset
- COIG-CQIA (Chinese instruction SFT, LIMA-style) — arXiv:2403.18058 · dataset
- MAP-Neo Matrix Data Pile (bilingual EN/ZH pretrain corpus) — arXiv:2405.19327 · dataset
- Magpie (self-synthesized instruction/preference data, ICLR 2025) → Llama-3-8B-Magpie-Align / SmolLM2 — arXiv:2406.08464 · org
- HelpSteer2 → Llama-3.1-Nemotron-70B-Reward (NVIDIA, #1 RewardBench at release) — arXiv:2406.08673 · dataset
- Beaver-7B (PKU-SafeRLHF, second citation) — arXiv:2406.15513
- Salesforce xLAM-function-calling-60k → xLAM-1b/7b-fc-r (NeurIPS 2024 D&B) — arXiv:2406.18518 · dataset
- IBM Granite-20B-FunctionCalling (Glaive-function-calling-v2) — arXiv:2407.00121 · dataset
- ToolACE → ToolACE-8B (Huawei, ICLR 2025) — arXiv:2409.00920 · dataset
- OpenMathInstruct-2 → OpenMath2-Llama3.1 (NVIDIA, ICLR 2025) — arXiv:2410.01560 · dataset
- Skywork-Reward-Preference-80K-v0.2 (#1 RewardBench) — arXiv:2410.18451 · dataset
- OpenCSG Chinese-Cosmopedia → csg-wukong-1B — arXiv:2501.08197 · dataset
- SmolLM2 (Smol-Magpie-Ultra derivative) — arXiv:2502.02737 · dataset
- OpenCodeReasoning-Nemotron (NVIDIA; SFT-only beats RL alternatives on LiveCodeBench) — arXiv:2504.01943 · dataset
- APIGen-MT → Salesforce xLAM-2 series (SOTA BFCL + τ-bench) — arXiv:2504.03601 · project
- COIG-P (Chinese preference/DPO, EACL 2026 Findings) — arXiv:2504.05535 · dataset
- OpenMathReasoning → OpenMath-Nemotron (NVIDIA’s AIMO-2-winning submission) — arXiv:2504.16891 · dataset
- OpenThoughts3-1.2M → OpenThinker3-7B (SOTA-open-data at release) — arXiv:2506.04178 · dataset
- BAAI Infinity-Instruct → InfInstruct family (beats GPT-4-0314 by 8.6% on IF) — arXiv:2506.11116 · dataset
Domain-specialization lineages — the frontier recipe (code / math / medical)
- Kaplan et al. — Scaling Laws for Neural Language Models — arXiv:2001.08361
- Hoffmann et al. — Chinchilla, compute-optimal scaling — arXiv:2203.15556
- phi-1 (textbook-quality data substitutes for ~100x scale in a narrow domain) — arXiv:2306.11644
- DeepSeek-Coder V2 (abandons from-scratch pretraining, continues from a strong base) — arXiv:2406.11931
- Qwen2.5-Math (co-evolves RM + SFT data across rounds before RL, reuses RM for best-of-N at inference) — arXiv:2409.12122
- DeepSeek-Prover-V2 (subgoal-decomposed cold-start data + kernel-verified RL) — arXiv:2504.21801
- OLMo 2 (mid-training as a named 5-10% FLOPs bridge stage) — arXiv:2501.00656
- Mid-training mechanism study (outperforms CPT-alone at matched budget, reduces catastrophic forgetting before SFT) — arXiv:2510.14865
- Med-PaLM v1 (mentioned, not a basis for cyber claims — prompt-tuning-only ceiling, motivates v2) — arXiv:2212.13138
- Med-PaLM 2 (domain instruction fine-tuning + ensemble refinement) — arXiv:2305.09617
- MedGemma (domain VLM pretraining + task fine-tuning, explicitly not clinical-grade alone) — arXiv:2507.05201
- Full fine-tuning vs. LoRA on code/math domain-skill acquisition (10-100x effective-rank gap) — arXiv:2405.09673
- Benchmark contamination (GSM8K/MMLU scores inflated up to 22.9%/19.0%) — arXiv:2406.13990
- LiveCodeBench (time-segmented, contamination-resistant-by-construction) — arXiv:2403.07974
- GPQA (“Google-proof” QA) — arXiv:2311.12022
- Tülu 3 (decontamination as a first-class deliverable; primary public naming of RLVR) — arXiv:2411.15124
- Emerging RL-environment-scale bottleneck framing (moderate confidence, new/low-citation) — arXiv:2511.09586
- SIMA (scalable instructable multiworld agent) — arXiv:2404.10179
- SIMA 2 (self-generated tasks + rewards via Gemini) — arXiv:2512.04797
Adjacent-domain structural transfer — coding agents, competitive programming, theorem proving, web agents, games, robotics
- SWE-agent / Agent-Computer Interface (fixed action set + concise feedback lifts pass@1 pre-RL) — arXiv:2405.15793
- SWE-Gym (executable-environment SFT trajectories) — arXiv:2412.21139
- R2E-Gym — arXiv:2504.07164
- SWE-RL (execution-verified reward beats a
difflibpatch-similarity fallback) — arXiv:2502.18449 - o1→o3 coding RL — arXiv:2502.06807
- Progressive context/turn-budget curriculum for long-horizon RL — arXiv:2508.03501
- SWE-Master (mask environment-feedback tokens out of the SFT loss, low-confidence/very recent) — arXiv:2602.03411
- DeepSeek-Coder v1 (from-scratch pretraining, the lone exception in the lineage) — arXiv:2401.14196
- StarCoder2 / The Stack v2 (curation quality substitutes for parameter count — data-pipeline lesson only) — arXiv:2402.19173
- StarCoder — arXiv:2305.06161
- CodeRL — arXiv:2207.01780
- PPOCoder — arXiv:2301.13816
- RLTF — arXiv:2307.04349
- StepCoder — arXiv:2402.01391
- RLEF (turn-level value function over a multi-turn POMDP; Meta FAIR, ICML 2025 spotlight) — arXiv:2410.02089
- Sailor (SEA-language CPT) — arXiv:2404.03608
- SEA-LION — arXiv:2504.05747
- LLaMA Beyond English — arXiv:2401.01055
- Tokenizer/vocabulary coverage as an architectural precondition — arXiv:2406.11477
- BLOOM+1 (adding a new language to the SFT mixture beats continued pretraining) — arXiv:2212.09535
- Aya — arXiv:2402.07827
- AlphaCode (sampling breadth + cheap filter) — arXiv:2203.07814
- GrandCode / Agentic GRPO (purpose-built GRPO variant for delayed reward + off-policy drift; single team, very recent) — arXiv:2604.02721
- AlphaGeometry (Nature 2024; synthetic self-play manufactures its own training problems) — DOI 10.1038/s41586-023-06747-5
- AlphaProof (Nature 2025; AlphaZero-style self-play/search on top) — DOI 10.1038/s41586-025-09833-y
- WebGPT (learned human-preference reward; origin of the SFT-cold-start-then-RL recipe shape, flagged OOD-weak by its own authors) — arXiv:2112.09332
- WebRL (self-evolving curriculum generated from the model’s own unsuccessful attempts; ICLR 2025) — arXiv:2411.02337
- R1-Searcher (sequential, not summed, two-stage tool-use reward) — arXiv:2503.05592
- DeepResearcher (training end-to-end in the real live environment is “a fundamental requirement”; EMNLP 2025) — arXiv:2504.03160
- AlphaStar (Nature; league-based diverse self-play population fixes strategy collapse) — DOI 10.1038/s41586-019-1724-z
- NetHack (honest “still unsolved” calibration point) — arXiv:2006.13760
- HER — Hindsight Experience Replay (relabel a failed trajectory as the goal it accidentally satisfied; NeurIPS 2017) — arXiv:1707.01495
- Firestone — competence vs. performance (formal/functional split; independently reconfirmed by the particular-language literature) — PMC7604508
- KLong (progressive horizon curriculum, second converging source; low-confidence) — arXiv:2602.17547
- BEACON (2026 long-horizon credit-assignment cluster; low-confidence individually) — arXiv:2605.06078
- Ecpo (2026 long-horizon credit-assignment cluster; low-confidence individually) — arXiv:2606.05885
CTF / pentest RL environments
(academic, cited for context — not a basis for our decisions; see the stance in Contested edges)
- CTF-Dojo (486 verified trajectories, 31.9% pass@1 credibility yardstick) — arXiv:2508.18370
- Cyber-Zero (monolithic outcome-RL, simulated env, +13.1%) — arXiv:2508.00910
- Pentest-R1 (two-stage RL for CTF methodology) — arXiv:2508.07382
- HackSynth (crypto-CTF GRPO) — arXiv:2506.02048
- InterCode-CTF (seminal monolithic-reward CTF environment) — arXiv:2306.14898
- Cybench (subtask decomposition, eval-only) — arXiv:2408.08926
- AutoPenBench (milestone taxonomy near-matching this book’s F1–F4 split, eval-only) — arXiv:2410.03225
- NYU CTF Bench (CTF benchmark family) — arXiv:2406.05590
- EnIGMA (“soliloquizing” fabrication failure mode, ICML 2025) — arXiv:2409.16165
- Guided Reasoning via Structured Attack Trees (deterministic ATT&CK-derived task tree, +5x subtask completion on the same weights) — arXiv:2509.07939
- From Capabilities to Performance (pentesting ablations) — arXiv:2509.14289
- PentestAgent (RAG-fix framing of a knowledge gap; contested against the scaffolding/execution readings above) — arXiv:2411.05185
- Capture the Flags: Family-Based Evaluation via Semantics-Preserving Transformations (CTF-specific robustness benchmark) — arXiv:2602.05523
- What Makes a Good LLM Agent for Real-world Penetration Testing? (Task Difficulty Assessment + Evidence-Guided Attack Tree Search) — arXiv:2602.17622
Agent benchmarks & failure taxonomies
- τ-bench (fault-assignment × fault-type taxonomy) — arXiv:2406.12045
- AgentRx (localizes the single critical failure step in a long trajectory) — arXiv:2602.02475
- AgentBoard (“progress rate” metric — general, non-security capability-decomposition principle, NeurIPS 2024 Oral) — arXiv:2401.13178
- MAST (14-mode/3-category multi-agent failure taxonomy, κ=0.88) — arXiv:2503.13657
- AgentErrorTaxonomy / AgentDebug (root-cause diagnosis alone, no reward change, buys +24% all-correct accuracy) — arXiv:2509.25370
- Phase-aligned taxonomy for autonomous agents (independent-domain convergence on a phase-keyed failure split) — arXiv:2508.13143
Capability boundary, elicitation & sandbagging (contested)
- Yue et al. — RL elicits, not expands — arXiv:2504.13837 (2025-04-18)
- ProRL — prolonged RL expands — arXiv:2505.24864
- Cohen-Inger et al. — “LLMs are Like a Chameleon” (benchmark scores mask overfitting; semantics-preserving perturbation robustness check) — arXiv:2502.07445 (2025-02-11)
- Zhang et al. — “Memorize or Generalize?” (Memorization Risk Index via semantic-perturbation code rewriting; companion robustness-check citation) — arXiv:2503.02296 (2025-03-04)
- PSN-RLVR — arXiv:2602.02555 · NuRL — arXiv:2509.25666 · CoT-Pass@K (Wen et al., RLVR implicitly incentivizes correct reasoning) — arXiv:2506.14245 (2025-06-17)
- Scalpel vs Hammer (GRPO amplifies, SFT replaces) — arXiv:2507.10616
- Zhai et al. — Does RL Expand the Capability Boundary of LLM Agents? Pass@(k,T) — arXiv:2604.14877 (2026-04-16)
- Dragoi et al. — Beyond Pass@k: Breadth-Depth Metrics / Cover@τ — arXiv:2510.08325 (2025-10-09)
- Kang et al. — Quagmires in SFT-RL Post-Training — arXiv:2510.01624 (2025-10-02)
- Chen et al. — The Coverage Principle — arXiv:2510.15020 (2025-10-16)
- Greenblatt et al. — Stress-Testing Capability Elicitation with Password-Locked Models — arXiv:2405.19550 (2024-05-29)
- Hofstätter et al. — The Elicitation Game — arXiv:2502.02180 (2025-02-04)
- van der Weij et al. — AI Sandbagging — arXiv:2406.07358 (2024-06-11)
- Ryd et al. — Removing Sandbagging via Weak Supervision — arXiv:2604.22082 (2026)
- Stroebl et al. — Inference Scaling fLaws — arXiv:2411.17501 (2024-11-26)
- Dorner et al. — ROC-n-reroll — arXiv:2507.12399 (2025-07-16)
- Huang et al. — Is Best-of-N the Best of Them? — arXiv:2503.21878 (2025-03-27, ICML 2025)
- Mahowald et al. — Dissociating Language and Thought in LLMs — arXiv:2301.06627 (2023-01-16)
- He et al. — LLMs as Neurolinguistic Subjects — arXiv:2411.07533 (2024-11-12)
- Boháček et al. — Uncovering Competency Gaps (sparse autoencoders on internal representations) — arXiv:2512.20638 (2025-12-06)
PEFT
- LoRA — arXiv:2106.09685 · QLoRA — arXiv:2305.14314 · DoRA — arXiv:2402.09353
- Unified LoRA-variant study (LR-sensitivity) — arXiv:2601.22708
Frontier lab recipes (reports & blogs) — full-year refresh, 2025-07 → 2026-07, all 10 tracked labs
Llama (historical anchor): Llama 3 — arXiv:2407.21783 · Llama 4 — ai.meta.com/blog/llama-4-multimodal-intelligence
Anthropic (Claude) — Constitutional AI/RLAIF backbone arXiv:2212.08073; inoculation prompting arXiv:2510.04340, Anthropic’s own study arXiv:2511.18397; release posts/system cards — Opus 4.1 · Sonnet 4.5 card · Sonnet 4.5 · Sonnet 4.5 research · Haiku 4.5 card (PDF) · Haiku 4.5 · Opus 4.5 card · Opus 4.5 research · Opus 4.5 card walkthrough (secondary) · inoculation prompting post · emergent misalignment / reward hacking · “teaching Claude why” post · research page · Opus 4.6 · Opus 4.6 sabotage risk report (PDF) · Sonnet 4.6 · Opus 4.7 · Opus 4.8 · Fable 5 / Mythos 5 · Fable 5 & Mythos 5 card (PDF) · Mythos guardrails coverage (secondary) · Sonnet 5 · Sonnet 5 card · Sonnet 5 launch coverage (secondary) · Sonnet 5 launch guide (secondary) · AI organizations post
OpenAI — GPT-5 (2025-08-07) · GPT-5 system card (PDF) · GPT-5 for developers · safe-completions arXiv:2508.09224 · safe-completions post · GPT-5-Codex addendum · Codex system card (PDF) · GPT-5.1 · GPT-5.1 deployment safety · routing/model-choice post (secondary) · GPT-5.1-Codex-Max · Codex-Max system card · Codex-Max safety training · long-horizon Codex tasks · GPT-5.2 · GPT-5.2 for science/math · GPT-5.2 system-card update · GPT-5.2-Codex · GPT-5.3-Codex · 5.3-Codex system card · 5.3-Codex coverage (secondary) · GPT-5.4 · GPT-5.4 thinking system card · GPT-5.4 (secondary) · graders docs · RFT guide · RFT use-cases · RFT wind-down, 2026-05-08 · community thread (secondary)
Google DeepMind (Gemini) — Gemini 2.5 tech report arXiv:2507.06261 (HTML, §2.4/2.5 mirror, cross-checked (secondary)) · Deep Think launch · Gemini 3 Pro model card (PDF) · Gemini 3 launch · agent-building with Gemini 3 · Gemini 3 Deep Think · Deep Think update · Gemini 3 Flash · Flash for enterprise · Gemini 3.1 Pro · Gemini 3.5 · Vending-Bench/τ²-bench cross-check (secondary)
xAI (Grok) — Grok 4 · Grok 4 model card (PDF) · Grok 4 analysis (secondary) · Grok Code Fast 1 · Code Fast 1 model card (PDF) · Grok 4 Fast · Grok 4 Fast model card (PDF) · coverage (secondary) · coverage (secondary) · Grok 4.1 · Grok 4.1 model card (PDF) · sycophancy coverage (secondary) · Grok 4.1 Fast · news index
Mistral — Magistral arXiv:2506.10910 (HF paper page, benchmark tables) · Ministral 3 arXiv:2601.08584 · Mistral 3 blog · Magistral blog · Mistral-Large-3 card · Magistral-Small-2509 card · Magistral-Small-2507 card · Magistral Medium 1.2 docs
DeepSeek — V3 arXiv:2412.19437 · R1 arXiv:2501.12948 · V3.2 arXiv:2512.02556 (HTML) · V3.1 release · V3.1 card · V3.1-Terminus · V3.1-Terminus card · V3.2-Exp · V3.2-Exp repo · V3.2 / V3.2-Speciale
Qwen (Alibaba) — Qwen3 tech report arXiv:2505.09388 · GSPO arXiv:2507.18071 · Qwen3-Omni arXiv:2509.17765 · Qwen3-VL arXiv:2511.21631 · Qwen3-Coder-Next arXiv:2603.00729 · Qwen3.5-Omni arXiv:2604.15804 · GSPO blog · Qwen3-Coder blog · Qwen3 blog · Qwen3 README · Qwen3-Next efficiency · Qwen3-Max · Qwen3-Max-Thinking · Qwen3.5 blog · Qwen3.5-397B-A17B card · Qwen3.7-Max “agent frontier” · Qwen3.7 blog · The Batch coverage (secondary) · VentureBeat coverage (secondary) · TechSphere coverage (secondary) · Qwen3.6-35B-A3B agentic coding
Moonshot AI (Kimi) — K2 arXiv:2507.20534 (verified via arXiv abs + full HTML crawl) · K2.5 arXiv:2602.02276 (verified via arXiv abs + full HTML crawl) · Kimi-K2-Thinking card (no arXiv paper) · K2 Thinking intro post · Kimi-K2.6 card (no arXiv paper) · K2.6 tech blog · K2.6 benchmark deltas · K2.6 coverage (secondary) · K2.6 method non-disclosure (secondary) · Kimi-K2.5 model card/benchmarks
GLM / Z.ai (Zhipu) — GLM-4.5 arXiv:2508.06471 · GLM-5 arXiv:2602.15763 (HTML) · GLM-4.5 blog · GLM-4.6 blog · GLM-4.7 blog · GLM-5 blog · GLM-5.2 blog · GLM-4.5 repo · GLM-5 repo · slime RL infra repo · GLM-4.7-Flash card · MoE architecture deep-dive (secondary, unverified-primary) · agentic RL post citing GLM-5 report (secondary) · RL infra post citing GLM-5 report (secondary) · GLM-5.2 vs 5.1 (secondary) · GLM-5.2 open-source coverage (secondary) · GLM-4.7-Flash coverage (secondary)
Xiaomi (MiMo) — MiMo-V2-Flash arXiv:2601.02780 · MiMo-Embodied arXiv:2511.16518 · MiMo-VL-Miloco arXiv:2512.17436 · MiMo-Audio arXiv:2512.23808 · MiMo lineage (background only, outside the 12-month window): MiMo arXiv:2505.07608, MiMo-VL arXiv:2506.03569 · MiMo-V2.5-Pro card · MiMo-V2.5 card · MiMo-V2.5-Pro blog · MiMo model-update docs
The verified concept-map + genealogy notes in the shared memory pool (
research/post-training-inference-concept-map.md,research/post-training-method-genealogy-onpolicy-offpolicy.md,research/frontier-lab-post-training-recipes-2026.md,research/rl-for-long-horizon-exploration-reasoning.md,research/diagnosing-capability-vs-execution-gap-framework.md) are the machine-readable companions to this book.
How this book grows
This is a living document maintained by the researcher seat of the llmresearch project. It grows one verified topic at a time; each chapter cites its sources so claims are checkable, not assertions.
Conventions
- Engineer-level. Assumes you know logprobs, KL, advantage, rollouts, MoE, PPO clip. No 101 filler.
- Cite or don’t claim. Every substantive statement carries an arXiv id or a named lab report/blog. Where something is contested, it’s marked contested with both sides (Contested edges).
- Honesty about status. Methods are tagged mainstream / niche / promising-not-proven / experimental based on whether a frontier flagship’s report actually uses them.
- Verified live. arXiv ids are crawl-checked; lab-recipe claims come from 2025–2026 tech reports and blogs, not training recall. Re-verify before betting a run — this field ships weekly.
Build & run locally
# one-time: install the toolchain (macOS)
brew install mdbook mdbook-mermaid
# from the book root:
mdbook-mermaid install . # vendors mermaid assets + wires the preprocessor
mdbook serve --open # live-reload server at http://localhost:3000
Mermaid flowcharts and raw HTML/iframes (e.g. the embedded journey) render offline — no CDN required.
Log
-
2026-07-02 — v0.4. Wired three new chapters into the “Toward a frontier cybersecurity model” section, re-ordered to read as a coherent arc — organizing reframe first, then the two chapters it’s a prerequisite for, then the existing family→path→forks arc: The recipe is a sequence, not a pick (retires the “which technique” framing at the root — two explicit stage sequences, Sequence A from-scratch-foundation-model and Sequence B fine-tune-an-open-weight-dense-model [this project’s actual path], why order matters and stages compound rather than add, the synthetic-trajectory bootstrap and its off-policy execution-gap caveat, and a stage-wise evaluation protocol for Sequence B), Continued pretraining on an instruction-tuned model (can you run raw CPT directly on an already-instruct/RLHF’d checkpoint without destroying it — yes, but naive CPT-on-instruct reliably causes format/alignment collapse, not fact erasure; a six-technique preservation decision table; recommends CPT-on-base→re-instruct as the default with chat-vector reattachment as a cheap fallback; an IFEval+MMLU+domain-QA stage-boundary gate to verify it didn’t break), and Proven post-training datasets — a usage-cited registry (a ~60-dataset registry across instruction/chat SFT, tool/function-calling, preference, reasoning/CoT, willingness/refusal-calibration, and Chinese-labs/multilingual — every row proven-by-usage in a named shipped model/recipe, never a single-paper-only academic artifact, mapped onto Sequence B’s actual stage order). Cross-linked
frontier-cyber-model-path.mdandroadmap-inputs.mdto both the recipe-sequence and dataset-registry chapters where their existing arguments (Stage 1 SFT/data synthesis, fork (b)’s SFT-now-vs-measure-first) are specific instances of the general point. Merged ~90 new citations into References via two new sections (“Post-training recipe as a sequence — order, compounding, synthetic-trajectory bootstrap” and “Continued pretraining on an instruction-tuned model — preservation techniques”) plus a new “Datasets (proven-by-usage)” subsection for the dataset-card/model-card links, deduped against the existing corpus. -
2026-07-02 — v0.3. Wired three new chapters into the book: Before you train — instrumentation & data readiness (what the harness already emits vs. the minimal per-stage-verifier gap to instrument, grounded in a direct source read of
go/libs/agent/events+ the flag-verification pipeline), One problem, or many? — monolithic vs decomposed (the eval-decomposition-vs-training-decomposition split, the potential-based-shaping safety net, the verdict for this project), and Where you are & the forks ahead (the capstone — five forks, a dependency DAG, seven falsifiable hypotheses, placed last before References). Applied the project’s standing no-academic-cybersecurity-LLM-as-research-basis stance across all three: every CTF-Dojo/Cyber-Zero/Pentest-R1/HackSynth/AutoPenBench/DRLRM-PT-style citation is labelled “academic, cited for context — not a basis for our decisions,” with load-bearing claims re-anchored on frontier-lab disclosures, general RL/ML theory, or this project’s own measured data. Merged ~35 new citations into References (new Hierarchical RL/decomposition/reward-shaping section; additions to Exploration & entropy collapse, CTF/pentest RL environments, Agent benchmarks & failure taxonomies, and Capability boundary sections) and added the CTF/pentest-RL section’s context-only header note.Later same day — added the frontier north-star section. Wired the two standing capstone chapters into a new top-level section, “Toward a frontier cybersecurity model,” placed after “In practice” and before References: Cybersecurity is one of a family — what cracked the others (the cross-domain structural-analogy survey — six adjacent long-horizon/sparse-reward/verifiable domains and what actually cracked each), The path to a frontier cybersecurity model (the capstone recipe + gap analysis — what “frontier” costs beyond this project’s own portfolio), and moved Where you are & the forks ahead into this new section as its final chapter (out of “In practice”), completing the arc family → path → your forks. Cross-linked
roadmap-inputs.mdat top and in its Cross-links section to both new chapters. Merged the new chapters’ citations into References via two new sections — “Domain-specialization lineages (code/math/medical)” and “Adjacent-domain structural transfer” — ~65 new arXiv/DOI/PMC ids, deduped against the existing ~250-citation corpus, academic-security entries kept labelled context-only. Salvaged three of the higher-value ideas the standing academic-cybersecurity-LLM stance would otherwise have excluded, by re-grounding each on independent frontier-lab or general-RL-theory evidence instead: (1) staged/kill-chain reward shaping — salvaged as the theorem-backed potential-based form only (Ng-Harada-Russell, ICML 1999), never the flat per-stage bonus academic pentest-RL papers use; (2) subgoal/curriculum decomposition of a long episode — salvaged via DeepSeek-Prover-V2 and AlphaGeometry (cold-start SFT data generation only, never densifying the RL reward itself); (3) failure-corpus-to-curriculum conversion — salvaged via WebRL’s self-evolving curriculum and HER’s relabeling principle (mining flag=0 trajectories for sub-skill SFT data), not any academic CTF-RL paper’s claim. -
2026-07-02 — v0.2. New Diagnosis section: Diagnosing the gap — a scientific framework (the pass@k crossover protocol, Cover@τ, sandbagging/elicitation tests — is the ~10-20% k=1 solve rate an execution gap or a knowledge gap, before betting a GRPO run on the answer) and From behavioral audit to training signal (maps the BSides-LV 5-pattern behavioral audit — tool avoidance, no methodology, brittle single-guess, uneven PTES phases, benchmarks-measure-speed-not-thoroughness — onto the specific post-training techniques designed to fix each). New method chapter RL that creates value — long-horizon · exploration · reasoning · novelty, a ~50-paper sweep tagged
[L]/[E]/[R]/[N]against this project’s own diagnosis (GiGPO, DAPO/Clip-Cov, ProRL, ReTool/ToRL/Search-R1, CTF-Dojo/Pentest-R1/HackSynth, pass@k-is-diagnostic-not-objective). Extendedreinforcement.mdandagentic-rl.mdwith cross-links into the sweep. Rewrote What the frontier labs actually do with a full last-year (2025-07→2026-07) refresh across all 10 tracked labs, filling in previously-thin xAI/Grok, Mistral (Magistral/Ministral), Zhipu/GLM, Xiaomi/MiMo, and deepening Kimi K2→K2.5→K2.6, each now carrying the same[L]/[E]/[R]/[N]tags. Extended Contested edges and The decision with the new capability-boundary and sandbagging/elicitation literature. Merged ~130 newly-cited sources into References. -
2026-07-02 — v0.1. Initial build from a live session: the on/off-policy foundation, the method genealogy (imitation/preference/reinforcement + agentic RL + PEFT), verified 2026 frontier-lab recipes, the method→data reframe, the decision tree, and contested edges. Embeds the interactive decision journey.
Backlog (next sessions)
- A worked example: filtering your ~100–200 solves into a rejection-sampling SFT set (the verified-trajectory pipeline).
- The reward-function chapter: building an ungameable
verify(state)for CTF flags (state, not transcript). - Pass@k methodology: per-challenge bucketing before choosing a branch.
- The train↔inference precision-mismatch rabbit hole (TIS vs FP16), for when you reach GRPO.
- Harness-shape coupling: ruling out “it’s the scaffold, not the model” before fine-tuning.