Agentic & multi-turn RL — the missing category
This is the category that was absent from the first pass of “the major methods,” and it’s the one that matters most for you, because a CTF agent is exactly this shape. It is not a new update rule — it still runs on GRPO/PPO/GSPO-family gradients — it’s a new training-loop shape: RL over multi-turn trajectories with live tools/environments in the loop (browser, code sandbox, MCP servers), instead of single-shot verifier-scored completions.
What changes vs. single-shot RLVR
- The rollout is an episode of tool-use, not one generation. Reward often arrives only at the end (flag captured / task complete) → a credit-assignment problem: which turn or tool-call earned the win or lost the run?
- The “dataset” is a live environment service, not a static file. You need rollout orchestration (sandboxes, tools, resets), not a JSONL.
- This is precisely the project’s “RL envs are the moat” thesis (
lessons/post-training/rl-envs-as-moat-between-providers.md, shared memory) — now independently corroborated as the frontier bet.
Verified production evidence (2026)
- OpenAI Deep Research (o3-based): “trained using end-to-end reinforcement learning on hard browsing and reasoning tasks” — a shipped product doing agentic RL over live tool use (openai.com/index/introducing-deep-research).
- Kimi K2 (Moonshot, open-weight 1T/32B-active MoE): headline post-training is a large-scale agentic data-synthesis pipeline + joint RL, where simulated tool environments generate the rollouts RL trains on (Kimi K2 tech report).
- Kimi K2.5 (2026): Agent Swarm trained with Parallel Agent Reinforcement Learning (PARL) — RL over cooperating multi-agent trajectories. This is the current frontier edge (Kimi K2.5 tech blog, 2026).
- Gemini 2.5: RL environments explicitly extended to “multi-step actions and tool use” (arXiv:2507.06261 §2.4).
- Anthropic: a 2026 lesson that safety training from chat-RLHF failed to generalize to agentic/tool-use settings, forcing explicit diversification into agentic environments (alignment.anthropic.com, “teaching Claude why”, 2026). Directly relevant: capability and alignment now have to be trained in the agentic loop, not chat.
What this means for your build
- Your harness already is the environment. The engineering surface is (a) a clean
verify(state)→{0,1}reward read from real environment state (not the transcript — see the confabulation gotcha in Contested edges), and (b) rollout orchestration to keep N episodes in flight. - Start where credit assignment is trivial (outcome-only flag reward on a solvable band), i.e. rejection-sampling FT → GRPO/RLVR on your own multi-turn trajectories, before reaching for dense per-turn shaping.
What is not mainstream yet
- Self-play (self-generated curricula / self-critique-as-opponent) appears only in niche academic work as of this pass — no confirmed frontier-lab production use. Watch, don’t bet.
Turn-level vs. trajectory-level credit assignment
Everything below exists because naively lifting GRPO/PPO from single-turn math/code RL to a ~100-turn tool-using agent breaks two assumptions simultaneously: (1) the “trajectory” is now dozens of LLM-generation turns interleaved with environment/tool observations, not one generation; (2) reward is terminal-only (flag verified or not) — so the vanilla group-mean baseline conflates credit across every turn equally, rewarding/penalizing an early exploratory enumeration turn exactly as much as the final exploit turn.
flowchart TB
subgraph Trajectory-level GRPO baseline
A1["turn 1<br/>enumerate"] --> A2["turn 2<br/>probe"] --> A3["...turn 60..."] --> A4["turn 61<br/>exploit"] --> A5["flag: 0/1"]
A5 -- "one advantage,<br/>broadcast to all turns" --> A1
A5 --> A2
A5 --> A3
A5 --> A4
end
flowchart TB
subgraph Turn-level credit GiGPO / turn-PPO family
B1["turn 1<br/>enumerate"] --> B2["turn 2<br/>probe"] --> B3["...turn 60..."] --> B4["turn 61<br/>exploit"] --> B5["flag: 0/1"]
B5 -- "episode advantage macro" --> B1
B4 -- "step/turn advantage micro<br/>via state-hash or turn-value" --> B4
end
One-line idea: GRPO (arXiv:2402.03300, DeepSeekMath, 2024-02-05) replaces PPO’s learned critic with a group-mean baseline over N samples of the same prompt — cheap, no critic, but implicitly one-reward-per-generation. GAE (arXiv:1506.02438, 2015-06-08) is the classical mechanism for trading bias/variance in the advantage estimate across a single trajectory’s timesteps — built for one agent-environment stream, not turn-vs-token double granularity. Full method table (with your reinforcement chapter’s GRPO/GSPO/DAPO baseline) below.
GiGPO — step-level advantage with zero extra rollouts [L] [E]
arXiv:2505.10978 (Feng, Xue, Liu, An; 2025-05-16, NeurIPS 2025 poster).
- Problem: vanilla GRPO computes one advantage per whole trajectory — a good early enumeration step and a lucky late guess get identical credit.
- Key idea: two nested groupings. (1) Episode-level group — N full rollouts, GRPO-style trajectory advantage (macro: “was this whole run good?”). (2) Step-level group — hash each
(state, step)pair, bucket steps that recur across trajectories into anchor groups, compute a second advantage from “what happened next” conditioned on that shared state (micro credit) — no new network, no extra rollouts. - Loop delta: after collecting the usual GRPO batch, add a step-indexing pass →
advantage = episode_advantage + λ · step_advantage→ feed into the same PPO-clipped update. - Hyperparameters that matter: state-hashing granularity (too coarse → false matches; too fine → anchor groups collapse to size 1) and the mixing weight λ.
- Gotcha for CTF: built for hashable/discrete states (web pages, grid cells). A CTF agent’s state is unbounded free text (shell stdout, HTTP bodies) — you need a canonicalization step, e.g. hash on
(tool_name, normalized_target, response_status_class)rather than raw text, or anchor groups never fire.
Designed to fix: pattern #4 (uneven PTES phases — 62% of failures stall in exploitation). Step-level credit is the mechanism that stops the 100-turn trajectory advantage from diluting the exploitation-phase steps that actually decide the run.
Given the harness already emits structured tool_call/tool_exec_ms spans (harness-observability-contract-2026-06.md), a state key from those spans is the cheapest first transplant in this whole chapter — no critic, no infra, just a canonicalization function.
ArCHer — the two-time-scale ancestor [L]
arXiv:2402.19446 (Zhou, Zanette, Pan, Levine, Kumar; 2024-02-29).
- Key idea: a hierarchy — a turn-level off-policy critic (TD-learned, “how good is this utterance given the conversation-so-far”) and a token-level on-policy policy gradient bootstrapped off that turn value instead of only the final episode reward. Decouples “which turn was good” (dozens of decisions) from “which token was good” (thousands).
- Ablation that matters: the sample-efficiency gap over flat single-critic baselines widens with horizon length — the regime you’re in.
- Gotcha: off-policy turn-level value learning reintroduces the value-overestimation instability the project’s no-critic GRPO preference was designed to avoid. Treat ArCHer as the theoretical justification for “turn is the right unit,” not a recipe to implement wholesale — prefer GiGPO’s critic-free step-grouping or Verlog’s dual-discount GAE (below) for the same decomposition without a learned off-policy critic.
RAGEN / StarPO — naming the collapse [E] [L]
arXiv:2504.20073 (Wang et al.; 2025-04-24).
- What it is: a diagnostic framework, not primarily a new algorithm. StarPO formalizes trajectory-level agent RL over whole
(state, think, action, reward)rollouts, then uses the RAGEN testbed to empirically show what breaks when you train “the naive way.” - Central finding — the “Echo Trap”: the policy’s reasoning traces converge to a small set of repeated, low-diversity patterns that keep scoring reward on the training distribution while generalization/exploration collapses. This is entropy collapse, named and reproduced across four stylized environments — not a one-off.
- Fix pattern reported: reward normalization across turns (so no single dominant reward source drowns out exploration signal) + explicit rollout-diversity interventions, triggered by monitoring entropy/reward-variance, not epoch count.
Designed to fix: pattern #2 (react-and-guess, no methodology — 82% pivot-after-failure). An entropy-collapsed policy is one direct explanation for why an agent stops trying diverse enumeration strategies and converges to a small guess-repertoire.
This is the citation for the project’s own stated plan — “watch policy entropy as the trigger to graduate from SFT to GRPO.” Concretely: instrument per-turn action-entropy (or a diversity metric over tool-call sequences) during RL and treat a converging curve as the operational graduation/intervention signal, mirroring RAGEN’s diagnosis instead of re-deriving it live mid-run. See also reinforcement.md’s “Exploration and entropy” section for the DAPO Clip-Higher / Dr. GRPO mechanisms that patch this at the single-turn level — RAGEN is the multi-turn-specific diagnosis of the same underlying failure.
The 2025–26 turn-level GRPO-variant cluster [L] [E]
A fast-moving cluster of near-simultaneous papers attacking the same problem — “turn is the unit of advantage, not trajectory or token” — with distinct mechanisms. Treat as one converging-consensus finding, not N competing final answers; none individually crosses into [N] (they refine existing capability rather than expand the boundary).
| Paper | arXiv | Mechanism | Confidence |
|---|---|---|---|
| Turn-Level Reward Design & Credit Assignment | 2505.11821 | Dense per-turn reward terms layered on the terminal outcome reward | Promising |
| Turn-PPO | 2512.17008 | Argues GRPO’s group-relative clip “exposes notable limitations” at long horizon; goes back to a per-turn PPO value function | Promising |
| TL-GRPO | 2601.16480 | Turn credit for same-state-revisited tasks (iterative code repair); narrower than general multi-turn | Promising |
| A2TGPO | 2605.06200 | Adaptive per-turn clip range — early vs. near-terminal turns have different advantage-magnitude distributions | Promising |
| Proximity-Based MTO | 2602.19225 | Weight credit by task difficulty, not just turn position | Promising |
| GAGPO | 2605.13217 | GAE-style λ-discounted advantage synthesized into GRPO’s group-relative framework | Promising |
What this means for the CTF agent: don’t pick one paper as “the” answer — prototype the cheapest shared mechanism first: GiGPO’s step-grouping (zero extra infra) or a straightforward per-turn shaping term (2505.11821’s approach) before reaching for a second value network (Turn-PPO / TL-GRPO). Given the project’s ground-truth-only reward constraint (see below), any dense intermediate signal must stay a shaping term added to, not a replacement of, the terminal verifier reward.
Reward shaping for a sparse terminal reward — without reopening the confabulation bug
The project already has a hard rule: reward must be ground-truth flag-verified, never format/regex-matched — SFT-induced FLAG{} confabulation was a real observed failure (lessons/post-training/sft-induced-flag-confabulation.md). Every reward-shaping idea in this section has to be read through that constraint.
-
Keep the terminal signal as ground truth, add density, don’t replace it. The turn-level cluster above (2505.11821 in particular) explicitly diagnoses that “sparse outcome rewards… lack dense intermediate signals across multiple decision steps” — the fix is injecting per-turn shaping alongside the verifier, e.g. reward tool-call progress (new open port found, new endpoint discovered, new credential recovered) as a small dense bonus, while the flag check remains the only source of the large terminal reward. A shaped proxy that can be gamed (e.g. “reward finding any string that looks like a flag”) reopens exactly the confabulation failure already logged — the shaping term must be read from verifiable environment state, same discipline as the terminal check.
-
Mask tool/environment output tokens from the policy-gradient loss. Search-R1 (arXiv:2503.09516) masks retrieved-content tokens out of the loss — you don’t want to reinforce/penalize text the environment produced, only the model’s own query/action-generation tokens. Direct transplant, not optional: shell stdout, HTTP response bodies, scan output must never enter the policy-gradient loss, only the agent’s own tool-call arguments and reasoning tokens. Easy to miss when standing up a GRPO/RLVR loop on top of an SFT-warmed policy — this is a correctness bug, not a design choice.
Designed to fix: pattern #1 (agents prefer raw curl/shell — 87.7% of tool calls bypass the provided surface). Search-R1 is direct evidence that RL specifically over which tool/query to issue is tractable in a comparable regime (search-engine calls) — supports RL (not just better prompting) as the lever for pattern #1.
-
Bootstrap a value estimate at truncation instead of
reward = 0. Verlog (below) proposes trajectory early truncation with a value-function bootstrap rather than waiting for the terminal reward — directly relevant since episodes cap at ~100 turns: today a hard timeout presumably returns zero reward for a run that made real, unfinished progress. Recovering partial-progress signal from failed-but-in-progress attempts is a second lever on the same sparse-terminal-reward problem, distinct from per-turn shaping. -
Plan exploration as its own object. PEARL (arXiv:2601.20439) treats which tools, in what order as something to explore/RL over, not just the final answer.
Designed to fix: pattern #2 (no methodology / enumeration). “Plan exploration” is a formal mechanism for rewarding systematic tool-sequencing instead of react-and-guess.
Why a 100-turn CTF is the hard case
Stack the constraints and the difficulty compounds — this is the “hard case” every technique above is implicitly being stress-tested against:
| Constraint | Why it bites at ~100 turns | Technique that targets it |
|---|---|---|
| Reward is terminal-only | Credit for the winning exploit turn gets diluted across ~100 turns of a flat trajectory-level advantage | GiGPO, turn-level cluster |
| Unbounded free-text state (shell/HTTP, not pixels/grid) | State-hashing methods built for discrete environments (web pages, grid worlds) don’t transplant for free | GiGPO — needs a bespoke canonicalization step |
| Variable episode length | Batched training wastes GPU cycles on padding/idle time when some rollouts finish in 10 turns and others run to 100 | Verlog’s early truncation |
| Long context growing every turn | Full transcript in every prompt overloads context retrieval well before turn 100 | Verlog’s customizable agent memory (windowed history) |
| On-policy exploration required, but entropy collapses under naive multi-turn RL | The “Echo Trap” — repeated low-diversity patterns keep scoring reward while exploration dies | RAGEN/StarPO’s entropy-as-trigger diagnosis |
| Exploitation phase, not enumeration, is where 62% of failures stall | A flat advantage rewards enumeration and exploitation turns identically, so neither gets a sharpened gradient | GiGPO step-groups, BSides pattern #4 |
Verlog — the only technique benchmarked past 100 turns [L] [E]
No arXiv id — OpenReview only (NeurIPS 2025 MTI-LLM workshop poster, openreview.net/forum?id=GmodkWwMV3) + project blog (wentsechen.github.io/Verlog_blogpost), Chen/Chen/Zhu/Schneider. Confidence: Promising — cite via OpenReview, do not fabricate an arXiv id.
Three mechanisms, all aimed at the “three failure modes of long-horizon agentic RL” the paper names explicitly: overloaded context, sparse terminal reward, variable trajectory length wasting GPU cycles.
- Customizable agent memory — a flexibly-sized history window per turn, decoupling “how much context the policy sees” from “how many turns the episode has run.”
- Dual-discounting GAE — two separate discount factors,
γ_step(turn-to-turn credit decay) andγ_token(within-turn token credit decay), instead of one GAE discount applied uniformly. Direct generalization of ArCHer’s two-time-scale idea, implemented inside GAE instead of a separate off-policy critic. - Trajectory early truncation — cuts long rollouts short during training and substitutes a value-bootstrap for the missing terminal reward, to cut GPU idle time from variance in episode length.
Scale claim: the blog states prior frameworks (VeRL, RAGEN) handle ~10-turn tasks, verl-agent scales to ~50, and Verlog targets 400+ turn episodes (Crafter, 70–400 steps, avg ~190) — the only technique in this thread validated longer than your ~100-turn ceiling.
What I’d change in your pipeline: the dual-discounting GAE split (γ_step vs γ_token) is a single hyperparameter change layered onto whatever advantage code the training loop already has, no new critic beyond what GAE needs — the most directly answerable “what would you change” in this whole file. Flag honestly: workshop-poster + blog source, not a peer-reviewed arXiv preprint.
The RL-framework landscape (verl-agent / VerlTool / RAGEN)
Framework choice is an infra decision, not a research-finding one — flagged here because it gates which of the mechanisms above you can actually run without building rollout orchestration from scratch. Cross-reference against the harness/GPU-economics material once the framework-choice chapter from this research sweep lands (see cross-links below).
- verl-agent (
github.com/langfengq/verl-agent) — open-source agent-RL extension of veRL. No standalone arXiv paper; cite as infra, not a research claim. Scales to ~50-turn tasks per Verlog’s own comparison. - VerlTool — “Towards Holistic Agentic Reinforcement Learning with Tool Use” (arXiv:2509.01055) — surfaced in this research pass but not independently abs-page-verified; confirm before citing as settled.
- RAGEN (
github.com— see StarPO above) — the modular multi-environment testbed the Echo Trap diagnosis was built on; useful as a reference implementation for entropy/diversity monitoring, not just the paper. - “Demystifying RL for Long-Horizon Tool-Using Agents” (arXiv:2603.21972, Wu et al., 2026-03-23)
[L][R]— the closest thing to a systematic “what to tune first” ablation study, decomposing the design space along 5 axes: reward shaping, model scaling, data composition, algorithm selection, environment design. Use this axis framing when triaging which lever to pull first on your own pipeline. Confidence: Promising (0 citations, <4 months old at verify-time, but methodologically the most comprehensive single source found).
Domain-adjacent: RL for CTF / pentesting agents directly
Academic, cited for context — not a basis for our decisions. CTF-Dojo, Pentest-R1, STRIATUM-CTF, and HackSynth are academic domain-specific CTF/pentest training or benchmark papers; none produced a frontier cybersecurity model, so none of them is load-bearing for any conclusion, recipe, or number below — they’re listed only so the researcher knows what already exists in the academic literature before presenting a technique here as novel.
- CTF-Dojo (arXiv:2508.18370, Zhuo et al., 2025-08-25)
[R]— “the first large-scale executable runtime tailored for training LLMs with verifiable feedback” for CTF-style tasks: 658 Docker-containerized challenges with ground-truth verified feedback. Context only — see re-grounding below for why verifier-grounded execution environments are the right substrate. - Pentest-R1 (arXiv:2508.07382, He Kong et al., 2025-08-10)
[L][R]— two-stage RL pipeline for autonomous pentesting reasoning, trained on 500+ real-world multi-step walkthroughs. Context only, unread beyond abstract — do not use it to lock the project’s own reward-shaping design; if a reward-design decision needs a citation, it must come from the general RLVR/reward-shaping literature (Ng/Harada/Russell potential-based shaping, PURE/MONA reward-hacking literature) or this project’s own measured data, not from Pentest-R1. - STRIATUM-CTF (arXiv:2603.22577, Hugglestone et al., 2026-03-23)
[R]— MCP-standardized agentic framework for general-purpose CTF solving, targeting “multi-step, stateful reasoning” as the gap static benchmarks miss. Context only — it does not itself do turn-level RL and is not used here to justify GiGPO’s design (that justification stands entirely on GiGPO’s own paper, arXiv:2505.10978, which motivates state-hashed step-groups from first principles, no CTF-specific evidence required). - HackSynth (arXiv:2506.02048, Muzsai, Imolai, Lukács, 2025-06-01)
[R][E]— fine-tunes a tool-augmented Llama-3.1-8B via vanilla, trajectory-level GRPO on a procedurally-generated crypto-CTF dataset. Context only — not used as evidence for the “start with vanilla GRPO” recommendation below; that recommendation is re-grounded independently.
Re-grounded recommendation (start vanilla, add turn-level machinery only if needed): this is supported by the general empirical ablation in “Demystifying RL for Long-Horizon Tool-Using Agents” (arXiv:2603.21972, already cited above, domain-general not CTF-specific), whose 5-axis decomposition (reward shaping, model scaling, data composition, algorithm selection, environment design) treats algorithm choice as one axis to tune after establishing a working baseline — plus this project’s own handbook rule that GRPO baseline must first land in the 30–60% signal band (memory/handbook.md §10) before any additional machinery is justified. Re-grounded substrate claim: that verifier-grounded execution environments (not format/regex reward) are the right substrate is this project’s own confirmed constraint, not an inference from CTF-Dojo — see the ground-truth-flag-verified rule and the SFT-induced FLAG{} confabulation failure (lessons/post-training/sft-induced-flag-confabulation.md), which is the actual basis.
The live gap: no paper in this pass demonstrates turn-level credit assignment (GiGPO/Verlog-class) applied to an offensive-security/CTF domain specifically. Transplanting GiGPO/Verlog mechanisms to CTF is this project’s own contribution to make, not something to find pre-solved — and, per the standing rule, not something to validate by reference to the academic CTF papers above.
Tool-integrated reasoning RL: ReTool, ToRL, Search-R1
These three share a mechanism — RL over an interleaved reason+tool-call+observation loop — but differ in domain (math/code-interpreter vs. search). Relevant because the harness is a tool-integrated-reasoning loop (shell, HTTP, scanning tools).
- ReTool (arXiv:2504.11536, Feng et al., 2025-04-15)
[R][E]— interleaves real-time code-interpreter execution inside the reasoning trace, and trains the interleaving policy with RL. The rollout is no longer “generate full CoT then maybe call a tool” — the policy learns when to interrupt its own reasoning to invoke a tool and resume conditioned on the result; credit must flow through that interruption boundary. Domain is math — direct transplant to CTF tool-calling (when to runcurlvs. reason further) is analogous but unvalidated in this domain. - ToRL (arXiv:2503.23383, Li, Zou, Liu, 2025-03-30)
[E][N]— the strongest[N]citation in this thread: pure RL (no SFT-on-tool-traces warmup) teaches autonomous tool invocation and reports emergent strategic tool-use behaviors absent from the SFT-only baseline, beating the best tool-integrated-reasoning model on AIME’24 by double digits. This is the same RL > SFT distinction the project’s diagnosis already leans on (“SFT replaces, GRPO amplifies,” arXiv:2507.10616) but pushed further — RL-from-scratch surfacing qualitatively new patterns is closer to boundary-expansion than amplification. Domain is math tool-use (Python), not offensive security — treat the emergent-behavior claim as suggestive, not proven, here. - Search-R1 (arXiv:2503.09516, Jin et al., 2025-03-12)
[R][E]— the loss-masking detail (mask tool/environment output tokens from the policy gradient) covered above under reward shaping; also BSides pattern #1 evidence.
Domain gap to flag honestly: all three are validated in math/search domains with much shorter tool-call chains than a 100-turn CTF episode — the mechanism (interleave, mask, learn-when-to-call) transfers; the specific hyperparameters (how often to call, reward magnitude per call) don’t, and need re-deriving on your own verifier-based reward.
Summary table
The four rows below marked “context only” (CTF-Dojo, Pentest-R1, STRIATUM-CTF, HackSynth) are academic domain-specific CTF/pentest papers — cited for landscape awareness, not as a basis for any conclusion/recipe/number in this file (see re-grounding above).
| Technique | arXiv (verified unless noted) | Tags | BSides pattern | Confidence | One-line takeaway |
|---|---|---|---|---|---|
| GRPO (baseline) | 2402.03300 | [L] | — | Verified | Group-mean baseline, no critic — trajectory-level only |
| GAE (baseline) | 1506.02438 | [L] | — | Verified | Classical multi-step advantage; single time-scale |
| GiGPO | 2505.10978 | [L][E] | #4 | Verified | Step-level advantage via state-hash groups, zero extra rollouts |
| ArCHer | 2402.19446 | [L] | — | Verified | Turn-level off-policy critic + token-level on-policy — heavier infra |
| RAGEN / StarPO | 2504.20073 | [E][L] | #2 | Verified | Names & diagnoses the “Echo Trap” entropy collapse |
| Turn-Level Reward Design | 2505.11821 | [L][E] | #4 | Promising | Dense per-turn reward layered on sparse terminal reward |
| Turn-PPO | 2512.17008 | [L] | — | Promising | Argues PPO turn-value beats GRPO group-baseline at long horizon |
| TL-GRPO | 2601.16480 | [L] | — | Promising | Turn credit for same-state-revisited (iterative) tasks |
| A2TGPO | 2605.06200 | [L] | — | Promising | Adaptive per-turn PPO clip range |
| Proximity-Based MTO | 2602.19225 | [L] | — | Promising | Weight credit by task difficulty, not just position |
| GAGPO | 2605.13217 | [L] | — | Promising | GAE-style generalized advantage inside GRPO grouping |
| Verlog | no arXiv — OpenReview | [L][E] | — | Promising | Dual-discount GAE + memory-windowing + early truncation; 400+ turn scale |
| Demystifying RL for Long-Horizon Tool Agents | 2603.21972 | [L][R] | — | Promising | 5-axis empirical recipe: reward/scale/data/algorithm/env |
| PEARL | 2601.20439 | [E][R] | #2 | Promising | RL over the planning/tool-sequencing step itself |
| ReTool | 2504.11536 | [R][E] | — | Verified | Interleaved code-exec reasoning, RL over interruption points |
| ToRL | 2503.23383 | [E][N] | — | Verified | RL-from-scratch surfaces emergent tool-use strategies |
| Search-R1 | 2503.09516 | [R][E] | #1 | Verified | Mask tool-output tokens from policy-gradient loss |
| CTF-Dojo (context only) | 2508.18370 | [R] | — | Verified, academic — not a basis | 658-challenge verifier-grounded CTF RL environment |
| Pentest-R1 (context only) | 2508.07382 | [L][R] | — | Verified (needs deeper read), academic — not a basis | Two-stage RL for pentest reasoning |
| STRIATUM-CTF (context only) | 2603.22577 | [R] | #4 | Promising, academic — not a basis | MCP-standardized stateful CTF agent framework |
| HackSynth (crypto CTF) (context only) | 2506.02048 | [R][E] | — | Verified, academic — not a basis | Vanilla GRPO already works on narrow (crypto) CTF |
| VerlTool | 2509.01055 | — | — | Unverified (flag before citing) | Holistic agentic-RL-with-tool-use framework |
Open questions for the next research pass
- No paper in this pass demonstrates turn-level credit assignment (GiGPO/Verlog-class) applied to an offensive-security/CTF domain specifically — transplanting these mechanisms to CTF is this project’s own contribution to make, not something to find pre-solved.
- Verlog has no arXiv paper — only an OpenReview NeurIPS-workshop submission and project blog. Cite the OpenReview id, not a fabricated arXiv id, and flag it as workshop-tier evidence.
- VerlTool (arXiv:2509.01055) surfaced in search but was not independently verified via abs-page crawl — confirm before citing as settled.
- Pentest-R1’s exact reward/credit design was not deep-dived here (abstract only). Per the project’s standing rule, it is context/landscape awareness only, never a basis for the project’s own reward-shaping design — that design must be re-grounded on general RLVR/reward-shaping theory or this project’s own measured data, not on Pentest-R1 or any other academic CTF/pentest paper.
Cross-links
- Reinforcement — PPO · GRPO · RLVR — the base algorithm (and its own “Exploration and entropy” section — DAPO Clip-Higher, Dr. GRPO) every technique in this chapter modifies or wraps.
- Imitation — SFT · rejection sampling — the pre-RL stage this project runs first; RAGEN’s Echo Trap is precisely the risk that activates once you graduate past it.
- Contested edges & landmines — the flag-confabulation gotcha, the RFT terminology trap, and the “RL can’t create capability” debate that ToRL’s
[N]evidence bears on directly. - Frontier recipes — Kimi K2.5’s PARL and OpenAI Deep Research’s end-to-end agentic RL, cited above as production evidence, are detailed there per-lab.
- Framework-choice / GPU-economics chapter and the frontier-lab sweep chapter (this overnight research batch): once those land in
SUMMARY.md, wire a link here for the verl-agent/VerlTool infra choice and the cross-lab RL-recipe comparison — their file names weren’t finalized at write-time for this chapter; the integrator should confirm the actual paths and slot them into this section.