Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Agentic & multi-turn RL — the missing category

This is the category that was absent from the first pass of “the major methods,” and it’s the one that matters most for you, because a CTF agent is exactly this shape. It is not a new update rule — it still runs on GRPO/PPO/GSPO-family gradients — it’s a new training-loop shape: RL over multi-turn trajectories with live tools/environments in the loop (browser, code sandbox, MCP servers), instead of single-shot verifier-scored completions.

What changes vs. single-shot RLVR

  • The rollout is an episode of tool-use, not one generation. Reward often arrives only at the end (flag captured / task complete) → a credit-assignment problem: which turn or tool-call earned the win or lost the run?
  • The “dataset” is a live environment service, not a static file. You need rollout orchestration (sandboxes, tools, resets), not a JSONL.
  • This is precisely the project’s “RL envs are the moat” thesis (lessons/post-training/rl-envs-as-moat-between-providers.md, shared memory) — now independently corroborated as the frontier bet.

Verified production evidence (2026)

  • OpenAI Deep Research (o3-based): “trained using end-to-end reinforcement learning on hard browsing and reasoning tasks” — a shipped product doing agentic RL over live tool use (openai.com/index/introducing-deep-research).
  • Kimi K2 (Moonshot, open-weight 1T/32B-active MoE): headline post-training is a large-scale agentic data-synthesis pipeline + joint RL, where simulated tool environments generate the rollouts RL trains on (Kimi K2 tech report).
  • Kimi K2.5 (2026): Agent Swarm trained with Parallel Agent Reinforcement Learning (PARL) — RL over cooperating multi-agent trajectories. This is the current frontier edge (Kimi K2.5 tech blog, 2026).
  • Gemini 2.5: RL environments explicitly extended to “multi-step actions and tool use” (arXiv:2507.06261 §2.4).
  • Anthropic: a 2026 lesson that safety training from chat-RLHF failed to generalize to agentic/tool-use settings, forcing explicit diversification into agentic environments (alignment.anthropic.com, “teaching Claude why”, 2026). Directly relevant: capability and alignment now have to be trained in the agentic loop, not chat.

What this means for your build

  • Your harness already is the environment. The engineering surface is (a) a clean verify(state)→{0,1} reward read from real environment state (not the transcript — see the confabulation gotcha in Contested edges), and (b) rollout orchestration to keep N episodes in flight.
  • Start where credit assignment is trivial (outcome-only flag reward on a solvable band), i.e. rejection-sampling FT → GRPO/RLVR on your own multi-turn trajectories, before reaching for dense per-turn shaping.

What is not mainstream yet

  • Self-play (self-generated curricula / self-critique-as-opponent) appears only in niche academic work as of this pass — no confirmed frontier-lab production use. Watch, don’t bet.

Turn-level vs. trajectory-level credit assignment

Everything below exists because naively lifting GRPO/PPO from single-turn math/code RL to a ~100-turn tool-using agent breaks two assumptions simultaneously: (1) the “trajectory” is now dozens of LLM-generation turns interleaved with environment/tool observations, not one generation; (2) reward is terminal-only (flag verified or not) — so the vanilla group-mean baseline conflates credit across every turn equally, rewarding/penalizing an early exploratory enumeration turn exactly as much as the final exploit turn.

flowchart TB
    subgraph Trajectory-level GRPO baseline
    A1["turn 1<br/>enumerate"] --> A2["turn 2<br/>probe"] --> A3["...turn 60..."] --> A4["turn 61<br/>exploit"] --> A5["flag: 0/1"]
    A5 -- "one advantage,<br/>broadcast to all turns" --> A1
    A5 --> A2
    A5 --> A3
    A5 --> A4
    end
flowchart TB
    subgraph Turn-level credit GiGPO / turn-PPO family
    B1["turn 1<br/>enumerate"] --> B2["turn 2<br/>probe"] --> B3["...turn 60..."] --> B4["turn 61<br/>exploit"] --> B5["flag: 0/1"]
    B5 -- "episode advantage macro" --> B1
    B4 -- "step/turn advantage micro<br/>via state-hash or turn-value" --> B4
    end

One-line idea: GRPO (arXiv:2402.03300, DeepSeekMath, 2024-02-05) replaces PPO’s learned critic with a group-mean baseline over N samples of the same prompt — cheap, no critic, but implicitly one-reward-per-generation. GAE (arXiv:1506.02438, 2015-06-08) is the classical mechanism for trading bias/variance in the advantage estimate across a single trajectory’s timesteps — built for one agent-environment stream, not turn-vs-token double granularity. Full method table (with your reinforcement chapter’s GRPO/GSPO/DAPO baseline) below.

GiGPO — step-level advantage with zero extra rollouts [L] [E]

arXiv:2505.10978 (Feng, Xue, Liu, An; 2025-05-16, NeurIPS 2025 poster).

  • Problem: vanilla GRPO computes one advantage per whole trajectory — a good early enumeration step and a lucky late guess get identical credit.
  • Key idea: two nested groupings. (1) Episode-level group — N full rollouts, GRPO-style trajectory advantage (macro: “was this whole run good?”). (2) Step-level group — hash each (state, step) pair, bucket steps that recur across trajectories into anchor groups, compute a second advantage from “what happened next” conditioned on that shared state (micro credit) — no new network, no extra rollouts.
  • Loop delta: after collecting the usual GRPO batch, add a step-indexing pass → advantage = episode_advantage + λ · step_advantage → feed into the same PPO-clipped update.
  • Hyperparameters that matter: state-hashing granularity (too coarse → false matches; too fine → anchor groups collapse to size 1) and the mixing weight λ.
  • Gotcha for CTF: built for hashable/discrete states (web pages, grid cells). A CTF agent’s state is unbounded free text (shell stdout, HTTP bodies) — you need a canonicalization step, e.g. hash on (tool_name, normalized_target, response_status_class) rather than raw text, or anchor groups never fire.

Designed to fix: pattern #4 (uneven PTES phases — 62% of failures stall in exploitation). Step-level credit is the mechanism that stops the 100-turn trajectory advantage from diluting the exploitation-phase steps that actually decide the run.

Given the harness already emits structured tool_call/tool_exec_ms spans (harness-observability-contract-2026-06.md), a state key from those spans is the cheapest first transplant in this whole chapter — no critic, no infra, just a canonicalization function.

ArCHer — the two-time-scale ancestor [L]

arXiv:2402.19446 (Zhou, Zanette, Pan, Levine, Kumar; 2024-02-29).

  • Key idea: a hierarchy — a turn-level off-policy critic (TD-learned, “how good is this utterance given the conversation-so-far”) and a token-level on-policy policy gradient bootstrapped off that turn value instead of only the final episode reward. Decouples “which turn was good” (dozens of decisions) from “which token was good” (thousands).
  • Ablation that matters: the sample-efficiency gap over flat single-critic baselines widens with horizon length — the regime you’re in.
  • Gotcha: off-policy turn-level value learning reintroduces the value-overestimation instability the project’s no-critic GRPO preference was designed to avoid. Treat ArCHer as the theoretical justification for “turn is the right unit,” not a recipe to implement wholesale — prefer GiGPO’s critic-free step-grouping or Verlog’s dual-discount GAE (below) for the same decomposition without a learned off-policy critic.

RAGEN / StarPO — naming the collapse [E] [L]

arXiv:2504.20073 (Wang et al.; 2025-04-24).

  • What it is: a diagnostic framework, not primarily a new algorithm. StarPO formalizes trajectory-level agent RL over whole (state, think, action, reward) rollouts, then uses the RAGEN testbed to empirically show what breaks when you train “the naive way.”
  • Central finding — the “Echo Trap”: the policy’s reasoning traces converge to a small set of repeated, low-diversity patterns that keep scoring reward on the training distribution while generalization/exploration collapses. This is entropy collapse, named and reproduced across four stylized environments — not a one-off.
  • Fix pattern reported: reward normalization across turns (so no single dominant reward source drowns out exploration signal) + explicit rollout-diversity interventions, triggered by monitoring entropy/reward-variance, not epoch count.

Designed to fix: pattern #2 (react-and-guess, no methodology — 82% pivot-after-failure). An entropy-collapsed policy is one direct explanation for why an agent stops trying diverse enumeration strategies and converges to a small guess-repertoire.

This is the citation for the project’s own stated plan — “watch policy entropy as the trigger to graduate from SFT to GRPO.” Concretely: instrument per-turn action-entropy (or a diversity metric over tool-call sequences) during RL and treat a converging curve as the operational graduation/intervention signal, mirroring RAGEN’s diagnosis instead of re-deriving it live mid-run. See also reinforcement.md’s “Exploration and entropy” section for the DAPO Clip-Higher / Dr. GRPO mechanisms that patch this at the single-turn level — RAGEN is the multi-turn-specific diagnosis of the same underlying failure.

The 2025–26 turn-level GRPO-variant cluster [L] [E]

A fast-moving cluster of near-simultaneous papers attacking the same problem — “turn is the unit of advantage, not trajectory or token” — with distinct mechanisms. Treat as one converging-consensus finding, not N competing final answers; none individually crosses into [N] (they refine existing capability rather than expand the boundary).

PaperarXivMechanismConfidence
Turn-Level Reward Design & Credit Assignment2505.11821Dense per-turn reward terms layered on the terminal outcome rewardPromising
Turn-PPO2512.17008Argues GRPO’s group-relative clip “exposes notable limitations” at long horizon; goes back to a per-turn PPO value functionPromising
TL-GRPO2601.16480Turn credit for same-state-revisited tasks (iterative code repair); narrower than general multi-turnPromising
A2TGPO2605.06200Adaptive per-turn clip range — early vs. near-terminal turns have different advantage-magnitude distributionsPromising
Proximity-Based MTO2602.19225Weight credit by task difficulty, not just turn positionPromising
GAGPO2605.13217GAE-style λ-discounted advantage synthesized into GRPO’s group-relative frameworkPromising

What this means for the CTF agent: don’t pick one paper as “the” answer — prototype the cheapest shared mechanism first: GiGPO’s step-grouping (zero extra infra) or a straightforward per-turn shaping term (2505.11821’s approach) before reaching for a second value network (Turn-PPO / TL-GRPO). Given the project’s ground-truth-only reward constraint (see below), any dense intermediate signal must stay a shaping term added to, not a replacement of, the terminal verifier reward.


Reward shaping for a sparse terminal reward — without reopening the confabulation bug

The project already has a hard rule: reward must be ground-truth flag-verified, never format/regex-matched — SFT-induced FLAG{} confabulation was a real observed failure (lessons/post-training/sft-induced-flag-confabulation.md). Every reward-shaping idea in this section has to be read through that constraint.

  • Keep the terminal signal as ground truth, add density, don’t replace it. The turn-level cluster above (2505.11821 in particular) explicitly diagnoses that “sparse outcome rewards… lack dense intermediate signals across multiple decision steps” — the fix is injecting per-turn shaping alongside the verifier, e.g. reward tool-call progress (new open port found, new endpoint discovered, new credential recovered) as a small dense bonus, while the flag check remains the only source of the large terminal reward. A shaped proxy that can be gamed (e.g. “reward finding any string that looks like a flag”) reopens exactly the confabulation failure already logged — the shaping term must be read from verifiable environment state, same discipline as the terminal check.

  • Mask tool/environment output tokens from the policy-gradient loss. Search-R1 (arXiv:2503.09516) masks retrieved-content tokens out of the loss — you don’t want to reinforce/penalize text the environment produced, only the model’s own query/action-generation tokens. Direct transplant, not optional: shell stdout, HTTP response bodies, scan output must never enter the policy-gradient loss, only the agent’s own tool-call arguments and reasoning tokens. Easy to miss when standing up a GRPO/RLVR loop on top of an SFT-warmed policy — this is a correctness bug, not a design choice.

    Designed to fix: pattern #1 (agents prefer raw curl/shell — 87.7% of tool calls bypass the provided surface). Search-R1 is direct evidence that RL specifically over which tool/query to issue is tractable in a comparable regime (search-engine calls) — supports RL (not just better prompting) as the lever for pattern #1.

  • Bootstrap a value estimate at truncation instead of reward = 0. Verlog (below) proposes trajectory early truncation with a value-function bootstrap rather than waiting for the terminal reward — directly relevant since episodes cap at ~100 turns: today a hard timeout presumably returns zero reward for a run that made real, unfinished progress. Recovering partial-progress signal from failed-but-in-progress attempts is a second lever on the same sparse-terminal-reward problem, distinct from per-turn shaping.

  • Plan exploration as its own object. PEARL (arXiv:2601.20439) treats which tools, in what order as something to explore/RL over, not just the final answer.

    Designed to fix: pattern #2 (no methodology / enumeration). “Plan exploration” is a formal mechanism for rewarding systematic tool-sequencing instead of react-and-guess.


Why a 100-turn CTF is the hard case

Stack the constraints and the difficulty compounds — this is the “hard case” every technique above is implicitly being stress-tested against:

ConstraintWhy it bites at ~100 turnsTechnique that targets it
Reward is terminal-onlyCredit for the winning exploit turn gets diluted across ~100 turns of a flat trajectory-level advantageGiGPO, turn-level cluster
Unbounded free-text state (shell/HTTP, not pixels/grid)State-hashing methods built for discrete environments (web pages, grid worlds) don’t transplant for freeGiGPO — needs a bespoke canonicalization step
Variable episode lengthBatched training wastes GPU cycles on padding/idle time when some rollouts finish in 10 turns and others run to 100Verlog’s early truncation
Long context growing every turnFull transcript in every prompt overloads context retrieval well before turn 100Verlog’s customizable agent memory (windowed history)
On-policy exploration required, but entropy collapses under naive multi-turn RLThe “Echo Trap” — repeated low-diversity patterns keep scoring reward while exploration diesRAGEN/StarPO’s entropy-as-trigger diagnosis
Exploitation phase, not enumeration, is where 62% of failures stallA flat advantage rewards enumeration and exploitation turns identically, so neither gets a sharpened gradientGiGPO step-groups, BSides pattern #4

Verlog — the only technique benchmarked past 100 turns [L] [E]

No arXiv id — OpenReview only (NeurIPS 2025 MTI-LLM workshop poster, openreview.net/forum?id=GmodkWwMV3) + project blog (wentsechen.github.io/Verlog_blogpost), Chen/Chen/Zhu/Schneider. Confidence: Promising — cite via OpenReview, do not fabricate an arXiv id.

Three mechanisms, all aimed at the “three failure modes of long-horizon agentic RL” the paper names explicitly: overloaded context, sparse terminal reward, variable trajectory length wasting GPU cycles.

  1. Customizable agent memory — a flexibly-sized history window per turn, decoupling “how much context the policy sees” from “how many turns the episode has run.”
  2. Dual-discounting GAE — two separate discount factors, γ_step (turn-to-turn credit decay) and γ_token (within-turn token credit decay), instead of one GAE discount applied uniformly. Direct generalization of ArCHer’s two-time-scale idea, implemented inside GAE instead of a separate off-policy critic.
  3. Trajectory early truncation — cuts long rollouts short during training and substitutes a value-bootstrap for the missing terminal reward, to cut GPU idle time from variance in episode length.

Scale claim: the blog states prior frameworks (VeRL, RAGEN) handle ~10-turn tasks, verl-agent scales to ~50, and Verlog targets 400+ turn episodes (Crafter, 70–400 steps, avg ~190) — the only technique in this thread validated longer than your ~100-turn ceiling.

What I’d change in your pipeline: the dual-discounting GAE split (γ_step vs γ_token) is a single hyperparameter change layered onto whatever advantage code the training loop already has, no new critic beyond what GAE needs — the most directly answerable “what would you change” in this whole file. Flag honestly: workshop-poster + blog source, not a peer-reviewed arXiv preprint.


The RL-framework landscape (verl-agent / VerlTool / RAGEN)

Framework choice is an infra decision, not a research-finding one — flagged here because it gates which of the mechanisms above you can actually run without building rollout orchestration from scratch. Cross-reference against the harness/GPU-economics material once the framework-choice chapter from this research sweep lands (see cross-links below).

  • verl-agent (github.com/langfengq/verl-agent) — open-source agent-RL extension of veRL. No standalone arXiv paper; cite as infra, not a research claim. Scales to ~50-turn tasks per Verlog’s own comparison.
  • VerlTool — “Towards Holistic Agentic Reinforcement Learning with Tool Use” (arXiv:2509.01055) — surfaced in this research pass but not independently abs-page-verified; confirm before citing as settled.
  • RAGEN (github.com — see StarPO above) — the modular multi-environment testbed the Echo Trap diagnosis was built on; useful as a reference implementation for entropy/diversity monitoring, not just the paper.
  • “Demystifying RL for Long-Horizon Tool-Using Agents” (arXiv:2603.21972, Wu et al., 2026-03-23) [L] [R] — the closest thing to a systematic “what to tune first” ablation study, decomposing the design space along 5 axes: reward shaping, model scaling, data composition, algorithm selection, environment design. Use this axis framing when triaging which lever to pull first on your own pipeline. Confidence: Promising (0 citations, <4 months old at verify-time, but methodologically the most comprehensive single source found).

Domain-adjacent: RL for CTF / pentesting agents directly

Academic, cited for context — not a basis for our decisions. CTF-Dojo, Pentest-R1, STRIATUM-CTF, and HackSynth are academic domain-specific CTF/pentest training or benchmark papers; none produced a frontier cybersecurity model, so none of them is load-bearing for any conclusion, recipe, or number below — they’re listed only so the researcher knows what already exists in the academic literature before presenting a technique here as novel.

  • CTF-Dojo (arXiv:2508.18370, Zhuo et al., 2025-08-25) [R] — “the first large-scale executable runtime tailored for training LLMs with verifiable feedback” for CTF-style tasks: 658 Docker-containerized challenges with ground-truth verified feedback. Context only — see re-grounding below for why verifier-grounded execution environments are the right substrate.
  • Pentest-R1 (arXiv:2508.07382, He Kong et al., 2025-08-10) [L] [R] — two-stage RL pipeline for autonomous pentesting reasoning, trained on 500+ real-world multi-step walkthroughs. Context only, unread beyond abstract — do not use it to lock the project’s own reward-shaping design; if a reward-design decision needs a citation, it must come from the general RLVR/reward-shaping literature (Ng/Harada/Russell potential-based shaping, PURE/MONA reward-hacking literature) or this project’s own measured data, not from Pentest-R1.
  • STRIATUM-CTF (arXiv:2603.22577, Hugglestone et al., 2026-03-23) [R] — MCP-standardized agentic framework for general-purpose CTF solving, targeting “multi-step, stateful reasoning” as the gap static benchmarks miss. Context only — it does not itself do turn-level RL and is not used here to justify GiGPO’s design (that justification stands entirely on GiGPO’s own paper, arXiv:2505.10978, which motivates state-hashed step-groups from first principles, no CTF-specific evidence required).
  • HackSynth (arXiv:2506.02048, Muzsai, Imolai, Lukács, 2025-06-01) [R] [E] — fine-tunes a tool-augmented Llama-3.1-8B via vanilla, trajectory-level GRPO on a procedurally-generated crypto-CTF dataset. Context only — not used as evidence for the “start with vanilla GRPO” recommendation below; that recommendation is re-grounded independently.

Re-grounded recommendation (start vanilla, add turn-level machinery only if needed): this is supported by the general empirical ablation in “Demystifying RL for Long-Horizon Tool-Using Agents” (arXiv:2603.21972, already cited above, domain-general not CTF-specific), whose 5-axis decomposition (reward shaping, model scaling, data composition, algorithm selection, environment design) treats algorithm choice as one axis to tune after establishing a working baseline — plus this project’s own handbook rule that GRPO baseline must first land in the 30–60% signal band (memory/handbook.md §10) before any additional machinery is justified. Re-grounded substrate claim: that verifier-grounded execution environments (not format/regex reward) are the right substrate is this project’s own confirmed constraint, not an inference from CTF-Dojo — see the ground-truth-flag-verified rule and the SFT-induced FLAG{} confabulation failure (lessons/post-training/sft-induced-flag-confabulation.md), which is the actual basis.

The live gap: no paper in this pass demonstrates turn-level credit assignment (GiGPO/Verlog-class) applied to an offensive-security/CTF domain specifically. Transplanting GiGPO/Verlog mechanisms to CTF is this project’s own contribution to make, not something to find pre-solved — and, per the standing rule, not something to validate by reference to the academic CTF papers above.


Tool-integrated reasoning RL: ReTool, ToRL, Search-R1

These three share a mechanism — RL over an interleaved reason+tool-call+observation loop — but differ in domain (math/code-interpreter vs. search). Relevant because the harness is a tool-integrated-reasoning loop (shell, HTTP, scanning tools).

  • ReTool (arXiv:2504.11536, Feng et al., 2025-04-15) [R] [E] — interleaves real-time code-interpreter execution inside the reasoning trace, and trains the interleaving policy with RL. The rollout is no longer “generate full CoT then maybe call a tool” — the policy learns when to interrupt its own reasoning to invoke a tool and resume conditioned on the result; credit must flow through that interruption boundary. Domain is math — direct transplant to CTF tool-calling (when to run curl vs. reason further) is analogous but unvalidated in this domain.
  • ToRL (arXiv:2503.23383, Li, Zou, Liu, 2025-03-30) [E] [N]the strongest [N] citation in this thread: pure RL (no SFT-on-tool-traces warmup) teaches autonomous tool invocation and reports emergent strategic tool-use behaviors absent from the SFT-only baseline, beating the best tool-integrated-reasoning model on AIME’24 by double digits. This is the same RL > SFT distinction the project’s diagnosis already leans on (“SFT replaces, GRPO amplifies,” arXiv:2507.10616) but pushed further — RL-from-scratch surfacing qualitatively new patterns is closer to boundary-expansion than amplification. Domain is math tool-use (Python), not offensive security — treat the emergent-behavior claim as suggestive, not proven, here.
  • Search-R1 (arXiv:2503.09516, Jin et al., 2025-03-12) [R] [E] — the loss-masking detail (mask tool/environment output tokens from the policy gradient) covered above under reward shaping; also BSides pattern #1 evidence.

Domain gap to flag honestly: all three are validated in math/search domains with much shorter tool-call chains than a 100-turn CTF episode — the mechanism (interleave, mask, learn-when-to-call) transfers; the specific hyperparameters (how often to call, reward magnitude per call) don’t, and need re-deriving on your own verifier-based reward.


Summary table

The four rows below marked “context only” (CTF-Dojo, Pentest-R1, STRIATUM-CTF, HackSynth) are academic domain-specific CTF/pentest papers — cited for landscape awareness, not as a basis for any conclusion/recipe/number in this file (see re-grounding above).

TechniquearXiv (verified unless noted)TagsBSides patternConfidenceOne-line takeaway
GRPO (baseline)2402.03300[L]VerifiedGroup-mean baseline, no critic — trajectory-level only
GAE (baseline)1506.02438[L]VerifiedClassical multi-step advantage; single time-scale
GiGPO2505.10978[L][E]#4VerifiedStep-level advantage via state-hash groups, zero extra rollouts
ArCHer2402.19446[L]VerifiedTurn-level off-policy critic + token-level on-policy — heavier infra
RAGEN / StarPO2504.20073[E][L]#2VerifiedNames & diagnoses the “Echo Trap” entropy collapse
Turn-Level Reward Design2505.11821[L][E]#4PromisingDense per-turn reward layered on sparse terminal reward
Turn-PPO2512.17008[L]PromisingArgues PPO turn-value beats GRPO group-baseline at long horizon
TL-GRPO2601.16480[L]PromisingTurn credit for same-state-revisited (iterative) tasks
A2TGPO2605.06200[L]PromisingAdaptive per-turn PPO clip range
Proximity-Based MTO2602.19225[L]PromisingWeight credit by task difficulty, not just position
GAGPO2605.13217[L]PromisingGAE-style generalized advantage inside GRPO grouping
Verlogno arXiv — OpenReview[L][E]PromisingDual-discount GAE + memory-windowing + early truncation; 400+ turn scale
Demystifying RL for Long-Horizon Tool Agents2603.21972[L][R]Promising5-axis empirical recipe: reward/scale/data/algorithm/env
PEARL2601.20439[E][R]#2PromisingRL over the planning/tool-sequencing step itself
ReTool2504.11536[R][E]VerifiedInterleaved code-exec reasoning, RL over interruption points
ToRL2503.23383[E][N]VerifiedRL-from-scratch surfaces emergent tool-use strategies
Search-R12503.09516[R][E]#1VerifiedMask tool-output tokens from policy-gradient loss
CTF-Dojo (context only)2508.18370[R]Verified, academic — not a basis658-challenge verifier-grounded CTF RL environment
Pentest-R1 (context only)2508.07382[L][R]Verified (needs deeper read), academic — not a basisTwo-stage RL for pentest reasoning
STRIATUM-CTF (context only)2603.22577[R]#4Promising, academic — not a basisMCP-standardized stateful CTF agent framework
HackSynth (crypto CTF) (context only)2506.02048[R][E]Verified, academic — not a basisVanilla GRPO already works on narrow (crypto) CTF
VerlTool2509.01055Unverified (flag before citing)Holistic agentic-RL-with-tool-use framework

Open questions for the next research pass

  1. No paper in this pass demonstrates turn-level credit assignment (GiGPO/Verlog-class) applied to an offensive-security/CTF domain specifically — transplanting these mechanisms to CTF is this project’s own contribution to make, not something to find pre-solved.
  2. Verlog has no arXiv paper — only an OpenReview NeurIPS-workshop submission and project blog. Cite the OpenReview id, not a fabricated arXiv id, and flag it as workshop-tier evidence.
  3. VerlTool (arXiv:2509.01055) surfaced in search but was not independently verified via abs-page crawl — confirm before citing as settled.
  4. Pentest-R1’s exact reward/credit design was not deep-dived here (abstract only). Per the project’s standing rule, it is context/landscape awareness only, never a basis for the project’s own reward-shaping design — that design must be re-grounded on general RLVR/reward-shaping theory or this project’s own measured data, not on Pentest-R1 or any other academic CTF/pentest paper.

  • Reinforcement — PPO · GRPO · RLVR — the base algorithm (and its own “Exploration and entropy” section — DAPO Clip-Higher, Dr. GRPO) every technique in this chapter modifies or wraps.
  • Imitation — SFT · rejection sampling — the pre-RL stage this project runs first; RAGEN’s Echo Trap is precisely the risk that activates once you graduate past it.
  • Contested edges & landmines — the flag-confabulation gotcha, the RFT terminology trap, and the “RL can’t create capability” debate that ToRL’s [N] evidence bears on directly.
  • Frontier recipes — Kimi K2.5’s PARL and OpenAI Deep Research’s end-to-end agentic RL, cited above as production evidence, are detailed there per-lab.
  • Framework-choice / GPU-economics chapter and the frontier-lab sweep chapter (this overnight research batch): once those land in SUMMARY.md, wire a link here for the verl-agent/VerlTool infra choice and the cross-lab RL-recipe comparison — their file names weren’t finalized at write-time for this chapter; the integrator should confirm the actual paths and slot them into this section.