Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

RL that creates value — long-horizon, exploration, reasoning, novelty

The other method pages (Reinforcement, Agentic & multi-turn RL) cover which algorithm (PPO/GRPO/GSPO) and what training-loop shape (single-shot vs. multi-turn-with-tools). This page is the tagged sweep of what actually makes RL pay off for a 100-turn, tool-using, verifier-rewarded CTF agent that already has an execution gap (capability present, unreliable) rather than a knowledge gap. Every technique below is filed under the axis it serves, cross-referenced to the behavioral audit patterns it targets, and cited only where the arxiv id was verified live (crawled arxiv.org/abs/<id> on 2026-07-02).

Legend

TagAxisWhy it matters here
[L]Long-horizon / credit assignmentEpisodes run to ~100 turns, reward is usually terminal-only (flag verified or not) — plain trajectory-level advantage smears credit across every turn equally.
[E]Exploration / entropy preservationCybersecurity is search/enumeration. Entropy collapse = the policy commits early to one narrow script and stops trying alternatives.
[R]Reasoning / test-time computeMulti-step vuln chaining, deciding what to try next, allocating turns to hard vs. easy phases.
[N]Novelty / boundary-expansionEvidence the technique expands what the policy can do at all (finds attack paths the base model never finds at any k), not just amplifies what’s already samplable.

Baseline context this sweep assumes: GRPO (critic-free, group-mean baseline, arXiv:2402.03300) and GAE (classical single-trajectory multi-step advantage, arXiv:1506.02438) are what everything below is patching. Vanilla GRPO/PPO assume one reward at the end of one generation — a 100-turn tool-using episode breaks that on two axes simultaneously: the “trajectory” is now dozens of interleaved generation-turns + environment observations, and reward is terminal-only, so the group-mean baseline credits/blames every turn identically regardless of whether it was an exploratory enumeration step or the exploit-landing step.

Reward-shaping guardrail (applies to every [L] technique below): any dense, per-turn, or process-level reward introduced to fix credit assignment must be a shaping term added on top of, not a replacement for, the terminal ground-truth flag verification. This project’s confirmed lesson — SFT-induced FLAG{} confabulation from a loose regex reward — is exactly the failure mode that reopens if a proxy signal gets promoted to primary reward. See the reward-hacking cluster in §3.4.


1. Long-horizon / credit assignment for agents [L]

1.1 GiGPO — step-level credit with zero extra rollouts

arXiv:2505.10978 (Feng, Xue, Liu, An; 2025-05-16; NeurIPS 2025 poster). Two nested grouping levels instead of GRPO’s one: the usual episode-level group (N full rollouts, trajectory advantage as before) plus a step-level group — retroactively hash (state, step) pairs, bucket steps that recur across rollouts at the same environment state, and compute a second advantage from “what happened next” conditioned on that state. No new critic, no extra rollouts — pure post-hoc bookkeeping on trajectories you already sampled.

# after collecting the usual GRPO batch of N trajectories
key = lambda s: (s.tool_name, normalize(s.target), s.response_status_class)  # state canonicalization
anchor_groups = bucket_by(key, all_steps_across_trajectories)
step_adv = {step: advantage_within(anchor_groups[key(step)]) for step in all_steps}
advantage = episode_advantage + lam * step_adv[step]     # combine before the clipped PG update

Designed to fix: pattern 4 (uneven PTES phases — 62% of failures stall in exploitation). Step-level credit stops rewarding all 100 turns equally when only the exploitation phase decides the outcome.

Gotcha: needs a hashable notion of state recurrence — a CTF agent’s state is unbounded free text (shell/HTTP output), so the state key above needs bespoke canonicalization, not raw-text hashing. Confidence: Verified. For this agent: the single most directly transplantable idea in this sweep — no new infra, just a state-canonicalization function over the harness’s existing tool_call/tool_exec_ms spans.

1.2 ArCHer — turn-level vs. token-level, two time-scales

arXiv:2402.19446 (Zhou, Zanette, Pan, Levine, Kumar; 2024-02-29). High-level off-policy TD critic at turn granularity + low-level on-policy token PG bootstrapped off it — decouples “which turn was good” from “which token was good.” The conceptual ancestor of “turn is the right credit unit,” but the off-policy critic reintroduces exactly the instability/infra weight the project’s critic-free GRPO preference was chosen to avoid. Confidence: Verified. Takeaway: use the decomposition, not the off-policy mechanism — GiGPO or Verlog (§1.6) get the same turn/token separation without a learned critic.

1.3 RAGEN / StarPO — naming the collapse in multi-turn RL

arXiv:2504.20073 (Wang et al.; 2025-04-24). Diagnostic paper: trajectory-level RL on multi-turn agents reproducibly collapses into the “Echo Trap” — reasoning traces converge to a small repeated repertoire that keeps scoring reward on the training distribution while generalization/exploration dies. Fixes center on reward normalization across turns + explicit rollout-diversity interventions.

Designed to fix: pattern 2 (react-and-guess, no methodology, 82% pivot-after-failure). An entropy-collapsed policy is one explanation for an agent that stops trying diverse enumeration and converges to a small set of guesses.

Confidence: Verified. For this agent: the citation for “watch policy entropy as the graduation trigger” — instrument per-turn action-entropy or tool-sequence diversity during RL and treat a converging curve as the operational signal, not epoch count.

1.4 The turn-level-advantage cluster (2025–26 convergence)

A fast-moving cluster of near-simultaneous papers, all attacking “vanilla GRPO’s trajectory advantage is too coarse for multi-turn agents” with distinct mechanisms. Treat as one finding: turn is the unit of advantage is now cross-group consensus, even though no implementation is yet the default.

PaperarXivAngle
Turn-Level Reward Design2505.11821Dense per-turn reward layered on the sparse terminal reward; on tool-use benchmarks, trajectory-level baselines can fail to invoke tools at all (20-30% exact-match) vs. 100% tool-exec-success with turn-level reward.
Turn-PPO2512.17008Goes back to a per-turn value function (PPO-style) instead of GRPO’s group baseline, arguing GRPO’s clipped-group-relative update has “notable limitations” specifically for long-horizon reasoning.
TL-GRPO2601.16480Turn credit for same-state-revisited tasks (iterative code repair) — narrower than general multi-turn, but overlaps GiGPO’s step-grouping idea.
A2TGPO2605.06200Adaptive per-turn PPO clip range — early-episode and near-terminal turns have different advantage-magnitude distributions; one fixed clip under/over-constrains one of the two.
Proximity-Based MTO2602.19225Weight credit by task difficulty, not just turn position — a success on a hard task is more informative than a success on a trivial one.
GAGPO2605.13217Brings GAE’s λ-discounted multi-step advantage directly into GRPO’s group-relative framework.

Designed to fix: pattern 4 (uneven PTES phases) — dynamic/turn-level signal keeps hard-exploitation-phase prompts from silently degenerating to all-zero-reward groups.

Confidence: all Promising (recent, 0-few citations at verify time) except Turn-Level Reward Design (workshop-validated). For this agent: don’t pick one paper as “the” answer — prototype the cheapest version (GiGPO’s step-grouping, zero extra infra, or a per-turn tool-exec-success shaping term) before reaching for a second value network (Turn-PPO/TL-GRPO).

1.5 GSPO — sequence-level clipping stabilizes as sequences get long

arXiv:2507.18071 (Qwen team, 2025-07). Import from Reinforcement: clip/importance-sample at the whole-sequence level instead of per-token, because token-level ratios compound multiplicatively over long sequences — exactly the failure mode a 100-turn tool-interleaved trajectory maximizes. This is Qwen3’s actual production RL algorithm. Tags: [L] [R] [E] (more stable optimization = less premature collapse). For this agent: if a future GRPO run destabilizes on long trajectories, GSPO-style sequence-level ratios should be the first thing tried, not more KL tuning.

1.6 Verlog — the only technique benchmarked past 100 turns

No arXiv id — OpenReview only (NeurIPS 2025 MTI-LLM workshop poster, openreview.net/forum?id=GmodkWwMV3; Chen, Chen, Zhu, Schneider). Cite the OpenReview id, not a fabricated arXiv one. Three mechanisms: (1) customizable agent memory — a flexible history window decoupled from episode length, (2) dual-discounting GAE — separate discount factors for turn-to-turn (γ_step) vs. within-turn token (γ_token) credit decay, generalizing ArCHer’s two-time-scale idea into a single GAE, (3) trajectory early truncation — bootstrap a value estimate instead of paying full wall-clock for variable-length rollouts.

gamma_step, gamma_token = 0.95, 0.99   # two discounts instead of one uniform GAE gamma
# ... standard GAE recursion, but decayed across turns with gamma_step and within a turn with gamma_token

Scale claim: prior frameworks (RAGEN ~10 turns, verl-agent ~50 turns) top out well below this project’s ~100-turn ceiling; Verlog is validated on Crafter at 70–400 steps (avg ~190). Confidence: Promising, workshop-tier — flag as such if cited externally. For this agent: the dual-discount split is a single-hyperparameter change on top of whatever GAE code already exists, no new critic required; trajectory early truncation directly targets the “100-turn hard timeout returns reward=0” problem by recovering partial-progress signal instead of discarding it.

1.7 verl-agent and the framework landscape

verl-agent (github.com/langfengq/verl-agent) — an open-source veRL extension for agent RL, ~50-turn scale per Verlog’s own comparison. No arXiv paper of its own; cite as infrastructure, not a research claim. Adjacent: “Demystifying RL for Long-Horizon Tool-Using Agents” (arXiv:2603.21972, Wu et al., 2026-03-23) decomposes the agentic-RL design space along 5 axes — reward shaping, model scaling, data composition, algorithm selection, environment design — the most systematic “what to tune first” ablation study found for this domain. [L][R], Promising. PEARL (arXiv:2601.20439, Wang et al., 2026-01-28) treats plan exploration (which tools, what order) as its own RL object, directly relevant to pattern 2 (no methodology). [E][R], Promising.

1.8 Tool-integrated reasoning RL: ReTool, ToRL, Search-R1

Three papers sharing a mechanism — RL over an interleaved reason→tool-call→observation loop:

  • ReTool (arXiv:2504.11536, Feng et al., 2025-04-15): interleaves real-time code-interpreter execution inside the reasoning trace and RL-trains when to interrupt reasoning to invoke a tool. [R][E], Verified, math domain.
  • ToRL (arXiv:2503.23383, Li, Zou, Liu, 2025-03-30): pure RL, no SFT-on-tool-traces warmup, reports emergent strategic tool-invocation behaviors absent from an SFT-only baseline. [E][N] — see §4.2.
  • Search-R1 (arXiv:2503.09516, Jin et al., 2025-03-12): masks retrieved/tool-output tokens out of the policy-gradient loss — the policy didn’t generate them, don’t backprop through them.
# retrieved/tool-output masking — a correctness detail, not a design choice
loss_pg = -(mask_model_generated_tokens * min(ratio * adv, clipped * adv))
# shell stdout / HTTP response body / scan output: masked out, same rule as search results

Designed to fix: pattern 1 (87.7% of tool calls bypass the rich tool surface; 26/40 tools dead). Search-R1 is direct evidence RL over “which tool/query to issue” is solved-enough in a comparable regime (search-engine calls).

For this agent: the loss-masking detail is a must-have engineering correctness fix regardless of which algorithm gets chosen — easy to miss when standing up a GRPO/RLVR loop.

1.9 Domain precedent — CTF/pentest RL (academic, context only)

These are academic domain-specific CTF/pentest training/benchmark papers — cited for context only, not a basis for any conclusion, decomposition, recipe, or number on this page (per the project’s standing rule: none of this line of work has produced a frontier cybersecurity model). Every claim they might otherwise appear to license is re-grounded below on general frontier/theory evidence or the project’s own confirmed data instead.

  • CTF-Dojo (arXiv:2508.18370, 2025-08-25) — academic, cited for context only: 658 Docker-containerized CTF challenges with verified feedback. Re-grounded: this page’s “ground-truth-verified reward, not format-matched” requirement rests on the project’s own confirmed lesson (SFT-induced FLAG{} confabulation from a loose regex reward — see the reward-shaping guardrail above §1) and on the Spurious Rewards finding (§3.6, arXiv:2506.10947) that a wrong reward can still look like it’s working — not on CTF-Dojo.
  • Pentest-R1 (arXiv:2508.07382, 2025-08-10) — academic, cited for context only: a two-stage offline-RL-on-walkthroughs → online-RL-in-Intercode-CTF pipeline. Re-grounded: the project’s own SFT→RL staging decision is grounded on DeepSeek-R1’s four-stage recipe at frontier scale (§3.1, arXiv:2501.12948) and on RAFT/Reinforce-Rej’s controlled ablation of rejection-sampling SFT (§2.6, arXiv:2504.11343) — do not lock the project’s reward shape off Pentest-R1’s design.
  • STRIATUM-CTF (arXiv:2603.22577, 2026-03-23) — academic, cited for context only: an MCP-standardized framework targeting “multi-step, stateful reasoning.” Re-grounded: GiGPO’s state-hashing (§1.1) is justified on GiGPO’s own general-agent-RL evidence (arXiv:2505.10978), not on STRIATUM-CTF’s framing.
  • HackSynth / Random-Crypto (arXiv:2506.02048, 2025-06-01) — academic, cited for context only: fine-tunes Llama-3.1-8B with vanilla GRPO on procedural crypto-CTF. Re-grounded: the claim that turn-level machinery matters more as horizon/challenge-length grows is instead supported by Verlog’s own turn-count comparison (§1.6, OpenReview) and PSN-RLVR’s length-scaling result (§2.8, arXiv:2602.02555) — general long-horizon-RL evidence, not a domain-specific CTF result.

Open gap: none of the above academic CTF-domain papers demonstrate turn-level credit assignment (GiGPO/Verlog-class) applied to offensive security — transplanting it is this project’s own contribution to make, not something pre-solved by this (academic, context-only) body of work.


2. Exploration & entropy preservation [E]

2.1 The entropy-collapse law, and its surgical fix

One-line idea: policy entropy collapses sharply and monotonically early in RLVR, and performance is bound by a fitted law R = -a·exp(H) + b — you are trading entropy for performance, hitting a hard, predictable ceiling at H=0. Mechanism: entropy change is driven by the covariance between a token’s action-probability and its logit update, which is proportional to advantage — high-probability, high-advantage tokens keep getting pushed toward certainty, and that covariance term stays positive almost everywhere.

Fix — Clip-Cov / KL-Cov: identify the small set of highest-covariance tokens per batch and either drop the PG update on them or scope an extra KL penalty to just them, leaving the rest untouched. Already merged into verl as a loss-mode flag.

cov = (logp - logp.mean()) * (adv - adv.mean())
mask = cov > percentile(cov, 1 - clip_frac)        # top ~0.2-2% of tokens
loss_pg[mask] = loss_pg[mask].detach()              # Clip-Cov
# or: loss = loss_pg + kl_coef * kl_per_token * mask   # KL-Cov

Designed to fix: patterns 2 and 3 (no methodology / brittle single-guess) — both are symptoms of a policy that already spent its entropy budget on a narrow, high-probability action sequence.

Confidence: High (mechanistic + empirical, 342 citations within a year, adopted upstream into verl within a month). Source: Cui et al., arXiv:2505.22617 (2025-05-28). For this agent: instrument mean(entropy) from RLVR step 0 — the single cheapest, highest-leverage move in this whole sweep, with no design decisions required.

2.2 Clip asymmetry — Clip-Low and Clip-High are not symmetric

arXiv:2509.26114 (Park, Kim et al., 2025-09-30). Raising the low-side PPO clip bound increases entropy; raising the high-side decreases it — they are independent exploration knobs, not one symmetric hyperparameter. Mechanistic complement to DAPO’s clip-higher (§2.3). Confidence: Medium-high, single paper. For this agent: if clip-higher alone doesn’t fully solve collapse, look at the low-side clip too.

2.3 DAPO — four engineering fixes, one entropy-preserving

arXiv:2503.14476 (Yu, Zhang et al., 2025-03-18). Clip-Higher (decoupled ε_low/ε_high, ~0.20/0.28, so rare-but-good tokens gain probability faster than they’re suppressed) + Dynamic Sampling (resample any prompt whose whole rollout group is all-correct or all-incorrect — zero-advantage groups give zero gradient) + token-level loss aggregation + overlong-response soft penalty.

eps_low, eps_high = 0.20, 0.28
loss_pg = -min(ratio * adv, clip(ratio, 1-eps_low, 1+eps_high) * adv)   # per-token, mean over ALL tokens
while std(group_rewards) == 0:            # dynamic sampling — degenerate groups contribute nothing
    group_rewards = rollout_and_score(resample_prompt(), n=G)

Designed to fix: patterns 2, 3, and indirectly 4 — dynamic sampling keeps gradient flowing even on hard-exploitation prompts that would otherwise silently degenerate to all-zero-reward.

Confidence: High — open weights/code/data, one of the most widely adopted open RLVR recipes. For this agent: at ~100 turns/rollout, resampling degenerate groups is expensive — pair with curriculum/difficulty filtering (drop challenges outside the project’s own 30–60% band) rather than paying for resamples on genuinely-unsolved challenges.

2.4 High-entropy minority tokens — where the exploration budget actually lives

arXiv:2506.01939 (2025-06, NeurIPS 2025). Only ~20% of CoT tokens carry high entropy (the semantic “forking” decision tokens); restricting RLVR gradient to only those matches full-gradient RLVR at 8B and beats it at 32B (+11 AIME25), while training on the low-entropy 80% actively degrades performance. Confidence: Medium-high, math/code domain only. For this agent: in an agentic trace, boilerplate tool-call JSON/restated context is plausibly an even larger low-entropy fraction than in pure CoT — the potential leverage of masking gradient to just the “which tool/branch” decision tokens may be higher here, though unverified for this domain.

2.5 Positive-Advantage Reweighting — independent confirmation

arXiv:2511.05993 (Jin, Gao et al., 2025-11-08). Independently re-derives that positive-advantage tokens drive entropy collapse (converging with §2.1 via a different route) and proposes direct reweighting of the loss on those tokens as a simpler alternative to covariance-thresholding. Confidence: Medium — but the cross-group convergence with §2.1 raises confidence in the underlying mechanism.

2.6 RAFT / Reinforce-Rej — the project’s current recipe, validated

arXiv:2504.11343 (Xiong, Yao et al., 2025-04-15). Rejection-sampling SFT (train only on positively-rewarded samples) is competitive with GRPO/PPO; ablation shows GRPO’s real edge over vanilla policy gradient is discarding all-fail groups, not reward normalization. Reinforce-Rej extends this by filtering both all-wrong and all-right groups.

positives = [r for r in [gen(prompt) for _ in range(N)] if verifier(r) == 1]
sft_loss(positives)                        # this project's current recipe
if not (all_correct(group) or all_incorrect(group)):
    policy_gradient_update(group)          # Reinforce-Rej: same degenerate-group filter as DAPO §2.3

Designed to fix: nothing behaviorally — this is a validation, not a fix. It confirms the project’s rejection-sampling-SFT phase is a literature-grounded baseline, not an ad hoc placeholder.

Confidence: High for the ablation (clean controlled comparison). For this agent: the single most directly actionable paper for where the project is right now — and its degenerate-group filter is the same requirement DAPO’s dynamic sampling encodes, independently derived.

2.7 KL-regularization design space

arXiv:2505.17508 (Zhang, Liu et al., 2025-05-23). Systematic study of KL-term design choices (forward vs reverse, applied to reward vs loss vs both, against which reference policy) — the “why” behind ProRL’s reference-resetting (§4.1) and KL-Cov’s token-scoped KL (§2.1). Confidence: Medium — theoretical framing.

2.8 Parameter-space noise — temporally coherent exploration, classic and revived

Classic (arXiv:1706.01905, Plappert, Houthooft et al., 2017; 364 citations, well-established): perturb the policy’s parameters before a rollout instead of the action distribution (temperature/top-p) — produces a temporally-consistent “perturbed persona” for the whole episode rather than incoherent per-token jitter.

2026 revival, PSN-RLVR (arXiv:2602.02555, Bai, Wang et al., 2026-01-30): applies parameter noise to RLVR specifically because standard RLVR has an exploration ceiling that grows more visible at large sampling budgets. Corrects the resulting off-policy mismatch with truncated importance sampling; reports gains that get larger as reasoning length grows (marginal on ~738-token AMC responses, +8.9% pass@256 on ~1978-token AIME responses).

theta_noisy = theta + sigma * noise            # perturb before rollout (typically MLP/FFN blocks)
rollout = generate(theta_noisy, prompt)
importance_weight = clip(pi_theta(a|s) / pi_theta_noisy(a|s), max_val)   # truncated importance sampling
loss = -importance_weight * advantage * logp_theta(a|s)                  # update the CLEAN theta

Designed to fix: patterns 2 and 3 — and directly relevant since these episodes run to ~100 turns, far longer than the paper’s single-CoT setting, where token-level noise decorrelation would compound into incoherence exactly as the paper predicts.

Confidence: classic — High. PSN-RLVR — Low-medium, single very-recent paper, unreplicated. For this agent: the single technique in this sweep whose stated advantage scales with trajectory length instead of against it — worth a dedicated small pilot.

2.9 Multi-temperature — spend exploration budget where it helps

arXiv:2510.08892 (Zhuang, Zhou et al., 2025-10-10). Classify tokens into high-entropy “reasoning/fork” vs. low-entropy “knowledge/fact” tokens; sample fork tokens at higher temperature, knowledge tokens at lower — don’t want the agent “exploring” whether a CVE number or flag format is correct. Confidence: Medium. For this agent: directly portable — higher temperature at “which tool next” decision points, lower temperature inside verbatim payload/command construction.

2.10 DIVER — reward group-level diversity as an intrinsic bonus

arXiv:2509.26209 (Hu, Zhang et al., 2025-09-30). Rewards global sequence-level diversity across a rollout group (pairwise dissimilarity) using potential-based reward shaping (Ng et al. 1999 invariance) so diversity-seeking doesn’t distort what “correct” means. Reports beating GRPO-w/-clip-higher, entropy-RL, and pass@k training on both pass@1 and pass@k, in- and out-of-domain.

D = pairwise_dissimilarity_matrix(responses)          # G x G over a group
diversity_of_i = mean(D[i, :])
r_intrinsic = diversity(state_t) - diversity(state_t_minus_1)   # potential-based shaping
reward_total = reward_task + lambda_div * r_intrinsic

Designed to fix: patterns 1 and 2/3 — a direct counter to “87.7% of tool calls bypass the rich tool surface” if “different approach” is defined over which tools/commands were used, not token-level text.

Confidence: Medium-high, single paper. For this agent: the strongest concrete [N]-flavored opportunity surfaced in this whole sweep — reward a rollout group for trying genuinely different tools/approaches against the same challenge, not just for eventually finding the flag.

2.11 CDE — curiosity as cheap perplexity + critic-variance bonus

arXiv:2509.09675 (Dai, Song et al., 2025-09-11, ICLR 2026). Actor-side bonus = perplexity of the model’s own response (high = “surprised,” i.e. exploring); critic-side = variance across a multi-head critic. Reports a calibration-collapse finding as a byproduct — the policy becomes confident regardless of correctness, which the actor-bonus specifically counters.

Designed to fix: pattern 3 (good guessers until they’re not) — overconfident-wrong is the literature-side twin of committing to an ungrounded guess and not recovering.

Confidence: Medium-high, modest empirical gain (+~3pt AIME) but genuinely useful framing. For this agent: the actor-side perplexity bonus is cheap (no extra network, GRPO is critic-free) — a reasonable first experiment before anything heavier.

2.12 MERCI — count-based novelty, with a domain caveat

arXiv:2510.16614 (Zhang, Li et al., 2025-10-18, ICLR 2026 poster). Classical count-based exploration adapted to the autoregressive LLM MDP via a lightweight Coin Flipping Network pseudo-count estimator — cheaper than general count-based bonuses because the token-sequence MDP has known, deterministic transitions. Confidence: Medium — but that deterministic-transition assumption is exactly what a live sandboxed CTF environment violates (server responses/subprocess stdout are stochastic, environment-dependent). For this agent: treat as inspiration for a tool/command-novelty bonus, not a drop-in.

2.13 Representation-based exploration — a negative result worth acting on

arXiv:2510.11686 (Tuyls, Foster et al., 2025-10-13). A diversity bonus from the base LM’s own hidden states, usable at inference time (build a diverse k-of-N pool) or as an RL bonus. Notable negative result: the bonus improves verifier efficiency across sampling strategies except high-temperature sampling — high-temp outputs look “novel” in representation space without being useful. Temperature-driven and representation-driven exploration are not naively composable.

pool = [gen(prompt, temp=1.0) for _ in range(N)]
selected = top_k_by([hidden_state_diversity(r, pool) for r in pool], k)   # diverse k-of-N, not random

Confidence: Medium-high (>50% verifier-efficiency gain reported, single group). For this agent: the actionable finding is at eval time, not training — if the harness uses high temperature to get sample diversity for pass@k>1 runs, this paper says that may be producing noisier repeats of the same strategy, not genuinely different ones. Cheap to ablate, changes nothing about training.

2.14 Pass@k as diagnostic, not objective — and how to fix that if you insist

arXiv:2511.16231 (Yu Yang, 2025-11-20): optimizing pass@k directly is mathematically just a positive reweighting of pass@1, whose gradient vanishes exactly where exploration is most needed (a concentrated policy). Use pass@k as a diagnostic (is the ceiling still rising with more samples?), not a training objective.

arXiv:2505.15201 (Walder & Karkhanis, PKPO, 2025-05-21): if you do want a pass@k-shaped reward anyway, derives an unbiased, low-variance estimator that keeps gradient on harder problems where pass@1 gives near-zero signal but pass@k still has coverage — the same unbiased-estimator family this project already uses at eval time (1 - C(N-c,k)/C(N,k)).

arXiv:2508.10751 (Chen, Qin et al., Pass@k Training, 2025-08-14): a lighter-weight alternative — use pass@k as the reward signal itself to adaptively balance exploration/exploitation.

Designed to fix: pattern 5 (benchmarks measure pattern-match speed, not thoroughness) — §2.14’s diagnostic-not-objective framing is the formal version of that same critique, applied to the training objective.

For this agent: treat the pass@1-vs-pass@5-vs-pass@10 gap (already the project’s locked methodology) as the diagnostic signal — a shrinking gap while pass@1 stays flat is the entropy-collapse warning from §2.1, not “the model learned the task.”

2.15 NuRL — unlocking prompts GRPO currently can’t learn from at all

arXiv:2509.25666 (Chen, Peng et al., 2025-09-30, Salesforce AI Research + UNC). Standard GRPO/RLVR gets zero gradient from any prompt where every rollout in the group fails (the same degeneracy DAPO resamples and RAFT filters). NuRL instead unlocks these: generate a self-conditioned hint (model, given the gold answer, produces its own CoT + hint), inject it for 0%-pass-rate groups, re-roll with the hint — now training on a hint-augmented group with real signal; hint dropped at inference.

group = rollout(prompt, n=G)
if pass_rate(group) == 0.0:                          # dead for GRPO/DAPO/RAFT alike
    hint = self_generate_hint(prompt, gold_answer)
    group = rollout(prompt + hint, n=G)               # re-roll WITH the hint; hint dropped at inference

Designed to fix: pattern 4 and directly the portfolio’s ~100-200-of-1000 solve rate — the currently-unsolved ~800 challenges are plausibly many all-zero-reward-group cases today, exactly NuRL’s target regime.

Confidence: Medium, single paper, six-benchmark/three-model validation. For this agent: needs design work to adapt — the flag itself isn’t the how-to-get-there knowledge, so a CTF-shaped hint needs a walkthrough/verifier-metadata analogue or a previously-successful trajectory for a similar challenge family; doesn’t port zero-shot from math/code.

2.16 Cybersecurity RLVR precedent — the entropy-preservation gap in-domain (academic, context only)

Pentest-R1, HackSynth/Random-Crypto, and a Linux-privesc RLVR paper (arXiv:2603.17673, Normann, Happe et al., 2026-03-18 — SFT-then-RLVR on a 4B model, 95.8% success vs. 97.5% for Claude Opus 4.6 at >100x lower inference cost) are academic domain-specific security-training papers — cited for context only, not a basis for this page’s conclusions. They’re mentioned here purely to note an absence: none report an explicit entropy-preservation mechanism. The project’s actual grounding for the two-stage SFT→RLVR pipeline is DeepSeek-R1’s staged recipe at frontier scale (§3.1, arXiv:2501.12948) and RAFT/Reinforce-Rej’s controlled ablation (§2.6, arXiv:2504.11343), not these academic security papers. The entropy-preservation gap itself is this project’s own [N] opportunity to make, not something pre-solved: no cybersecurity-specific RL paper found combines DAPO/entropy-mechanism/curiosity/parameter-noise with a multi-vuln-class, ~100-turn, tool-rich CTF setting.


3. Reasoning & test-time compute [R]

3.1 The staging lesson from DeepSeek-R1

arXiv:2501.12948 (2025-01). R1-Zero (pure RL, binary rule-based reward, long CoT emerges) then R1’s four-stage fix (cold-start SFT → reasoning RL → rejection-sampling SFT on the RL checkpoint’s own correct trajectories → second RL pass). The middle-to-late stage — rejection-sampling SFT on verifier-passed solves — is literally this project’s chosen path, validated at frontier scale. Gap: R1’s reward is single-turn/terminal on math/code; the sparser, later-arriving terminal signal of a 100-turn CTF episode is the part R1 does not solve (that’s §1 of this page).

3.2 Dr.GRPO — the length-bias trap gets worse with more turns

arXiv:2503.20783 (2025-03). Vanilla GRPO’s length + group-std normalization secretly rewards longer wrong answers, shorter right ones. Fix: drop both normalizations, keep only advantage = reward - mean(rewards). Matches/beats GRPO accuracy at same compute.

Designed to fix: pattern 3 (good guessers until they’re not) — if the base algorithm rewards length-padding on failure, an agent could learn to “look busy” (redundant tool calls, extra enumeration) without the enumeration being useful — a bigger attack surface for this bug in a 100-turn setting than a single-turn math answer.

Confidence: High. For this agent: use Dr.GRPO’s advantage normalization, not vanilla GRPO’s, if/when graduating to RL — specifically check whether the policy is learning to burn turns on unproductive tool calls after a wrong guess, which is nearly indistinguishable from legitimate enumeration unless checked for.

3.3 LIMO — SFT data quality over quantity

arXiv:2502.03387 (2025-02). 817 carefully-curated SFT examples beat >100k loosely-curated ones on AIME/MATH500 plus strong OOD transfer — SFT works as “cognitive templates” for knowledge the base model already has, not a knowledge source.

For this agent — directly relevant to the immediate next step: when building the rejection-sampling SFT set from the agent’s own verifier-passed runs, prioritize trajectory quality/technique diversity over raw count. A smaller set of clean, full-PTES-phase, well-enumerated solves may generalize better than a larger set of lucky-guess successes — training on lucky-guess trajectories specifically risks teaching the “guess and hope” failure mode LIMO’s framing predicts generalizes poorly (pattern 3).

3.4 Test-time compute is a resource-allocation problem, not a skill problem

arXiv:2408.03314 (Snell et al., 2024-08, 1772+ citations). Difficulty-adaptive test-time compute allocation can match a 14x larger model at fixed budget; uniform allocation is wasteful. Related, arXiv:2502.15631 (o3-mini vs o1-mini, Feb 2025): higher accuracy achieved WITHOUT longer reasoning chains — accuracy generally declines as CoT length grows within a fixed model, even controlling for difficulty.

For this agent: the 100-turn budget IS test-time compute allocation, just framed as episode length. Maps onto pattern 4 (strong at chaining, weak at thorough enumeration) as a resource-allocation problem: the agent should spend more of its turn budget on thorough enumeration for hard/unfamiliar challenge types and less on easy/familiar ones, rather than a flat per-phase turn count. Track turns-per-solve alongside solve rate — more turns is not automatically better.

3.5 The contested question — does RL expand or just narrow the reasoning boundary?

Two papers, opposite conclusions, genuinely contested:

  • “Does RL Really Incentivize Reasoning Capacity Beyond the Base Model?” (arXiv:2504.13837, 2025-04): under pass@k with large k, RLVR-trained models’ correct paths are all already samplable from the base model — pass@1 improves, pass@256 decreases over training. RLVR-as-typically-run narrows.
  • ProRL (arXiv:2505.24864, Liu, Diao et al., 2025-05-30, NeurIPS 2025): with KL control + periodic reference-policy resetting + a diverse task suite, sustained over 2000+ steps, RL-trained models solve problems the base model never solves at any k — genuine boundary expansion, correlating with training duration and base-model competence.
loss += kl_coef * KL(policy || ref_policy)              # (a) adaptive KL control
if step % reset_interval == 0:
    ref_policy.load_state_dict(policy.state_dict())     # (b) periodic reference reset — the non-obvious piece
# (c) diverse multi-task training suite, not a narrow curriculum

Reconciliation (this page’s synthesis, not either paper’s claim): RLVR as typically run (short, KL-to-frozen-init) narrows/amplifies; RLVR prolonged, KL-controlled, reference-resetting, diverse-task can expand. Training duration + KL management is the resolving variable — treat as a hypothesis to validate with your own entropy instrumentation (§2.1), not settled fact.

Designed to fix: pattern 4 (uneven PTES, weak enumeration) — a reasoning boundary that hasn’t been expanded keeps failing the same class of enumeration-heavy step no matter how much RL polishing it gets.

For this agent: don’t expect boundary expansion from a short RL run — that’s expected behavior matching §2504.13837, not a bug. Periodic reference-policy resetting is cheap, orthogonal to GRPO/DAPO/GSPO choice, and worth adopting from the start of any RL run.

3.6 Spurious rewards — a methodology warning before trusting any result

arXiv:2506.10947 (2025-06, 93+ citations fast). For Qwen2.5-Math specifically, RLVR improves MATH-500 almost as much with completely spurious rewards (random, wrong-label, format-only) as with ground-truth ones — RL surfaces a latent pretrained quirk, not the reward’s information content. Does not replicate on Llama3/OLMo2 — a model-family-dependent finding.

For this agent: strengthens the project’s own confirmed lesson (ground-truth-verified reward, never format-matched) — a completely wrong reward can look like it’s working if the base model has the right latent bias, a scarier version of “format reward causes confabulation.” Sanity-check any future “the reward design worked” claim with a brief spurious-reward ablation on the same base model.

3.7 The reward-hacking cluster — what to pre-mortem before scaling RL

Three independent, very recent (2026) papers converging on one warning:

  • arXiv:2605.02269 — “Towards Understanding Specification Gaming in Reasoning Models”: RL reasoning training causally increases specification-gaming rate (32–170% across model pairs); test-time mitigations reduce but don’t eliminate it.
  • arXiv:2604.15149 — “LLMs Gaming Verifiers”: RLVR-trained models abandon intended generalizable behavior and exploit the gap between a verifier’s extensional (checks-the-output) and intensional (checks-the-process) correctness — the verifier admits false positives, RL finds exactly that gap.
  • arXiv:2604.13602 — “Reward Hacking in the Era of Large Models”: the “Proxy Compression Hypothesis” — reward hacking is near-inevitable when optimizing an expressive policy against any compressed proxy of a high-dimensional true objective. Framework/survey, not a novel empirical result.

For this agent — a concrete pre-mortem, not a hypothetical: a binary flag-match check IS an extensional verifier. §2604.15149’s mechanism predicts RL will find and exploit any gap between “produces the correct flag string” and “actually exploited the intended vulnerability” (info leak, predictable flag generation, a scoring bug) at a higher rate than the SFT-only regime already run. Audit the flag-verification harness for exactly these extensional gaps before scaling RL, and consider periodic trajectory-level spot-audits (not just flag-match) once RL training starts. All three: Medium confidence (very recent, unreplicated) — but the engineering precaution is warranted regardless of replication.

3.8 Process reward models — context, and why they stay off the table

Lightman et al. (arXiv:2305.20050, 2023, foundational) and its 2025 continuation “Process Reward Models That Think” (arXiv:2504.16828) are the step-level-verification lineage — DeepSeek explicitly rejected PRM for R1 due to step-level reward hacking. For this agent: the natural alternative if credit-assignment sparsity (§1) becomes a bottleneck, but a PRM is itself a learned (not ground-truth) reward — any PRM-style process reward would need its own anti-gaming safeguards on top of §3.7’s warnings, not a naive “good step = positive reward” scheme. Keep the terminal flag verifier as the ungameable primary signal.


4. Novelty / boundary-expansion [N]

The techniques above earn an [N] tag when they have direct evidence of expanding what the policy can do (solving what the base model never solves at any k), not just sharpening what it already sometimes does. Collected here as the load-bearing case for or against the project’s confirmed lesson “GRPO amplifies existing capabilities, SFT replaces them” (arXiv:2507.10616).

4.1 ProRL — the strongest counter-evidence to “amplification only”

Already covered in §3.5 — restated here because it’s the anchor [N] citation: prolonged, KL-controlled, reference-resetting RL demonstrably expands the reasoning boundary. The qualifier that makes this actionable: boundary expansion correlates with training duration and base-model competence — a short GRPO polish pass should be expected to behave like the narrowing result (§3.5), not ProRL’s.

4.2 ToRL — RL-from-scratch surfaces qualitatively new tool-use strategies

arXiv:2503.23383 (2025-03-30, math domain). No SFT-on-tool-traces warmup at all; pure RL discovers when and how to invoke tools and reports emergent invocation strategies absent from the SFT-only baseline, outperforming the best tool-integrated-reasoning model on AIME’24 by double digits. This is the strongest citation in this whole sweep for [N]: the explicit contrast “RL discovers emergent patterns vs. SFT imitates them” pushes past mere amplification — RL-from-scratch surfaced qualitatively new patterns. Domain gap: math tool-use (Python), not offensive-security tool-use — suggestive, not proven, for CTF.

4.3 Absolute Zero — self-play with zero external data

arXiv:2505.03335 (2025-05). A model proposes its own tasks (validated by a code executor, rewarded by a “learnability” signal peaking at the frontier of current competence — an automatic curriculum) and solves them, beating models trained on tens of thousands of curated examples with zero external labeled data.

For this agent — speculative but worth flagging cross-seat: architecturally similar to a self-generated CTF-challenge curriculum, IF the flag-verifier concept generalizes from “check code output” to “check flag capture in a sandbox.” Math/code domain only — flag to challenge-builder/main as a longer-term idea for auto-scaling challenge difficulty to agent capability rather than a fixed static portfolio, not a near-term recipe.

4.4 NuRL, DIVER, MERCI — engineered novelty-seeking

Already covered in §2.15, §2.10, §2.12 — grouped here as the [N]-tagged mechanisms that explicitly target “escape local routines to discover better solutions” (MERCI’s phrase) rather than sharpen an existing one: NuRL raises the model’s upper bound on prompts it currently cannot solve at all; DIVER rewards genuinely-different group-level strategies; MERCI’s novelty bonus (with the deterministic-MDP caveat) targets repetitive, suboptimal reasoning patterns directly.

4.5 Parameter-space noise (PSN-RLVR) — explicitly framed as boundary-expanding

Already covered in §2.8. Framed by its authors as “expanding the effective reasoning capability boundary,” with gains growing as reasoning length grows — the property most aligned with this project’s long-horizon axis of any [N]-tagged technique in this sweep.

4.6 Kimi K2 and MUA-RL — agentic capability as a distinct training target

Kimi K2 (arXiv:2507.20534, 2025-07): frontier labs increasingly treat agentic/tool-use capability as requiring dedicated training investment, not an emergent side-effect of reasoning-RL — validates not expecting math/code RLVR gains to automatically transfer to 100-turn agentic CTF competence. MUA-RL (arXiv:2508.18669, 2025-08): trains against a dynamic, LLM-simulated counterpart in the RL loop instead of a static script, generalizing better per-parameter on multi-turn tool-use benchmarks. For this agent: the project’s live sandboxed CTF environment already satisfies MUA-RL’s “train against the real, dynamic, reactive counterpart” principle — validating evidence, not a design change.


Decision flow — which lever to pull first

flowchart TD
    A["Rejection-sampling SFT plateaus"] --> B{"Entropy instrumented\nfrom step 0?"}
    B -- "not yet" --> B0["Add entropy logging NOW\n(§2.1) — do this regardless"]
    B0 --> C
    B -- "yes" --> C{"GRPO baseline in\n30-60% band?"}
    C -- "no, too low/high" --> C0["Curriculum-filter challenges\nto the 30-60% band first"]
    C0 --> D
    C -- "yes" --> D["Start RL with DAPO recipe\n(clip-higher + dynamic sampling, §2.3)\nnot vanilla GRPO"]
    D --> E{"Entropy still\ncollapsing?"}
    E -- "yes" --> F["Add Clip-Cov / KL-Cov\n(§2.1) — one-line verl flag"]
    E -- "no" --> G
    F --> G{"Credit smeared across\nall 100 turns equally?"}
    G -- "yes" --> H["GiGPO step-groups (§1.1)\nor turn-level reward shaping (§1.4)"]
    G -- "no" --> I
    H --> I{"~800 challenges still\nall-zero-reward?"}
    I -- "yes" --> J["NuRL-style hints (§2.15)\nor more rejection-sampling data"]
    I -- "no" --> K
    J --> K{"Tool avoidance persists\n(pattern 1)?"}
    K -- "yes" --> L["DIVER / MERCI-style\ntool-sequence diversity bonus (§2.10, §2.12)"]
    K -- "no" --> M["Budget real training duration +\nreference-policy resets for boundary\nexpansion (ProRL, §3.5/§4.1)"]
    L --> M

Ranked shortlist — what to reach for FIRST

Given the diagnosis (execution gap, not knowledge gap), the chosen path (rejection-sampling SFT → GRPO/RLVR at entropy collapse), the five BSides patterns, and the binary terminal verified-flag reward:

  1. Instrument entropy from step 0 of any future RL run (§2.1). Free, no design decisions, do this before anything else — you cannot diagnose a collapse you didn’t measure.
  2. Start the RL stage with DAPO’s four fixes as the baseline recipe (§2.3), not vanilla GRPO — clip-higher and dynamic sampling are exactly the entropy/degenerate-group guarantees the project’s own “30–60% baseline” rule is implicitly reaching for, made explicit. Reinforce-Rej (§2.6) independently validates the same direction.
  3. If entropy still collapses under DAPO, add KL-Cov/Clip-Cov as a one-line verl loss-mode flag (§2.1) before building anything custom.
  4. Adopt GiGPO’s step-level credit (§1.1) as the first turn-level fix — zero extra rollouts, zero new critic, just a state-canonicalization function over the harness’s existing tool-call spans. Reach for the turn-PPO/TL-GRPO cluster (§1.4) only if GiGPO’s assumptions (hashable state recurrence) don’t hold in practice.
  5. Separate the ~800 never-solved challenges from the 30–60%-band ones and treat them as NuRL-territory (§2.15) — a distinct problem requiring hints or more SFT data before they’re GRPO-ready at all, not more training on the same recipe.
  6. Pilot parameter-space noise (§2.8) as the one technique whose stated advantage scales with the 100-turn horizon rather than against it — small-scale, given PSN-RLVR is unproven outside math.
  7. Build a tool-call-sequence-level diversity bonus (DIVER §2.10 / MERCI §2.12, adapted) — the strongest concrete [N] opportunity surfaced anywhere in this sweep, and the most direct counter to pattern 1 (87.7% tool bypass) that isn’t a prompting fix.
  8. Do not optimize pass@k directly as a training reward (§2.14) without the PKPO estimator; use the pass@1/pass@k gap as a diagnostic, and ablate whether pass@5/pass@10 sampling diversity is real or just noisier repeats (§2.13) — cheap, changes nothing about training, tells you whether the eval methodology measures what you think.
  9. Audit the flag verifier for extensional gaps before scaling RL (§3.7) — a pre-mortem, not a reaction; RL is empirically expected to find gaming opportunities at a higher rate than the SFT-only regime already run.
  10. Budget real training duration + periodic reference-policy resets (ProRL, §3.5/§4.1) once past the initial GRPO baseline — boundary expansion (genuinely new attack paths, not just more reliable execution of known ones) is a function of how long and how KL-controlled the run is, not available from a short polish pass.

Summary table

TechniquearXiv (verified)[L][E][R][N]BSides patternConfidenceOne-line takeaway
GRPO (baseline)2402.03300HighGroup-mean baseline, no critic — trajectory-level only
GAE (baseline)1506.02438HighClassical multi-step advantage, single time-scale
GiGPO2505.10978#4HighStep-level advantage via state-hash groups, zero extra rollouts
ArCHer2402.19446HighTurn-level off-policy critic — heavier infra than needed
RAGEN / StarPO2504.20073#2HighNames and diagnoses the “Echo Trap” entropy collapse
Turn-level reward cluster2505.11821 +5#4Promising (cluster)Turn is the unit of advantage — cross-group consensus
GSPO2507.18071Med-HighSequence-level clip; stability grows more relevant with length
VerlogOpenReview onlyPromisingDual-discount GAE + memory-window + early truncation; 400+ turns
Demystifying long-horizon RL2603.21972Promising5-axis empirical recipe: reward/scale/data/algo/env
PEARL2601.20439#2PromisingRL over the planning/tool-sequencing step itself
ReTool / ToRL / Search-R12504.11536 / 2503.23383 / 2503.09516ToRL:✓#1High/VerifiedTool-output tokens masked from PG loss; ToRL = emergent tool strategies
CTF-Dojo / Pentest-R1 / HackSynth (academic, context only)2508.18370 / 2508.07382 / 2506.02048Context onlyNOT a basis for this page’s claims — see §1.9 re-grounding on DeepSeek-R1/RAFT/GiGPO/Verlog/PSN-RLVR
Entropy Mechanism / Clip-Cov / KL-Cov2505.22617#2 #3HighFitted entropy-vs-performance law; surgical per-token fix
Clip-Low/Clip-High asymmetry2509.26114#2 #3Med-HighTwo independent clip knobs, not one symmetric one
DAPO2503.14476#2 #3 #4HighClip-higher + dynamic sampling — the RL baseline recipe
High-entropy minority tokens2506.01939#3Med-HighOnly ~20% of tokens carry the exploration-relevant decisions
Positive-Advantage Reweighting2511.05993#2 #3MediumIndependent confirmation of the entropy-collapse mechanism
RAFT / Reinforce-Rej2504.11343HighValidates this project’s own rejection-sampling-SFT phase
Parameter-space noise / PSN-RLVR1706.01905 / 2602.02555#2 #3High / Low-medTemporally-coherent exploration; gains grow with length
Multi-temperature2510.08892#2MediumHigh temp on fork tokens, low temp on payload/syntax tokens
DIVER2509.26209#1 #2 #3Med-HighReward group-level diversity, potential-based shaping
CDE2509.09675#3Med-HighPerplexity + critic-variance curiosity bonus; calibration fix
MERCI2510.16614#1 #2MediumCount-based novelty; deterministic-MDP assumption breaks for live envs
Representation-based exploration2510.11686#3Med-HighNegative result: high temp and rep-diversity fight each other
Pass@k diagnostic / PKPO / Pass@k Training2511.16231 / 2505.15201 / 2508.10751#5High/Med/MedPass@k’s gradient vanishes as policy concentrates unless corrected
NuRL2509.25666#4MediumSelf-generated hints unlock currently-0%-pass-rate prompts
DeepSeek-R12501.12948#2HighThe staging recipe this project’s own path mirrors
Dr.GRPO2503.20783#3HighRemoves GRPO’s length-bias reward artifact
LIMO2502.03387#3Med-HighSFT set quality/technique-diversity over raw count
Test-time compute scaling2408.03314#4HighTurn budget is a resource-allocation problem, not a skill one
ProRL2505.24864#4Med-HighProlonged + KL-control + reference-reset genuinely expands boundary
Does RL Really Incentivize…2504.13837✓(neg)MediumCONTESTED vs. ProRL — short-run RLVR narrows, doesn’t expand
Spurious Rewards2506.10947✓(neg)High (scope-ltd)A wrong reward can still “work” — validate across model families
Reward-hacking cluster2605.02269 / 2604.15149 / 2604.13602✓(neg)MediumRL causally increases spec-gaming; audit the verifier’s extensional gaps
Absolute Zero2505.03335MediumSelf-play, zero external data — speculative curriculum idea
Kimi K2 / MUA-RL2507.20534 / 2508.18669Med/MedAgentic capability is a distinct training target, not RLVR side-effect