From behavioral audit to training signal

The BSides LV CFP audit (189 runs, 4 frontier models, 68,049 Phoenix spans) is not a benchmark score — it’s a behavioral trace. Each of its 5 findings describes a shape of failure, and each shape implies a different kind of gap: exploration, methodology/reasoning, credit assignment, or eval validity. This page maps observed behavior → gap type → the literature’s actual fix → how you’d check the fix worked, for each of the 5 patterns, against this project’s real setup (rejection-sampling SFT on verifier-passed solves now, GRPO/RLVR when entropy collapses; ground-truth flag verifier; knowledge in tools not weights).

flowchart LR
    A[observed behavior] --> B[gap type]
    B --> C[training signal / method]
    C --> D[verification check]
    D -.re-audit.-> A

None of this relitigates the diagnosis (execution gap, not knowledge gap) — it asks a narrower question per pattern: given this specific failure shape, which lever in the SFT→GRPO graduation actually targets it, and how would you know if it worked?

Pattern 1 — Agents prefer their own tools (87.7% of calls bypass the rich tool surface; 26/40 tools dead)

Observed: raw curl/shell dominates; the bespoke sectools surface is mostly unused.

Gap type [E]: this is not a knowledge gap (the agent isn’t ignorant of the tools — they’re in its context) and not really a reasoning gap. It’s an exploration-of-tool-space problem: tool choice is driven by the model’s pretraining prior (shell/curl is high-likelihood, familiar, “free” under next-token probability) rather than by the tool’s actual value for the subtask. Faghih et al., “Tool Preferences in Agentic LLMs are Unreliable,” arXiv:2505.18135 (2025-05-23) shows this is gameable by description text alone — edited docstrings shift usage >10x with zero change in tool capability. Confidence: established (controlled cross-model study). This is a diagnosis, not a fix — it just rules out “better docstrings” as a ceiling-breaking move.

Training signal:

Designed to fix: pattern 1 — agents defaulting to raw shell/curl over the provided tool surface.

ToolRL (Qian et al., arXiv:2504.13958, 2025-04-16, established) — don’t SFT-imitate tool traces; decompose the reward into per-call terms so which tool was chosen is its own learnable signal, not folded into one coarse outcome score:
```
r_format  = valid_json_schema(call)          # 0/1
r_tool    = tool_is_appropriate(call, state)  # 0/1, graded by task type
r_param   = params_correct(call)              # 0/1 or partial
r_outcome = env_feedback(call)                # sparse, terminal-heavy — your flag verifier
reward = w1*r_format + w2*r_tool + w3*r_param + w4*r_outcome
```
Their ablation: reward granularity (per-call beats per-episode) and reward type (graded beats binary) both matter. This is the cheapest lever — you already have the tool registry, you need the auxiliary signal.
ReTool (Feng et al., arXiv:2504.11536, 2025-04-15, established) — trains the whole trajectory end-to-end so the policy learns when in a reasoning chain to reach for a tool, not just which one given an isolated decision point. Confirms the shape of the project’s SFT→GRPO plan; its specific addition is that real sandbox tool-execution output (not a paraphrase) must be in the rollout context that gets scored — worth auditing whether secagent/daytona feeds live stdout into the trajectory used for reward.
Tool-Star (Dong et al., arXiv:2505.16410, 2025-05-22, promising, <6mo/low-citation) — naive RL under-explores a large tool inventory because gradient concentrates on whatever already gets used. Its fix: manufacture forced/hinted rollouts that exercise under-used tools before RL, verify with the real environment, fold verifier-passed ones into SFT data. Directly targets the “26/40 dead” number — and flags a self-reinforcing trap: a rejection-sampling corpus built from the current curl-biased policy will never contain a dead tool succeeding, because the policy never tried it. RL alone has ~zero probability mass to reinforce on those tools.
Search-R1 (Jin et al., arXiv:2503.09516, 2025-03-12, established) contributes one mandatory engineering detail, domain-general: mask tool/environment output tokens from the policy-gradient loss — you don’t want to reinforce or penalize text the environment produced (shell stdout, HTTP bodies), only the model’s own query/command-generation tokens.

Honest caveat: ToolRL/ReTool/Tool-Star are validated on general agentic tool-use, not offensive-security tool-use specifically — the transplant to sectools is this researcher’s inference, not literature-verified. Random-Crypto/HackSynth, arXiv:2506.02048 — academic domain-specific CTF RL work, cited for context only, not a basis for this page’s claims — reports vanilla GRPO on crypto-CTF attributing generalization gains to improved tool usage; per this project’s standing rule, academic cybersecurity-LLM training/benchmark papers don’t count as frontier evidence, so this is mentioned but not relied on. The actual basis for “RL over tool-choice transfers” stays the general-agentic evidence above (ToolRL/ReTool/Tool-Star) — domain-general RL-for-tool-use results that don’t need a CTF-specific data point to hold.

Verify the fix: track the tool-usage histogram across all 40 sectools entries before/after training — do previously-dead tools get invoked at all on held-out challenges (not just replayed on training challenges)? Cheap, no-training-required companion check: DIVER, arXiv:2509.26209 (2025-09-30) rewards pairwise diversity across a rollout group — if adapted to “which tools/commands were used” rather than token text, a rising group-diversity score on tool choice is a leading indicator before solve-rate moves at all.

Pattern 2 — React-and-guess, no methodology (no wordlists/checklists/PTES sequencing; 82% pivot after failure)

Observed: no visible recon→enum→exploit sequencing; the agent abandons a line of attack after one failed attempt instead of enumerating alternatives methodically.

Gap type [R]: a sequencing prior is missing — the model has PTES-shaped knowledge somewhere in its weights (it can describe methodology if asked) but doesn’t apply it as an ordering constraint during live rollouts. This reads as reasoning/methodology, not exploration-in-the-entropy sense, though the two compound (§ below, and see long-horizon.md → RAGEN discussion of the “Echo Trap”).

Training signal:

Designed to fix: pattern 2 — no methodology / premature pivoting after a single failed attempt.

Two-stage guide-then-explore RL, re-grounded on general RL theory: Jump-Start Reinforcement Learning (JSRL) — Uchendu et al., arXiv:2204.02372 (2022-04-05, established, 17+ citations) is the domain-general theoretical basis for the same two-stage shape this page used to cite a CTF-specific paper for: a guide-policy (built from offline data/demonstrations/an existing policy) forms a curriculum of starting states, and an exploration-policy is trained forward from those states — naive initialization-then-finetune underperforms this because value-based methods handle a cold-start policy poorly. Mapped onto this project’s actual pipeline:
```
# Stage 1 — guide-policy: rejection-sampling SFT on verifier-passed walkthroughs
# (methodology prior — ordered recon → enum → exploit, not just any verifier-passed trace)
# Stage 2 — exploration-policy: online GRPO/RLVR in the live sandbox
# reward = terminal flag-verified {0,1} + intermediate env feedback (command succeeded/failed)
# JSRL's curriculum knob: anneal how far into the guide-trajectory the exploration-policy starts,
# rather than always starting cold — cheap to add to the existing rejection-sampling corpus.
```
DeepSeek-R1’s own disclosed recipe (R1, arXiv:2501.12948) is the frontier-lab confirmation that this shape works at scale: “cold-start SFT data, then RL” outperforms RL-from-scratch specifically because RL-from-scratch on an unguided policy wastes exploration budget relearning basic structure the SFT stage would have given it for free — the same failure mode JSRL formalizes. Academic CTF/pentest work — cited for context only, not a basis: Pentest-R1 (Kong et al., arXiv:2508.07382, evaluated on Cybench + AutoPenBench) reports the identical two-stage shape in the CTF domain specifically and claims both stages are required, order matters, and stage-1 data must be walkthrough-shaped rather than raw verifier-passed traces. Per this project’s standing rule, academic cybersecurity-LLM training/benchmark papers (Pentest-R1, AutoPenBench, and similar) don’t count as frontier evidence for a decision — the load-bearing basis here is JSRL + DeepSeek-R1’s cold-start-then-RL recipe, both domain-general. If Pentest-R1’s specific findings (walkthrough-shaping matters, order matters) turn out to matter operationally, that’s this project’s own thing to verify empirically, not something to inherit from an academic CTF paper.
Structured attack trees / ATT&CK scaffolding (Nakano et al., arXiv:2509.07939, 2025-09-09, established mechanism, but this is scaffolding not weight-training — flag the distinction) — externally constrain rollout-time reasoning with a deterministic task tree built from MITRE ATT&CK’s kill-chain, filtering unproductive actions. Reports 71.8–78.6% subtask completion vs. 13.5–75.7% for self-guided reasoning at far fewer queries — a large, reproducible gap. The cheapest lever in this whole page: use the ATT&CK tree as a rollout-generation scaffold to harvest higher-quality, more methodical verifier-passed trajectories for the SFT corpus right now, with zero RL infrastructure — and it keeps knowledge in the tree (a prompt/tool artifact), not baked into weights, consistent with the project’s “knowledge in tools not weights” rule.
PEARL (Wang et al., arXiv:2601.20439, 2026-01-28, promising) — treats the planning step (which tools, in what order) as its own object of RL, rather than only optimizing the final answer. Directly relevant as a formal mechanism for systematic tool-sequencing instead of react-and-guess, though not yet validated outside general multihop tool-use.
Adjacent, unverified beyond title: classical-planner-hybridized LLM agents (arXiv:2512.11143) — a stronger, harder version of the ATT&CK-tree idea; worth a follow-up only if the tree scaffold proves insufficient.

Honest caveat: the load-bearing basis for the two-stage recipe above is domain-general (JSRL, DeepSeek-R1’s cold-start-then-RL disclosure), not the CTF-specific Pentest-R1 paper — per this project’s standing rule, an academic CTF/pentest paper being “in-domain” doesn’t make it stronger evidence, it makes it out of scope as a basis (no academic cybersecurity-LLM project has produced a frontier model). The ATT&CK-tree scaffold (Nakano et al.) is a separate, non-cybersecurity-specific mechanism (deterministic task-tree filtering built on a public taxonomy, evaluated as scaffolding not weight-training) and isn’t subject to the same caveat. The main open question either way is whether a walkthrough corpus built from this project’s own data generalizes past the ~10 canonical PD26 challenges it would initially be built from — that’s this project’s own thing to test.

Verify the fix: measure PTES-phase coverage per episode (recon steps taken before first exploit attempt) and the pivot-after-failure rate (does the agent try ≥2 alternatives before abandoning a technique?) on held-out challenges, pre/post. Both are derivable from existing Phoenix spans without new instrumentation.

Pattern 3 — Good guessers until they’re not (≈half of solves hinge on an ungrounded guess; brittle to a wrong first guess)

Observed: many successful runs pivot on a single ungrounded guess at a critical step; when that guess is wrong, the agent rarely recovers.

Gap type [E]/[R]: this is where entropy collapse and reasoning intersect — a policy that has already spent its exploration budget on one high-probability guess has no remaining probability mass on alternatives when that guess fails. Cui et al., “The Entropy Mechanism of RL for Reasoning Language Models,” arXiv:2505.22617 (2025-05-28, high confidence, mechanistic + empirical) is the underlying diagnosis this pattern shares with pattern 4: entropy falls monotonically and predictably (R = -a·exp(H) + b), and the mechanism is the covariance between a token’s probability and its advantage — the model reinforces what’s already likely rather than exploring what’s uncertain.

Training signal:

Designed to fix: pattern 3 — brittle single-guess behavior with no recovery on failure.

SCoRe — self-correction via RL (Kumar et al., arXiv:2409.12917, 2024-09-19, established, canonical DeepMind paper) — SFT on (wrong→right) correction pairs barely transfers because it’s distribution-mismatched; only multi-turn RL with reward shaped to reward improvement, not just final correctness, produces genuine revision instead of mode-collapsing to “be right turn 1” or no-op-collapsing to “always change the answer”:
```
r1 = verifier(attempt_1)
r2 = verifier(attempt_2)
reward = r2 + alpha * max(0, r2 - r1)   # bonus specifically for turning a miss into a hit
```
Naming collision, flag explicitly: a different Sept-2025 paper reuses “SCoRe” for teacher-corrected earliest-error localization + short-horizon RL-continue-from-verified-prefix (Lyu et al., arXiv:2509.14257, promising, 7B-matches-72B claim unreplicated). Cite by arxiv id, not the shared name. If a 100-turn episode fails at a single identifiable wrong-guess turn (which matches this exact BSides finding), earliest-error-localized short-horizon RL is much cheaper credit assignment per episode than scoring the whole trajectory as one unit.
CDE — curiosity-driven exploration (Dai et al., arXiv:2509.09675, 2025-09-11, medium-high, ICLR 2026 poster) — an actor-side perplexity bonus (reward the model for being “surprised” by its own output) reports a calibration collapse finding as a byproduct: the policy becomes confident regardless of correctness, which the perplexity bonus specifically counters. This is the literature-side twin of “commits to an ungrounded guess and doesn’t recover” — cheap to try (no extra network, just log-perplexity of the model’s own rollout) before anything heavier.
Representation-based exploration (Tuyls et al., arXiv:2510.11686, 2025-10-13, medium-high) — an inference-time-only lever, not training: build a diverse k-of-N pass@k pool from hidden-state dissimilarity instead of random k-of-N. Notable negative result: this is anti-composable with high-temperature sampling — high-temp outputs look “novel” in representation space without being useful. Worth an ablation on whatever temperature the pass@k>1 eval pool currently uses, since it changes nothing about training and answers whether current sample diversity is real strategic variance or just noisier repeats.

Honest caveat: SCoRe (both papers) and CDE are domain-general (math/code) — the CTF/100-turn transfer is inference, not literature-verified. One concrete, cheap check the project can run without any new training: confirm the reward/credit-assignment scheme doesn’t implicitly favor shorter successful trajectories — a 60-turn recovery-from-wrong-guess success should score the same as a 10-turn lucky first guess; if it doesn’t, the reward is actively working against fixing this pattern regardless of which paper’s fix gets adopted.

Verify the fix: on a held-out set, measure whether backtracking-after-a-wrong-guess correlates with eventual success pre/post training (does the trained policy actually try a second technique after the first fails, and does that second attempt land more often?).

Pattern 4 — Uneven PTES phases: strong at chaining inside an exploit, weak at thorough enumeration (62% of failures stall in exploitation)

Observed: the agent is good at following a known exploit chain once inside it, but weak at the enumeration/reconnaissance breadth that would get it there in the first place; most failures stall during exploitation rather than in an earlier phase.

Gap type [L]/[E]: two compounding gaps. First, entropy collapse narrows the recon repertoire — once RL reinforces whatever enumeration path happened to work once, alternative recon strategies stop being tried (same mechanism as pattern 3, arXiv:2505.22617). Second, a long-horizon credit-assignment problem: a flat terminal-only reward over a ~100-turn episode gives early, correct enumeration steps the same undifferentiated credit as late exploitation steps — so even when enumeration was necessary for the eventual win, nothing in the reward signal reinforces it specifically.

Training signal:

Designed to fix: pattern 4 — weak/uneven enumeration and stalling mid-exploitation.

DAPO (Yu et al., arXiv:2503.14476, 2025-03-18, established, widely reproduced) + the Entropy Mechanism paper (arXiv:2505.22617) together are the baseline recipe, not an optional add-on, once GRPO starts: clip-higher (decouple the PPO clip range so rare-but-good tokens aren’t capped as hard as likely ones) and dynamic sampling (drop degenerate all-correct/all-wrong prompt groups, which otherwise contribute zero gradient — this matters disproportionately here since a whole-group-zero-reward challenge at ~100 turns/rollout is expensive to keep resampling; consider curriculum-filtering to the 30–60% pass-rate band the project already targets, rather than paying for blind resamples on genuinely-unsolved challenges).
```
eps_low, eps_high = 0.20, 0.28              # decoupled clip (vanilla PPO/GRPO: symmetric 0.20/0.20)
ratio = exp(logp_new - logp_old)
clipped = clip(ratio, 1 - eps_low, 1 + eps_high)
loss_pg = -min(ratio * adv, clipped * adv)   # per-token, mean over ALL tokens in batch
```
GiGPO (Feng et al., arXiv:2505.10978, 2025-05-16, established, NeurIPS 2025 poster) — the most directly transplantable long-horizon idea in this map, and it needs no new infra: no critic, no extra rollouts. It adds a second, step-level advantage on top of GRPO’s trajectory-level one by retroactively hashing (state, step) pairs that recur across rollouts and scoring “what happened next” conditioned on that shared state — pure post-hoc bookkeeping on trajectories already sampled. This is exactly the mechanism that could stop rewarding all 100 turns equally when only the exploitation phase decides the outcome. Failure mode to flag honestly: GiGPO’s state-hashing was validated on benchmarks with hashable states (web pages, grid worlds); a CTF agent’s state is unbounded free text (shell/HTTP output), so state-canonicalization needs a bespoke similarity function — e.g. (tool_name, normalized_target, response_status_class) from the existing tool_call/tool_exec_ms spans — rather than a raw-text hash, or the anchor groups never fire.
HiPER / hindsight credit assignment (Peng et al., arXiv:2602.16165, 2026-02-18; Tan et al., arXiv:2603.08754, 2026-03-07, both promising, 0 citations, evaluated on WebShop/ALFWorld — domain transfer to cybersecurity is inference, not literature-verified) — explicit hierarchical decomposition: a planner proposes subgoals (recon-done, foothold-gained, priv-esc-done, flag-captured), each independently checkable against environment state, with terminal reward at the flag level and intermediate credit at the subgoal level. The PTES phases already tracked are a natural subgoal taxonomy for this — no new taxonomy needed. If the flag-verifier is extended to also check an intermediate condition (“foothold confirmed” via sandbox state), that stays deterministic and doesn’t violate the ground-truth-verified-reward rule.
RL-PLUS (Dong et al., arXiv:2508.00222, 2025-07-31, promising, strong math/code ablations, cybersecurity transfer untested) — names “capability boundary collapse” directly: pass@k at large k drops under pure on-policy RLVR even as pass@1 rises, i.e. uneven phases get more uneven as training narrows the policy. Its fix: mix verifier-passed off-policy trajectories (already produced by rejection sampling!) into GRPO via importance-sampling correction, plus an advantage bonus for visiting under-explored-but-successful states. Concretely: don’t discard the rejection-sampling SFT corpus once GRPO starts — feed it back in as off-policy anchor data.
NuRL (Chen et al., arXiv:2509.25666, 2025-09-30, medium, consistent gains across six benchmarks/three models) — targets prompts with zero reward across every rollout in the group, which vanilla GRPO simply cannot learn from (zero gradient). Self-generates a hint conditioned on the gold answer, re-rolls with the hint injected, trains on the hint-augmented rollout, then drops the hint at inference. This is the strongest candidate for the ~800 currently-unsolved challenges in the portfolio — but the “gold answer” analogue doesn’t port zero-shot: the flag itself isn’t the how-to- get-there knowledge, so a CTF adaptation needs a hint source (walkthrough/verifier metadata, or a successful trajectory from a similar challenge family) that doesn’t yet exist and requires design work.

Verify the fix: instrument mean policy entropy from step 0 of any GRPO run (this alone is diagnostic, not a fix — but without it you cannot tell “the task is hard” from “the policy already collapsed at step 50 and is just getting faster at one script”); track per-PTES-phase solve/stall rates before and after each intervention layer (DAPO → GiGPO → RL-PLUS/NuRL); and run RL-PLUS’s own diagnostic — pass@k at large k on the base model vs. the trained checkpoint — to check whether exploitation-phase gains are holding or narrowing over training.

Pattern 5 — Benchmarks measure pattern-match speed, not thoroughness/methodology/robustness

Observed: a rising solve-rate number doesn’t by itself tell you whether the agent generalized a strategy or pattern-matched something close to a memorized/leaked challenge shape.

Gap type [N]: this is an eval-methodology gap, not a training-loop gap directly — but it’s the project’s own check on whether the SFT→GRPO pipeline is doing what axis [N] (novelty/boundary-expansion) requires, versus merely axis-amplifying what the base model already does.

Training signal (mostly diagnostic, one eval-recipe change, one eval-recipe addition):

Designed to fix: pattern 5 — solve-rate gains that can’t be told apart from elicitation/memorization.

Yue et al., “Does RL Really Incentivize Reasoning Capacity Beyond the Base Model?,” arXiv:2504.13837 (2025-04-18, established, large systematic study across model families/algorithms) — across math/code/visual-reasoning and 6 popular RLVR algorithms, the base model catches up and overtakes at large pass@k even when RL wins at pass@1: the patterns RL concentrates on were already latent in the base model’s sampling distribution. This paper’s own recommended fix — evaluate pass@k at large k, not just pass@1 — is already this project’s locked methodology (pass@k, k=3/5/10 bands per decisions/2026-06-11-ctf-benchmark-pass-count.md); the missing piece is running the base model at the same pass@k bands as a mandatory control. If base-model pass@10 ≈ trained-model pass@10 on a chunk of challenges, that chunk’s improvement is elicitation, which is fine to attribute to the SFT stage (SFT is explicitly meant to replace/instill, not expand) but would be a red flag if it persists after GRPO, where genuine capability gain is expected.
Semantics-preserving-transform robustness testing, re-grounded on general (non-cybersecurity) evidence: GSM-Symbolic — Mirzadeh et al. (Apple), arXiv:2410.05229 (2024-10-18, established, frontier-lab work) is the domain-general instance of the same failure mode: regenerating GSM8K-style math questions from symbolic templates — same structure, only surface values/names changed — shows LLM solve rates drop and get noticeably more variable under these semantics-preserving perturbations, and performance degrades further as templates add clauses that shouldn’t change the answer. This is established, cited-widely evidence (independent of any cybersecurity-domain paper) that a rising benchmark number can reflect pattern-match on the specific surface form rather than a robust, generalized method — exactly the risk pattern 5 flags for this project’s own PD26 solve-rate numbers. Concrete recipe for this project: build semantics-preserving variants of the existing PD26-01..10 challenges (renamed services/users, reordered but logically-identical steps, cosmetic code/config changes that don’t alter the exploit path) and check whether a claimed solve-rate jump transfers to a transformed variant before crediting it to “the agent got better at CTF,” mirroring GSM-Symbolic’s methodology directly. Academic CTF-benchmark work — cited for context only, not a basis: Capture the Flags / Evolve-CTF (Honarvar et al., arXiv:2602.05523) runs the identical idea in the CTF domain specifically (source-transformation families, composed-obfuscation degradation) and would be the more directly-transplantable recipe if it counted as evidence — but per this project’s standing rule it’s an academic domain-specific CTF-benchmark paper, so it doesn’t serve as the basis here; GSM-Symbolic does.

Honest note: this pattern’s fix is mostly an eval-protocol addition, not a new training signal — the “fix” for pattern 5 is really making patterns 1–4’s fixes falsifiable. Both papers above are cheap (no new training) relative to any of the RL infrastructure work in patterns 1–4 and should run before crediting the first GRPO graduation with “fixing execution,” not after.

Verify the fix: run the base-model pass@k-at-large-k control (5.1) alongside every RL-trained checkpoint’s pass@k; separately, generate semantics-preserving-transform variants of a held-out challenge subset and check whether solve-rate gains survive the transform. If gains evaporate on either check, that’s elicitation/memorization, not the execution-reliability improvement the rejection-sampling-SFT diagnosis is banking on.

Summary table

Pattern	Gap type	Method designed to fix it	Citation	Confidence
1. Own-tool preference (87.7% bypass, 26/40 dead)	[E] tool-space exploration	ToolRL — decomposed per-call reward	arXiv:2504.13958	Established
1. (same)	[E][N]	Tool-Star — forced exposure to under-used tools pre-RL	arXiv:2505.16410	Promising
1. (same)	[L][E]	ReTool — end-to-end trajectory-level tool-RL	arXiv:2504.11536	Established
2. React-and-guess, no methodology (82% pivot)	[R][L]	JSRL + DeepSeek-R1 cold-start recipe — guide-policy (SFT walkthroughs) then exploration-policy (online RL)	arXiv:2204.02372 / arXiv:2501.12948	Established (domain-general)
2. (same, context only — not a basis)	[R][L]	Pentest-R1 — same two-stage shape, CTF-specific; academic, cited for context, not a basis per standing rule	arXiv:2508.07382	Academic CTF work — not evidentiary
2. (same)	[R]	ATT&CK structured attack-tree scaffold (no training needed)	arXiv:2509.07939	Established (mechanism), scaffold not training
3. Brittle single-guess (~half hinge on a guess)	[E][R]	SCoRe — RL reward for improvement, not final correctness	arXiv:2409.12917	Established
3. (same)	[E][N]	CDE — curiosity/perplexity bonus counters calibration collapse	arXiv:2509.09675	Medium-high
4. Uneven PTES / 62% stall in exploitation	[E][L]	DAPO + Entropy Mechanism — clip-higher, dynamic sampling	arXiv:2503.14476 / arXiv:2505.22617	Established
4. (same)	[L][E]	GiGPO — step-level credit via state-hash groups, zero extra rollouts	arXiv:2505.10978	Established
4. (same)	[L][E][N]	HiPER / hindsight credit assignment — PTES-shaped subgoal decomposition	arXiv:2602.16165 / arXiv:2603.08754	Promising, domain-transfer speculative
4. (same)	[E][N]	RL-PLUS — counters capability-boundary collapse w/ off-policy mixing	arXiv:2508.00222	Promising
4. (same, ~800 unsolved tail)	[L][E][N]	NuRL — self-generated hints unlock zero-reward-group prompts	arXiv:2509.25666	Medium
5. Benchmarks measure pattern-match, not thoroughness	[N] eval validity	Base-model pass@k-at-large-k control	arXiv:2504.13837	Established
5. (same)	[N] eval validity	GSM-Symbolic — semantics-preserving-transform degrades solve rate (domain-general)	arXiv:2410.05229	Established
5. (same, context only — not a basis)	[N] eval validity	Evolve-CTF — same idea, CTF-specific; academic, cited for context, not a basis per standing rule	arXiv:2602.05523	Academic CTF work — not evidentiary

What this changes about the plan, concretely

Cheapest, no-training-required move first: the ATT&CK attack-tree scaffold (pattern 2) improves rejection-sampling SFT corpus quality today, before any RL infra exists.
When GRPO starts, DAPO’s clip-higher + dynamic sampling is the baseline, not an optional add-on — it’s simultaneously the fix for patterns 3 and 4’s shared entropy-collapse mechanism.
GiGPO is the single most transplantable long-horizon idea (pattern 4) — zero new infra, just a state-canonicalization function over the existing tool_call spans.
Keep the rejection-sampling SFT corpus as off-policy anchor data through GRPO, not a disjoint earlier stage — RL-PLUS’s argument (pattern 4) applies directly since that data already exists.
Pattern 5’s checks (base-model pass@k control, semantics-preserving-transform families) should run before the first GRPO graduation is credited with anything — they’re the cheapest falsification test available and gate whether patterns 1–4’s fixes actually expanded capability or just re-elicited it.
Contested / open: whether GiGPO/HiPER-style credit assignment transfers from hashable (WebShop/ALFWorld) state spaces to a CTF agent’s unbounded free-text environment state is this project’s own thing to test, not something published literature has already settled — say so plainly if this page gets cited externally.

Keyboard shortcuts

Post-Training Field Notes