Cybersecurity is one of a family — what cracked the others
Every other chapter in this book treats the CTF task as our problem: our harness, our 1000-challenge portfolio, our flag verifier. This chapter argues the opposite framing is more useful: cyber CTF-solving is one instance of a general problem family — {LONG-HORIZON (up to ~100 turns), EXPLORATORY (search/enumeration over a huge space), SPARSE-TERMINAL-REWARD (only the flag is verified, nothing in between), VERIFIABLE (an ungameable checker)} — and at least six other domains share that exact structural signature. Frontier labs and general RL research have been cracking members of this family for a decade. The move this chapter makes is: hold up each domain, find the stage (pretraining / SFT / RL) and the specific technique that actually fixed its long-horizon/sparse-reward problem, and rank what transfers.
Stance, honored throughout: academic cybersecurity-LLM projects (CTF-Dojo, Cyber-Zero, Pentest-R1, HackSynth, AutoPenBench, DRLRM-PT, and siblings) are mention-only, labelled “academic, not a basis” below — no conclusion here rests on them. General coding, competitive programming, theorem proving, deep-research/web agents, games, and robotics are not academic-security work — they are exactly the frontier-lab and general-RL-theory evidence the project’s stance asks for. Everything below is cross-linked to The path to a frontier cybersecurity model, Diagnosing the gap, RL that creates value, and One problem, or many? — this chapter is the cross-domain evidence layer those four already draw on; it does not re-derive their verdicts.
1. The structural frame
Strip away the domain-specific vocabulary (a “vulnerable endpoint,” a “failing unit test,” a “Lean tactic,” a “hidden test case,” a “Montezuma key”) and every member of this family reduces to the same abstract shape:
flowchart TD
subgraph ABSTRACT["Abstract class"]
direction LR
A1["Long-horizon:\nmany sequential\ndecisions"] --> A2["Exploratory:\nhuge search space,\nmost paths fail"]
A2 --> A3["Sparse terminal\nreward: only the\nend state is scored"]
A3 --> A4["Verifiable:\nan ungameable,\nmechanical checker"]
end
subgraph CYBER["Our CTF pipeline"]
direction LR
C1["Recon /\nenumeration\n(~turns 1-20)"] --> C2["Endpoint /\nvuln discovery\n(~turns 20-50)"]
C2 --> C3["Exploit\nchain\n(~turns 50-90)"]
C3 --> C4["Flag read +\nserver-side\nverify {0,1}"]
end
A1 -. maps to .-> C1
A2 -. maps to .-> C2
A3 -. maps to .-> C3
A4 -. maps to .-> C4
The point of drawing the arrow this way: the flag verifier is not special — it is our domain’s instance of “the proof kernel,” “the unit-test suite,” “the hidden Codeforces test cases,” “the reference-answer F1 score,” “the win/loss signal.” Every domain below chose (or was forced into) a stage + technique to make progress against that shape. The question this chapter answers is which of those choices generalize.
2. THE ANALOGY TABLE
| Domain | Horizon | Reward sparsity | Exploration burden | Verifier | Stage + technique that cracked it | Transfers to cyber |
|---|---|---|---|---|---|---|
| Long-horizon coding (SWE-bench repo agents) | 20–80+ tool-call turns | Terminal (tests pass) | Which file/function among thousands | Execution (unit tests) | SFT on executable-env trajectories (SWE-Gym arXiv:2412.21139, R2E-Gym arXiv:2504.07164) → RL with execution-verified reward (SWE-RL arXiv:2502.18449, o1→o3 arXiv:2502.06807); long-context multi-turn RL needs DAPO-style stabilizers + progressive context curriculum (arXiv:2508.03501) | HIGH — closest structural twin. Mirrors our planned pipeline almost exactly. |
| Particular-language lift (weak PL / weak NL, execution-RL on code) | Short (1 program) → multi-turn repair (RLEF) | Terminal (tests) | Program-space / repair-space | Execution (compiler/unit tests) | Pretraining/CPT does the knowledge lift (corpus coverage — DeepSeek-Coder arXiv:2401.14196, StarCoder2 arXiv:2402.19173); RL fixes execution only — CodeRL arXiv:2207.01780, StepCoder arXiv:2402.01391, RLEF arXiv:2410.02089 — never injects new knowledge | HIGH, but as a diagnostic contrast, not a lever — see §4 below. |
| Competitive programming | Short per-problem; GrandCode reframes as multi-stage agentic loop | Terminal (hidden tests) | Program space (millions of candidates) | Execution (unit tests) | Sampling breadth + cheap filter (AlphaCode, arXiv:2203.07814) then a purpose-built multi-stage GRPO variant for delayed reward + off-policy drift (GrandCode’s “Agentic GRPO,” arXiv:2604.02721) | STRONG (conceptually) — validates our rejection-sampling-breadth strategy; GrandCode is the most load-bearing GRPO-variant precedent for our exact delayed-reward shape. |
| Theorem proving (Lean/Coq) | Long, many tactic steps — decomposable into subgoals | Per-step, not just terminal — the kernel checks every step | Tactic/proof search space | Deterministic, per-step, ungameable (proof kernel) | Subgoal decomposition for cold-start data + curriculum (DeepSeek-Prover-V2, arXiv:2504.21801); synthetic self-play data generation (AlphaGeometry, Nature 2024, DOI:10.1038/s41586-023-06747-5); then RL vs. binary kernel-verified reward (DeepSeek-Prover-V1.5) | PARTIAL — cleanest verifier of any row. Licenses subgoal-decomposition-for-DATA; does not license densifying reward over unverifiable CTF steps (shell/HTTP output isn’t kernel-checkable). |
| Deep-research / web agents | ~10–40 turns (search/click/browse) | Terminal (task complete) | Which query/page/source on the live open web | Learned RM (WebGPT, weak) → outcome/ORM (WebRL) → reference-match F1 (DeepResearcher) | RL (outcome-only) trained in the real live environment; WebRL’s self-evolving curriculum generated from the model’s own failures (arXiv:2411.02337); R1-Searcher’s sequential (not summed) format-then-outcome staged reward (arXiv:2503.05592) | STRONGEST transfer row. Failure-to-curriculum + safe staged tool-use bootstrap, both directly actionable. |
| Hard-exploration games (Montezuma / Pitfall / NetHack / StarCraft) | Very long (1000s–10,000+ actions) | Terminal, near-zero for prior baselines | Combinatorial state space; adversarial strategy space | Environment score / win-loss | Archive-based “explore, remember, return-then-explore-further” (Go-Explore, arXiv:1901.10995); diverse self-play league vs. naive self-play (AlphaStar, DOI:10.1038/s41586-019-1724-z); scale + dense shaping, no exotic algorithm (OpenAI Five, arXiv:1912.06680) | STRONG (Go-Explore) / different axis (league) / honest caveat (NetHack still unsolved). |
| Robotics (sparse-reward manipulation) | Short (tens of steps) | Sparse binary terminal | Continuous action/goal space | Environment success check | Data-relabeling: HER (arXiv:1707.01495) — relabel a failed trajectory’s achieved state as the goal it accidentally satisfied | SPECULATIVE at the mechanism level (off-policy-specific); the idea licenses mining our own flag=0 trajectories for sub-skill SFT data. |
| Cyber-CTF (academic literature — mention-only) | ~100 turns (our regime) | Terminal (flag) | Endpoint/vuln/exploit-chain space | Deterministic flag verifier | Monolithic outcome-only RL / rejection-sampling (CTF-Dojo, Cyber-Zero — academic, not a basis); two-stage offline→online curriculum (Pentest-R1 — academic, not a basis) | — target row; convergence with the frontier rows above is a mild corroborating note only, never load-bearing. |
3. Per-domain deep-dive
3.1 Long-horizon agentic coding — the closest twin
Problem it fixes: a model that free-runs an unconstrained shell wastes turns on malformed edits and noisy tool output; naive RL on a full 20–80-turn trajectory with only a terminal test-pass reward hits credit-misattribution (an early correct action gets penalized because a later, unrelated action failed).
What fixed it, by stage:
- Scaffold (pre-RL): SWE-agent’s Agent-Computer Interface (arXiv:2405.15793) — a small fixed action set + concise per-step feedback lifted pass@1 from 3.8%→12.5% on the same underlying LM, before touching weights. Anthropic’s Claude 3.5/3.7 Sonnet scaffold philosophy is the opposite-looking but complementary lesson: deliberately minimal scaffolding (bash + string-replace edit tool), crediting the gain to post-training, not scaffold cleverness.
- SFT: trajectory distillation on executable-environment corpora is the near-universal first stage — SWE-Gym (arXiv:2412.21139), R2E-Gym (arXiv:2504.07164). SWE-Master (very recent, arXiv:2602.03411, low-confidence) adds a concrete, cheap idea: mask environment-feedback tokens out of the SFT loss — train on the agent’s own actions/reasoning, not on memorizing verbose tool stdout.
- RL: execution-derived reward beats a learned/similarity proxy when both are available — SWE-Gym’s ground-truth path over SWE-RL’s
difflibpatch-similarity fallback (arXiv:2502.18449). DeepSWE (RL-only, no SFT, from Qwen3-32B) shows DAPO-style stabilizers (Clip-High, no-KL, compact filtering of failed/timeout trajectories) let RL-only work when the base model already has strong agentic priors. Progressive context/turn-budget curriculum — start RL at a shorter horizon than the full ceiling, extend once performance plateaus — is independently confirmed by two groups (arXiv:2508.03501, and KLong, arXiv:2602.17547, low-confidence but converging). - Long-horizon credit assignment specifically: GiGPO (arXiv:2505.10978, NeurIPS 2025) adds a step-level grouping on top of GRPO — group actions taken from repeated “anchor states” across different rollouts, giving fine-grained credit without an extra critic. This is the most mature fix in a fast-moving 2026 cluster (BEACON arXiv:2605.06078, HiPER arXiv:2602.16165, Ecpo arXiv:2606.05885 — all low-confidence individually, but converging on “flat trajectory-only advantage is the open problem”).
- Test-time (no training): parallel sampling + a verifier to pick the best candidate is a large, cheap, repeatedly-replicated multiplier (Claude “high compute” 63.7%→70.3%; R2E-Gym 34.4%→51%; DeepSWE 42.2%→59%) — orthogonal to whatever training was done.
What I’d change in our pipeline: (1) audit our sectools tool surface against the SWE-agent lesson — concise, structured observations, a “you already tried this” signal, before assuming RL will fix noisy feedback; (2) mask tool/environment-output tokens from our SFT loss if we don’t already; (3) if moving to full-trajectory RLVR, do not expect vanilla GRPO with only a terminal flag-reward to assign credit well across ~100 turns — prototype GiGPO’s anchor-state grouping first; (4) check whether we do reject-and-rescore across pass@k, or only report pass@1/pass@5 independently — the hybrid-TTS multiplier is free money already inside our methodology.
3.2 Particular-language — the knowledge-injection contrast (read this one for the diagnosis framing)
The reason this domain is second, not last: it is the cleanest existing literature answering exactly the question our diagnosis framework poses — is a failure a knowledge gap or an execution gap?
For both a specific programming language and a specific natural language, the field’s converged answer is stage-specific:
- Pretraining / continued-pretraining injects KNOWLEDGE — corpus composition (how many languages, how much of each) is the lever, not RL, not even SFT. StarCoder (arXiv:2305.06161), StarCoder2 (arXiv:2402.19173), DeepSeek-Coder (arXiv:2401.14196) for code; Sailor (arXiv:2404.03608), SEA-LION (arXiv:2504.05747), LLaMA Beyond English (arXiv:2401.01055) for natural language. Tokenizer/vocabulary coverage sits underneath this stage as an architectural precondition (arXiv:2406.11477) — poor coverage looks like “doesn’t know the language” even when data exists, because every token is spent on fragmented sub-word pieces.
- SFT/instruction-tuning teaches USE, cheaply, once the base model already has passive knowledge — BLOOM+1’s finding (arXiv:2212.09535) is the sharpest data point: for an already-instruction-tuned model, simply including a new language in the multitask instruction-tuning mixture beat continued pretraining — the cheapest lever to try first, once a base exists.
- RL (execution/compiler-verified) fixes BEHAVIOR, never knowledge — CodeRL (arXiv:2207.01780), PPOCoder (arXiv:2301.13816), RLTF (arXiv:2307.04349), StepCoder (arXiv:2402.01391), RLEF (arXiv:2410.02089, Meta FAIR, ICML 2025 spotlight). None of these teach new language knowledge — every one explicitly frames the problem as get-the-execution-loop-right on a domain the base model’s pretraining already covers. RLEF in particular is a near-literal preview of our structural frame: multi-turn POMDP, policy emits → executes against public tests → feedback appended → repair → repeat → reward on held-out private tests, solved with a turn-level (not token-level) value function.
The quote-worthy contrast: BLOOM+1’s “put it in the SFT mixture” (an SFT-stage move) vs. StepCoder/RLEF’s “RL never adds knowledge, only sharpens execution of knowledge already latent in the base model.” No amount of GRPO/RLVR on our harness will inject knowledge of a CVE or technique class the base model never saw much of in pretraining — see §4.
3.3 Theorem proving — the cleanest verifier, the strongest calibration check
Formal theorem proving (Lean/Coq) has the single cleanest sparse-reward substrate of any domain: the proof kernel checks every intermediate step, not just the final answer — cleaner even than our flag verifier. DeepSeek-Prover-V2 (arXiv:2504.21801) decomposes a hard theorem into a DAG of subgoals, generates cold-start SFT data by solving each subgoal independently with a smaller model, then RL’s on top with binary kernel-verified reward. AlphaGeometry (Nature 2024, DOI:10.1038/s41586-023-06747-5) goes further: synthetic self-play data generation that manufactures its own training problems, not just solutions to given ones — escaping a human-demonstration data-scarcity floor entirely. AlphaProof (Nature 2025, DOI:10.1038/s41586-025-09833-y) applies AlphaZero-style self-play/search on top.
The honest limit, stated precisely: the transferable lever is not “densify reward the way theorem
proving does” in general — it is specifically: wherever a CTF sub-milestone can be reduced to a
deterministic, server-side, ground-truth check (a reverse-shell callback actually received, a specific
privileged file actually read — not an LLM-judge’s opinion), score it exactly like a verified Lean subgoal.
Everywhere else — a curl command’s stdout, a subprocess’s raw output — there is no general “CTF kernel”
that can certify a step was valid, and this domain’s recipe does not license densifying reward there. What
does transfer safely: subgoal decomposition and synthetic self-play, used to generate additional
training data or curriculum, leaving the terminal reward untouched.
3.4 The KNOWLEDGE vs EXECUTION contrast — tying it to our diagnosis framework
This is the load-bearing synthesis point of the whole chapter, and it is exactly the split Diagnosing the gap already formalizes (competence vs. performance, Firestone PMC7604508; formal vs. functional competence, Mahowald et al. arXiv:2301.06627) — the particular-language literature is independent, cross-domain confirmation of the same split, from a different field entirely:
| Failure mode observed | Root cause | Fixing stage | Cross-domain evidence |
|---|---|---|---|
| Model has never seen the relevant tokens/technique at all | Missing from pretraining corpus | Pretraining / continued-pretraining — data mixture, upsampling | StarCoder/DeepSeek-Coder (PL); Sailor/SEA-LION (NL) |
| Model “sort of” knows it but fragments/mishandles it | Tokenizer/vocabulary coverage gap | Vocabulary expansion + CPT (sub-pretraining level, not SFT/RL) | Yamaguchi et al. arXiv:2406.11477 |
| Model knows it passively but won’t reliably use it on command | Instruction-tuning data doesn’t cover it | SFT / instruction-tuning mixture — cheapest lever, no CPT needed | BLOOM+1 arXiv:2212.09535; Aya arXiv:2402.07827 |
| Model knows the technique but fumbles execution over many turns / fails tests | Behavior/execution gap, not knowledge | RL with an execution/compiler/flag verifier | CodeRL, StepCoder, RLEF — this is our project’s diagnosed gap |
What I’d change in our pipeline: before spending a training-run budget on GRPO/RLVR to fix a specific recurring failure, run it through this table first. If trace review shows the agent never produces the right technique/CVE reference at any sample count — that’s the top two rows, a pretraining/SFT-data problem, and no amount of RL will fix it (consistent with handbook rule 5: “knowledge in tools, not weights or prompt” — this generalizes: RL doesn’t need to hold facts if tools can supply them at inference time). If the technique does appear somewhere across k samples but pass@1 doesn’t convert it — that’s the bottom row, our diagnosed execution gap, and RLVR is the right lever. This is not a new framework; it’s the particular-language literature independently re-deriving the same split our diagnosis chapter already uses, from code/NLP rather than cyber — worth citing back as corroboration.
3.5 Deep-research / web agents — the strongest-transfer row
The tightest non-cyber structural analogue: many sequential tool calls into a real, noisy, adversarial live environment (not a toy simulator), reward assessable only once the task is actually done. WebGPT (arXiv:2112.09332) is the direct ancestor of “SFT cold-start, then reward-guided optimization” — our own planned shape — but its reward is a learned human-preference model, flagged by its own authors as struggling out-of-distribution; cite it as the origin of the recipe shape, not as license to use a learned reward. WebRL (arXiv:2411.02337, ICLR 2025) is the standout: a self-evolving curriculum that generates new training tasks directly from the model’s own unsuccessful attempts — Llama-3.1-8B goes 4.8%→42.4% success on WebArena-Lite from this alone. R1-Searcher (arXiv:2503.05592) shows a genuinely safe way to densify reward for tool-use: a sequential, not summed, two-stage reward — stage 1 rewards only correct tool-invocation format, stage 2 switches fully to outcome reward. Because the stages are sequential rather than concurrently-summed, the policy can’t “farm” stage-1’s format reward once stage 2 has begun — it avoids the reward-hacking trap a concurrent per-call bonus would carry. DeepResearcher (arXiv:2504.03160, EMNLP 2025) independently argues, as its central claim, that training end-to-end in the real environment (not a simulated proxy) is “a fundamental requirement” — direct validation of our own live-sandbox harness design.
What I’d change in our pipeline: (1) WebRL’s failure-to-curriculum mechanism is the single best
non-cyber precedent for our own currently-unsolved long tail (most of the ~1000-challenge portfolio outside
the 30-60% GRPO band) — an automatic, quantitatively-strong demonstration that a failure corpus converts
into new, appropriately-calibrated training tasks rather than sitting inert; (2) R1-Searcher’s sequential
staged reward is a concrete, safe fix if trace review shows tool-avoidance (agent bypassing the sectools
surface in favor of raw shell) — a stage-1 format-only reward for correct tool invocation, switched off once
stage 2 begins.
3.6 Hard-exploration games — the honest calibration check, plus one genuinely new lever
Montezuma’s Revenge and Pitfall are the domain where “sparse binary terminal reward with near-zero prior success” was most literally the entire problem. Go-Explore (arXiv:1901.10995, Nature 2021) solved both by maintaining an archive of previously-visited states, always returning to a promising already-discovered frontier state cheaply before exploring further from it, rather than re-discovering the same early states from scratch every episode — then robustifying the best archived trajectories into a closed-loop policy. AlphaStar (Nature, DOI:10.1038/s41586-019-1724-z) solves a different axis — adversarial strategy collapse — with a league: train against a diverse, continually-adapting population of checkpoints and hand-designed “exploiter” agents, not just the latest self; it also validates imitation-bootstrap-before-RL at frontier scale (pure self-play RL from scratch over such a long horizon was intractably slow to bootstrap). OpenAI Five (arXiv:1912.06680) shows scale + dense reward shaping substitutes for exotic exploration algorithms — but its dense proxy-reward choice is exactly the reward-hacking risk our handbook’s ground-truth-only stance guards against; treat it as a caution, not a recipe. NetHack (arXiv:2006.13760) remains genuinely unsolved — the honest calibration point that {long-horizon + heavy exploration + procedural generalization + sparse terminal reward} all at once is not a solved combination anywhere in the literature, cyber included.
What I’d change in our pipeline (speculative transfer — engineering-unproven in this domain): Go-Explore’s literal mechanism (deterministic sim-state checkpoint/restore) doesn’t map to a live target box that isn’t cheaply resettable — but the principle does: explicitly archive distinct states of partial progress within a CTF episode (new service found, new privilege level reached, new file discovered) as first-class checkpoints, and bias new rollout attempts toward continuing exploration from an archived frontier state rather than always cold-starting from turn 0. This changes where exploration compute is spent, not the reward function — it carries none of the reward-hacking risk a shaped reward would. Building the state-abstraction function for free-text shell/HTTP output is real, non-trivial work; flag accordingly.
3.7 Robotics — speculative mechanism, but a concrete data-curation descendant
Hindsight Experience Replay (arXiv:1707.01495, NeurIPS 2017) is the seminal sparse-reward paper: relabel a failed trajectory, post-hoc, as if the state it actually reached had been the intended goal all along — converting 100% sparse failure into 100% dense, correctly-labeled success. Honest limit: this is an off-policy, goal-conditioned, replay-buffer mechanism (DDPG/DQN-family) — it does not literally port to an on-policy GRPO/RLVR loop, and a CTF episode has one terminal flag, not a continuum of interchangeable goals a failed run could be relabeled as having achieved instead. Speculative transfer, principle only: mine the corpus of failed (flag=0) trajectories for reusable sub-skill demonstrations — a run that reached a foothold but failed to escalate privileges is a valid positive demonstration of “how to reach a foothold,” even though the overall episode is a failure. Segment the failed-trajectory corpus by furthest-pipeline-stage-reached and fold the phase-appropriate positive prefixes into the SFT corpus for that sub-skill — a pure data-curation move that never touches the RL reward function, carrying none of the reward-hacking risk an online auxiliary reward would.
4. The KNOWLEDGE vs EXECUTION contrast — the load-bearing takeaway
Section 3.4 above is the single most important cross-domain finding in this chapter, restated plainly: the particular-language literature is the cleanest existing evidence, from a domain with no cyber baggage, that pretraining/SFT injects knowledge and RL only sharpens execution of knowledge the base model already has. This directly operationalizes Diagnosing the gap’s competence-vs-performance split with a second, independent field’s worth of citations. Practical consequence for this project: run any suspected failure mode through the four-row table in §3.4 before committing GRPO/RLVR compute to it — a “doesn’t know the technique” failure needs a data lever (more SFT coverage, a tool that supplies the fact at inference time), not a training-loop lever; a “knows it, fumbles execution over ~100 turns” failure is where RLVR belongs, and is what every SWE/RLEF/GrandCode result above was built to fix.
5. Ranked shortlist of transferable levers
- RLEF’s turn-level value function over a multi-turn POMDP (arXiv:2410.02089) — the single most literal structural preview of our problem (public-test feedback → repair → repeat → private-test terminal reward) solved by Meta at an order of magnitude fewer samples than scaffolded prompting. High confidence, high relevance — read before finalizing a GRPO variant.
- GiGPO’s step-level anchor-state advantage (arXiv:2505.10978) and GrandCode’s Agentic-GRPO (arXiv:2604.02721) — two independent, purpose-built GRPO variants for exactly our credit-assignment shape (long trajectory, terminal-only reward, off-policy drift). High confidence on GiGPO (NeurIPS-accepted); promising, unverified on GrandCode (single team, very recent) — converging evidence “stage/turn is the unit of credit” is the field’s consensus fix.
- WebRL’s self-evolving curriculum from failures (arXiv:2411.02337) — the strongest available precedent for converting our own unsolved-challenge tail into new training data automatically. High confidence, needs CTF-specific task-generation design.
- Ground-truth execution reward over any learned/similarity proxy, now 4x cross-validated — SWE-Gym > SWE-RL, the theorem-proving kernel, WebGPT’s own flagged weakness, and DeepResearcher’s live-environment requirement all land on the same rule our handbook already locks. Established, high confidence — reconfirmation, not a new finding, but reassurance the rule is domain-general.
- Subgoal/curriculum decomposition for cold-start DATA generation, never for the reward itself (DeepSeek-Prover-V2 arXiv:2504.21801, AlphaGeometry) — safe wherever a genuine sub-milestone is deterministically verifiable; see One problem, or many? for the full decompose-eval-vs-decompose-reward argument this reconfirms. Established, high confidence.
- Progressive context/turn-budget curriculum for RL (arXiv:2508.03501, KLong arXiv:2602.17547) — start short, extend once performance plateaus; directly portable to the existing GRPO 30–60% baseline-band rule. Medium confidence (2 independent sources).
- R1-Searcher’s sequential (not summed) staged reward for tool-use bootstrap (arXiv:2503.05592) — a safe fix if tool-avoidance is diagnosed in trace review. Established mechanism, cyber-domain application untested.
- Go-Explore’s archive-and-return exploration scheduling (arXiv:1901.10995) — genuinely the most novel, least-already-covered addition here; changes where exploration compute is spent, zero reward-hacking risk. Strong transfer of the idea; engineering-unproven in this domain — speculative transfer.
- Hybrid test-time scaling (execution + execution-free verifiers, complementary blind spots) — a free multiplier on any trained policy, replicated across ≥4 independent groups (R2E-Gym, DeepSWE, SWE-Master, Claude “high compute”). High confidence, immediately actionable on our existing pass@k methodology.
- Mining flag=0 trajectories for sub-skill SFT data (HER’s spirit, arXiv:1707.01495) — segment failed runs by furthest-stage-reached, fold positive prefixes into SFT. Speculative-to-established idea, translated from a different algorithm family — low risk, purely data-curation.
- NetHack’s honest “still unsolved” status (arXiv:2006.13760) — not a lever, a calibration check: no algorithm anywhere has cracked {long-horizon + heavy exploration + procedural generalization + sparse terminal reward} simultaneously. Set expectations accordingly.
Cross-links
- The path to a frontier cybersecurity model — the capstone this chapter’s cross-domain evidence feeds; “frontier” requires the same {strong agentic base, ungameable verifier at scale, infra matched to the RL bottleneck} triad this chapter finds repeating across code/math/web.
- Diagnosing the gap — the competence/performance split §3.4 independently reconfirms from the particular-language literature; run failures through both before committing RL compute.
- RL that creates value — long-horizon · exploration · reasoning · novelty — the algorithm-level detail (GiGPO, DAPO stabilizers, entropy preservation) this chapter’s domain evidence is evaluated against.
- One problem, or many? — monolithic vs decomposed — the decompose-eval-vs-decompose-reward distinction that DeepSeek-Prover-V2’s and StepCoder’s subgoal work in §3.1/§3.3 reconfirms from two more domains.