Where you are & the forks ahead
This is the capstone chapter, not a roadmap. Every other chapter in this book resolves a method question (SFT vs DPO vs GRPO, monolithic vs decomposed, which exploration fix). This chapter assembles those resolutions into the shape you actually need to draw your own plan: what’s true today, what you have to decide, in what order the decisions unlock each other, and what would have to be false to make you change course. It recommends nothing you haven’t already read elsewhere in this book — it routes you back to Diagnosing the gap, From behavioral audit to training signal, One problem, or many?, Before you train, RL that creates value, and The decision at every load-bearing point. Read this after those, not instead of them.
Where this fork-by-fork plan sits in the bigger picture: everything below is Rung-1-scoped — it resolves execute reliably on the portfolio you have. It is a prerequisite for, not a substitute for, The path to a frontier cybersecurity model, which argues that even a perfectly-resolved DAG below doesn’t by itself cross into “frontier” — that takes orders-of-magnitude more RL-environment scale (Rung 2) and possibly a CPT/mid-training stage (Rung 3). The cross-domain evidence grounding that argument — six other long-horizon/sparse-reward/verifiable domains and what actually cracked each one — lives in Cybersecurity is one of a family — what cracked the others; several forks below (especially (c) and (e)) draw directly on techniques surveyed there (GiGPO, DAPO, WebRL’s failure-to-curriculum, potential-based shaping). Fork (b)’s SFT-vs-measure-first framing and its order-matters/compounding rationale are the Sequence-B-specific instance of the general argument in The recipe is a sequence, not a pick; the general-capability SFT/preference data fork (b)/(d) would eventually pull from is catalogued in Proven post-training datasets — a usage-cited registry.
1. Where you are — the diagnosis on one screen
The number: ~100–200 solved / ~1000 challenges, at k=1. This is a portfolio statistic, not a per-challenge pass rate — a challenge that “solved once” could be a 5% fluke or a 55% near-certainty, and those two cases call for opposite next moves (The decision, “one prerequisite before any of this”). Nobody has yet run pass@k per challenge, let alone per pipeline stage.
The pipeline is a chain, not a single action:
flowchart LR
R["Recon /\nenumeration"] --> E["Endpoint\ndiscovery"]
E --> V["Identify the\nvulnerable endpoint"]
V --> X["Exploit it"]
X --> P["Post-exploitation /\npivot"]
P --> F(("Flag\n{0,1}\nground-truth verified"))
classDef stage fill:#132b22,stroke:#34d399,color:#eafaf3;
class R,E,V,X,P stage;
Only the last box (flag_verified) is checked today, and even that is presently a provenance proxy
(flag_scan.retrieved — “the string came back from the sandbox, not the model’s mouth”) rather than a true
byte-compare against benchmark/flags/pd26_flags.current.json — that exact-match wiring doesn’t exist yet
outside a manual, SSH-gated step (instrumentation-and-data-readiness.md §3.1). So even the ground-truth
anchor this whole book leans on is one small, well-scoped engineering task away from being fully automatic,
not there yet.
The stage-localized failure taxonomy (F1–F4), mapped to canonical RL/agent-research framing:
| Tag | Failure | Canonical framing | Fix lever if confirmed dominant |
|---|---|---|---|
| F1 | Never finds the vulnerable endpoint | Exploration / coverage failure — no gradient until reward is first observed | On-policy RL with exploration preservation, not more demonstrations |
| F2 | Finds it, probes shallowly, can’t land the exploit | Execution / skill (performance-floor) failure | Trajectory curation, rejection-sampling SFT, DAPO/GiGPO as GRPO baseline |
| F3 | Clumsy tool use, wrong tool for the job | Policy / tool-selection failure — its own axis | Elicitation ladder → ToolRL/Tool-Star if elicitation fails |
| F4 | No real pivot/chaining after a foothold | Long-horizon credit-assignment failure (variance, not coverage) | Step-level credit (GiGPO), curriculum (1-hop before 2-hop), never “try more” alone |
The diagnosis, stated as a hypothesis, not a fact: the project’s working read is “likely an execution gap” (F2/F3-flavored) rather than a knowledge gap or a pure exploration gap — capability is probably present and unreliable, not absent. This is the single most load-bearing framing decision in the whole plan, and it is currently unproven. The book’s own diagnosis framework is explicit about this: “the honest, defensible answer will not be a single sentence” — the true picture is almost certainly a split verdict, different F-tags dominating different challenge subtypes, not one gap type for the whole 1000 (Diagnosing the gap §0, §8).
What “proven” requires, concretely, and doesn’t exist yet:
- Per-challenge (not aggregate) pass@k, segmented by whether the winning path is single-shot or sequentially-gated (compositional — enumeration must land before exploitation is even visible).
- Pass@(k,T) — base model vs. current checkpoint, per segment — to tell a genuine execution gap (trained pulls away from base at large k) apart from a pure elicitation artifact (base catches up) apart from an exploration gap hiding underneath (matched-data SFT regresses the segment, RL expands it) (arXiv:2504.13837, arXiv:2604.14877).
- A working F1–F4 stage tagger over the existing Phoenix/
events.jsonlcorpus — the actual current gap, confirmed by direct source read: the harness’s process telemetry is already complete for this (tool_call/tool_resultpairs joined ontool_call_id); nothing needs to change ingo/libs/agent/eventsorsecagent/runner.go. What’s missing is purely semantic — a deterministic post-hoc scan plus onestage_oracle.jsonper challenge, authored from that challenge’s ownsolve.py(Before you train §5).
Bottom line for this section: treat “it’s an execution gap” as the leading hypothesis, not settled ground. Everything in §2–§4 below is written so that it stays true whichever way the funnel eventually comes down — several forks are explicitly gated on a measurement that hasn’t been taken yet.
2. The forks
Five decisions, each with exactly two options (per this book’s convention — no hybrids, no third path). Every “what must be TRUE” column is a gate, not a preference — the fork should not be decided until its gate is checked, because in at least two of these forks (b, e) the two options are not just different costs, they require the opposite SFT/RL ordering. One explicit, flagged exception: fork (b)’s verdict below lands on a hybrid rather than either labeled option outright — called out and justified there, not silently smuggled in past this convention.
(a) Instrument the F1–F4 stage tagger first?
| Option 1 — build it first | Option 2 — skip it, decide off aggregate pass@k/flag_verified alone | |
|---|---|---|
| What it is | stagescan.go (same shape as the existing flagscan.go) + one stage_oracle.json per challenge, authored from solve.py; prototype on PD26-02 first, then generalize | Proceed straight to the SFT/GRPO plan using only the terminal flag signal and portfolio-level solve rate |
| Cost (compute + eng) | Near-free — read-only post-hoc scan over events.jsonl you already have; no new sandbox instrumentation; editorial authoring of ~4 predicates/challenge (one person reads ~10 solve.py files) | Zero now — but every downstream decision (b–e) is made blind to which F-tag actually dominates |
| Risk (reward-hacking) | None — this is diagnostic-only, no reward function touched | Routing risk, not hacking risk: you may sink a training cycle into the wrong lever (e.g. rejection-sampling SFT when the real bottleneck is F1/exploration, or milestone shaping when it’s actually F2/F4) |
| Information value | Highest single move in this whole chapter. “This single aggregation is the input every downstream decision below depends on” — verbatim from Before you train §4 | Low — an aggregate number “averages over four structurally different failure modes,” exactly the collapsing-a-split-verdict anti-pattern the diagnosis framework names first |
| What must be TRUE first | Nothing — this is the recommended first step regardless of any other measurement | N/A |
Verdict pressure: there is no real argument for Option 2. This fork is here because it’s the fork everyone is tempted to skip under time pressure, not because the evidence is close.
(b) Rejection-sampling SFT on verified solves NOW vs. measure pass@k-per-stage FIRST
| Option 1 — SFT now | Option 2 — segment + measure first | |
|---|---|---|
| What it is | Run rejection-sampling SFT on the ~185 already-collected verified-solve trajectories across the 5 corpora (gym263, gym564, warpenv-broker, envgen, argus60-base) | Segment the 1000 challenges (single-shot vs. sequentially-gated), run Pass@(k,T) — base vs. current checkpoint — per segment, before deciding what to train on |
| Cost | Cheap — data already exists, this is already the project’s stated near-term plan | Medium — sampling compute at multiple k, cold-start pdq --fresh-retries, no training required |
| Risk (reward-hacking / generalization) | Concrete, not hypothetical. On the compositional/sequentially-gated segment, matched-data SFT actually regresses capability (net −4) while RL expands it (net +4) — Zhai et al., arXiv:2604.14877. Training the wrong subset with SFT doesn’t just waste compute, it can make that subset worse. Separately: the ~185-trajectory pool may be guessing-dominated (high pass@64, low Cover@τ — arXiv:2510.08325) or contain lucky-but-unsound paths (right flag, wrong/wasted reasoning — arXiv:2506.14245) | Low — diagnostic only, but real opportunity cost if it delays shipping a known-safe move (SFT is the project’s current plan, already literature-validated as a baseline — arXiv:2504.11343) |
| Information value | Low incremental — you already believe SFT-on-solves works generically; this doesn’t test where it works | High — this is “the single highest-value experimental design” in the diagnosis chapter, and it’s directly testable this week with no new training |
| What must be TRUE before committing to Option 1 wholesale | (i) the SFT pool isn’t guessing-dominated (Cover@τ check on its source challenges); (ii) trajectories are filtered on soundness (backtracking/wasted-turns/tool-validity), not just flag==1; (iii) fork (a)’s stage tagger doesn’t show these ~185 trajectories concentrated on the sequentially-gated segment | N/A |
Verdict pressure — this is this chapter’s one deliberate exception to the “no hybrids” convention stated in §2, flagged rather than smuggled in: don’t cancel the SFT plan — but don’t treat “SFT now” as a blanket recipe across the whole portfolio either. The correct read of these two options is closer to “do (2) as a segmentation gate on (1)”: SFT the single-shot segment now, hold the sequentially-gated segment for GRPO once entropy instrumentation is live. Why this fork earns the exception where the other four don’t: options 1 and 2 here aren’t mutually exclusive courses of action — one is a training decision, the other a measurement decision, and they resolve at different grain (portfolio-wide vs. per-segment). Once segmentation lands, “measure first” naturally gates “SFT now” rather than replacing it. Forks (a), (c), (d), (e) don’t have that structure — their two options are genuinely exclusive paths, which is why no-hybrids holds cleanly for them and only for them.
(c) Monolithic GRPO vs. milestone-shaped GRPO
| Option 1 — monolithic | Option 2 — milestone-shaped | |
|---|---|---|
| What it is | Terminal flag reward only, unchanged, once GRPO/RLVR starts | Potential-based shaping F(s,a,s') = γΦ(s') − Φ(s) layered on top of (never instead of) the terminal reward, where Φ = a monotone running-max count of deterministically-verified stage completions (Ng/Harada/Russell, ICML 1999 — policy-invariant by theorem) |
| Cost | None beyond baseline GRPO infra | Medium — stage_oracle.json authoring (reuses fork (a)’s work if already done), Φ must be a running max (not instantaneous), and defined identically across every termination path (stop_reason ∈ {stop, max_turns, error}) or the invariance proof breaks |
| Risk (reward-hacking) | Risk of leaving real gains on the table if the funnel is genuinely F1-dominated — MiRA’s 6.4%→43.0% WebArena-Lite result is the strongest existence-proof in this book that flag-only reward can leave a large gap, though that’s a web-navigation result, not CTF (arXiv:2603.19685) | This is where the thick, convergent reward-hacking literature lives — PURE/Stop Summation (arXiv:2504.15275), Reward Under Attack (arXiv:2603.06621), Gao et al. (arXiv:2410.15115), PRIME’s own admission (arXiv:2502.01456), MONA (arXiv:2501.13011). Every one of these converges on: the moment a stage check becomes anything softer than deterministic ground-truth, it gets farmed. This project’s own confirmed lesson (SFT-induced FLAG{} confabulation from a loose format-matcher) is the small-scale preview of the same failure mode. But that whole cluster is about gaming a soft/gameable proxy — none of it is reward tampering. Denison, MacDiarmid, Barez, Duvenaud, Kravec, Marks et al. (Anthropic, “Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models,” arXiv:2406.10162, 130+ citations — verified live) show a curriculum of easy, low-stakes specification-gaming generalizes zero-shot to models that directly rewrite their own reward function/checklists when they have the tool access to do so — categorically more severe than metric-farming, and making the verifier deterministic and ground-truth does not by itself stop it if the policy’s own tool surface can reach what the verifier reads. This is not hypothetical for this project: the agent already has real shell access to the sandboxed target, and today’s flag_verified check (flagscan.go, §1) is file/process-based — reachable by the same tool calls the agent uses to solve the challenge. The gap this menu doesn’t yet name: isolate the verifier’s read-path from the agent’s own tool-call surface, and adversarially probe for exactly this generalization before scaling RL — “deterministic ground-truth reward” alone is the fix for gaming, not for tampering |
| Information value | N/A — this is the default, not an experiment | High if built correctly — this is the only mechanism in the whole menu that’s a theorem, not an empirical bet, provided the two subtleties are respected |
| What must be TRUE before building Option 2 | (i) fork (a)’s funnel shows an F1 (exploration)-dominated bottleneck, not F2/F3; (ii) the scale-dependence check confirms the base policy is genuinely capacity-limited rather than already-capable-but-unreliable — arXiv:2603.21972 found staged reward helps weak models only, larger models converge fine on outcome-only reward; (iii) a deterministic oracle exists for the stage being shaped — explicitly excluding stage 3 (vuln identification), which VPR’s own authors flag as the “open, unstructured” regime their method doesn’t yet solve (arXiv:2605.10325) | Default — no gate needed |
Verdict pressure: stay monolithic until the funnel says otherwise. If it does, build the narrow,
theorem-backed version — never a learned/LLM-judge per-stage reward, under any circumstance. And regardless
of which option wins: verifier-integrity hardening (flagscan.go’s read-path isolated from the agent’s own
tool surface) is not optional once RL starts, because Denison et al.’s generalization result means a
ground-truth verifier alone doesn’t rule out the agent attacking the verifier rather than the challenge.
(d) Tool-use fix for the curl-preference (SFT/DPO/KTO)
| Option 1 — elicitation ladder first | Option 2 — jump straight to a training-time fix | |
|---|---|---|
| What it is | Escalate cheapest→most-expensive: few-shot prompt with 2–3 correct-usage examples → light SFT on a handful of demonstrated-usage trajectories → only then consider RL-level intervention | Go directly to DPO/KTO on tool-choice pairs, or ToolRL-style decomposed per-call reward, without testing whether elicitation alone recovers the behavior |
| Cost | Very cheap — the few-shot test is nearly free; SFT-demo step is cheap | Medium — true DPO needs k≥2 same-challenge same-model divergent pairs (only the PD26 k=5 canonical sweep qualifies today; the larger gym pools are k=1); KTO-native data (unpaired success/failed splits) is free and ready today |
| Risk (reward-hacking / wasted engineering) | Low — but a self-reinforcing trap exists regardless of which rung you’re on: a rejection-sampling corpus built from the current curl-biased policy will never contain a dead tool succeeding, because the policy never tried it — RL alone has ~zero probability mass to reinforce those tools without forced/hinted exposure first (Tool-Star, arXiv:2505.16410) | Building ToolRL/DPO machinery for what might be a pure elicitation gap — Greenblatt et al.’s password-locked-model finding says a few high-quality SFT demonstrations are often sufficient to fully elicit a locked capability (arXiv:2405.19550); over-engineering here is real opportunity cost, not just aesthetic |
| Information value | High and cheap — turns “the model prefers curl” from anecdote into a falsifiable, staged experiment (framework.md §5) | Lower until the ladder has been run — you don’t yet know which rung actually recovers the behavior |
| What must be TRUE before escalating past few-shot | (a) few-shot prompting fails to recover tool usage on held-out challenges; (b) SFT on a small demonstrated-usage set also fails to recover it → only then is it a genuine missing-affordance problem calling for Tool-Star-style forced exposure + ToolRL-style decomposed reward (arXiv:2504.13958) | N/A |
Verdict pressure: run the ladder. Don’t skip to DPO/ToolRL on a hunch — the cheapest rungs have direct, citable precedent for “this alone is often sufficient,” and skipping them risks building infrastructure for a gap that a two-line prompt change would have closed.
(e) Exploration-emphasis vs. execution-emphasis — routed by the funnel
| Option 1 — execution-emphasis | Option 2 — exploration-emphasis | |
|---|---|---|
| What it is | Invest in trajectory curation, more/better rejection-sampling SFT data, DAPO’s clip-higher + dynamic sampling as the GRPO baseline, GiGPO step-level credit | On-policy RL first, not SFT, for the segment the funnel flags as F1/F4-dominant; DIVER/tool-sequence diversity bonus, curiosity bonus (CDE), parameter-space-noise pilot, periodic reference-policy resets (ProRL) for genuine boundary expansion |
| Cost | Lower — DAPO is an established, widely-adopted recipe; trajectory curation reuses existing data | Higher — several of these techniques are RL-infra-dependent and Promising-not-validated (PSN-RLVR, DIVER, HiPER/hindsight credit assignment for a CTF-shaped domain) |
| Risk | If the funnel is actually F1-dominant, more SFT on the same recipe teaches guessing-and-hoping more confidently, not more competence (LIMO’s framing, arXiv:2502.03387) | If the funnel is actually F2/F3-dominant, exploration machinery is solving a problem that doesn’t exist here and burns the RL-infra budget on the wrong axis — the entropy-collapse mechanism these fixes target (arXiv:2505.22617) is real but doesn’t help a policy that’s exploring fine and just executing unreliably |
| Information value | This is literally what the funnel is for. Not a taste choice — a routed decision | Same |
| What must be TRUE before routing | Funnel result from fork (a); the scale-check (arXiv:2603.21972); the Pass@(k,T) crossover-direction test on the specific segment (fork b) — does trained pull away from base at large k (execution), or does matched-data SFT regress it while RL expands it (exploration)? | Same gate, opposite branch |
Verdict pressure: this fork cannot be decided from priors or literature alone — by design, it is the output of forks (a) and (b), not an independent choice. If you find yourself picking an emphasis before the funnel exists, you are guessing, and the guess has better-than-even odds of being wrong given the project’s own “likely execution, unproven” framing in §1.
3. Dependency order — a DAG, not a timeline
This is deliberately not a schedule. It shows what unlocks what — several branches can run in parallel, and nothing downstream of “instrument” is safe to start before its own inputs exist.
flowchart TD
subgraph INSTRUMENT["INSTRUMENT — near-free, read-only, do first"]
I1["stagescan.go + stage_oracle.json\nper-challenge, F1-F4 tagger"]
I2["flag_verified true byte-compare\n(replace retrieved-provenance proxy)"]
I3["entropy logging wired,\nready from RL step 0"]
I4["elicitation-ladder harness:\ntool-usage histogram across 40 sectools"]
end
subgraph MEASURE["MEASURE — diagnostic, no training changes"]
M1["Segment 1000 challenges:\nsingle-shot vs sequentially-gated"]
M2["Pass@(k,T): base vs current checkpoint,\nper segment (arXiv:2604.14877)"]
M3["Cover@tau per challenge\n(arXiv:2510.08325) — guessing vs reliable"]
M4["Base-model pass@k control\n(arXiv:2504.13837)"]
M5["Scale-check: weak/capacity-limited\nvs already-capable-unreliable\n(arXiv:2603.21972)"]
M6["Elicitation ladder run:\nfew-shot -> SFT-demo -> RL"]
end
subgraph ROUTE["ROUTE — the fork decisions (Section 2)"]
RA["Fork a: ALREADY DECIDED\n(instrument first)"]
RB["Fork b: SFT-now vs measure-first\nper segment"]
RC["Fork c: monolithic vs\nmilestone-shaped GRPO"]
RD["Fork d: elicitation vs\ntraining-time tool fix"]
RE["Fork e: exploration- vs\nexecution-emphasis"]
end
subgraph TRAIN["TRAIN — the actual runs"]
T1["Rejection-sampling SFT\non single-shot segment, curated\n(STaR / ReST-EM pattern)"]
T2["GRPO + DAPO baseline\n(clip-higher, dynamic sampling)"]
T3["+ GiGPO step-level credit\n(zero extra rollouts)"]
T4["+ potential-based milestone\nshaping (gated, fork c only)"]
T5["ToolRL / Tool-Star forced\nexposure (gated, fork d only)"]
T2b["Exploration-emphasis RL:\nDIVER / CDE curiosity bonus /\nPSN-RLVR / ProRL resets\n(gated, fork e exploration branch)"]
end
subgraph GRADUATE["GRADUATE — the go/no-go gates"]
G1["Entropy collapsed\nAND pass@64 non-trivial\n(arXiv:2510.01624)"]
G2["Semantics-preserving-transform\nrobustness check survives\n(arXiv:2502.07445 / 2503.02296)"]
G3["Base-pass@k control still\ntrails trained pass@k\n(gain is real, not elicitation)"]
G4["pass@large-k did NOT shrink\npost-RL (RL-PLUS check,\narXiv:2508.00222)"]
end
I1 --> M1
I2 --> M4
I3 --> G1
I4 --> M6
M1 --> M2
M2 --> M3
M2 --> M4
M2 --> M5
M1 --> RB
M2 --> RB
M3 --> RB
M5 --> RC
M1 --> RC
M6 --> RD
RB --> RE
M5 --> RE
RB --> T1
RC -->|"F1-dominant, weak policy"| T4
RC -->|"F2/F3-dominant"| T2
RD -->|"elicitation recovers it"| T1
RD -->|"neither recovers it"| T5
RE -->|"execution-emphasis"| T2
RE -->|"exploration-emphasis"| T2b
T1 --> T2
T2 --> T3
T3 --> T4
T3 --> T5
T2 --> G1
T4 --> G1
T5 --> G1
T2b --> G1
G1 --> G2
G2 --> G3
G3 --> G4
G4 -->|"holds"| Ship["Credit the gain.\nGeneralize to the next segment /\nchallenge subset"]
G4 -->|"fails"| Back["Back to MEASURE —\nre-run funnel, re-check scale,\ndo not re-train blind"]
classDef inst fill:#132b22,stroke:#34d399,color:#eafaf3;
classDef meas fill:#0f2a3d,stroke:#38bdf8,color:#e6f6ff;
classDef route fill:#3a2e14,stroke:#f5b942,color:#fff6e0;
classDef train fill:#2a1438,stroke:#c084fc,color:#f3e8ff;
classDef grad fill:#3a1414,stroke:#f87171,color:#fde8e8;
class I1,I2,I3,I4 inst;
class M1,M2,M3,M4,M5,M6 meas;
class RA,RB,RC,RD,RE route;
class T1,T2,T3,T4,T5,T2b train;
class G1,G2,G3,G4 grad;
Read this as: nothing in TRAIN is safe to start before its ROUTE gate fires, and nothing in ROUTE is safe to decide before its MEASURE inputs exist. INSTRUMENT is the only stage with no prerequisites — which is why fork (a) has no real counter-argument.
4. Open hypotheses to test
These are falsifiable, in the sense the diagnosis chapter insists on: each has a stated experiment and a stated result that would kill it. This is the “prove it to myself” frame, not a checklist to complete once — re-run per challenge segment as the corpus grows.
| # | Hypothesis | Experiment | Falsified if |
|---|---|---|---|
| H1 | It’s execution, not knowledge, on the non-sequentially-gated segment | Base-model pass@k at large k (64, 256) on currently-failing single-shot challenges — does the correct action ever appear? | The correct action never appears at any N on any checkpoint for a large fraction of these — that’s a knowledge gap for that subset, requiring off-policy injection (demonstration, teacher, or a tool), not more RL |
| H2 | Milestone shaping helps, doesn’t hack | Introduce potential-based shaping (fork c, gated) on the F1-dominant segment only; track held-out flag_verified rate and pass@large-k before/after | Held-out flag rate drops, or pass@large-k shrinks post-introduction (capability-boundary collapse, arXiv:2508.00222) — either result means the shaping term is being farmed, revert to monolithic immediately |
| H3 | The horizon is tractable for GRPO at the 30–60% baseline band | Run DAPO+GiGPO on challenges the funnel tags F2/F3-dominant, in the 30–60% pass-rate band; watch entropy from step 0 | Entropy still collapses under DAPO’s own fixes, or stage-transition credit doesn’t concentrate on the exploitation phase specifically (GiGPO’s state-hash groups show flat credit) — means the horizon/credit-assignment problem is harder than the established recipe assumes for this task shape |
| H4 | A real exploration gap exists, localized to the sequentially-gated segment | Replicate Zhai et al.’s crossover-direction test on this project’s own compositional-segment challenges: does matched-data SFT regress pass@(k,T) on this segment while GRPO expands it? | If SFT does not regress this segment (both SFT and RL improve it comparably), the sequential-gating framing doesn’t transfer to this task family, and the single-shot ordering (SFT then GRPO) is fine everywhere — fork (b)/(e)’s special-casing was unnecessary |
| H5 | Tool-avoidance (curl-preference) is elicitation, not a missing-affordance problem | Run the elicitation ladder (fork d) on a sample of the 26 dead sectools entries on held-out challenges | Neither few-shot prompting nor light SFT-on-demos recovers usage — genuinely a missing-affordance problem, escalate to Tool-Star forced exposure + ToolRL decomposed reward |
| H6 | Any claimed solve-rate gain reflects real execution-reliability improvement, not memorization/elicitation | Base-model pass@k-at-large-k control (H1’s instrument, reused) and semantics-preserving-transform variants of a held-out subset, checked against every claimed gain before crediting it | The gain evaporates on either check — the gain is elicitation (fine to attribute to SFT, a red flag if it persists after GRPO) or memorization of the fixed ~10 canonical PD26 shapes |
| H7 | The SFT go/no-go gate (entropy collapse) is sufficient on its own | Check whether pass@64 on the rejection-sampling-SFT checkpoint is non-trivial at the same time entropy collapses, before green-lighting GRPO (arXiv:2510.01624) | Entropy has collapsed but pass@64 is flat/low — this predicts a disappointing GRPO run regardless of how good SFT accuracy looked; do not launch on entropy-collapse alone |
5. What we deliberately are NOT basing this on
Standing project rule, restated for this chapter specifically: no fork, no hypothesis, no cost/risk
estimate, and no number above rests on an academic cybersecurity-LLM training or benchmark paper — CTF-Dojo,
Cyber-Zero, Pentest-R1, HackSynth/Random-Crypto, AutoPenBench, Cybench, NYU CTF Bench, EnIGMA, InterCode-CTF,
DRLRM-PT, node-fragility reward shaping, the kill-chain-staged-reward paper, Nakano’s ATT&CK-tree scaffold,
or Honarvar’s Evolve-CTF/Capture-the-Flags family-based evaluation — even where several of these report a
finding that would superficially support one side of a fork here. None of that line of work has produced a
frontier cybersecurity model, so none of it counts as frontier evidence for a load-bearing decision; every
mention of them in the six source chapters this capstone draws from is explicitly labelled “academic, cited
for context only, not a basis,” and this chapter inherits that discipline rather than re-importing their
numbers under a different heading. Every claim above is re-grounded on one of: general frontier post-training
disclosures (DeepSeek-R1, Kimi k1.5/K2, Llama 4, OpenAI Deep Research), general RL/agent theory (potential-based
shaping, the reward-hacking convergence, reward-tampering-as-generalization (arXiv:2406.10162),
DAgger, GAE, entropy-collapse mechanics), general (non-security)
agent-eval and long-horizon literature (METR, AgentBoard, MAST, τ-bench, GSM-Symbolic/C-BOD), or this project’s
own measured data and confirmed lessons (the SFT-induced FLAG{} confabulation, the Phoenix trace corpus, the
existing pass@k methodology). Where a demoted academic-security idea is still worth pursuing on its own
merits — e.g. staged/kill-chain-shaped reward in a cybersecurity-specific loop — the chapters this one draws
from say so explicitly and flag it “worth pursuing — unvalidated outside academic-security work,” never as
settled ground.
Cross-links
- The path to a frontier cybersecurity model — the north star this whole Rung-1 fork-and-DAG plan is a prerequisite for, not a substitute for; explains why resolving every fork here still leaves Rungs 2–3 (environment scale, CPT/mid-training) unaddressed.
- The recipe is a sequence, not a pick — the stage-order-and-compounding frame fork (b)’s “don’t SFT the whole portfolio blind” verdict is a specific instance of.
- Proven post-training datasets — a usage-cited registry — the concrete dataset shopping list for the general-capability rungs underneath forks (b)/(d)’s SFT/preference data.
- Cybersecurity is one of a family — what cracked the others — the cross-domain evidence (coding agents, competitive programming, theorem proving, web agents, games, robotics) several forks below draw on directly for technique precedent and risk calibration.
- Diagnosing the gap — a scientific framework — the routing test and the pass@k / Pass@(k,T) / Cover@τ protocol every MEASURE node in §3’s DAG instantiates.
- From behavioral audit to training signal — the per-pattern gap→method→verification mapping fork (d) and fork (e) draw on directly.
- One problem, or many? — decomposition vs. monolithic — the full verdict and honest-limits case behind fork (c).
- Before you train — instrumentation & data readiness — the source of fork (a)’s cost estimate and the concrete first-step recipe for the F1–F4 tagger.
- RL that creates value — long-horizon, exploration, reasoning, novelty — the technique menu (DAPO, GiGPO, DIVER, ProRL, NuRL, PSN-RLVR) fork (e)’s exploration-emphasis branch draws on.
- The decision — the one-line version of the whole book’s routing question; this chapter is its decision-surface expansion, not a replacement.