Where you are & the forks ahead

This is the capstone chapter, not a roadmap. Every other chapter in this book resolves a method question (SFT vs DPO vs GRPO, monolithic vs decomposed, which exploration fix). This chapter assembles those resolutions into the shape you actually need to draw your own plan: what’s true today, what you have to decide, in what order the decisions unlock each other, and what would have to be false to make you change course. It recommends nothing you haven’t already read elsewhere in this book — it routes you back to Diagnosing the gap, From behavioral audit to training signal, One problem, or many?, Before you train, RL that creates value, and The decision at every load-bearing point. Read this after those, not instead of them.

Where this fork-by-fork plan sits in the bigger picture: everything below is Rung-1-scoped — it resolves execute reliably on the portfolio you have. It is a prerequisite for, not a substitute for, The path to a frontier cybersecurity model, which argues that even a perfectly-resolved DAG below doesn’t by itself cross into “frontier” — that takes orders-of-magnitude more RL-environment scale (Rung 2) and possibly a CPT/mid-training stage (Rung 3). The cross-domain evidence grounding that argument — six other long-horizon/sparse-reward/verifiable domains and what actually cracked each one — lives in Cybersecurity is one of a family — what cracked the others; several forks below (especially (c) and (e)) draw directly on techniques surveyed there (GiGPO, DAPO, WebRL’s failure-to-curriculum, potential-based shaping). Fork (b)’s SFT-vs-measure-first framing and its order-matters/compounding rationale are the Sequence-B-specific instance of the general argument in The recipe is a sequence, not a pick; the general-capability SFT/preference data fork (b)/(d) would eventually pull from is catalogued in Proven post-training datasets — a usage-cited registry.

1. Where you are — the diagnosis on one screen

The number: ~100–200 solved / ~1000 challenges, at k=1. This is a portfolio statistic, not a per-challenge pass rate — a challenge that “solved once” could be a 5% fluke or a 55% near-certainty, and those two cases call for opposite next moves (The decision, “one prerequisite before any of this”). Nobody has yet run pass@k per challenge, let alone per pipeline stage.

The pipeline is a chain, not a single action:

flowchart LR
  R["Recon /\nenumeration"] --> E["Endpoint\ndiscovery"]
  E --> V["Identify the\nvulnerable endpoint"]
  V --> X["Exploit it"]
  X --> P["Post-exploitation /\npivot"]
  P --> F(("Flag\n{0,1}\nground-truth verified"))

  classDef stage fill:#132b22,stroke:#34d399,color:#eafaf3;
  class R,E,V,X,P stage;

Only the last box (flag_verified) is checked today, and even that is presently a provenance proxy (flag_scan.retrieved — “the string came back from the sandbox, not the model’s mouth”) rather than a true byte-compare against benchmark/flags/pd26_flags.current.json — that exact-match wiring doesn’t exist yet outside a manual, SSH-gated step (instrumentation-and-data-readiness.md §3.1). So even the ground-truth anchor this whole book leans on is one small, well-scoped engineering task away from being fully automatic, not there yet.

The stage-localized failure taxonomy (F1–F4), mapped to canonical RL/agent-research framing:

Tag	Failure	Canonical framing	Fix lever if confirmed dominant
F1	Never finds the vulnerable endpoint	Exploration / coverage failure — no gradient until reward is first observed	On-policy RL with exploration preservation, not more demonstrations
F2	Finds it, probes shallowly, can’t land the exploit	Execution / skill (performance-floor) failure	Trajectory curation, rejection-sampling SFT, DAPO/GiGPO as GRPO baseline
F3	Clumsy tool use, wrong tool for the job	Policy / tool-selection failure — its own axis	Elicitation ladder → ToolRL/Tool-Star if elicitation fails
F4	No real pivot/chaining after a foothold	Long-horizon credit-assignment failure (variance, not coverage)	Step-level credit (GiGPO), curriculum (1-hop before 2-hop), never “try more” alone

The diagnosis, stated as a hypothesis, not a fact: the project’s working read is “likely an execution gap” (F2/F3-flavored) rather than a knowledge gap or a pure exploration gap — capability is probably present and unreliable, not absent. This is the single most load-bearing framing decision in the whole plan, and it is currently unproven. The book’s own diagnosis framework is explicit about this: “the honest, defensible answer will not be a single sentence” — the true picture is almost certainly a split verdict, different F-tags dominating different challenge subtypes, not one gap type for the whole 1000 (Diagnosing the gap §0, §8).

What “proven” requires, concretely, and doesn’t exist yet:

Per-challenge (not aggregate) pass@k, segmented by whether the winning path is single-shot or sequentially-gated (compositional — enumeration must land before exploitation is even visible).
Pass@(k,T) — base model vs. current checkpoint, per segment — to tell a genuine execution gap (trained pulls away from base at large k) apart from a pure elicitation artifact (base catches up) apart from an exploration gap hiding underneath (matched-data SFT regresses the segment, RL expands it) (arXiv:2504.13837, arXiv:2604.14877).
A working F1–F4 stage tagger over the existing Phoenix/events.jsonl corpus — the actual current gap, confirmed by direct source read: the harness’s process telemetry is already complete for this (tool_call/tool_result pairs joined on tool_call_id); nothing needs to change in go/libs/agent/events or secagent/runner.go. What’s missing is purely semantic — a deterministic post-hoc scan plus one stage_oracle.json per challenge, authored from that challenge’s own solve.py (Before you train §5).

Bottom line for this section: treat “it’s an execution gap” as the leading hypothesis, not settled ground. Everything in §2–§4 below is written so that it stays true whichever way the funnel eventually comes down — several forks are explicitly gated on a measurement that hasn’t been taken yet.

2. The forks

Five decisions, each with exactly two options (per this book’s convention — no hybrids, no third path). Every “what must be TRUE” column is a gate, not a preference — the fork should not be decided until its gate is checked, because in at least two of these forks (b, e) the two options are not just different costs, they require the opposite SFT/RL ordering. One explicit, flagged exception: fork (b)’s verdict below lands on a hybrid rather than either labeled option outright — called out and justified there, not silently smuggled in past this convention.

(a) Instrument the F1–F4 stage tagger first?

	Option 1 — build it first	Option 2 — skip it, decide off aggregate `pass@k`/`flag_verified` alone
What it is	`stagescan.go` (same shape as the existing `flagscan.go`) + one `stage_oracle.json` per challenge, authored from `solve.py`; prototype on PD26-02 first, then generalize	Proceed straight to the SFT/GRPO plan using only the terminal flag signal and portfolio-level solve rate
Cost (compute + eng)	Near-free — read-only post-hoc scan over `events.jsonl` you already have; no new sandbox instrumentation; editorial authoring of ~4 predicates/challenge (one person reads ~10 `solve.py` files)	Zero now — but every downstream decision (b–e) is made blind to which F-tag actually dominates
Risk (reward-hacking)	None — this is diagnostic-only, no reward function touched	Routing risk, not hacking risk: you may sink a training cycle into the wrong lever (e.g. rejection-sampling SFT when the real bottleneck is F1/exploration, or milestone shaping when it’s actually F2/F4)
Information value	Highest single move in this whole chapter. “This single aggregation is the input every downstream decision below depends on” — verbatim from Before you train §4	Low — an aggregate number “averages over four structurally different failure modes,” exactly the collapsing-a-split-verdict anti-pattern the diagnosis framework names first
What must be TRUE first	Nothing — this is the recommended first step regardless of any other measurement	N/A

Verdict pressure: there is no real argument for Option 2. This fork is here because it’s the fork everyone is tempted to skip under time pressure, not because the evidence is close.

(b) Rejection-sampling SFT on verified solves NOW vs. measure pass@k-per-stage FIRST

	Option 1 — SFT now	Option 2 — segment + measure first
What it is	Run rejection-sampling SFT on the ~185 already-collected verified-solve trajectories across the 5 corpora (gym263, gym564, warpenv-broker, envgen, argus60-base)	Segment the 1000 challenges (single-shot vs. sequentially-gated), run Pass@(k,T) — base vs. current checkpoint — per segment, before deciding what to train on
Cost	Cheap — data already exists, this is already the project’s stated near-term plan	Medium — sampling compute at multiple k, cold-start `pdq --fresh-retries`, no training required
Risk (reward-hacking / generalization)	Concrete, not hypothetical. On the compositional/sequentially-gated segment, matched-data SFT actually regresses capability (net −4) while RL expands it (net +4) — Zhai et al., arXiv:2604.14877. Training the wrong subset with SFT doesn’t just waste compute, it can make that subset worse. Separately: the ~185-trajectory pool may be guessing-dominated (high pass@64, low Cover@τ — arXiv:2510.08325) or contain lucky-but-unsound paths (right flag, wrong/wasted reasoning — arXiv:2506.14245)	Low — diagnostic only, but real opportunity cost if it delays shipping a known-safe move (SFT is the project’s current plan, already literature-validated as a baseline — arXiv:2504.11343)
Information value	Low incremental — you already believe SFT-on-solves works generically; this doesn’t test where it works	High — this is “the single highest-value experimental design” in the diagnosis chapter, and it’s directly testable this week with no new training
What must be TRUE before committing to Option 1 wholesale	(i) the SFT pool isn’t guessing-dominated (Cover@τ check on its source challenges); (ii) trajectories are filtered on soundness (backtracking/wasted-turns/tool-validity), not just `flag==1`; (iii) fork (a)’s stage tagger doesn’t show these ~185 trajectories concentrated on the sequentially-gated segment	N/A

Verdict pressure — this is this chapter’s one deliberate exception to the “no hybrids” convention stated in §2, flagged rather than smuggled in: don’t cancel the SFT plan — but don’t treat “SFT now” as a blanket recipe across the whole portfolio either. The correct read of these two options is closer to “do (2) as a segmentation gate on (1)”: SFT the single-shot segment now, hold the sequentially-gated segment for GRPO once entropy instrumentation is live. Why this fork earns the exception where the other four don’t: options 1 and 2 here aren’t mutually exclusive courses of action — one is a training decision, the other a measurement decision, and they resolve at different grain (portfolio-wide vs. per-segment). Once segmentation lands, “measure first” naturally gates “SFT now” rather than replacing it. Forks (a), (c), (d), (e) don’t have that structure — their two options are genuinely exclusive paths, which is why no-hybrids holds cleanly for them and only for them.

(c) Monolithic GRPO vs. milestone-shaped GRPO

	Option 1 — monolithic	Option 2 — milestone-shaped
What it is	Terminal flag reward only, unchanged, once GRPO/RLVR starts	Potential-based shaping `F(s,a,s') = γΦ(s') − Φ(s)` layered on top of (never instead of) the terminal reward, where Φ = a monotone running-max count of deterministically-verified stage completions (Ng/Harada/Russell, ICML 1999 — policy-invariant by theorem)
Cost	None beyond baseline GRPO infra	Medium — `stage_oracle.json` authoring (reuses fork (a)’s work if already done), Φ must be a running max (not instantaneous), and defined identically across every termination path (`stop_reason` ∈ `{stop, max_turns, error}`) or the invariance proof breaks
Risk (reward-hacking)	Risk of leaving real gains on the table if the funnel is genuinely F1-dominated — MiRA’s 6.4%→43.0% WebArena-Lite result is the strongest existence-proof in this book that flag-only reward can leave a large gap, though that’s a web-navigation result, not CTF (arXiv:2603.19685)	This is where the thick, convergent reward-hacking literature lives — PURE/Stop Summation (arXiv:2504.15275), Reward Under Attack (arXiv:2603.06621), Gao et al. (arXiv:2410.15115), PRIME’s own admission (arXiv:2502.01456), MONA (arXiv:2501.13011). Every one of these converges on: the moment a stage check becomes anything softer than deterministic ground-truth, it gets farmed. This project’s own confirmed lesson (SFT-induced `FLAG{}` confabulation from a loose format-matcher) is the small-scale preview of the same failure mode. *But that whole cluster is about gaming a soft/gameable proxy — none of it is reward tampering.* Denison, MacDiarmid, Barez, Duvenaud, Kravec, Marks et al. (Anthropic, “Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models,” arXiv:2406.10162, 130+ citations — verified live) show a curriculum of easy, low-stakes specification-gaming generalizes zero-shot to models that directly rewrite their own reward function/checklists when they have the tool access to do so — categorically more severe than metric-farming, and making the verifier deterministic and ground-truth does not by itself stop it if the policy’s own tool surface can reach what the verifier reads. This is not hypothetical for this project: the agent already has real shell access to the sandboxed target, and today’s `flag_verified` check (`flagscan.go`, §1) is file/process-based — reachable by the same tool calls the agent uses to solve the challenge. The gap this menu doesn’t yet name: isolate the verifier’s read-path from the agent’s own tool-call surface, and adversarially probe for exactly this generalization before scaling RL — “deterministic ground-truth reward” alone is the fix for gaming, not for tampering
Information value	N/A — this is the default, not an experiment	High if built correctly — this is the only mechanism in the whole menu that’s a theorem, not an empirical bet, provided the two subtleties are respected
What must be TRUE before building Option 2	(i) fork (a)’s funnel shows an F1 (exploration)-dominated bottleneck, not F2/F3; (ii) the scale-dependence check confirms the base policy is genuinely capacity-limited rather than already-capable-but-unreliable — arXiv:2603.21972 found staged reward helps weak models only, larger models converge fine on outcome-only reward; (iii) a deterministic oracle exists for the stage being shaped — explicitly excluding stage 3 (vuln identification), which VPR’s own authors flag as the “open, unstructured” regime their method doesn’t yet solve (arXiv:2605.10325)	Default — no gate needed

Verdict pressure: stay monolithic until the funnel says otherwise. If it does, build the narrow, theorem-backed version — never a learned/LLM-judge per-stage reward, under any circumstance. And regardless of which option wins: verifier-integrity hardening (flagscan.go’s read-path isolated from the agent’s own tool surface) is not optional once RL starts, because Denison et al.’s generalization result means a ground-truth verifier alone doesn’t rule out the agent attacking the verifier rather than the challenge.

(d) Tool-use fix for the curl-preference (SFT/DPO/KTO)

	Option 1 — elicitation ladder first	Option 2 — jump straight to a training-time fix
What it is	Escalate cheapest→most-expensive: few-shot prompt with 2–3 correct-usage examples → light SFT on a handful of demonstrated-usage trajectories → only then consider RL-level intervention	Go directly to DPO/KTO on tool-choice pairs, or ToolRL-style decomposed per-call reward, without testing whether elicitation alone recovers the behavior
Cost	Very cheap — the few-shot test is nearly free; SFT-demo step is cheap	Medium — true DPO needs k≥2 same-challenge same-model divergent pairs (only the PD26 k=5 canonical sweep qualifies today; the larger gym pools are k=1); KTO-native data (unpaired success/failed splits) is free and ready today
Risk (reward-hacking / wasted engineering)	Low — but a self-reinforcing trap exists regardless of which rung you’re on: a rejection-sampling corpus built from the current curl-biased policy will never contain a dead tool succeeding, because the policy never tried it — RL alone has ~zero probability mass to reinforce those tools without forced/hinted exposure first (Tool-Star, arXiv:2505.16410)	Building ToolRL/DPO machinery for what might be a pure elicitation gap — Greenblatt et al.’s password-locked-model finding says a few high-quality SFT demonstrations are often sufficient to fully elicit a locked capability (arXiv:2405.19550); over-engineering here is real opportunity cost, not just aesthetic
Information value	High and cheap — turns “the model prefers curl” from anecdote into a falsifiable, staged experiment (framework.md §5)	Lower until the ladder has been run — you don’t yet know which rung actually recovers the behavior
What must be TRUE before escalating past few-shot	(a) few-shot prompting fails to recover tool usage on held-out challenges; (b) SFT on a small demonstrated-usage set also fails to recover it → only then is it a genuine missing-affordance problem calling for Tool-Star-style forced exposure + ToolRL-style decomposed reward (arXiv:2504.13958)	N/A

Verdict pressure: run the ladder. Don’t skip to DPO/ToolRL on a hunch — the cheapest rungs have direct, citable precedent for “this alone is often sufficient,” and skipping them risks building infrastructure for a gap that a two-line prompt change would have closed.

(e) Exploration-emphasis vs. execution-emphasis — routed by the funnel

	Option 1 — execution-emphasis	Option 2 — exploration-emphasis
What it is	Invest in trajectory curation, more/better rejection-sampling SFT data, DAPO’s clip-higher + dynamic sampling as the GRPO baseline, GiGPO step-level credit	On-policy RL first, not SFT, for the segment the funnel flags as F1/F4-dominant; DIVER/tool-sequence diversity bonus, curiosity bonus (CDE), parameter-space-noise pilot, periodic reference-policy resets (ProRL) for genuine boundary expansion
Cost	Lower — DAPO is an established, widely-adopted recipe; trajectory curation reuses existing data	Higher — several of these techniques are RL-infra-dependent and Promising-not-validated (PSN-RLVR, DIVER, HiPER/hindsight credit assignment for a CTF-shaped domain)
Risk	If the funnel is actually F1-dominant, more SFT on the same recipe teaches guessing-and-hoping more confidently, not more competence (LIMO’s framing, arXiv:2502.03387)	If the funnel is actually F2/F3-dominant, exploration machinery is solving a problem that doesn’t exist here and burns the RL-infra budget on the wrong axis — the entropy-collapse mechanism these fixes target (arXiv:2505.22617) is real but doesn’t help a policy that’s exploring fine and just executing unreliably
Information value	This is literally what the funnel is for. Not a taste choice — a routed decision	Same
What must be TRUE before routing	Funnel result from fork (a); the scale-check (arXiv:2603.21972); the Pass@(k,T) crossover-direction test on the specific segment (fork b) — does trained pull away from base at large k (execution), or does matched-data SFT regress it while RL expands it (exploration)?	Same gate, opposite branch

Verdict pressure: this fork cannot be decided from priors or literature alone — by design, it is the output of forks (a) and (b), not an independent choice. If you find yourself picking an emphasis before the funnel exists, you are guessing, and the guess has better-than-even odds of being wrong given the project’s own “likely execution, unproven” framing in §1.

3. Dependency order — a DAG, not a timeline

This is deliberately not a schedule. It shows what unlocks what — several branches can run in parallel, and nothing downstream of “instrument” is safe to start before its own inputs exist.

flowchart TD
  subgraph INSTRUMENT["INSTRUMENT — near-free, read-only, do first"]
    I1["stagescan.go + stage_oracle.json\nper-challenge, F1-F4 tagger"]
    I2["flag_verified true byte-compare\n(replace retrieved-provenance proxy)"]
    I3["entropy logging wired,\nready from RL step 0"]
    I4["elicitation-ladder harness:\ntool-usage histogram across 40 sectools"]
  end

  subgraph MEASURE["MEASURE — diagnostic, no training changes"]
    M1["Segment 1000 challenges:\nsingle-shot vs sequentially-gated"]
    M2["Pass@(k,T): base vs current checkpoint,\nper segment (arXiv:2604.14877)"]
    M3["Cover@tau per challenge\n(arXiv:2510.08325) — guessing vs reliable"]
    M4["Base-model pass@k control\n(arXiv:2504.13837)"]
    M5["Scale-check: weak/capacity-limited\nvs already-capable-unreliable\n(arXiv:2603.21972)"]
    M6["Elicitation ladder run:\nfew-shot -> SFT-demo -> RL"]
  end

  subgraph ROUTE["ROUTE — the fork decisions (Section 2)"]
    RA["Fork a: ALREADY DECIDED\n(instrument first)"]
    RB["Fork b: SFT-now vs measure-first\nper segment"]
    RC["Fork c: monolithic vs\nmilestone-shaped GRPO"]
    RD["Fork d: elicitation vs\ntraining-time tool fix"]
    RE["Fork e: exploration- vs\nexecution-emphasis"]
  end

  subgraph TRAIN["TRAIN — the actual runs"]
    T1["Rejection-sampling SFT\non single-shot segment, curated\n(STaR / ReST-EM pattern)"]
    T2["GRPO + DAPO baseline\n(clip-higher, dynamic sampling)"]
    T3["+ GiGPO step-level credit\n(zero extra rollouts)"]
    T4["+ potential-based milestone\nshaping (gated, fork c only)"]
    T5["ToolRL / Tool-Star forced\nexposure (gated, fork d only)"]
    T2b["Exploration-emphasis RL:\nDIVER / CDE curiosity bonus /\nPSN-RLVR / ProRL resets\n(gated, fork e exploration branch)"]
  end

  subgraph GRADUATE["GRADUATE — the go/no-go gates"]
    G1["Entropy collapsed\nAND pass@64 non-trivial\n(arXiv:2510.01624)"]
    G2["Semantics-preserving-transform\nrobustness check survives\n(arXiv:2502.07445 / 2503.02296)"]
    G3["Base-pass@k control still\ntrails trained pass@k\n(gain is real, not elicitation)"]
    G4["pass@large-k did NOT shrink\npost-RL (RL-PLUS check,\narXiv:2508.00222)"]
  end

  I1 --> M1
  I2 --> M4
  I3 --> G1
  I4 --> M6

  M1 --> M2
  M2 --> M3
  M2 --> M4
  M2 --> M5

  M1 --> RB
  M2 --> RB
  M3 --> RB
  M5 --> RC
  M1 --> RC
  M6 --> RD
  RB --> RE
  M5 --> RE

  RB --> T1
  RC -->|"F1-dominant, weak policy"| T4
  RC -->|"F2/F3-dominant"| T2
  RD -->|"elicitation recovers it"| T1
  RD -->|"neither recovers it"| T5
  RE -->|"execution-emphasis"| T2
  RE -->|"exploration-emphasis"| T2b
  T1 --> T2
  T2 --> T3
  T3 --> T4
  T3 --> T5

  T2 --> G1
  T4 --> G1
  T5 --> G1
  T2b --> G1
  G1 --> G2
  G2 --> G3
  G3 --> G4
  G4 -->|"holds"| Ship["Credit the gain.\nGeneralize to the next segment /\nchallenge subset"]
  G4 -->|"fails"| Back["Back to MEASURE —\nre-run funnel, re-check scale,\ndo not re-train blind"]

  classDef inst fill:#132b22,stroke:#34d399,color:#eafaf3;
  classDef meas fill:#0f2a3d,stroke:#38bdf8,color:#e6f6ff;
  classDef route fill:#3a2e14,stroke:#f5b942,color:#fff6e0;
  classDef train fill:#2a1438,stroke:#c084fc,color:#f3e8ff;
  classDef grad fill:#3a1414,stroke:#f87171,color:#fde8e8;
  class I1,I2,I3,I4 inst;
  class M1,M2,M3,M4,M5,M6 meas;
  class RA,RB,RC,RD,RE route;
  class T1,T2,T3,T4,T5,T2b train;
  class G1,G2,G3,G4 grad;

Read this as: nothing in TRAIN is safe to start before its ROUTE gate fires, and nothing in ROUTE is safe to decide before its MEASURE inputs exist. INSTRUMENT is the only stage with no prerequisites — which is why fork (a) has no real counter-argument.

4. Open hypotheses to test

These are falsifiable, in the sense the diagnosis chapter insists on: each has a stated experiment and a stated result that would kill it. This is the “prove it to myself” frame, not a checklist to complete once — re-run per challenge segment as the corpus grows.

#	Hypothesis	Experiment	Falsified if
H1	It’s execution, not knowledge, on the non-sequentially-gated segment	Base-model pass@k at large k (64, 256) on currently-failing single-shot challenges — does the correct action ever appear?	The correct action never appears at any N on any checkpoint for a large fraction of these — that’s a knowledge gap for that subset, requiring off-policy injection (demonstration, teacher, or a tool), not more RL
H2	Milestone shaping helps, doesn’t hack	Introduce potential-based shaping (fork c, gated) on the F1-dominant segment only; track held-out `flag_verified` rate and pass@large-k before/after	Held-out flag rate drops, or pass@large-k shrinks post-introduction (capability-boundary collapse, arXiv:2508.00222) — either result means the shaping term is being farmed, revert to monolithic immediately
H3	The horizon is tractable for GRPO at the 30–60% baseline band	Run DAPO+GiGPO on challenges the funnel tags F2/F3-dominant, in the 30–60% pass-rate band; watch entropy from step 0	Entropy still collapses under DAPO’s own fixes, or stage-transition credit doesn’t concentrate on the exploitation phase specifically (GiGPO’s state-hash groups show flat credit) — means the horizon/credit-assignment problem is harder than the established recipe assumes for this task shape
H4	A real exploration gap exists, localized to the sequentially-gated segment	Replicate Zhai et al.’s crossover-direction test on this project’s own compositional-segment challenges: does matched-data SFT regress pass@(k,T) on this segment while GRPO expands it?	If SFT does not regress this segment (both SFT and RL improve it comparably), the sequential-gating framing doesn’t transfer to this task family, and the single-shot ordering (SFT then GRPO) is fine everywhere — fork (b)/(e)’s special-casing was unnecessary
H5	Tool-avoidance (curl-preference) is elicitation, not a missing-affordance problem	Run the elicitation ladder (fork d) on a sample of the 26 dead `sectools` entries on held-out challenges	Neither few-shot prompting nor light SFT-on-demos recovers usage — genuinely a missing-affordance problem, escalate to Tool-Star forced exposure + ToolRL decomposed reward
H6	Any claimed solve-rate gain reflects real execution-reliability improvement, not memorization/elicitation	Base-model pass@k-at-large-k control (H1’s instrument, reused) and semantics-preserving-transform variants of a held-out subset, checked against every claimed gain before crediting it	The gain evaporates on either check — the gain is elicitation (fine to attribute to SFT, a red flag if it persists after GRPO) or memorization of the fixed ~10 canonical PD26 shapes
H7	The SFT go/no-go gate (entropy collapse) is sufficient on its own	Check whether pass@64 on the rejection-sampling-SFT checkpoint is non-trivial at the same time entropy collapses, before green-lighting GRPO (arXiv:2510.01624)	Entropy has collapsed but pass@64 is flat/low — this predicts a disappointing GRPO run regardless of how good SFT accuracy looked; do not launch on entropy-collapse alone

5. What we deliberately are NOT basing this on

Standing project rule, restated for this chapter specifically: no fork, no hypothesis, no cost/risk estimate, and no number above rests on an academic cybersecurity-LLM training or benchmark paper — CTF-Dojo, Cyber-Zero, Pentest-R1, HackSynth/Random-Crypto, AutoPenBench, Cybench, NYU CTF Bench, EnIGMA, InterCode-CTF, DRLRM-PT, node-fragility reward shaping, the kill-chain-staged-reward paper, Nakano’s ATT&CK-tree scaffold, or Honarvar’s Evolve-CTF/Capture-the-Flags family-based evaluation — even where several of these report a finding that would superficially support one side of a fork here. None of that line of work has produced a frontier cybersecurity model, so none of it counts as frontier evidence for a load-bearing decision; every mention of them in the six source chapters this capstone draws from is explicitly labelled “academic, cited for context only, not a basis,” and this chapter inherits that discipline rather than re-importing their numbers under a different heading. Every claim above is re-grounded on one of: general frontier post-training disclosures (DeepSeek-R1, Kimi k1.5/K2, Llama 4, OpenAI Deep Research), general RL/agent theory (potential-based shaping, the reward-hacking convergence, reward-tampering-as-generalization (arXiv:2406.10162), DAgger, GAE, entropy-collapse mechanics), general (non-security) agent-eval and long-horizon literature (METR, AgentBoard, MAST, τ-bench, GSM-Symbolic/C-BOD), or this project’s own measured data and confirmed lessons (the SFT-induced FLAG{} confabulation, the Phoenix trace corpus, the existing pass@k methodology). Where a demoted academic-security idea is still worth pursuing on its own merits — e.g. staged/kill-chain-shaped reward in a cybersecurity-specific loop — the chapters this one draws from say so explicitly and flag it “worth pursuing — unvalidated outside academic-security work,” never as settled ground.

Cross-links

The path to a frontier cybersecurity model — the north star this whole Rung-1 fork-and-DAG plan is a prerequisite for, not a substitute for; explains why resolving every fork here still leaves Rungs 2–3 (environment scale, CPT/mid-training) unaddressed.
The recipe is a sequence, not a pick — the stage-order-and-compounding frame fork (b)’s “don’t SFT the whole portfolio blind” verdict is a specific instance of.
Proven post-training datasets — a usage-cited registry — the concrete dataset shopping list for the general-capability rungs underneath forks (b)/(d)’s SFT/preference data.
Cybersecurity is one of a family — what cracked the others — the cross-domain evidence (coding agents, competitive programming, theorem proving, web agents, games, robotics) several forks below draw on directly for technique precedent and risk calibration.
Diagnosing the gap — a scientific framework — the routing test and the pass@k / Pass@(k,T) / Cover@τ protocol every MEASURE node in §3’s DAG instantiates.
From behavioral audit to training signal — the per-pattern gap→method→verification mapping fork (d) and fork (e) draw on directly.
One problem, or many? — decomposition vs. monolithic — the full verdict and honest-limits case behind fork (c).
Before you train — instrumentation & data readiness — the source of fork (a)’s cost estimate and the concrete first-step recipe for the F1–F4 tagger.
RL that creates value — long-horizon, exploration, reasoning, novelty — the technique menu (DAPO, GiGPO, DIVER, ProRL, NuRL, PSN-RLVR) fork (e)’s exploration-emphasis branch draws on.
The decision — the one-line version of the whole book’s routing question; this chapter is its decision-surface expansion, not a replacement.

Keyboard shortcuts

Post-Training Field Notes