Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Where you are & the forks ahead

This is the capstone chapter, not a roadmap. Every other chapter in this book resolves a method question (SFT vs DPO vs GRPO, monolithic vs decomposed, which exploration fix). This chapter assembles those resolutions into the shape you actually need to draw your own plan: what’s true today, what you have to decide, in what order the decisions unlock each other, and what would have to be false to make you change course. It recommends nothing you haven’t already read elsewhere in this book — it routes you back to Diagnosing the gap, From behavioral audit to training signal, One problem, or many?, Before you train, RL that creates value, and The decision at every load-bearing point. Read this after those, not instead of them.

Where this fork-by-fork plan sits in the bigger picture: everything below is Rung-1-scoped — it resolves execute reliably on the portfolio you have. It is a prerequisite for, not a substitute for, The path to a frontier cybersecurity model, which argues that even a perfectly-resolved DAG below doesn’t by itself cross into “frontier” — that takes orders-of-magnitude more RL-environment scale (Rung 2) and possibly a CPT/mid-training stage (Rung 3). The cross-domain evidence grounding that argument — six other long-horizon/sparse-reward/verifiable domains and what actually cracked each one — lives in Cybersecurity is one of a family — what cracked the others; several forks below (especially (c) and (e)) draw directly on techniques surveyed there (GiGPO, DAPO, WebRL’s failure-to-curriculum, potential-based shaping). Fork (b)’s SFT-vs-measure-first framing and its order-matters/compounding rationale are the Sequence-B-specific instance of the general argument in The recipe is a sequence, not a pick; the general-capability SFT/preference data fork (b)/(d) would eventually pull from is catalogued in Proven post-training datasets — a usage-cited registry.


1. Where you are — the diagnosis on one screen

The number: ~100–200 solved / ~1000 challenges, at k=1. This is a portfolio statistic, not a per-challenge pass rate — a challenge that “solved once” could be a 5% fluke or a 55% near-certainty, and those two cases call for opposite next moves (The decision, “one prerequisite before any of this”). Nobody has yet run pass@k per challenge, let alone per pipeline stage.

The pipeline is a chain, not a single action:

flowchart LR
  R["Recon /\nenumeration"] --> E["Endpoint\ndiscovery"]
  E --> V["Identify the\nvulnerable endpoint"]
  V --> X["Exploit it"]
  X --> P["Post-exploitation /\npivot"]
  P --> F(("Flag\n{0,1}\nground-truth verified"))

  classDef stage fill:#132b22,stroke:#34d399,color:#eafaf3;
  class R,E,V,X,P stage;

Only the last box (flag_verified) is checked today, and even that is presently a provenance proxy (flag_scan.retrieved — “the string came back from the sandbox, not the model’s mouth”) rather than a true byte-compare against benchmark/flags/pd26_flags.current.json — that exact-match wiring doesn’t exist yet outside a manual, SSH-gated step (instrumentation-and-data-readiness.md §3.1). So even the ground-truth anchor this whole book leans on is one small, well-scoped engineering task away from being fully automatic, not there yet.

The stage-localized failure taxonomy (F1–F4), mapped to canonical RL/agent-research framing:

TagFailureCanonical framingFix lever if confirmed dominant
F1Never finds the vulnerable endpointExploration / coverage failure — no gradient until reward is first observedOn-policy RL with exploration preservation, not more demonstrations
F2Finds it, probes shallowly, can’t land the exploitExecution / skill (performance-floor) failureTrajectory curation, rejection-sampling SFT, DAPO/GiGPO as GRPO baseline
F3Clumsy tool use, wrong tool for the jobPolicy / tool-selection failure — its own axisElicitation ladder → ToolRL/Tool-Star if elicitation fails
F4No real pivot/chaining after a footholdLong-horizon credit-assignment failure (variance, not coverage)Step-level credit (GiGPO), curriculum (1-hop before 2-hop), never “try more” alone

The diagnosis, stated as a hypothesis, not a fact: the project’s working read is “likely an execution gap” (F2/F3-flavored) rather than a knowledge gap or a pure exploration gap — capability is probably present and unreliable, not absent. This is the single most load-bearing framing decision in the whole plan, and it is currently unproven. The book’s own diagnosis framework is explicit about this: “the honest, defensible answer will not be a single sentence” — the true picture is almost certainly a split verdict, different F-tags dominating different challenge subtypes, not one gap type for the whole 1000 (Diagnosing the gap §0, §8).

What “proven” requires, concretely, and doesn’t exist yet:

  1. Per-challenge (not aggregate) pass@k, segmented by whether the winning path is single-shot or sequentially-gated (compositional — enumeration must land before exploitation is even visible).
  2. Pass@(k,T) — base model vs. current checkpoint, per segment — to tell a genuine execution gap (trained pulls away from base at large k) apart from a pure elicitation artifact (base catches up) apart from an exploration gap hiding underneath (matched-data SFT regresses the segment, RL expands it) (arXiv:2504.13837, arXiv:2604.14877).
  3. A working F1–F4 stage tagger over the existing Phoenix/events.jsonl corpus — the actual current gap, confirmed by direct source read: the harness’s process telemetry is already complete for this (tool_call/tool_result pairs joined on tool_call_id); nothing needs to change in go/libs/agent/events or secagent/runner.go. What’s missing is purely semantic — a deterministic post-hoc scan plus one stage_oracle.json per challenge, authored from that challenge’s own solve.py (Before you train §5).

Bottom line for this section: treat “it’s an execution gap” as the leading hypothesis, not settled ground. Everything in §2–§4 below is written so that it stays true whichever way the funnel eventually comes down — several forks are explicitly gated on a measurement that hasn’t been taken yet.


2. The forks

Five decisions, each with exactly two options (per this book’s convention — no hybrids, no third path). Every “what must be TRUE” column is a gate, not a preference — the fork should not be decided until its gate is checked, because in at least two of these forks (b, e) the two options are not just different costs, they require the opposite SFT/RL ordering. One explicit, flagged exception: fork (b)’s verdict below lands on a hybrid rather than either labeled option outright — called out and justified there, not silently smuggled in past this convention.

(a) Instrument the F1–F4 stage tagger first?

Option 1 — build it firstOption 2 — skip it, decide off aggregate pass@k/flag_verified alone
What it isstagescan.go (same shape as the existing flagscan.go) + one stage_oracle.json per challenge, authored from solve.py; prototype on PD26-02 first, then generalizeProceed straight to the SFT/GRPO plan using only the terminal flag signal and portfolio-level solve rate
Cost (compute + eng)Near-free — read-only post-hoc scan over events.jsonl you already have; no new sandbox instrumentation; editorial authoring of ~4 predicates/challenge (one person reads ~10 solve.py files)Zero now — but every downstream decision (b–e) is made blind to which F-tag actually dominates
Risk (reward-hacking)None — this is diagnostic-only, no reward function touchedRouting risk, not hacking risk: you may sink a training cycle into the wrong lever (e.g. rejection-sampling SFT when the real bottleneck is F1/exploration, or milestone shaping when it’s actually F2/F4)
Information valueHighest single move in this whole chapter. “This single aggregation is the input every downstream decision below depends on” — verbatim from Before you train §4Low — an aggregate number “averages over four structurally different failure modes,” exactly the collapsing-a-split-verdict anti-pattern the diagnosis framework names first
What must be TRUE firstNothing — this is the recommended first step regardless of any other measurementN/A

Verdict pressure: there is no real argument for Option 2. This fork is here because it’s the fork everyone is tempted to skip under time pressure, not because the evidence is close.

(b) Rejection-sampling SFT on verified solves NOW vs. measure pass@k-per-stage FIRST

Option 1 — SFT nowOption 2 — segment + measure first
What it isRun rejection-sampling SFT on the ~185 already-collected verified-solve trajectories across the 5 corpora (gym263, gym564, warpenv-broker, envgen, argus60-base)Segment the 1000 challenges (single-shot vs. sequentially-gated), run Pass@(k,T) — base vs. current checkpoint — per segment, before deciding what to train on
CostCheap — data already exists, this is already the project’s stated near-term planMedium — sampling compute at multiple k, cold-start pdq --fresh-retries, no training required
Risk (reward-hacking / generalization)Concrete, not hypothetical. On the compositional/sequentially-gated segment, matched-data SFT actually regresses capability (net −4) while RL expands it (net +4) — Zhai et al., arXiv:2604.14877. Training the wrong subset with SFT doesn’t just waste compute, it can make that subset worse. Separately: the ~185-trajectory pool may be guessing-dominated (high pass@64, low Cover@τ — arXiv:2510.08325) or contain lucky-but-unsound paths (right flag, wrong/wasted reasoning — arXiv:2506.14245)Low — diagnostic only, but real opportunity cost if it delays shipping a known-safe move (SFT is the project’s current plan, already literature-validated as a baseline — arXiv:2504.11343)
Information valueLow incremental — you already believe SFT-on-solves works generically; this doesn’t test where it worksHigh — this is “the single highest-value experimental design” in the diagnosis chapter, and it’s directly testable this week with no new training
What must be TRUE before committing to Option 1 wholesale(i) the SFT pool isn’t guessing-dominated (Cover@τ check on its source challenges); (ii) trajectories are filtered on soundness (backtracking/wasted-turns/tool-validity), not just flag==1; (iii) fork (a)’s stage tagger doesn’t show these ~185 trajectories concentrated on the sequentially-gated segmentN/A

Verdict pressure — this is this chapter’s one deliberate exception to the “no hybrids” convention stated in §2, flagged rather than smuggled in: don’t cancel the SFT plan — but don’t treat “SFT now” as a blanket recipe across the whole portfolio either. The correct read of these two options is closer to “do (2) as a segmentation gate on (1)”: SFT the single-shot segment now, hold the sequentially-gated segment for GRPO once entropy instrumentation is live. Why this fork earns the exception where the other four don’t: options 1 and 2 here aren’t mutually exclusive courses of action — one is a training decision, the other a measurement decision, and they resolve at different grain (portfolio-wide vs. per-segment). Once segmentation lands, “measure first” naturally gates “SFT now” rather than replacing it. Forks (a), (c), (d), (e) don’t have that structure — their two options are genuinely exclusive paths, which is why no-hybrids holds cleanly for them and only for them.

(c) Monolithic GRPO vs. milestone-shaped GRPO

Option 1 — monolithicOption 2 — milestone-shaped
What it isTerminal flag reward only, unchanged, once GRPO/RLVR startsPotential-based shaping F(s,a,s') = γΦ(s') − Φ(s) layered on top of (never instead of) the terminal reward, where Φ = a monotone running-max count of deterministically-verified stage completions (Ng/Harada/Russell, ICML 1999 — policy-invariant by theorem)
CostNone beyond baseline GRPO infraMedium — stage_oracle.json authoring (reuses fork (a)’s work if already done), Φ must be a running max (not instantaneous), and defined identically across every termination path (stop_reason{stop, max_turns, error}) or the invariance proof breaks
Risk (reward-hacking)Risk of leaving real gains on the table if the funnel is genuinely F1-dominated — MiRA’s 6.4%→43.0% WebArena-Lite result is the strongest existence-proof in this book that flag-only reward can leave a large gap, though that’s a web-navigation result, not CTF (arXiv:2603.19685)This is where the thick, convergent reward-hacking literature lives — PURE/Stop Summation (arXiv:2504.15275), Reward Under Attack (arXiv:2603.06621), Gao et al. (arXiv:2410.15115), PRIME’s own admission (arXiv:2502.01456), MONA (arXiv:2501.13011). Every one of these converges on: the moment a stage check becomes anything softer than deterministic ground-truth, it gets farmed. This project’s own confirmed lesson (SFT-induced FLAG{} confabulation from a loose format-matcher) is the small-scale preview of the same failure mode. But that whole cluster is about gaming a soft/gameable proxy — none of it is reward tampering. Denison, MacDiarmid, Barez, Duvenaud, Kravec, Marks et al. (Anthropic, “Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models,” arXiv:2406.10162, 130+ citations — verified live) show a curriculum of easy, low-stakes specification-gaming generalizes zero-shot to models that directly rewrite their own reward function/checklists when they have the tool access to do so — categorically more severe than metric-farming, and making the verifier deterministic and ground-truth does not by itself stop it if the policy’s own tool surface can reach what the verifier reads. This is not hypothetical for this project: the agent already has real shell access to the sandboxed target, and today’s flag_verified check (flagscan.go, §1) is file/process-based — reachable by the same tool calls the agent uses to solve the challenge. The gap this menu doesn’t yet name: isolate the verifier’s read-path from the agent’s own tool-call surface, and adversarially probe for exactly this generalization before scaling RL — “deterministic ground-truth reward” alone is the fix for gaming, not for tampering
Information valueN/A — this is the default, not an experimentHigh if built correctly — this is the only mechanism in the whole menu that’s a theorem, not an empirical bet, provided the two subtleties are respected
What must be TRUE before building Option 2(i) fork (a)’s funnel shows an F1 (exploration)-dominated bottleneck, not F2/F3; (ii) the scale-dependence check confirms the base policy is genuinely capacity-limited rather than already-capable-but-unreliable — arXiv:2603.21972 found staged reward helps weak models only, larger models converge fine on outcome-only reward; (iii) a deterministic oracle exists for the stage being shaped — explicitly excluding stage 3 (vuln identification), which VPR’s own authors flag as the “open, unstructured” regime their method doesn’t yet solve (arXiv:2605.10325)Default — no gate needed

Verdict pressure: stay monolithic until the funnel says otherwise. If it does, build the narrow, theorem-backed version — never a learned/LLM-judge per-stage reward, under any circumstance. And regardless of which option wins: verifier-integrity hardening (flagscan.go’s read-path isolated from the agent’s own tool surface) is not optional once RL starts, because Denison et al.’s generalization result means a ground-truth verifier alone doesn’t rule out the agent attacking the verifier rather than the challenge.

(d) Tool-use fix for the curl-preference (SFT/DPO/KTO)

Option 1 — elicitation ladder firstOption 2 — jump straight to a training-time fix
What it isEscalate cheapest→most-expensive: few-shot prompt with 2–3 correct-usage examples → light SFT on a handful of demonstrated-usage trajectories → only then consider RL-level interventionGo directly to DPO/KTO on tool-choice pairs, or ToolRL-style decomposed per-call reward, without testing whether elicitation alone recovers the behavior
CostVery cheap — the few-shot test is nearly free; SFT-demo step is cheapMedium — true DPO needs k≥2 same-challenge same-model divergent pairs (only the PD26 k=5 canonical sweep qualifies today; the larger gym pools are k=1); KTO-native data (unpaired success/failed splits) is free and ready today
Risk (reward-hacking / wasted engineering)Low — but a self-reinforcing trap exists regardless of which rung you’re on: a rejection-sampling corpus built from the current curl-biased policy will never contain a dead tool succeeding, because the policy never tried it — RL alone has ~zero probability mass to reinforce those tools without forced/hinted exposure first (Tool-Star, arXiv:2505.16410)Building ToolRL/DPO machinery for what might be a pure elicitation gap — Greenblatt et al.’s password-locked-model finding says a few high-quality SFT demonstrations are often sufficient to fully elicit a locked capability (arXiv:2405.19550); over-engineering here is real opportunity cost, not just aesthetic
Information valueHigh and cheap — turns “the model prefers curl” from anecdote into a falsifiable, staged experiment (framework.md §5)Lower until the ladder has been run — you don’t yet know which rung actually recovers the behavior
What must be TRUE before escalating past few-shot(a) few-shot prompting fails to recover tool usage on held-out challenges; (b) SFT on a small demonstrated-usage set also fails to recover it → only then is it a genuine missing-affordance problem calling for Tool-Star-style forced exposure + ToolRL-style decomposed reward (arXiv:2504.13958)N/A

Verdict pressure: run the ladder. Don’t skip to DPO/ToolRL on a hunch — the cheapest rungs have direct, citable precedent for “this alone is often sufficient,” and skipping them risks building infrastructure for a gap that a two-line prompt change would have closed.

(e) Exploration-emphasis vs. execution-emphasis — routed by the funnel

Option 1 — execution-emphasisOption 2 — exploration-emphasis
What it isInvest in trajectory curation, more/better rejection-sampling SFT data, DAPO’s clip-higher + dynamic sampling as the GRPO baseline, GiGPO step-level creditOn-policy RL first, not SFT, for the segment the funnel flags as F1/F4-dominant; DIVER/tool-sequence diversity bonus, curiosity bonus (CDE), parameter-space-noise pilot, periodic reference-policy resets (ProRL) for genuine boundary expansion
CostLower — DAPO is an established, widely-adopted recipe; trajectory curation reuses existing dataHigher — several of these techniques are RL-infra-dependent and Promising-not-validated (PSN-RLVR, DIVER, HiPER/hindsight credit assignment for a CTF-shaped domain)
RiskIf the funnel is actually F1-dominant, more SFT on the same recipe teaches guessing-and-hoping more confidently, not more competence (LIMO’s framing, arXiv:2502.03387)If the funnel is actually F2/F3-dominant, exploration machinery is solving a problem that doesn’t exist here and burns the RL-infra budget on the wrong axis — the entropy-collapse mechanism these fixes target (arXiv:2505.22617) is real but doesn’t help a policy that’s exploring fine and just executing unreliably
Information valueThis is literally what the funnel is for. Not a taste choice — a routed decisionSame
What must be TRUE before routingFunnel result from fork (a); the scale-check (arXiv:2603.21972); the Pass@(k,T) crossover-direction test on the specific segment (fork b) — does trained pull away from base at large k (execution), or does matched-data SFT regress it while RL expands it (exploration)?Same gate, opposite branch

Verdict pressure: this fork cannot be decided from priors or literature alone — by design, it is the output of forks (a) and (b), not an independent choice. If you find yourself picking an emphasis before the funnel exists, you are guessing, and the guess has better-than-even odds of being wrong given the project’s own “likely execution, unproven” framing in §1.


3. Dependency order — a DAG, not a timeline

This is deliberately not a schedule. It shows what unlocks what — several branches can run in parallel, and nothing downstream of “instrument” is safe to start before its own inputs exist.

flowchart TD
  subgraph INSTRUMENT["INSTRUMENT — near-free, read-only, do first"]
    I1["stagescan.go + stage_oracle.json\nper-challenge, F1-F4 tagger"]
    I2["flag_verified true byte-compare\n(replace retrieved-provenance proxy)"]
    I3["entropy logging wired,\nready from RL step 0"]
    I4["elicitation-ladder harness:\ntool-usage histogram across 40 sectools"]
  end

  subgraph MEASURE["MEASURE — diagnostic, no training changes"]
    M1["Segment 1000 challenges:\nsingle-shot vs sequentially-gated"]
    M2["Pass@(k,T): base vs current checkpoint,\nper segment (arXiv:2604.14877)"]
    M3["Cover@tau per challenge\n(arXiv:2510.08325) — guessing vs reliable"]
    M4["Base-model pass@k control\n(arXiv:2504.13837)"]
    M5["Scale-check: weak/capacity-limited\nvs already-capable-unreliable\n(arXiv:2603.21972)"]
    M6["Elicitation ladder run:\nfew-shot -> SFT-demo -> RL"]
  end

  subgraph ROUTE["ROUTE — the fork decisions (Section 2)"]
    RA["Fork a: ALREADY DECIDED\n(instrument first)"]
    RB["Fork b: SFT-now vs measure-first\nper segment"]
    RC["Fork c: monolithic vs\nmilestone-shaped GRPO"]
    RD["Fork d: elicitation vs\ntraining-time tool fix"]
    RE["Fork e: exploration- vs\nexecution-emphasis"]
  end

  subgraph TRAIN["TRAIN — the actual runs"]
    T1["Rejection-sampling SFT\non single-shot segment, curated\n(STaR / ReST-EM pattern)"]
    T2["GRPO + DAPO baseline\n(clip-higher, dynamic sampling)"]
    T3["+ GiGPO step-level credit\n(zero extra rollouts)"]
    T4["+ potential-based milestone\nshaping (gated, fork c only)"]
    T5["ToolRL / Tool-Star forced\nexposure (gated, fork d only)"]
    T2b["Exploration-emphasis RL:\nDIVER / CDE curiosity bonus /\nPSN-RLVR / ProRL resets\n(gated, fork e exploration branch)"]
  end

  subgraph GRADUATE["GRADUATE — the go/no-go gates"]
    G1["Entropy collapsed\nAND pass@64 non-trivial\n(arXiv:2510.01624)"]
    G2["Semantics-preserving-transform\nrobustness check survives\n(arXiv:2502.07445 / 2503.02296)"]
    G3["Base-pass@k control still\ntrails trained pass@k\n(gain is real, not elicitation)"]
    G4["pass@large-k did NOT shrink\npost-RL (RL-PLUS check,\narXiv:2508.00222)"]
  end

  I1 --> M1
  I2 --> M4
  I3 --> G1
  I4 --> M6

  M1 --> M2
  M2 --> M3
  M2 --> M4
  M2 --> M5

  M1 --> RB
  M2 --> RB
  M3 --> RB
  M5 --> RC
  M1 --> RC
  M6 --> RD
  RB --> RE
  M5 --> RE

  RB --> T1
  RC -->|"F1-dominant, weak policy"| T4
  RC -->|"F2/F3-dominant"| T2
  RD -->|"elicitation recovers it"| T1
  RD -->|"neither recovers it"| T5
  RE -->|"execution-emphasis"| T2
  RE -->|"exploration-emphasis"| T2b
  T1 --> T2
  T2 --> T3
  T3 --> T4
  T3 --> T5

  T2 --> G1
  T4 --> G1
  T5 --> G1
  T2b --> G1
  G1 --> G2
  G2 --> G3
  G3 --> G4
  G4 -->|"holds"| Ship["Credit the gain.\nGeneralize to the next segment /\nchallenge subset"]
  G4 -->|"fails"| Back["Back to MEASURE —\nre-run funnel, re-check scale,\ndo not re-train blind"]

  classDef inst fill:#132b22,stroke:#34d399,color:#eafaf3;
  classDef meas fill:#0f2a3d,stroke:#38bdf8,color:#e6f6ff;
  classDef route fill:#3a2e14,stroke:#f5b942,color:#fff6e0;
  classDef train fill:#2a1438,stroke:#c084fc,color:#f3e8ff;
  classDef grad fill:#3a1414,stroke:#f87171,color:#fde8e8;
  class I1,I2,I3,I4 inst;
  class M1,M2,M3,M4,M5,M6 meas;
  class RA,RB,RC,RD,RE route;
  class T1,T2,T3,T4,T5,T2b train;
  class G1,G2,G3,G4 grad;

Read this as: nothing in TRAIN is safe to start before its ROUTE gate fires, and nothing in ROUTE is safe to decide before its MEASURE inputs exist. INSTRUMENT is the only stage with no prerequisites — which is why fork (a) has no real counter-argument.


4. Open hypotheses to test

These are falsifiable, in the sense the diagnosis chapter insists on: each has a stated experiment and a stated result that would kill it. This is the “prove it to myself” frame, not a checklist to complete once — re-run per challenge segment as the corpus grows.

#HypothesisExperimentFalsified if
H1It’s execution, not knowledge, on the non-sequentially-gated segmentBase-model pass@k at large k (64, 256) on currently-failing single-shot challenges — does the correct action ever appear?The correct action never appears at any N on any checkpoint for a large fraction of these — that’s a knowledge gap for that subset, requiring off-policy injection (demonstration, teacher, or a tool), not more RL
H2Milestone shaping helps, doesn’t hackIntroduce potential-based shaping (fork c, gated) on the F1-dominant segment only; track held-out flag_verified rate and pass@large-k before/afterHeld-out flag rate drops, or pass@large-k shrinks post-introduction (capability-boundary collapse, arXiv:2508.00222) — either result means the shaping term is being farmed, revert to monolithic immediately
H3The horizon is tractable for GRPO at the 30–60% baseline bandRun DAPO+GiGPO on challenges the funnel tags F2/F3-dominant, in the 30–60% pass-rate band; watch entropy from step 0Entropy still collapses under DAPO’s own fixes, or stage-transition credit doesn’t concentrate on the exploitation phase specifically (GiGPO’s state-hash groups show flat credit) — means the horizon/credit-assignment problem is harder than the established recipe assumes for this task shape
H4A real exploration gap exists, localized to the sequentially-gated segmentReplicate Zhai et al.’s crossover-direction test on this project’s own compositional-segment challenges: does matched-data SFT regress pass@(k,T) on this segment while GRPO expands it?If SFT does not regress this segment (both SFT and RL improve it comparably), the sequential-gating framing doesn’t transfer to this task family, and the single-shot ordering (SFT then GRPO) is fine everywhere — fork (b)/(e)’s special-casing was unnecessary
H5Tool-avoidance (curl-preference) is elicitation, not a missing-affordance problemRun the elicitation ladder (fork d) on a sample of the 26 dead sectools entries on held-out challengesNeither few-shot prompting nor light SFT-on-demos recovers usage — genuinely a missing-affordance problem, escalate to Tool-Star forced exposure + ToolRL decomposed reward
H6Any claimed solve-rate gain reflects real execution-reliability improvement, not memorization/elicitationBase-model pass@k-at-large-k control (H1’s instrument, reused) and semantics-preserving-transform variants of a held-out subset, checked against every claimed gain before crediting itThe gain evaporates on either check — the gain is elicitation (fine to attribute to SFT, a red flag if it persists after GRPO) or memorization of the fixed ~10 canonical PD26 shapes
H7The SFT go/no-go gate (entropy collapse) is sufficient on its ownCheck whether pass@64 on the rejection-sampling-SFT checkpoint is non-trivial at the same time entropy collapses, before green-lighting GRPO (arXiv:2510.01624)Entropy has collapsed but pass@64 is flat/low — this predicts a disappointing GRPO run regardless of how good SFT accuracy looked; do not launch on entropy-collapse alone

5. What we deliberately are NOT basing this on

Standing project rule, restated for this chapter specifically: no fork, no hypothesis, no cost/risk estimate, and no number above rests on an academic cybersecurity-LLM training or benchmark paper — CTF-Dojo, Cyber-Zero, Pentest-R1, HackSynth/Random-Crypto, AutoPenBench, Cybench, NYU CTF Bench, EnIGMA, InterCode-CTF, DRLRM-PT, node-fragility reward shaping, the kill-chain-staged-reward paper, Nakano’s ATT&CK-tree scaffold, or Honarvar’s Evolve-CTF/Capture-the-Flags family-based evaluation — even where several of these report a finding that would superficially support one side of a fork here. None of that line of work has produced a frontier cybersecurity model, so none of it counts as frontier evidence for a load-bearing decision; every mention of them in the six source chapters this capstone draws from is explicitly labelled “academic, cited for context only, not a basis,” and this chapter inherits that discipline rather than re-importing their numbers under a different heading. Every claim above is re-grounded on one of: general frontier post-training disclosures (DeepSeek-R1, Kimi k1.5/K2, Llama 4, OpenAI Deep Research), general RL/agent theory (potential-based shaping, the reward-hacking convergence, reward-tampering-as-generalization (arXiv:2406.10162), DAgger, GAE, entropy-collapse mechanics), general (non-security) agent-eval and long-horizon literature (METR, AgentBoard, MAST, τ-bench, GSM-Symbolic/C-BOD), or this project’s own measured data and confirmed lessons (the SFT-induced FLAG{} confabulation, the Phoenix trace corpus, the existing pass@k methodology). Where a demoted academic-security idea is still worth pursuing on its own merits — e.g. staged/kill-chain-shaped reward in a cybersecurity-specific loop — the chapters this one draws from say so explicitly and flag it “worth pursuing — unvalidated outside academic-security work,” never as settled ground.