The path to a frontier cybersecurity model

Every other chapter in this book resolves a method question for this project’s current bottleneck (SFT vs GRPO, monolithic vs decomposed reward, which exploration fix). The recipe is a sequence, not a pick makes the same point this chapter’s §2 skeleton depends on, one level down: it’s never a technique choice, it’s a fixed, ordered sequence of stages that compound. This chapter zooms out to the north star those decisions serve: not “improve pass@k on our 1000-challenge portfolio” as an end in itself, but a frontier cybersecurity model — the offensive-security analog of what DeepSeek-Coder, Qwen-Coder, and DeepSeekMath are to code and math. It asks the harder question every other chapter brackets: even if The decision’s routing tree is answered correctly and roadmap-inputs.md’s forks are all resolved well, does that alone produce a frontier model? The honest answer, argued below, is no — it produces a materially better agent on this project’s own portfolio, which is necessary but not sufficient. This chapter is the capstone that says what else the word “frontier” actually requires, stage by stage, and where this project genuinely already stands on that ladder.

1. The frame — what “frontier” means here, and why academic-security is the wrong template

“Frontier” is not a vibe or a marketing word here — across every domain-specialization lineage examined below (code, math, medical), it tracked the same three things simultaneously, never just one: (1) a strong general, already-agentic starting checkpoint, not a small model fine-tuned harder; (2) a domain RL stage with an ungameable, automatically-computable verifier at scale; (3) infrastructure investment matched to the RL stage’s actual bottleneck — which turns out to be environment count and diversity, not bigger GPUs for pretraining. A model that is merely “instruction-tuned on curated domain data” (Med-PaLM v1’s prompt-tuning-only recipe, arXiv:2212.13138) is a domain assistant, not a frontier domain model — the field’s own vocabulary distinguishes these, and this project should too.

This is why the standing project stance treats academic cybersecurity-LLM papers (CTF-Dojo, Cyber-Zero, Pentest-R1, HackSynth, AutoPenBench, DRLRM-PT, and siblings) as mention-only, never load-bearing: none of them has produced a model that clears the bar above. They are useful as landscape context, occasionally as a source of a technique name worth knowing, but citing them as the basis for a claim about what a frontier cyber model requires would be citing evidence that has never once been tested against the thing it claims to predict. Every load-bearing claim in this chapter instead rests on (a) frontier-lab flagship and domain-specialization disclosures — code, math, medical, and general-agentic recipes, 2023–2026; (b) general RL/ML theory (scaling laws, potential-based reward shaping, reward-hacking mechanics); (c) this project’s own measured data and confirmed lessons. Where a genuinely production, externally-verified cyber-specialized system exists outside the academic set (XBOW, Bugcrowd) it is admissible under clause (a) as a domain-specialization precedent, exactly like Code Llama or Med-PaLM — and is flagged as such, never folded in with the excluded academic set.

2. The transferable frontier domain-specialization recipe

Cross-referencing the code lineages (DeepSeek-Coder → DeepSeek-Coder-V2, Qwen2.5-Coder → Qwen3-Coder-Next), the math lineage (DeepSeekMath → Qwen2.5-Math → DeepSeek-Prover-V2), and the medical contrast vertical (Med-PaLM → Med-PaLM 2 → MedGemma), one skeleton recurs with variation in emphasis but is never fully absent. A fifth stage — mid-training — is a distinct, more recently named bridge stage the code/math lineages ran informally but only OLMo 2 and a 2025 controlled study give a name and a mechanism to (arXiv:2501.00656, arXiv:2510.14865).

flowchart TD
  P["Pretrain\n(inherited — a strong general/\nagentic open-weight checkpoint,\nnot trained here)"] --> S0

  subgraph S0["Stage 0 — Domain continued pretraining (CPT)"]
    direction TB
    S0A["100B-5.5T in-domain tokens,\nfrom a STRONG checkpoint, never from scratch\nCode Llama ~500B code tokens\nDeepSeekMath 120B math tokens\nQwen2.5-Coder 5.5T code tokens"]
  end

  S0 --> S05

  subgraph S05["Stage 0.5 — Mid-training (the named bridge)"]
    direction TB
    S05A["5-10% of pretrain FLOPs, curriculum-shaped,\nupsampled high-quality + synthetic patches\n'infuse knowledge, patch deficiencies'\n(OLMo 2); reduces catastrophic forgetting\nbefore SFT (2510.14865)"]
  end

  S05 --> S1

  subgraph S1["Stage 1 — Domain SFT / data synthesis,\nincreasingly SELF-BOOTSTRAPPED"]
    direction TB
    S1A["Rejection-sampling + iterative co-evolution:\nQwen3-Coder used Qwen2.5-Coder to clean its\nown next-gen data; Qwen2.5-Math co-evolved\nRM+SFT across rounds; DeepSeek-Prover-V2\nstitched subgoal-decomposed traces"]
  end

  S1 --> S2

  subgraph S2["Stage 2 — Domain RL, verifier-gated\n'hard to solve, easy to verify'"]
    direction TB
    S2A["GRPO / RLVR, no critic, group-mean baseline\n(origin: DeepSeekMath); scaling axis that\nmattered most = PARALLEL RL ENVIRONMENTS\n(Qwen3-Coder: 20,000), not model size"]
  end

  S2 -.->|"cross-cutting, every stage"| S4["Data-pipeline + scale engineering\nas a first-class investment\n(Qwen2.5-Coder: curation > scale;\nStarCoder2/Stack-v2: quality substitutes\nfor parameter count)"]

  classDef stage fill:#132b22,stroke:#34d399,color:#eafaf3;
  classDef cross fill:#3a2e14,stroke:#f5b942,color:#fff6e0;
  class P,S0A,S05A,S1A,S2A stage;
  class S4 cross;

Reading the skeleton stage by stage, cited:

Stage 0 — domain CPT. Every recent frontier vertical lineage continues pretraining from an existing strong checkpoint — never truly from scratch (DeepSeek-Coder v1’s from-scratch 2T-token run, arXiv:2401.14196, is the sole exception in this set, and even DeepSeek abandoned it by v2, arXiv:2406.11931). Cross-domain transfer is itself a load-bearing finding, not noise: DeepSeekMath deliberately starts from a code base (arXiv:2402.03300) for a math specialist, because precise multi-step symbolic reasoning transfers — argues a cyber CPT stage, if built, should start from a model already strong at general coding/tool-use/agentic reasoning, not a generic chat model.
Stage 0.5 — mid-training. OLMo 2 names this explicitly as “Stage 2: Mid-training (5–10% of training FLOPs)… upsample the highest-quality web documents and curated non-web sources; employ synthetic data crafted to patch math capabilities” (arXiv:2501.00656). A 2025 controlled study formalizes the mechanism: mid-training outperforms continued-pretraining-alone at a matched specialized-token budget and mitigates catastrophic forgetting in the subsequent SFT stage, because it acts as a better initialization for post-training rather than just adding knowledge (arXiv:2510.14865, moderate-high confidence — recent, not yet heavily cited, but consistent with and explaining the OLMo/Llama-3/DBRX practitioner reports it’s built on).
Stage 1 — self-bootstrapped SFT/data synthesis. The frontier pattern has moved past “filter and train once”: Qwen3-Coder used the prior generation (Qwen2.5-Coder) to clean and rewrite its own next-generation pretraining data (blog disclosure, no standalone arXiv for the 480B flagship — the architecture is covered by arXiv:2505.09388; the agentic-RL successor, Qwen3-Coder-Next, is arXiv:2603.00729). Qwen2.5-Math (arXiv:2409.12122) co-evolves a reward model and SFT data across rounds before RL is even applied, then reuses the same RM at inference for best-of-N reranking. DeepSeek-Prover-V2 (arXiv:2504.21801) decomposes a hard problem into subgoals, solves each with a cheaper model, and stitches the resolved subgoals into a single cold-start trajectory — a direct precedent for treating a ~100-turn CTF episode’s implicit stages (recon → foothold → priv-esc → flag) as subgoal-decomposable SFT-construction material, even while the RL reward itself stays terminal-only for ungameability. For the general (non-cyber) chat/tool-use/reasoning/preference data that fills this same SFT rung before any cyber-specific data is layered on, see Proven post-training datasets — a usage-cited registry.
Stage 2 — verifier-gated RL. This is where GRPO was born: DeepSeekMath’s own framing attributes math capability to two factors — a web-data mining pipeline, and Group Relative Policy Optimization, a critic-free PPO variant using the sampled group’s mean reward as the baseline (arXiv:2402.03300). Qwen3-Coder’s post-training explicitly names the reward-design principle “hard to solve, easy to verify,” and its own headline scaling axis wasn’t a bigger model, it was 20,000 parallel RL environments for the long-horizon agentic RL stage. DeepSeek-Prover-V2 runs the same pattern with Lean’s type-checker as a binary, ungameable reward — structurally identical in spirit to a terminal flag verifier.
Cross-cutting — data-pipeline engineering as its own investment. Qwen2.5-Coder’s whole story is “meticulous data cleaning, scalable synthetic data generation, balanced data mixing” beating larger models on the same benchmarks purely on data quality/composition. StarCoder2/The Stack v2 (arXiv:2402.19173, included as a data-pipeline lesson only — flagged explicitly as not a frontier-capability reference point) independently confirms curation-quality substituting for parameter count from a second source.

The medical contrast (mentioned, not a basis for cyber claims): Med-PaLM v1’s prompt-tuning-only recipe (arXiv:2212.13138) shows the cheap-adaptation-of-a-frozen-giant path is not sufficient on its own — the paper’s own human-eval gap (factuality, harm) motivated Med-PaLM 2 (arXiv:2305.09617, domain instruction fine-tuning + ensemble refinement) and MedGemma (arXiv:2507.05201, domain vision-language pretraining + task-specific fine-tuning, explicitly disclosed as not clinical-grade without further fine-tuning). The useful lesson by contrast: for a binary, adversarial correctness domain like offensive security (a wrong action doesn’t mislead a reader, it fails the exploit), a v1-style prompt-tuned ceiling is lower than in code/math — supporting this project’s existing bias toward continued adaptation + RLVR over prompting alone.

3. The frontier ingredients, as requirements

Restating the seven-ingredient survey as a checklist of what “frontier” actually costs, independent of any one domain:

#	Ingredient	What frontier scale actually looks like	Confidence
1	Compute + scale law	Loss falls as a power law in model size × data × compute; the ratio matters (Chinchilla-optimal ≈ equal scaling of params and tokens, not param-dominant) — Kaplan, arXiv:2001.08361; Hoffmann/Chinchilla, arXiv:2203.15556	High — foundational, independently reproduced
2	Data scale + quality + curation	5–15T+ curated tokens is the pretraining norm (DeepSeek-V3, arXiv:2412.19437; Llama 3, arXiv:2407.21783) — or the phi-1 extreme: curation quality can substitute for ~100x less scale within a narrow domain (arXiv:2306.11644), though that ratio is an upper bound, not a universal rule	High for the scale rows; moderate on how far the “quality substitutes for scale” ratio generalizes
3	Domain CPT before post-training	120B–5.5T in-domain tokens continued-pretrained into an existing strong base, before any instruction-tuning/RL — not a thin adapter on the base chat model (Qwen2.5-Coder 5.5T+, DeepSeekMath 120B, Med-PaLM 2’s domain finetuning)	High — three independent labs, primary technical reports
4	Mid-training as a named bridge stage	A shorter (5–10% FLOPs), curriculum-shaped stage between broad pretraining and narrow post-training that patches domain deficiencies cheaply and reduces forgetting — OLMo 2, arXiv:2501.00656; arXiv:2510.14865	High (OLMo 2); moderate-high (mechanism study, new)
5	RL-environment scale, diversity, verifiability	The reasoning/agentic jump to o1/R1-class models is attributed to large-scale RL with verifiable, not learned, rewards, not more pretraining — DeepSeek-R1, arXiv:2501.12948; OpenAI o1 system card (arXiv mirror 2412.16720); environment diversity/scale is its own axis, distinct from reward correctness — Kimi K2’s “tens of thousands” synthesized-tool pipeline (arXiv:2507.20534); framed as an emerging bottleneck by arXiv:2511.09586	High (R1, o1, Kimi K2); moderate on the survey’s “emerging bottleneck” framing specifically (new, low-citation)
6	Full pipeline vs. thin adapter	LoRA measurably underperforms full fine-tuning specifically on code/math domain-skill acquisition — full fine-tuning learns perturbations at 10–100x the effective rank of typical LoRA configs — arXiv:2405.09673	High — controlled, ablated, >250 citations in 18 months, directly on-domain
7	Eval reflects reality, not a saturated benchmark	Classic benchmarks are contaminated enough to inflate scores by up to 22.9%/19.0% (GSM8K/MMLU) — arXiv:2406.13990; frontier practice responds with contamination-resistant-by-construction benchmarks: time-segmented LiveCodeBench, arXiv:2403.07974, “Google-proof” GPQA, arXiv:2311.12022; Tülu 3, arXiv:2411.15124 treats decontamination as a first-class deliverable, and is also the primary public naming of RLVR	High — all four primary, independently corroborating

Net read of §2+§3 together: compute/scale (ingredient 1) is inherited from the base model’s own pretraining — not this project’s job to re-derive. Ingredients 3–4 (CPT, mid-training) are the stages this project’s plan currently skips by design (handbook rule “knowledge in tools, not weights”), a defensible bet but an unvalidated one at the CPT/mid-training layer specifically. Ingredient 5 (RL-environment scale/diversity/verifiability) is where this project is closest to the frontier pattern already — see §5. Ingredients 6–7 (full pipeline vs. adapter, eval integrity) are concrete, checkable knobs, not open research questions.

4. Gap analysis — the frontier recipe vs. this project, stage by stage

One-line verdict: this project has correctly identified the shape of the frontier recipe — harness as a live RL environment, ground-truth verifiable reward, rejection-sampling SFT → GRPO/RLVR ordering, “knowledge in tools not weights” — and has already built the two hardest structural pieces: a working ~100-turn agentic harness and a genuine, non-gameable terminal verifier. What it has not built is scale on every other axis, and it has skipped a pretraining-adjacent stage entirely (domain CPT / mid-training) that every cited frontier domain-specialization precedent inserts. The gap is 1–5 orders of magnitude on breadth, not a missing insight on direction.

Frontier-recipe stage	Have	Partial	Missing
Stage 0 — Domain CPT	Nothing — explicit design choice (“knowledge in tools, not weights”), not oversight	The “knowledge in tools” architectural bet is coherent but unvalidated at this layer — the 87.7%-tool-bypass finding is at least as consistent with “raw vocabulary exposure is thin” as with “SFT/RL hasn’t reinforced the surface yet”	A CPT stage on curated offensive-security text (tool docs, CVE writeups, exploit-dev reasoning) — every cited frontier precedent (Code Llama ~500B code tokens, DeepSeekMath 120B math tokens) runs this before SFT/RL
Stage 0.5 — Mid-training	Nothing	—	The single biggest concrete gap relative to its likely payoff — cheapest missing stage of the whole skeleton (§2), and this project currently goes straight from a general base into RL with no bridge stage at all
Stage 1 — SFT / rejection-sampling	Fully-designed quality-filter recipe (replay-reproduce, loss-masking, dedup, decontamination); one SFT already shipped and flag-verified (pass@1 26%→35%, pass@5 4/10→5/10); ~1,200+ trajectories across 5 corpora, ~185 verified-success candidates; a load-bearing negative result confirming the reward-must-be-ground-truth rule on this project’s own data (35% flag fabrication rate under a loose acceptance filter)	The corrected filter (replay-reproduce, byte-exact flag verify) is designed but not confirmed re-run since the confabulation finding; ground truth exists for only 2 of 7 corpora	~2–3 orders of magnitude below frontier scale — Kimi K2’s SFT draws on 3,000+ real + 20,000+ synthesized MCP tools; no systematic teacher-distillation at scale; no difficulty-curriculum construction (candidate pool too small)
Stage 2 — RLVR	`flagscan.go` — a genuine rung-1, deterministic, ungameable terminal verifier, exactly the reward shape RLVR requires; RL-candidate-selection methodology matching GRPO’s actual zero-gradient mechanics (1–4-of-5 band); a theory-correct potential-based stage-shaping proposal (Ng-Harada-Russell) that doesn’t touch the ground-truth backbone	No GRPO/RLVR training loop implemented anywhere in the codebase (zero training binaries); `retrieved` heuristic not yet upgraded to byte-exact for 5 of 7 corpora; stage-shaping designed, zero code; the one OpenAI-RFT fallback option is winding down	Qwen3.7-Max’s decoupled Task/Harness/Verifier infra — the first-party-named fix for exactly this project’s own measured 87.7%-tool-bypass scaffold-overfitting finding; no entropy-collapse countermeasure (nothing to attach one to yet); no partial-rollout/pause-resume infra for long-tail ~100-turn episodes
Stage 3 — Scaled agentic RL environments	A genuinely working RL-environment shell (100-turn multi-turn loop, sandboxed, fully OTel-traced); 10 canonical, hardened, contamination-free, single-solution PD26 challenges live in production, genuine vuln-class breadth; several hundred additional informal challenges (warpenv/envgen/gym); the RL-envs-as-moat thesis, now corroborated by frontier evidence, not just three original data points	Total known population is a few hundred distinct targets — an order of magnitude below the north star’s own “~1000” framing; the single Hetzner eval box is sized for sequential/moderate-parallel eval, never for concurrent GRPO rollout load	Harness/verifier diversity as its own trained axis (one tool surface, one verifier per challenge — categorical, not a scale gap); procedural/generative environment scaling at frontier order-of-magnitude (`envgen`’s 84 challenges is ~22x smaller than DeepSeek-V3.2’s 1,827-environment pipeline); training-time rollout compute at K8s scale (categorical — no training-scale infra exists, only eval-scale)
Stage 4 — Eval integrity	A locked, rigorous pass@k methodology (unbiased estimator, k=3/5/10 bands, independent cold starts); the terminal `flag_verified` contract (exact match, never a proxy); two modest QA datasets beyond flag-capture; a fully-designed, near-zero-risk eval-decomposition plan	Base-model pass@64–128 control not confirmed run against the current SFT checkpoint; no per-challenge `stage_oracle.json` manifest yet	No confirmed contamination/canary audit applied to the cyber CTF corpus specifically; no externally-comparable published benchmark result — no way for an outside reader to place this project’s solve rate against any external reference point

What this means concretely: this project is not missing an idea anywhere in the recipe — every stage has a designed, theoretically-grounded, often partially-built answer, and two pieces (the harness-as- environment, the ground-truth verifier) are genuinely frontier-quality today. What’s missing everywhere except eval methodology and reward design is scale: more environments, more verified trajectories, more training-time compute — plus one categorical, non-scale gap (harness/verifier diversity) that is the single frontier-lab-named fix for a failure mode this project has already measured on its own data.

5. RL-environments + data-synthesis as the moat, at frontier scale

Frontier evidence on this axis is now broader than the project’s original three-source hypothesis (Anthropic’s reported spend, Bugcrowd’s market, this project’s own harness) — nine of ten labs surveyed for the frontier-recipes chapter independently corroborate it, and the general-agentic and cyber-specific evidence below sharpens the picture further.

Kimi K2 discloses “a large-scale agentic data synthesis pipeline and a joint reinforcement learning stage, where the model improves its capabilities through interactions with real and synthetic environments” (arXiv:2507.20534) — the cleanest public confirmation that a frontier lab’s post-training lever is environment synthesis + mixing real with synthetic, not base- model scale alone. Frontier_cyber_takeaway: PD26-01..10 is this project’s “real” side; anything procedurally generated (parameterized bug-class variants, mutated target configs) is the “synthetic” side — a program training only on the 10 canonical challenges is closer to “real-only, no synthesis,” which K2’s own design implies under-scales.
Kimi K2.5’s Agent Swarm (arXiv:2602.02276) — a self-directed parallel-agent orchestration framework decomposing tasks into concurrent heterogeneous sub-problems, 4.5x latency reduction — architecturally identical to what XBOW independently converged on for cybersecurity (below): narrow-scope parallel sub-agents, not one longer monolithic trajectory. Two unrelated programs landing on the same architecture is a signal worth taking seriously against this project’s current single-agent ~100-turn episode design.
OpenAI Deep Research — official disclosure that it “was trained on real-world tasks requiring browser and Python tool use, using the same reinforcement learning methods behind OpenAI o1,” and on-record team commentary that “end-to-end training beats manual orchestration” — a fixed recon→scan→exploit→flag graph “breaks” when the agent needs to adapt; letting the model learn strategy via RL over hard tasks outperforms hand-scripted phase logic. This directly reinforces the project’s own “light framing beats heavy scaffolding” rule, from the team that shipped the highest-profile agentic-RL product to date.
Anthropic — a reported (secondhand, via TechCrunch citing The Information; direction corroborated by surrounding market activity, dollar figure not on-record) >$1B RL-environment commitment, plus a live, current, on-record disclosure that cyber-offensive capability is deliberately tier-gated across the Claude line rather than propagated uniformly with general capability. Frontier_cyber_takeaway: cyber capability is not a free byproduct of general-agentic scaling even at a frontier lab — it has to be deliberately trained in, which is exactly this project’s actual bet (open-weight base + this project’s own cyber RL environments).
Google DeepMind SIMA / SIMA 2 (arXiv:2404.10179, arXiv:2512.04797) — SIMA 2’s headline: “by leveraging Gemini to generate tasks and provide rewards, SIMA 2 can autonomously learn new skills from scratch in a new environment.” The clearest frontier precedent for self-generated challenges: applied to cybersecurity, a frontier-scale program would use a strong model to propose novel vulnerable-target variants and score exploit attempts, with the project’s existing terminal flag verifier as the ungameable ground-truth check that keeps a self-generated curriculum honest.
Market corroboration, general-purpose: Prime Intellect’s Environments Hub — 1,000+ unique environments from 250+ creators, 100,000+ downloads. Cyber-specific instance: Bugcrowd’s RL Environments (built on Mayhem Security tech) — “hundreds of thousands of training environments, each built from authentic open-source vulnerabilities with real source code and verifiable outcomes,” with Chief AI/Science Officer David Brumley’s own framing: “Most AI security training stops too early. Models learn to find bugs, but not to prove the bugs are real and exploitable… detection through exploitation, patching, and audit.” That last phrase is also a concrete, safely-implementable curriculum idea — a graded multi-stage reward, but only if built potential-based (F(s,a,s') = γΦ(s') − Φ(s), the only reward-shaping form with a policy-invariance guarantee, per Ng/Harada/Russell, ICML 1999). A flat per-stage bonus is exactly the kind of ad hoc shaping the theorem warns produces gameable, farm-the-partial-credit policies.
XBOW — admissible as a production, externally-verified domain-specialization precedent (top-ranked on HackerOne against human researchers, real CVEs, real payouts), not the excluded academic set. Its own disclosed curriculum climbed four rungs in order: canned CTF (PortSwigger/PentesterLab, “artificial exercises”) → a custom-built realistic benchmark → white-box zero-day discovery in real open-source projects → black-box production dogfooding on HackerOne, where the real-world bug-bounty triage process itself is the verifier. This project currently sits at roughly rung 2 (PD26-01..10, custom-built, more realistic than generic CTF). XBOW’s own architecture note — “thousands of short-lived agents, each with a narrow objective, orchestrated by a persistent coordinator and validated by deterministic logic… if one agent runs into a dead end on step 4 of a 20-step attack, it doesn’t tank the whole operation” — independently confirms (alongside Kimi K2.5’s Agent Swarm) that decomposition into parallel narrow agents, not a longer monolithic single-agent trajectory, is where frontier-grade cyber-agent architecture is heading.

This project’s own numbers, side by side with the frontier evidence: ~1,200+ trajectories, solving ~100–200 of ~1,000 attempted challenges — but the distinct-challenge corpus underneath that is on the order of 10 canonical live challenges plus a 15-challenge locked dataset slice. Bugcrowd alone is “hundreds of thousands” of distinct verifiable cyber environments; Prime Intellect’s general hub is 1,000+; Kimi K2 leans on 3,000+ real + 20,000+ synthesized tools feeding trajectory generation. The gap is 2–5 orders of magnitude on environment volume and diversity — not on algorithm. Every frontier disclosure above agrees GRPO/RLVR-family joint RL with tool-use is now well-understood and largely commoditized; the highest-leverage next investment is a synthesis pipeline that turns the existing 10–15 hand-built challenges into hundreds-to-thousands of verifiably-distinct variants (parameterized bug-class mutations, stack/target permutations, difficulty-graded variants of the same vuln class) — mirroring Kimi K2’s real+synthetic split and XBOW’s own rung-2→3 transition — rather than continuing to hand-author challenges one-off at the current cadence.

6. How this ladders — from “improve solve rate” to “frontier model”

Diagnosing the gap, From behavioral audit to training signal, One problem, or many?, and Where you are & the forks ahead are all, correctly, scoped to this project’s own ~1,000-challenge portfolio — they answer “what training signal fixes F1 vs F2 vs F3 vs F4” and “monolithic vs milestone-shaped reward,” which are the right near-term engineering questions. This chapter’s honest addition: answering those questions well moves solve rate on the existing portfolio; it does not, by itself, cross into “frontier.” The ladder has three rungs, and the roadmap-inputs.md decision brief only climbs the first:

Rung 1 — execute reliably on the portfolio you have. This is the diagnosis framework’s whole job: segment single-shot vs. sequentially-gated, route by the F1–F4 funnel, pick SFT-vs-GRPO ordering correctly per segment. Get this right and you’ve closed the execution gap this project diagnosed — necessary, and per §4 above, this project’s Stage 1/Stage 2 work already targets exactly this.
Rung 2 — scale the environment/data axis by orders of magnitude. Per §4/§5, this is where the actual distance to “frontier” lives: 10 canonical + ~250 informal challenges vs. Bugcrowd’s hundreds of thousands; ~185 verified-solve candidates vs. Kimi K2’s tens-of-thousands-of-tools synthesis pipeline. No amount of correctly-routed SFT-vs-GRPO decision-making on Rung 1’s existing portfolio substitutes for this — every frontier precedent in this chapter treats environment/data scale as the dominant axis, not a nice-to-have.
Rung 3 — close the CPT/mid-training gap, if evals reveal it’s real. Per §4’s Stage 0/0.5 rows, this project’s “knowledge in tools, not weights” bet is a legitimate scoping choice only if the deficit evals surface is confirmed execution-only. If a knowledge deficit shows up instead (not just an execution one), neither of the two disclosed frontier tools for fixing it — domain CPT, mid-training — is currently in this project’s plan at all. This is the one rung that’s genuinely contingent, not scheduled.

The decomposition-vs-monolithic.md verdict (stay monolithic on reward shape until the funnel says otherwise, build only the theorem-backed potential-based version if you do) and the roadmap-inputs.md forks (instrument first, segment before committing SFT capacity, route exploration-vs-execution emphasis by the funnel) are all Rung-1-scoped decisions, made correctly — none of them need to change in light of this chapter. What this chapter adds is the honest framing that Rung 1 is a prerequisite, not the destination: a project that nails every fork in roadmap-inputs.md and still has 10 canonical challenges and no CPT/mid-training stage has built an excellent instance of “the shape of the frontier recipe” at a scale that isn’t frontier yet. The concrete forward move this chapter argues for, independent of and parallel to the Rung-1 work already underway, is the Rung-2 synthesis-pipeline investment named in §5 — because unlike Rung 3 (contingent on an eval result not yet in hand), Rung 2’s gap is already confirmed, already the largest, and already has multiple frontier precedents (Kimi K2, SIMA 2, XBOW rung 2→3, Bugcrowd) showing the shape of the fix.

Cross-links

The recipe is a sequence, not a pick — the stage-order-and-compounding argument this chapter’s §2 domain-specialization skeleton is a specific instance of.
Proven post-training datasets — a usage-cited registry — the concrete, proven-by-usage dataset shopping list for the general-capability SFT/preference/reasoning rungs §2’s Stage 1 describes abstractly.
Where you are & the forks ahead — the Rung-1 decision surface this chapter sits on top of; its forks (a)–(e) and its DAG are unaffected by anything in this chapter, they’re prerequisites to it, not alternatives.
One problem, or many? — decomposition vs. monolithic — the reward-shape verdict this chapter’s §5 potential-based-shaping discussion (Bugcrowd’s detection→exploitation→patching→ audit staging idea) must stay consistent with: shape only via a theorem-backed potential function, never a flat per-stage bonus.
Diagnosing the gap — a scientific framework — the F1–F4 funnel and pass@k / Pass@(k,T) instrumentation that determines whether Rung 3 (CPT/mid-training) is contingent-but-unnecessary or the load-bearing risk this chapter flags it as.
What the frontier labs actually do (2026) — the ten-lab SFT/RL method survey this chapter’s Stage 1/Stage 2 skeleton is consistent with; that chapter is method-focused, this one is recipe-and-scale-focused — read them as companions, not duplicates.
RL that creates value — long-horizon, exploration, reasoning, novelty — the technique menu for Rung 1’s execution work; orthogonal to this chapter’s Rung 2/3 scale argument.

Keyboard shortcuts

Post-Training Field Notes