Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

The recipe is a sequence, not a pick

Every other chapter in this book eventually asks “which technique” — SFT or GRPO, DPO or KTO, monolithic or decomposed reward. This chapter retires that framing at the root. None of the frontier reports surveyed below describe a technique choice. They describe a fixed, ordered sequence of stages, each doing a qualitatively different job on qualitatively different data at a qualitatively different scale, and the capability that ships is a function of the order those stages run in and the way they compound — not of which single stage you picked. Skip a stage and a later stage cannot silently make up for it (you cannot RL-amplify a capability pretraining/SFT never injected — §3). Run stages in the wrong order or the wrong dose and a later stage actively regresses (heavy SFT/DPO can cap RL’s exploration room before RL ever starts). This is the organizing thesis of this book’s north star: not “what recipe/technique should I choose,” but “what sequence of stages produces frontier capability, and how do we run it for cyber.”

Two explicit sequences follow, because this project’s actual path and the textbook from-scratch path are different sequences with different stage-skeletons even though every stage-name rhymes:

  • Sequence A (§1) — a foundation model trained from scratch: pretraining → mid-training/annealing → SFT → rejection-sampling → preference opt → RL/RLVR → iterate.
  • Sequence B (§2) — this project’s actual path: fine-tune an already-available open-weight dense checkpoint (Qwen/Llama-class) — base-vs-instruct choice → (optional) continued/domain pretraining → SFT cold-start (often distilled) → rejection-sampling/on-policy SFT → preference (DPO/KTO) → RLVR/GRPO → iterate.

Stance held throughout: no claim below is grounded in an academic cybersecurity-LLM project (CTF-Dojo, Cyber-Zero, Pentest-R1, HackSynth, AutoPenBench, DRLRM-PT) — those appear, if at all, labelled “academic, not a basis.” Grounding is frontier-lab technical reports, frontier open post-training recipes (Tülu 3, OLMo 2, DeepSeek-R1, Qwen, Llama-Nemotron), and general RL/ML theory. Every arXiv id below was checked live via Exa/WebFetch against arxiv.org/abs/<id> or arxiv.org/html/<id>, not recalled from training-data memory — confidence is stated per claim, and “promising, not yet validated” is used honestly where a finding is recent/low-citation.


1. Sequence A — the foundation model, from scratch

This is the textbook path — pretrain a dense model from zero, then run a multi-round post-training loop. Grounded in Llama 3 (arXiv:2407.21783), OLMo 2 (arXiv:2501.00656), DeepSeek-V3/R1 (arXiv:2412.19437, arXiv:2501.12948, cited for pipeline-shape — MoE, flagged where MoE-specific), and Qwen2.5 (arXiv:2412.15115, dense 0.5B–72B). It is not this project’s path — included so Sequence B’s compression ratio (§2) has a baseline to compress against.

flowchart TD
  Pre["Pretraining\n15-18T tokens: web + code + math + multilingual\nLlama 3 405B 15.6T (2407.21783)\nDeepSeek-V3 14.8T (2412.19437)\nQwen2.5 18T (2412.15115)"] --> Mid

  subgraph Mid["Stage 2 — Mid-training / annealing"]
    direction TB
    MidA["5-10% of pretrain FLOPs, curated premium mix\nLR linearly decayed to 0\nOLMo 2: 50/100/300B-token anneals, souped (2501.00656)\nLlama 3: 40B tokens, 30:70 weight (2407.21783)"]
  end

  Mid --> SFT1

  subgraph SFT1["Stage 3 — SFT cold-start"]
    direction TB
    SFT1A["Curated (prompt,response) pairs — human,\ndistilled, or rejection-sampled from a prior round\nDeepSeek-V3 1.5M instances (2412.19437)\nQwen2.5 >1M samples (2412.15115)\nDeepSeek-R1 cold-start: 'thousands' only (2501.12948)"]
  end

  SFT1 --> RS

  subgraph RS["Stage 4 — Rejection sampling"]
    direction TB
    RSA["Sample K per prompt from current policy,\nkeep verifier/RM-filtered correct-only\nDeepSeek-R1: ~600K reasoning + ~200K general\n= ~800K, 2 epochs (2501.12948)"]
  end

  RS --> DPO

  subgraph DPO["Stage 5 — Preference opt (DPO)"]
    direction TB
    DPOA["(chosen, rejected) triplets, closed-form loss\nchosen over PPO for stability/scale (2407.21783)\nQwen2.5 stages SFT -> DPO -> GRPO (2412.15115)"]
  end

  DPO --> RL

  subgraph RL["Stage 6 — RL / RLVR"]
    direction TB
    RLA["GRPO: group-relative advantage, no critic,\nrule-based verifiable reward\nR1-Zero AIME pass@1 15.6% -> 71.0% (2501.12948)"]
  end

  RL -.->|"iterate: rejection-sample the RL-converged\ncheckpoint, retrain SFT from base"| SFT1

  classDef stage fill:#132b22,stroke:#34d399,color:#eafaf3;
  class Pre,MidA,SFT1A,RSA,DPOA,RLA stage;
StageJobData typeApprox size (verified)Contribution / order-rationale
1. PretrainingTeach language structure + load broad-domain knowledge via next-token prediction at massive scale — the only stage that’s affordable at trillions of tokensFiltered/deduped web + code + math/STEM + multilingual, general-purpose15.6T tokens (Llama 3 405B dense, 2407.21783 — mix: 50% general / 25% math-reasoning / 17% code / 8% multilingual); 14.8T (DeepSeek-V3 MoE, 2412.19437); 18T (Qwen2.5 dense family, 2412.15115, up from 7T for Qwen2)Must come first — every later stage assumes a working “reads and represents language” substrate; RL/RLVR only sharpen a distribution pretraining already put mass on, they don’t create it
2. Mid-training / annealingUpsample small, high-quality, capability-specific data that a uniform trillion-token mix would dilute to near-zero; also a cheap diagnostic for “is this new dataset worth anything”Curated high-quality web + synthetic + math/domain-specific; LR decayed to (near) zero5–10% of total pretrain FLOPs. OLMo 2: 832.6B-token curated pool (“Dolmino Mix 1124”), drawn down into 50B/100B/300B-token anneal runs, then checkpoint-averaged (“souped”) 2501.00656. Llama 3: 40B tokens, 30% new-data : 70% default-mix weight 2407.21783Too scarce (tens of B tokens) to survive uniform mixing into a 15T-token stream; too much (needs pretraining-scale volume) for SFT’s ~10⁶-example budget to inject. OLMo 2’s measured delta: +18.7% (7B) / +15.9% (13B) / +12.3% (32B) downstream from mid-training alone (Table 2) — the cleanest “small stage, outsized compounding gain” number in this whole thread
3. SFT cold-startTeach instruction-following / assistant-shaped output; for reasoning pipelines, narrows further to “stabilize RL’s starting point”Curated (prompt, response) pairs — human-written, distilled from a stronger teacher, or rejection-sampled from a prior roundDeepSeek-V3: 1.5M instances, multi-domain 2412.19437; Qwen2.5: >1M samples 2412.15115; DeepSeek-R1 cold-start: “thousands” only, deliberately tiny 2501.12948Must follow pretraining+mid-training (needs the base capability). In multi-round designs, SFT is interleaved with RL, not one-shot — cold-start SFT is ~3 orders of magnitude smaller than “capability” SFT because its only job is fixing format/readability, not teaching the full skill
4. Rejection samplingTurn a trained policy into a data generator for the next SFT round — crystallize RL-gained capability back into cheap-to-train supervised pairsModel-generated completions, filtered by rule-based correctness and/or a reward model/generative judgeDeepSeek-R1: ~600K reasoning + ~200K non-reasoning = ~800K samples, 2 epochs, retrained from DeepSeek-V3-Base 2501.12948; Llama 3: K≈10–30 samples/prompt (medium confidence — secondary source)Sits between an RL stage and the next SFT stage in every multi-round design — cannot happen before a trained policy exists, and its output is consumed entirely by the next SFT round. This is the mechanism that makes the pipeline compounding rather than a single pass
5. Preference opt (DPO)Align to relative preferences without a live RL loop, critic, or continuously-updated reward model(prompt, chosen, rejected) tripletsQwen2.5: explicit SFT → offline DPO → online GRPO staging 2412.15115; Llama 3: ≈6 rounds, each SFT+DPO (medium confidence on the exact count) 2407.21783A deliberate simplicity/stability trade against full online RL, stated directly by Meta: “less stable and harder to scale” for PPO-family algorithms at 405B 2407.21783; needs a policy already producing two plausible candidates to rank
6. RL / RLVRExceed imitation — discover reasoning behaviors never explicitly demonstrated, via a verifiable (not learned) rewardPrompts + a verifier, not response demonstrationsR1-Zero (pure RL, no SFT): AIME 2024 pass@1 15.6% → 71.0% (cons@64 86.7%) 2501.12948; GRPO: group-relative advantage, no critic/value networkPlaced last/final rounds — least stable, most expensive per-sample (live generation + verification per step). R1-Zero is the direct demonstration of running this first: real capability gain, but a documented failure mode (poor readability, language mixing) the paper attributes explicitly to skipping cold-start SFT
7. IterateRepair a failure mode the previous pass introduced; bootstrap the next round’s training data from the current-best policyn/a — reuses stages 3–6DeepSeek-R1: explicit 4-stage loop (cold-start SFT → reasoning RL → RS+SFT → RL-for-all-scenarios) 2501.12948; Llama 3: 6 rounds, self-referential rejection-sampling across rounds 2407.21783Not “more of the same” — round N+1’s data quality is bounded by round N’s model quality (a genuine bootstrapping effect). Llama 2’s own early-version regression (rejection-sampling only from the latest round’s data caused a documented capability loss — “struggled more… to compose rhyming lines”) 2307.09288 is the concrete warning that compounding is not monotonic for free — you must deliberately mix in older-round data

2. Sequence B — fine-tune an open-weight DENSE model (our path)

This is the project’s actual path. Start from a Qwen3/Llama-class dense checkpoint — never pretrain from zero. Grounded in four frontier open post-training recipes: Tülu 3 (arXiv:2411.15124), DeepSeek-R1 (arXiv:2501.12948), Qwen3 (arXiv:2505.09388), and Llama-Nemotron (arXiv:2505.00949). Underneath surface naming differences, all four run the same stage skeleton.

flowchart TD
  Base["Stage 0 — Base-vs-instruct choice\nStart from BASE, not Instruct\nDeepSeek-R1, Qwen3, Tulu 3 all start Base\n(2501.12948, 2505.09388, 2411.15124)"] --> CPT

  subgraph CPT["Stage 1 — (optional) continued /\ndomain pretraining"]
    direction TB
    CPTA["Full-FT (not LoRA), low LR, unsupervised domain tokens\nQwen3 knowledge stage: +5T tokens on ~30T base (2505.09388)\nDeepSeekMath CPT: 120B math tokens on a 7B dense model (2402.03300)\nLoRA-vs-FFT CPT regime: ~20B tokens (2405.09673)"]
  end

  CPT --> SFT

  subgraph SFT["Stage 2 — SFT cold-start\n(often distilled / synthetic)"]
    direction TB
    SFTA["Small, format-focused, often teacher-distilled\nDeepSeek-R1: 'thousands' (2501.12948)\nQwen3: deliberately minimized by design (2505.09388)\nDeepSeek-R1-Distill: 800K samples, no RL,\nbeats RL-on-Qwen2.5-32B directly"]
  end

  SFT --> RS

  subgraph RS["Stage 3 — Rejection-sampling /\non-policy SFT"]
    direction TB
    RSA["Sample K, verify against REAL outcome, keep\ncorrect, retrain — closes the execution gap\nRAFT (2304.06767), STaR (2203.14465)\nDeepSeek-R1: 800K samples built this way (2501.12948)"]
  end

  RS --> Pref

  subgraph Pref["Stage 4 — Preference opt (DPO/KTO)"]
    direction TB
    PrefA["(chosen,rejected) triplets or binary\ndesirable/undesirable labels\nDPO (2305.18290), KTO (2402.01306)\nTulu 3: ~273K pairs, stage 3-of-5 (2411.15124)"]
  end

  Pref --> RLVR

  subgraph RLVR["Stage 5 — RLVR / GRPO"]
    direction TB
    RLVRA["Group-relative advantage, no critic,\nverifiable reward — 2-pass: narrow-reasoning\nthen broad-general (DeepSeek-R1, Qwen3, Nemotron)"]
  end

  RLVR -.->|"iterate: rejection-sample the RL-converged\ncheckpoint -> mint next SFT round"| SFT

  classDef stage fill:#132b22,stroke:#34d399,color:#eafaf3;
  class Base,CPTA,SFTA,RSA,PrefA,RLVRA stage;
StageJobData typeApprox size (verified)Contribution / order-rationale
0. Base-vs-instructPick a starting checkpoint that won’t fight the target behaviorn/an/aBase, not Instruct — no competing “assistant persona”/prior RLHF alignment to fight; DeepSeek-R1, Qwen3, and Tülu 3 all start every reasoning/post-training recipe from Base, never from the vendor Instruct checkpoint 2501.12948, 2505.09388, 2411.15124 — Instruct’s habits (short answers, refusal patterns, chat-template quirks) actively fight a long-CoT/tool-use format install
1. (Optional) continued/domain pretrainingInject genuinely new facts SFT/RL cannot teach at low data volume — RL/DPO reweight existing capability, they don’t teach new factsRaw domain text/code, unsupervised, next-token-prediction objectiveQwen3 knowledge-injection sub-stage: +5T tokens on top of ~30T general 2505.09388; DeepSeekMath: 120B math tokens continued-pretrained onto a dense 7B code checkpoint 2402.03300 — the cleanest Sequence-B-scale CPT anchor; LoRA-vs-FFT benchmark regime: ~20B tokens 2405.09673Full fine-tuning, not LoRA, is the recommended method here — CPT needs to learn too much (new facts, new token distributions) for a low-rank constraint; full-FT learns perturbations at 10–100× the effective rank of typical LoRA configs 2405.09673. Skip this stage if the corpus is small/curated — push knowledge in via tools/retrieval instead (handbook rule: knowledge in tools, not weights)
2. SFT cold-startFix format/readability/tool-syntax so RL has a stable starting point to sharpen, not invent, from scratch; often distilled from a stronger teacherSmall curated set, often off-policy/teacher-distilled long-CoT or tool-use tracesDeepSeek-R1 cold-start: “thousands” 2501.12948; Qwen3 explicitly states design intent: “minimize both the number of training samples and the training steps during this preparatory phase” 2505.09388; DeepSeek-R1-Distill (dense Qwen/Llama, 1.5B–70B): 800K samples, SFT-only, outperforms running RL directly on Qwen2.5-32B 2501.12948Deliberately kept small if it will be followed by RL — over-investing here (turning cold-start into a full capability-SFT pass) is exactly the failure mode §3’s Llama 4 finding warns about: heavy SFT caps the RL stage’s exploration room
3. Rejection-sampling / on-policy SFTClose the execution gap — train on the model’s OWN correct outputs (grounded in its own tool-call results), not just imitation of a teacher’s plausible-looking traceModel-generated completions, filtered by a real verifier (rule-based correctness, not a learned judge where avoidable)DeepSeek-R1: same ~800K-sample set, produced by rejection-sampling the RL-converged checkpoint 2501.12948; RAFT formalizes the loop generically 2304.06767; STaR is the seminal reasoning-specific version, with the “rationalization” (backward-from-answer) trick 2203.14465; a contested finding argues plain rejection-sampling (RAFT) is competitive with full GRPO — the edge attributed to prompt-filtering, not reward-normalization 2504.11343 (low-citation, “promising, worth testing on your own gym”)The on-policy bridge between imitation and RL — the mechanism that actually closes the off-policy execution gap (§5), because now the “reasoning” is grounded in the agent’s own tool-call outputs, not a teacher’s
4. Preference opt (DPO/KTO)Align the softer, harder-to-verify axis (style, report quality, tool-use elegance) where no ground-truth checker exists(prompt, chosen, rejected) triplets, or binary desirable/undesirable labelsDPO: closed-form classification loss, β the hyperparameter that actually matters 2305.18290; KTO: binary labels only, HALO framing, “matches or exceeds DPO from 1B–30B” 2402.01306; Tülu 3: ~272,898 pairs (8B mixture), stage 3-of-5 2411.15124Ordering here is genuinely contested — Tülu 3/Nemotron run preference-opt before/after RLVR depending on whether the preference signal and the verifiable-reward signal target the same behavior (fold into one RL stage, à la DeepSeek) or orthogonal behaviors (sequence them, harder-to-specify objective last, à la Tülu 3/Nemotron)
5. RLVR / GRPOSharpen and stabilize on-policy behavior against a ground-truth verifier — the stage that can exceed what imitation/preference-opt cap out atPrompts + a verifier, no response demonstrationsEvery 2025 open recipe surveyed (DeepSeek-R1, Qwen3, Llama-Nemotron) runs this in two passes: narrow reasoning-only RL, then a broader general-domain pass — this two-pass structure is close to a settled convention across 3/3 recipesGRPO needs no critic/value network — tractable to bolt onto an existing dense checkpoint without training a same-size value model; the RAFT-vs-GRPO ablation above (2504.11343) applies directly — test whether gains come from reward-normalization or from prompt-filtering before committing to full GRPO infra
6. IterateRepeat RL ↔ rejection-sampling ↔ light-SFT round-tripsn/a — reuses stages 2–5DeepSeek-R1’s pipeline literally loops this (SFT → RL → rejection-sample → SFT → RL); AdaSTaR formalizes efficient iteration (curriculum sampling, −58.6% training FLOPs at equal-or-better accuracy) 2505.16322Budget at least two RL↔rejection-sampling round-trips, not one — “RL once, done” undersells what every recipe surveyed here actually does

The Sequence A → B compression, cited: the entire 1000×+ data-volume saving of “fine-tune, don’t pretrain” comes almost entirely from skipping/shrinking the pretraining stage — SFT (500K–1.5M examples either way), preference pairs (10⁵–10⁶ either way), and RLVR prompts (10³–10⁴ either way) are roughly the same absolute order of magnitude whether you’re doing full pretraining or just fine-tuning an existing dense checkpoint. This is a genuinely useful correction to the naive assumption “Sequence B is smaller at every stage” — it isn’t; only the pretraining/CPT stage shrinks by orders of magnitude.


3. Why ORDER matters, and why stages COMPOUND (not add)

Three mechanisms recur across every source in this thread, each backed by a controlled comparison or a direct frontier-lab disclosure — this is what “order is load-bearing” cashes out to concretely.

Cold-start SFT before RL changes RL’s stability and convergence, not just its ceiling. DeepSeek-R1 vs. R1-Zero is the cleanest natural experiment: same base model (DeepSeek-V3-Base), same RL algorithm (GRPO), only the presence of an SFT stage differs. R1-Zero (pure RL, zero SFT) gets real reasoning gains (AIME 15.6%→71.0%) but “poor readability, language mixing” — the paper’s own stated reason cold-start SFT exists: “starting RL training from an uninitialized model can lead to instability and slow convergence” 2501.12948. “SFT Memorizes, RL Generalizes” reaches the same conclusion from a different testbed (GeneralPoints/V-IRL): even a paper whose headline is “SFT bad, RL good” finds “SFT stabilizes the model’s output format, enabling subsequent RL to achieve its performance gains” 2501.17161 (medium-high confidence, ICML 2025 poster).

Heavy SFT/DPO can cap RL exploration — a frontier lab’s own production disclosure, not theory. Meta’s Llama 4 blog states directly: “SFT and DPO can over-constrain the model, restricting exploration during the online RL stage and leading to suboptimal accuracy, particularly in reasoning, coding, and math domains.” Their fix: drop >50% of data tagged “easy,” train only on the harder remainder before RL. This is the single cleanest concrete counter-example to “more SFT/DPO is always better” — high confidence on the mechanism (first-party engineering account), unquantified on magnitude (no ablation numbers disclosed).

RL’s gains are bounded by what the base/SFT policy can already sample — distillation is the one mechanism shown here to inject genuinely new capability. “Does RL Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?” (NeurIPS 2025 Best Paper Runner-Up, ICML 2025 AI4Math Best Paper — high confidence, well-vetted): RLVR-trained models win at small k, but base models overtake at large k — “the reasoning capability boundary of LLMs often narrows as RLVR training progresses… reasoning paths generated by RLVR models are already included in the base models’ sampling distribution” 2504.13837. RL narrows toward existing high-reward paths; it does not expand pass@large-k beyond the base model. Distillation, by contrast, transplants a stronger teacher’s genuinely new reasoning patterns — DeepSeek-R1’s own finding: “direct distillation from DeepSeek-R1 outperforms applying RL on [Qwen2.5-32B]” directly 2501.12948. A mechanistic (low-citation, “promising not validated”) companion result frames the same split via parameter-update analysis: “RL amplifies existing capabilities, while SFT replaces old skills with new ones” 2507.10616. A contested push-back worth flagging honestly: “RL Fine-Tuning Heals OOD Forgetting in SFT” reframes this as “SFT forgets, RL recovers” rather than “SFT memorizes, RL generalizes” — RL mostly restores an early-SFT peak rather than exceeding it 2509.12235 (medium confidence, newer). (Calibration note: 2507.10616 is a single, 0-citation preprint whose own abstract calls the finding a “preliminary indication,” not a settled result — that is the confidence level to carry for this citation everywhere it appears in this book. Other chapters that cite it as a flat, unhedged “principle” or a “confirmed”/“project-confirmed” finding are miscalibrated against the paper’s own abstract and should be brought in line with the hedge above, not the reverse.)

How much a stage buys, where disclosed (this is rarely cleanly ablated — say so):

StageDisclosed deltaSourceHonest caveat
Mid-training/annealing+12.3% to +18.7% downstream, at 5–10% of pretrain FLOPsOLMo 2, Table 2 2501.00656Fully open data/code — one of the only genuinely ablated per-stage numbers in this whole thread
SFT → DPO → RLVR, average score8B: 60.6 → 64.7 (+4.1) → 65.1 (+0.4); 70B: 72.6 → 76.2 (+3.6) → 76.2 (+0.0); 405B: 77.5 → 79.6 (+2.1) → 80.7 (+1.1)Tülu 3, Table 3 2411.15124The average column hides where the real gain lives — RLVR’s aggregate contribution looks ~0 at 70B, but MATH-specifically it went 59.9 → 67.3 (+7.4) at 405B. Don’t evaluate a stage by its average-score delta alone (this is the direct precedent for §6’s “track the narrow-skill delta, not the average”)
Cold-start SFT dose“Thousands,” not hundreds of thousands (R1); Llama 2 ceased at exactly 27,540 annotations, having found “fewer but better-quality examples led to notable performance improvements”2501.12948, 2307.09288Cold-start dose is a genuine hyperparameter, not “as much data as you can get” — over-investing risks capping RL per the Llama 4 finding above
Iterative-round non-monotonicityEarly Llama 2 RLHF, sampling only the latest round for rejection-sampling data, caused a documented regression (“struggled more… to compose rhyming lines”)2307.09288Compounding is not free/monotonic by default — you must deliberately mix in older-round data or silently regress a capability an earlier round had

Synthesis — settled vs. contested, stated honestly: the “cold-start stabilizes RL,” “heavy SFT/DPO caps RL exploration,” and “iterative rounds compound only if you mix old+new data” claims are high-confidence, multi-source-corroborated. “SFT memorizes / RL generalizes” as a clean universal law is not settled — it holds in controlled game/navigation environments, is nuanced by a nearer-2026 result showing the useful range of SFT checkpoints to RL from is bounded, not “less SFT is always safer.” Clean, ablated per-stage attribution is rare — of everything surveyed across this whole thread, only Tülu 3 and OLMo 2 disclose a genuine stage-by-stage number; treat any single-number stage-attribution claim (including the ones in this chapter) with the same skepticism the Tülu 3 MATH-vs-average gap earns.


4. What I’d change in the project’s pipeline — order + dosage

  1. Even a small, cheap cold-start SFT stage (thousands, not millions) ahead of GRPO/RLVR is worth the compute purely for RL stability/format-compliance, independent of whether it raises the reward ceiling.
  2. Audit SFT/DPO data difficulty before RLVR — Llama 4’s explicit fix (drop >50% “easy,” train the hard remainder) is the single most actionable, frontier-lab-disclosed lever available; this project’s harness-generated CTF trajectories should be difficulty-scored before SFT, so the later GRPO stage still has exploration room on hard challenges.
  3. RL will not inject a capability the base/SFT policy cannot already sample — if a CTF category stays at ~0% under GRPO, that is evidence the capability needs to enter via SFT (ideally teacher-distilled), not via more RL steps against the same reward.
  4. When iterating multiple GRPO rounds, mix in earlier-round data when constructing later rounds’ SFT/preference sets, per Llama 2’s own documented regression when it didn’t.
  5. Track narrow-skill deltas per stage, not just the eval average — a stage can look like ~0 aggregate contribution while delivering the specific skill-targeted gain (CTF-category solve-rate, in this project’s case) that actually mattered.

5. The synthetic-trajectory bootstrap — teacher writes it from the answer

A trajectory used for cold-start SFT can come from two structurally different sources, and conflating them is the single most common mistake in this literature:

  • Off-policy synthetic — a different, usually bigger model writes the trajectory (distillation, Self-Instruct 2212.10560, Evol-Instruct/WizardLM 2304.12244, persona-driven synthesis 2406.20094). The trajectory was never in the student’s own output distribution.
  • On-policy synthetic (self-generated + filtered) — the model being trained writes the trajectory itself; a verifier decides which ones to keep (STaR 2203.14465, RAFT 2304.06767, ReST-EM 2312.06585, the rejection-sampling stage in R1/Tülu 3). The trajectory is always something the model could actually produce.

Where each slots into the sequences above: off-policy synthetic belongs at Stage 2 (SFT cold-start) in both Sequence A and B — it fixes format/instruction-following/tool-syntax. On-policy synthetic belongs at Stage 3 (rejection-sampling) in both — this is where genuine skill-compounding starts, because the training signal is now grounded in the model’s own execution.

The off-policy caveat: cold-start knowledge, yes — execution gap, no

Off-policy teacher trajectories are excellent for cold-start but do not close the execution gap, and this needed cross-referencing several 2025–2026 papers because no single seminal paper states it cleanly:

  • Theoretical reason (DAgger lineage / covariate shift): SFT on teacher-written trajectories trains only on teacher-visited states. At inference the student generates autoregressively from its own prior tokens/actions — the moment it errs somewhere the teacher never did, it enters a state distribution it was never trained on, and errors compound. “Revisiting DAgger in the Era of LLM-Agents” states this precisely for multi-turn agents: “SFT provides dense teacher supervision but suffers from covariate shift because it is trained on off-policy teacher trajectories; while RLVR avoids this off-policy mismatch by learning from on-policy rollouts but with only sparse outcome feedback” 2605.12913 (medium-high confidence, very recent). A companion result on SWE-bench: pure off-policy imitation on expert trajectories suffers covariate shift; mixing in on-policy expert corrections gives +13–14% relative gain over traditional imitation (OpenReview KXAJtW8Bib, ICLR 2026 submission).
  • Empirical confirmation — distillation only expands capability (pass@k) when it brings NEW knowledge, not pattern imitation. A controlled ablation compares base model vs. real DeepSeek-R1-Distill (trained on genuine teacher trajectories, “likely incorporates substantial new knowledge”) vs. a distilled model trained only on teacher responses for questions the base model’s own output distribution already covered (pure pattern transfer, zero new knowledge): “both distilled models significantly improve accuracy, [but] only the DeepSeek model shows a meaningful increase in capability” 2505.14216 (high confidence — direct quote, this is the single most load-bearing paper for this nuance).
  • Why this is categorically worse for agentic tool-use than for math CoT: a math CoT is a single linear token stream — the “state” barely diverges from what the teacher wrote. A CTF-solving trajectory is multi-turn and environment-coupled — tool call → real stdout → next decision. The instant the student’s tool call returns different real output than what was baked into the teacher-written trajectory (different port open, different file present), the student is off the teacher’s distribution with no grounded behavior for that state. The environment, not just the model’s own tokens, is a second source of state divergence a teacher trajectory can never have anticipated.

The confabulation risk — backward synthesis needs the same warning label

STaR’s “rationalization” (show the model the correct answer, let it construct a plausible rationale backward, strip the hint before training) 2203.14465 is the seminal instance of backward/reverse synthesis. A 2026 follow-up sharpens the risk directly relevant here: when a model can see the answer while generating the “reasoning,” the answer acts as a cognitive anchor, and the model tends to produce a rationalization rather than a reasoning process that would generalize — naive mitigation (telling it to ignore the answer) paradoxically makes anchoring worse 2602.14469 (medium confidence, very recent, 0 citations — “promising, mechanistically plausible, not yet validated at scale”).

The cyber instantiation

Applying the above to the project’s own harness (extrapolation from general theory, not from an academic cybersecurity-LLM paper): use a strong external model to write CTF-solve trajectories for challenges you already have ground-truth flags for — but treat that corpus strictly as cold-start (format, tool-call syntax, instruction-following). Every single trajectory must be gated through the real flag verifier (the actual environment check, flag_verified, not a regex on the model’s claimed flag) before it enters SFT — per the post-hoc-rationalization warning above, a teacher that already knows the flag can write a plausible-looking exploit chain that never actually ran against real environment state. Then budget real compute for a rejection-sampling pass on the agent’s own real rollouts before RL, because per the covariate-shift/execution-gap evidence above, off-policy teacher data cannot by itself close the gap between “writes a plausible exploit chain” and “actually recovers from a real tool-call failure the harness will hit.” This is the direct, cyber-specific reading of Sequence B’s Stage 2 → Stage 3 transition (§2).


6. How to evaluate Sequence B stage-by-stage

Answer, up front: evaluate at every stage boundary, with a single frozen held-out suite run against each checkpoint as it’s produced — never re-run from scratch, never wait for the final model. This is exactly the pattern Tülu 3 and OLMo 2 use in the open literature, and DeepSeek-R1’s own developmental-stage table demonstrates the diagnostic payoff directly: comparing R1-Zero against “Dev1” (cold-start SFT added) shows Dev1 gaining on IFEval/Arena-Hard but losing ground on AIME, attributed to “the limited size of the cold-start dataset” 2501.12948 — a stage regression that is only visible because they evaluated at the intermediate checkpoint, not just at final R1. Tülu 3’s own words: “our methodology facilitates identifying skill deficiencies and refining the data mix… ensuring a balanced performance of core skills across the training process” 2411.15124 — and they release the actual intermediate checkpoints (Tulu3-SFT, Tulu3-DPO, final RLVR model) specifically so this comparison is reproducible. Cost asymmetry reinforces the case: RLVR/GRPO is the most expensive stage in the sequence — discovering an SFT-stage defect only after a multi-day RL run has already summed sunk cost you didn’t need to spend.

Per-stage metrics — the frozen suite is fixed, but what you watch closely changes by stage:

  • After (optional) CPT: held-out domain-knowledge QA (not CTF-solving — pure recall of the corpus you just trained on) + a general-capability retention check (MMLU/IFEval before/after). “Domain-continual pretraining induces moderate forgetting with low-to-moderate backward transfer” 2510.17776 — CPT is the mildest of the post-training stages for forgetting, but not free.
  • After SFT cold-start: initial pass@1 on the target task family + format/instruction-adherence (IFEval-style: does it actually follow the tool-call schema). This is the stage with the sharpest forgetting risk in the literature — one documented case: SFT dropped a benchmark 52.1%→40.1% while RFT improved the same setting to 54.2% 2507.05386; a Qwen-family (dense, directly relevant) result documents SFT degrading TruthfulQA/HaluEval on Qwen3-4B specifically 2605.20005. MMLU/IFEval must be in the frozen suite at the SFT checkpoint, not just at the end.
  • After preference opt: win-rate on a held-out preference-eval set (the thing DPO/KTO directly optimizes) and re-run the same pass@1 suite from the SFT checkpoint — DPO should not regress raw task-solve rate; if it does, β or the preference data is miscalibrated. Tülu 3 separates “development” evals (looked at between stages) from “unseen” evals (reserved until the end) precisely so preference tuning doesn’t overfit the eval suite itself 2411.15124 — mirror this split.
  • After RLVR: pass@1 AND pass@k, not pass@1 alone (§3’s argument, restated as a diagnostic here) — plus the live reward/entropy curve during training. Naive GRPO’s entropy-collapse failure mode (“the entropy of the policy decreases quickly… sampled responses of certain groups tend to be nearly identical… limited exploration”) only reached 30/100 AIME points vs. DeepSeek’s reported 47 before a fix (Clip-Higher) was applied — DAPO 2503.14476. A frozen post-hoc suite alone will not catch this; you need the live curve as an in-training diagnostic in addition to before/after checkpoint comparisons.

Pass@k, per funnel stage, is the specific diagnostic RLVR requires that earlier stages don’t. “Does RL Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?” (636+ citations in ~14 months, ICML 2025 AI4Math Oral — well-validated) shows: RLVR wins at small k, base models overtake at large k, and “the reasoning capability boundary… often narrows as RLVR training progresses” 2504.13837. If you track pass@1 alone at the RLVR checkpoint, an RL run that is actively narrowing solution-space diversity looks like a pure win right up until the model needs to solve something outside the narrowed distribution — a genuinely novel CTF challenge, or the hard tail of the funnel. Compute pass@1 and pass@k (k matched to this project’s own locked methodology — k=3 pilot, k=5 real, k=10 edge-band) against the base model’s own pass@k on the same suite as the reference ceiling, not just the previous checkpoint — and do this per funnel stage (F1–F4), not only as one aggregate number, so “RLVR improved pass@1 on easy challenges but narrowed pass@k on hard ones” is visible on a training-stage × funnel-stage grid rather than hidden inside an aggregate.

Guardrails at every stage. Forgetting-risk ranking across the literature converges: SFT is the worst offender, CPT is moderate, RLVR/RFT is the gentlest and can even improve general-capability numbers in some cited settings 2507.05386 — treat the ranking as transferable, the exact percentages as architecture/task-specific. Held-out discipline: never let training data overlap with the frozen eval suite — Tülu 3’s own rule is “removing any training set that has overlap with more than 2% of our evaluation suite” 2411.15124, a concrete, adoptable threshold (treat as convention, not law).

Ablation to attribute contribution and decide where to restart. Hold the frozen suite fixed, run the full sequence with a stage included vs. skipped, compare final-checkpoint numbers on the same suite. DeepSeek-R1’s own R1-Zero-vs-R1 comparison is this ablation, published. Reading the result: if ablating stage N barely moves the frozen-suite numbers, stage N is a target for shrinking/dropping in the next iteration — restart from the checkpoint just before stage N, don’t re-run the whole sequence. If ablating stage N causes a big regression, it’s load-bearing and any future recipe change must preserve it. This is the direct empirical test of “order is load-bearing” for this project’s specific data, not just an inherited belief from the literature — and DAPO’s own reproduction difficulty (only 30/100 AIME with naive GRPO despite a strong base) is a reminder to isolate order-effects from hyperparameter-effects before attributing a regression to staging.


7. The cyber mapping — Sequence B, instantiated

This project runs Sequence B. Mapping each rung to a concrete cyber-data instantiation (extrapolation from the general theory above, applied to this project’s own harness — not from an academic cybersecurity-LLM paper):

Sequence B stageCyber-specific data at this rung
0. Base-vs-instructStart from the Base checkpoint of whatever Qwen/Llama-class dense model is chosen — matches all three frontier open recipes and avoids inheriting a chat-alignment prior that resists long-CoT/tool-use/verifiable-reward shaping
1. (Optional) CPTIf the security corpus (CVE descriptions, tool docs, writeups, security-agent-<family> trace history) is a genuine token corpus, not a handful of curated docs — run it as full-FT, low LR, before SFT. If it’s small/curated, skip this stage; push the knowledge in as SFT context/retrieval instead
2. SFT cold-startA strong external model writes CTF-solve trajectories for challenges with already-known, ground-truth flags — off-policy, format/tool-syntax-only. Gate every trajectory through the real flag verifier before it enters SFT (§5’s confabulation warning applies directly)
3. Rejection-sampling / on-policy SFTRun the cold-started model as the agent in the real harness. Sample K trajectories per challenge (temp ~0.7, per RFT/ReST-EM’s exact hyperparameters), keep only trajectories where the real environment state produced the real flag — not a string match on FLAG{...} in the model’s text, an actual environment check. This is the step that closes the execution gap
4. Preference (DPO/KTO)Contrast verified-correct vs. verified-incorrect trajectories from the same rejection-sampling pool; use KTO if only cheap binary “good/bad” labels exist (from a trace-verification tool), DPO if genuine paired trajectories on the same challenge exist
5. RLVR/GRPOSame ground-truth flag verifier as the reward function — the verifiable-reward signal (flag captured / exploit worked) and the preference signal (report quality, doesn’t waste turns) are genuinely orthogonal here, arguing for RLVR-then-short-DPO-polish ordering (Tülu 3/Nemotron pattern) rather than folding both into one stage
6. IterateMint the next round’s cold-start/rejection data from the current RL-converged checkpoint (R1’s own stage-3 pattern) — budget at least two RL↔rejection-sampling round-trips before calling a GRPO baseline “done”

Cross-links: What the frontier labs actually do (2026) for the per-lab method survey this chapter’s stage-skeleton was built on; The path to a frontier cybersecurity model for the domain-specialization lineage (code/math/medical) that cross-validates this same stage-skeleton and the current gap analysis against this project’s own harness; Diagnosing the gap — a scientific framework for how a measured failure mode maps to which stage needs the fix; Where you are & the forks ahead for how this sequence view resolves into this project’s next concrete decision.