How this book grows

This is a living document maintained by the researcher seat of the llmresearch project. It grows one verified topic at a time; each chapter cites its sources so claims are checkable, not assertions.

Conventions

Engineer-level. Assumes you know logprobs, KL, advantage, rollouts, MoE, PPO clip. No 101 filler.
Cite or don’t claim. Every substantive statement carries an arXiv id or a named lab report/blog. Where something is contested, it’s marked contested with both sides (Contested edges).
Honesty about status. Methods are tagged mainstream / niche / promising-not-proven / experimental based on whether a frontier flagship’s report actually uses them.
Verified live. arXiv ids are crawl-checked; lab-recipe claims come from 2025–2026 tech reports and blogs, not training recall. Re-verify before betting a run — this field ships weekly.

Build & run locally

# one-time: install the toolchain (macOS)
brew install mdbook mdbook-mermaid
# from the book root:
mdbook-mermaid install .   # vendors mermaid assets + wires the preprocessor
mdbook serve --open        # live-reload server at http://localhost:3000

Mermaid flowcharts and raw HTML/iframes (e.g. the embedded journey) render offline — no CDN required.

Log

2026-07-02 — v0.4. Wired three new chapters into the “Toward a frontier cybersecurity model” section, re-ordered to read as a coherent arc — organizing reframe first, then the two chapters it’s a prerequisite for, then the existing family→path→forks arc: The recipe is a sequence, not a pick (retires the “which technique” framing at the root — two explicit stage sequences, Sequence A from-scratch-foundation-model and Sequence B fine-tune-an-open-weight-dense-model [this project’s actual path], why order matters and stages compound rather than add, the synthetic-trajectory bootstrap and its off-policy execution-gap caveat, and a stage-wise evaluation protocol for Sequence B), Continued pretraining on an instruction-tuned model (can you run raw CPT directly on an already-instruct/RLHF’d checkpoint without destroying it — yes, but naive CPT-on-instruct reliably causes format/alignment collapse, not fact erasure; a six-technique preservation decision table; recommends CPT-on-base→re-instruct as the default with chat-vector reattachment as a cheap fallback; an IFEval+MMLU+domain-QA stage-boundary gate to verify it didn’t break), and Proven post-training datasets — a usage-cited registry (a ~60-dataset registry across instruction/chat SFT, tool/function-calling, preference, reasoning/CoT, willingness/refusal-calibration, and Chinese-labs/multilingual — every row proven-by-usage in a named shipped model/recipe, never a single-paper-only academic artifact, mapped onto Sequence B’s actual stage order). Cross-linked frontier-cyber-model-path.md and roadmap-inputs.md to both the recipe-sequence and dataset-registry chapters where their existing arguments (Stage 1 SFT/data synthesis, fork (b)’s SFT-now-vs-measure-first) are specific instances of the general point. Merged ~90 new citations into References via two new sections (“Post-training recipe as a sequence — order, compounding, synthetic-trajectory bootstrap” and “Continued pretraining on an instruction-tuned model — preservation techniques”) plus a new “Datasets (proven-by-usage)” subsection for the dataset-card/model-card links, deduped against the existing corpus.
2026-07-02 — v0.3. Wired three new chapters into the book: Before you train — instrumentation & data readiness (what the harness already emits vs. the minimal per-stage-verifier gap to instrument, grounded in a direct source read of go/libs/agent/events + the flag-verification pipeline), One problem, or many? — monolithic vs decomposed (the eval-decomposition-vs-training-decomposition split, the potential-based-shaping safety net, the verdict for this project), and Where you are & the forks ahead (the capstone — five forks, a dependency DAG, seven falsifiable hypotheses, placed last before References). Applied the project’s standing no-academic-cybersecurity-LLM-as-research-basis stance across all three: every CTF-Dojo/Cyber-Zero/Pentest-R1/HackSynth/AutoPenBench/DRLRM-PT-style citation is labelled “academic, cited for context — not a basis for our decisions,” with load-bearing claims re-anchored on frontier-lab disclosures, general RL/ML theory, or this project’s own measured data. Merged ~35 new citations into References (new Hierarchical RL/decomposition/reward-shaping section; additions to Exploration & entropy collapse, CTF/pentest RL environments, Agent benchmarks & failure taxonomies, and Capability boundary sections) and added the CTF/pentest-RL section’s context-only header note.

Later same day — added the frontier north-star section. Wired the two standing capstone chapters into a new top-level section, “Toward a frontier cybersecurity model,” placed after “In practice” and before References: Cybersecurity is one of a family — what cracked the others (the cross-domain structural-analogy survey — six adjacent long-horizon/sparse-reward/verifiable domains and what actually cracked each), The path to a frontier cybersecurity model (the capstone recipe + gap analysis — what “frontier” costs beyond this project’s own portfolio), and moved Where you are & the forks ahead into this new section as its final chapter (out of “In practice”), completing the arc family → path → your forks. Cross-linked roadmap-inputs.md at top and in its Cross-links section to both new chapters. Merged the new chapters’ citations into References via two new sections — “Domain-specialization lineages (code/math/medical)” and “Adjacent-domain structural transfer” — ~65 new arXiv/DOI/PMC ids, deduped against the existing ~250-citation corpus, academic-security entries kept labelled context-only. Salvaged three of the higher-value ideas the standing academic-cybersecurity-LLM stance would otherwise have excluded, by re-grounding each on independent frontier-lab or general-RL-theory evidence instead: (1) staged/kill-chain reward shaping — salvaged as the theorem-backed potential-based form only (Ng-Harada-Russell, ICML 1999), never the flat per-stage bonus academic pentest-RL papers use; (2) subgoal/curriculum decomposition of a long episode — salvaged via DeepSeek-Prover-V2 and AlphaGeometry (cold-start SFT data generation only, never densifying the RL reward itself); (3) failure-corpus-to-curriculum conversion — salvaged via WebRL’s self-evolving curriculum and HER’s relabeling principle (mining flag=0 trajectories for sub-skill SFT data), not any academic CTF-RL paper’s claim.
2026-07-02 — v0.2. New Diagnosis section: Diagnosing the gap — a scientific framework (the pass@k crossover protocol, Cover@τ, sandbagging/elicitation tests — is the ~10-20% k=1 solve rate an execution gap or a knowledge gap, before betting a GRPO run on the answer) and From behavioral audit to training signal (maps the BSides-LV 5-pattern behavioral audit — tool avoidance, no methodology, brittle single-guess, uneven PTES phases, benchmarks-measure-speed-not-thoroughness — onto the specific post-training techniques designed to fix each). New method chapter RL that creates value — long-horizon · exploration · reasoning · novelty, a ~50-paper sweep tagged [L]/[E]/[R]/[N] against this project’s own diagnosis (GiGPO, DAPO/Clip-Cov, ProRL, ReTool/ToRL/Search-R1, CTF-Dojo/Pentest-R1/HackSynth, pass@k-is-diagnostic-not-objective). Extended reinforcement.md and agentic-rl.md with cross-links into the sweep. Rewrote What the frontier labs actually do with a full last-year (2025-07→2026-07) refresh across all 10 tracked labs, filling in previously-thin xAI/Grok, Mistral (Magistral/Ministral), Zhipu/GLM, Xiaomi/MiMo, and deepening Kimi K2→K2.5→K2.6, each now carrying the same [L]/[E]/[R]/[N] tags. Extended Contested edges and The decision with the new capability-boundary and sandbagging/elicitation literature. Merged ~130 newly-cited sources into References.
2026-07-02 — v0.1. Initial build from a live session: the on/off-policy foundation, the method genealogy (imitation/preference/reinforcement + agentic RL + PEFT), verified 2026 frontier-lab recipes, the method→data reframe, the decision tree, and contested edges. Embeds the interactive decision journey.

Backlog (next sessions)

A worked example: filtering your ~100–200 solves into a rejection-sampling SFT set (the verified-trajectory pipeline).
The reward-function chapter: building an ungameable verify(state) for CTF flags (state, not transcript).
Pass@k methodology: per-challenge bucketing before choosing a branch.
The train↔inference precision-mismatch rabbit hole (TIS vs FP16), for when you reach GRPO.
Harness-shape coupling: ruling out “it’s the scaffold, not the model” before fine-tuning.

Keyboard shortcuts

Post-Training Field Notes

How this book grows

Conventions

Build & run locally

Log

Backlog (next sessions)