References
Every id below was crawl-verified during the sessions that built this book (title/authors/date confirmed on the arXiv abstract page). Lab blogs/tech reports are linked to their source. Citation counts are unreliable for <18-month-old work — venue/lab-report presence is the stronger signal.
Foundations & imitation
- Ross, Gordon & Bagnell — DAgger, imitation compounds O(εT²) — arXiv:1011.0686 (AISTATS 2011)
- InstructGPT (SFT + RLHF) — arXiv:2203.02155
- Hinton et al. — Knowledge Distillation — arXiv:1503.02531
- GKD / on-policy distillation — arXiv:2306.13649 (ICLR 2024)
- STaR — arXiv:2203.14465 · ReST — arXiv:2308.08998 · ReST-EM / “Beyond Human Data” (DeepMind, self-training on own correct samples beats human-data-only, math/code) — arXiv:2312.06585 · RAFT — arXiv:2304.06767 · RFT (Yuan) — arXiv:2308.01825
- Rejection Sampling → Reinforce (entropy collapse; GRPO’s implicit filtering) — arXiv:2504.11343
Preference
- DPO — arXiv:2305.18290 · KTO — arXiv:2402.01306 · IPO — arXiv:2310.12036 · ORPO — arXiv:2403.07691 · SimPO — arXiv:2405.14734
- DPO-family survey — arXiv:2503.11701
- Constitutional AI / RLAIF — arXiv:2212.08073
Reinforcement
- PPO — arXiv:1707.06347
- GAE (classical single-time-scale credit assignment) — arXiv:1506.02438
- GRPO (DeepSeekMath) — arXiv:2402.03300
- DeepSeek-R1 (RLVR pipeline) — arXiv:2501.12948
- GSPO (Qwen3) — arXiv:2507.18071 · DAPO — arXiv:2503.14476
- Dr.GRPO (removes GRPO’s length-bias reward artifact) — arXiv:2503.20783
- REINFORCE++ (critic-free, no group-sampling requirement) — arXiv:2501.03262
- VAPO (value-model pretraining + length-adaptive GAE for long/heterogeneous responses) — arXiv:2504.05118
- PRM (Let’s Verify Step by Step) — arXiv:2305.20050 · Process Reward Models That Think — arXiv:2504.16828
Long-horizon & multi-turn agentic RL — credit assignment across turns, not tokens
- GiGPO (step-level advantage from state-hash-matched steps across rollouts, zero extra rollouts) — arXiv:2505.10978
- ArCHer (two-timescale: off-policy turn-level critic + on-policy token-level PG) — arXiv:2402.19446
- RAGEN / StarPO (multi-turn agentic RL framework, state-thinking-action loop) — arXiv:2504.20073
- Turn-Level Reward Design (dense per-turn reward layered under a terminal reward) — arXiv:2505.11821
- Turn-PPO (turn as the MDP unit, not token or trajectory) — arXiv:2512.17008
- Demystifying RL for Long-Horizon Tool-Using Agents (5-axis systematic ablation: reward/scale/data/algorithm/environment) — arXiv:2603.21972
- Verlog (dual-discount GAE, memory-windowing, validated to 400+ turn episodes) — no arXiv id, cite OpenReview:GmodkWwMV3
- Kimi k1.5 (128k-context RL via partial-rollout checkpoint/resume, no MCTS/value-fn/PRM) — arXiv:2501.12599
- AgentGym-RL / ScalingInter-RL (horizon curriculum: short turn cap expanding to full budget over training) — arXiv:2509.08755
- MUA-RL (trains against a dynamic, LLM-simulated counterpart instead of a static script) — arXiv:2508.18669
- HiPER (hierarchical credit assignment) — arXiv:2602.16165 · Hindsight Credit Assignment for Long-Horizon LLM Agents — arXiv:2603.08754
- RL-PLUS (names “capability boundary collapse” — pass@k at large k dropping even as pass@1 rises under RLVR) — arXiv:2508.00222
Hierarchical RL, decomposition & potential-based reward shaping — “one problem, or many?”
- Sutton — “The Bitter Lesson” (hand-built structure plateaus, general search+learning wins at scale; intellectual ancestor of the monolithic-outcome-RL case) — no arXiv id, incompleteideas.net (2019)
- OpenAI Deep Research system card (long-horizon tool-using agent trained end-to-end on outcome/rubric reward; “end-to-end training beats manual orchestration”) — no arXiv id, OpenAI system card
- Options / SMDP framework (Sutton, Precup, Singh — the seminal HRL / temporal-abstraction paper) — Artificial Intelligence 112 (1999), no arXiv id, DOI 10.1016/S0004-3702(99)00052-1
- FeUdal Networks (Manager/Worker HRL, fixes option-collapse) — arXiv:1703.01161
- HIRO (off-policy correction for HRL non-stationarity / subgoal ceiling-capping) — arXiv:1805.08296
- Ng, Harada & Russell — potential-based reward shaping,
F(s,a,s')=γΦ(s')−Φ(s)provably policy-invariant — ICML 1999, no arXiv id (predates arXiv’s routine ML use), ACM DL 10.5555/645528.657613 - Müller & Kudenko (PBRS practical effectiveness still depends on potential scaling) — arXiv:2502.01307
- RUDDER (learned, return-equivalent reward redistribution — a learned alternative to hand-specifying Φ) — arXiv:1806.07857
- Go-Explore (pure outcome RL structurally fails on sparse/deceptive long-horizon tasks without explicit remember-and-return exploration) — arXiv:1901.10995 / Nature s41586-020-03157-9
- OpenAI Five (long-horizon precedent needed huge scale + a per-frame shaped reward, not a single terminal bit) — arXiv:1912.06680
- Credit Assignment survey (separates credit-assignment variance from exploration burden) — arXiv:2312.01072
- MiRA (milestone-based dense reward; Gemma3-12B WebArena-Lite 6.4%→43.0%, beating WebRL/GPT-4-Turbo) — arXiv:2603.19685
- Verifiable Process Rewards / VPR (safe ground-truth checklist process reward; own caveat that open/unstructured stages remain unsolved) — arXiv:2605.10325
- CM2 (checklist-style verifiable sub-criteria reward) — arXiv:2602.12268
- Curriculum Learning (Bengio et al. — foundational, order training by difficulty, touches no reward function) — ICML 2009, no arXiv id
- h1 (curriculum + pure outcome-only reward yields an exponential sample-complexity gain) — arXiv:2510.07312
- FastCuRL (context-length curriculum, entropy-collapse timing) — arXiv:2503.17287
- BPO (curriculum + rejection-sampling refine; vanilla GRPO on sparse reward gains only marginally without curriculum) — arXiv:2508.03018
- TIPS (turn-level potential shaping for search-augmented LLMs — shaping machinery directly on-point, domain is not) — arXiv:2603.22293
- Randlov & Alstrom — the canonical non-potential-based “bicycle shaping” failure (agent farms a looks-like-progress bonus instead of reaching the goal) — ICML 1998, no arXiv id
- Kill-chain-staged reward (cyber-defense red-teaming) (academic cybersecurity-LLM work — cited for context only, not a basis) — arXiv:2605.17075 (May 2026)
- DRLRM-PT (reward machine over kill-chain phases, classical/non-LLM pentest RL) (academic — cited for context only, not a basis; explicitly named in the project’s standing rule) — arXiv:2405.15908 / DOI 10.1109/ijcnn60899.2024.10650368
- Node-fragility reward shaping (classical dense-reward pentest, non-LLM regime) (academic — cited for context only, not a basis) — DOI 10.3390/electronics13214311
Tool-integrated / tool-use RL — the direct BSides pattern-1 (tool avoidance) fixes
- ReTool (trajectory-level tool-integrated RL) — arXiv:2504.11536
- ToRL (tool-integrated RL, math) — arXiv:2503.23383
- Search-R1 (RL for search-agent tool use) — arXiv:2503.09516
- ToolRL (fine-grained, decomposed per-call tool-selection reward) — arXiv:2504.13958
- Tool-Star (forced exposure to under-used tools via multi-tool synthesis pre-RL) — arXiv:2505.16410
- Tool Preferences in Agentic LLMs are Unreliable (diagnosis of pattern-1-shaped tool avoidance) — arXiv:2505.18135
Exploration & entropy collapse
- The Entropy Mechanism of RL for Reasoning LMs (R = -a·e^H + b; Clip-Cov/KL-Cov fixes) — arXiv:2505.22617
- Beyond the 80/20 Rule (top-20%-entropy “forking tokens” carry nearly all exploration signal) — arXiv:2506.01939
- Reasoning with Exploration: An Entropy Perspective — arXiv:2506.14758
- Representation-Based Exploration for Language Models (hidden-state diversity bonus, usable at inference time) — arXiv:2510.11686
- Pass@k Metric for RLVR: A Diagnostic Tool of Exploration, But Not an Objective — arXiv:2511.16231
- Spurious Rewards (RLVR gains on Qwen2.5-Math nearly as large with completely wrong rewards; model-family-dependent) — arXiv:2506.10947
- Absolute Zero (self-play task-proposal + solve, zero external labeled data) — arXiv:2505.03335
- LIMO (817 curated SFT examples beat >100k loosely-curated ones — SFT as cognitive templates, not knowledge source) — arXiv:2502.03387
- Test-time compute scaling (Snell et al., difficulty-adaptive allocation matches a 14x larger model) — arXiv:2408.03314 · o3-mini vs o1-mini (accuracy without longer CoT) — arXiv:2502.15631
- OpenAI o1 System Card (methodology precedent, cited across the field) — arXiv:2412.16720
- Reward-hacking-under-RL cluster: Specification Gaming in Reasoning Models — arXiv:2605.02269 · LLMs Gaming Verifiers (extensional vs intensional correctness) — arXiv:2604.15149 · Reward Hacking in the Era of Large Models (Proxy Compression Hypothesis) — arXiv:2604.13602
- Per-step / process-reward hacking convergence (the case against naive per-stage reward): PURE / Stop Summation (summation-form credit assignment “easily induces LLMs to hack steps with high rewards”) — arXiv:2504.15275 · Reward Under Attack (SOTA PRMs as “fluency detectors rather than reasoning verifiers”) — arXiv:2603.06621 · Gao et al. (learned PRM/ORM + success reward can hurt vs success-only) — arXiv:2410.15115 · PRIME (authors’ own admission that process labels are “prohibitively expensive,” PRMs vulnerable to hacking) — arXiv:2502.01456 · MONA (multi-step reward hacking even when no single step looks bad to an overseer) — arXiv:2501.13011
Self-correction & tool-use self-correction RL
- SCoRe — Training LMs to Self-Correct via RL (reward for improvement, not final correctness) — arXiv:2409.12917
- From Correction to Mastery (earliest-error RL, distinct “SCoRe”) — arXiv:2509.14257
Post-training recipe as a sequence — order, compounding, synthetic-trajectory bootstrap
(new citations from The recipe is a sequence, not a pick; ids already covered elsewhere — Llama 3, Tülu 3, OLMo 2, DeepSeek-V3/R1, Qwen3, GRPO/DeepSeekMath, LoRA-learns-less, STaR, RAFT, ReST-EM, DAPO, the rejection-sampling→REINFORCE entropy-collapse paper, “Scalpel vs Hammer” — are not repeated here.)
- Self-Instruct (off-policy synthetic instruction generation) — arXiv:2212.10560
- WizardLM / Evol-Instruct (off-policy synthetic, complexity-evolved instructions) — arXiv:2304.12244
- Llama 2 (RLHF report; iterative-round rejection-sampling non-monotonicity — “struggled more… to compose rhyming lines” when only the latest round was sampled) — arXiv:2307.09288
- Persona-driven synthetic data generation — arXiv:2406.20094
- Qwen2.5 (dense 0.5B–72B; explicit SFT→offline-DPO→online-GRPO staging) — arXiv:2412.15115
- “SFT Memorizes, RL Generalizes” (GeneralPoints/V-IRL testbed; SFT stabilizes format, RL then generalizes) — arXiv:2501.17161 (ICML 2025 poster)
- Llama-Nemotron (post-training recipe report) — arXiv:2505.00949
- Distillation-vs-pattern-imitation ablation (“only the DeepSeek model shows a meaningful increase in capability” — new knowledge, not pattern transfer, is what expands pass@k) — arXiv:2505.14216
- AdaSTaR (efficient iterative rejection-sampling; curriculum sampling, −58.6% training FLOPs at equal-or-better accuracy) — arXiv:2505.16322
- SFT-vs-RFT forgetting comparison (SFT 52.1%→40.1% drop vs. RFT 54.2% improvement on the same setting) — arXiv:2507.05386
- “RL Fine-Tuning Heals OOD Forgetting in SFT” (contested re-framing of “SFT memorizes/RL generalizes” as “SFT forgets, RL recovers”) — arXiv:2509.12235
- Domain-continual pretraining forgetting / backward-transfer at scale (“moderate forgetting, low-to-moderate backward transfer”) — arXiv:2510.17776
- Backward-synthesis answer-anchoring / confabulation risk (STaR-rationalization follow-up; the answer acts as a cognitive anchor) — arXiv:2602.14469
- “Revisiting DAgger in the Era of LLM-Agents” (SFT’s off-policy covariate shift vs. RLVR’s on-policy but sparse feedback, stated precisely for multi-turn agents) — arXiv:2605.12913
- Qwen3-4B SFT degrading TruthfulQA/HaluEval (Qwen-family-specific forgetting evidence) — arXiv:2605.20005
Continued pretraining on an instruction-tuned model — preservation techniques
(new citations from Continued pretraining on an instruction-tuned model; LoRA-learns-less-forgets-less [2405.09673] already cited above, not repeated.)
- DAPT — Gururangan et al., “Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks” (ACL 2020, seminal domain-adaptive-pretraining concept) — arXiv:2004.10964
- MMLU — Hendrycks et al. — arXiv:2009.03300
- Scialom et al. — replay mitigates forgetting in continual instruction-tuning (EMNLP 2022) — arXiv:2205.12393
- Ilharco et al. — “Editing Models with Task Arithmetic” (ICLR 2023, seminal task-vector/delta-arithmetic result) — arXiv:2212.04089
- TIES-Merging (NeurIPS 2023; trim + elect-sign + merge) — arXiv:2306.01708
- Gupta et al. — continual-pretraining LR re-warm/re-decay — arXiv:2308.04014
- Luo et al. — empirical study of catastrophic forgetting in LLMs during continual fine-tuning (1B–14B; forgetting worsens with scale) — arXiv:2308.08747
- AdaptLLM — auto-converts raw domain text into reading-comprehension/QA replay pairs — arXiv:2309.09530
- Qi, Zeng et al. — “Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!” (ICLR 2024) — arXiv:2310.03693
- Chat Vector (Huang et al., ACL 2024; language-shift instance of the task-arithmetic reattach recipe) — arXiv:2310.04799
- DARE — “Super Mario” random delta-dropping + rescaling (ICML 2024) — arXiv:2311.03099
- IFEval — “Instruction-Following Evaluation for Large Language Models” (Zhou et al., Google) — arXiv:2311.07911
- LLaMA Pro — block-expansion CPT that never touches original weights, then a separate instruction-tuning pass (ACL 2024) — arXiv:2401.02415
- Li & Lee — “Examining Forgetting in Continual Pre-training of Aligned Large Language Models” (direct CPT-on-Llama-2-7b-chat comparison) — arXiv:2401.03129
- RESTA — DARE-sparsified delta subtraction/restoration (ACL 2024) — arXiv:2402.11746
- Ibrahim et al. — simple/scalable continual-pretraining strategies (re-warm+re-decay+replay matches from-scratch retraining) — arXiv:2403.08763
- Qi et al. — “Safety Alignment Should Be Made More Than Just a Few Tokens Deep” (ICLR 2025 Oral; shallow safety-alignment mechanism) — arXiv:2406.05946
- Instruction Pre-Training (Microsoft, EMNLP 2024; 200M synthesized instruction-response pairs woven into raw CPT corpus) — arXiv:2406.14491
- Jindal, Badrinath, Bharti, Vinay & Sharma (Samsung Research) — “Balancing Continuous Pre-Training and Instruction Fine-Tuning” (the direct S1-vs-S2 CPT-on-instruct-vs-base comparison, 4 model families) — arXiv:2410.10739
- Mousavi, Alghisi & Riccardi (U. Trento) — “What Does Loss Optimization Actually Teach, If Anything? Knowledge Dynamics in Continual Pre-training of LLMs” (loss curves don’t reveal instruct-layer damage in real time) — arXiv:2601.03858
- Zheng, Cai, Qiu & Ma — “Spurious Forgetting in Continual Learning of Language Models” (ICLR 2025 poster; forgetting is often a task-alignment/metric artifact, not true knowledge loss) — no arXiv id, OpenReview:ScI7IlKGdI
- Harmon, Hochlehnert, Bethge & Prabhu (Tübingen AI Center) — “Mapping Post-Training Forgetting in Language Models at Scale” (~30 model pairs; “model merging does not reliably mitigate forgetting”) — no arXiv id found, anonymous ICLR 2026 submission, OpenReview:qCIg2WGudx
Datasets (proven-by-usage) — general post-training data, Sequence-B rungs
Full registry, inclusion rule, and per-dataset detail: Proven post-training datasets — a usage-cited registry. Papers backing named training recipes for these datasets (Tülu 3, OLMo 2, Qwen2.5-Math, DPO, KTO, IPO, ORPO already cited above, not repeated):
- Gorilla / APIBench (UC Berkeley, NeurIPS 2024 D&B) — arXiv:2305.15334 · dataset
- ToolBench (OpenBMB) → ToolLLaMA (ICLR 2024 spotlight) — arXiv:2307.16789 · GitHub
- PKU-SafeRLHF → Beaver-7B-v1.0 — arXiv:2310.12773 · dataset
- AgentInstruct (Zhipu/THUDM) → AgentLM (ACL 2024 Findings) — arXiv:2310.12823 · dataset
- Zephyr-7B (UltraChat-200k SFT stage → UltraFeedback DPO stage) — arXiv:2310.16944 · dataset
- LongAlign-10k (Zhipu/THUDM, EMNLP 2024 Findings) — arXiv:2401.18058 · dataset
- Aya (Cohere) → Aya-101/23/Expanse — arXiv:2402.06619 · dataset
- Agent-FLAN (InternLM, ACL 2024 Findings) — arXiv:2403.12881 · dataset
- COIG-CQIA (Chinese instruction SFT, LIMA-style) — arXiv:2403.18058 · dataset
- MAP-Neo Matrix Data Pile (bilingual EN/ZH pretrain corpus) — arXiv:2405.19327 · dataset
- Magpie (self-synthesized instruction/preference data, ICLR 2025) → Llama-3-8B-Magpie-Align / SmolLM2 — arXiv:2406.08464 · org
- HelpSteer2 → Llama-3.1-Nemotron-70B-Reward (NVIDIA, #1 RewardBench at release) — arXiv:2406.08673 · dataset
- Beaver-7B (PKU-SafeRLHF, second citation) — arXiv:2406.15513
- Salesforce xLAM-function-calling-60k → xLAM-1b/7b-fc-r (NeurIPS 2024 D&B) — arXiv:2406.18518 · dataset
- IBM Granite-20B-FunctionCalling (Glaive-function-calling-v2) — arXiv:2407.00121 · dataset
- ToolACE → ToolACE-8B (Huawei, ICLR 2025) — arXiv:2409.00920 · dataset
- OpenMathInstruct-2 → OpenMath2-Llama3.1 (NVIDIA, ICLR 2025) — arXiv:2410.01560 · dataset
- Skywork-Reward-Preference-80K-v0.2 (#1 RewardBench) — arXiv:2410.18451 · dataset
- OpenCSG Chinese-Cosmopedia → csg-wukong-1B — arXiv:2501.08197 · dataset
- SmolLM2 (Smol-Magpie-Ultra derivative) — arXiv:2502.02737 · dataset
- OpenCodeReasoning-Nemotron (NVIDIA; SFT-only beats RL alternatives on LiveCodeBench) — arXiv:2504.01943 · dataset
- APIGen-MT → Salesforce xLAM-2 series (SOTA BFCL + τ-bench) — arXiv:2504.03601 · project
- COIG-P (Chinese preference/DPO, EACL 2026 Findings) — arXiv:2504.05535 · dataset
- OpenMathReasoning → OpenMath-Nemotron (NVIDIA’s AIMO-2-winning submission) — arXiv:2504.16891 · dataset
- OpenThoughts3-1.2M → OpenThinker3-7B (SOTA-open-data at release) — arXiv:2506.04178 · dataset
- BAAI Infinity-Instruct → InfInstruct family (beats GPT-4-0314 by 8.6% on IF) — arXiv:2506.11116 · dataset
Domain-specialization lineages — the frontier recipe (code / math / medical)
- Kaplan et al. — Scaling Laws for Neural Language Models — arXiv:2001.08361
- Hoffmann et al. — Chinchilla, compute-optimal scaling — arXiv:2203.15556
- phi-1 (textbook-quality data substitutes for ~100x scale in a narrow domain) — arXiv:2306.11644
- DeepSeek-Coder V2 (abandons from-scratch pretraining, continues from a strong base) — arXiv:2406.11931
- Qwen2.5-Math (co-evolves RM + SFT data across rounds before RL, reuses RM for best-of-N at inference) — arXiv:2409.12122
- DeepSeek-Prover-V2 (subgoal-decomposed cold-start data + kernel-verified RL) — arXiv:2504.21801
- OLMo 2 (mid-training as a named 5-10% FLOPs bridge stage) — arXiv:2501.00656
- Mid-training mechanism study (outperforms CPT-alone at matched budget, reduces catastrophic forgetting before SFT) — arXiv:2510.14865
- Med-PaLM v1 (mentioned, not a basis for cyber claims — prompt-tuning-only ceiling, motivates v2) — arXiv:2212.13138
- Med-PaLM 2 (domain instruction fine-tuning + ensemble refinement) — arXiv:2305.09617
- MedGemma (domain VLM pretraining + task fine-tuning, explicitly not clinical-grade alone) — arXiv:2507.05201
- Full fine-tuning vs. LoRA on code/math domain-skill acquisition (10-100x effective-rank gap) — arXiv:2405.09673
- Benchmark contamination (GSM8K/MMLU scores inflated up to 22.9%/19.0%) — arXiv:2406.13990
- LiveCodeBench (time-segmented, contamination-resistant-by-construction) — arXiv:2403.07974
- GPQA (“Google-proof” QA) — arXiv:2311.12022
- Tülu 3 (decontamination as a first-class deliverable; primary public naming of RLVR) — arXiv:2411.15124
- Emerging RL-environment-scale bottleneck framing (moderate confidence, new/low-citation) — arXiv:2511.09586
- SIMA (scalable instructable multiworld agent) — arXiv:2404.10179
- SIMA 2 (self-generated tasks + rewards via Gemini) — arXiv:2512.04797
Adjacent-domain structural transfer — coding agents, competitive programming, theorem proving, web agents, games, robotics
- SWE-agent / Agent-Computer Interface (fixed action set + concise feedback lifts pass@1 pre-RL) — arXiv:2405.15793
- SWE-Gym (executable-environment SFT trajectories) — arXiv:2412.21139
- R2E-Gym — arXiv:2504.07164
- SWE-RL (execution-verified reward beats a
difflibpatch-similarity fallback) — arXiv:2502.18449 - o1→o3 coding RL — arXiv:2502.06807
- Progressive context/turn-budget curriculum for long-horizon RL — arXiv:2508.03501
- SWE-Master (mask environment-feedback tokens out of the SFT loss, low-confidence/very recent) — arXiv:2602.03411
- DeepSeek-Coder v1 (from-scratch pretraining, the lone exception in the lineage) — arXiv:2401.14196
- StarCoder2 / The Stack v2 (curation quality substitutes for parameter count — data-pipeline lesson only) — arXiv:2402.19173
- StarCoder — arXiv:2305.06161
- CodeRL — arXiv:2207.01780
- PPOCoder — arXiv:2301.13816
- RLTF — arXiv:2307.04349
- StepCoder — arXiv:2402.01391
- RLEF (turn-level value function over a multi-turn POMDP; Meta FAIR, ICML 2025 spotlight) — arXiv:2410.02089
- Sailor (SEA-language CPT) — arXiv:2404.03608
- SEA-LION — arXiv:2504.05747
- LLaMA Beyond English — arXiv:2401.01055
- Tokenizer/vocabulary coverage as an architectural precondition — arXiv:2406.11477
- BLOOM+1 (adding a new language to the SFT mixture beats continued pretraining) — arXiv:2212.09535
- Aya — arXiv:2402.07827
- AlphaCode (sampling breadth + cheap filter) — arXiv:2203.07814
- GrandCode / Agentic GRPO (purpose-built GRPO variant for delayed reward + off-policy drift; single team, very recent) — arXiv:2604.02721
- AlphaGeometry (Nature 2024; synthetic self-play manufactures its own training problems) — DOI 10.1038/s41586-023-06747-5
- AlphaProof (Nature 2025; AlphaZero-style self-play/search on top) — DOI 10.1038/s41586-025-09833-y
- WebGPT (learned human-preference reward; origin of the SFT-cold-start-then-RL recipe shape, flagged OOD-weak by its own authors) — arXiv:2112.09332
- WebRL (self-evolving curriculum generated from the model’s own unsuccessful attempts; ICLR 2025) — arXiv:2411.02337
- R1-Searcher (sequential, not summed, two-stage tool-use reward) — arXiv:2503.05592
- DeepResearcher (training end-to-end in the real live environment is “a fundamental requirement”; EMNLP 2025) — arXiv:2504.03160
- AlphaStar (Nature; league-based diverse self-play population fixes strategy collapse) — DOI 10.1038/s41586-019-1724-z
- NetHack (honest “still unsolved” calibration point) — arXiv:2006.13760
- HER — Hindsight Experience Replay (relabel a failed trajectory as the goal it accidentally satisfied; NeurIPS 2017) — arXiv:1707.01495
- Firestone — competence vs. performance (formal/functional split; independently reconfirmed by the particular-language literature) — PMC7604508
- KLong (progressive horizon curriculum, second converging source; low-confidence) — arXiv:2602.17547
- BEACON (2026 long-horizon credit-assignment cluster; low-confidence individually) — arXiv:2605.06078
- Ecpo (2026 long-horizon credit-assignment cluster; low-confidence individually) — arXiv:2606.05885
CTF / pentest RL environments
(academic, cited for context — not a basis for our decisions; see the stance in Contested edges)
- CTF-Dojo (486 verified trajectories, 31.9% pass@1 credibility yardstick) — arXiv:2508.18370
- Cyber-Zero (monolithic outcome-RL, simulated env, +13.1%) — arXiv:2508.00910
- Pentest-R1 (two-stage RL for CTF methodology) — arXiv:2508.07382
- HackSynth (crypto-CTF GRPO) — arXiv:2506.02048
- InterCode-CTF (seminal monolithic-reward CTF environment) — arXiv:2306.14898
- Cybench (subtask decomposition, eval-only) — arXiv:2408.08926
- AutoPenBench (milestone taxonomy near-matching this book’s F1–F4 split, eval-only) — arXiv:2410.03225
- NYU CTF Bench (CTF benchmark family) — arXiv:2406.05590
- EnIGMA (“soliloquizing” fabrication failure mode, ICML 2025) — arXiv:2409.16165
- Guided Reasoning via Structured Attack Trees (deterministic ATT&CK-derived task tree, +5x subtask completion on the same weights) — arXiv:2509.07939
- From Capabilities to Performance (pentesting ablations) — arXiv:2509.14289
- PentestAgent (RAG-fix framing of a knowledge gap; contested against the scaffolding/execution readings above) — arXiv:2411.05185
- Capture the Flags: Family-Based Evaluation via Semantics-Preserving Transformations (CTF-specific robustness benchmark) — arXiv:2602.05523
- What Makes a Good LLM Agent for Real-world Penetration Testing? (Task Difficulty Assessment + Evidence-Guided Attack Tree Search) — arXiv:2602.17622
Agent benchmarks & failure taxonomies
- τ-bench (fault-assignment × fault-type taxonomy) — arXiv:2406.12045
- AgentRx (localizes the single critical failure step in a long trajectory) — arXiv:2602.02475
- AgentBoard (“progress rate” metric — general, non-security capability-decomposition principle, NeurIPS 2024 Oral) — arXiv:2401.13178
- MAST (14-mode/3-category multi-agent failure taxonomy, κ=0.88) — arXiv:2503.13657
- AgentErrorTaxonomy / AgentDebug (root-cause diagnosis alone, no reward change, buys +24% all-correct accuracy) — arXiv:2509.25370
- Phase-aligned taxonomy for autonomous agents (independent-domain convergence on a phase-keyed failure split) — arXiv:2508.13143
Capability boundary, elicitation & sandbagging (contested)
- Yue et al. — RL elicits, not expands — arXiv:2504.13837 (2025-04-18)
- ProRL — prolonged RL expands — arXiv:2505.24864
- Cohen-Inger et al. — “LLMs are Like a Chameleon” (benchmark scores mask overfitting; semantics-preserving perturbation robustness check) — arXiv:2502.07445 (2025-02-11)
- Zhang et al. — “Memorize or Generalize?” (Memorization Risk Index via semantic-perturbation code rewriting; companion robustness-check citation) — arXiv:2503.02296 (2025-03-04)
- PSN-RLVR — arXiv:2602.02555 · NuRL — arXiv:2509.25666 · CoT-Pass@K (Wen et al., RLVR implicitly incentivizes correct reasoning) — arXiv:2506.14245 (2025-06-17)
- Scalpel vs Hammer (GRPO amplifies, SFT replaces) — arXiv:2507.10616
- Zhai et al. — Does RL Expand the Capability Boundary of LLM Agents? Pass@(k,T) — arXiv:2604.14877 (2026-04-16)
- Dragoi et al. — Beyond Pass@k: Breadth-Depth Metrics / Cover@τ — arXiv:2510.08325 (2025-10-09)
- Kang et al. — Quagmires in SFT-RL Post-Training — arXiv:2510.01624 (2025-10-02)
- Chen et al. — The Coverage Principle — arXiv:2510.15020 (2025-10-16)
- Greenblatt et al. — Stress-Testing Capability Elicitation with Password-Locked Models — arXiv:2405.19550 (2024-05-29)
- Hofstätter et al. — The Elicitation Game — arXiv:2502.02180 (2025-02-04)
- van der Weij et al. — AI Sandbagging — arXiv:2406.07358 (2024-06-11)
- Ryd et al. — Removing Sandbagging via Weak Supervision — arXiv:2604.22082 (2026)
- Stroebl et al. — Inference Scaling fLaws — arXiv:2411.17501 (2024-11-26)
- Dorner et al. — ROC-n-reroll — arXiv:2507.12399 (2025-07-16)
- Huang et al. — Is Best-of-N the Best of Them? — arXiv:2503.21878 (2025-03-27, ICML 2025)
- Mahowald et al. — Dissociating Language and Thought in LLMs — arXiv:2301.06627 (2023-01-16)
- He et al. — LLMs as Neurolinguistic Subjects — arXiv:2411.07533 (2024-11-12)
- Boháček et al. — Uncovering Competency Gaps (sparse autoencoders on internal representations) — arXiv:2512.20638 (2025-12-06)
PEFT
- LoRA — arXiv:2106.09685 · QLoRA — arXiv:2305.14314 · DoRA — arXiv:2402.09353
- Unified LoRA-variant study (LR-sensitivity) — arXiv:2601.22708
Frontier lab recipes (reports & blogs) — full-year refresh, 2025-07 → 2026-07, all 10 tracked labs
Llama (historical anchor): Llama 3 — arXiv:2407.21783 · Llama 4 — ai.meta.com/blog/llama-4-multimodal-intelligence
Anthropic (Claude) — Constitutional AI/RLAIF backbone arXiv:2212.08073; inoculation prompting arXiv:2510.04340, Anthropic’s own study arXiv:2511.18397; release posts/system cards — Opus 4.1 · Sonnet 4.5 card · Sonnet 4.5 · Sonnet 4.5 research · Haiku 4.5 card (PDF) · Haiku 4.5 · Opus 4.5 card · Opus 4.5 research · Opus 4.5 card walkthrough (secondary) · inoculation prompting post · emergent misalignment / reward hacking · “teaching Claude why” post · research page · Opus 4.6 · Opus 4.6 sabotage risk report (PDF) · Sonnet 4.6 · Opus 4.7 · Opus 4.8 · Fable 5 / Mythos 5 · Fable 5 & Mythos 5 card (PDF) · Mythos guardrails coverage (secondary) · Sonnet 5 · Sonnet 5 card · Sonnet 5 launch coverage (secondary) · Sonnet 5 launch guide (secondary) · AI organizations post
OpenAI — GPT-5 (2025-08-07) · GPT-5 system card (PDF) · GPT-5 for developers · safe-completions arXiv:2508.09224 · safe-completions post · GPT-5-Codex addendum · Codex system card (PDF) · GPT-5.1 · GPT-5.1 deployment safety · routing/model-choice post (secondary) · GPT-5.1-Codex-Max · Codex-Max system card · Codex-Max safety training · long-horizon Codex tasks · GPT-5.2 · GPT-5.2 for science/math · GPT-5.2 system-card update · GPT-5.2-Codex · GPT-5.3-Codex · 5.3-Codex system card · 5.3-Codex coverage (secondary) · GPT-5.4 · GPT-5.4 thinking system card · GPT-5.4 (secondary) · graders docs · RFT guide · RFT use-cases · RFT wind-down, 2026-05-08 · community thread (secondary)
Google DeepMind (Gemini) — Gemini 2.5 tech report arXiv:2507.06261 (HTML, §2.4/2.5 mirror, cross-checked (secondary)) · Deep Think launch · Gemini 3 Pro model card (PDF) · Gemini 3 launch · agent-building with Gemini 3 · Gemini 3 Deep Think · Deep Think update · Gemini 3 Flash · Flash for enterprise · Gemini 3.1 Pro · Gemini 3.5 · Vending-Bench/τ²-bench cross-check (secondary)
xAI (Grok) — Grok 4 · Grok 4 model card (PDF) · Grok 4 analysis (secondary) · Grok Code Fast 1 · Code Fast 1 model card (PDF) · Grok 4 Fast · Grok 4 Fast model card (PDF) · coverage (secondary) · coverage (secondary) · Grok 4.1 · Grok 4.1 model card (PDF) · sycophancy coverage (secondary) · Grok 4.1 Fast · news index
Mistral — Magistral arXiv:2506.10910 (HF paper page, benchmark tables) · Ministral 3 arXiv:2601.08584 · Mistral 3 blog · Magistral blog · Mistral-Large-3 card · Magistral-Small-2509 card · Magistral-Small-2507 card · Magistral Medium 1.2 docs
DeepSeek — V3 arXiv:2412.19437 · R1 arXiv:2501.12948 · V3.2 arXiv:2512.02556 (HTML) · V3.1 release · V3.1 card · V3.1-Terminus · V3.1-Terminus card · V3.2-Exp · V3.2-Exp repo · V3.2 / V3.2-Speciale
Qwen (Alibaba) — Qwen3 tech report arXiv:2505.09388 · GSPO arXiv:2507.18071 · Qwen3-Omni arXiv:2509.17765 · Qwen3-VL arXiv:2511.21631 · Qwen3-Coder-Next arXiv:2603.00729 · Qwen3.5-Omni arXiv:2604.15804 · GSPO blog · Qwen3-Coder blog · Qwen3 blog · Qwen3 README · Qwen3-Next efficiency · Qwen3-Max · Qwen3-Max-Thinking · Qwen3.5 blog · Qwen3.5-397B-A17B card · Qwen3.7-Max “agent frontier” · Qwen3.7 blog · The Batch coverage (secondary) · VentureBeat coverage (secondary) · TechSphere coverage (secondary) · Qwen3.6-35B-A3B agentic coding
Moonshot AI (Kimi) — K2 arXiv:2507.20534 (verified via arXiv abs + full HTML crawl) · K2.5 arXiv:2602.02276 (verified via arXiv abs + full HTML crawl) · Kimi-K2-Thinking card (no arXiv paper) · K2 Thinking intro post · Kimi-K2.6 card (no arXiv paper) · K2.6 tech blog · K2.6 benchmark deltas · K2.6 coverage (secondary) · K2.6 method non-disclosure (secondary) · Kimi-K2.5 model card/benchmarks
GLM / Z.ai (Zhipu) — GLM-4.5 arXiv:2508.06471 · GLM-5 arXiv:2602.15763 (HTML) · GLM-4.5 blog · GLM-4.6 blog · GLM-4.7 blog · GLM-5 blog · GLM-5.2 blog · GLM-4.5 repo · GLM-5 repo · slime RL infra repo · GLM-4.7-Flash card · MoE architecture deep-dive (secondary, unverified-primary) · agentic RL post citing GLM-5 report (secondary) · RL infra post citing GLM-5 report (secondary) · GLM-5.2 vs 5.1 (secondary) · GLM-5.2 open-source coverage (secondary) · GLM-4.7-Flash coverage (secondary)
Xiaomi (MiMo) — MiMo-V2-Flash arXiv:2601.02780 · MiMo-Embodied arXiv:2511.16518 · MiMo-VL-Miloco arXiv:2512.17436 · MiMo-Audio arXiv:2512.23808 · MiMo lineage (background only, outside the 12-month window): MiMo arXiv:2505.07608, MiMo-VL arXiv:2506.03569 · MiMo-V2.5-Pro card · MiMo-V2.5 card · MiMo-V2.5-Pro blog · MiMo model-update docs
The verified concept-map + genealogy notes in the shared memory pool (
research/post-training-inference-concept-map.md,research/post-training-method-genealogy-onpolicy-offpolicy.md,research/frontier-lab-post-training-recipes-2026.md,research/rl-for-long-horizon-exploration-reasoning.md,research/diagnosing-capability-vs-execution-gap-framework.md) are the machine-readable companions to this book.