Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

References

Every id below was crawl-verified during the sessions that built this book (title/authors/date confirmed on the arXiv abstract page). Lab blogs/tech reports are linked to their source. Citation counts are unreliable for <18-month-old work — venue/lab-report presence is the stronger signal.

Foundations & imitation

Preference

Reinforcement

Long-horizon & multi-turn agentic RL — credit assignment across turns, not tokens

  • GiGPO (step-level advantage from state-hash-matched steps across rollouts, zero extra rollouts) — arXiv:2505.10978
  • ArCHer (two-timescale: off-policy turn-level critic + on-policy token-level PG) — arXiv:2402.19446
  • RAGEN / StarPO (multi-turn agentic RL framework, state-thinking-action loop) — arXiv:2504.20073
  • Turn-Level Reward Design (dense per-turn reward layered under a terminal reward) — arXiv:2505.11821
  • Turn-PPO (turn as the MDP unit, not token or trajectory) — arXiv:2512.17008
  • Demystifying RL for Long-Horizon Tool-Using Agents (5-axis systematic ablation: reward/scale/data/algorithm/environment) — arXiv:2603.21972
  • Verlog (dual-discount GAE, memory-windowing, validated to 400+ turn episodes) — no arXiv id, cite OpenReview:GmodkWwMV3
  • Kimi k1.5 (128k-context RL via partial-rollout checkpoint/resume, no MCTS/value-fn/PRM) — arXiv:2501.12599
  • AgentGym-RL / ScalingInter-RL (horizon curriculum: short turn cap expanding to full budget over training) — arXiv:2509.08755
  • MUA-RL (trains against a dynamic, LLM-simulated counterpart instead of a static script) — arXiv:2508.18669
  • HiPER (hierarchical credit assignment) — arXiv:2602.16165 · Hindsight Credit Assignment for Long-Horizon LLM Agents — arXiv:2603.08754
  • RL-PLUS (names “capability boundary collapse” — pass@k at large k dropping even as pass@1 rises under RLVR) — arXiv:2508.00222

Hierarchical RL, decomposition & potential-based reward shaping — “one problem, or many?”

  • Sutton — “The Bitter Lesson” (hand-built structure plateaus, general search+learning wins at scale; intellectual ancestor of the monolithic-outcome-RL case) — no arXiv id, incompleteideas.net (2019)
  • OpenAI Deep Research system card (long-horizon tool-using agent trained end-to-end on outcome/rubric reward; “end-to-end training beats manual orchestration”) — no arXiv id, OpenAI system card
  • Options / SMDP framework (Sutton, Precup, Singh — the seminal HRL / temporal-abstraction paper) — Artificial Intelligence 112 (1999), no arXiv id, DOI 10.1016/S0004-3702(99)00052-1
  • FeUdal Networks (Manager/Worker HRL, fixes option-collapse) — arXiv:1703.01161
  • HIRO (off-policy correction for HRL non-stationarity / subgoal ceiling-capping) — arXiv:1805.08296
  • Ng, Harada & Russell — potential-based reward shaping, F(s,a,s')=γΦ(s')−Φ(s) provably policy-invariant — ICML 1999, no arXiv id (predates arXiv’s routine ML use), ACM DL 10.5555/645528.657613
  • Müller & Kudenko (PBRS practical effectiveness still depends on potential scaling) — arXiv:2502.01307
  • RUDDER (learned, return-equivalent reward redistribution — a learned alternative to hand-specifying Φ) — arXiv:1806.07857
  • Go-Explore (pure outcome RL structurally fails on sparse/deceptive long-horizon tasks without explicit remember-and-return exploration) — arXiv:1901.10995 / Nature s41586-020-03157-9
  • OpenAI Five (long-horizon precedent needed huge scale + a per-frame shaped reward, not a single terminal bit) — arXiv:1912.06680
  • Credit Assignment survey (separates credit-assignment variance from exploration burden) — arXiv:2312.01072
  • MiRA (milestone-based dense reward; Gemma3-12B WebArena-Lite 6.4%→43.0%, beating WebRL/GPT-4-Turbo) — arXiv:2603.19685
  • Verifiable Process Rewards / VPR (safe ground-truth checklist process reward; own caveat that open/unstructured stages remain unsolved) — arXiv:2605.10325
  • CM2 (checklist-style verifiable sub-criteria reward) — arXiv:2602.12268
  • Curriculum Learning (Bengio et al. — foundational, order training by difficulty, touches no reward function) — ICML 2009, no arXiv id
  • h1 (curriculum + pure outcome-only reward yields an exponential sample-complexity gain) — arXiv:2510.07312
  • FastCuRL (context-length curriculum, entropy-collapse timing) — arXiv:2503.17287
  • BPO (curriculum + rejection-sampling refine; vanilla GRPO on sparse reward gains only marginally without curriculum) — arXiv:2508.03018
  • TIPS (turn-level potential shaping for search-augmented LLMs — shaping machinery directly on-point, domain is not) — arXiv:2603.22293
  • Randlov & Alstrom — the canonical non-potential-based “bicycle shaping” failure (agent farms a looks-like-progress bonus instead of reaching the goal) — ICML 1998, no arXiv id
  • Kill-chain-staged reward (cyber-defense red-teaming) (academic cybersecurity-LLM work — cited for context only, not a basis)arXiv:2605.17075 (May 2026)
  • DRLRM-PT (reward machine over kill-chain phases, classical/non-LLM pentest RL) (academic — cited for context only, not a basis; explicitly named in the project’s standing rule)arXiv:2405.15908 / DOI 10.1109/ijcnn60899.2024.10650368
  • Node-fragility reward shaping (classical dense-reward pentest, non-LLM regime) (academic — cited for context only, not a basis) — DOI 10.3390/electronics13214311

Tool-integrated / tool-use RL — the direct BSides pattern-1 (tool avoidance) fixes

  • ReTool (trajectory-level tool-integrated RL) — arXiv:2504.11536
  • ToRL (tool-integrated RL, math) — arXiv:2503.23383
  • Search-R1 (RL for search-agent tool use) — arXiv:2503.09516
  • ToolRL (fine-grained, decomposed per-call tool-selection reward) — arXiv:2504.13958
  • Tool-Star (forced exposure to under-used tools via multi-tool synthesis pre-RL) — arXiv:2505.16410
  • Tool Preferences in Agentic LLMs are Unreliable (diagnosis of pattern-1-shaped tool avoidance) — arXiv:2505.18135

Exploration & entropy collapse

  • The Entropy Mechanism of RL for Reasoning LMs (R = -a·e^H + b; Clip-Cov/KL-Cov fixes) — arXiv:2505.22617
  • Beyond the 80/20 Rule (top-20%-entropy “forking tokens” carry nearly all exploration signal) — arXiv:2506.01939
  • Reasoning with Exploration: An Entropy Perspective — arXiv:2506.14758
  • Representation-Based Exploration for Language Models (hidden-state diversity bonus, usable at inference time) — arXiv:2510.11686
  • Pass@k Metric for RLVR: A Diagnostic Tool of Exploration, But Not an Objective — arXiv:2511.16231
  • Spurious Rewards (RLVR gains on Qwen2.5-Math nearly as large with completely wrong rewards; model-family-dependent) — arXiv:2506.10947
  • Absolute Zero (self-play task-proposal + solve, zero external labeled data) — arXiv:2505.03335
  • LIMO (817 curated SFT examples beat >100k loosely-curated ones — SFT as cognitive templates, not knowledge source) — arXiv:2502.03387
  • Test-time compute scaling (Snell et al., difficulty-adaptive allocation matches a 14x larger model) — arXiv:2408.03314 · o3-mini vs o1-mini (accuracy without longer CoT) — arXiv:2502.15631
  • OpenAI o1 System Card (methodology precedent, cited across the field) — arXiv:2412.16720
  • Reward-hacking-under-RL cluster: Specification Gaming in Reasoning Models — arXiv:2605.02269 · LLMs Gaming Verifiers (extensional vs intensional correctness) — arXiv:2604.15149 · Reward Hacking in the Era of Large Models (Proxy Compression Hypothesis) — arXiv:2604.13602
  • Per-step / process-reward hacking convergence (the case against naive per-stage reward): PURE / Stop Summation (summation-form credit assignment “easily induces LLMs to hack steps with high rewards”) — arXiv:2504.15275 · Reward Under Attack (SOTA PRMs as “fluency detectors rather than reasoning verifiers”) — arXiv:2603.06621 · Gao et al. (learned PRM/ORM + success reward can hurt vs success-only) — arXiv:2410.15115 · PRIME (authors’ own admission that process labels are “prohibitively expensive,” PRMs vulnerable to hacking) — arXiv:2502.01456 · MONA (multi-step reward hacking even when no single step looks bad to an overseer) — arXiv:2501.13011

Self-correction & tool-use self-correction RL

  • SCoRe — Training LMs to Self-Correct via RL (reward for improvement, not final correctness) — arXiv:2409.12917
  • From Correction to Mastery (earliest-error RL, distinct “SCoRe”) — arXiv:2509.14257

Post-training recipe as a sequence — order, compounding, synthetic-trajectory bootstrap

(new citations from The recipe is a sequence, not a pick; ids already covered elsewhere — Llama 3, Tülu 3, OLMo 2, DeepSeek-V3/R1, Qwen3, GRPO/DeepSeekMath, LoRA-learns-less, STaR, RAFT, ReST-EM, DAPO, the rejection-sampling→REINFORCE entropy-collapse paper, “Scalpel vs Hammer” — are not repeated here.)

  • Self-Instruct (off-policy synthetic instruction generation) — arXiv:2212.10560
  • WizardLM / Evol-Instruct (off-policy synthetic, complexity-evolved instructions) — arXiv:2304.12244
  • Llama 2 (RLHF report; iterative-round rejection-sampling non-monotonicity — “struggled more… to compose rhyming lines” when only the latest round was sampled) — arXiv:2307.09288
  • Persona-driven synthetic data generation — arXiv:2406.20094
  • Qwen2.5 (dense 0.5B–72B; explicit SFT→offline-DPO→online-GRPO staging) — arXiv:2412.15115
  • “SFT Memorizes, RL Generalizes” (GeneralPoints/V-IRL testbed; SFT stabilizes format, RL then generalizes) — arXiv:2501.17161 (ICML 2025 poster)
  • Llama-Nemotron (post-training recipe report) — arXiv:2505.00949
  • Distillation-vs-pattern-imitation ablation (“only the DeepSeek model shows a meaningful increase in capability” — new knowledge, not pattern transfer, is what expands pass@k) — arXiv:2505.14216
  • AdaSTaR (efficient iterative rejection-sampling; curriculum sampling, −58.6% training FLOPs at equal-or-better accuracy) — arXiv:2505.16322
  • SFT-vs-RFT forgetting comparison (SFT 52.1%→40.1% drop vs. RFT 54.2% improvement on the same setting) — arXiv:2507.05386
  • “RL Fine-Tuning Heals OOD Forgetting in SFT” (contested re-framing of “SFT memorizes/RL generalizes” as “SFT forgets, RL recovers”) — arXiv:2509.12235
  • Domain-continual pretraining forgetting / backward-transfer at scale (“moderate forgetting, low-to-moderate backward transfer”) — arXiv:2510.17776
  • Backward-synthesis answer-anchoring / confabulation risk (STaR-rationalization follow-up; the answer acts as a cognitive anchor) — arXiv:2602.14469
  • “Revisiting DAgger in the Era of LLM-Agents” (SFT’s off-policy covariate shift vs. RLVR’s on-policy but sparse feedback, stated precisely for multi-turn agents) — arXiv:2605.12913
  • Qwen3-4B SFT degrading TruthfulQA/HaluEval (Qwen-family-specific forgetting evidence) — arXiv:2605.20005

Continued pretraining on an instruction-tuned model — preservation techniques

(new citations from Continued pretraining on an instruction-tuned model; LoRA-learns-less-forgets-less [2405.09673] already cited above, not repeated.)

  • DAPT — Gururangan et al., “Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks” (ACL 2020, seminal domain-adaptive-pretraining concept) — arXiv:2004.10964
  • MMLU — Hendrycks et al. — arXiv:2009.03300
  • Scialom et al. — replay mitigates forgetting in continual instruction-tuning (EMNLP 2022) — arXiv:2205.12393
  • Ilharco et al. — “Editing Models with Task Arithmetic” (ICLR 2023, seminal task-vector/delta-arithmetic result) — arXiv:2212.04089
  • TIES-Merging (NeurIPS 2023; trim + elect-sign + merge) — arXiv:2306.01708
  • Gupta et al. — continual-pretraining LR re-warm/re-decay — arXiv:2308.04014
  • Luo et al. — empirical study of catastrophic forgetting in LLMs during continual fine-tuning (1B–14B; forgetting worsens with scale) — arXiv:2308.08747
  • AdaptLLM — auto-converts raw domain text into reading-comprehension/QA replay pairs — arXiv:2309.09530
  • Qi, Zeng et al. — “Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!” (ICLR 2024) — arXiv:2310.03693
  • Chat Vector (Huang et al., ACL 2024; language-shift instance of the task-arithmetic reattach recipe) — arXiv:2310.04799
  • DARE — “Super Mario” random delta-dropping + rescaling (ICML 2024) — arXiv:2311.03099
  • IFEval — “Instruction-Following Evaluation for Large Language Models” (Zhou et al., Google) — arXiv:2311.07911
  • LLaMA Pro — block-expansion CPT that never touches original weights, then a separate instruction-tuning pass (ACL 2024) — arXiv:2401.02415
  • Li & Lee — “Examining Forgetting in Continual Pre-training of Aligned Large Language Models” (direct CPT-on-Llama-2-7b-chat comparison) — arXiv:2401.03129
  • RESTA — DARE-sparsified delta subtraction/restoration (ACL 2024) — arXiv:2402.11746
  • Ibrahim et al. — simple/scalable continual-pretraining strategies (re-warm+re-decay+replay matches from-scratch retraining) — arXiv:2403.08763
  • Qi et al. — “Safety Alignment Should Be Made More Than Just a Few Tokens Deep” (ICLR 2025 Oral; shallow safety-alignment mechanism) — arXiv:2406.05946
  • Instruction Pre-Training (Microsoft, EMNLP 2024; 200M synthesized instruction-response pairs woven into raw CPT corpus) — arXiv:2406.14491
  • Jindal, Badrinath, Bharti, Vinay & Sharma (Samsung Research) — “Balancing Continuous Pre-Training and Instruction Fine-Tuning” (the direct S1-vs-S2 CPT-on-instruct-vs-base comparison, 4 model families) — arXiv:2410.10739
  • Mousavi, Alghisi & Riccardi (U. Trento) — “What Does Loss Optimization Actually Teach, If Anything? Knowledge Dynamics in Continual Pre-training of LLMs” (loss curves don’t reveal instruct-layer damage in real time) — arXiv:2601.03858
  • Zheng, Cai, Qiu & Ma — “Spurious Forgetting in Continual Learning of Language Models” (ICLR 2025 poster; forgetting is often a task-alignment/metric artifact, not true knowledge loss) — no arXiv id, OpenReview:ScI7IlKGdI
  • Harmon, Hochlehnert, Bethge & Prabhu (Tübingen AI Center) — “Mapping Post-Training Forgetting in Language Models at Scale” (~30 model pairs; “model merging does not reliably mitigate forgetting”) — no arXiv id found, anonymous ICLR 2026 submission, OpenReview:qCIg2WGudx

Datasets (proven-by-usage) — general post-training data, Sequence-B rungs

Full registry, inclusion rule, and per-dataset detail: Proven post-training datasets — a usage-cited registry. Papers backing named training recipes for these datasets (Tülu 3, OLMo 2, Qwen2.5-Math, DPO, KTO, IPO, ORPO already cited above, not repeated):

Domain-specialization lineages — the frontier recipe (code / math / medical)

  • Kaplan et al. — Scaling Laws for Neural Language Models — arXiv:2001.08361
  • Hoffmann et al. — Chinchilla, compute-optimal scaling — arXiv:2203.15556
  • phi-1 (textbook-quality data substitutes for ~100x scale in a narrow domain) — arXiv:2306.11644
  • DeepSeek-Coder V2 (abandons from-scratch pretraining, continues from a strong base) — arXiv:2406.11931
  • Qwen2.5-Math (co-evolves RM + SFT data across rounds before RL, reuses RM for best-of-N at inference) — arXiv:2409.12122
  • DeepSeek-Prover-V2 (subgoal-decomposed cold-start data + kernel-verified RL) — arXiv:2504.21801
  • OLMo 2 (mid-training as a named 5-10% FLOPs bridge stage) — arXiv:2501.00656
  • Mid-training mechanism study (outperforms CPT-alone at matched budget, reduces catastrophic forgetting before SFT) — arXiv:2510.14865
  • Med-PaLM v1 (mentioned, not a basis for cyber claims — prompt-tuning-only ceiling, motivates v2)arXiv:2212.13138
  • Med-PaLM 2 (domain instruction fine-tuning + ensemble refinement) — arXiv:2305.09617
  • MedGemma (domain VLM pretraining + task fine-tuning, explicitly not clinical-grade alone) — arXiv:2507.05201
  • Full fine-tuning vs. LoRA on code/math domain-skill acquisition (10-100x effective-rank gap) — arXiv:2405.09673
  • Benchmark contamination (GSM8K/MMLU scores inflated up to 22.9%/19.0%) — arXiv:2406.13990
  • LiveCodeBench (time-segmented, contamination-resistant-by-construction) — arXiv:2403.07974
  • GPQA (“Google-proof” QA) — arXiv:2311.12022
  • Tülu 3 (decontamination as a first-class deliverable; primary public naming of RLVR) — arXiv:2411.15124
  • Emerging RL-environment-scale bottleneck framing (moderate confidence, new/low-citation) — arXiv:2511.09586
  • SIMA (scalable instructable multiworld agent) — arXiv:2404.10179
  • SIMA 2 (self-generated tasks + rewards via Gemini) — arXiv:2512.04797

Adjacent-domain structural transfer — coding agents, competitive programming, theorem proving, web agents, games, robotics

  • SWE-agent / Agent-Computer Interface (fixed action set + concise feedback lifts pass@1 pre-RL) — arXiv:2405.15793
  • SWE-Gym (executable-environment SFT trajectories) — arXiv:2412.21139
  • R2E-Gym — arXiv:2504.07164
  • SWE-RL (execution-verified reward beats a difflib patch-similarity fallback) — arXiv:2502.18449
  • o1→o3 coding RL — arXiv:2502.06807
  • Progressive context/turn-budget curriculum for long-horizon RL — arXiv:2508.03501
  • SWE-Master (mask environment-feedback tokens out of the SFT loss, low-confidence/very recent) — arXiv:2602.03411
  • DeepSeek-Coder v1 (from-scratch pretraining, the lone exception in the lineage) — arXiv:2401.14196
  • StarCoder2 / The Stack v2 (curation quality substitutes for parameter count — data-pipeline lesson only) — arXiv:2402.19173
  • StarCoder — arXiv:2305.06161
  • CodeRL — arXiv:2207.01780
  • PPOCoder — arXiv:2301.13816
  • RLTF — arXiv:2307.04349
  • StepCoder — arXiv:2402.01391
  • RLEF (turn-level value function over a multi-turn POMDP; Meta FAIR, ICML 2025 spotlight) — arXiv:2410.02089
  • Sailor (SEA-language CPT) — arXiv:2404.03608
  • SEA-LION — arXiv:2504.05747
  • LLaMA Beyond English — arXiv:2401.01055
  • Tokenizer/vocabulary coverage as an architectural precondition — arXiv:2406.11477
  • BLOOM+1 (adding a new language to the SFT mixture beats continued pretraining) — arXiv:2212.09535
  • Aya — arXiv:2402.07827
  • AlphaCode (sampling breadth + cheap filter) — arXiv:2203.07814
  • GrandCode / Agentic GRPO (purpose-built GRPO variant for delayed reward + off-policy drift; single team, very recent) — arXiv:2604.02721
  • AlphaGeometry (Nature 2024; synthetic self-play manufactures its own training problems) — DOI 10.1038/s41586-023-06747-5
  • AlphaProof (Nature 2025; AlphaZero-style self-play/search on top) — DOI 10.1038/s41586-025-09833-y
  • WebGPT (learned human-preference reward; origin of the SFT-cold-start-then-RL recipe shape, flagged OOD-weak by its own authors) — arXiv:2112.09332
  • WebRL (self-evolving curriculum generated from the model’s own unsuccessful attempts; ICLR 2025) — arXiv:2411.02337
  • R1-Searcher (sequential, not summed, two-stage tool-use reward) — arXiv:2503.05592
  • DeepResearcher (training end-to-end in the real live environment is “a fundamental requirement”; EMNLP 2025) — arXiv:2504.03160
  • AlphaStar (Nature; league-based diverse self-play population fixes strategy collapse) — DOI 10.1038/s41586-019-1724-z
  • NetHack (honest “still unsolved” calibration point) — arXiv:2006.13760
  • HER — Hindsight Experience Replay (relabel a failed trajectory as the goal it accidentally satisfied; NeurIPS 2017) — arXiv:1707.01495
  • Firestone — competence vs. performance (formal/functional split; independently reconfirmed by the particular-language literature) — PMC7604508
  • KLong (progressive horizon curriculum, second converging source; low-confidence) — arXiv:2602.17547
  • BEACON (2026 long-horizon credit-assignment cluster; low-confidence individually) — arXiv:2605.06078
  • Ecpo (2026 long-horizon credit-assignment cluster; low-confidence individually) — arXiv:2606.05885

CTF / pentest RL environments

(academic, cited for context — not a basis for our decisions; see the stance in Contested edges)

  • CTF-Dojo (486 verified trajectories, 31.9% pass@1 credibility yardstick) — arXiv:2508.18370
  • Cyber-Zero (monolithic outcome-RL, simulated env, +13.1%) — arXiv:2508.00910
  • Pentest-R1 (two-stage RL for CTF methodology) — arXiv:2508.07382
  • HackSynth (crypto-CTF GRPO) — arXiv:2506.02048
  • InterCode-CTF (seminal monolithic-reward CTF environment) — arXiv:2306.14898
  • Cybench (subtask decomposition, eval-only) — arXiv:2408.08926
  • AutoPenBench (milestone taxonomy near-matching this book’s F1–F4 split, eval-only) — arXiv:2410.03225
  • NYU CTF Bench (CTF benchmark family) — arXiv:2406.05590
  • EnIGMA (“soliloquizing” fabrication failure mode, ICML 2025) — arXiv:2409.16165
  • Guided Reasoning via Structured Attack Trees (deterministic ATT&CK-derived task tree, +5x subtask completion on the same weights) — arXiv:2509.07939
  • From Capabilities to Performance (pentesting ablations) — arXiv:2509.14289
  • PentestAgent (RAG-fix framing of a knowledge gap; contested against the scaffolding/execution readings above) — arXiv:2411.05185
  • Capture the Flags: Family-Based Evaluation via Semantics-Preserving Transformations (CTF-specific robustness benchmark) — arXiv:2602.05523
  • What Makes a Good LLM Agent for Real-world Penetration Testing? (Task Difficulty Assessment + Evidence-Guided Attack Tree Search) — arXiv:2602.17622

Agent benchmarks & failure taxonomies

  • τ-bench (fault-assignment × fault-type taxonomy) — arXiv:2406.12045
  • AgentRx (localizes the single critical failure step in a long trajectory) — arXiv:2602.02475
  • AgentBoard (“progress rate” metric — general, non-security capability-decomposition principle, NeurIPS 2024 Oral) — arXiv:2401.13178
  • MAST (14-mode/3-category multi-agent failure taxonomy, κ=0.88) — arXiv:2503.13657
  • AgentErrorTaxonomy / AgentDebug (root-cause diagnosis alone, no reward change, buys +24% all-correct accuracy) — arXiv:2509.25370
  • Phase-aligned taxonomy for autonomous agents (independent-domain convergence on a phase-keyed failure split) — arXiv:2508.13143

Capability boundary, elicitation & sandbagging (contested)

  • Yue et al. — RL elicits, not expands — arXiv:2504.13837 (2025-04-18)
  • ProRL — prolonged RL expands — arXiv:2505.24864
  • Cohen-Inger et al. — “LLMs are Like a Chameleon” (benchmark scores mask overfitting; semantics-preserving perturbation robustness check) — arXiv:2502.07445 (2025-02-11)
  • Zhang et al. — “Memorize or Generalize?” (Memorization Risk Index via semantic-perturbation code rewriting; companion robustness-check citation) — arXiv:2503.02296 (2025-03-04)
  • PSN-RLVR — arXiv:2602.02555 · NuRL — arXiv:2509.25666 · CoT-Pass@K (Wen et al., RLVR implicitly incentivizes correct reasoning) — arXiv:2506.14245 (2025-06-17)
  • Scalpel vs Hammer (GRPO amplifies, SFT replaces) — arXiv:2507.10616
  • Zhai et al. — Does RL Expand the Capability Boundary of LLM Agents? Pass@(k,T) — arXiv:2604.14877 (2026-04-16)
  • Dragoi et al. — Beyond Pass@k: Breadth-Depth Metrics / Cover@τ — arXiv:2510.08325 (2025-10-09)
  • Kang et al. — Quagmires in SFT-RL Post-Training — arXiv:2510.01624 (2025-10-02)
  • Chen et al. — The Coverage Principle — arXiv:2510.15020 (2025-10-16)
  • Greenblatt et al. — Stress-Testing Capability Elicitation with Password-Locked Models — arXiv:2405.19550 (2024-05-29)
  • Hofstätter et al. — The Elicitation Game — arXiv:2502.02180 (2025-02-04)
  • van der Weij et al. — AI Sandbagging — arXiv:2406.07358 (2024-06-11)
  • Ryd et al. — Removing Sandbagging via Weak Supervision — arXiv:2604.22082 (2026)
  • Stroebl et al. — Inference Scaling fLaws — arXiv:2411.17501 (2024-11-26)
  • Dorner et al. — ROC-n-reroll — arXiv:2507.12399 (2025-07-16)
  • Huang et al. — Is Best-of-N the Best of Them? — arXiv:2503.21878 (2025-03-27, ICML 2025)
  • Mahowald et al. — Dissociating Language and Thought in LLMs — arXiv:2301.06627 (2023-01-16)
  • He et al. — LLMs as Neurolinguistic Subjects — arXiv:2411.07533 (2024-11-12)
  • Boháček et al. — Uncovering Competency Gaps (sparse autoencoders on internal representations) — arXiv:2512.20638 (2025-12-06)

PEFT

Frontier lab recipes (reports & blogs) — full-year refresh, 2025-07 → 2026-07, all 10 tracked labs

Llama (historical anchor): Llama 3 — arXiv:2407.21783 · Llama 4 — ai.meta.com/blog/llama-4-multimodal-intelligence

Anthropic (Claude) — Constitutional AI/RLAIF backbone arXiv:2212.08073; inoculation prompting arXiv:2510.04340, Anthropic’s own study arXiv:2511.18397; release posts/system cards — Opus 4.1 · Sonnet 4.5 card · Sonnet 4.5 · Sonnet 4.5 research · Haiku 4.5 card (PDF) · Haiku 4.5 · Opus 4.5 card · Opus 4.5 research · Opus 4.5 card walkthrough (secondary) · inoculation prompting post · emergent misalignment / reward hacking · “teaching Claude why” post · research page · Opus 4.6 · Opus 4.6 sabotage risk report (PDF) · Sonnet 4.6 · Opus 4.7 · Opus 4.8 · Fable 5 / Mythos 5 · Fable 5 & Mythos 5 card (PDF) · Mythos guardrails coverage (secondary) · Sonnet 5 · Sonnet 5 card · Sonnet 5 launch coverage (secondary) · Sonnet 5 launch guide (secondary) · AI organizations post

OpenAIGPT-5 (2025-08-07) · GPT-5 system card (PDF) · GPT-5 for developers · safe-completions arXiv:2508.09224 · safe-completions post · GPT-5-Codex addendum · Codex system card (PDF) · GPT-5.1 · GPT-5.1 deployment safety · routing/model-choice post (secondary) · GPT-5.1-Codex-Max · Codex-Max system card · Codex-Max safety training · long-horizon Codex tasks · GPT-5.2 · GPT-5.2 for science/math · GPT-5.2 system-card update · GPT-5.2-Codex · GPT-5.3-Codex · 5.3-Codex system card · 5.3-Codex coverage (secondary) · GPT-5.4 · GPT-5.4 thinking system card · GPT-5.4 (secondary) · graders docs · RFT guide · RFT use-cases · RFT wind-down, 2026-05-08 · community thread (secondary)

Google DeepMind (Gemini) — Gemini 2.5 tech report arXiv:2507.06261 (HTML, §2.4/2.5 mirror, cross-checked (secondary)) · Deep Think launch · Gemini 3 Pro model card (PDF) · Gemini 3 launch · agent-building with Gemini 3 · Gemini 3 Deep Think · Deep Think update · Gemini 3 Flash · Flash for enterprise · Gemini 3.1 Pro · Gemini 3.5 · Vending-Bench/τ²-bench cross-check (secondary)

xAI (Grok)Grok 4 · Grok 4 model card (PDF) · Grok 4 analysis (secondary) · Grok Code Fast 1 · Code Fast 1 model card (PDF) · Grok 4 Fast · Grok 4 Fast model card (PDF) · coverage (secondary) · coverage (secondary) · Grok 4.1 · Grok 4.1 model card (PDF) · sycophancy coverage (secondary) · Grok 4.1 Fast · news index

Mistral — Magistral arXiv:2506.10910 (HF paper page, benchmark tables) · Ministral 3 arXiv:2601.08584 · Mistral 3 blog · Magistral blog · Mistral-Large-3 card · Magistral-Small-2509 card · Magistral-Small-2507 card · Magistral Medium 1.2 docs

DeepSeek — V3 arXiv:2412.19437 · R1 arXiv:2501.12948 · V3.2 arXiv:2512.02556 (HTML) · V3.1 release · V3.1 card · V3.1-Terminus · V3.1-Terminus card · V3.2-Exp · V3.2-Exp repo · V3.2 / V3.2-Speciale

Qwen (Alibaba) — Qwen3 tech report arXiv:2505.09388 · GSPO arXiv:2507.18071 · Qwen3-Omni arXiv:2509.17765 · Qwen3-VL arXiv:2511.21631 · Qwen3-Coder-Next arXiv:2603.00729 · Qwen3.5-Omni arXiv:2604.15804 · GSPO blog · Qwen3-Coder blog · Qwen3 blog · Qwen3 README · Qwen3-Next efficiency · Qwen3-Max · Qwen3-Max-Thinking · Qwen3.5 blog · Qwen3.5-397B-A17B card · Qwen3.7-Max “agent frontier” · Qwen3.7 blog · The Batch coverage (secondary) · VentureBeat coverage (secondary) · TechSphere coverage (secondary) · Qwen3.6-35B-A3B agentic coding

Moonshot AI (Kimi) — K2 arXiv:2507.20534 (verified via arXiv abs + full HTML crawl) · K2.5 arXiv:2602.02276 (verified via arXiv abs + full HTML crawl) · Kimi-K2-Thinking card (no arXiv paper) · K2 Thinking intro post · Kimi-K2.6 card (no arXiv paper) · K2.6 tech blog · K2.6 benchmark deltas · K2.6 coverage (secondary) · K2.6 method non-disclosure (secondary) · Kimi-K2.5 model card/benchmarks

GLM / Z.ai (Zhipu) — GLM-4.5 arXiv:2508.06471 · GLM-5 arXiv:2602.15763 (HTML) · GLM-4.5 blog · GLM-4.6 blog · GLM-4.7 blog · GLM-5 blog · GLM-5.2 blog · GLM-4.5 repo · GLM-5 repo · slime RL infra repo · GLM-4.7-Flash card · MoE architecture deep-dive (secondary, unverified-primary) · agentic RL post citing GLM-5 report (secondary) · RL infra post citing GLM-5 report (secondary) · GLM-5.2 vs 5.1 (secondary) · GLM-5.2 open-source coverage (secondary) · GLM-4.7-Flash coverage (secondary)

Xiaomi (MiMo) — MiMo-V2-Flash arXiv:2601.02780 · MiMo-Embodied arXiv:2511.16518 · MiMo-VL-Miloco arXiv:2512.17436 · MiMo-Audio arXiv:2512.23808 · MiMo lineage (background only, outside the 12-month window): MiMo arXiv:2505.07608, MiMo-VL arXiv:2506.03569 · MiMo-V2.5-Pro card · MiMo-V2.5 card · MiMo-V2.5-Pro blog · MiMo model-update docs

The verified concept-map + genealogy notes in the shared memory pool (research/post-training-inference-concept-map.md, research/post-training-method-genealogy-onpolicy-offpolicy.md, research/frontier-lab-post-training-recipes-2026.md, research/rl-for-long-horizon-exploration-reasoning.md, research/diagnosing-capability-vs-execution-gap-framework.md) are the machine-readable companions to this book.