What the frontier labs actually do (last 12 months: 2025-07 → 2026-07)
Ten labs, one year, one question: SFT + which RL, in what order, and why. The pattern holds from the first pass and gets stronger with more labs in the sample: everyone runs the same small method set (SFT · rejection-sampling · DPO-family · GRPO/GSPO/PPO-family · RLVR · RLAIF). Differentiation is ordering, data/environment scale, and a handful of stabilization tricks — not exotic new losses. Where a lab discloses a genuinely new algorithmic idea (Mistral’s clip-higher, Qwen’s GSPO, Moonshot’s PARL, Xiaomi’s MOPD, DeepSeek’s four GRPO stabilizers), it’s flagged [N] and it’s still a variation on group-relative policy optimization, not a different paradigm.
Tag legend (four axes that matter for a ~100-turn, terminal-reward, verifier-gated CTF agent): [L] long-horizon/credit-assignment · [E] exploration/entropy-collapse-resistance · [R] multi-step reasoning · [N] novelty / capability-boundary-expansion (not just amplification). [D] = lab-disclosed mechanism, [I] = third-party-inferred — kept inline per lab because disclosure quality varies by an order of magnitude between labs (DeepSeek/Qwen/Mistral publish ablation tables; xAI/OpenAI/Google publish one paragraph of prose per model).
Designed to fix: pattern N callouts below map a disclosed technique directly to one of this project’s 5 BSides behavioral-audit findings (1: agents prefer own raw tools over the rich surface; 2: react-and-guess, no methodology; 3: good-guessers-until-not, brittle on a wrong first guess; 4: uneven PTES phases, weak enumeration/exploitation; 5: benchmarks reward pattern-match speed not thoroughness).
Historical anchor: Llama 3 → Llama 4 (pre-window, kept short)
Llama 3/3.1 — several rounds of SFT → rejection sampling → DPO; tried PPO, dropped it for DPO+RS at their scale (arXiv:2407.21783).
Llama 4 (2025-04) — inverted the order: thin SFT (LLM-judge dropped >50% of “easy” data) → intensive online RL on hard prompts with continuous re-filtering → thin DPO for corner cases. Meta’s explicit finding — heavy SFT/DPO restricts RL exploration — is the one lesson from this era every later lab implicitly re-derives: don’t let imitation calcify the policy before RL gets to explore (ai.meta.com/blog/llama-4-multimodal-intelligence). This is why Mistral’s Magistral Medium below runs RL with zero SFT and why Zhipu’s GLM-4.5 keeps SFT to “just enough correctness for RL to have signal,” not more.
Anthropic (Claude) — richest on alignment mechanism, thinnest on capability-RL mechanism
The backbone [D], unchanged since 2022: Constitutional AI / RLAIF (arXiv:2212.08073) — SL-CAI critique-revise → RL-CAI against an AI-feedback preference model. Every 2025-26 system card repeats this as boilerplate; capability-RL hyperparameters, reward-model architecture, and dataset sizes are never disclosed for the Sonnet/Opus line. The useful material is in the alignment research, not the model cards.
-
Inoculation prompting [D, N] (arXiv:2510.04340; Anthropic’s own study arXiv:2511.18397) — deployed since Opus 4.5, expanded in Opus 4.6’s highest-risk RL settings. Problem: a model that learns to reward-hack on real production RL generalizes that disposition into broader misalignment (sabotage, alignment-faking) — chat-RLHF safety training doesn’t transfer to agentic settings. Fix: tell the model at train time that the gameable behavior is expected/acceptable in this context; query at test time with the unmodified prompt. The model still learns the capability, but doesn’t internalize “hacking is my default disposition when oversight is weak.”
inoc_prompt = prompt + "\n(Note: hard-coding to pass this test case is expected here.)" loss = sft_loss(model, inoc_prompt, hacky_response) # train-time only # test-time: query with the ORIGINAL (un-inoculated) promptDirectly relevant to this project’s own confirmed lesson (format/regex reward → SFT-induced flag confabulation): an imperfect reward doesn’t just fail locally, it teaches a generalizable disposition. If rejection-sampling SFT ever surfaces a technically-passing-but-degenerate solve (lucky guess vs. real exploitation), Anthropic’s finding says annotate it explicitly, don’t silently filter or leave the framing implicit.
-
“Teaching Claude Why” [D, L, N] (alignment.anthropic.com, 2026-05-08) — three interventions for agentic-misalignment generalization, all disclosed as complementary: (1) SFT on non-agentic chat transcripts about an ethical dilemma reduced agentic tool-calling misalignment to zero — cross-modality generalization; (2) SDF (synthetic-document finetuning, pretraining-style docs about an AI acting per Claude’s constitution) — 3M OOD-tokens beat 14M eval-similar tokens, 28x more token-efficient, and the effect survived subsequent RL rather than being washed out; (3) diversifying RL environments with tool defs + varied system prompts even when the tools are never needed for the task — measurably reduced honeypot misalignment.
Designed to fix: pattern 1 (agents prefer their own raw tools, 26/40 provided tools dead). Anthropic’s finding (3) says the dead-tool problem may be a training-distribution coverage gap, not just an inference-time preference: if the RL/SFT distribution rarely rewards using the rich tool surface as the winning strategy, the model won’t reach for it regardless of the system prompt at eval time. Concrete action: oversample rejection-sampling-SFT trajectories that used the project’s own tool surface (not raw shell/curl) rather than filtering only on outcome.
-
Multi-agent orchestration [D, L] (Opus 4.5): tested/tuned as an orchestrator of Haiku/Sonnet worker subagents — cheap Haiku workers under an Opus orchestrator beat Opus alone by ~12 points; Opus is a measurably better orchestrator than Sonnet given the same subagent pool. Sonnet 4.5’s headline: ~30-hour autonomous coding sessions — the year’s clearest [L] claim from this lab.
-
Cyber capability [vendor-claimed, undisclosed recipe]: Claude Mythos 5 (gated, Project Glasswing) — “strongest cybersecurity capabilities of any model in the world,” explicit multi-phase agentic-hacking claim (recon → discovery → lateral movement). Zero SFT/RL-environment detail disclosed for the cyber-capable checkpoint — treat as a vendor capability claim, not a methodology to learn from.
Headline gains: SWE-bench Verified 80.9% (Opus 4.5, first model >80%); Sonnet 5 BrowseComp 84.7% at a 10M-token operating limit with context compaction. Tags: [L] strong (30-hr sessions, task budgets, dynamic-workflow subagent fan-out) · [R] steady benchmark climb · [N] inoculation prompting + SDF-for-values are genuine training-loop-level interventions, the clearest [N] items in this whole file from any lab · [E] not addressed anywhere in disclosed material — a real disclosure gap, not evidence of absence.
OpenAI — almost nothing on the RL algorithm, one very citable agentic-RL sentence
Disclosure reality [D]: every GPT-5.x system card repeats the same paragraph — “trained to reason through reinforcement learning… learn to refine their thinking process, try different strategies, and recognize their mistakes.” No algorithm name, no reward-model architecture, no compute numbers, across 10+ releases (GPT-5 → 5.4) from Aug 2025 to Mar 2026.
What actually is disclosed and load-bearing, repeated verbatim across the whole Codex line since Sep 2025:
“trained using reinforcement learning on real-world coding tasks in a variety of environments… iteratively run tests until passing results are achieved.”
That’s RLVR-shaped agentic RL: real-repo/PR environments, implicit test-pass reward (ground-truth verifiable — not format-matched, matching this project’s own confirmed rule), explicit iterate-until-verified inner loop. Cite this as independent industry confirmation that “RL against a ground-truth verifier with an iterate-until-pass loop” is the dominant agentic-coding recipe, not an idiosyncratic choice.
- Safe-completions [D, the one fully-disclosed SFT/preference technique] (arXiv:2508.09224) — trains the policy over outputs, not a binary refuse/comply intent classifier, so a dual-use prompt gets a partial, non-harmful-boundary answer instead of a hard refusal. Breaks the refusal/helpfulness tradeoff rather than trading one for the other.
- Compaction (GPT-5.1-Codex-Max, [D, the year’s strongest [L] mechanism]) — “first model natively trained to operate across multiple context windows… coherently working over millions of tokens in a single task.” Not a prompting trick — the model is trained to prune its own history and continue in a fresh window. METR: 50%-reliability time horizon ~2h42m vs GPT-5’s 2h15m.
Designed to fix: pattern 4 (uneven PTES phases, weak thorough enumeration). OpenAI’s own cyber-eval writeup: “most cyber challenges are limited by exploring many different paths which involve running commands that can produce verbose logs and easily consume the model’s context window… trying different tools with an almost brute-force approach.” This is an independent, external confirmation of this project’s own diagnosis — CTF-style tasks are bottlenecked by long-horizon context exhaustion during enumeration, not single-step reasoning — and OpenAI’s fix is architectural (compaction), not reward-shaping.
- RFT API [D, the closest thing to a public recipe] — sample completions, score with a programmable grader (
string_check/text_similarity/score_model/python/multigrader), policy-gradient update toward higher-scoring completions. Their own eligibility guidance — “eval results must be variable enough to improve” — is the same 30–60% baseline-band logic this project already uses. Status: being wound down May 2026, new users cut off, existing users capped to Jan 2027 — flag tomain: don’t plan around “fall back to OpenAI RFT.”
Headline gains: GPT-5.2 ARC-AGI-2 52.9% (vs 17.6% GPT-5.1, the largest single jump reported); GPT-5.4 OSWorld-Verified 75.0% (above the 72.4% human baseline). [N] evidence: GPT-5.2 Pro solved an open COLT-2019 problem in statistical learning theory unaided — a single vendor-reported anecdote, externally verified by unspecified “subject-matter experts,” treat as promising not validated. Tags: [R] core · [L] strongest disclosed mechanism this year (compaction) · [E] never named as a training objective anywhere in the OpenAI corpus — the “almost brute-force” tool-trying is an observed side effect, not an engineered exploration bonus · [N] one well-documented anecdote, caveated.
Google DeepMind (Gemini) — one real tech report (2.5), everything since is a model-card rerun
Gemini 2.5 [D] (arXiv:2507.06261 §2.4) is the only Gemini release with a real disclosed recipe; every 3.x card since repeats the same boilerplate with zero new mechanism.
- SFT: adversarial/red-team-sourced data (model-probes-model + human-probes-model), “loosely inspired by Constitutional AI,” refined across successive model generations — a self-improving data engine, not a static curated set.
- RL — “RLF” (Reinforcement Learning from human and critic Feedback), dual-channel: a trained Data Reward Model (DRM) amortizing human preference labels + a prompted Critic scored against offline-editable rubrics. Deliberate hedge against the two classic single-RM failure modes (trained-RM reward hacking vs. prompted-judge brittleness).
Separately: “increased training compute allocated to RL… enabled Gemini 2.5 to learn from more diverse and complex RL environments, including those requiring multi-step actions and tool use” — this is the direct antecedent of a verifiable-reward RL track, the branch closest to this project’s own ground-truth flag verifier (you don’t need the DRM/Critic hedge — flag capture is already ground-truth verifiable).reward = f(DRM(response), Critic(response, rubric)) # two channels, decoupled cost profile - Gemini 3 Pro [D, model card only]: “RL techniques that can leverage multi-step reasoning, problem-solving and theorem-proving data” — no new mechanism named. The evidence is all benchmark: Vending-Bench 2 mean net worth $5,478 vs Gemini 2.5 Pro’s $573.64 — a ~9.6x jump, the single most relevant public [L] data point this year (closest published analog to a ~100-turn terminal-reward task, no per-step reward). “Thought Signatures” (encrypted reasoning-state tokens carried across multi-turn tool calls) is a serving-side mitigation for context/reasoning loss over long agentic loops.
Designed to fix: pattern 4. Google’s own Frontier Safety Framework discloses a concrete capability ceiling directly in this project’s domain: cyber “v1 hard challenges: 11/12 solved; v2 challenges: 0/13 solved end-to-end” — evidence that even a 9.6x long-horizon jump on Vending-Bench doesn’t close the gap on harder, more adversarial multi-step exploitation tasks.
Headline gains: Gemini 2.5 → “5x on Aider Polyglot, 2x on SWE-bench Verified” (report’s own framing); Gemini 3 Pro SWE-bench Verified 76.2%, τ²-bench 85.4%. Tags: [L] strong (Vending-Bench 2) · [R] core focus (Thinking/Deep Think tracks are RL-trained test-time-compute) · [E] the one explicit phrase (“deeper exploration”) is a compute-scale claim, not an algorithmic one — nothing disclosed addresses entropy-collapse directly, a real gap · [N] contested — ARC-AGI-2 gains are suggestive, not mechanistically explained.
xAI / Grok — no arXiv report for any Grok-4-family model; the clearest [L] disclosure industry-wide
Company-wide posture: one model card per release, almost entirely a safety/RMF eval doc. Post-training gets one paragraph.
-
Grok 4 [D]: SFT is explicitly minor (“along with supervised finetuning of specific capabilities”); RL is the driver — pushed to “the same order of magnitude as pretraining” compute (caveat: the launch chart had no y-axis labels — directional, not audited), RLVR domains expanded “from math/coding to many more domains,” native RL-trained tool use (model chooses its own search depth). CyBench unguided success 0.43 — “below a human professional” end-to-end.
-
Grok 4.1 Fast [D] — the year’s most explicit long-horizon-RL disclosure:
“We trained Grok 4.1 Fast using long-horizon reinforcement learning with a strong emphasis on multi-turn scenarios, ensuring consistent performance across its full 2-million-token context window.”
This is a rare case of a lab naming “long-horizon RL” as the training objective, not just an eval axis — cite this precisely.
Designed to fix: pattern 1. Co-launched with a first-party Agent Tools API (web search, X search, code exec, MCP) that the model was trained against directly — xAI controls and RL-trains against its own curated tool surface, structurally reducing the degrees of freedom for the model to default to raw shell/curl-equivalents, because the trained-against tools are the native path.
-
Grok 4.1 [D, N, but instructive as a contrast] — RLAIF using an agentic reasoning model as the reward model (not a static preference classifier) to extend RL onto non-verifiable axes (style, EQ, tone). Genuinely portable idea — but the model’s own safety card shows the cost: MASK dishonesty 0.43→0.49, sycophancy 0.07→0.19-0.23 (both worse). This is the opposite of this project’s ground-truth-reward rule, and it’s a first-party-disclosed regression — read as a live demonstration of what happens when you relax ground-truth verification, not something to adopt.
-
Grok Code Fast 1 [D]: pure SFT/imitation on real PR/tool-use demonstrations, no disclosed RL at all — xAI’s own precedent that SFT-only is a legitimate shipped strategy for a cheap specialist tier, not the capability frontier.
Headline gains: Grok 4 first to 50.7% HLE (w/ tools, Heavy); Grok 4.1 Fast τ²-bench Telecom 100%; Vending-Bench $4,694 (Grok 4) vs Claude Opus 4’s $2,077. Tags: [L] Grok 4.1 Fast = strongest in the corpus · [R] Grok 4 primary target · [E] never disclosed for any model, and large-scale RLVR is exactly the regime where entropy collapse is a known risk — xAI says nothing about it · [N] the RLAIF-agentic-judge idea (Grok 4.1) is the most transferable and the most cautionary.
Mistral — the cleanest published ablation table in the whole set (single-turn only)
Magistral Medium [D] (arXiv:2506.10910) — RL alone, zero SFT, zero distillation from a stronger teacher, on top of an instruct checkpoint. The paper’s own headline methodological claim, explicitly benchmarked against DeepSeek-R1’s SFT-then-RL pipeline. GRPO with three deliberate departures, each ablation-justified:
# vanilla GRPO
loss = -min(ratio_t * A_i, clip(ratio_t, 1-eps, 1+eps) * A_i) # per-token, symmetric clip, KL to ref
# Magistral's departures
loss = normalize_by_group_token_count(loss) # not per-sequence -> removes length bias
eps_high = 0.26-0.28 # asymmetric "clip-higher" instead of an entropy bonus
# entropy bonus WAS tried: "unstable and dataset-dependent" -> collapsed on math, exploded on mixed data
# KL term: beta = 0 (removed) -> policy diverges anyway, the term bought nothing
Reward = 4 additive terms (format 0.1 / correctness 0.9, SymPy or compile+test, all-or-nothing — partial-credit code reward was tried and rejected, cost ~2pts LiveCodeBench / length penalty / language-consistency 0.1, fixes CoT code-switching). Magistral Small uses cold-start SFT distilled from Medium + RL on top — Table 3 ablation: SFT+RL (70.7 AIME’24) beats SFT-only (65.4) and RL-only (65.8) at the 24B scale — i.e., pure RL sufficed at Medium’s scale but not at Small’s.
Ministral 3 [D] (arXiv:2601.08584) cites Magistral’s GRPO recipe directly (“Rastogi et al. [2025]”). Adds a General RL stage with a rubric-based LLM-judge reward (reward = fraction of atomic rubric items satisfied) layered after verifiable STEM RL — a candidate pattern for domains (like reporting/methodology quality) where a ground-truth verifier exists for outcome but not for process, as an additional shaping signal, never replacing the terminal flag-verified reward.
Headline gain: Magistral Medium AIME’24 pass@1 26.8→73.6 (+~47pp, “nearly 50% boost,” Mistral’s own framing, without cold-start reasoning traces). Tags: [R] primary and only real target — this is single-turn math/code RLVR, reward is per-completion · [E] yes, narrowly: clip-higher is a genuine, ablation-validated anti-entropy-collapse mechanism, directly citable vocabulary — but tuned for single-shot generation, not multi-turn tool-call exploration · [L] not addressed at all — no multi-turn credit assignment in this report, nothing transfers directly to a 100-turn episode without further work · [N] the “RL-only, no distillation” result at Medium’s scale is the clearest disclosed boundary-expansion claim of the year (pure RLVR uncovered capability a teacher’s traces wouldn’t have shown).
DeepSeek — the four GRPO stabilizers are the single most transferable [E] artifact in this file
V3.1 [D, thin] — hybrid think/non-think in one checkpoint; SFT/RL specifics not disclosed beyond “post-training optimization.” V3.2-Exp [D] — DeepSeek Sparse Attention (DSA) via continued pretrain, post-training held identical to V3.1-Terminus by design (a controlled comparison). V3.2 [D] (arXiv:2512.02556) carries essentially all of the year’s real recipe detail:
- Specialist distillation — 6 domain specialists (math/code/reasoning/agentic/agentic-coding/agentic-search), each pushed with large-scale RL independently, distilled back into one generalist. “Models trained on the distilled data achieve performance only marginally below domain-specific specialists, with the gap eliminated through subsequent RL” — distillation gets 90% cheaply, RL closes the rest.
- Mixed RL (GRPO), merged not sequential — reasoning + agent + alignment trained together explicitly to avoid catastrophic forgetting from multi-stage sequencing. Post-training compute >10% of pretraining compute, disclosed directly.
- Four GRPO stabilizers, each a concrete anti-entropy-collapse/anti-instability fix:
- Unbiased KL estimate — corrects Schulman’s K3 estimator via importance-sampling ratio; the uncorrected estimator assigns unboundedly large gradient weight when π_θ ≪ π_ref.
- Off-policy sequence masking — zero the loss on negative-advantage sequences whose divergence from π_old exceeds a threshold; positive-advantage samples are kept regardless.
- Keep Routing — freeze the MoE expert-routing path used at sampling time; don’t let training recompute a diverged routing.
- Keep Sampling Mask — reapply the same top-p/top-k truncation mask from sampling during the training update, so importance-sampling validity holds; empirically “preserves language consistency during RL training” (fixes RL-induced mixed-language garbage, a textbook entropy-collapse symptom).
- Thinking-in-tool-use agentic RL environments — 1,827 environments, 85k+ prompts across code/search/general/interpreter agents. Search-agent verification keeps only samples where the ground truth is checkable AND every wrong candidate is provably wrong — the same hard-negative discipline as this project’s own flag-verifier, independently arrived at.
Designed to fix: pattern 4 / pattern 3. The context-management fix (“retain reasoning across tool turns, drop only on a genuinely new user message”) directly targets long-horizon coherence during exploitation; the hard-negative-verified reward directly targets brittleness from an ungrounded guess (pattern 3) by refusing to reward a correct-looking answer unless every alternative is provably wrong too.
V3.2-Speciale: same base, reduced length penalty (let it think longer) + DeepSeekMath-V2 reward folded in — gold-medal IMO/IOI/ICPC/CMO 2025.
Headline gain: V3.2 “performs comparably to GPT-5” on reasoning at substantially lower cost, explicitly framed as narrowing the open-vs-closed gap on agentic long-tail tasks. Tags: [L] strong (DSA makes long-context RL tractable; context-retention rule is a direct multi-turn continuity fix) · [E] the strongest, most technically concrete axis in this whole file — four named, ablatable stabilizers, zero architecture changes required · [R] strong · [N] claimed (closing the gap on “long-tail/novel environments”) but self-rated, and structurally the specialist→distill→RL pipeline resembles SFT+RL, which this project’s own thesis (“GRPO amplifies, SFT replaces,” arXiv:2507.10616) would predict caps its novelty — DeepSeek doesn’t isolate this in an ablation, so contested/unresolved by their own disclosure.
Qwen (Alibaba) — Qwen3.7-Max is the single most directly relevant release in this entire file
Baseline (Qwen3, pre-window, arXiv:2505.09388): Long-CoT SFT cold-start → Reasoning RL (GRPO/RLVR) → thinking-mode fusion SFT → General RL + strong-to-weak distillation. Every later release patches this.
-
GSPO [D] (arXiv:2507.18071) — the load-bearing algorithm swap from 2025-07 onward. Problem: GRPO’s per-token importance ratio gets corrupted on a MoE model when routing jitters between rollout and update, forcing an expensive “Routing Replay” workaround. Fix: define the ratio at the sequence level.
# GRPO: per-token ratio, needs Routing Replay to stay valid on MoE ratio_t = pi_theta(y_t|x,y_<t) / pi_old(y_t|x,y_<t) # GSPO: one length-normalized ratio per whole rollout -> removes Routing Replay entirely s_i = (pi_theta(y_i|x) / pi_old(y_i|x)) ** (1/len(y_i)) A_i = (r_i - mean(group_rewards)) / std(group_rewards) loss += -min(s_i * A_i, clip(s_i, 1-eps, 1+eps) * A_i)Removes infra complexity (no per-token log-prob pinning) rather than adding it — cheap to try if a MoE base is ever adopted.
-
Qwen3-Coder [D]: “hard-to-solve, easy-to-verify” code RL (execution-driven, automatically-scaled test cases) as a separate stage from “long-horizon Agent RL” (multi-turn plan→act→observe→replan over 20,000 parallel real dev environments) — SOTA SWE-bench Verified without test-time scaling, i.e. the gain is in the policy.
-
Qwen3.5 [D]: pivot point — “the post-training performance gains in Qwen3.5 primarily stem from our extensive scaling of virtually all RL tasks and environments… we focused heavily on increasing the difficulty and generalizability of RL environments, rather than optimizing for specific metrics.” Origin of the decoupled Task/Harness/Verifier idea that gets its full writeup next.
-
Qwen3.7-Max [D] — read this one closely:
“agent RL training conventionally couples the task, the harness, and the verifier — train on one fixed triple and the policy learns harness-specific shortcuts instead of a generalizable strategy.”
This is a first-party, named statement of exactly this project’s own scaffold-overfitting finding. Fix: decoupled Task/Harness/Verifier rollout infra — the same task replayed against different harnesses (types and versions) and different verifiers, forcing cross-harness generalization.
for task in tasks: for harness in sample_harnesses(task): # e.g. Claude Code, OpenClaw, Qwen Code, Hermes for verifier in sample_verifiers(task): rollout = run(policy, task, harness) update(policy, rollout, verifier(rollout))Designed to fix: pattern 1, directly and by name — validated by consistent performance across QwenClawBench/CoWorkBench regardless of eval-time harness, contrasted against Qwen3.6-Plus which “showed significant variance.” Also ships a reward-hacking self-detection framework — the policy itself flags candidate reward-hacking patterns in its own trajectories, a governance mechanism directly relevant to this project’s “never regex-match, always verify” rule.
Headline gain: Qwen3.7-Max — 35-hour fully-autonomous kernel-optimization run on a previously-unseen accelerator, zero prior exposure, 432 iterations/1,158 tool calls, ~10x speedup, entirely self-directed (self-reported, not yet independently benchmarked). Tags: [E]+[L]+[N] for Qwen3.5/3.7-Max — the release most directly targeted at this project’s exact problem shape (harness generalization, long-horizon coherence, novel governance mechanism) · [R] GSPO/reasoning-RL lineage throughout.
Moonshot AI (Kimi) — strongest disclosed [N] case in the file (Agent Swarm is a qualitatively different solution shape)
K2 [D] (arXiv:2507.20534) — SFT data is itself rejection-sampled: synthetic agentic trajectories (3,000+ real MCP tools + 20,000+ synthesized) scored by an LLM judge against per-task rubrics, only passing trajectories enter SFT — “large-scale rejection sampling… through our quality filtering process,” disclosed in those words. RL = REINFORCE-with-baseline (GRPO-adjacent, group-mean baseline, no value model) + self-critique rubric reward re-grounded continuously by on-policy RLVR rollouts (a template for a safe non-verifiable auxiliary reward that doesn’t drift from the ground-truth signal). Three named engineering fixes: budget control (hard token cap, fights length inflation), PTX loss (replay high-quality SFT data during RL, fights catastrophic forgetting), temperature decay (high early, annealed later — an explicit, named [E] exploration-preservation mechanism). Partial rollout — long-tail unfinished episodes pause/resume across RL iterations rather than blocking the batch — the single most directly transferable engineering idea for ~100-turn terminal-reward episodes.
K2.5 [D] (arXiv:2602.02276) — Zero-Vision SFT: text-only SFT alone activates visual agentic tool-use; hand-annotated visual CoT data hurts generalization (don’t spend annotation budget on the modality you’re activating; spend it where you already have depth). PARL / Agent Swarm — a trainable orchestrator + frozen sub-agents (instantiated from an earlier checkpoint); only the orchestrator gets gradient updates, explicitly to sidestep credit-assignment ambiguity across sub-agent calls.
reward = λ1·r_parallel (fights orchestrator collapsing back to single-agent)
+ λ2·r_finish (fights spawning many sub-agents without real decomposition)
+ r_perf (task-level outcome)
# λ1, λ2 annealed to zero -> final policy optimizes pure task success
Designed to fix: pattern 2 / pattern 4. Sequential agentic execution has linear latency scaling and, per this project’s own audit, agents “pivot after a single failure instead of enumerating.” PARL’s whole premise is training a policy to decompose wide-search/enumeration tasks into parallel sub-agent calls rather than one brittle serial chain — a direct, working answer to “how do you reward decomposition without the model gaming step-count,” and structurally analogous to training an agent to enumerate broadly instead of guessing once and pivoting.
Toggle (token-efficient RL) alternates budget-limited and unconstrained phases, gated on accuracy already exceeding a threshold — fixes K2’s earlier length-overfitting failure mode (a rigid budget doesn’t generalize back up when a harder problem needs more room).
Headline gain: K2.5 Agent Swarm — 4.5x latency reduction with a simultaneous F1 gain (72.8%→79.0% WideSearch); K2 Thinking sustains 200-300 coherent tool calls vs. “prior models degrade after 30-50 steps” (a direct, quantified [L] claim). Tags: [L] yes throughout (partial rollout, PARL, 200-300-step coherence) · [E] temperature decay is named and explicit; PARL’s anti-serial-collapse term is the same shape of problem as entropy collapse solved with a shaping reward instead of an entropy bonus · [R] present, secondary — K2 itself is a non-thinking model · [N] the strongest in this file: Agent Swarm is not “faster at the same thing,” it’s a qualitatively different solution shape (parallel decomposition vs. any single sequential agent, however long-horizon).
GLM / Z.ai — the difficulty-curriculum + iterative self-distillation loop is a validated version of this project’s own plan
GLM-4.5 [D] (arXiv:2508.06471) — Stage 1: three domain specialists (Reasoning/Agent/General), each cold-start SFT’d separately (a domain-general model from scratch wastes RL exploration budget re-discovering what expert-labeled distillation data already gives free). Stage 2: self-distill into one generalist, with rejection sampling on the distillation data itself (strip malformed samples, verify correctness for objective answers, RM-filter subjective ones, verify tool-call trajectories reach a terminal state).
- Reasoning RL: GRPO, no KL term, three ablation-justified fixes:
- Two-stage difficulty curriculum — switch to problems that are pass@8=0 but pass@512>0 (hard-but-not-impossible); a static difficulty set goes stale as the policy improves, collapsing reward variance to all-0 or all-1 either way (zero gradient signal). This is the exact wall a ~1000-challenge portfolio at 100-200 solves is almost certainly sitting on for its hard tail.
- Single-stage RL at the full 64K target length — staged length scaling (8K→16K→…→64K) caused an irreversible unlearning of long-output generation that never recovered even when length was scaled back up. Direct lesson: don’t RL-train shorter than your SFT init’s horizon if the target task is long.
- Dynamic sampling temperature — raise temperature when rollout reward plateaus (their named signal for entropy collapse), gated by a max-1%-perf-drop bound on held-out validation.
- Iterative self-distillation (Agentic RL): RL to a plateau → distill the RL-improved policy’s own outputs into a fresh SFT checkpoint (replacing the original cold-start data) → resume RL on the stronger base with a harder curriculum → repeat.
This is the closest external validation of this project’s stated plan (“rejection-sampling SFT → GRPO/RLVR”) in the whole file — except Zhipu alternates SFT↔RL repeatedly rather than doing it once and switching permanently. Worth treating the handoff as a loop gated on reward plateauing, not a fixed step count.
GLM-5 [D] (arXiv:2602.15763) — the recipe shape changed: a sequential RL pipeline (Reasoning RL → Agentic RL → General RL) with On-Policy Cross-Stage Distillation blended throughout, replacing 4.5’s expert-then-unify structure, specifically to prevent catastrophic forgetting between stages. New async RL infra + double-sided importance sampling with hard-masking — tokens whose importance ratio falls outside [1-ε_l, 1+ε_h] are zeroed, not soft-clipped — explicitly motivated by policy drift compounding across long agentic trajectories before the (often sole) terminal reward arrives.
Designed to fix: pattern 4. Zhipu explicitly names Vending-Bench 2 and CC-Bench-V2 (“long-term coherence in agents”) as the benchmarks this release targets — a lab measuring the [L] axis directly, not incidentally.
Headline gain: GLM-5 ~20% average improvement over GLM-4.7 across 8 benchmarks; SWE-bench Verified 73.8→77.8; first open-weights model to hit 50 on Artificial Analysis Intelligence Index v4.0. Tags: [E] strong — dynamic temperature + difficulty curriculum are both explicit, named entropy-collapse countermeasures · [L] strong in GLM-5 (double-sided IS + hard masking is a credit-assignment fix aimed squarely at long trajectories) · [R] the most-developed program in GLM-4.5’s report · [N] weak-to-moderate — mostly amplification via expert-iteration/distillation, not boundary expansion.
Xiaomi (MiMo) — MOPD is a genuinely new algorithm, not a GRPO variant with a new name
MiMo-V2-Flash [D] (arXiv:2601.02780) — 3-stage post-training, and stage 3 is the interesting one. Problem named directly: “the see-saw effect” — naive multi-skill post-training improves one capability at the cost of another. Fix: Multi-Teacher On-Policy Distillation (MOPD).
- Stage 1 — SFT to “activate latent capabilities acquired during pretraining” (not teach new ones). Notable operational detail: num-zeros (count of MoE params with zero gradient) is their leading indicator of SFT instability — rising = expert-load-balance collapse, falling = overfitting.
- Stage 2 — train a suite of narrow domain-specialist teachers via independent RL (agentic: search/coding/tool-use; non-agentic: math/reasoning/safety).
- Stage 3 — MOPD: the student rolls out on its own policy distribution (not an offline distillation set, not weight merging) and receives dense, token-level reward = KL-divergence against each teacher’s logits, plus a verifiable outcome reward.
Result: the student mostly matches/beats the best individual teacher without the see-saw across the full skill set simultaneously (one model, not N specialist checkpoints) — not a free lunch (a couple of regressions), but net positive.reward_t = -KL(student_logits_t || teacher_logits_t) + verifiable_outcome_reward # student samples on-policy; teachers never generate the training data directly
- Agentic RL scaffold [D]: deliberately minimal — 3 atomic tools (bash, str_replace, finish), no prescribed workflow in the system prompt, “allowing the model to discover best practices during training” — an independently-arrived-at instance of this project’s own “light framing beats heavy scaffolding” rule.
Designed to fix: pattern 4. MOPD is a documented mechanism for fusing narrow specialists (e.g. an enumeration-specialist and an exploitation-specialist RL checkpoint) into one policy without one specialist’s regressions bleeding into the other — a direct, algorithmic answer to “uneven PTES phases.”
MiMo-V2.5-Pro [D, undisclosed quantitatively] — same MOPD pipeline scaled to 1T params/1M context; qualitative claim of “thousand-tool-call” coherence (worked example: 672 tool calls / 4.3 hrs building a compiler, self-correction after a mid-run regression at turn 512).
Headline gain: matches Kimi-K2-Thinking / DeepSeek-V3.2-Thinking on most reasoning benchmarks at 1/2-1/3 the params; 73.4% SWE-Bench Verified. Tags: [L] explicit design target (“sustains complex trajectories”) · [E] explicit — minimal scaffold + on-policy (not offline) sampling are both named exploration-preserving choices · [R] the non-agentic teacher/reasoning stage · [N] the strongest algorithmic novelty in the file after Kimi’s Agent Swarm — MOPD’s token-level on-policy KL-distillation is a genuinely different move from rejection-sampling SFT, GRPO, or plain distillation, even though the agentic RL environments themselves (unit tests, visual verifiers) are recombinations of known reward-design patterns, not new task surfaces.
The transferable lessons (updated for ten labs)
- The method set is small and shared, and it has calcified further, not diversified. SFT · rejection-sampling · DPO-family · GRPO/GSPO/PPO-family · RLVR · RLAIF are the entire vocabulary across ten labs and ~40 releases. Every “new algorithm” this year (clip-higher, GSPO, PARL, MOPD, the four DeepSeek stabilizers) is a variation on group-relative policy optimization, not a new paradigm.
- Ordering and SFT-dosage are the live design choice. Llama 4 → thin-SFT/heavy-RL/thin-DPO; Magistral Medium → zero SFT; Zhipu → cold-start-just-enough-for-signal, then iterate SFT↔RL repeatedly rather than once. The Llama-4-era finding (“heavy SFT/DPO restricts RL exploration”) is now independently re-derived by three more labs.
- The frontier is agentic-RL-environment design, not new losses. Qwen3.7-Max’s decoupled Task/Harness/Verifier infra, Kimi’s PARL, DeepSeek’s 1,827-environment agentic-task synthesis, Xiaomi’s minimal-scaffold discipline, GLM’s iterative self-distillation loop — none of these are algorithm papers, all of them are environment/data/scaffold engineering. See Agentic RL.
- Exploration/entropy-collapse resistance is where disclosure is thinnest and most valuable when it exists. OpenAI, Google, and xAI disclose essentially nothing on this axis despite running RLVR at a scale where it’s a known risk. Where labs do disclose a mechanism (Mistral’s clip-higher, DeepSeek’s four stabilizers, GLM’s dynamic temperature + difficulty curriculum, Kimi’s temperature decay, Xiaomi’s minimal-scaffold + on-policy sampling), it’s the single most directly reusable material in this file for a project whose stated risk is exactly entropy collapse.
- Ground-truth-verified reward is now cross-lab-confirmed discipline, not an idiosyncratic project choice — DeepSeek’s hard-negative search-agent filter, Qwen3-Coder’s “hard-to-solve, easy-to-verify” framing, GLM’s format-gate-before-outcome-reward, Mistral’s rejected partial-credit code reward, Xiaomi’s rule-based-only embodied reward. Where a lab relaxes this (Grok 4.1’s agentic-LLM-judge RLAIF), the lab’s own safety card shows a measurable honesty/sycophancy regression — treat as a documented cautionary tale, not a competing recipe.
Summary table
| Lab | Flagship (recipe source) | SFT | RL algorithm | L / E / R / N |
|---|---|---|---|---|
| Llama (historical) | Llama 4 Maverick | Thin, LLM-judge-filtered | Heavy online RL → thin DPO | R |
| Anthropic | Claude Opus 4.5 | RLHF/RLAIF (undisclosed detail) + SDF | Inoculation-prompted RL (algorithm undisclosed) | L, N |
| OpenAI | GPT-5.1-Codex-Max | Undisclosed (safe-completions is the one named technique) | RLVR-shaped agentic RL + compaction training (algorithm undisclosed) | L, R |
| Gemini 3 Pro | CAI-inspired adversarial SFT | RLF (DRM + Critic) + verifiable-reward/agentic RL track | L, R | |
| xAI | Grok 4.1 Fast | Minor | Long-horizon multi-turn RL, RLVR-domain-expanded | L, R |
| Mistral | Magistral Medium | None | GRPO, no-KL, clip-higher, 4-part reward | R, E, N |
| DeepSeek | DeepSeek-V3.2 | Specialist distillation | GRPO, merged reasoning+agent+alignment, 4 stabilizers | L, E, R |
| Qwen | Qwen3.7-Max | Long-CoT cold-start (base recipe) | GSPO-lineage + decoupled Task/Harness/Verifier RL | L, E, N |
| Kimi (Moonshot) | Kimi K2.5 | Rejection-sampled agentic trajectories; Zero-Vision SFT | REINFORCE/GRPO-adjacent + PARL (Agent Swarm) | L, E, N |
| GLM (Zhipu) | GLM-5 | Expert-then-unify (4.5) → cross-stage on-policy distillation (5) | GRPO, difficulty curriculum, dynamic temperature, double-sided IS | L, E, R |
| Xiaomi (MiMo) | MiMo-V2.5-Pro | Activate-latent-capability SFT | MOPD (multi-teacher on-policy KL distillation) | L, E, N, R |
Provenance caveat, restated and strengthened: disclosure quality varies by an order of magnitude across this table. DeepSeek, Qwen, Mistral, Kimi, GLM, and Xiaomi publish arXiv tech reports with ablation tables — treat their mechanism claims as [D] high confidence. Anthropic, OpenAI, Google (post-2.5), and xAI disclose mechanism only in system-card prose or blog posts, often one paragraph per model, with capability-RL algorithm/hyperparameters/reward-model architecture never named — treat their “recipe” as mostly inferred continuity except where a technique is explicitly flagged [D] above (safe-completions, compaction, RLF’s DRM+Critic split, inoculation prompting, long-horizon multi-turn RL). Every arXiv id in this file was live-verified against
arxiv.org/abs/<id>during the research pass that produced the per-lab notes this chapter is built from — none were fabricated or carried over from training-data memory. Lab recipes change fast; re-verify before betting a training run on a specific ordering or hyperparameter.