Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Contested edges & landmines

The places where confident-sounding claims are actually unsettled, plus the terminology traps that cause real planning errors. Cited so you can check me.

1. “RL can’t create capability” — contested, not a law

  • Elicit-not-expand (the base claim): RLVR raises pass@1 but the base model beats the RL model at large pass@k — the paths RL finds were already in the base distribution; the reasoning boundary narrows with training. Distillation from a stronger teacher does expand it; RL does not (Yue et al., “Does RL Really Incentivize Reasoning Capacity Beyond the Base Model?”, arXiv:2504.13837, NeurIPS 2025). Reproduced for vanilla GRPO by others (e.g. NuRL notes plain GRPO leaves pass@1024 ≈ base, arXiv:2509.25666).
  • The counter: prolonged RL + KL control + reference resets expands the boundary even on problems the base never solves (ProRL, arXiv:2505.24864); entropy/exploration bonuses and parameter-space noise (arXiv:2602.02555) show similar. Also a metric critique: pass@k over-credits lucky-but-wrong CoT (CoT-Pass@K, arXiv:2506.14245).
  • Safe framing: vanilla RLVR at a normal budget elicits; sufficient compute + explicit exploration-preservation can expand — recipe-dependent, not settled. So “you can’t RL your way out of a missing capability” holds for standard-recipe RL only. Don’t state it as a law.

2. The “RFT” terminology landmine

Two different things share the acronym; conflating them mis-scopes a whole plan:

  • Rejection-sampling Fine-Tuning (STaR-family) = “RL without RL,” positives-only SFT on your own verified samples. Cheap.
  • Reinforcement Fine-Tuning (OpenAI/Fireworks product term — and what the project handbook calls “RFT”) = actual online RL / GRPO against a grader. Expensive.

“Start with RFT before RL” only parses under the first. Say “rejection-sampling SFT” for the cheap thing so you don’t accidentally spec a GRPO run.

3. On-policy distillation is promising, not lab-proven

Earlier framing called it “the sleeper.” Correction after a 2026 verification pass: the efficiency numbers (~9–30× cheaper than RL) come from GKD + Thinking Machines’ own blog (arXiv:2306.13649; thinkingmachines.ai 2025-10-27). No frontier lab has stated it as their production recipe. Real and attractive; treat the numbers as directional, not settled.

4. Don’t over-SFT before RL (2026 lesson)

Meta’s Llama 4 recipe deliberately keeps SFT and DPO lightweight around an intensive online-RL core, with the explicit finding that heavy SFT/DPO restricts RL exploration (ai.meta.com/blog/llama-4-multimodal-intelligence). If RL is your capability driver, a big SFT stage can cap your ceiling — counter to the naive “more SFT is safer.”

5. Reward must be ground-truth-verified, never format-matched

A project-empirical finding: SFT on trajectory data trained the model to emit FLAG{…}-shaped strings on unsolved challenges — confabulation — and a loose regex matcher fired on the model’s own reasoning/tool-args, not real server output (lessons/post-training/sft-induced-flag-confabulation.md, lessons/security-agent/flag-detection-false-positives.md, shared memory). Rule: scan tool output / server state for the flag and verify against ground truth; never reward format. On the gameability ladder, a deterministic verifier (level 1) has no parameters to exploit — stay there; PRM (level 5) was rejected for R1 for exactly this (arXiv:2501.12948).

6. “Three knobs” was a teaching scaffold

The on/off-policy axis and the imitation/preference/reward paradigm split are canonical. Packaging them as “N independent knobs you toggle” is a scaffold that over-reached — the axes aren’t independent (signal + policy largely determine what changes), so free combinations produce non-methods. Learn the one axis + the fixed method presets, not a combinatorial grid.

7. Pass@k-as-diagnostic has its own landmines — don’t trust a bare crossover plot [E][R]

Point 1’s crossover test (base pass@large-k beats RL pass@large-k → RL only reweighted, didn’t teach) is the closest thing to a standard instrument in this literature, but it is contested on at least four fronts, each of which changes what conclusion you’re entitled to draw from your own portfolio’s pass@k curves:

  • The metric credits lucky final answers. CoT-Pass@K, arXiv:2506.14245 shows pass@k gives full credit to a correct terminal answer reached via a wrong reasoning chain — once you require the CoT itself to be correct, the base-vs-RL crossover disappears and RLVR shows monotonic gains at every k. This directly re-opens point 1’s “elicit vs expand” question: some of the reported “RL doesn’t expand the boundary” results may be an artifact of not checking how the answer was reached. Your flag verifier removes the lucky-final-string confound (exact match, not “42” appearing by chance) — but a verifier-passed CTF trajectory can still contain wasted turns or an ungrounded critical guess before the winning move (BSides pattern 3), so the same confound reappears one level up: filtering your rejection-sampling SFT set on flag==1 alone is the CTF-domain analogue of trusting bare pass@k.
  • Pass@k at large k conflates “solvable” with “brute-force-guessable.” Cover@τ, arXiv:2510.08325 proposes requiring a τ-fraction of samples (not just ≥1 of many) to be correct — reordering which RLVR algorithms “win” once you penalize guessing instead of rewarding eventual luck. Directly relevant to BSides pattern 5 (benchmarks measure pattern-match speed, not thoroughness): a challenge with high pass@64 but near-zero Cover@0.3 is guessing-dominated, and training on its “wins” teaches confident guessing, not competence.
  • Optimizing pass@k directly has a vanishing-gradient trap. Naive pass@k-as-objective is mathematically a per-example reweighting of plain pass@1 whose gradient goes to zero exactly where exploration is most needed — once the policy concentrates, pass@k and pass@1 converge and there’s nothing left to reweight (arXiv:2511.16231). PKPO, arXiv:2505.15201 fixes this with an unbiased, low-variance pass@k gradient estimator — so “just add a pass@k reward” is not a free entropy-preservation trick; it needs the right estimator or it’s a no-op past the point you needed it.
  • The agentic extension changes the verdict entirely. Pass@(k,T), arXiv:2604.14877 — a two-axis metric varying sampling budget k and interaction-depth T — finds the static-reasoning crossover (point 1) is task-structure-dependent: on tasks needing compositional, sequentially-gated information-gathering, the RL pass-curve pulls above and away from base as k grows (the opposite of Yue et al.’s finding), while on independent-retrieval tasks the effect is small, replicating Yue et al. This is the strongest bridge between the pure-reasoning literature and this project’s own agentic setting — see point 9 below.

Designed to fix: pattern 5 (benchmarks measure pattern-match speed, not thoroughness) — CoT-Pass@K and Cover@τ are both direct literature-side instantiations of that same audit finding, applied to the training/eval metric itself rather than the benchmark.

Safe framing: treat the pass@1-vs-pass@k gap as a diagnostic to segment by challenge type, not a single portfolio-wide verdict. A shrinking gap with flat pass@1 is the entropy-collapse warning sign (point 8 below), not “the model learned the task.” Never conclude “RL only elicited, didn’t expand” from an aggregate pass@k plot without checking (a) whether the crossover survives a CoT-correctness filter and (b) whether it’s driven by the compositional or independent-task subset.

8. Long-horizon credit assignment: turn-level vs trajectory-level vs sequence-level — no consensus on the right granularity [L]

Every flagship reasoning-RL recipe (GRPO, DAPO, Dr.GRPO) assigns one advantage to the whole trajectory — fine for a single-turn math answer, but it starts to matter once episodes run 100 turns with a terminal-only flag reward. The literature has responded with at least three different, non-convergent fixes, each validated on a different (mostly non-CTF, mostly short) domain:

  • Go finer — turn-level advantage. arXiv:2505.11821 shows trajectory-level GRPO applied naively to multi-turn tool-use can fail to teach tool invocation at all (baselines get 20-30% exact-match and never learn to call tools; their turn-level MT-GRPO variant hits 100% tool-execution success). Turn-PPO, arXiv:2512.17008 independently argues PPO-with-a-critic, reformulated so the MDP’s base unit is a turn (not a token), is more robust than GRPO for long-horizon agentic tasks — a direct challenge to defaulting to critic-free GRPO. Both are recent, small-scale (workshop poster / 0-citation preprint, toy benchmarks WebShop/Sokoban) — promising, not settled.
  • Go coarser — sequence-level ratio. GSPO, arXiv:2507.18071 goes the opposite direction: clip and optimize at the whole-response (sequence) level instead of per-token, because token-level importance ratios compound multiplicatively over long sequences and destabilize MoE RL training at scale — credited with letting Qwen3’s RL stage not destabilize. Backed by a shipped frontier model, not a toy benchmark — stronger evidence than Turn-PPO’s, but solving a different problem (numerical stability of the ratio, not credit assignment across turns): the two proposals are not mutually exclusive, but they are not the same fix either.
  • Fix the critic, don’t change the granularity. VAPO, arXiv:2504.05118 keeps trajectory-level PPO but argues the real problem is an unreliable critic at long/heterogeneous horizons — fixed with value-model pretraining + length-adaptive GAE (λ tuned per response length) — reporting zero training crashes across independent runs, directly disputing the “value-based RL is unstable for LLM reasoning” folklore GRPO was invented to route around.
  • Scale the horizon itself, skip the value function entirely. Kimi k1.5, arXiv:2501.12599 treats context/horizon length as a first-class RL scaling axis (not a constraint), using partial rollouts (checkpoint/resume mid-episode) to make 128k-context RL tractable — explicitly avoiding MCTS, value functions, and PRMs. Reframes “the episode is long” as an opportunity, contingent on whether security-agent-<family>/pdq can support partial-rollout checkpointing (unanswered today).
  • Curriculum over the horizon. AgentGym-RL / ScalingInter-RL, arXiv:2509.08755 sidesteps the granularity debate entirely: cap the allowed turn budget low early in training and relax it toward the full target (100 turns) as training proceeds, reporting this prevents the long-horizon collapse that training at full horizon from step one causes.

Designed to fix: pattern 1 (agents prefer their own raw tools, 87.7% bypass the rich surface, 26/40 tools dead) — the turn-level papers’ headline finding (trajectory-level GRPO can fail to teach tool invocation at all) is a plausible root-cause mechanism for tool disuse, not just a scaffolding/prompting issue, if this project ever RL-trains on tool-use.

Confidence: contested by construction — no single paper compares turn-level vs sequence-level vs trajectory-level-with-a-better-critic head-to-head on the same long-horizon agentic benchmark. Most of this cluster is 2025 H2–2026 preprints with 0 citations at verification time, validated on toy environments (WebShop, Sokoban) or math/code, not a 100-turn CTF setting. Treat as “candidate designs to pilot cheaply,” not a default architectural choice — and note turn-level, sequence-level, and critic-repair are not mutually exclusive; a future GRPO/RLVR run here could combine GSPO’s sequence-level clipping (stability) with a turn-level auxiliary tool-invocation reward without contradiction.

9. Do exploration bonuses genuinely EXPAND [N] the boundary, or just elicit what the base model already has? — open question, no paper has run the decisive test on this domain

Point 1 already flags “RL expands vs. only reweights” as contested between Yue et al. and ProRL. The exploration-specific literature (DIVER, CDE, MERCI, PSN-RLVR) all claim their intrinsic-reward/parameter-noise mechanism helps the policy “escape local routines” or “discover better solutions” — language asserting [N] (boundary expansion) rather than mere elicitation. That claim deserves the same skepticism point 1 applies to vanilla RLVR:

  • None of the exploration-bonus papers ran the decisive ablation. The rigorous test for “did this expand the boundary or just elicit/redistribute” is a pass@large-k comparison against the base model (point 1’s own protocol) — DIVER (arXiv:2509.26209), CDE (arXiv:2509.09675), MERCI (arXiv:2510.16614), and PSN-RLVR (arXiv:2602.02555) all compare against vanilla-GRPO/DAPO baselines, not against a very-large-k base-model ceiling. Beating a collapsed-entropy baseline is a much lower bar than beating the base model’s own pass@1024 — and per Spurious Rewards, arXiv:2506.10947, even a completely wrong reward can look like it’s “unlocking” capability on the right base model (Qwen2.5-Math specifically; does not replicate on Llama3/OLMo2) — a stark warning that “the policy now solves things it didn’t before” is not sufficient evidence of genuine novelty without a same-model-family, large-k, base-vs-trained comparison.
  • The capability-elicitation / AI-safety literature already built the falsification protocol for exactly this question — password-locked models (arXiv:2405.19550), harder circuit-broken organisms (The Elicitation Game, arXiv:2502.02180), and the “elicit within <1% of training cost” operational definition of latent capability (AI Sandbagging, arXiv:2406.07358) — all built around known-ground-truth hidden capabilities specifically to distinguish “the technique surfaced something already there” from “the technique taught something new.” No exploration-bonus paper in the RLVR-entropy literature has been tested against a model organism with known injected/withheld capability the way this sub-field requires before it will accept an expansion claim.
  • The one paper that ran something close to the decisive test, on an agentic/compositional task, found genuine expansion — but it’s a single, very recent result. Pass@(k,T), arXiv:2604.14877 (point 7 above) shows RL pulls ahead of base-model pass@k on compositional, sequentially-gated tasks, and — critically — that matched-data SFT on the same tasks regresses the capability boundary (net −4 vs RL’s net +4), isolating self-directed exploration during RL, not exposure to more data, as the causal factor. This is the strongest evidence in the entire corpus that exploration specifically (not RL in general, not more data) is what expands the boundary — but it is one paper (2026-04-16), one research group, unreplicated, studying retrieval-style compositional tasks, not cybersecurity.

Designed to fix: pattern 1 (tool avoidance) and pattern 3 (good guessers until they’re not) — DIVER’s pairwise-diversity-of-a-group reward and CDE’s perplexity-based actor bonus are both pitched as countering exactly these behavioral patterns. That framing may be correct as an elicitation mechanism (surfacing tool-diverse or better-calibrated behavior the base model can already produce with a nudge) even if the “expands the boundary” language in the papers’ abstracts is not yet earned.

Confidence: genuinely open. The honest position for this project: exploration bonuses are worth piloting (cheap, mechanistically motivated, all portable per the exploration research thread) — but do not claim any of them “expand the capability boundary” [N] until you’ve run a pass@large-k-vs-base-model check on the same challenge subset (point 1/7’s protocol), ideally segmented by task compositionality the way Pass@(k,T) recommends. Absent that check, the safer verb is “elicit” (base capability present, surfaced more reliably), matching this project’s own competence/performance framing, rather than “expand” (base capability genuinely absent, newly created).