Reinforcement — PPO · GRPO · RLVR
Signal = reward / verification. Fully on-policy, online, uses the whole reward landscape (push winners up and losers down). Most powerful, most expensive/unstable. This is the family driving every 2025–2026 reasoning model.
PPO
- What: clipped-surrogate policy gradient with a value/critic network + KL-to-reference (arXiv:1707.06347).
- Eats: prompts + a reward (learned RM or verifier).
- 2026 status: still used for classic preference-RL at the proprietary labs, but declining share for reasoning-RL (the critic is expensive; GRPO/GSPO replaced it there).
GRPO (Group Relative Policy Optimization)
- What: drop the critic. Sample a group of N completions per prompt; the group mean reward is the baseline; advantage
Aᵢ = rᵢ − mean(r)(optionally std-normalized). Introduced in DeepSeekMath (arXiv:2402.03300). - Eats: prompts + a reward fn; no fixed target dataset.
- 2026 status: the reasoning-RL default — DeepSeek-R1’s core algorithm (arXiv:2501.12948), still the base of V3.2’s mixed RL (arXiv:2512.02556).
- Requirement (project rule): baseline solve rate must sit in 30–60% per prompt-group — all-pass or all-fail groups give zero advantage → zero gradient (
llmresearch-handbook.mdrule 7; mechanics inhandbook.md§10, shared memory).
GRPO successors
-
GSPO (Qwen) [R][L] — the problem with vanilla GRPO/PPO at scale: the token-level importance ratio
r_t = π_θ(a_t|s_t)/π_old(a_t|s_t)compounds multiplicatively over a long response, and noisy per-token drift is specifically what destabilizes MoE RL (expert routing shifts mid-rollout, under-policy). GSPO clips at the sequence level instead:# GRPO/PPO: one ratio PER TOKEN, clipped per token — variance compounds over length L r_t = pi_theta(a_t|s_t) / pi_old(a_t|s_t) # GSPO: one ratio for the WHOLE sequence (length-normalized geometric mean) r_seq = (pi_theta(y|x) / pi_old(y|x)) ** (1 / len(y)) loss = -mean(min(r_seq * A, clip(r_seq, 1-eps, 1+eps) * A))This is Qwen3’s actual stated production RL algorithm (arXiv:2507.18071) — the first GRPO-successor with a flagship behind it, and it gets more relevant, not less, as episodes lengthen: a 100-turn agentic trajectory with tool calls interleaved is exactly the long-sequence regime where token-level ratios drift furthest from 1 by the last token. If a future GRPO/RLVR run on the CTF agent shows training instability, sequence-level clipping is the first thing to try — not more KL-coefficient tuning.
-
DAPO (ByteDance Seed) [E][R][L] — four concrete engineering fixes, not one new algorithm, each independently adoptable as a
verlloss-mode flag (arXiv:2503.14476):- Clip-Higher — decouple the PPO clip bounds (
eps_low ≠ eps_high, e.g.0.20 / 0.28vs. the symmetric PPO-default0.20/0.20) so a rare-but-good token can gain probability faster than a bad one loses it. Symmetric clipping caps how fast a rare-correct action can ever be reinforced — a direct driver of entropy collapse (below). - Dynamic Sampling — resample any prompt whose whole group of G rollouts is all-correct or all-incorrect (
std(group_rewards) == 0→ zero advantage → zero gradient in GRPO) instead of paying for a wasted rollout batch. - Token-level loss — average the policy-gradient loss over every token in the batch, not per-sample-then-averaged, so long correct/incorrect responses aren’t down-weighted relative to short ones.
[L]— a credit-assignment fix that matters more the longer responses get. - Overlong reward shaping — a soft length penalty instead of a hard truncation penalty, so a response cut off by the context window isn’t punished as if it were simply wrong.
eps_low, eps_high = 0.20, 0.28 ratio = exp(logp_new - logp_old) clipped = clip(ratio, 1 - eps_low, 1 + eps_high) loss_pg = -min(ratio * adv, clipped * adv) # per-token, mean over ALL tokens in the batch while std(group_rewards) == 0: # dynamic sampling prompt = resample_prompt() group_rewards = rollout_and_score(prompt, n=G)Mainstream in OSS RL tooling (verl and open GRPO reproductions default to these four fixes) and independently reproduced as a 50-point AIME 2024 result beating R1-Zero-Qwen-32B with half the training steps — one of the few fully open (algorithm + infra + data) large-scale reasoning-RL reproductions. Project rule 7 (GRPO baseline must hit 30–60%) is DAPO’s dynamic-sampling problem, stated as a portfolio-composition constraint instead of a training-loop fallback — keeping the baseline in-band is how you avoid feeding all-pass/all-fail groups into the update in the first place; DAPO’s dynamic sampling is the fallback for whatever still lands there. At ~100 turns per rollout, resampling a whole-group-zero-reward challenge is expensive — prefer upstream curriculum/difficulty filtering (drop challenges outside the 30–60% band) over paying for resamples on genuinely-unsolved-yet challenges.
Designed to fix: pattern 2 & 3 — Clip-Higher targets the exact mechanism behind “no methodology / 82% pivot-after-failure” and “good guesser until it isn’t”: symmetric clipping caps how much probability mass a rare, correct enumeration branch (or a well-grounded, as opposed to lucky, guess) can ever accumulate — which is precisely what narrows a policy onto one brittle script. See Exploration and entropy below for the mechanism this is patching.
- Clip-Higher — decouple the PPO clip bounds (
-
Dr. GRPO [R][E] — a smaller, easy-to-miss companion fix: vanilla GRPO’s per-sample length- and std-normalization secretly rewards longer wrong answers and shorter right ones (an optimization artifact, not a real preference). Fix is a two-line change — drop the
1/|response|length term and the group-std division, keep onlyA = r − mean(r)(arXiv:2503.20783). Matches or beats vanilla GRPO’s accuracy at the same compute while removing the length-inflation drift. For a 100-turn agent this bug has a much bigger attack surface than a single-turn math answer: a policy trained on the unfixed objective can learn to “look busy” (extra tool calls, redundant enumeration) after a wrong guess without the enumeration being useful — nearly indistinguishable from legitimate PTES-style enumeration unless you’re specifically checking for it.
RLVR (RL with Verifiable Rewards)
- What: GRPO/PPO where the reward is a deterministic verifier (unit tests, math checker, flag check) rather than a neural RM. No parameters to game.
- Eats: prompts + a
verify(state) → {0,1}function. This is your setup — the CTF flag verifier is a textbook verifiable reward. - 2026 status: arguably the defining technique of the era. Every reasoning model (o1/o3, R1, Gemini-thinking, Qwen3, Kimi) scales RL against verifiable/rule-based rewards as the capability driver; Gemini 2.5 explicitly allocates increased RL compute to “verifiable rewards” (arXiv:2507.06261; OpenAI “Learning to reason with LLMs”; R1, arXiv:2501.12948).
Exploration and entropy: the GRPO graduation trigger
Cybersecurity is exploration — every technique here is [E]-tagged. This is the section that operationalizes the project’s own stated graduation condition (“SFT → GRPO/RLVR when policy entropy collapses”): what actually collapses, why, and the cheapest fixes, roughly in the order you’d reach for them.
The phenomenon has a fitted law. Policy entropy falls sharply and monotonically early in RLVR training, and downstream performance is empirically bounded by R = -a·exp(H) + b — performance is traded from entropy, hitting a fixed, predictable ceiling at H=0. The mechanism: entropy change is driven by the covariance between a token’s action-probability and its logit update, which is proportional to advantage under policy-gradient methods — a high-probability, high-advantage token gets pushed toward certainty (entropy ↓), a rare-but-good token gets reinforced (entropy ↑), and this covariance term stays positive almost everywhere during training, which is why the collapse is monotonic rather than noise (arXiv:2505.22617). Actionable consequence: instrument mean(entropy) from training step 0. A run whose entropy is already heading to 0 within the first 10–20% of steps is capped, regardless of what the reward curve currently shows — you cannot tell “40% because the task is hard” from “already collapsed onto one enumeration script at step 50, now just executing it faster” without this instrumentation.
Designed to fix: pattern 2 & 3 — this is the literature-side name for exactly the two most load-bearing BSides findings: a policy that has spent its entropy budget on one narrow, high-probability action sequence has no probability mass left on alternatives when that sequence fails (pattern 2, 82% pivot-after-failure) and commits to a single ungrounded guess at the critical step because it’s the only path its distribution still assigns real weight to (pattern 3).
Cheap, surgical fixes — try in this order:
- Clip-Higher (DAPO, above) — already the baseline recipe; decoupling
eps_highfromeps_lowis the first lever, not an afterthought. - Clip-Cov / KL-Cov — since collapse is driven by a small set of high-covariance tokens, don’t touch the loss globally. Each batch, flag the top ~0.2–2% of tokens by
cov = (logp − logp.mean()) * (adv − adv.mean())and either drop the policy-gradient update on them (Clip-Cov) or add an extra KL penalty scoped only to them (KL-Cov):
Already merged intocov = (logp - logp.mean()) * (adv - adv.mean()) mask = cov > percentile(cov, 1 - clip_frac) # top ~0.2-2% of tokens loss_pg[mask] = loss_pg[mask].detach() # Clip-Cov loss = loss_pg + kl_coef * kl_per_token * mask # KL-Cov (scoped, not global)verl(loss_mode="clip_cov"/"kl_cov") — a config flag, not a reimplementation (arXiv:2505.22617). - The clip asymmetry is mechanistically real, not just DAPO folklore — raising the low-side clip bound increases entropy, raising the high-side bound decreases it; these are two independent knobs, not one symmetric hyperparameter (arXiv:2509.26114). If Clip-Higher alone doesn’t fully hold entropy up, look at
eps_lowtoo. - High-entropy minority tokens — only ~20% of tokens in a trajectory are high-entropy “forking” tokens (where the policy actually chooses between distinct paths); restricting the gradient update to just those matches full-gradient RLVR at 8B and beats it at 32B, while training on the other 80% (low-entropy filler/syntax) actively hurts (arXiv:2506.01939). In an agentic trace (JSON tool-call formatting, restated context) the boilerplate-token fraction is plausibly even larger than in pure math CoT — untested in this domain, but the leverage ceiling is higher, not lower.
The graduation trigger, precisely stated: don’t wait for the reward curve to plateau — watch entropy. Once mean(entropy) is tracking toward the flat part of the R = -a·exp(H)+b curve, more rejection-sampling-SFT epochs on the same policy distribution will not move the needle (you’re re-sampling a narrowing distribution); that is the signal to graduate to GRPO/RLVR with the interventions above already wired in, not bolted on after observing collapse.
ProRL — prolonged, KL-controlled RL expands the boundary rather than just amplifying it. Directly tests the contested question in Contested edges: does RL discover genuinely new solution paths, or just reweight what the base model could already sample at large k? Answer, conditional on three ingredients: (a) adaptive KL control against a reference policy, (b) periodic reference-policy resetting (ref_policy ← policy.detach() every N steps, so the KL term compares against a recent snapshot instead of an ever-more-distant step-0 policy), and (c) a diverse task suite, sustained over prolonged training (thousands of steps, not a short polish pass) — under these conditions, RL-trained models solve problems the base model never solves at any k, and the boundary-expansion effect strengthens with training duration (arXiv:2505.24864).
loss += kl_coef * KL(policy || ref_policy) # KL-to-reference, adaptive coefficient
if step % reset_interval == 0:
ref_policy.load_state_dict(policy.state_dict()) # re-anchor — don't compare against a stale step-0 snapshot
# (c) mix task families/difficulties in the training distribution rather than a narrow curriculum
This directly qualifies the project’s own confirmed lesson (“GRPO amplifies existing capabilities, SFT replaces them,” arXiv:2507.10616): the amplify-vs-expand outcome is a function of how long and how KL-controlled the run is, not GRPO’s inherent ceiling — a short GRPO pass should be expected to behave like amplification; ProRL’s own ablations show the boundary-expansion effect only emerges after prolonged, reference-resetting training. This is genuinely contested — a separate 2025 measurement study reports the opposite under a short, single-reference-policy setup (pass@1 rises, pass@256 coverage falls over training, i.e. narrowing, arXiv:2504.13837) — the two are read here as reconciled by duration/KL-management, not as one being simply wrong, but that reconciliation is this book’s synthesis, not a claim either paper makes explicitly. Treat “does RL create new capability here” as open and verify with your own entropy instrumentation (above), not settled fact.
Designed to fix: pattern 4 — a model whose reasoning boundary hasn’t been expanded keeps failing the same class of enumeration-heavy exploitation steps no matter how much RL “polishing” it gets (62% of failures stall in exploitation). ProRL’s finding implies the fix is duration + diversity + entropy control together, not any single trick — budget real training time and periodic reference resets once past the initial GRPO baseline, rather than treating GRPO as a short fine-tune pass.
Everything past this point — pass@k as a training signal, diversity/curiosity/count-based intrinsic rewards, parameter-space noise for temporally-coherent exploration, tool-call-sequence diversity as the project’s own novel opportunity, and the full turn-level/step-level credit-assignment literature for the ~100-turn setting — is covered in depth in the dedicated long-horizon and exploration-sweep chapters, and in Agentic & multi-turn RL for the multi-turn training-loop shape itself. Read this section for the graduation trigger and the cheapest fixes; read those for the harder credit-assignment and boundary-expansion questions once you’re past the initial DAPO-recipe GRPO baseline.
The reward-model question (PRM vs outcome/rubric)
- PRM (process reward, dense step-level) — score each reasoning step (Lightman et al., arXiv:2305.20050). Niche / avoided in production: DeepSeek explicitly rejected PRM for R1 due to step-level reward hacking (arXiv:2501.12948).
- Outcome verifier + rubric/critic grading — the real 2026 answer to “what replaced PRM”: not dense step rewards, but LLM-judge/rubric-based outcome grading. Gemini’s “Critic” (prompted rubric grader, arXiv:2507.06261) and OpenAI’s RFT “model grader” are both this in production. Your deterministic flag verifier is the ungameable end of this spectrum — keep it there (gameability ladder in Contested edges).
When to reach for RL
An execution gap where rejection-sampling FT has plateaued (entropy collapsed), and you want the negative-sample gradient + online updates. Cost: online rollout infra, reward plumbing, KL control, instability, and the train↔inference precision mismatch (its own rabbit hole — see the shared memory note research/fp8-quantization-mechanics-training-serving.md).