This walks the reasoning from that problem to a method you can defend. Each step is a question the previous answer forces. Scroll — you'll arrive at the decision yourself.
The method is downstream of one thing: why does it fail the 800? The failure type selects the tool. Pick the tool first and you’ll pour compute into a fix that structurally can’t work.
Data/SFT injects capability the model lacks. RL only amplifies capability it already shows — sometimes. You cannot reinforce a behavior that never fires. This is a named 2025 result, not folklore: “Scalpel vs. Hammer: GRPO Amplifies Existing Capabilities, SFT Replaces Them.”
So before choosing a knob, we need the vocabulary to describe the failure. That vocabulary is three knobs.
The whole zoo — SFT, DPO, GRPO, distillation — is one objective (move probability mass toward good behavior) at three settings. Toggle them to feel the space:
Off-policy training makes the model correct on states drawn from the teacher’s trajectory. But at inference the model drives — it visits states from its own trajectory. For long agentic runs these diverge, and worse over the horizon:
Ross & Bagnell (DAgger, 2011), Thm 2.1: behavioral cloning error grows like ε·T²; on-policy correction like ε·T. Long CTF trajectories punish off-policy imitation specifically.
An execution failure, by definition, happens in a state the model’s own policy reaches. Off-policy data is — structurally — data about states it doesn’t reach. So off-policy SFT is blind to exactly the region where execution fails. You imitate teacher-states while your own-state behavior barely moves. Not a hyperparameter miss — the wrong distribution.
Ordered by increasing on-policy-ness + use of the failure signal, at increasing cost. Click a rung.
Born to fix: a base model won’t follow instructions / lacks a behavior. Show it (prompt, answer) pairs, cross-entropy. Pure imitation.
Signature: teaches new capability + format, cheap, stable — but off-policy → blind to execution + overwrites knowledge (worse on small models).
Born to fix: student imitates teacher yet still fails at inference (the mismatch above). Fix: sample from the student, let the teacher grade those tokens (dense).
Signature: the sleeper — dense per-token signal on your own rollouts. GKD: +90% rel. on GSM8K vs fixed-dataset KD. ~10× cheaper than RL. Needs a teacher genuinely better at your task.
Born to fix: want RL’s on-policy gain without the machinery. Sample N from the current model, keep verifier-accepted winners, SFT on them.
Signature: on-policy DATA + SFT UPDATE = stable, cheap, no reward-model/critic. But positives-only → entropy collapse → plateaus. It’s the first-order special case of policy gradient.
Born to fix: no verifier exists (“helpful/harmless”). Learn from comparisons (A≻B). DPO drops the reward-model+RL loop into one closed-form loss. KTO takes unpaired good/bad logs.
Signature: reshapes ranking over what the model can already produce — injects NO capability. Uses the negative signal. Off-policy by default (its weakness).
Born to fix: optimize a policy against a (verifiable) reward, using the whole landscape. GRPO drops the critic (group-mean baseline). RLVR = deterministic verifier reward (your flag check) — ungameable.
Signature: fully on-policy, online, pushes winners UP and losers DOWN. Most powerful, most expensive/unstable.
Dense per-token signal on your own rollouts. Qwen3-8B ← Qwen3-32B → ~70% AIME’24 in 150–200 steps, ~9–30× less compute than RL. Beats your friend’s off-policy SFT decisively and undercuts RL on cost. The catch: you need a model actually better at your CTFs.
Your flag check + your 100–200 solves are the seed. It’s on-policy and cheap. Graduation trigger is measurable: positives-only training causes policy-entropy collapse — fast early gains, then plateau. When entropy collapses, switch to GRPO for the push-down-on-failures gradient. (GRPO’s real edge over rejection-FT isn’t group-norm — it’s discarding all-same-reward groups. arXiv 2504.11343.)
Yue et al. (2504.13837): vanilla RLVR raises pass@1 but the base beats it at large pass@k — RL elicits, doesn’t expand; distillation does expand. But ProRL (2505.24864) shows prolonged RL + KL control expands the boundary even on never-solved problems. Safe framing: vanilla RLVR at normal budget elicits; enough compute + exploration-preservation can expand. Recipe-dependent.
Rejection-sampling FT (STaR-family) = “RL without RL,” cheap. Reinforcement Fine-Tuning (OpenAI/Fireworks — and your project handbook) = actual online RL / GRPO, expensive. “Start with RFT instead of RL” only parses under the first. Say “rejection-sampling SFT” so you don’t accidentally order a GRPO run.
Answer for your failing challenges. The routing question is the one that separates every branch:
Nothing to reinforce, so no on-policy method helps. Add the capability by demonstration — or, cheapest of all, put the missing fact in a tool (“knowledge in tools, not weights”). Standard RL won’t cheaply conjure it (Yue 2504.13837); prolonged-RL might, but that’s expensive and contested.
Dense on-policy signal, ~10× cheaper than RL, and it fixes the exact off-policy blindness that gave your friend small gains. This is likely your best first move — most people skip it. GKD 2306.13649.
Run your 100–200 verified solves through a filter (verifier-accept → drop loops → replay-reproduce → dedup → decontaminate → loss-mask observations) into an SFT set. On-policy, cheap, verifier-clean. Watch policy entropy; when it collapses and gains plateau, graduate to GRPO/RLVR with the flag as reward.
The capability exists; you’re reshaping which behavior it prefers. KTO fits your data shape — unpaired solved/failed logs, no matched pairs needed. Do this after any real knowledge gap is SFT-resolved.
The one sentence to carry: every fine-tuning method is the same move — push probability mass toward good behavior — differing only on whose distribution the data comes from (off- vs on-policy) and whether you use the failure signal. Off-policy imitation is structurally blind to execution gaps; that’s why your friend’s SFT stalled, and why any on-policy method is the fix.