00 the problem

My CTF agent solves 1 in 5.
How do I fine-tune it — and why that method?

GIVEN ~1,000 CTF challenges, varying difficulty.
RAN the agent → solved ~100–200, failed ~800.
GOAL make it better at cybersecurity via fine-tuning.
ASK where do I start — which fine-tuning, and the reason why, not a recipe.

This walks the reasoning from that problem to a method you can defend. Each step is a question the previous answer forces. Scroll — you'll arrive at the decision yourself.

↓  begin
01 reframe

“Which method” is the wrong first question.

The method is downstream of one thing: why does it fail the 800? The failure type selects the tool. Pick the tool first and you’ll pour compute into a fix that structurally can’t work.

The load-bearing principle

Data/SFT injects capability the model lacks. RL only amplifies capability it already shows — sometimes. You cannot reinforce a behavior that never fires. This is a named 2025 result, not folklore: “Scalpel vs. Hammer: GRPO Amplifies Existing Capabilities, SFT Replaces Them.”

So before choosing a knob, we need the vocabulary to describe the failure. That vocabulary is three knobs.

02 the skeleton

Every fine-tuning method is three knobs.

The whole zoo — SFT, DPO, GRPO, distillation — is one objective (move probability mass toward good behavior) at three settings. Toggle them to feel the space:

KNOB 1 — what's the training signal?
demonstration “here’s the answer”
comparison “A ≻ B”
reward / verifier “scored 1”
KNOB 2 — whose distribution is the data from?  (the deep one)
off-policy — someone else’s trajectories
on-policy — the model’s own attempts, scored
KNOB 3 — what are you changing?
knowledge / capability it lacks
policy — which existing behavior it prefers
Pick one option per knob — the matching method appears here.
03 the key axis

Why off-policy SFT is blind to execution gaps.

Off-policy training makes the model correct on states drawn from the teacher’s trajectory. But at inference the model drives — it visits states from its own trajectory. For long agentic runs these diverge, and worse over the horizon:

error horizon T → off-policy O(εT²) on-policy O(εT) teacher path
what the teacher did model cloning it → one slip → states the teacher never visited → compounds on-policy correction stays linear

Ross & Bagnell (DAgger, 2011), Thm 2.1: behavioral cloning error grows like ε·T²; on-policy correction like ε·T. Long CTF trajectories punish off-policy imitation specifically.

The theorem behind “SFT gave small gains”

An execution failure, by definition, happens in a state the model’s own policy reaches. Off-policy data is — structurally — data about states it doesn’t reach. So off-policy SFT is blind to exactly the region where execution fails. You imitate teacher-states while your own-state behavior barely moves. Not a hyperparameter miss — the wrong distribution.

04 the genealogy

One ladder. Each rung fixes the one above.

Ordered by increasing on-policy-ness + use of the failure signal, at increasing cost. Click a rung.

◀ off-policy · positives-only · cheapon-policy · uses failures · expensive ▶
SFT / off-policy distillationOFF-POLICYInstructGPT · Hinton KD

Born to fix: a base model won’t follow instructions / lacks a behavior. Show it (prompt, answer) pairs, cross-entropy. Pure imitation.

Signature: teaches new capability + format, cheap, stable — but off-policy → blind to execution + overwrites knowledge (worse on small models).

On-policy distillationON-POLICYGKD 2306.13649

Born to fix: student imitates teacher yet still fails at inference (the mismatch above). Fix: sample from the student, let the teacher grade those tokens (dense).

Signature: the sleeper — dense per-token signal on your own rollouts. GKD: +90% rel. on GSM8K vs fixed-dataset KD. ~10× cheaper than RL. Needs a teacher genuinely better at your task.

Rejection-sampling FT · “RL without RL”ON-POLICYSTaR · ReST · RAFT

Born to fix: want RL’s on-policy gain without the machinery. Sample N from the current model, keep verifier-accepted winners, SFT on them.

Signature: on-policy DATA + SFT UPDATE = stable, cheap, no reward-model/critic. But positives-only → entropy collapse → plateaus. It’s the first-order special case of policy gradient.

Preference opt · DPO / KTOOFF→ONDPO 2305.18290 · KTO

Born to fix: no verifier exists (“helpful/harmless”). Learn from comparisons (A≻B). DPO drops the reward-model+RL loop into one closed-form loss. KTO takes unpaired good/bad logs.

Signature: reshapes ranking over what the model can already produce — injects NO capability. Uses the negative signal. Off-policy by default (its weakness).

RL · PPO → GRPO → RLVRON-POLICY, ONLINEGRPO 2402.03300 · R1

Born to fix: optimize a policy against a (verifiable) reward, using the whole landscape. GRPO drops the critic (group-mean baseline). RLVR = deterministic verifier reward (your flag check) — ungameable.

Signature: fully on-policy, online, pushes winners UP and losers DOWN. Most powerful, most expensive/unstable.

The unification: all five put mass on high-reward regions of an output distribution. They differ only on whose distribution (on/off-policy), how rich the signal (dense→binary→scalar), and whether failures push down too.
05 for an execution gap

Two on-policy fixes — and when to graduate.

On-policy distillation — if a stronger teacher exists

Dense per-token signal on your own rollouts. Qwen3-8B ← Qwen3-32B → ~70% AIME’24 in 150–200 steps, ~9–30× less compute than RL. Beats your friend’s off-policy SFT decisively and undercuts RL on cost. The catch: you need a model actually better at your CTFs.

Rejection-sampling SFT → GRPO — if you only have a verifier

Your flag check + your 100–200 solves are the seed. It’s on-policy and cheap. Graduation trigger is measurable: positives-only training causes policy-entropy collapse — fast early gains, then plateau. When entropy collapses, switch to GRPO for the push-down-on-failures gradient. (GRPO’s real edge over rejection-FT isn’t group-norm — it’s discarding all-same-reward groups. arXiv 2504.11343.)

06 honesty

Two places not to overclaim.

“RL can’t create capability” — contested, not law

Yue et al. (2504.13837): vanilla RLVR raises pass@1 but the base beats it at large pass@k — RL elicits, doesn’t expand; distillation does expand. But ProRL (2505.24864) shows prolonged RL + KL control expands the boundary even on never-solved problems. Safe framing: vanilla RLVR at normal budget elicits; enough compute + exploration-preservation can expand. Recipe-dependent.

The “RFT” landmine

Rejection-sampling FT (STaR-family) = “RL without RL,” cheap.  Reinforcement Fine-Tuning (OpenAI/Fireworks — and your project handbook) = actual online RL / GRPO, expensive. “Start with RFT instead of RL” only parses under the first. Say “rejection-sampling SFT” so you don’t accidentally order a GRPO run.

07 your conclusion

Diagnose the gap → the method follows.

Answer for your failing challenges. The routing question is the one that separates every branch:

Does the correct action ever appear in the model’s own outputs — even rarely — at high sampling N?
→ Inject off-policy: knowledge SFT · teacher data · or a TOOL

Nothing to reinforce, so no on-policy method helps. Add the capability by demonstration — or, cheapest of all, put the missing fact in a tool (“knowledge in tools, not weights”). Standard RL won’t cheaply conjure it (Yue 2504.13837); prolonged-RL might, but that’s expensive and contested.

→ On-policy distillation

Dense on-policy signal, ~10× cheaper than RL, and it fixes the exact off-policy blindness that gave your friend small gains. This is likely your best first move — most people skip it. GKD 2306.13649.

→ Rejection-sampling SFT → then GRPO

Run your 100–200 verified solves through a filter (verifier-accept → drop loops → replay-reproduce → dedup → decontaminate → loss-mask observations) into an SFT set. On-policy, cheap, verifier-clean. Watch policy entropy; when it collapses and gains plateau, graduate to GRPO/RLVR with the flag as reward.

→ DPO / KTO

The capability exists; you’re reshaping which behavior it prefers. KTO fits your data shape — unpaired solved/failed logs, no matched pairs needed. Do this after any real knowledge gap is SFT-resolved.

The one sentence to carry: every fine-tuning method is the same move — push probability mass toward good behavior — differing only on whose distribution the data comes from (off- vs on-policy) and whether you use the failure signal. Off-policy imitation is structurally blind to execution gaps; that’s why your friend’s SFT stalled, and why any on-policy method is the fix.