Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Method → Data (your real bottleneck)

Your words: “it’s not that we don’t have data; it’s that we don’t know what data we want and what fine-tune we want.” This chapter is the fix, and it’s a single causal claim:

You do not pick data and then a method. You pick the method — by failure type — and the method dictates the data object you must produce.

Once the method is chosen, “what data do we want” is answered mechanically. Here’s the mapping:

MethodData object it consumesFor your harness, where it comes from
SFT / off-policy distillationfull trajectories from a sourcecurate, or run a stronger model on your challenges and keep its solves
On-policy distillationyour model’s own rollouts, graded per-token by a teacheryour rollouts + a stronger teacher model
Rejection-sampling FTyour model’s own verifier-passed trajectoriesyou already generate these — filter your ~100–200 solves
DPO(chosen, rejected) trajectory pairs at a decision pointpair a solved run vs a failed run on the same challenge
KTOunpaired trajectories tagged good/badyour solved pile + your failed pile, as-is (no pairing)
GRPO / RLVRprompts + a verify() fn — no fixed datasetyour challenge set + the flag check
Agentic RLa live environment emitting rollouts + end-of-episode rewardyour harness itself, as a rollout service

Two consequences you can act on immediately:

  • RLVR needs almost no dataset — just challenges + a verifier. You have both. The “data problem” nearly vanishes; the work moves to the reward fn and rollout infra.
  • Rejection-sampling FT needs only your own solves, which you’re already producing. It’s the lowest-friction first move because the data object is a byproduct of running the benchmark.

So the real question isn’t “what data” — it’s “which gap”

The data object is downstream of the gap diagnosis. Do that first (The decision), and the data spec falls out. The diagnostic that routes everything:

Does the correct action ever appear in the model’s own outputs, even rarely, at high sampling N?

  • Never → knowledge gap → you need external trajectories (SFT / teacher / a tool). Data = curated or teacher-generated.
  • Sometimes (your case: solves 100–200/1000) → execution gap → data = your own rollouts (rejection-sampling) or a verifier (RLVR). You already have both.
  • Mis-ranked → data = good/bad pairs or tagged logs (DPO/KTO). You already have both piles.

In all three of the last cases, you already possess or can trivially generate the data — which is why your instinct that “data isn’t the bottleneck” is correct. The bottleneck was the method, and the method is chosen by the gap.