Method → Data (your real bottleneck)
Your words: “it’s not that we don’t have data; it’s that we don’t know what data we want and what fine-tune we want.” This chapter is the fix, and it’s a single causal claim:
You do not pick data and then a method. You pick the method — by failure type — and the method dictates the data object you must produce.
Once the method is chosen, “what data do we want” is answered mechanically. Here’s the mapping:
| Method | Data object it consumes | For your harness, where it comes from |
|---|---|---|
| SFT / off-policy distillation | full trajectories from a source | curate, or run a stronger model on your challenges and keep its solves |
| On-policy distillation | your model’s own rollouts, graded per-token by a teacher | your rollouts + a stronger teacher model |
| Rejection-sampling FT | your model’s own verifier-passed trajectories | you already generate these — filter your ~100–200 solves |
| DPO | (chosen, rejected) trajectory pairs at a decision point | pair a solved run vs a failed run on the same challenge |
| KTO | unpaired trajectories tagged good/bad | your solved pile + your failed pile, as-is (no pairing) |
| GRPO / RLVR | prompts + a verify() fn — no fixed dataset | your challenge set + the flag check |
| Agentic RL | a live environment emitting rollouts + end-of-episode reward | your harness itself, as a rollout service |
Two consequences you can act on immediately:
- RLVR needs almost no dataset — just challenges + a verifier. You have both. The “data problem” nearly vanishes; the work moves to the reward fn and rollout infra.
- Rejection-sampling FT needs only your own solves, which you’re already producing. It’s the lowest-friction first move because the data object is a byproduct of running the benchmark.
So the real question isn’t “what data” — it’s “which gap”
The data object is downstream of the gap diagnosis. Do that first (The decision), and the data spec falls out. The diagnostic that routes everything:
Does the correct action ever appear in the model’s own outputs, even rarely, at high sampling N?
- Never → knowledge gap → you need external trajectories (SFT / teacher / a tool). Data = curated or teacher-generated.
- Sometimes (your case: solves 100–200/1000) → execution gap → data = your own rollouts (rejection-sampling) or a verifier (RLVR). You already have both.
- Mis-ranked → data = good/bad pairs or tagged logs (DPO/KTO). You already have both piles.
In all three of the last cases, you already possess or can trivially generate the data — which is why your instinct that “data isn’t the bottleneck” is correct. The bottleneck was the method, and the method is chosen by the gap.