Method → Data (your real bottleneck)

Your words: “it’s not that we don’t have data; it’s that we don’t know what data we want and what fine-tune we want.” This chapter is the fix, and it’s a single causal claim:

You do not pick data and then a method. You pick the method — by failure type — and the method dictates the data object you must produce.

Once the method is chosen, “what data do we want” is answered mechanically. Here’s the mapping:

Method	Data object it consumes	For your harness, where it comes from
SFT / off-policy distillation	full trajectories from a source	curate, or run a stronger model on your challenges and keep its solves
On-policy distillation	your model’s own rollouts, graded per-token by a teacher	your rollouts + a stronger teacher model
Rejection-sampling FT	your model’s own verifier-passed trajectories	you already generate these — filter your ~100–200 solves
DPO	`(chosen, rejected)` trajectory pairs at a decision point	pair a solved run vs a failed run on the same challenge
KTO	unpaired trajectories tagged `good`/`bad`	your solved pile + your failed pile, as-is (no pairing)
GRPO / RLVR	prompts + a `verify()` fn — no fixed dataset	your challenge set + the flag check
Agentic RL	a live environment emitting rollouts + end-of-episode reward	your harness itself, as a rollout service

Two consequences you can act on immediately:

RLVR needs almost no dataset — just challenges + a verifier. You have both. The “data problem” nearly vanishes; the work moves to the reward fn and rollout infra.
Rejection-sampling FT needs only your own solves, which you’re already producing. It’s the lowest-friction first move because the data object is a byproduct of running the benchmark.

So the real question isn’t “what data” — it’s “which gap”

The data object is downstream of the gap diagnosis. Do that first (The decision), and the data spec falls out. The diagnostic that routes everything:

Does the correct action ever appear in the model’s own outputs, even rarely, at high sampling N?

Never → knowledge gap → you need external trajectories (SFT / teacher / a tool). Data = curated or teacher-generated.
Sometimes (your case: solves 100–200/1000) → execution gap → data = your own rollouts (rejection-sampling) or a verifier (RLVR). You already have both.
Mis-ranked → data = good/bad pairs or tagged logs (DPO/KTO). You already have both piles.

In all three of the last cases, you already possess or can trivially generate the data — which is why your instinct that “data isn’t the bottleneck” is correct. The bottleneck was the method, and the method is chosen by the gap.

Keyboard shortcuts

Post-Training Field Notes

Method → Data (your real bottleneck)

So the real question isn’t “what data” — it’s “which gap”