What “data” actually means for an agent

You asked the right question earlier: “demonstration = the answer — are you talking about trajectories?” Yes. Being concrete about the data object dissolves most of the confusion, because each method eats a different-shaped object, and for an agent those shapes are not what the chatbot literature implies.

A “demonstration” is a trajectory, not an answer

Chatbot SFT example = (prompt → ideal response text).

Agent SFT example = the whole trajectory:

system prompt
→ [assistant: reasoning] → [tool_call: shell "nmap …"] → [tool_result: <output>]
→ [assistant: reasoning] → [tool_call: http_request …] → [tool_result: <output>]
→ …
→ [assistant: submit_flag("FLAG{…}")]

The training target is the path, not the flag. That is what SFT and rejection-sampling FT imitate token-by-token.

Loss-masking: don’t train the model to predict the world

Tool outputs / observations are part of the sequence but are not the policy’s own tokens — they come from the environment. Standard practice is to mask the loss on prompt and observation spans and only compute loss on the model-generated reasoning + tool_call + submit tokens. Otherwise you teach the model to hallucinate stdout. (This is the same principle as instruction-tuning masking the prompt; for agent traces it’s applied to every observation span.) The project’s own trajectory-synthesis note treats observation-masking as a first-class filter step — see lessons/post-training/verified-trajectory-synthesis-recipe.md in the shared memory pool. (Caveat: that recipe’s specific numbers derive from an unverified third-party report — trust the procedure, not the figures.)

The data shapes, per family (forward reference)

Family	Training object it consumes	Where it comes from
Imitation (SFT, distillation)	full trajectories (yours, a teacher’s, or human)	curate / run a teacher
Rejection-sampling FT	your own verifier-passed trajectories	you already generate these
Preference (DPO/KTO)	`(chosen, rejected)` trajectory pairs, or tagged good/bad	your solved vs failed logs
RL (GRPO/RLVR)	prompts + a reward/verifier fn — no fixed dataset	your challenges + `verify()`

This table is the hinge of the whole book. It’s expanded in Method → Data, which is the chapter that actually addresses your stated bottleneck (“we don’t know what data we want”). The short version: you don’t choose data and then a method — you choose the method, and the method dictates the data object.

Keyboard shortcuts

Post-Training Field Notes

What “data” actually means for an agent

A “demonstration” is a trajectory, not an answer

Loss-masking: don’t train the model to predict the world

The data shapes, per family (forward reference)