What “data” actually means for an agent
You asked the right question earlier: “demonstration = the answer — are you talking about trajectories?” Yes. Being concrete about the data object dissolves most of the confusion, because each method eats a different-shaped object, and for an agent those shapes are not what the chatbot literature implies.
A “demonstration” is a trajectory, not an answer
-
Chatbot SFT example =
(prompt → ideal response text). -
Agent SFT example = the whole trajectory:
system prompt → [assistant: reasoning] → [tool_call: shell "nmap …"] → [tool_result: <output>] → [assistant: reasoning] → [tool_call: http_request …] → [tool_result: <output>] → … → [assistant: submit_flag("FLAG{…}")]
The training target is the path, not the flag. That is what SFT and rejection-sampling FT imitate token-by-token.
Loss-masking: don’t train the model to predict the world
Tool outputs / observations are part of the sequence but are not the policy’s own tokens — they come from the environment. Standard practice is to mask the loss on prompt and observation spans and only compute loss on the model-generated reasoning + tool_call + submit tokens. Otherwise you teach the model to hallucinate stdout. (This is the same principle as instruction-tuning masking the prompt; for agent traces it’s applied to every observation span.) The project’s own trajectory-synthesis note treats observation-masking as a first-class filter step — see lessons/post-training/verified-trajectory-synthesis-recipe.md in the shared memory pool. (Caveat: that recipe’s specific numbers derive from an unverified third-party report — trust the procedure, not the figures.)
The data shapes, per family (forward reference)
| Family | Training object it consumes | Where it comes from |
|---|---|---|
| Imitation (SFT, distillation) | full trajectories (yours, a teacher’s, or human) | curate / run a teacher |
| Rejection-sampling FT | your own verifier-passed trajectories | you already generate these |
| Preference (DPO/KTO) | (chosen, rejected) trajectory pairs, or tagged good/bad | your solved vs failed logs |
| RL (GRPO/RLVR) | prompts + a reward/verifier fn — no fixed dataset | your challenges + verify() |
This table is the hinge of the whole book. It’s expanded in Method → Data, which is the chapter that actually addresses your stated bottleneck (“we don’t know what data we want”). The short version: you don’t choose data and then a method — you choose the method, and the method dictates the data object.