Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

What “data” actually means for an agent

You asked the right question earlier: “demonstration = the answer — are you talking about trajectories?” Yes. Being concrete about the data object dissolves most of the confusion, because each method eats a different-shaped object, and for an agent those shapes are not what the chatbot literature implies.

A “demonstration” is a trajectory, not an answer

  • Chatbot SFT example = (prompt → ideal response text).

  • Agent SFT example = the whole trajectory:

    system prompt
    → [assistant: reasoning] → [tool_call: shell "nmap …"] → [tool_result: <output>]
    → [assistant: reasoning] → [tool_call: http_request …] → [tool_result: <output>]
    → …
    → [assistant: submit_flag("FLAG{…}")]
    

The training target is the path, not the flag. That is what SFT and rejection-sampling FT imitate token-by-token.

Loss-masking: don’t train the model to predict the world

Tool outputs / observations are part of the sequence but are not the policy’s own tokens — they come from the environment. Standard practice is to mask the loss on prompt and observation spans and only compute loss on the model-generated reasoning + tool_call + submit tokens. Otherwise you teach the model to hallucinate stdout. (This is the same principle as instruction-tuning masking the prompt; for agent traces it’s applied to every observation span.) The project’s own trajectory-synthesis note treats observation-masking as a first-class filter step — see lessons/post-training/verified-trajectory-synthesis-recipe.md in the shared memory pool. (Caveat: that recipe’s specific numbers derive from an unverified third-party report — trust the procedure, not the figures.)

The data shapes, per family (forward reference)

FamilyTraining object it consumesWhere it comes from
Imitation (SFT, distillation)full trajectories (yours, a teacher’s, or human)curate / run a teacher
Rejection-sampling FTyour own verifier-passed trajectoriesyou already generate these
Preference (DPO/KTO)(chosen, rejected) trajectory pairs, or tagged good/badyour solved vs failed logs
RL (GRPO/RLVR)prompts + a reward/verifier fn — no fixed datasetyour challenges + verify()

This table is the hinge of the whole book. It’s expanded in Method → Data, which is the chapter that actually addresses your stated bottleneck (“we don’t know what data we want”). The short version: you don’t choose data and then a method — you choose the method, and the method dictates the data object.