Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

This is a living field manual for post-training an agent — written for an engineer who wants to know it, then do it, not experiment blindly. It grows: each session adds or sharpens a chapter.

The problem it exists to answer

~1,000 CTF challenges. The agent solves ~100–200 and fails ~800. We have the harness, the workflow, the budget, and the backing to do anything. The bottleneck is not data — it’s knowing which fine-tuning method we want, and therefore what data to build, and why.

Everything here builds toward answering that from first principles.

How to read it

What’s canonical vs. what’s a teaching scaffold (read this once)

Being honest about provenance, because you’re becoming a researcher and the distinction matters:

  • Canonical, universal, you’ll find it in any RL/post-training text: the on-policy vs. off-policy axis, and the three learning paradigms (imitation / preference / reinforcement). These are load-bearing and not up for debate.
  • My teaching scaffold: any packaging that presents these as “N knobs you freely toggle.” The axes describe methods; they are not independent dials you combine — each named method is a fixed preset. An earlier interactive matrix implied free combination and produced nonsense for some products. That was the scaffold over-reaching. Corrected here: learn the one axis (on/off-policy) + the fixed method presets, not a combinatorial grid.

The one line to anchor on

Every method is the same move — push probability mass toward good behavior — differing only on whose distribution the data comes from (off- vs on-policy) and whether you also learn from failures.

Keep that; the rest is detail.