Introduction
This is a living field manual for post-training an agent — written for an engineer who wants to know it, then do it, not experiment blindly. It grows: each session adds or sharpens a chapter.
The problem it exists to answer
~1,000 CTF challenges. The agent solves ~100–200 and fails ~800. We have the harness, the workflow, the budget, and the backing to do anything. The bottleneck is not data — it’s knowing which fine-tuning method we want, and therefore what data to build, and why.
Everything here builds toward answering that from first principles.
How to read it
- Want the feel first? → The 5-minute journey — the interactive version, embedded.
- Want the theory? → start at Foundations and go in order.
- Want the answer for your case? → jump to Method → Data and The decision.
- Want to know if you actually have an execution gap, and which RL technique fixes it? → Diagnosing the gap, then the RL sweep at RL that creates value.
What’s canonical vs. what’s a teaching scaffold (read this once)
Being honest about provenance, because you’re becoming a researcher and the distinction matters:
- Canonical, universal, you’ll find it in any RL/post-training text: the on-policy vs. off-policy axis, and the three learning paradigms (imitation / preference / reinforcement). These are load-bearing and not up for debate.
- My teaching scaffold: any packaging that presents these as “N knobs you freely toggle.” The axes describe methods; they are not independent dials you combine — each named method is a fixed preset. An earlier interactive matrix implied free combination and produced nonsense for some products. That was the scaffold over-reaching. Corrected here: learn the one axis (on/off-policy) + the fixed method presets, not a combinatorial grid.
The one line to anchor on
Every method is the same move — push probability mass toward good behavior — differing only on whose distribution the data comes from (off- vs on-policy) and whether you also learn from failures.
Keep that; the rest is detail.