Introduction

This is a living field manual for post-training an agent — written for an engineer who wants to know it, then do it, not experiment blindly. It grows: each session adds or sharpens a chapter.

The problem it exists to answer

~1,000 CTF challenges. The agent solves ~100–200 and fails ~800. We have the harness, the workflow, the budget, and the backing to do anything. The bottleneck is not data — it’s knowing which fine-tuning method we want, and therefore what data to build, and why.

Everything here builds toward answering that from first principles.

How to read it

Want the feel first? → The 5-minute journey — the interactive version, embedded.
Want the theory? → start at Foundations and go in order.
Want the answer for your case? → jump to Method → Data and The decision.
Want to know if you actually have an execution gap, and which RL technique fixes it? → Diagnosing the gap, then the RL sweep at RL that creates value.

What’s canonical vs. what’s a teaching scaffold (read this once)

Being honest about provenance, because you’re becoming a researcher and the distinction matters:

Canonical, universal, you’ll find it in any RL/post-training text: the on-policy vs. off-policy axis, and the three learning paradigms (imitation / preference / reinforcement). These are load-bearing and not up for debate.
My teaching scaffold: any packaging that presents these as “N knobs you freely toggle.” The axes describe methods; they are not independent dials you combine — each named method is a fixed preset. An earlier interactive matrix implied free combination and produced nonsense for some products. That was the scaffold over-reaching. Corrected here: learn the one axis (on/off-policy) + the fixed method presets, not a combinatorial grid.

The one line to anchor on

Every method is the same move — push probability mass toward good behavior — differing only on whose distribution the data comes from (off- vs on-policy) and whether you also learn from failures.

Keep that; the rest is detail.

Keyboard shortcuts

Post-Training Field Notes

Introduction

The problem it exists to answer

How to read it

What’s canonical vs. what’s a teaching scaffold (read this once)

The one line to anchor on