Before you train — instrumentation & data readiness

This chapter is a prerequisite, not a fix. Diagnosing the gap gives you the routing test (knowledge vs. execution vs. exploration); behavioral audit → training signal maps observed failure shapes to methods. Both assume you can already say which stage of a run failed. Today, for this project’s own harness, you can’t — not because the telemetry is missing, but because nobody has written the small amount of code that turns existing telemetry into a stage verdict. This chapter is the ground-truth answer to “what do I already have, and what’s the minimal thing to build first” — grounded entirely in the repo and vault as read on 2026-07-02, not in aspiration.

Everything below is a recommendation for main to implement — no code was written, no repo file was touched. Confidence is stated per claim; where I could not confirm something, I say so.

1. Why per-stage signal is the prerequisite for the whole diagnosis

The diagnosis framework’s routing test — does the correct action ever appear at high N, and does RL (not matched SFT) expand it? — is defined per challenge or per challenge-subtype, not per portfolio. Run it on the portfolio in aggregate and you get one number that averages over four structurally different failure modes (F1 exploration / F2 skill / F3 tool-use / F4 long-horizon), which is exactly the “collapsing a split verdict into one sentence” anti-pattern Diagnosing the gap §0 warns against. You cannot segment a portfolio you cannot localize. Per-stage signal is what turns “39% pass@1, up from 26%” into “F1 dropped from 40%→12% of failures, F2 is now the dominant failure mode” — the only shape of finding that tells you which lever (SFT curriculum, tool-use reward, GRPO exploration term) to pull next.

flowchart LR
    A["events.jsonl<br/>(already emitted)"] --> B["stage extractor<br/>(NOT built)"]
    C["per-challenge manifest<br/>(NOT built)"] --> B
    B --> D["F1-F4 attribution<br/>per run"]
    D --> E["Diagnosis framework<br/>routing test, per subtype"]
    E --> F["method choice:<br/>SFT curriculum / tool reward / GRPO term"]

2. What the harness already emits — confirmed, source-read

Source: go/libs/agent/events/events.go (schema), go/libs/secagent/runner.go (wiring), doc of record lessons/security-agent/harness-observability-contract-2026-06.md. Every event is Turn-indexed and tool_call/tool_result pairs join on tool_call_id — this is the load-bearing fact that makes any stage-localization script possible with zero changes to the agent loop.

Event	What it carries	Stage-attribution value
`meta` / `tool_schemas` / `system_prompt` / `user_message` (preamble)	trace_id, model, action space, task string	Run identity; needed to join against a per-challenge manifest
`agent_start`	`{input}`	“Did the agent do anything” — trivially free
`turn_start`	`{history_len}`	Per-turn anchor
`llm_response`	role/content/reasoning, `tool_calls[]`, token usage, TTFT/latency, `finish_reason`	Reasoning text — usable for context, never as stage-reached evidence (confabulation risk)
`tool_call`	`{tool_call_id, name, args}`	The command actually issued — `args` is a raw string (bash/curl/etc.), not a structured HTTP call
`tool_result`	`{tool_call_id, name, output, tool_exec_ms, error?}`	The real, server-observed response text — this is the only tier that counts as ground truth
`agent_finish`	`stop_reason`, `turns`/`max_turns`, `finish_reason`	Distinguishes “ran out of budget” (`max_turns`) from “gave up” (`stop`) from silent truncation (`stop_reason=stop` but `finish_reason=length` — don’t trust `stop_reason` alone)
`flag_scan`	`{solved, primary?, retrieved[], findings[{flag, origin, first_event, first_turn?}]}`	Terminal outcome, provenance-classified (`retrieved`/`echoed`/`model_claim`)

Tool surface bounding what a “stage” can even look like: five tools total (bash, read_file, write_file, update_file, WebSearch — confirmed go/libs/sectools/sandbox.go, no WebFetch, contra two now-stale lessons). All HTTP interaction with the target is embedded in free-text bash commands and free-text bash output — there is no structured http_request tool call with a machine-readable status/URL/body. This is the single biggest reason “endpoint discovered” and “vuln identified” are not already derivable cleanly — any stage parser must regex/parse free text, not read a field.

Confidence: high (direct source read, 2026-07-02). Full detail: artifacts/overnight-instrumentation/research/harness-signals.md.

3. Per-stage ground-truth verifier design

3.1 What already satisfies “ground-truth-verified, never transcript-matched”

The project’s non-negotiable rule — reward from real environment/tool-output state, never format/regex-matched on the transcript — is already met at the terminal (flag) stage and nowhere else:

go/libs/agent/events/flagscan.go (ScanFlags) classifies every FLAG{...} sighting as retrieved (in a tool_result, absent from that call’s own args) / echoed (present in both) / model_claim (only in model text). This is a provenance signal — it proves the string came back from the sandbox, not the model’s mouth — but is not a byte-compare against the real flag. A retrieved flag from a decoy or off-target leak still reads solved:true.
The actual byte-compare (flag_verified, the project’s own term) lives outside the agent runtime: benchmark/flags/pd26_flags.current.json (10 live flags) + benchmark/verification/PD26-NN/exploit/solve.py (held-out reference solvers). Verifying against these today is a manual, SSH-gated, human-authorized step (lessons/evals/verifying-agentic-security-runs.md) — not wired into the harness, pdq, or trace-verify as an automatic per-run check. grep -rln "flag_verified" --include="*.go" returns zero hits — confirmed absence, not a naming mismatch.
Documented failure modes any pre-terminal verifier must not reproduce: model-claim fabrication after repeated 404s (lessons/security-agent/flag-detection-false-positives.md, 6/17 fine-tune “solves” were exactly this); a hardcoded flag-format regex producing false negatives on an off-roster 24-hex flag (lessons/evals/gym-challenge-flag-format-breadcrumb-false-negative.md); an 18-point proxy-vs-verified inflation on the fine-tune’s own leaderboard (lessons/evals/ctf-flag-verification-and-proxy-pitfall.md).

3.2 The proposed per-stage predicate design

Phase names follow the already-designed (if unused) ptes schema in .claude/rules/challenges.md (recon → enumeration → detection → exploitation → lateral), mapped onto the brief’s F1–F4 taxonomy. The mechanism is a direct reuse of flagscan.go’s proven shape: a pure, read-only, post-hoc scan over events.jsonl — no new sandbox instrumentation, no changes to secagent’s execution path.

Stage	Maps to	What real state proves it	How checkable	Robustness
recon	pre-F1	A request reached a known recon surface and got a response	`tool_result` exists for a `tool_call` whose `args` path matches a per-challenge recon-surface allowlist	Cheap + robust
enumeration	F1 (never finds vuln endpoint)	A request’s method+path matched the vuln-bearing route, regardless of payload correctness	`tool_call.args` path/method vs. a per-challenge allowlist lifted from `solve.py`	Cheap + robust — automates the existing by-hand method in `lessons/evals/wall-attribution-discovery-vs-exploit-fail.md`
detection	F1/F2 boundary	Response shows diagnostic evidence of the specific bug class (error, type-confusion tell, introspection leak)	`tool_result.output` vs. a per-challenge, bug-class-specific signature	Hard/ambiguous — bug-class-specific; recommend optional/best-effort in v1, fold into “enumeration reached, exploitation not yet” if no clean signature
exploitation	F2 (finds, can’t exploit)	The payload actually worked — server-side artifact only possible on success (token, row leak, shell banner)	`tool_result.output` vs. the exact success predicate already written in that challenge’s `solve.py` (verbatim reuse — this IS the ground-truth oracle)	Cheap + robust when the exploit yields one identifiable artifact; coarser (any-200-on-payload) proxy where it doesn’t
lateral / flag	F4 + terminal	Second request in a bypass→flag-read chain returned the flag	`flag_scan.retrieved`, verbatim	Already built — zero new work
(cross-cutting, not a stage)	F3 (clumsy tool-use)	Tool-call diversity / pro-grade tool vs. improvised shell one-liner	Count distinct `tool_call.name`, or classify `args` against a pro-tool allowlist	Does not fit the stage ladder — log as a separate metric, do not fold into a potential function (see §3.3)

Two things this design deliberately does NOT do, because both violate the project’s own reward rule:

Does not reuse the .claude/rules/challenges.md ptes matcher/triage-subagent mechanism. That mechanism is explicitly an LLM judge (“a span satisfies a matcher when the description’s intent is met, not merely when the regex matches”) — a level-3 rubric on the reward-gameability ladder (lessons/post-training/reward-signal-types-and-gameability-ladder.md), and per lessons/challenges/pd-challenge-file-anatomy.md, none of the 10 live PD26 challenges even carry the file this schema lives in. Use its phase names, not its judging mechanism.
Does not treat intent (the opt-in SECAGENT_CAPTURE_INTENT field) as evidence of “identified the vuln.” It’s documented observability metadata, low-faithfulness, off by default, and must be stripped before training use (lessons/security-agent/bash-intent-observability-field.md).

3.3 Potential-based shaping — the caveats if this ever feeds RL reward

If a stage-scan result is ever turned into dense RL signal (rather than just a diagnostic readout), the only shaping form proven not to change the optimal policy is F(s,a,s') = γΦ(s') − Φ(s) for any potential function Φ — Ng, Harada & Russell, “Policy Invariance Under Reward Transformations,” ICML 1999 (no arXiv id — this predates arXiv’s routine ML use; ACM DL 10.5555/645528.657613, verified live 2026-07-02). Two subtleties are easy to get wrong:

Φ must be a monotone “best stage reached so far” running max, not the instantaneous current-turn stage — otherwise re-triggering an already-reached signature, or a later turn’s evidence going quiet because the agent moved on, can pay a spurious negative shaping reward for forward motion.
Φ must be defined identically across every terminal branch (stop_reason ∈ {stop, max_turns, error}, all real values in this harness) — otherwise the invariance proof breaks across the different termination paths this project’s variable-length episodes actually produce. Recommended sidestep: apply shaping only over non-terminal transitions; let R_terminal (= flag_scan.Solved, unchanged) carry all outcome signal at the very last transition.

Domain-adjacent SOTA, verified live 2026-07-02 — none tick all three boxes (cybersec-specific + proven invariant + validated against a ground-truth terminal verifier), so this remains an unfilled niche, not a solved-elsewhere problem. The two rows below (Pentest-R1, DRLRM-PT) are academic cybersecurity-LLM training papers — per this project’s standing rule, they are cited for context only, not as a basis for any claim/recipe/verdict here; neither produced a frontier model, so neither is treated as evidence for the shaping design below. The recommendation that follows this table rests on the Ng/Harada/Russell invariance theorem and this project’s own gameability-ladder lesson, not on either of these two papers.

Paper	arXiv	Relevance	Confidence
TIPS — turn-level potential shaping for search-augmented LLMs	2603.22293	Shaping machinery is directly on-point; domain (search-QA) is not	0 citations, brand-new — promising, not validated
ToolRL — reward design for tool-use RL	2504.13958	Closest prior art on reward granularity/timing for tool-use GRPO; not potential-based	1 citation
Pentest-R1 — two-stage RL for autonomous pentesting (academic cybersecurity-LLM training paper — cited for context, not a basis for our decisions)	2508.07382	Domain topic overlaps — a per-step reward in an interactive CTF env (InterCode-CTF); exact shaping formula not fully verified from search highlights alone — flag as unread in full. Not used to support the recommendation below	0 citations, brand-new
DRLRM-PT — Reward Machine over kill-chain phases (academic cybersecurity-LLM training paper — cited for context, not a basis for our decisions)	2405.15908	Illustrates a non-potential-based design (flat +1/+10 phase bonuses, no `γΦ(s')−Φ(s)` structure) — cited only to warn against conflating “reward machine over phases” with “provably invariant shaping,” not as prior art we build on	Medium

Recommended default if this is built: keep the terminal flag reward and the dense stage-shaping term as two separate additive components, never merged into one function — this is both what makes the invariance argument clean (Ng, Harada & Russell, ICML 1999, cited above) and what the project’s own gameability doctrine independently favors (decoupled dense process signal + sparse ground-truth outcome signal, lessons/post-training/reward-signal-types-and-gameability-ladder.md).

Confidence: high on the harness-reuse mechanism and the Ng et al. invariance result itself (25-year-old, well-established). Medium on the “running-max Φ” / “terminal-consistency” recommendations — applied reasoning from the theorem plus this project’s variable-horizon episode shape, not lifted from a paper. Full detail: artifacts/overnight-instrumentation/research/staged-verifier-design.md.

4. What training/eval data we already have, per candidate move

Scope: benchmark/ (repo, git-tracked) + ~/security-agent-qwen/ (untracked local run-artifact directory holding the actual trajectory takeouts, partially mirrored to s3://llmresearch-data/). On the order of 1,200+ individual agent trajectories across ~470 distinct challenge definitions exist already — this is a mining problem, not a collection problem, for three of the four candidate moves.

Candidate move	Readiness	Extraction step	Sharpest gotcha
(i) Rejection-sampling SFT positive set	~185 raw solved trajectories across 5 corpora (gym263 64, gym564 39-cleaned, warpenv-broker 22, envgen 29, argus60-base 29); one prior SFT (`cybersec-qwen36-traj-ep2` / `pd-v5-qwen36-ft`) already built this way	Follow `lessons/post-training/verified-trajectory-synthesis-recipe.md` verbatim: verifier-accepted terminal only → replay-reproduce → dedup → decontaminate → Thought/Action/Observation with Observation loss-masked	Confirmed, not hypothetical: the prior SFT fabricated flags on 6/17 claimed solves (`lessons/post-training/sft-induced-flag-confabulation.md`) — naive success-folder collection teaches success-shape, not success
(ii) KTO/DPO pairs	KTO-native data is ready today for free — every `success/`/`failed/` split is an unpaired good/bad label (mechanical, zero judgment calls). True DPO (same-decision-point divergent pairs) needs k≥2 same-challenge same-model runs — only the PD26 canonical sweep (k=5, 10 challenges) has this; the larger gym pools are k=1	Label KTO now; if DPO is wanted, mine the PD26 k=5 sweep, don’t re-sweep the gym pools	Don’t default to DPO just because solve/fail piles exist — `lessons/post-training/dpo-kto-for-agent-tool-selection.md`’s escalation ladder (fix tool description → prompt guidance → action-space → better base → SFT → DPO/KTO → GRPO) should gate the choice first
(iii) Per-stage eval (F1–F4)	This is the actual gap. A PTES-phase tagger exists as a named, working concept — but on a different, sibling corpus (the `pr` seat’s BSides-LV-2026 audit: 189 Claude-model runs, Neo/XBOW-bench, not this project’s qwen/deepseek/glm/xai/gemini roster), and it is not yet open-sourced or present anywhere in this repo	Build a lightweight classifier over the existing `tool_call`/`tool_result` stream, scoped to the PD26/gym corpus specifically — materially smaller than the sibling talk’s full framework (no tool-tier/contamination/recovery-shape analyzers needed for internal F1–F4 attribution)	Don’t import the BSides talk’s headline numbers (“82% pivot rate,” “62% stall in exploitation”) as if they describe this project’s own models — they describe Claude models on a different bench
(iv) Credit-assignment traces	Best-instrumented axis in the inventory — every run in every corpus carries the full per-turn event stream, confirmed live on a real sample (190-line `events.jsonl`, 63 tool_call/tool_result pairs, 29 turns)	`tool_call_id`-paired parsing is a solved extraction problem (`events.ScanFlags` already demonstrates the pattern in Go)	Turn-level “did this action retrieve the flag” ≠ “was this turn part of a coherent minimal solve path” — the replay-reproduce check proves an action sequence is causal, not that recorded `Thought` spans are faithful narration (a distinct, unaddressed reasoning-distillation risk)

Cross-cutting gotchas that apply to all four moves:

Ground truth exists for only 2 of 7 corpora (PD26, argus60/APEX). gym263/gym564/warpenv/envgen have no held-out flag file — their “solved” signal is the harness’s own retrieved classification, a weaker epistemic tier than exact-match. Flag this explicitly in anything built downstream of them.
retrieved (tool_result-not-in-own-args) is a strong genuineness signal but is still transcript-level heuristic, not an out-of-band verifier query — the project rule is not automatically satisfied just because the harness flagged something retrieved.
gym564’s local raw copy (521 completed) is not the cleaned number (305) cited elsewhere in memory — reconcile against experiments/2026-06-27-gym564-archive-cleanup.md before using it quantitatively.

Confidence: high on the corpus inventory and readiness split (row-counted directly, 2026-07-02); high on the (iii) framing gap being genuine (vault-searched for “PTES”/“stage-local”/“kill chain,” found exactly one non-applicable hit). Full detail, including per-corpus row counts and S3 paths: artifacts/overnight-instrumentation/research/data-inventory.md.

5. The minimal instrumentation gap — the recommended first step

Collapsing §2–§4 into one actionable delta: the harness’s process telemetry is already complete for stage attribution. Nothing in go/libs/agent/events or secagent/runner.go needs to change. What’s missing is entirely semantic, and it splits into two independent, additive pieces of work that correctly sit on different sides of the seat boundary:

flowchart TD
  A["Pick ONE challenge<br/>(PD26-02 — chain already<br/>documented in lessons/challenges/<br/>pd26-02-nosqli-authbypass-chain.md)"] --> B["challenge-builder:<br/>author stage_oracle.json<br/>from solve.py — 4 predicates,<br/>editorial work, not infra"]
  A --> C["main: write stagescan.go,<br/>same shape as flagscan.go —<br/>pure io.Reader -> StageScan,<br/>no side effects"]
  B --> D["Run stagescan over existing<br/>events.jsonl from already-<br/>collected runs (benchmark/results/,<br/>~/security-agent-qwen/)"]
  C --> D
  D --> E["Validate: does the stage vector<br/>match what a human reading<br/>the same trace concludes?<br/>(same discipline as the flag-layer<br/>validation)"]
  E -->|"holds"| F["Generalize to all 10 PD26,<br/>then gym/warpenv/envgen"]
  E -->|"doesn't hold"| G["Refine predicates before<br/>trusting any F1-F4 number"]

main/harness side: a deterministic (no-LLM, no-confabulation) URL/path extractor over tool_call.args
- tool_result.output, scoped to the bash tool only. Emit as a derived per-run artifact, not a new harness event — keeps the harness itself free of challenge-specific semantics.
challenge-builder side: one stage_oracle.json sibling per challenge — one entry per PTES phase, each a deterministic predicate over (method, path_regex, status_code, body_signature), authored directly from that challenge’s own solve.py. This is editorial work (one person reads ~10 solve.py files and writes ~4 predicates each), not new infrastructure — the ground-truth reference already exists, it’s just not lifted into a machine-readable file.
Prototype on ONE challenge before committing to all 10. PD26-02 is the natural first pick — its two-step NoSQLi-authbypass chain is already fully narrated in lessons/challenges/pd26-02-nosqli-authbypass-chain.md, and solve.py’s own success checks (r.status_code == 200 and data.get("token") for the bypass; a flag regex on /api/profile for the pivot) are directly reusable as the exploitation-stage and lateral-stage predicates verbatim. Run the prototype over runs already sitting in benchmark/results/ and ~/security-agent-qwen/ — no new sweep needed to validate the mechanism.
A narrower, more concrete companion gap on the flag side: turn flag_scan.retrieved into a true flag_verified boolean via an offline exact-match diff against the already-git-tracked benchmark/flags/pd26_flags.current.json, for the 10-challenge roster only. This needs no SSH, no live secrets, and closes the one place today’s “ground-truth-verified” claim is actually a provenance proxy.

None of this requires RL infrastructure, a new sandbox tool, or a change to secagent’s execution path — it is a read-only scan over data that already exists, validated against traces already collected, before any GRPO/RLVR reward design depends on it.

Cross-links

Diagnosing the gap — the routing test this chapter’s stage signal feeds.
From behavioral audit to training signal — what to do once a failure is localized to a stage.
Method → Data — the data-object framing this chapter grounds in an actual corpus.

Keyboard shortcuts

Post-Training Field Notes