Before you train — instrumentation & data readiness
This chapter is a prerequisite, not a fix. Diagnosing the gap gives you the routing test (knowledge vs. execution vs. exploration); behavioral audit → training signal maps observed failure shapes to methods. Both assume you can already say which stage of a run failed. Today, for this project’s own harness, you can’t — not because the telemetry is missing, but because nobody has written the small amount of code that turns existing telemetry into a stage verdict. This chapter is the ground-truth answer to “what do I already have, and what’s the minimal thing to build first” — grounded entirely in the repo and vault as read on 2026-07-02, not in aspiration.
Everything below is a recommendation for main to implement — no code was written, no repo file was
touched. Confidence is stated per claim; where I could not confirm something, I say so.
1. Why per-stage signal is the prerequisite for the whole diagnosis
The diagnosis framework’s routing test — does the correct action ever appear at high N, and does RL (not matched SFT) expand it? — is defined per challenge or per challenge-subtype, not per portfolio. Run it on the portfolio in aggregate and you get one number that averages over four structurally different failure modes (F1 exploration / F2 skill / F3 tool-use / F4 long-horizon), which is exactly the “collapsing a split verdict into one sentence” anti-pattern Diagnosing the gap §0 warns against. You cannot segment a portfolio you cannot localize. Per-stage signal is what turns “39% pass@1, up from 26%” into “F1 dropped from 40%→12% of failures, F2 is now the dominant failure mode” — the only shape of finding that tells you which lever (SFT curriculum, tool-use reward, GRPO exploration term) to pull next.
flowchart LR
A["events.jsonl<br/>(already emitted)"] --> B["stage extractor<br/>(NOT built)"]
C["per-challenge manifest<br/>(NOT built)"] --> B
B --> D["F1-F4 attribution<br/>per run"]
D --> E["Diagnosis framework<br/>routing test, per subtype"]
E --> F["method choice:<br/>SFT curriculum / tool reward / GRPO term"]
2. What the harness already emits — confirmed, source-read
Source: go/libs/agent/events/events.go (schema), go/libs/secagent/runner.go (wiring), doc of record
lessons/security-agent/harness-observability-contract-2026-06.md. Every event is Turn-indexed and
tool_call/tool_result pairs join on tool_call_id — this is the load-bearing fact that makes any
stage-localization script possible with zero changes to the agent loop.
| Event | What it carries | Stage-attribution value |
|---|---|---|
meta / tool_schemas / system_prompt / user_message (preamble) | trace_id, model, action space, task string | Run identity; needed to join against a per-challenge manifest |
agent_start | {input} | “Did the agent do anything” — trivially free |
turn_start | {history_len} | Per-turn anchor |
llm_response | role/content/reasoning, tool_calls[], token usage, TTFT/latency, finish_reason | Reasoning text — usable for context, never as stage-reached evidence (confabulation risk) |
tool_call | {tool_call_id, name, args} | The command actually issued — args is a raw string (bash/curl/etc.), not a structured HTTP call |
tool_result | {tool_call_id, name, output, tool_exec_ms, error?} | The real, server-observed response text — this is the only tier that counts as ground truth |
agent_finish | stop_reason, turns/max_turns, finish_reason | Distinguishes “ran out of budget” (max_turns) from “gave up” (stop) from silent truncation (stop_reason=stop but finish_reason=length — don’t trust stop_reason alone) |
flag_scan | {solved, primary?, retrieved[], findings[{flag, origin, first_event, first_turn?}]} | Terminal outcome, provenance-classified (retrieved/echoed/model_claim) |
Tool surface bounding what a “stage” can even look like: five tools total (bash, read_file,
write_file, update_file, WebSearch — confirmed go/libs/sectools/sandbox.go, no WebFetch, contra
two now-stale lessons). All HTTP interaction with the target is embedded in free-text bash commands and
free-text bash output — there is no structured http_request tool call with a machine-readable
status/URL/body. This is the single biggest reason “endpoint discovered” and “vuln identified” are not
already derivable cleanly — any stage parser must regex/parse free text, not read a field.
Confidence: high (direct source read, 2026-07-02). Full detail: artifacts/overnight-instrumentation/research/harness-signals.md.
3. Per-stage ground-truth verifier design
3.1 What already satisfies “ground-truth-verified, never transcript-matched”
The project’s non-negotiable rule — reward from real environment/tool-output state, never format/regex-matched on the transcript — is already met at the terminal (flag) stage and nowhere else:
go/libs/agent/events/flagscan.go(ScanFlags) classifies everyFLAG{...}sighting asretrieved(in atool_result, absent from that call’s ownargs) /echoed(present in both) /model_claim(only in model text). This is a provenance signal — it proves the string came back from the sandbox, not the model’s mouth — but is not a byte-compare against the real flag. Aretrievedflag from a decoy or off-target leak still readssolved:true.- The actual byte-compare (
flag_verified, the project’s own term) lives outside the agent runtime:benchmark/flags/pd26_flags.current.json(10 live flags) +benchmark/verification/PD26-NN/exploit/solve.py(held-out reference solvers). Verifying against these today is a manual, SSH-gated, human-authorized step (lessons/evals/verifying-agentic-security-runs.md) — not wired into the harness,pdq, ortrace-verifyas an automatic per-run check.grep -rln "flag_verified" --include="*.go"returns zero hits — confirmed absence, not a naming mismatch. - Documented failure modes any pre-terminal verifier must not reproduce: model-claim fabrication after
repeated 404s (
lessons/security-agent/flag-detection-false-positives.md, 6/17 fine-tune “solves” were exactly this); a hardcoded flag-format regex producing false negatives on an off-roster 24-hex flag (lessons/evals/gym-challenge-flag-format-breadcrumb-false-negative.md); an 18-point proxy-vs-verified inflation on the fine-tune’s own leaderboard (lessons/evals/ctf-flag-verification-and-proxy-pitfall.md).
3.2 The proposed per-stage predicate design
Phase names follow the already-designed (if unused) ptes schema in .claude/rules/challenges.md
(recon → enumeration → detection → exploitation → lateral), mapped onto the brief’s F1–F4 taxonomy. The
mechanism is a direct reuse of flagscan.go’s proven shape: a pure, read-only, post-hoc scan over
events.jsonl — no new sandbox instrumentation, no changes to secagent’s execution path.
| Stage | Maps to | What real state proves it | How checkable | Robustness |
|---|---|---|---|---|
| recon | pre-F1 | A request reached a known recon surface and got a response | tool_result exists for a tool_call whose args path matches a per-challenge recon-surface allowlist | Cheap + robust |
| enumeration | F1 (never finds vuln endpoint) | A request’s method+path matched the vuln-bearing route, regardless of payload correctness | tool_call.args path/method vs. a per-challenge allowlist lifted from solve.py | Cheap + robust — automates the existing by-hand method in lessons/evals/wall-attribution-discovery-vs-exploit-fail.md |
| detection | F1/F2 boundary | Response shows diagnostic evidence of the specific bug class (error, type-confusion tell, introspection leak) | tool_result.output vs. a per-challenge, bug-class-specific signature | Hard/ambiguous — bug-class-specific; recommend optional/best-effort in v1, fold into “enumeration reached, exploitation not yet” if no clean signature |
| exploitation | F2 (finds, can’t exploit) | The payload actually worked — server-side artifact only possible on success (token, row leak, shell banner) | tool_result.output vs. the exact success predicate already written in that challenge’s solve.py (verbatim reuse — this IS the ground-truth oracle) | Cheap + robust when the exploit yields one identifiable artifact; coarser (any-200-on-payload) proxy where it doesn’t |
| lateral / flag | F4 + terminal | Second request in a bypass→flag-read chain returned the flag | flag_scan.retrieved, verbatim | Already built — zero new work |
| (cross-cutting, not a stage) | F3 (clumsy tool-use) | Tool-call diversity / pro-grade tool vs. improvised shell one-liner | Count distinct tool_call.name, or classify args against a pro-tool allowlist | Does not fit the stage ladder — log as a separate metric, do not fold into a potential function (see §3.3) |
Two things this design deliberately does NOT do, because both violate the project’s own reward rule:
- Does not reuse the
.claude/rules/challenges.mdptesmatcher/triage-subagent mechanism. That mechanism is explicitly an LLM judge (“a span satisfies a matcher when the description’s intent is met, not merely when the regex matches”) — a level-3 rubric on the reward-gameability ladder (lessons/post-training/reward-signal-types-and-gameability-ladder.md), and perlessons/challenges/pd-challenge-file-anatomy.md, none of the 10 live PD26 challenges even carry the file this schema lives in. Use its phase names, not its judging mechanism. - Does not treat
intent(the opt-inSECAGENT_CAPTURE_INTENTfield) as evidence of “identified the vuln.” It’s documented observability metadata, low-faithfulness, off by default, and must be stripped before training use (lessons/security-agent/bash-intent-observability-field.md).
3.3 Potential-based shaping — the caveats if this ever feeds RL reward
If a stage-scan result is ever turned into dense RL signal (rather than just a diagnostic readout), the
only shaping form proven not to change the optimal policy is
F(s,a,s') = γΦ(s') − Φ(s) for any potential function Φ — Ng, Harada & Russell, “Policy Invariance Under
Reward Transformations,” ICML 1999 (no arXiv id — this predates arXiv’s routine ML use; ACM DL
10.5555/645528.657613, verified live 2026-07-02). Two subtleties are easy to get wrong:
- Φ must be a monotone “best stage reached so far” running max, not the instantaneous current-turn stage — otherwise re-triggering an already-reached signature, or a later turn’s evidence going quiet because the agent moved on, can pay a spurious negative shaping reward for forward motion.
- Φ must be defined identically across every terminal branch (
stop_reason∈{stop, max_turns, error}, all real values in this harness) — otherwise the invariance proof breaks across the different termination paths this project’s variable-length episodes actually produce. Recommended sidestep: apply shaping only over non-terminal transitions; letR_terminal(=flag_scan.Solved, unchanged) carry all outcome signal at the very last transition.
Domain-adjacent SOTA, verified live 2026-07-02 — none tick all three boxes (cybersec-specific + proven invariant + validated against a ground-truth terminal verifier), so this remains an unfilled niche, not a solved-elsewhere problem. The two rows below (Pentest-R1, DRLRM-PT) are academic cybersecurity-LLM training papers — per this project’s standing rule, they are cited for context only, not as a basis for any claim/recipe/verdict here; neither produced a frontier model, so neither is treated as evidence for the shaping design below. The recommendation that follows this table rests on the Ng/Harada/Russell invariance theorem and this project’s own gameability-ladder lesson, not on either of these two papers.
| Paper | arXiv | Relevance | Confidence |
|---|---|---|---|
| TIPS — turn-level potential shaping for search-augmented LLMs | 2603.22293 | Shaping machinery is directly on-point; domain (search-QA) is not | 0 citations, brand-new — promising, not validated |
| ToolRL — reward design for tool-use RL | 2504.13958 | Closest prior art on reward granularity/timing for tool-use GRPO; not potential-based | 1 citation |
| Pentest-R1 — two-stage RL for autonomous pentesting (academic cybersecurity-LLM training paper — cited for context, not a basis for our decisions) | 2508.07382 | Domain topic overlaps — a per-step reward in an interactive CTF env (InterCode-CTF); exact shaping formula not fully verified from search highlights alone — flag as unread in full. Not used to support the recommendation below | 0 citations, brand-new |
| DRLRM-PT — Reward Machine over kill-chain phases (academic cybersecurity-LLM training paper — cited for context, not a basis for our decisions) | 2405.15908 | Illustrates a non-potential-based design (flat +1/+10 phase bonuses, no γΦ(s')−Φ(s) structure) — cited only to warn against conflating “reward machine over phases” with “provably invariant shaping,” not as prior art we build on | Medium |
Recommended default if this is built: keep the terminal flag reward and the dense stage-shaping term
as two separate additive components, never merged into one function — this is both what makes the
invariance argument clean (Ng, Harada & Russell, ICML 1999, cited above) and what the project’s own
gameability doctrine independently favors (decoupled dense process signal + sparse ground-truth outcome
signal, lessons/post-training/reward-signal-types-and-gameability-ladder.md).
Confidence: high on the harness-reuse mechanism and the Ng et al. invariance result itself (25-year-old,
well-established). Medium on the “running-max Φ” / “terminal-consistency” recommendations — applied
reasoning from the theorem plus this project’s variable-horizon episode shape, not lifted from a paper.
Full detail: artifacts/overnight-instrumentation/research/staged-verifier-design.md.
4. What training/eval data we already have, per candidate move
Scope: benchmark/ (repo, git-tracked) + ~/security-agent-qwen/ (untracked local run-artifact directory
holding the actual trajectory takeouts, partially mirrored to s3://llmresearch-data/). On the order of
1,200+ individual agent trajectories across ~470 distinct challenge definitions exist already — this
is a mining problem, not a collection problem, for three of the four candidate moves.
| Candidate move | Readiness | Extraction step | Sharpest gotcha |
|---|---|---|---|
| (i) Rejection-sampling SFT positive set | ~185 raw solved trajectories across 5 corpora (gym263 64, gym564 39-cleaned, warpenv-broker 22, envgen 29, argus60-base 29); one prior SFT (cybersec-qwen36-traj-ep2 / pd-v5-qwen36-ft) already built this way | Follow lessons/post-training/verified-trajectory-synthesis-recipe.md verbatim: verifier-accepted terminal only → replay-reproduce → dedup → decontaminate → Thought/Action/Observation with Observation loss-masked | Confirmed, not hypothetical: the prior SFT fabricated flags on 6/17 claimed solves (lessons/post-training/sft-induced-flag-confabulation.md) — naive success-folder collection teaches success-shape, not success |
| (ii) KTO/DPO pairs | KTO-native data is ready today for free — every success//failed/ split is an unpaired good/bad label (mechanical, zero judgment calls). True DPO (same-decision-point divergent pairs) needs k≥2 same-challenge same-model runs — only the PD26 canonical sweep (k=5, 10 challenges) has this; the larger gym pools are k=1 | Label KTO now; if DPO is wanted, mine the PD26 k=5 sweep, don’t re-sweep the gym pools | Don’t default to DPO just because solve/fail piles exist — lessons/post-training/dpo-kto-for-agent-tool-selection.md’s escalation ladder (fix tool description → prompt guidance → action-space → better base → SFT → DPO/KTO → GRPO) should gate the choice first |
| (iii) Per-stage eval (F1–F4) | This is the actual gap. A PTES-phase tagger exists as a named, working concept — but on a different, sibling corpus (the pr seat’s BSides-LV-2026 audit: 189 Claude-model runs, Neo/XBOW-bench, not this project’s qwen/deepseek/glm/xai/gemini roster), and it is not yet open-sourced or present anywhere in this repo | Build a lightweight classifier over the existing tool_call/tool_result stream, scoped to the PD26/gym corpus specifically — materially smaller than the sibling talk’s full framework (no tool-tier/contamination/recovery-shape analyzers needed for internal F1–F4 attribution) | Don’t import the BSides talk’s headline numbers (“82% pivot rate,” “62% stall in exploitation”) as if they describe this project’s own models — they describe Claude models on a different bench |
| (iv) Credit-assignment traces | Best-instrumented axis in the inventory — every run in every corpus carries the full per-turn event stream, confirmed live on a real sample (190-line events.jsonl, 63 tool_call/tool_result pairs, 29 turns) | tool_call_id-paired parsing is a solved extraction problem (events.ScanFlags already demonstrates the pattern in Go) | Turn-level “did this action retrieve the flag” ≠ “was this turn part of a coherent minimal solve path” — the replay-reproduce check proves an action sequence is causal, not that recorded Thought spans are faithful narration (a distinct, unaddressed reasoning-distillation risk) |
Cross-cutting gotchas that apply to all four moves:
- Ground truth exists for only 2 of 7 corpora (PD26, argus60/APEX). gym263/gym564/warpenv/envgen have no
held-out flag file — their “solved” signal is the harness’s own
retrievedclassification, a weaker epistemic tier than exact-match. Flag this explicitly in anything built downstream of them. retrieved(tool_result-not-in-own-args) is a strong genuineness signal but is still transcript-level heuristic, not an out-of-band verifier query — the project rule is not automatically satisfied just because the harness flagged something retrieved.- gym564’s local raw copy (521 completed) is not the cleaned number (305) cited elsewhere in memory —
reconcile against
experiments/2026-06-27-gym564-archive-cleanup.mdbefore using it quantitatively.
Confidence: high on the corpus inventory and readiness split (row-counted directly, 2026-07-02); high
on the (iii) framing gap being genuine (vault-searched for “PTES”/“stage-local”/“kill chain,” found exactly
one non-applicable hit). Full detail, including per-corpus row counts and S3 paths:
artifacts/overnight-instrumentation/research/data-inventory.md.
5. The minimal instrumentation gap — the recommended first step
Collapsing §2–§4 into one actionable delta: the harness’s process telemetry is already complete for
stage attribution. Nothing in go/libs/agent/events or secagent/runner.go needs to change. What’s
missing is entirely semantic, and it splits into two independent, additive pieces of work that
correctly sit on different sides of the seat boundary:
flowchart TD A["Pick ONE challenge<br/>(PD26-02 — chain already<br/>documented in lessons/challenges/<br/>pd26-02-nosqli-authbypass-chain.md)"] --> B["challenge-builder:<br/>author stage_oracle.json<br/>from solve.py — 4 predicates,<br/>editorial work, not infra"] A --> C["main: write stagescan.go,<br/>same shape as flagscan.go —<br/>pure io.Reader -> StageScan,<br/>no side effects"] B --> D["Run stagescan over existing<br/>events.jsonl from already-<br/>collected runs (benchmark/results/,<br/>~/security-agent-qwen/)"] C --> D D --> E["Validate: does the stage vector<br/>match what a human reading<br/>the same trace concludes?<br/>(same discipline as the flag-layer<br/>validation)"] E -->|"holds"| F["Generalize to all 10 PD26,<br/>then gym/warpenv/envgen"] E -->|"doesn't hold"| G["Refine predicates before<br/>trusting any F1-F4 number"]
main/harness side: a deterministic (no-LLM, no-confabulation) URL/path extractor overtool_call.argstool_result.output, scoped to thebashtool only. Emit as a derived per-run artifact, not a new harness event — keeps the harness itself free of challenge-specific semantics.
challenge-builderside: onestage_oracle.jsonsibling per challenge — one entry per PTES phase, each a deterministic predicate over(method, path_regex, status_code, body_signature), authored directly from that challenge’s ownsolve.py. This is editorial work (one person reads ~10solve.pyfiles and writes ~4 predicates each), not new infrastructure — the ground-truth reference already exists, it’s just not lifted into a machine-readable file.- Prototype on ONE challenge before committing to all 10. PD26-02 is the natural first pick — its
two-step NoSQLi-authbypass chain is already fully narrated in
lessons/challenges/pd26-02-nosqli-authbypass-chain.md, andsolve.py’s own success checks (r.status_code == 200 and data.get("token")for the bypass; a flag regex on/api/profilefor the pivot) are directly reusable as the exploitation-stage and lateral-stage predicates verbatim. Run the prototype over runs already sitting inbenchmark/results/and~/security-agent-qwen/— no new sweep needed to validate the mechanism. - A narrower, more concrete companion gap on the flag side: turn
flag_scan.retrievedinto a trueflag_verifiedboolean via an offline exact-match diff against the already-git-trackedbenchmark/flags/pd26_flags.current.json, for the 10-challenge roster only. This needs no SSH, no live secrets, and closes the one place today’s “ground-truth-verified” claim is actually a provenance proxy.
None of this requires RL infrastructure, a new sandbox tool, or a change to secagent’s execution path —
it is a read-only scan over data that already exists, validated against traces already collected, before
any GRPO/RLVR reward design depends on it.
Cross-links
- Diagnosing the gap — the routing test this chapter’s stage signal feeds.
- From behavioral audit to training signal — what to do once a failure is localized to a stage.
- Method → Data — the data-object framing this chapter grounds in an actual corpus.