Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Before you train — instrumentation & data readiness

This chapter is a prerequisite, not a fix. Diagnosing the gap gives you the routing test (knowledge vs. execution vs. exploration); behavioral audit → training signal maps observed failure shapes to methods. Both assume you can already say which stage of a run failed. Today, for this project’s own harness, you can’t — not because the telemetry is missing, but because nobody has written the small amount of code that turns existing telemetry into a stage verdict. This chapter is the ground-truth answer to “what do I already have, and what’s the minimal thing to build first” — grounded entirely in the repo and vault as read on 2026-07-02, not in aspiration.

Everything below is a recommendation for main to implement — no code was written, no repo file was touched. Confidence is stated per claim; where I could not confirm something, I say so.

1. Why per-stage signal is the prerequisite for the whole diagnosis

The diagnosis framework’s routing test — does the correct action ever appear at high N, and does RL (not matched SFT) expand it? — is defined per challenge or per challenge-subtype, not per portfolio. Run it on the portfolio in aggregate and you get one number that averages over four structurally different failure modes (F1 exploration / F2 skill / F3 tool-use / F4 long-horizon), which is exactly the “collapsing a split verdict into one sentence” anti-pattern Diagnosing the gap §0 warns against. You cannot segment a portfolio you cannot localize. Per-stage signal is what turns “39% pass@1, up from 26%” into “F1 dropped from 40%→12% of failures, F2 is now the dominant failure mode” — the only shape of finding that tells you which lever (SFT curriculum, tool-use reward, GRPO exploration term) to pull next.

flowchart LR
    A["events.jsonl<br/>(already emitted)"] --> B["stage extractor<br/>(NOT built)"]
    C["per-challenge manifest<br/>(NOT built)"] --> B
    B --> D["F1-F4 attribution<br/>per run"]
    D --> E["Diagnosis framework<br/>routing test, per subtype"]
    E --> F["method choice:<br/>SFT curriculum / tool reward / GRPO term"]

2. What the harness already emits — confirmed, source-read

Source: go/libs/agent/events/events.go (schema), go/libs/secagent/runner.go (wiring), doc of record lessons/security-agent/harness-observability-contract-2026-06.md. Every event is Turn-indexed and tool_call/tool_result pairs join on tool_call_id — this is the load-bearing fact that makes any stage-localization script possible with zero changes to the agent loop.

EventWhat it carriesStage-attribution value
meta / tool_schemas / system_prompt / user_message (preamble)trace_id, model, action space, task stringRun identity; needed to join against a per-challenge manifest
agent_start{input}“Did the agent do anything” — trivially free
turn_start{history_len}Per-turn anchor
llm_responserole/content/reasoning, tool_calls[], token usage, TTFT/latency, finish_reasonReasoning text — usable for context, never as stage-reached evidence (confabulation risk)
tool_call{tool_call_id, name, args}The command actually issued — args is a raw string (bash/curl/etc.), not a structured HTTP call
tool_result{tool_call_id, name, output, tool_exec_ms, error?}The real, server-observed response text — this is the only tier that counts as ground truth
agent_finishstop_reason, turns/max_turns, finish_reasonDistinguishes “ran out of budget” (max_turns) from “gave up” (stop) from silent truncation (stop_reason=stop but finish_reason=length — don’t trust stop_reason alone)
flag_scan{solved, primary?, retrieved[], findings[{flag, origin, first_event, first_turn?}]}Terminal outcome, provenance-classified (retrieved/echoed/model_claim)

Tool surface bounding what a “stage” can even look like: five tools total (bash, read_file, write_file, update_file, WebSearch — confirmed go/libs/sectools/sandbox.go, no WebFetch, contra two now-stale lessons). All HTTP interaction with the target is embedded in free-text bash commands and free-text bash output — there is no structured http_request tool call with a machine-readable status/URL/body. This is the single biggest reason “endpoint discovered” and “vuln identified” are not already derivable cleanly — any stage parser must regex/parse free text, not read a field.

Confidence: high (direct source read, 2026-07-02). Full detail: artifacts/overnight-instrumentation/research/harness-signals.md.

3. Per-stage ground-truth verifier design

3.1 What already satisfies “ground-truth-verified, never transcript-matched”

The project’s non-negotiable rule — reward from real environment/tool-output state, never format/regex-matched on the transcript — is already met at the terminal (flag) stage and nowhere else:

  • go/libs/agent/events/flagscan.go (ScanFlags) classifies every FLAG{...} sighting as retrieved (in a tool_result, absent from that call’s own args) / echoed (present in both) / model_claim (only in model text). This is a provenance signal — it proves the string came back from the sandbox, not the model’s mouth — but is not a byte-compare against the real flag. A retrieved flag from a decoy or off-target leak still reads solved:true.
  • The actual byte-compare (flag_verified, the project’s own term) lives outside the agent runtime: benchmark/flags/pd26_flags.current.json (10 live flags) + benchmark/verification/PD26-NN/exploit/solve.py (held-out reference solvers). Verifying against these today is a manual, SSH-gated, human-authorized step (lessons/evals/verifying-agentic-security-runs.md) — not wired into the harness, pdq, or trace-verify as an automatic per-run check. grep -rln "flag_verified" --include="*.go" returns zero hits — confirmed absence, not a naming mismatch.
  • Documented failure modes any pre-terminal verifier must not reproduce: model-claim fabrication after repeated 404s (lessons/security-agent/flag-detection-false-positives.md, 6/17 fine-tune “solves” were exactly this); a hardcoded flag-format regex producing false negatives on an off-roster 24-hex flag (lessons/evals/gym-challenge-flag-format-breadcrumb-false-negative.md); an 18-point proxy-vs-verified inflation on the fine-tune’s own leaderboard (lessons/evals/ctf-flag-verification-and-proxy-pitfall.md).

3.2 The proposed per-stage predicate design

Phase names follow the already-designed (if unused) ptes schema in .claude/rules/challenges.md (recon → enumeration → detection → exploitation → lateral), mapped onto the brief’s F1–F4 taxonomy. The mechanism is a direct reuse of flagscan.go’s proven shape: a pure, read-only, post-hoc scan over events.jsonl — no new sandbox instrumentation, no changes to secagent’s execution path.

StageMaps toWhat real state proves itHow checkableRobustness
reconpre-F1A request reached a known recon surface and got a responsetool_result exists for a tool_call whose args path matches a per-challenge recon-surface allowlistCheap + robust
enumerationF1 (never finds vuln endpoint)A request’s method+path matched the vuln-bearing route, regardless of payload correctnesstool_call.args path/method vs. a per-challenge allowlist lifted from solve.pyCheap + robust — automates the existing by-hand method in lessons/evals/wall-attribution-discovery-vs-exploit-fail.md
detectionF1/F2 boundaryResponse shows diagnostic evidence of the specific bug class (error, type-confusion tell, introspection leak)tool_result.output vs. a per-challenge, bug-class-specific signatureHard/ambiguous — bug-class-specific; recommend optional/best-effort in v1, fold into “enumeration reached, exploitation not yet” if no clean signature
exploitationF2 (finds, can’t exploit)The payload actually worked — server-side artifact only possible on success (token, row leak, shell banner)tool_result.output vs. the exact success predicate already written in that challenge’s solve.py (verbatim reuse — this IS the ground-truth oracle)Cheap + robust when the exploit yields one identifiable artifact; coarser (any-200-on-payload) proxy where it doesn’t
lateral / flagF4 + terminalSecond request in a bypass→flag-read chain returned the flagflag_scan.retrieved, verbatimAlready built — zero new work
(cross-cutting, not a stage)F3 (clumsy tool-use)Tool-call diversity / pro-grade tool vs. improvised shell one-linerCount distinct tool_call.name, or classify args against a pro-tool allowlistDoes not fit the stage ladder — log as a separate metric, do not fold into a potential function (see §3.3)

Two things this design deliberately does NOT do, because both violate the project’s own reward rule:

  • Does not reuse the .claude/rules/challenges.md ptes matcher/triage-subagent mechanism. That mechanism is explicitly an LLM judge (“a span satisfies a matcher when the description’s intent is met, not merely when the regex matches”) — a level-3 rubric on the reward-gameability ladder (lessons/post-training/reward-signal-types-and-gameability-ladder.md), and per lessons/challenges/pd-challenge-file-anatomy.md, none of the 10 live PD26 challenges even carry the file this schema lives in. Use its phase names, not its judging mechanism.
  • Does not treat intent (the opt-in SECAGENT_CAPTURE_INTENT field) as evidence of “identified the vuln.” It’s documented observability metadata, low-faithfulness, off by default, and must be stripped before training use (lessons/security-agent/bash-intent-observability-field.md).

3.3 Potential-based shaping — the caveats if this ever feeds RL reward

If a stage-scan result is ever turned into dense RL signal (rather than just a diagnostic readout), the only shaping form proven not to change the optimal policy is F(s,a,s') = γΦ(s') − Φ(s) for any potential function Φ — Ng, Harada & Russell, “Policy Invariance Under Reward Transformations,” ICML 1999 (no arXiv id — this predates arXiv’s routine ML use; ACM DL 10.5555/645528.657613, verified live 2026-07-02). Two subtleties are easy to get wrong:

  1. Φ must be a monotone “best stage reached so far” running max, not the instantaneous current-turn stage — otherwise re-triggering an already-reached signature, or a later turn’s evidence going quiet because the agent moved on, can pay a spurious negative shaping reward for forward motion.
  2. Φ must be defined identically across every terminal branch (stop_reason{stop, max_turns, error}, all real values in this harness) — otherwise the invariance proof breaks across the different termination paths this project’s variable-length episodes actually produce. Recommended sidestep: apply shaping only over non-terminal transitions; let R_terminal (= flag_scan.Solved, unchanged) carry all outcome signal at the very last transition.

Domain-adjacent SOTA, verified live 2026-07-02 — none tick all three boxes (cybersec-specific + proven invariant + validated against a ground-truth terminal verifier), so this remains an unfilled niche, not a solved-elsewhere problem. The two rows below (Pentest-R1, DRLRM-PT) are academic cybersecurity-LLM training papers — per this project’s standing rule, they are cited for context only, not as a basis for any claim/recipe/verdict here; neither produced a frontier model, so neither is treated as evidence for the shaping design below. The recommendation that follows this table rests on the Ng/Harada/Russell invariance theorem and this project’s own gameability-ladder lesson, not on either of these two papers.

PaperarXivRelevanceConfidence
TIPS — turn-level potential shaping for search-augmented LLMs2603.22293Shaping machinery is directly on-point; domain (search-QA) is not0 citations, brand-new — promising, not validated
ToolRL — reward design for tool-use RL2504.13958Closest prior art on reward granularity/timing for tool-use GRPO; not potential-based1 citation
Pentest-R1 — two-stage RL for autonomous pentesting (academic cybersecurity-LLM training paper — cited for context, not a basis for our decisions)2508.07382Domain topic overlaps — a per-step reward in an interactive CTF env (InterCode-CTF); exact shaping formula not fully verified from search highlights alone — flag as unread in full. Not used to support the recommendation below0 citations, brand-new
DRLRM-PT — Reward Machine over kill-chain phases (academic cybersecurity-LLM training paper — cited for context, not a basis for our decisions)2405.15908Illustrates a non-potential-based design (flat +1/+10 phase bonuses, no γΦ(s')−Φ(s) structure) — cited only to warn against conflating “reward machine over phases” with “provably invariant shaping,” not as prior art we build onMedium

Recommended default if this is built: keep the terminal flag reward and the dense stage-shaping term as two separate additive components, never merged into one function — this is both what makes the invariance argument clean (Ng, Harada & Russell, ICML 1999, cited above) and what the project’s own gameability doctrine independently favors (decoupled dense process signal + sparse ground-truth outcome signal, lessons/post-training/reward-signal-types-and-gameability-ladder.md).

Confidence: high on the harness-reuse mechanism and the Ng et al. invariance result itself (25-year-old, well-established). Medium on the “running-max Φ” / “terminal-consistency” recommendations — applied reasoning from the theorem plus this project’s variable-horizon episode shape, not lifted from a paper. Full detail: artifacts/overnight-instrumentation/research/staged-verifier-design.md.

4. What training/eval data we already have, per candidate move

Scope: benchmark/ (repo, git-tracked) + ~/security-agent-qwen/ (untracked local run-artifact directory holding the actual trajectory takeouts, partially mirrored to s3://llmresearch-data/). On the order of 1,200+ individual agent trajectories across ~470 distinct challenge definitions exist already — this is a mining problem, not a collection problem, for three of the four candidate moves.

Candidate moveReadinessExtraction stepSharpest gotcha
(i) Rejection-sampling SFT positive set~185 raw solved trajectories across 5 corpora (gym263 64, gym564 39-cleaned, warpenv-broker 22, envgen 29, argus60-base 29); one prior SFT (cybersec-qwen36-traj-ep2 / pd-v5-qwen36-ft) already built this wayFollow lessons/post-training/verified-trajectory-synthesis-recipe.md verbatim: verifier-accepted terminal only → replay-reproduce → dedup → decontaminate → Thought/Action/Observation with Observation loss-maskedConfirmed, not hypothetical: the prior SFT fabricated flags on 6/17 claimed solves (lessons/post-training/sft-induced-flag-confabulation.md) — naive success-folder collection teaches success-shape, not success
(ii) KTO/DPO pairsKTO-native data is ready today for free — every success//failed/ split is an unpaired good/bad label (mechanical, zero judgment calls). True DPO (same-decision-point divergent pairs) needs k≥2 same-challenge same-model runs — only the PD26 canonical sweep (k=5, 10 challenges) has this; the larger gym pools are k=1Label KTO now; if DPO is wanted, mine the PD26 k=5 sweep, don’t re-sweep the gym poolsDon’t default to DPO just because solve/fail piles exist — lessons/post-training/dpo-kto-for-agent-tool-selection.md’s escalation ladder (fix tool description → prompt guidance → action-space → better base → SFT → DPO/KTO → GRPO) should gate the choice first
(iii) Per-stage eval (F1–F4)This is the actual gap. A PTES-phase tagger exists as a named, working concept — but on a different, sibling corpus (the pr seat’s BSides-LV-2026 audit: 189 Claude-model runs, Neo/XBOW-bench, not this project’s qwen/deepseek/glm/xai/gemini roster), and it is not yet open-sourced or present anywhere in this repoBuild a lightweight classifier over the existing tool_call/tool_result stream, scoped to the PD26/gym corpus specifically — materially smaller than the sibling talk’s full framework (no tool-tier/contamination/recovery-shape analyzers needed for internal F1–F4 attribution)Don’t import the BSides talk’s headline numbers (“82% pivot rate,” “62% stall in exploitation”) as if they describe this project’s own models — they describe Claude models on a different bench
(iv) Credit-assignment tracesBest-instrumented axis in the inventory — every run in every corpus carries the full per-turn event stream, confirmed live on a real sample (190-line events.jsonl, 63 tool_call/tool_result pairs, 29 turns)tool_call_id-paired parsing is a solved extraction problem (events.ScanFlags already demonstrates the pattern in Go)Turn-level “did this action retrieve the flag” ≠ “was this turn part of a coherent minimal solve path” — the replay-reproduce check proves an action sequence is causal, not that recorded Thought spans are faithful narration (a distinct, unaddressed reasoning-distillation risk)

Cross-cutting gotchas that apply to all four moves:

  • Ground truth exists for only 2 of 7 corpora (PD26, argus60/APEX). gym263/gym564/warpenv/envgen have no held-out flag file — their “solved” signal is the harness’s own retrieved classification, a weaker epistemic tier than exact-match. Flag this explicitly in anything built downstream of them.
  • retrieved (tool_result-not-in-own-args) is a strong genuineness signal but is still transcript-level heuristic, not an out-of-band verifier query — the project rule is not automatically satisfied just because the harness flagged something retrieved.
  • gym564’s local raw copy (521 completed) is not the cleaned number (305) cited elsewhere in memory — reconcile against experiments/2026-06-27-gym564-archive-cleanup.md before using it quantitatively.

Confidence: high on the corpus inventory and readiness split (row-counted directly, 2026-07-02); high on the (iii) framing gap being genuine (vault-searched for “PTES”/“stage-local”/“kill chain,” found exactly one non-applicable hit). Full detail, including per-corpus row counts and S3 paths: artifacts/overnight-instrumentation/research/data-inventory.md.

Collapsing §2–§4 into one actionable delta: the harness’s process telemetry is already complete for stage attribution. Nothing in go/libs/agent/events or secagent/runner.go needs to change. What’s missing is entirely semantic, and it splits into two independent, additive pieces of work that correctly sit on different sides of the seat boundary:

flowchart TD
  A["Pick ONE challenge<br/>(PD26-02 — chain already<br/>documented in lessons/challenges/<br/>pd26-02-nosqli-authbypass-chain.md)"] --> B["challenge-builder:<br/>author stage_oracle.json<br/>from solve.py — 4 predicates,<br/>editorial work, not infra"]
  A --> C["main: write stagescan.go,<br/>same shape as flagscan.go —<br/>pure io.Reader -> StageScan,<br/>no side effects"]
  B --> D["Run stagescan over existing<br/>events.jsonl from already-<br/>collected runs (benchmark/results/,<br/>~/security-agent-qwen/)"]
  C --> D
  D --> E["Validate: does the stage vector<br/>match what a human reading<br/>the same trace concludes?<br/>(same discipline as the flag-layer<br/>validation)"]
  E -->|"holds"| F["Generalize to all 10 PD26,<br/>then gym/warpenv/envgen"]
  E -->|"doesn't hold"| G["Refine predicates before<br/>trusting any F1-F4 number"]
  1. main/harness side: a deterministic (no-LLM, no-confabulation) URL/path extractor over tool_call.args
    • tool_result.output, scoped to the bash tool only. Emit as a derived per-run artifact, not a new harness event — keeps the harness itself free of challenge-specific semantics.
  2. challenge-builder side: one stage_oracle.json sibling per challenge — one entry per PTES phase, each a deterministic predicate over (method, path_regex, status_code, body_signature), authored directly from that challenge’s own solve.py. This is editorial work (one person reads ~10 solve.py files and writes ~4 predicates each), not new infrastructure — the ground-truth reference already exists, it’s just not lifted into a machine-readable file.
  3. Prototype on ONE challenge before committing to all 10. PD26-02 is the natural first pick — its two-step NoSQLi-authbypass chain is already fully narrated in lessons/challenges/pd26-02-nosqli-authbypass-chain.md, and solve.py’s own success checks (r.status_code == 200 and data.get("token") for the bypass; a flag regex on /api/profile for the pivot) are directly reusable as the exploitation-stage and lateral-stage predicates verbatim. Run the prototype over runs already sitting in benchmark/results/ and ~/security-agent-qwen/ — no new sweep needed to validate the mechanism.
  4. A narrower, more concrete companion gap on the flag side: turn flag_scan.retrieved into a true flag_verified boolean via an offline exact-match diff against the already-git-tracked benchmark/flags/pd26_flags.current.json, for the 10-challenge roster only. This needs no SSH, no live secrets, and closes the one place today’s “ground-truth-verified” claim is actually a provenance proxy.

None of this requires RL infrastructure, a new sandbox tool, or a change to secagent’s execution path — it is a read-only scan over data that already exists, validated against traces already collected, before any GRPO/RLVR reward design depends on it.