Proven post-training datasets — a usage-cited registry

Every other chapter in this book asks “which method” (The family map) or “which sequence of stages” (The recipe is a sequence, not a pick). This chapter answers the question those chapters leave open once you’ve picked Sequence B (§2 of the recipe chapter): which concrete, downloadable dataset actually fills each rung, for the six GENERAL (non-cyber) capabilities every Sequence-B rung needs regardless of what cyber-specific data you build on top — chat template, tool/function-calling, instruction following, preference, reasoning, and willingness/refusal calibration. This is a registry, not an essay: the tables are the payload.

1. The inclusion rule (read this before trusting any row)

This project’s permanent stance is never rely on academic cybersecurity-LLM projects — grounding comes from real frontier/open-weight recipes, not single-paper academic artifacts (frontier-cyber-model-path.md §1). This chapter applies the exact dataset-side analog of that stance:

A dataset gets a row ONLY if it is PROVEN-BY-USAGE — i.e. a named, real, released open-weight model or recipe (a tech report, an official model card, or a widely-used/well-cited community fine-tune) is documented as having actually trained on it. Every row below names that model/recipe and cites the source (model card, tech report, or lab blog post) that states the link directly — not inferred, not “this would probably be a good fit.”

Consequences enforced throughout:

No single-paper-only academic datasets. A dataset whose only appearance is its own authors’ academic ablation, with no third-party or even self-reported shipped model consuming it, is dropped. Each category file below has an explicit “Dropped” list — read it before re-proposing something that already failed the bar.
No cyber-specific datasets at all — this registry is scoped to GENERAL capabilities on purpose; cyber-specific data is out of scope here by design, not merely unfound.
Self-proof is weaker than cross-lab proof, and is flagged as such. A dataset proven only via its own authors’ reference model (e.g. Magpie-Reasoning-150K → the Magpie team’s own checkpoints) is marked Medium confidence even though the usage is real; a dataset reused by an independent lab or absorbed into a bigger shipped mixture (e.g. COIG-CQIA → folded into BAAI’s Infinity-Instruct recipe) earns High.
License ≠ proof-of-usage. Several rows below carry a real, verified training usage but a murky or non-commercial license (ShareGPT, ToolACE, COIG-CQIA, toxic-dpo-v0.2). The PROVEN_IN column tells you it was used; the License column tells you separately whether you can reuse it — don’t conflate them.

Every table row format is: dataset · link · capability · Sequence-B stage · size · license · PROVEN_IN (model/recipe) + citation · language(s) · notes · confidence. Source notes for all six categories, including live-verification method and the full per-dataset detail, are the six research files under artifacts/overnight-datasets/research/ (instruction-chat, tool-calling-agentic, preference-alignment, reasoning-cot, willingness-uncensor-safety, chinese-labs-multilingual) — this chapter distills them; consult those files for the complete citation quotes.

2. Instruction / chat SFT

The CPT→SFT rung: general instruction-following and multi-turn chat behavior — the base every other capability in this chapter sits on top of.

Dataset	Link	Seq-B stage	Size	License	Proven in (+ cite)	Confidence
Tulu 3 SFT Mixture	HF	SFT	939K	ODC-BY-1.0 (mixed per-subset)	Tülu 3 + OLMo 2 (7B/13B/32B) — arXiv:2411.15124, AI2 blog, OLMo 2 blog	High
OpenHermes 2.5	HF	SFT	~1M	mixed per-source	OpenHermes-2.5-Mistral-7B, Nous Hermes 2 series — dataset card states directly	High
UltraChat-200k	HF	SFT (stage 1 of SFT→DPO)	231K	MIT	Zephyr-7B-α/β — arXiv:2310.16944, alignment-handbook recipe	High
OpenOrca / SlimOrca	OpenOrca · SlimOrca	SFT	500K–4.2M	MIT	Mistral-7B-OpenOrca, OpenOrca-Platypus2-13B, Mistral-7B-SlimOrca — model card	High
ShareGPT (unofficial mirrors)	e.g. `anon8231489123/ShareGPT_Vicuna_unfiltered`	SFT	~70K+	unclear/gray (scraped, no official grant)	Vicuna-13B (LMSYS) — LMSYS blog	High (usage) / Low (license)
Dolphin (`cognitivecomputations/dolphin`)	HF	SFT	multi-million	Apache-2.0	dolphin-2.x series (2.5-mixtral-8x7b, 2.9-llama3-8b, 2.9.3-Yi-1.5-34B, …) — model cards list it directly	High
Infinity-Instruct	HF	CPT-adjacent → SFT	3M/7M (foundational) + chat	Apache-2.0 (gated)	BAAI InfInstruct-* family (Mistral-7B, Llama3-70B/8B, Yi-1.5-9B) — model cards state directly; arXiv:2506.11116	Medium
Magpie → Smol-Magpie-Ultra	Magpie-Align org · SmolTalk card	SFT	raw 10M+ / Smol-Magpie-Ultra 400K curated	CC-BY-NC-4.0 (raw) / Apache-2.0 (Smol variant)	SmolLM2-Instruct family — arXiv:2502.02737; technique also self-proven via Llama-3-8B-Magpie-Align (arXiv:2406.08464, ICLR 2025)	High (technique) / Medium (raw license)
LMSYS-Chat-1M (GPT-4 subset, as ingredient)	HF (gated)	SFT (ingredient)	1M total, subset used	custom gated license	OpenHermes 2.5 → Nous Hermes 2 — dataset card lists “ChatBot Arena (GPT-4 Only)” as a constituent source	Medium
WizardLM / Evol-Instruct	HF	SFT	196K (public V2)	unclear (GPT-ToS-derived)	WizardLM family (7B–70B) — arXiv:2304.12244; ancestor of WizardCoder, absorbed into Tulu-3-style compilations	High
No Robots	HF	SFT	10K	CC-BY-NC-4.0	Tülu 3 / OLMo 2 (named constituent); independently: `monsterapi/zephyr_7b_norobots`	High

Dropped: raw LMSYS-Chat-1M as a standalone primary SFT set (no lab found training directly+solely on the 1M dump for general chat SFT); raw Magpie-Align dumps as a standalone commercial-usable set (kept only the technique + the Apache-licensed Smol-Magpie-Ultra derivative).

3. Tool / function-calling & agentic trajectories

The SFT rung that teaches the model to use things — required regardless of cyber content, since the CTF agent’s core skill is calling tools correctly across multi-turn trajectories.

Dataset	Link	Seq-B stage	Size	License	Proven in (+ cite)	Confidence
Glaive-function-calling-v2	HF	SFT	112,960 rows	Apache-2.0	IBM Granite-20B-FunctionCalling — arXiv:2407.00121; also 256 HF models declare it as training data	High
Hermes-Function-Calling-v1 (NousResearch)	HF	SFT	tens of thousands	Apache-2.0	Hermes-2-Pro-Llama-3-8B / -Mistral-7B — dataset card; its `<tool_call>` tag format is now vLLM’s default hermes tool-parser and Qwen’s own recommended format	High
Salesforce APIGen / xLAM-function-calling-60k	HF (arXiv:2406.18518)	SFT	60,000	CC-BY-4.0	xLAM-1b-fc-r / xLAM-7b-fc-r (Salesforce) — ranked top-25 BFCL at release; NeurIPS 2024 D&B	High
APIGen-MT → xLAM-2 series	project (arXiv:2504.03601)	SFT	spans 1B–70B model family; APIGen-MT-5k public sample	research license (verify per artifact)	Salesforce xLAM-2-{1b..70b}-fc-r — model card states trained via this framework; SOTA on BFCL + τ-bench	High
ToolACE	HF (arXiv:2409.00920, ICLR 2025)	SFT	11,300 rows	Apache-2.0	ToolACE-8B + 51 total HF models trained/fine-tuned on it (Huawei)	High
ToolBench (OpenBMB) → ToolLLaMA	GitHub (arXiv:2307.16789, ICLR 2024 spotlight)	SFT	~127k instructions over 16,000+ real APIs	Apache-2.0	ToolLLaMA-7b-v1; feeds Agent-FLAN downstream	Medium-High (self-proven + downstream reuse)
Agent-FLAN	HF (arXiv:2403.12881, ACL 2024 Findings)	SFT	219 MB (AgentInstruct+ToolBench+ShareGPT mix)	Apache-2.0	InternLM Agent-FLAN-7B — dataset card states “+3.5% across agent eval datasets” over prior best	High
Gorilla / APIBench	HF (arXiv:2305.15334)	SFT	1,600+ APIs	Apache-2.0	Gorilla-7B family (UC Berkeley) — NeurIPS 2024 D&B track	High

Dropped: NexusRaven-V2 training data (only an eval set was released publicly); watt-tool-8B/70B (proprietary/undisclosed dataset, no citable artifact); a standalone Qwen tool-calling SFT row (Qwen never published one — its format adoption of Hermes-style tool use is credited to the Hermes row above instead).

Cross-cutting for Sequence-B: prioritize the multi-turn/agentic sets (ToolACE, APIGen-MT, Agent-FLAN, ToolBench) over single-call sets (xLAM-60k, APIBench, Glaive) — CTF tool use is inherently multi-step and stateful. Agent-FLAN’s explicit negative/anti-hallucination samples are the one artifact in this table that doubles as willingness/refusal-adjacent signal (see §7).

4. Preference (DPO / KTO / RLHF)

Dataset	Link	Seq-B stage	Size	License	Proven in (+ cite)	Pairwise/On-off-policy	Confidence
UltraFeedback (binarized)	HF	preference (DPO/RM)	64K prompts → ~61K pairs	MIT	Zephyr-7B-β — arXiv:2310.16944; also a Tulu 3 DPO-mixture component (arXiv:2411.15124)	Pairwise, off-policy	High
Anthropic HH-RLHF	HF	preference (RM/RLHF)	~170K comparisons	MIT	StableVicuna (Stability AI/CarperAI) — blog	Pairwise, off-policy	High
Stanford SHP / SHP-2	SHP · SHP-2	preference (RM/RLHF)	385K / 4.8M	MIT	StableVicuna (combined 3-dataset RM recipe) — same blog above	Pairwise, off-policy	Medium-High
Nectar	HF	preference (RM → RLAIF)	~183K prompts × 7-way ranking	Apache-2.0 (research)	Starling-RM-7B-alpha / Starling-LM-7B-alpha (Berkeley NEST) — blog	K-wise→pairwise, off-policy	High
HelpSteer2 (+v1)	HF	preference (attribute-scored RM)	~21K pairs (v2)	CC-BY-4.0	Llama-3.1-Nemotron-70B-Reward/-Instruct (NVIDIA), #1 RewardBench at release — arXiv:2406.08673	Multi-attribute → pairwise, off-policy	High
PKU-SafeRLHF	HF	preference (safe-RLHF dual reward+cost)	265K QA / 166.8K pref pairs	Apache-2.0 (framework) / non-commercial (data)	Beaver-7B (PKU-Alignment) — GitHub, arXiv:2406.15513	Pairwise (dual-label), off-policy	High
Argilla DPO mixes (`ultrafeedback-binarized-preferences-cleaned`, `distilabel-intel-orca-dpo-pairs`)	HF	preference (DPO)	64K / 12.9K	Apache-2.0	Notus-7B-v1, argilla/distilabeled-OpenHermes-2.5-Mistral-7B — model cards	Pairwise, off-policy	High
Skywork-Reward-Preference-80K (v0.2)	HF	preference (RM)	80K curated pairs	per-source (verify)	Skywork-Reward-Gemma-2-27B / -Llama-3.1-8B (v0.2) — #1 RewardBench; arXiv:2410.18451	Pairwise, mixed on/off-policy	High
KTO-mix-14k	HF	preference (KTO)	~15K rows	Apache-2.0	HF TRL’s `KTOTrainer` reference dataset; ~9 community checkpoints; Oumi’s `KtoMix14kDataset` recipe	Unpaired, off-policy	Medium (real usage, no flagship model pinned)

Dropped: no cyber-specific preference dataset was in scope; any preference set whose only usage was a single unrelated academic ablation was excluded before it reached the table.

5. Reasoning / CoT (math + code + science distillation)

Dataset	Link	Seq-B stage	Size	License	Proven in (+ cite)	Confidence
NuminaMath-CoT	HF	SFT	860K pairs	Apache-2.0 (code)	AI-MO/NuminaMath-7B-CoT/-TIR — 1st AIMO Progress Prize; also a stated source in Qwen2.5-Math (arXiv:2409.12122)	High
Sky-T1_data_17k	HF	SFT	17K	Apache-2.0	NovaSky-AI/Sky-T1-32B-Preview — blog	High
Bespoke-Stratos-17k	HF	SFT	17K	CC-BY-NC-4.0	Bespoke-Stratos-32B/-7B — blog	High
OpenThoughts-114k / OpenThoughts2-1M / OpenThoughts3-1.2M	114k · 2-1M · 3-1.2M	SFT	114K / 1M / 1.2M	Apache-2.0-style	OpenThinker-7B / -2-32B / -3-7B — arXiv:2506.04178; OpenThinker3-7B is SOTA-open-data at release (53% AIME25, 51% LCB, 54% GPQA-D)	High
OpenR1-Math-220k	HF	SFT (+DPO-usable)	220K curated	Apache-2.0	open-r1/OpenR1-Qwen-7B — Open-R1 update #2	High
Mixture-of-Thoughts	HF	SFT	350K verified traces	Apache-2.0	open-r1/OpenR1-Distill-7B — replicates R1-Distill-Qwen-7B; mixture ratio follows Phi-4-reasoning methodology	High
OpenMathInstruct-2	HF	SFT	14M pairs	CC-BY-4.0-style	OpenMath2-Llama3.1-8B/70B (NVIDIA) — ICLR 2025, arXiv:2410.01560	High
OpenMathReasoning	HF	SFT (+RL-prompts via problem-only split)	3.2M CoT + 1.7M TIR + 566K GenSelect	CC-BY-4.0-style	OpenMath-Nemotron-1.5B..32B — literal data behind NVIDIA’s AIMO-2-winning submission, arXiv:2504.16891	High
OpenCodeReasoning (+1.1)	HF	SFT	736K–1.165M	CC-BY-4.0-style	OpenCodeReasoning-Nemotron-7B/14B/32B/-1.1 — model cards state directly; arXiv:2504.01943; SFT-only beats RL alternatives on LiveCodeBench (61.8%)	High
Magpie-Reasoning-150K (V1)	HF	SFT	150K	Llama-3/Qwen2 community license	Llama-3-8B-Magpie-Align-SFT-v0.2 (Magpie’s own reference checkpoints) — arXiv:2406.08464, ICLR 2025	Medium (self-proven only)

Dropped: the raw DeepSeek-R1 “800k samples” SFT set was never released as a standalone artifact (only checkpoints were open-sourced) — the community reconstructions above are what’s actually usable and are what’s listed instead; AM-DeepSeek-R1-Distilled-1.4M (no verified third-party adoption beyond its own paper); any cyber-specific reasoning/CTF-reasoning dataset (out of scope by permanent stance).

Read order if you’re pulling data, not just reading the table: NuminaMath-CoT (substrate) → OpenThoughts3-1.2M (best current fully-open ablated set) → OpenMathReasoning (if you need competition-tier math) → OpenCodeReasoning (closest analog to CTF/exploit-reasoning) → Mixture-of-Thoughts/OpenR1-Math-220k (if reproducibility of the training recipe matters more than raw SOTA) → Bespoke-Stratos/Sky-T1 (cheapest pilot).

6. Willingness / uncensor vs. safety / refusal calibration

This category is both directions of refusal/compliance behavior — deliberately, because the target model is a sandboxed offensive-security research agent and over-refusal is a real, documented failure mode (see §8).

6a. Compliance-increasing (uncensor / de-refusal)

Dataset	Link	Seq-B stage	Size	License	Proven in (+ cite)	Confidence
Dolphin dataset family (incl. `not_samantha_norefusals.jsonl`)	HF	SFT	~4.5M base + multi-M Dolphin-2.9 mix	Apache-2.0	Dolphin model series (2.9.1-llama-3-8b, 2.9.2-Phi-3-Medium, 2.5-mixtral-8x7b, …) — Hartford’s post, axolotl configs list the file directly	High
`unalignment/toxic-dpo-v0.2`	HF	preference (DPO)	541 pairs	not permissively licensed; sensitive-content flagged	Component of `mlabonne/orpo-dpo-mix-40k`, used to DPO-“heal” NeuralDaredevil-8B-abliterated post weight-orthogonalization — dataset card, Labonne’s abliteration post	High
`ehartford/wizard_vicuna_70k_unfiltered`	HF	SFT	34,598 conversations	unclear (ShareGPT-derived)	Wizard-Vicuna-13B-Uncensored family (7B/13B/33B/65B) — model card states the alignment-stripping directly; 128 HF models trained on it	High

6b. Safety-increasing (harmlessness / appropriate refusal / anti-jailbreak / anti-over-refusal)

Dataset	Link	Seq-B stage	Size	License	Proven in (+ cite)	Confidence
`Anthropic/hh-rlhf`	HF	preference / RL-prompts	161K+ helpful + 42K+ harmless + red-team transcripts	research-use	StableVicuna-13B RM training + the original Anthropic RLHF paper (Bai et al. 2022) — blog	High
`PKU-Alignment/PKU-SafeRLHF`	HF	preference (safe-RLHF reward+cost)	83.4K entries	non-commercial	Beaver-7B-v1.0 — model card lists it directly, arXiv:2310.12773	High
`allenai/wildguardmix`	HF	SFT (`safety_noncompliance`)	50K prompts (Tulu-3 slice)	Apache-2.0	Tulu 3 — named `safety_noncompliance` bucket component; arXiv:2411.15124	High
`allenai/wildjailbreak`	HF	SFT (`safety_noncompliance`) + RL-prompts	262K total (50K sampled into Tulu 3)	ODC-BY-1.0	Tulu 3 — same bucket; explicitly designed to fix over-refusal via harmful-vs-benign-but-scary contrastive pairs	High
`allenai/coconot`	HF	SFT (`safety_noncompliance`)	10,983 prompts	ODC-BY-1.0	Tulu 3 — same bucket; Brahman et al. 2024, “The Art of Saying No”	High

Dropped: Nous Hermes 3’s compliance-steerability behavior is real but no single named public dataset is credited for it (proprietary blend); Do-Not-Answer/XSTest/OR-Bench are refusal-rate eval sets, not training sets, in any published recipe verified; Llama Guard’s training-data composition is not public.

7. Chinese-labs & multilingual

Dataset	Link	Capability	Seq-B stage	Size	License	Proven in (+ cite)	Confidence
BAAI Infinity-Instruct	HF	instruction/chat SFT	SFT	~7.4M found. + 1.5M chat	CC-BY-SA-4.0 (mixed subsets)	Own tech report fine-tunes Mistral/LLaMA/Qwen/Yi; InfInstruct-LLaMA3.1-70B beats GPT-4-0314 by 8.6% on IF — arXiv:2506.11116	High
COIG-CQIA	HF	Chinese instruction SFT (LIMA-style)	SFT	~48K (45.8K filtered)	CC-BY-NC-SA	Own paper fine-tunes Yi-6B/34B, Qwen2-7B/72B — arXiv:2403.18058; also folded into Infinity-Instruct (second, independent usage)	High
Firefly (`firefly-train-1.1M`)	HF	Chinese multi-task SFT	SFT (+DPO via toolkit)	~1.15M–1.65M	unspecified (verify)	firefly-mixtral-8x7b, firefly-baichuan2-13b, firefly-llama-30b — GitHub, 6.6K★	Medium-High
Magpie-Qwen2(.5)-Pro (+ `-200K-Chinese`)	HF	self-synthesized instruction/preference-pair SFT	SFT	1M (Qwen2-Pro) + per-model variants	research-use	Method paper matches Llama-3-8B-Instruct SFT-only — arXiv:2406.08464, ICLR 2025; ZH subset generated by Qwen2-72B-Instruct itself	High (method+EN) / Medium (ZH subset)
MAP-Neo Matrix Data Pile	HF	bilingual EN/ZH pretrain corpus	CPT (+SFT/alignment released alongside)	4.5–4.69T tokens	Apache-2.0	MAP-Neo-7B, pretrained from scratch, fully open — arXiv:2405.19327	High
OpenCSG Chinese Corpus (Chinese-Cosmopedia etc.)	HF	ZH synthetic textbook CPT + Smoltalk-style SFT	CPT + SFT	15M docs / ~60B tokens (Cosmopedia slice)	Apache-2.0	csg-wukong-1B (OpenCSG) — arXiv:2501.08197	Medium
Congliu/Chinese-DeepSeek-R1-Distill-data-110k	HF	ZH reasoning/CoT distillation	SFT	110K	not stated (research/community use)	Distilled via DeepSeek-R1-671B API per R1’s own protocol; 91 downstream community models trained on it	Medium-High
AgentInstruct (Zhipu/THUDM)	HF	agent/tool-use trajectories	SFT	1,866 verified trajectories	Apache-2.0-style	AgentLM-7B/13B/70B — arXiv:2310.12823, ACL 2024 Findings; independently reused by InternLM’s Agent-FLAN	High
LongAlign-10k (Zhipu/THUDM)	HF	long-context instruction alignment	SFT	10,000 (8K–64K tokens)	Apache-2.0-style	ChatGLM3-6B-128k, LongAlign-6B/7B/13B-64k — arXiv:2401.18058, EMNLP 2024 Findings	High
COIG-P	HF	Chinese preference (DPO)	preference	1,009K (paper) / ~101K (HF release)	CC-BY-NC-4.0	Own paper DPO-trains Qwen2.5-Instruct-7B-COIG-P, *Infinity-Instruct-3M--COIG-P** — arXiv:2504.05535, EACL 2026 Findings	Medium
huozi_rlhf_data	GitHub	Chinese human-labeled preference	preference	16.9K pairs	Apache-2.0	Huozi 2.0 (HIT-SCIR) — official RLHF stage of a named released model	Medium
Aya Dataset + Aya Collection (Cohere)	Dataset · Collection	massively multilingual instruction SFT	SFT	204K (Dataset, 65 langs) / 513M (Collection, 101 langs)	Apache-2.0 (verify per-subset)	Aya-101 → Aya 23 → Aya Expanse (Cohere/Cohere Labs) — arXiv:2402.06619	High

Dropped: zhihu_rlhf_3k and dikw/hh_rlhf_cn (only scattered small/unlabeled community reward-model usage, no named shipped recipe); a standalone first-party Qwen/DeepSeek/GLM/Yi SFT-or-preference-data row (none of those labs release their actual training data — their footprint here is captured indirectly via AgentInstruct/LongAlign, and via community R1-distillation sets like Congliu’s 110k).

8. Which datasets at which Sequence-B rung

Mapped onto Sequence B’s actual stage order (frontier-recipe-is-a-sequence.md §2): base/instruct choice → (optional) CPT → SFT cold-start → rejection-sampling/on-policy SFT → preference (DPO/KTO) → RLVR/GRPO → iterate.

Stage	What it needs	Pull from
CPT / domain pretraining (optional)	Bilingual/domain raw-text corpus, if extending beyond the base checkpoint’s native coverage	MAP-Neo Matrix (§7, EN/ZH), OpenCSG Chinese-Cosmopedia (§7); Infinity-Instruct’s “foundational” phase blurs CPT/SFT (§2, §7)
SFT cold-start — general chat/instruction	Broad instruction-following + multi-turn chat prior before anything domain-specific	Tulu 3 SFT Mixture, UltraChat-200k, OpenHermes 2.5, No Robots (§2)
SFT cold-start — tool/agentic	Multi-turn tool-call format + agentic trajectory shape	Hermes-Function-Calling-v1 (format anchor for vLLM’s hermes parser), ToolACE, APIGen-MT, Agent-FLAN, AgentInstruct (§3, §7)
SFT cold-start — reasoning prior	Long-CoT math/code/science distillation before any RL stage touches it	OpenThoughts3-1.2M (default pick), OpenMathReasoning (competition-tier), OpenCodeReasoning (code/CTF-adjacent) (§5)
SFT cold-start — willingness calibration	Refusal-boundary precision for a sandboxed offensive agent, not blanket compliance or blanket refusal	CoCoNot + WildJailbreak + WildGuardMix (the Tulu 3 `safety_noncompliance` bucket) as the base; Dolphin/wizard_vicuna_70k_unfiltered only if you deliberately want the de-refusal direction too (§6)
Preference (DPO/KTO)	Chosen/rejected pairs (or unpaired KTO-shaped labels) once an SFT checkpoint exists to regenerate on-policy	UltraFeedback, Skywork-Reward-Preference-80K, HelpSteer2 as off-policy starting mixes; KTO-mix-14k as the format template if your own trajectories are naturally unpaired good/bad, not clean pairs (§4)
RL-prompts (RLVR/GRPO)	Prompts + a verifier — deliberately NOT a fixed labeled dataset per method-to-data.md	WildJailbreak’s adversarial-prompt pool and OpenMathReasoning’s 193k problem-only split are the closest “prompt source, no answer” shapes in this registry; your own challenge set + flag-check verifier remains the actual RLVR data object for the cyber-specific rung (out of scope here)

9. Two honest caveats

Willingness vs. over-refusal, for a sandboxed offensive-security agent. Over-refusal — declining a legitimate pentest/CTF request because it pattern-matches “harmful” — is a real, well-documented failure mode, which is exactly what CoCoNot and WildJailbreak were purpose-built to fix (contrastive harmful-vs-benign-but-scary-looking prompts, §6b). That is the correct tool for this project’s actual problem: precision on the refusal boundary inside a controlled, sandboxed tool-use context — not blanket compliance. The de-refusal-direction datasets in §6a (Dolphin, toxic-dpo-v0.2, wizard_vicuna_70k_unfiltered) are included because they are genuinely proven-by-usage and instructive as the opposite pole of the same axis, but they are a blunter instrument (strip refusal-flavored completions wholesale) than CoCoNot’s calibrated “refuse the actually-unsafe subset, comply with the rest.” Default recommendation: build the willingness rung primarily from the Tulu-3 safety_noncompliance bucket (WildGuardMix + WildJailbreak + CoCoNot together, exactly as Tulu 3 ships it), and treat §6a as a reference recipe pattern rather than a default ingredient.

The off-policy caveat for distilled reasoning data. Nearly every dataset in §5 is CoT distilled from a stronger teacher (DeepSeek-R1, QwQ-32B, Llama-3.1/3.3-70B-Instruct) — the policy model you’d actually train on Sequence B never generated these traces itself. This is the correct, cheap way to bootstrap a reasoning prior before any on-policy stage (exactly what Sky-T1/Bespoke-Stratos/OpenThoughts/OpenR1 all do), but pure SFT-on-distillation caps quality at the teacher’s and plateaus — it does not substitute for an on-policy RL stage afterward. This mirrors the same on/off-policy axis this book treats as the single most load-bearing distinction in post-training generally (foundations/on-off-policy.md) and the same caveat the preference registry (§4) makes explicitly about off-policy DPO mixes: budget an on-policy regeneration-and-rescoring pass against your own SFT checkpoint before treating either the reasoning prior or the preference stage as final, mirroring what Tulu 3 and NVIDIA’s HelpSteer2-Preference both do in practice.

Cross-links

The recipe is a sequence, not a pick — the Sequence-A/Sequence-B stage skeleton this registry’s §8 mapping is built against.
The path to a frontier cybersecurity model — the north star this registry serves; the “proven-by-usage, not academic-only” stance is the dataset-side mirror of that chapter’s model/recipe-side stance.
Method → Data (your real bottleneck) — the method-first framing that explains why RLVR/GRPO in §8 deliberately has no fixed dataset row, and why rejection-sampling FT’s data object is a byproduct you already produce.
Full per-dataset detail, live-verification method, and additional dropped candidates: artifacts/overnight-datasets/research/{instruction-chat,tool-calling-agentic,preference-alignment, reasoning-cot,willingness-uncensor-safety,chinese-labs-multilingual}.md.

Keyboard shortcuts

Post-Training Field Notes