Proven post-training datasets — a usage-cited registry
Every other chapter in this book asks “which method” (The family map) or “which sequence of stages” (The recipe is a sequence, not a pick). This chapter answers the question those chapters leave open once you’ve picked Sequence B (§2 of the recipe chapter): which concrete, downloadable dataset actually fills each rung, for the six GENERAL (non-cyber) capabilities every Sequence-B rung needs regardless of what cyber-specific data you build on top — chat template, tool/function-calling, instruction following, preference, reasoning, and willingness/refusal calibration. This is a registry, not an essay: the tables are the payload.
1. The inclusion rule (read this before trusting any row)
This project’s permanent stance is never rely on academic cybersecurity-LLM projects — grounding comes from real frontier/open-weight recipes, not single-paper academic artifacts (frontier-cyber-model-path.md §1). This chapter applies the exact dataset-side analog of that stance:
A dataset gets a row ONLY if it is PROVEN-BY-USAGE — i.e. a named, real, released open-weight model or recipe (a tech report, an official model card, or a widely-used/well-cited community fine-tune) is documented as having actually trained on it. Every row below names that model/recipe and cites the source (model card, tech report, or lab blog post) that states the link directly — not inferred, not “this would probably be a good fit.”
Consequences enforced throughout:
- No single-paper-only academic datasets. A dataset whose only appearance is its own authors’ academic ablation, with no third-party or even self-reported shipped model consuming it, is dropped. Each category file below has an explicit “Dropped” list — read it before re-proposing something that already failed the bar.
- No cyber-specific datasets at all — this registry is scoped to GENERAL capabilities on purpose; cyber-specific data is out of scope here by design, not merely unfound.
- Self-proof is weaker than cross-lab proof, and is flagged as such. A dataset proven only via its own authors’ reference model (e.g. Magpie-Reasoning-150K → the Magpie team’s own checkpoints) is marked Medium confidence even though the usage is real; a dataset reused by an independent lab or absorbed into a bigger shipped mixture (e.g. COIG-CQIA → folded into BAAI’s Infinity-Instruct recipe) earns High.
- License ≠ proof-of-usage. Several rows below carry a real, verified training usage but a murky or non-commercial license (ShareGPT, ToolACE, COIG-CQIA, toxic-dpo-v0.2). The PROVEN_IN column tells you it was used; the License column tells you separately whether you can reuse it — don’t conflate them.
Every table row format is: dataset · link · capability · Sequence-B stage · size · license ·
PROVEN_IN (model/recipe) + citation · language(s) · notes · confidence. Source notes for all six
categories, including live-verification method and the full per-dataset detail, are the six research
files under artifacts/overnight-datasets/research/ (instruction-chat, tool-calling-agentic,
preference-alignment, reasoning-cot, willingness-uncensor-safety, chinese-labs-multilingual) — this
chapter distills them; consult those files for the complete citation quotes.
2. Instruction / chat SFT
The CPT→SFT rung: general instruction-following and multi-turn chat behavior — the base every other capability in this chapter sits on top of.
| Dataset | Link | Seq-B stage | Size | License | Proven in (+ cite) | Confidence |
|---|---|---|---|---|---|---|
| Tulu 3 SFT Mixture | HF | SFT | 939K | ODC-BY-1.0 (mixed per-subset) | Tülu 3 + OLMo 2 (7B/13B/32B) — arXiv:2411.15124, AI2 blog, OLMo 2 blog | High |
| OpenHermes 2.5 | HF | SFT | ~1M | mixed per-source | OpenHermes-2.5-Mistral-7B, Nous Hermes 2 series — dataset card states directly | High |
| UltraChat-200k | HF | SFT (stage 1 of SFT→DPO) | 231K | MIT | Zephyr-7B-α/β — arXiv:2310.16944, alignment-handbook recipe | High |
| OpenOrca / SlimOrca | OpenOrca · SlimOrca | SFT | 500K–4.2M | MIT | Mistral-7B-OpenOrca, OpenOrca-Platypus2-13B, Mistral-7B-SlimOrca — model card | High |
| ShareGPT (unofficial mirrors) | e.g. anon8231489123/ShareGPT_Vicuna_unfiltered | SFT | ~70K+ | unclear/gray (scraped, no official grant) | Vicuna-13B (LMSYS) — LMSYS blog | High (usage) / Low (license) |
Dolphin (cognitivecomputations/dolphin) | HF | SFT | multi-million | Apache-2.0 | dolphin-2.x series (2.5-mixtral-8x7b, 2.9-llama3-8b, 2.9.3-Yi-1.5-34B, …) — model cards list it directly | High |
| Infinity-Instruct | HF | CPT-adjacent → SFT | 3M/7M (foundational) + chat | Apache-2.0 (gated) | BAAI InfInstruct-* family (Mistral-7B, Llama3-70B/8B, Yi-1.5-9B) — model cards state directly; arXiv:2506.11116 | Medium |
| Magpie → Smol-Magpie-Ultra | Magpie-Align org · SmolTalk card | SFT | raw 10M+ / Smol-Magpie-Ultra 400K curated | CC-BY-NC-4.0 (raw) / Apache-2.0 (Smol variant) | SmolLM2-Instruct family — arXiv:2502.02737; technique also self-proven via Llama-3-8B-Magpie-Align (arXiv:2406.08464, ICLR 2025) | High (technique) / Medium (raw license) |
| LMSYS-Chat-1M (GPT-4 subset, as ingredient) | HF (gated) | SFT (ingredient) | 1M total, subset used | custom gated license | OpenHermes 2.5 → Nous Hermes 2 — dataset card lists “ChatBot Arena (GPT-4 Only)” as a constituent source | Medium |
| WizardLM / Evol-Instruct | HF | SFT | 196K (public V2) | unclear (GPT-ToS-derived) | WizardLM family (7B–70B) — arXiv:2304.12244; ancestor of WizardCoder, absorbed into Tulu-3-style compilations | High |
| No Robots | HF | SFT | 10K | CC-BY-NC-4.0 | Tülu 3 / OLMo 2 (named constituent); independently: monsterapi/zephyr_7b_norobots | High |
Dropped: raw LMSYS-Chat-1M as a standalone primary SFT set (no lab found training directly+solely on the 1M dump for general chat SFT); raw Magpie-Align dumps as a standalone commercial-usable set (kept only the technique + the Apache-licensed Smol-Magpie-Ultra derivative).
3. Tool / function-calling & agentic trajectories
The SFT rung that teaches the model to use things — required regardless of cyber content, since the CTF agent’s core skill is calling tools correctly across multi-turn trajectories.
| Dataset | Link | Seq-B stage | Size | License | Proven in (+ cite) | Confidence |
|---|---|---|---|---|---|---|
| Glaive-function-calling-v2 | HF | SFT | 112,960 rows | Apache-2.0 | IBM Granite-20B-FunctionCalling — arXiv:2407.00121; also 256 HF models declare it as training data | High |
| Hermes-Function-Calling-v1 (NousResearch) | HF | SFT | tens of thousands | Apache-2.0 | Hermes-2-Pro-Llama-3-8B / -Mistral-7B — dataset card; its <tool_call> tag format is now vLLM’s default hermes tool-parser and Qwen’s own recommended format | High |
| Salesforce APIGen / xLAM-function-calling-60k | HF (arXiv:2406.18518) | SFT | 60,000 | CC-BY-4.0 | xLAM-1b-fc-r / xLAM-7b-fc-r (Salesforce) — ranked top-25 BFCL at release; NeurIPS 2024 D&B | High |
| APIGen-MT → xLAM-2 series | project (arXiv:2504.03601) | SFT | spans 1B–70B model family; APIGen-MT-5k public sample | research license (verify per artifact) | Salesforce xLAM-2-{1b..70b}-fc-r — model card states trained via this framework; SOTA on BFCL + τ-bench | High |
| ToolACE | HF (arXiv:2409.00920, ICLR 2025) | SFT | 11,300 rows | Apache-2.0 | ToolACE-8B + 51 total HF models trained/fine-tuned on it (Huawei) | High |
| ToolBench (OpenBMB) → ToolLLaMA | GitHub (arXiv:2307.16789, ICLR 2024 spotlight) | SFT | ~127k instructions over 16,000+ real APIs | Apache-2.0 | ToolLLaMA-7b-v1; feeds Agent-FLAN downstream | Medium-High (self-proven + downstream reuse) |
| Agent-FLAN | HF (arXiv:2403.12881, ACL 2024 Findings) | SFT | 219 MB (AgentInstruct+ToolBench+ShareGPT mix) | Apache-2.0 | InternLM Agent-FLAN-7B — dataset card states “+3.5% across agent eval datasets” over prior best | High |
| Gorilla / APIBench | HF (arXiv:2305.15334) | SFT | 1,600+ APIs | Apache-2.0 | Gorilla-7B family (UC Berkeley) — NeurIPS 2024 D&B track | High |
Dropped: NexusRaven-V2 training data (only an eval set was released publicly); watt-tool-8B/70B (proprietary/undisclosed dataset, no citable artifact); a standalone Qwen tool-calling SFT row (Qwen never published one — its format adoption of Hermes-style tool use is credited to the Hermes row above instead).
Cross-cutting for Sequence-B: prioritize the multi-turn/agentic sets (ToolACE, APIGen-MT, Agent-FLAN, ToolBench) over single-call sets (xLAM-60k, APIBench, Glaive) — CTF tool use is inherently multi-step and stateful. Agent-FLAN’s explicit negative/anti-hallucination samples are the one artifact in this table that doubles as willingness/refusal-adjacent signal (see §7).
4. Preference (DPO / KTO / RLHF)
| Dataset | Link | Seq-B stage | Size | License | Proven in (+ cite) | Pairwise/On-off-policy | Confidence |
|---|---|---|---|---|---|---|---|
| UltraFeedback (binarized) | HF | preference (DPO/RM) | 64K prompts → ~61K pairs | MIT | Zephyr-7B-β — arXiv:2310.16944; also a Tulu 3 DPO-mixture component (arXiv:2411.15124) | Pairwise, off-policy | High |
| Anthropic HH-RLHF | HF | preference (RM/RLHF) | ~170K comparisons | MIT | StableVicuna (Stability AI/CarperAI) — blog | Pairwise, off-policy | High |
| Stanford SHP / SHP-2 | SHP · SHP-2 | preference (RM/RLHF) | 385K / 4.8M | MIT | StableVicuna (combined 3-dataset RM recipe) — same blog above | Pairwise, off-policy | Medium-High |
| Nectar | HF | preference (RM → RLAIF) | ~183K prompts × 7-way ranking | Apache-2.0 (research) | Starling-RM-7B-alpha / Starling-LM-7B-alpha (Berkeley NEST) — blog | K-wise→pairwise, off-policy | High |
| HelpSteer2 (+v1) | HF | preference (attribute-scored RM) | ~21K pairs (v2) | CC-BY-4.0 | Llama-3.1-Nemotron-70B-Reward/-Instruct (NVIDIA), #1 RewardBench at release — arXiv:2406.08673 | Multi-attribute → pairwise, off-policy | High |
| PKU-SafeRLHF | HF | preference (safe-RLHF dual reward+cost) | 265K QA / 166.8K pref pairs | Apache-2.0 (framework) / non-commercial (data) | Beaver-7B (PKU-Alignment) — GitHub, arXiv:2406.15513 | Pairwise (dual-label), off-policy | High |
Argilla DPO mixes (ultrafeedback-binarized-preferences-cleaned, distilabel-intel-orca-dpo-pairs) | HF | preference (DPO) | 64K / 12.9K | Apache-2.0 | Notus-7B-v1, argilla/distilabeled-OpenHermes-2.5-Mistral-7B — model cards | Pairwise, off-policy | High |
| Skywork-Reward-Preference-80K (v0.2) | HF | preference (RM) | 80K curated pairs | per-source (verify) | Skywork-Reward-Gemma-2-27B / -Llama-3.1-8B (v0.2) — #1 RewardBench; arXiv:2410.18451 | Pairwise, mixed on/off-policy | High |
| KTO-mix-14k | HF | preference (KTO) | ~15K rows | Apache-2.0 | HF TRL’s KTOTrainer reference dataset; ~9 community checkpoints; Oumi’s KtoMix14kDataset recipe | Unpaired, off-policy | Medium (real usage, no flagship model pinned) |
Dropped: no cyber-specific preference dataset was in scope; any preference set whose only usage was a single unrelated academic ablation was excluded before it reached the table.
5. Reasoning / CoT (math + code + science distillation)
| Dataset | Link | Seq-B stage | Size | License | Proven in (+ cite) | Confidence |
|---|---|---|---|---|---|---|
| NuminaMath-CoT | HF | SFT | 860K pairs | Apache-2.0 (code) | AI-MO/NuminaMath-7B-CoT/-TIR — 1st AIMO Progress Prize; also a stated source in Qwen2.5-Math (arXiv:2409.12122) | High |
| Sky-T1_data_17k | HF | SFT | 17K | Apache-2.0 | NovaSky-AI/Sky-T1-32B-Preview — blog | High |
| Bespoke-Stratos-17k | HF | SFT | 17K | CC-BY-NC-4.0 | Bespoke-Stratos-32B/-7B — blog | High |
| OpenThoughts-114k / OpenThoughts2-1M / OpenThoughts3-1.2M | 114k · 2-1M · 3-1.2M | SFT | 114K / 1M / 1.2M | Apache-2.0-style | OpenThinker-7B / -2-32B / -3-7B — arXiv:2506.04178; OpenThinker3-7B is SOTA-open-data at release (53% AIME25, 51% LCB, 54% GPQA-D) | High |
| OpenR1-Math-220k | HF | SFT (+DPO-usable) | 220K curated | Apache-2.0 | open-r1/OpenR1-Qwen-7B — Open-R1 update #2 | High |
| Mixture-of-Thoughts | HF | SFT | 350K verified traces | Apache-2.0 | open-r1/OpenR1-Distill-7B — replicates R1-Distill-Qwen-7B; mixture ratio follows Phi-4-reasoning methodology | High |
| OpenMathInstruct-2 | HF | SFT | 14M pairs | CC-BY-4.0-style | OpenMath2-Llama3.1-8B/70B (NVIDIA) — ICLR 2025, arXiv:2410.01560 | High |
| OpenMathReasoning | HF | SFT (+RL-prompts via problem-only split) | 3.2M CoT + 1.7M TIR + 566K GenSelect | CC-BY-4.0-style | OpenMath-Nemotron-1.5B..32B — literal data behind NVIDIA’s AIMO-2-winning submission, arXiv:2504.16891 | High |
| OpenCodeReasoning (+1.1) | HF | SFT | 736K–1.165M | CC-BY-4.0-style | OpenCodeReasoning-Nemotron-7B/14B/32B/-1.1 — model cards state directly; arXiv:2504.01943; SFT-only beats RL alternatives on LiveCodeBench (61.8%) | High |
| Magpie-Reasoning-150K (V1) | HF | SFT | 150K | Llama-3/Qwen2 community license | Llama-3-8B-Magpie-Align-SFT-v0.2 (Magpie’s own reference checkpoints) — arXiv:2406.08464, ICLR 2025 | Medium (self-proven only) |
Dropped: the raw DeepSeek-R1 “800k samples” SFT set was never released as a standalone artifact (only checkpoints were open-sourced) — the community reconstructions above are what’s actually usable and are what’s listed instead; AM-DeepSeek-R1-Distilled-1.4M (no verified third-party adoption beyond its own paper); any cyber-specific reasoning/CTF-reasoning dataset (out of scope by permanent stance).
Read order if you’re pulling data, not just reading the table: NuminaMath-CoT (substrate) → OpenThoughts3-1.2M (best current fully-open ablated set) → OpenMathReasoning (if you need competition-tier math) → OpenCodeReasoning (closest analog to CTF/exploit-reasoning) → Mixture-of-Thoughts/OpenR1-Math-220k (if reproducibility of the training recipe matters more than raw SOTA) → Bespoke-Stratos/Sky-T1 (cheapest pilot).
6. Willingness / uncensor vs. safety / refusal calibration
This category is both directions of refusal/compliance behavior — deliberately, because the target model is a sandboxed offensive-security research agent and over-refusal is a real, documented failure mode (see §8).
6a. Compliance-increasing (uncensor / de-refusal)
| Dataset | Link | Seq-B stage | Size | License | Proven in (+ cite) | Confidence |
|---|---|---|---|---|---|---|
Dolphin dataset family (incl. not_samantha_norefusals.jsonl) | HF | SFT | ~4.5M base + multi-M Dolphin-2.9 mix | Apache-2.0 | Dolphin model series (2.9.1-llama-3-8b, 2.9.2-Phi-3-Medium, 2.5-mixtral-8x7b, …) — Hartford’s post, axolotl configs list the file directly | High |
unalignment/toxic-dpo-v0.2 | HF | preference (DPO) | 541 pairs | not permissively licensed; sensitive-content flagged | Component of mlabonne/orpo-dpo-mix-40k, used to DPO-“heal” NeuralDaredevil-8B-abliterated post weight-orthogonalization — dataset card, Labonne’s abliteration post | High |
ehartford/wizard_vicuna_70k_unfiltered | HF | SFT | 34,598 conversations | unclear (ShareGPT-derived) | Wizard-Vicuna-13B-Uncensored family (7B/13B/33B/65B) — model card states the alignment-stripping directly; 128 HF models trained on it | High |
6b. Safety-increasing (harmlessness / appropriate refusal / anti-jailbreak / anti-over-refusal)
| Dataset | Link | Seq-B stage | Size | License | Proven in (+ cite) | Confidence |
|---|---|---|---|---|---|---|
Anthropic/hh-rlhf | HF | preference / RL-prompts | 161K+ helpful + 42K+ harmless + red-team transcripts | research-use | StableVicuna-13B RM training + the original Anthropic RLHF paper (Bai et al. 2022) — blog | High |
PKU-Alignment/PKU-SafeRLHF | HF | preference (safe-RLHF reward+cost) | 83.4K entries | non-commercial | Beaver-7B-v1.0 — model card lists it directly, arXiv:2310.12773 | High |
allenai/wildguardmix | HF | SFT (safety_noncompliance) | 50K prompts (Tulu-3 slice) | Apache-2.0 | Tulu 3 — named safety_noncompliance bucket component; arXiv:2411.15124 | High |
allenai/wildjailbreak | HF | SFT (safety_noncompliance) + RL-prompts | 262K total (50K sampled into Tulu 3) | ODC-BY-1.0 | Tulu 3 — same bucket; explicitly designed to fix over-refusal via harmful-vs-benign-but-scary contrastive pairs | High |
allenai/coconot | HF | SFT (safety_noncompliance) | 10,983 prompts | ODC-BY-1.0 | Tulu 3 — same bucket; Brahman et al. 2024, “The Art of Saying No” | High |
Dropped: Nous Hermes 3’s compliance-steerability behavior is real but no single named public dataset is credited for it (proprietary blend); Do-Not-Answer/XSTest/OR-Bench are refusal-rate eval sets, not training sets, in any published recipe verified; Llama Guard’s training-data composition is not public.
7. Chinese-labs & multilingual
| Dataset | Link | Capability | Seq-B stage | Size | License | Proven in (+ cite) | Confidence |
|---|---|---|---|---|---|---|---|
| BAAI Infinity-Instruct | HF | instruction/chat SFT | SFT | ~7.4M found. + 1.5M chat | CC-BY-SA-4.0 (mixed subsets) | Own tech report fine-tunes Mistral/LLaMA/Qwen/Yi; InfInstruct-LLaMA3.1-70B beats GPT-4-0314 by 8.6% on IF — arXiv:2506.11116 | High |
| COIG-CQIA | HF | Chinese instruction SFT (LIMA-style) | SFT | ~48K (45.8K filtered) | CC-BY-NC-SA | Own paper fine-tunes Yi-6B/34B, Qwen2-7B/72B — arXiv:2403.18058; also folded into Infinity-Instruct (second, independent usage) | High |
Firefly (firefly-train-1.1M) | HF | Chinese multi-task SFT | SFT (+DPO via toolkit) | ~1.15M–1.65M | unspecified (verify) | firefly-mixtral-8x7b, firefly-baichuan2-13b, firefly-llama-30b — GitHub, 6.6K★ | Medium-High |
Magpie-Qwen2(.5)-Pro (+ -200K-Chinese) | HF | self-synthesized instruction/preference-pair SFT | SFT | 1M (Qwen2-Pro) + per-model variants | research-use | Method paper matches Llama-3-8B-Instruct SFT-only — arXiv:2406.08464, ICLR 2025; ZH subset generated by Qwen2-72B-Instruct itself | High (method+EN) / Medium (ZH subset) |
| MAP-Neo Matrix Data Pile | HF | bilingual EN/ZH pretrain corpus | CPT (+SFT/alignment released alongside) | 4.5–4.69T tokens | Apache-2.0 | MAP-Neo-7B, pretrained from scratch, fully open — arXiv:2405.19327 | High |
| OpenCSG Chinese Corpus (Chinese-Cosmopedia etc.) | HF | ZH synthetic textbook CPT + Smoltalk-style SFT | CPT + SFT | 15M docs / ~60B tokens (Cosmopedia slice) | Apache-2.0 | csg-wukong-1B (OpenCSG) — arXiv:2501.08197 | Medium |
| Congliu/Chinese-DeepSeek-R1-Distill-data-110k | HF | ZH reasoning/CoT distillation | SFT | 110K | not stated (research/community use) | Distilled via DeepSeek-R1-671B API per R1’s own protocol; 91 downstream community models trained on it | Medium-High |
| AgentInstruct (Zhipu/THUDM) | HF | agent/tool-use trajectories | SFT | 1,866 verified trajectories | Apache-2.0-style | AgentLM-7B/13B/70B — arXiv:2310.12823, ACL 2024 Findings; independently reused by InternLM’s Agent-FLAN | High |
| LongAlign-10k (Zhipu/THUDM) | HF | long-context instruction alignment | SFT | 10,000 (8K–64K tokens) | Apache-2.0-style | ChatGLM3-6B-128k, LongAlign-6B/7B/13B-64k — arXiv:2401.18058, EMNLP 2024 Findings | High |
| COIG-P | HF | Chinese preference (DPO) | preference | 1,009K (paper) / ~101K (HF release) | CC-BY-NC-4.0 | Own paper DPO-trains Qwen2.5-Instruct-7B-COIG-P, Infinity-Instruct-3M-*-COIG-P — arXiv:2504.05535, EACL 2026 Findings | Medium |
| huozi_rlhf_data | GitHub | Chinese human-labeled preference | preference | 16.9K pairs | Apache-2.0 | Huozi 2.0 (HIT-SCIR) — official RLHF stage of a named released model | Medium |
| Aya Dataset + Aya Collection (Cohere) | Dataset · Collection | massively multilingual instruction SFT | SFT | 204K (Dataset, 65 langs) / 513M (Collection, 101 langs) | Apache-2.0 (verify per-subset) | Aya-101 → Aya 23 → Aya Expanse (Cohere/Cohere Labs) — arXiv:2402.06619 | High |
Dropped: zhihu_rlhf_3k and dikw/hh_rlhf_cn (only scattered small/unlabeled community reward-model
usage, no named shipped recipe); a standalone first-party Qwen/DeepSeek/GLM/Yi SFT-or-preference-data row
(none of those labs release their actual training data — their footprint here is captured indirectly via
AgentInstruct/LongAlign, and via community R1-distillation sets like Congliu’s 110k).
8. Which datasets at which Sequence-B rung
Mapped onto Sequence B’s actual stage order (frontier-recipe-is-a-sequence.md §2): base/instruct choice → (optional) CPT → SFT cold-start → rejection-sampling/on-policy SFT → preference (DPO/KTO) → RLVR/GRPO → iterate.
| Stage | What it needs | Pull from |
|---|---|---|
| CPT / domain pretraining (optional) | Bilingual/domain raw-text corpus, if extending beyond the base checkpoint’s native coverage | MAP-Neo Matrix (§7, EN/ZH), OpenCSG Chinese-Cosmopedia (§7); Infinity-Instruct’s “foundational” phase blurs CPT/SFT (§2, §7) |
| SFT cold-start — general chat/instruction | Broad instruction-following + multi-turn chat prior before anything domain-specific | Tulu 3 SFT Mixture, UltraChat-200k, OpenHermes 2.5, No Robots (§2) |
| SFT cold-start — tool/agentic | Multi-turn tool-call format + agentic trajectory shape | Hermes-Function-Calling-v1 (format anchor for vLLM’s hermes parser), ToolACE, APIGen-MT, Agent-FLAN, AgentInstruct (§3, §7) |
| SFT cold-start — reasoning prior | Long-CoT math/code/science distillation before any RL stage touches it | OpenThoughts3-1.2M (default pick), OpenMathReasoning (competition-tier), OpenCodeReasoning (code/CTF-adjacent) (§5) |
| SFT cold-start — willingness calibration | Refusal-boundary precision for a sandboxed offensive agent, not blanket compliance or blanket refusal | CoCoNot + WildJailbreak + WildGuardMix (the Tulu 3 safety_noncompliance bucket) as the base; Dolphin/wizard_vicuna_70k_unfiltered only if you deliberately want the de-refusal direction too (§6) |
| Preference (DPO/KTO) | Chosen/rejected pairs (or unpaired KTO-shaped labels) once an SFT checkpoint exists to regenerate on-policy | UltraFeedback, Skywork-Reward-Preference-80K, HelpSteer2 as off-policy starting mixes; KTO-mix-14k as the format template if your own trajectories are naturally unpaired good/bad, not clean pairs (§4) |
| RL-prompts (RLVR/GRPO) | Prompts + a verifier — deliberately NOT a fixed labeled dataset per method-to-data.md | WildJailbreak’s adversarial-prompt pool and OpenMathReasoning’s 193k problem-only split are the closest “prompt source, no answer” shapes in this registry; your own challenge set + flag-check verifier remains the actual RLVR data object for the cyber-specific rung (out of scope here) |
9. Two honest caveats
Willingness vs. over-refusal, for a sandboxed offensive-security agent. Over-refusal — declining a
legitimate pentest/CTF request because it pattern-matches “harmful” — is a real, well-documented failure
mode, which is exactly what CoCoNot and WildJailbreak were purpose-built to fix (contrastive
harmful-vs-benign-but-scary-looking prompts, §6b).
That is the correct tool for this project’s actual problem: precision on the refusal boundary inside a
controlled, sandboxed tool-use context — not blanket compliance. The de-refusal-direction datasets in
§6a (Dolphin, toxic-dpo-v0.2, wizard_vicuna_70k_unfiltered) are included because they are genuinely
proven-by-usage and instructive as the opposite pole of the same axis, but they are a blunter instrument
(strip refusal-flavored completions wholesale) than CoCoNot’s calibrated “refuse the actually-unsafe
subset, comply with the rest.” Default recommendation: build the willingness rung primarily from the
Tulu-3 safety_noncompliance bucket (WildGuardMix + WildJailbreak + CoCoNot together, exactly as Tulu 3
ships it), and treat §6a as a reference recipe pattern rather than a default ingredient.
The off-policy caveat for distilled reasoning data. Nearly every dataset in §5 is CoT distilled from a stronger teacher (DeepSeek-R1, QwQ-32B, Llama-3.1/3.3-70B-Instruct) — the policy model you’d actually train on Sequence B never generated these traces itself. This is the correct, cheap way to bootstrap a reasoning prior before any on-policy stage (exactly what Sky-T1/Bespoke-Stratos/OpenThoughts/OpenR1 all do), but pure SFT-on-distillation caps quality at the teacher’s and plateaus — it does not substitute for an on-policy RL stage afterward. This mirrors the same on/off-policy axis this book treats as the single most load-bearing distinction in post-training generally (foundations/on-off-policy.md) and the same caveat the preference registry (§4) makes explicitly about off-policy DPO mixes: budget an on-policy regeneration-and-rescoring pass against your own SFT checkpoint before treating either the reasoning prior or the preference stage as final, mirroring what Tulu 3 and NVIDIA’s HelpSteer2-Preference both do in practice.
Cross-links
- The recipe is a sequence, not a pick — the Sequence-A/Sequence-B stage skeleton this registry’s §8 mapping is built against.
- The path to a frontier cybersecurity model — the north star this registry serves; the “proven-by-usage, not academic-only” stance is the dataset-side mirror of that chapter’s model/recipe-side stance.
- Method → Data (your real bottleneck) — the method-first framing that explains why RLVR/GRPO in §8 deliberately has no fixed dataset row, and why rejection-sampling FT’s data object is a byproduct you already produce.
- Full per-dataset detail, live-verification method, and additional dropped candidates:
artifacts/overnight-datasets/research/{instruction-chat,tool-calling-agentic,preference-alignment, reasoning-cot,willingness-uncensor-safety,chinese-labs-multilingual}.md.