Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Proven post-training datasets — a usage-cited registry

Every other chapter in this book asks “which method” (The family map) or “which sequence of stages” (The recipe is a sequence, not a pick). This chapter answers the question those chapters leave open once you’ve picked Sequence B (§2 of the recipe chapter): which concrete, downloadable dataset actually fills each rung, for the six GENERAL (non-cyber) capabilities every Sequence-B rung needs regardless of what cyber-specific data you build on top — chat template, tool/function-calling, instruction following, preference, reasoning, and willingness/refusal calibration. This is a registry, not an essay: the tables are the payload.

1. The inclusion rule (read this before trusting any row)

This project’s permanent stance is never rely on academic cybersecurity-LLM projects — grounding comes from real frontier/open-weight recipes, not single-paper academic artifacts (frontier-cyber-model-path.md §1). This chapter applies the exact dataset-side analog of that stance:

A dataset gets a row ONLY if it is PROVEN-BY-USAGE — i.e. a named, real, released open-weight model or recipe (a tech report, an official model card, or a widely-used/well-cited community fine-tune) is documented as having actually trained on it. Every row below names that model/recipe and cites the source (model card, tech report, or lab blog post) that states the link directly — not inferred, not “this would probably be a good fit.”

Consequences enforced throughout:

  • No single-paper-only academic datasets. A dataset whose only appearance is its own authors’ academic ablation, with no third-party or even self-reported shipped model consuming it, is dropped. Each category file below has an explicit “Dropped” list — read it before re-proposing something that already failed the bar.
  • No cyber-specific datasets at all — this registry is scoped to GENERAL capabilities on purpose; cyber-specific data is out of scope here by design, not merely unfound.
  • Self-proof is weaker than cross-lab proof, and is flagged as such. A dataset proven only via its own authors’ reference model (e.g. Magpie-Reasoning-150K → the Magpie team’s own checkpoints) is marked Medium confidence even though the usage is real; a dataset reused by an independent lab or absorbed into a bigger shipped mixture (e.g. COIG-CQIA → folded into BAAI’s Infinity-Instruct recipe) earns High.
  • License ≠ proof-of-usage. Several rows below carry a real, verified training usage but a murky or non-commercial license (ShareGPT, ToolACE, COIG-CQIA, toxic-dpo-v0.2). The PROVEN_IN column tells you it was used; the License column tells you separately whether you can reuse it — don’t conflate them.

Every table row format is: dataset · link · capability · Sequence-B stage · size · license · PROVEN_IN (model/recipe) + citation · language(s) · notes · confidence. Source notes for all six categories, including live-verification method and the full per-dataset detail, are the six research files under artifacts/overnight-datasets/research/ (instruction-chat, tool-calling-agentic, preference-alignment, reasoning-cot, willingness-uncensor-safety, chinese-labs-multilingual) — this chapter distills them; consult those files for the complete citation quotes.


2. Instruction / chat SFT

The CPT→SFT rung: general instruction-following and multi-turn chat behavior — the base every other capability in this chapter sits on top of.

DatasetLinkSeq-B stageSizeLicenseProven in (+ cite)Confidence
Tulu 3 SFT MixtureHFSFT939KODC-BY-1.0 (mixed per-subset)Tülu 3 + OLMo 2 (7B/13B/32B) — arXiv:2411.15124, AI2 blog, OLMo 2 blogHigh
OpenHermes 2.5HFSFT~1Mmixed per-sourceOpenHermes-2.5-Mistral-7B, Nous Hermes 2 series — dataset card states directlyHigh
UltraChat-200kHFSFT (stage 1 of SFT→DPO)231KMITZephyr-7B-α/βarXiv:2310.16944, alignment-handbook recipeHigh
OpenOrca / SlimOrcaOpenOrca · SlimOrcaSFT500K–4.2MMITMistral-7B-OpenOrca, OpenOrca-Platypus2-13B, Mistral-7B-SlimOrcamodel cardHigh
ShareGPT (unofficial mirrors)e.g. anon8231489123/ShareGPT_Vicuna_unfilteredSFT~70K+unclear/gray (scraped, no official grant)Vicuna-13B (LMSYS) — LMSYS blogHigh (usage) / Low (license)
Dolphin (cognitivecomputations/dolphin)HFSFTmulti-millionApache-2.0dolphin-2.x series (2.5-mixtral-8x7b, 2.9-llama3-8b, 2.9.3-Yi-1.5-34B, …) — model cards list it directlyHigh
Infinity-InstructHFCPT-adjacent → SFT3M/7M (foundational) + chatApache-2.0 (gated)BAAI InfInstruct-* family (Mistral-7B, Llama3-70B/8B, Yi-1.5-9B) — model cards state directly; arXiv:2506.11116Medium
Magpie → Smol-Magpie-UltraMagpie-Align org · SmolTalk cardSFTraw 10M+ / Smol-Magpie-Ultra 400K curatedCC-BY-NC-4.0 (raw) / Apache-2.0 (Smol variant)SmolLM2-Instruct familyarXiv:2502.02737; technique also self-proven via Llama-3-8B-Magpie-Align (arXiv:2406.08464, ICLR 2025)High (technique) / Medium (raw license)
LMSYS-Chat-1M (GPT-4 subset, as ingredient)HF (gated)SFT (ingredient)1M total, subset usedcustom gated licenseOpenHermes 2.5 → Nous Hermes 2 — dataset card lists “ChatBot Arena (GPT-4 Only)” as a constituent sourceMedium
WizardLM / Evol-InstructHFSFT196K (public V2)unclear (GPT-ToS-derived)WizardLM family (7B–70B) — arXiv:2304.12244; ancestor of WizardCoder, absorbed into Tulu-3-style compilationsHigh
No RobotsHFSFT10KCC-BY-NC-4.0Tülu 3 / OLMo 2 (named constituent); independently: monsterapi/zephyr_7b_norobotsHigh

Dropped: raw LMSYS-Chat-1M as a standalone primary SFT set (no lab found training directly+solely on the 1M dump for general chat SFT); raw Magpie-Align dumps as a standalone commercial-usable set (kept only the technique + the Apache-licensed Smol-Magpie-Ultra derivative).


3. Tool / function-calling & agentic trajectories

The SFT rung that teaches the model to use things — required regardless of cyber content, since the CTF agent’s core skill is calling tools correctly across multi-turn trajectories.

DatasetLinkSeq-B stageSizeLicenseProven in (+ cite)Confidence
Glaive-function-calling-v2HFSFT112,960 rowsApache-2.0IBM Granite-20B-FunctionCallingarXiv:2407.00121; also 256 HF models declare it as training dataHigh
Hermes-Function-Calling-v1 (NousResearch)HFSFTtens of thousandsApache-2.0Hermes-2-Pro-Llama-3-8B / -Mistral-7Bdataset card; its <tool_call> tag format is now vLLM’s default hermes tool-parser and Qwen’s own recommended formatHigh
Salesforce APIGen / xLAM-function-calling-60kHF (arXiv:2406.18518)SFT60,000CC-BY-4.0xLAM-1b-fc-r / xLAM-7b-fc-r (Salesforce) — ranked top-25 BFCL at release; NeurIPS 2024 D&BHigh
APIGen-MT → xLAM-2 seriesproject (arXiv:2504.03601)SFTspans 1B–70B model family; APIGen-MT-5k public sampleresearch license (verify per artifact)Salesforce xLAM-2-{1b..70b}-fc-r — model card states trained via this framework; SOTA on BFCL + τ-benchHigh
ToolACEHF (arXiv:2409.00920, ICLR 2025)SFT11,300 rowsApache-2.0ToolACE-8B + 51 total HF models trained/fine-tuned on it (Huawei)High
ToolBench (OpenBMB) → ToolLLaMAGitHub (arXiv:2307.16789, ICLR 2024 spotlight)SFT~127k instructions over 16,000+ real APIsApache-2.0ToolLLaMA-7b-v1; feeds Agent-FLAN downstreamMedium-High (self-proven + downstream reuse)
Agent-FLANHF (arXiv:2403.12881, ACL 2024 Findings)SFT219 MB (AgentInstruct+ToolBench+ShareGPT mix)Apache-2.0InternLM Agent-FLAN-7B — dataset card states “+3.5% across agent eval datasets” over prior bestHigh
Gorilla / APIBenchHF (arXiv:2305.15334)SFT1,600+ APIsApache-2.0Gorilla-7B family (UC Berkeley) — NeurIPS 2024 D&B trackHigh

Dropped: NexusRaven-V2 training data (only an eval set was released publicly); watt-tool-8B/70B (proprietary/undisclosed dataset, no citable artifact); a standalone Qwen tool-calling SFT row (Qwen never published one — its format adoption of Hermes-style tool use is credited to the Hermes row above instead).

Cross-cutting for Sequence-B: prioritize the multi-turn/agentic sets (ToolACE, APIGen-MT, Agent-FLAN, ToolBench) over single-call sets (xLAM-60k, APIBench, Glaive) — CTF tool use is inherently multi-step and stateful. Agent-FLAN’s explicit negative/anti-hallucination samples are the one artifact in this table that doubles as willingness/refusal-adjacent signal (see §7).


4. Preference (DPO / KTO / RLHF)

DatasetLinkSeq-B stageSizeLicenseProven in (+ cite)Pairwise/On-off-policyConfidence
UltraFeedback (binarized)HFpreference (DPO/RM)64K prompts → ~61K pairsMITZephyr-7B-βarXiv:2310.16944; also a Tulu 3 DPO-mixture component (arXiv:2411.15124)Pairwise, off-policyHigh
Anthropic HH-RLHFHFpreference (RM/RLHF)~170K comparisonsMITStableVicuna (Stability AI/CarperAI) — blogPairwise, off-policyHigh
Stanford SHP / SHP-2SHP · SHP-2preference (RM/RLHF)385K / 4.8MMITStableVicuna (combined 3-dataset RM recipe) — same blog abovePairwise, off-policyMedium-High
NectarHFpreference (RM → RLAIF)~183K prompts × 7-way rankingApache-2.0 (research)Starling-RM-7B-alpha / Starling-LM-7B-alpha (Berkeley NEST) — blogK-wise→pairwise, off-policyHigh
HelpSteer2 (+v1)HFpreference (attribute-scored RM)~21K pairs (v2)CC-BY-4.0Llama-3.1-Nemotron-70B-Reward/-Instruct (NVIDIA), #1 RewardBench at release — arXiv:2406.08673Multi-attribute → pairwise, off-policyHigh
PKU-SafeRLHFHFpreference (safe-RLHF dual reward+cost)265K QA / 166.8K pref pairsApache-2.0 (framework) / non-commercial (data)Beaver-7B (PKU-Alignment) — GitHub, arXiv:2406.15513Pairwise (dual-label), off-policyHigh
Argilla DPO mixes (ultrafeedback-binarized-preferences-cleaned, distilabel-intel-orca-dpo-pairs)HFpreference (DPO)64K / 12.9KApache-2.0Notus-7B-v1, argilla/distilabeled-OpenHermes-2.5-Mistral-7B — model cardsPairwise, off-policyHigh
Skywork-Reward-Preference-80K (v0.2)HFpreference (RM)80K curated pairsper-source (verify)Skywork-Reward-Gemma-2-27B / -Llama-3.1-8B (v0.2) — #1 RewardBench; arXiv:2410.18451Pairwise, mixed on/off-policyHigh
KTO-mix-14kHFpreference (KTO)~15K rowsApache-2.0HF TRL’s KTOTrainer reference dataset; ~9 community checkpoints; Oumi’s KtoMix14kDataset recipeUnpaired, off-policyMedium (real usage, no flagship model pinned)

Dropped: no cyber-specific preference dataset was in scope; any preference set whose only usage was a single unrelated academic ablation was excluded before it reached the table.


5. Reasoning / CoT (math + code + science distillation)

DatasetLinkSeq-B stageSizeLicenseProven in (+ cite)Confidence
NuminaMath-CoTHFSFT860K pairsApache-2.0 (code)AI-MO/NuminaMath-7B-CoT/-TIR1st AIMO Progress Prize; also a stated source in Qwen2.5-Math (arXiv:2409.12122)High
Sky-T1_data_17kHFSFT17KApache-2.0NovaSky-AI/Sky-T1-32B-PreviewblogHigh
Bespoke-Stratos-17kHFSFT17KCC-BY-NC-4.0Bespoke-Stratos-32B/-7BblogHigh
OpenThoughts-114k / OpenThoughts2-1M / OpenThoughts3-1.2M114k · 2-1M · 3-1.2MSFT114K / 1M / 1.2MApache-2.0-styleOpenThinker-7B / -2-32B / -3-7BarXiv:2506.04178; OpenThinker3-7B is SOTA-open-data at release (53% AIME25, 51% LCB, 54% GPQA-D)High
OpenR1-Math-220kHFSFT (+DPO-usable)220K curatedApache-2.0open-r1/OpenR1-Qwen-7BOpen-R1 update #2High
Mixture-of-ThoughtsHFSFT350K verified tracesApache-2.0open-r1/OpenR1-Distill-7B — replicates R1-Distill-Qwen-7B; mixture ratio follows Phi-4-reasoning methodologyHigh
OpenMathInstruct-2HFSFT14M pairsCC-BY-4.0-styleOpenMath2-Llama3.1-8B/70B (NVIDIA) — ICLR 2025, arXiv:2410.01560High
OpenMathReasoningHFSFT (+RL-prompts via problem-only split)3.2M CoT + 1.7M TIR + 566K GenSelectCC-BY-4.0-styleOpenMath-Nemotron-1.5B..32B — literal data behind NVIDIA’s AIMO-2-winning submission, arXiv:2504.16891High
OpenCodeReasoning (+1.1)HFSFT736K–1.165MCC-BY-4.0-styleOpenCodeReasoning-Nemotron-7B/14B/32B/-1.1 — model cards state directly; arXiv:2504.01943; SFT-only beats RL alternatives on LiveCodeBench (61.8%)High
Magpie-Reasoning-150K (V1)HFSFT150KLlama-3/Qwen2 community licenseLlama-3-8B-Magpie-Align-SFT-v0.2 (Magpie’s own reference checkpoints) — arXiv:2406.08464, ICLR 2025Medium (self-proven only)

Dropped: the raw DeepSeek-R1 “800k samples” SFT set was never released as a standalone artifact (only checkpoints were open-sourced) — the community reconstructions above are what’s actually usable and are what’s listed instead; AM-DeepSeek-R1-Distilled-1.4M (no verified third-party adoption beyond its own paper); any cyber-specific reasoning/CTF-reasoning dataset (out of scope by permanent stance).

Read order if you’re pulling data, not just reading the table: NuminaMath-CoT (substrate) → OpenThoughts3-1.2M (best current fully-open ablated set) → OpenMathReasoning (if you need competition-tier math) → OpenCodeReasoning (closest analog to CTF/exploit-reasoning) → Mixture-of-Thoughts/OpenR1-Math-220k (if reproducibility of the training recipe matters more than raw SOTA) → Bespoke-Stratos/Sky-T1 (cheapest pilot).


6. Willingness / uncensor vs. safety / refusal calibration

This category is both directions of refusal/compliance behavior — deliberately, because the target model is a sandboxed offensive-security research agent and over-refusal is a real, documented failure mode (see §8).

6a. Compliance-increasing (uncensor / de-refusal)

DatasetLinkSeq-B stageSizeLicenseProven in (+ cite)Confidence
Dolphin dataset family (incl. not_samantha_norefusals.jsonl)HFSFT~4.5M base + multi-M Dolphin-2.9 mixApache-2.0Dolphin model series (2.9.1-llama-3-8b, 2.9.2-Phi-3-Medium, 2.5-mixtral-8x7b, …) — Hartford’s post, axolotl configs list the file directlyHigh
unalignment/toxic-dpo-v0.2HFpreference (DPO)541 pairsnot permissively licensed; sensitive-content flaggedComponent of mlabonne/orpo-dpo-mix-40k, used to DPO-“heal” NeuralDaredevil-8B-abliterated post weight-orthogonalization — dataset card, Labonne’s abliteration postHigh
ehartford/wizard_vicuna_70k_unfilteredHFSFT34,598 conversationsunclear (ShareGPT-derived)Wizard-Vicuna-13B-Uncensored family (7B/13B/33B/65B) — model card states the alignment-stripping directly; 128 HF models trained on itHigh

6b. Safety-increasing (harmlessness / appropriate refusal / anti-jailbreak / anti-over-refusal)

DatasetLinkSeq-B stageSizeLicenseProven in (+ cite)Confidence
Anthropic/hh-rlhfHFpreference / RL-prompts161K+ helpful + 42K+ harmless + red-team transcriptsresearch-useStableVicuna-13B RM training + the original Anthropic RLHF paper (Bai et al. 2022) — blogHigh
PKU-Alignment/PKU-SafeRLHFHFpreference (safe-RLHF reward+cost)83.4K entriesnon-commercialBeaver-7B-v1.0 — model card lists it directly, arXiv:2310.12773High
allenai/wildguardmixHFSFT (safety_noncompliance)50K prompts (Tulu-3 slice)Apache-2.0Tulu 3 — named safety_noncompliance bucket component; arXiv:2411.15124High
allenai/wildjailbreakHFSFT (safety_noncompliance) + RL-prompts262K total (50K sampled into Tulu 3)ODC-BY-1.0Tulu 3 — same bucket; explicitly designed to fix over-refusal via harmful-vs-benign-but-scary contrastive pairsHigh
allenai/coconotHFSFT (safety_noncompliance)10,983 promptsODC-BY-1.0Tulu 3 — same bucket; Brahman et al. 2024, “The Art of Saying No”High

Dropped: Nous Hermes 3’s compliance-steerability behavior is real but no single named public dataset is credited for it (proprietary blend); Do-Not-Answer/XSTest/OR-Bench are refusal-rate eval sets, not training sets, in any published recipe verified; Llama Guard’s training-data composition is not public.


7. Chinese-labs & multilingual

DatasetLinkCapabilitySeq-B stageSizeLicenseProven in (+ cite)Confidence
BAAI Infinity-InstructHFinstruction/chat SFTSFT~7.4M found. + 1.5M chatCC-BY-SA-4.0 (mixed subsets)Own tech report fine-tunes Mistral/LLaMA/Qwen/Yi; InfInstruct-LLaMA3.1-70B beats GPT-4-0314 by 8.6% on IF — arXiv:2506.11116High
COIG-CQIAHFChinese instruction SFT (LIMA-style)SFT~48K (45.8K filtered)CC-BY-NC-SAOwn paper fine-tunes Yi-6B/34B, Qwen2-7B/72B — arXiv:2403.18058; also folded into Infinity-Instruct (second, independent usage)High
Firefly (firefly-train-1.1M)HFChinese multi-task SFTSFT (+DPO via toolkit)~1.15M–1.65Munspecified (verify)firefly-mixtral-8x7b, firefly-baichuan2-13b, firefly-llama-30bGitHub, 6.6K★Medium-High
Magpie-Qwen2(.5)-Pro (+ -200K-Chinese)HFself-synthesized instruction/preference-pair SFTSFT1M (Qwen2-Pro) + per-model variantsresearch-useMethod paper matches Llama-3-8B-Instruct SFT-only — arXiv:2406.08464, ICLR 2025; ZH subset generated by Qwen2-72B-Instruct itselfHigh (method+EN) / Medium (ZH subset)
MAP-Neo Matrix Data PileHFbilingual EN/ZH pretrain corpusCPT (+SFT/alignment released alongside)4.5–4.69T tokensApache-2.0MAP-Neo-7B, pretrained from scratch, fully open — arXiv:2405.19327High
OpenCSG Chinese Corpus (Chinese-Cosmopedia etc.)HFZH synthetic textbook CPT + Smoltalk-style SFTCPT + SFT15M docs / ~60B tokens (Cosmopedia slice)Apache-2.0csg-wukong-1B (OpenCSG) — arXiv:2501.08197Medium
Congliu/Chinese-DeepSeek-R1-Distill-data-110kHFZH reasoning/CoT distillationSFT110Knot stated (research/community use)Distilled via DeepSeek-R1-671B API per R1’s own protocol; 91 downstream community models trained on itMedium-High
AgentInstruct (Zhipu/THUDM)HFagent/tool-use trajectoriesSFT1,866 verified trajectoriesApache-2.0-styleAgentLM-7B/13B/70BarXiv:2310.12823, ACL 2024 Findings; independently reused by InternLM’s Agent-FLANHigh
LongAlign-10k (Zhipu/THUDM)HFlong-context instruction alignmentSFT10,000 (8K–64K tokens)Apache-2.0-styleChatGLM3-6B-128k, LongAlign-6B/7B/13B-64karXiv:2401.18058, EMNLP 2024 FindingsHigh
COIG-PHFChinese preference (DPO)preference1,009K (paper) / ~101K (HF release)CC-BY-NC-4.0Own paper DPO-trains Qwen2.5-Instruct-7B-COIG-P, Infinity-Instruct-3M-*-COIG-ParXiv:2504.05535, EACL 2026 FindingsMedium
huozi_rlhf_dataGitHubChinese human-labeled preferencepreference16.9K pairsApache-2.0Huozi 2.0 (HIT-SCIR) — official RLHF stage of a named released modelMedium
Aya Dataset + Aya Collection (Cohere)Dataset · Collectionmassively multilingual instruction SFTSFT204K (Dataset, 65 langs) / 513M (Collection, 101 langs)Apache-2.0 (verify per-subset)Aya-101 → Aya 23 → Aya Expanse (Cohere/Cohere Labs) — arXiv:2402.06619High

Dropped: zhihu_rlhf_3k and dikw/hh_rlhf_cn (only scattered small/unlabeled community reward-model usage, no named shipped recipe); a standalone first-party Qwen/DeepSeek/GLM/Yi SFT-or-preference-data row (none of those labs release their actual training data — their footprint here is captured indirectly via AgentInstruct/LongAlign, and via community R1-distillation sets like Congliu’s 110k).


8. Which datasets at which Sequence-B rung

Mapped onto Sequence B’s actual stage order (frontier-recipe-is-a-sequence.md §2): base/instruct choice → (optional) CPT → SFT cold-start → rejection-sampling/on-policy SFT → preference (DPO/KTO) → RLVR/GRPO → iterate.

StageWhat it needsPull from
CPT / domain pretraining (optional)Bilingual/domain raw-text corpus, if extending beyond the base checkpoint’s native coverageMAP-Neo Matrix (§7, EN/ZH), OpenCSG Chinese-Cosmopedia (§7); Infinity-Instruct’s “foundational” phase blurs CPT/SFT (§2, §7)
SFT cold-start — general chat/instructionBroad instruction-following + multi-turn chat prior before anything domain-specificTulu 3 SFT Mixture, UltraChat-200k, OpenHermes 2.5, No Robots (§2)
SFT cold-start — tool/agenticMulti-turn tool-call format + agentic trajectory shapeHermes-Function-Calling-v1 (format anchor for vLLM’s hermes parser), ToolACE, APIGen-MT, Agent-FLAN, AgentInstruct (§3, §7)
SFT cold-start — reasoning priorLong-CoT math/code/science distillation before any RL stage touches itOpenThoughts3-1.2M (default pick), OpenMathReasoning (competition-tier), OpenCodeReasoning (code/CTF-adjacent) (§5)
SFT cold-start — willingness calibrationRefusal-boundary precision for a sandboxed offensive agent, not blanket compliance or blanket refusalCoCoNot + WildJailbreak + WildGuardMix (the Tulu 3 safety_noncompliance bucket) as the base; Dolphin/wizard_vicuna_70k_unfiltered only if you deliberately want the de-refusal direction too (§6)
Preference (DPO/KTO)Chosen/rejected pairs (or unpaired KTO-shaped labels) once an SFT checkpoint exists to regenerate on-policyUltraFeedback, Skywork-Reward-Preference-80K, HelpSteer2 as off-policy starting mixes; KTO-mix-14k as the format template if your own trajectories are naturally unpaired good/bad, not clean pairs (§4)
RL-prompts (RLVR/GRPO)Prompts + a verifier — deliberately NOT a fixed labeled dataset per method-to-data.mdWildJailbreak’s adversarial-prompt pool and OpenMathReasoning’s 193k problem-only split are the closest “prompt source, no answer” shapes in this registry; your own challenge set + flag-check verifier remains the actual RLVR data object for the cyber-specific rung (out of scope here)

9. Two honest caveats

Willingness vs. over-refusal, for a sandboxed offensive-security agent. Over-refusal — declining a legitimate pentest/CTF request because it pattern-matches “harmful” — is a real, well-documented failure mode, which is exactly what CoCoNot and WildJailbreak were purpose-built to fix (contrastive harmful-vs-benign-but-scary-looking prompts, §6b). That is the correct tool for this project’s actual problem: precision on the refusal boundary inside a controlled, sandboxed tool-use context — not blanket compliance. The de-refusal-direction datasets in §6a (Dolphin, toxic-dpo-v0.2, wizard_vicuna_70k_unfiltered) are included because they are genuinely proven-by-usage and instructive as the opposite pole of the same axis, but they are a blunter instrument (strip refusal-flavored completions wholesale) than CoCoNot’s calibrated “refuse the actually-unsafe subset, comply with the rest.” Default recommendation: build the willingness rung primarily from the Tulu-3 safety_noncompliance bucket (WildGuardMix + WildJailbreak + CoCoNot together, exactly as Tulu 3 ships it), and treat §6a as a reference recipe pattern rather than a default ingredient.

The off-policy caveat for distilled reasoning data. Nearly every dataset in §5 is CoT distilled from a stronger teacher (DeepSeek-R1, QwQ-32B, Llama-3.1/3.3-70B-Instruct) — the policy model you’d actually train on Sequence B never generated these traces itself. This is the correct, cheap way to bootstrap a reasoning prior before any on-policy stage (exactly what Sky-T1/Bespoke-Stratos/OpenThoughts/OpenR1 all do), but pure SFT-on-distillation caps quality at the teacher’s and plateaus — it does not substitute for an on-policy RL stage afterward. This mirrors the same on/off-policy axis this book treats as the single most load-bearing distinction in post-training generally (foundations/on-off-policy.md) and the same caveat the preference registry (§4) makes explicitly about off-policy DPO mixes: budget an on-policy regeneration-and-rescoring pass against your own SFT checkpoint before treating either the reasoning prior or the preference stage as final, mirroring what Tulu 3 and NVIDIA’s HelpSteer2-Preference both do in practice.


  • The recipe is a sequence, not a pick — the Sequence-A/Sequence-B stage skeleton this registry’s §8 mapping is built against.
  • The path to a frontier cybersecurity model — the north star this registry serves; the “proven-by-usage, not academic-only” stance is the dataset-side mirror of that chapter’s model/recipe-side stance.
  • Method → Data (your real bottleneck) — the method-first framing that explains why RLVR/GRPO in §8 deliberately has no fixed dataset row, and why rejection-sampling FT’s data object is a byproduct you already produce.
  • Full per-dataset detail, live-verification method, and additional dropped candidates: artifacts/overnight-datasets/research/{instruction-chat,tool-calling-agentic,preference-alignment, reasoning-cot,willingness-uncensor-safety,chinese-labs-multilingual}.md.