Continued pretraining on an instruction-tuned model (without breaking it)
The north star this book keeps returning to is fine-tuning an already-available open-weight dense model into a frontier cybersecurity model — not training one from scratch. The path to a frontier cybersecurity model names Stage 0 — domain continued pretraining (CPT) as the first rung of that ladder: every code/math/medical specialization lineage examined there continues pretraining from an existing strong checkpoint on a raw in-domain corpus (Code Llama ~500B code tokens, DeepSeekMath 120B math tokens, Qwen2.5-Coder 5.5T code tokens) before SFT/RL ever starts. What that chapter leaves implicit — and this one makes explicit — is which checkpoint Stage 0 should actually start from.
In practice you frequently don’t have a clean choice. Many open-weight releases you’d actually want to build on are shipped, served, and eval-harnessed as the instruction-tuned/RLHF’d checkpoint — that’s the one your inference stack, prompt templates, and safety behavior are already built around. So the question this chapter answers is concrete and load-bearing, not academic: can you run raw next-token CPT on a raw cybersecurity corpus directly on top of an already-instruct checkpoint — for knowledge injection — without destroying the instruction-following/alignment layer that checkpoint already has? If yes, how? And how do you know, empirically, that it didn’t break?
Bottom line up front: yes, this is real, documented practice — but naive next-token CPT on an instruct checkpoint reliably damages it. The damage is disproportionately format/alignment collapse (chat-template adherence, instruction-following reliability, output degeneracy, sometimes safety behavior), not fact erasure — raw knowledge is comparatively robust. This is fixable, but not “just wait it out”: it needs an explicit countermeasure. The safer industry default remains CPT-on-base → re-instruct when a base twin exists (true for essentially every mainstream open-weight family — Llama, Qwen, Mistral, Gemma all ship base+instruct pairs); CPT-on-instruct directly is viable only as a budgeted fallback. Per this project’s standing stance, no cybersecurity-LLM paper is used as evidentiary ground anywhere below — every claim rests on general domain-adaptive-pretraining, continual-learning, and model-merging/task-arithmetic literature, or on frontier-lab disclosures already cited elsewhere in this book.
1. Why this is a real fork, not a formality
The reason this can’t be waved away as “just keep training” is that instruction-tuning/RLHF is a comparatively thin edit on top of the base pretraining distribution — instruction-following and safety behavior live in a small slice of the weight change, concentrated (per the safety-alignment literature below) in the first few output tokens of a response. Raw next-token CPT on a large new corpus is exactly the kind of update that can overwrite a thin layer without the training loss ever showing it. That’s the catastrophic-forgetting risk this chapter is about, and it’s a direct instance of the same competence-vs-performance split Diagnosing the gap uses for the harness’s own capability gaps: a CPT run that silently collapses instruction-following isn’t erasing knowledge (competence) — it’s breaking elicitation (performance). Misdiagnosing a post-CPT regression as “the model needs more domain SFT” when it’s actually a broken output layer sends you down the wrong fix.
The seminal reference for “keep pretraining on domain data before you specialize further” is Domain-Adaptive Pretraining (DAPT) — Gururangan et al., “Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks,” ACL 2020, arXiv:2004.10964. It establishes the concept and lineage everything below inherits, but it’s an encoder (RoBERTa, MLM objective) with no instruction-following layer to protect and no chat-model eval — cite it for the concept, not for the instruct-preservation question. Confidence: high (canonical, correctly attributed; different model class from the rest of this chapter).
2. What actually breaks — cited, mechanism-level
The direct, on-point comparison. Jindal, Badrinath, Bharti, Vinay & Sharma (Samsung Research), “Balancing Continuous Pre-Training and Instruction Fine-Tuning: Optimizing Instruction-Following in LLMs,” arXiv:2410.10739 (Oct 2024), frames the question almost literally as S1 (CPT directly on the instruct checkpoint) vs. S2 (CPT the base, then re-instruct), tested across Llama-3, Llama-3.1, Qwen2, Qwen2.5 base/instruct pairs. Their own contribution bullet: “Continuous pre-training of an instruction model results in catastrophic forgetting of the instruction capabilities and, therefore should be avoided.” Conversely, CPT-on-base-then-reinstruct “preserve[s] both the domain knowledge and the instruction capabilities.” Confidence: medium-high — single preprint, but directly on-point and multi-model-family.
What specifically degrades. Li & Lee (NTU), “Examining Forgetting in Continual Pre-training of Aligned Large Language Models,” arXiv:2401.03129 (Jan 2024), CPT directly on Llama-2-7b-chat (already SFT+RLHF’d) with 1B tokens of raw Traditional Chinese text. Their own framing: “the model’s knowledge remains unaffected while its reliability declines” — increased repetition, drift toward generating in the CPT-corpus language regardless of prompt language. They also tried the cheap fixes (freezing first/last layers, attention-only vs. MLP-only, LoRA, (IA)³ adapters) and found none fully solved it — “more than straightforward methods are required.” Confidence: high for the qualitative finding; the mitigation-list result is a useful negative (rules out cheap partial-freezing as sufficient on its own).
Independent confirmation the loss is alignment, not knowledge. Zheng, Cai, Qiu & Ma, “Spurious Forgetting in Continual Learning of Language Models,” ICLR 2025 poster (OpenReview:ScI7IlKGdI): much of what looks like catastrophic forgetting is a decline in task alignment from near-orthogonal early-optimization-step weight updates, not true knowledge loss — the underlying knowledge is often still there, just mis-elicited. Proposed mitigation: freeze the bottom layers during the new training phase. Confidence: medium-high (peer-reviewed poster; corroborates Li & Lee independently).
Safety is the same mechanism, and it’s shallow. Qi, Zeng, Xie, Chen, Jia, Mittal & Henderson, “Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!,” arXiv:2310.03693 (ICLR 2024): further training on an aligned checkpoint degrades safety behavior even with entirely benign, non-adversarial data — no malicious intent required. A raw cybersecurity corpus is exactly the kind of domain data that can interact badly with refusal/safety behavior; a security CPT run needs a dedicated safety eval pass, not just an instruction-following check. Qi, Zeng et al. (Princeton + Google DeepMind), “Safety Alignment Should Be Made More Than Just a Few Tokens Deep,” ICLR 2025 Oral, gives the mechanism: safety alignment is disproportionately encoded in the model’s behavior over the first few output tokens — “shallow safety alignment” — so small perturbations to early-token distributions from any further training can collapse refusal behavior while leaving downstream behavior looking otherwise unaffected. Confidence: high (both peer-reviewed, foundational, and independently corroborating).
Scale makes it worse, not better. Luo et al., “An Empirical Study of Catastrophic Forgetting in LLMs During Continual Fine-tuning,” arXiv:2308.08747: forgetting is general across the 1B–14B range tested and — counterintuitively — gets worse as scale increases within that range, because larger models start from a higher capability baseline and so have more to lose. One actionable positive from the same paper: mixing general instruction-tuning data into subsequent training measurably alleviates the forgetting — the empirical backbone for the replay technique below. Confidence: high (their own ablations directly support both the negative finding and the mitigation).
Loss curves lie to you about this. Mousavi, Alghisi & Riccardi (U. Trento), “What Does Loss Optimization Actually Teach, If Anything? Knowledge Dynamics in Continual Pre-training of LLMs,” arXiv:2601.03858 (2026): CPT epoch-by-epoch on three instruction-tuned LLMs with diagnostic probes interleaved shows training loss decreases monotonically while factual learning is unstable/non-monotonic and out-of-domain general-skill performance degrades from early epochs onward — by the time your CPT loss curve “looks fine,” you may already be past the point where the instruct layer started eroding. Confidence: medium (single group, very recent, methodologically sound).
Large-scale but not yet peer-reviewed caveat. Harmon, Hochlehnert, Bethge & Prabhu (Tübingen AI Center), “Mapping Post-Training Forgetting in Language Models at Scale” (OpenReview:qCIg2WGudx, anonymous ICLR 2026 submission, no arXiv id found — treat as promising, not yet validated), sample-wise transition-counts forgetting/backward transfer across ~30 model pairs: “domain-continual pretraining induces moderate forgetting with low-to-moderate backward transfer” at the knowledge level (i.e., moderate, not catastrophic, purely on facts — consistent with Li & Lee), and — the finding to carry into §3 below — “model merging does not reliably mitigate forgetting.” Confidence: medium, and explicitly flagged every time it’s cited because it cuts against several merging-based fixes cited next.
3. Preservation techniques — decision table
Six documented countermeasures, each independently verified, at different cost/robustness points. They compose — production recipes stack two or three at once.
| Technique | How | Cost | Confidence |
|---|---|---|---|
| Replay / data-mixing | Blend general and/or instruction-formatted data into every CPT batch instead of 100% raw domain text. Floor: 1% replay is enough to substantially mitigate forgetting in continual instruction-tuning (Scialom et al., EMNLP 2022, arXiv:2205.12393). At pretraining scale, Ibrahim et al. (arXiv:2403.08763) report 5% replay sufficient for a weak shift (English→English), 25% for a strong shift (English→German); treat a cybersecurity corpus as closer to the strong-shift end (in-distribution stylistically, out-of-distribution lexically/topically — jargon, CVE text, shellcode) and start at 15–25%. Prefer instruction-shaped replay over generic replay when affordable: AdaptLLM (arXiv:2309.09530) auto-converts raw domain text into reading-comprehension/QA task pairs; Instruction Pre-Training (arXiv:2406.14491, Microsoft, EMNLP 2024) scales this to 200M synthesized instruction-response pairs woven into the raw corpus, reporting a CPT’d Llama3-8B “comparable to or outperforming Llama3-70B” on their domain suite. | Low (mixing) → medium (synthesizing instruction-shaped replay via a teacher-model pass) | High — 4 independent groups, fine-tuning-scale through 10B-param/hundreds-of-billions-token pretraining-scale, consistent conclusion |
| Low LR + re-warm/re-decay + short CPT | Treat the instruct checkpoint’s weights as a fragile prior: re-warm the LR (most released checkpoints have already decayed to near-zero) then re-decay over a short cosine cycle sized to the new-token budget, rather than training to convergence on the new corpus. Ibrahim et al. (arXiv:2403.08763, building on Gupta et al., arXiv:2308.04014) show re-warm+re-decay+replay matches full retrain-from-scratch on final loss and LM-eval-harness average, at 405M–10B params, hundreds of billions of tokens. Skipping re-warm → poor adaptation; skipping re-decay/over-running at peak → forgetting spike. | Near-zero (scheduler choice, no new infra) | High on the mechanism; medium on any specific peak-LR number transferred to a much smaller CPT budget than the paper’s own experiments — needs a small sweep |
| LoRA / PEFT-CPT | Freeze base weights, train only low-rank adapters during CPT — the rank constraint mechanically bounds how far the update can move the checkpoint. Biderman et al., “LoRA Learns Less and Forgets Less,” TMLR 2024, arXiv:2405.09673 — the rigorous head-to-head at exactly the CPT regime (Llama-2-7B, ~20B unstructured tokens): LoRA forgets markedly less than full-FT, but underlearns the new domain and the gap is NOT closed even at r=256 — full-FT learns perturbations of effective rank 10–100× higher than typical LoRA configs, which is the mechanistic reason LoRA both underlearns and underforgets. Verified concrete settings from their CPT appendix: target all transformer modules (not just attention), α = 2r, LR 2e-4 at r=16/64, 1e-4 at r=256 (higher rank needs a lower LR to avoid instability), cosine schedule w/ warmup, bf16. | Low (standard PEFT tooling, memory-cheap) | High — single but rigorous, matched-hyperparameter, mechanistic (SVD-based) paper; explicitly reports the negative result (LoRA is not a free lunch for CPT knowledge injection) |
| Chat-vector / instruction-residual (task arithmetic) | Compute Δ_instruct = θ_instruct − θ_base once from the model family’s released base/instruct pair (Ilharco et al., “Editing Models with Task Arithmetic,” ICLR 2023, arXiv:2212.04089 — the seminal task-vector result: deltas can be added/negated/composed via simple weight-space arithmetic). CPT the base checkpoint (not the instruct one) on the raw corpus → θ_base_cpt; reattach θ_final = θ_base_cpt + γ·Δ_instruct (γ defaults to 1.0, sweep 0.8–1.2 if outputs degrade). No SFT rerun required. Verified independently twice: Chat Vector (Huang et al., ACL 2024, arXiv:2310.04799) for a language-shift version of the same problem, and Jindal et al. (arXiv:2410.10739) for the domain-shift version this chapter is about — both converge on the identical recipe independently. If the raw add underperforms, RESTA (arXiv:2402.11746, ACL 2024) shows DARE-sparsifying the delta before adding reduces interference. | Low — the CPT run costs the same as CPT’ing the base directly; reattachment is a single elementwise tensor op | High for the mechanism (ICLR-seminal + ACL-peer-reviewed); medium-high for direct applicability, since the published experiments CPT the base and reattach rather than training the instruct weights themselves — no cybersecurity corpus tested |
| Re-apply a light SFT (or SFT+DPO) stage after CPT | Accept CPT degrades chat behavior and budget a short, cheap post-CPT SFT pass — cheaper than the original SFT since you’re re-sharpening a latent capability, not teaching it from scratch. LLaMA Pro (arXiv:2401.02415, ACL 2024) does exactly this: expand the model with new frozen-original transformer blocks, CPT only the new blocks, then run a separate instruction-tuning pass to produce “LLaMA Pro-Instruct” — original weights never touched, so preservation is structural, not repaired post-hoc (tradeoff: parameter growth + re-integration into the serving stack). Safety needs its own restoration step: per §2, a generic SFT pass is not guaranteed to restore safety alignment as reliably as instruction-following, because safety is shallow/early-token-concentrated; Qi et al.’s (arXiv:2406.05946) proposed fix is a fine-tuning objective that specifically constrains updates on the initial-token distribution. | Low-medium (a short SFT pass, hours not the CPT-scale token budget; block-expansion variant is medium — architecture surgery) | High for “instruction-following degrades and a light SFT pass helps” (standard, cross-validated pattern); medium-high for the safety-specific restoration claim (newer, narrower ICLR-2025-Oral-level result) |
| Model merging (TIES / DARE) | Instead of a single arithmetic addition, treat the domain-CPT’d model and the original instruct model as siblings and merge their deltas with an interference-aware algorithm rather than naive averaging. TIES-Merging (arXiv:2306.01708, NeurIPS 2023): trim small-magnitude changes, elect a majority sign per parameter to resolve sign disagreement, then merge only agreeing parameters. DARE (arXiv:2311.03099, ICML 2024, “Super Mario”): randomly drop delta parameters at rate p and rescale survivors by 1/(1−p) — SFT deltas are typically tiny and redundant enough that 90–99% can be zeroed without hurting the model’s own abilities, and using DARE as a pre-processing sparsifier before merging mitigates the same interference TIES targets. | Low — post-hoc weight-space operation, no additional training, needs both source checkpoints on the same architecture | High for both papers’ core claims (peer-reviewed, reproducible open code); medium for applying either to this exact CPT-knowledge + instruct-behavior scenario (a well-supported inference, not a verbatim citation) — and read the §2 caution again: Harmon et al.’s large-scale study reports merging “does not reliably mitigate forgetting,” so budget a stage-boundary eval after any merge regardless of technique |
4. The practical recipe + recommendation
Two viable orderings, plus a fallback, evaluated against the actual constraint most teams face: you very likely have (or strongly prefer to keep working from) the instruct checkpoint your serving/eval harness is already built around, even though for mainstream open-weight families (Qwen, Llama) the base twin is also published.
Option A — CPT-on-base → re-instruct (the textbook-safe default).
Use when the model family publishes both Base and Instruct (true for Qwen2/2.5/3 and Llama-3.x —
verify per model before committing) and you have budget to re-run SFT/RL afterward. CPT the base checkpoint
with the Ibrahim et al. recipe (re-warm/re-decay LR + 15–25% instruction-shaped replay), then run the
planned SFT→RL pipeline on the domain-adapted base. This is confirmed directly by Jindal et al.’s §4.4
result — it preserves both domain knowledge and instruction capability, no post-hoc repair needed — and
since CPT→SFT→RL is already this project’s intended stage order (per Stage 0 in the frontier
recipe), this
isn’t really extra cost, just correct sequencing.
Option B — CPT-on-base + chat-vector reattach (the cheap shortcut).
Use when you specifically want to skip re-collecting/re-running SFT (e.g., reusing a vendor’s expensive
proprietary instruction dataset you don’t want to reproduce). Mechanics: compute
Δ_instruct = θ_instruct − θ_base once from the original released pair; CPT the base as in Option A; reattach
θ_final = θ_base_cpt + γ·Δ_instruct (γ=1.0, sweep 0.8–1.2). Well-evidenced by two independent convergent
recipes (Chat Vector, Jindal et al.), but with one fewer confirming source than Option A, and neither source
paper tested a cybersecurity corpus — the stage-boundary eval in §5 is load-bearing here, not optional.
Option B′ — CPT literally on the instruct checkpoint’s own weights (LoRA-constrained + replay), fallback only. Use only if the base weights are genuinely unavailable. Run CPT with LoRA directly on the instruct checkpoint (Biderman et al.’s settings above), add 15–25% replay, and treat the stage-boundary eval as the primary gate rather than a confirmation — LoRA bounds but does not eliminate forgetting (their own math-domain ablation shows r=256 forgetting nearly as much as full-FT). No paper in this set tests “LoRA-CPT directly on an instruct checkpoint + later vector repair” as a combined recipe — this is a reasonable compositional inference from two separately-verified findings, not a directly validated one. Confidence: medium, flag it as such to the team.
Recommendation for a Qwen/Llama-class dense instruct model with a raw cybersecurity corpus: default to Option A. A clean base checkpoint is almost certainly available, it’s the best-evidenced path, and it costs no extra compute versus the CPT→SFT→RL order this book already argues for. Reach for Option B only if the second SFT run is the thing you’re specifically trying to avoid. Reserve Option B′ for the case where base weights are truly unavailable, and treat its output as provisional until it clears the eval in §5.
5. How to verify it didn’t break
Generic downstream-benchmark improvement is not evidence alignment survived — per §2, safety and instruction-following can degrade even while target-domain accuracy rises, because the degradation concentrates in a narrow behavioral slice (early-token safety distribution, chat-template format) that a domain benchmark never probes. This is the same trap Diagnosing the gap warns about generally: a single aggregate number collapses a heterogeneous verdict into a false single sentence. Run this pair at the CPT stage boundary — immediately after the CPT run (Option A/B′) or immediately after the chat-vector reattach (Option B), before SFT/RL starts, not only at the end of the pipeline — exactly the same “gate before you spend more compute” logic this book’s stage-wise eval protocol already uses elsewhere.
- Instruction-following retention — IFEval. Zhou, Lu, Mishra, Brahma, Basu, Luan, Zhou & Hou (Google), “Instruction-Following Evaluation for Large Language Models,” arXiv:2311.07911. ~500 prompts, 25 automatically/objectively verifiable instruction types (“write >400 words,” JSON-only, no commas). Chosen deliberately because it’s format/constraint-based, not LLM-judged — it directly measures the exact failure mode both forgetting papers observed (repetition, chat-template drift, “does what I asked” reliability), with no evaluator-model cost or bias.
- General-capability retention — MMLU (with a saturation caveat). Hendrycks et al., arXiv:2009.03300. Checks whether general world-knowledge/reasoning survived while domain knowledge was gained — a separate axis from IFEval, since a model can retain facts while completely losing instruction-following (exactly the Li & Lee finding), so both numbers are needed, not one. Log the caveat this project already applies to saturated benchmarks: MMLU is heavily contaminated/near-ceiling for current-generation models, so treat a small delta as necessary-but-not-sufficient and prefer a fresher check (GPQA, or a held-out-recent slice) as a secondary cross-check when budget allows.
- Domain-knowledge gain, on the same held-out set pre- and post-CPT. This is the thing CPT was for — a flat IFEval/MMLU alongside a negative domain-knowledge delta means the CPT run bought nothing, corpus quality/dedup is the thing to check, not the preservation hyperparameters. Run the identical prompt-formatting, chat-template, and decoding parameters across all three probes at every stage boundary — mismatched sampling temperature or template version manufactures false deltas on its own.
- Safety/refusal retention, specifically adversarial, not just vanilla. Per Qi et al.’s shallow-alignment result (arXiv:2406.05946), “still refuses the standard red-team prompt” is not evidence safety alignment survived — check refusal robustness under adversarial-suffix/prefill-style probes, using Qi et al.’s (arXiv:2310.03693) small adversarially-designed probe-set methodology. This is not optional for a cybersecurity CPT corpus specifically — it’s exactly the domain where a refusal-behavior check matters most.
- Spot-check generations, don’t trust one aggregate number. The “spurious forgetting” (OpenReview:ScI7IlKGdI) finding — not independently peer-reviewed at time of writing, flag as promising-not-validated — is that some measured “forgetting” is a metric artifact: the model may still know the answer but phrase/format it differently post-CPT, tanking a strict-match score without real knowledge loss. Regardless of that paper’s own validity, the practical implication holds on general grounds: hand-inspect a sample of IFEval failures before declaring real forgetting, especially at a stage-boundary gate where you’re deciding whether to proceed or roll back.
Gate logic:
| Signal pattern | Read |
|---|---|
| IFEval flat, MMLU flat, domain-QA improved | Proceed to SFT/RL |
| IFEval drops hard, MMLU flat | Classic direct-CPT-on-instruct forgetting signature (matches §2 exactly). Option A: re-check you actually started from base, not instruct. Option B: re-check the reattach step (wrong γ, mismatched checkpoint version, dtype mismatch between CPT’d base and vector source). Option B′: expected to some degree — proceed only if domain-QA gain justifies it, or raise replay ratio and retrain |
| Both flat, domain-QA flat too | CPT taught nothing — check corpus size/quality/dedup before touching preservation hyperparameters |
This ties back into the ordering point from the frontier recipe: CPT is a pre-alignment stage, not a fine-tuning add-on layered after alignment. The evidence in §2 says the opposite order — CPT after alignment, naively, on the instruct checkpoint’s own weights — is the one failure mode every source converges on. The change this chapter makes concrete to that stage diagram: gradient updates from raw-corpus next-token training should land on the base weights (or be LoRA-isolated with explicit forgetting mitigation if base is genuinely unavailable), not be bolted directly onto the instruct checkpoint as an afterthought — and the IFEval+MMLU+domain-QA gate above is the mechanism that catches it early if that constraint is violated.
Confidence summary
| Claim | Confidence | Basis |
|---|---|---|
| Direct CPT on an instruct/RLHF checkpoint degrades instruction-following/format reliability | High | Two independent papers (2401.03129, 2410.10739), different model families/years, same conclusion |
| What breaks is disproportionately format/alignment, not raw knowledge | High | Li & Lee; Zheng et al.’s “spurious forgetting” (ICLR 2025, peer-reviewed) |
| CPT-on-base → re-instruct preserves both knowledge and instruction-following | High | Jindal et al. §4.4, across 4 model families |
| Loss curves don’t reveal instruct-layer damage in real time | Medium | Mousavi et al., single group, very recent |
| Benign fine-tuning/CPT data erodes safety alignment without malicious intent, and safety is shallow | High | Qi et al. ×2, both peer-reviewed (ICLR 2024, ICLR 2025 Oral) |
| Forgetting gets worse, not better, with scale (1B–14B range) | High | Luo et al., direct ablations |
| LR re-warm/re-decay + replay matches from-scratch retraining | High | Ibrahim et al., multi-scale validated to 10B params |
| Chat-vector/instruction-residual reattachment lets you skip re-SFT after CPT | Medium-high | Two independent convergent recipes (language-shift + domain-shift) + seminal theory; no cybersecurity corpus tested by either |
| LoRA-constrained CPT bounds but doesn’t eliminate forgetting, and doesn’t fully close the domain-learning gap | High (general claim) / Medium (numbers transferring to a security corpus) | Biderman et al., rigorous but only code/math domains tested |
| Model merging (TIES/DARE) is a repair option but NOT a reliable fix at scale | High (core claims) / Medium (large-scale reliability caveat) | TIES/DARE peer-reviewed and reproducible; Harmon et al.’s ICLR 2026 submission (not yet peer-reviewed) reports merging “does not reliably mitigate forgetting” |
| IFEval + MMLU + domain-QA is the right stage-boundary gate | High (why these three axes) / Medium (universal numeric tolerance — must be pilot-calibrated) | Directly matches the failure modes documented above; MMLU-saturation caveat is a known general concern |
| “Spurious forgetting” (metric artifact vs. real capability loss) is a real confound to check for | Low-medium (flagging only) | Single un-independently-verified ICLR 2025 poster; included for the practical “spot-check generations” implication only |
Cross-links
- The path to a frontier cybersecurity model — this chapter is the “how” underneath that book’s Stage 0 (domain CPT); read that chapter first for why CPT is Stage 0 at all.
- Diagnosing the gap — a scientific framework — the competence/performance split this chapter borrows to explain why a post-CPT instruction-following collapse is an elicitation failure, not a knowledge gap.
- What the frontier labs actually do — the broader stage-ordering pattern (SFT → RL, imitation before exploration) this chapter’s CPT-before-alignment argument is a specific instance of.
- PEFT is orthogonal — general LoRA/QLoRA/DoRA mechanics; this chapter’s LoRA-CPT numbers are the CPT-specific instantiation of that general knob.