PEFT is orthogonal — LoRA · QLoRA · DoRA
Common confusion worth killing outright: PEFT is not a fine-tuning method — it’s a mechanism for applying one. Any of SFT / DPO / GRPO / RLVR can be delivered full-parameter or via a PEFT adapter. It changes which parameters get gradients and how much memory you burn, not what signal you learn from.
The methods
- LoRA — freeze W, train a low-rank update
ΔW = B·A(rank r), soy = Wx + (BA)x·(α/r). Only A, B get gradients (Hu et al., arXiv:2106.09685). - QLoRA — quantize the frozen base to 4-bit NF4, keep adapters in BF16; lets a large base fit a small GPU (Dettmers et al., arXiv:2305.14314).
- DoRA — decompose the update into magnitude + direction for a bit more accuracy at the same budget (arXiv:2402.09353).
Two engineering facts that matter
- It’s a knob on top of a method. “Should I do LoRA or GRPO?” is a category error — you do GRPO, via LoRA. On-policy distillation reproductions run rank-128 LoRA; OpenAI/Fireworks/Google Vertex customer-RFT products are LoRA-first — LoRA lives in the applied/enterprise fine-tuning layer. The frontier labs post-train their own flagship checkpoints full-parameter (Llama/Qwen/DeepSeek/GPT/Claude/Gemini reports; verified 2026 pass).
- LoRA reduces forgetting (low-rank update can’t move W far → less overwrite of pretrained knowledge — relevant to the small-model overwrite problem in Imitation), but it is startlingly learning-rate-sensitive: the 2026 unified LoRA-variant study finds LoRA responds far more to LR than to which variant you pick, and a well-tuned vanilla LoRA matches or beats most fancy variants (arXiv:2601.22708). Tune LR before you tune adapter architecture.
Practical default for your scale
At ≤~9–16B, LoRA/QLoRA is the sane default for iteration cost; go full-parameter only when you have a concrete reason (measured OOM headroom aside, the project’s stance is LoRA-by-default, full-FT as a deliberate escalation — see lessons/post-training/ in shared memory). It composes with every method chapter here.