feat: add CLI runner + training recipe docs
Browse files
ai-ml/hf-finetuning/TRAINING_RECIPE.md
ADDED
|
@@ -0,0 +1,58 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Model Enhancement — Dataset & Training Recipe vNext
|
| 2 |
+
|
| 3 |
+
## What Changed (v1 → v2)
|
| 4 |
+
|
| 5 |
+
| Parameter | v1 (Old) | v2 (LoRA Without Regret) | Why |
|
| 6 |
+
|-----------|----------|--------------------------|-----|
|
| 7 |
+
| **Dataset** | ultrachat_200k (5K subset) | **tulu-3-sft-mixture** (940K) | 19 curated sources > single source |
|
| 8 |
+
| **LoRA r** | 16 | **256** | SFT-scale datasets need r=256 to match full FT |
|
| 9 |
+
| **LoRA alpha** | 32 | **16** | Stable scaling with high rank |
|
| 10 |
+
| **Target modules** | q/k/v/o_proj only | **all-linear** | Attention-only underperforms even at higher rank |
|
| 11 |
+
| **Effective batch** | 32 | **16** | LoRA less tolerant of large batches |
|
| 12 |
+
| **Learning rate** | 2e-4 | **2e-4** (same) | 10x full FT rate — correct in v1 |
|
| 13 |
+
| **Packing** | False | **True (bfd_split)** | Preserves all tokens, 2-3x throughput |
|
| 14 |
+
| **assistant_only_loss** | False | **True** | Loss only on assistant tokens |
|
| 15 |
+
| **EOS token** | Not set | **<\|eot_id\|>** | Llama 3.1 chat template |
|
| 16 |
+
| **LR scheduler** | linear | **cosine** | Better convergence for LoRA |
|
| 17 |
+
| **Epochs** | 3 | **1** | 940K examples = 1 epoch sufficient |
|
| 18 |
+
|
| 19 |
+
## Dataset Comparison
|
| 20 |
+
|
| 21 |
+
| Dataset | Size | Format | Best For | Quality |
|
| 22 |
+
|---------|------|--------|----------|---------|
|
| 23 |
+
| **tulu-3-sft-mixture** | 940K | messages ✅ | General SFT (code, math, IF, safety, science) | ⭐⭐⭐⭐⭐ |
|
| 24 |
+
| **OpenThoughts-114k** | 114K | conversations (needs conversion) | Reasoning, CoT traces | ⭐⭐⭐⭐ |
|
| 25 |
+
| ultrachat_200k | 200K | messages ✅ | Multi-turn chat baseline | ⭐⭐⭐ |
|
| 26 |
+
|
| 27 |
+
## Key Research: "LoRA Without Regret" (Schulman et al., 2025)
|
| 28 |
+
|
| 29 |
+
Four findings that change how we fine-tune:
|
| 30 |
+
|
| 31 |
+
1. **Target ALL linear layers** — not just attention. Increasing rank does NOT compensate for skipping layers.
|
| 32 |
+
2. **Use r=256 for SFT** — sufficient capacity for post-training scale datasets.
|
| 33 |
+
3. **Use 10x higher LR** (2e-4 vs 2e-5 for full FT) — 1/r scaling makes optimal LR rank-independent.
|
| 34 |
+
4. **Keep batch size < 32** — LoRA is less tolerant of large batches. Cannot be mitigated by increasing rank.
|
| 35 |
+
|
| 36 |
+
## Recommended Training Matrix
|
| 37 |
+
|
| 38 |
+
### SFT (Supervised Fine-Tuning)
|
| 39 |
+
|
| 40 |
+
| Model | Dataset | Hardware | Time | Cost |
|
| 41 |
+
|-------|---------|----------|------|------|
|
| 42 |
+
| Llama-3.1-8B-Instruct | tulu-3-sft (940K) | A100 (80GB) | ~6h | ~$24 |
|
| 43 |
+
| Llama-3.1-8B-Instruct | OpenThoughts-114k | A100 (80GB) | ~2h | ~$8 |
|
| 44 |
+
| Llama-3.1-8B-Instruct | tulu-3-sft (940K) | A10G (24GB) + QLoRA | ~12h | ~$24 |
|
| 45 |
+
|
| 46 |
+
### GRPO (Reinforcement Learning)
|
| 47 |
+
|
| 48 |
+
| Model | Dataset | LoRA r | Hardware |
|
| 49 |
+
|-------|---------|--------|----------|
|
| 50 |
+
| Qwen3-0.6B | OpenR1-Math-220k | 1 | A100 |
|
| 51 |
+
| Llama-3.1-8B-Base | GSM8k | 1-32 | A100 |
|
| 52 |
+
|
| 53 |
+
## Source Attribution
|
| 54 |
+
|
| 55 |
+
- LoRA Without Regret: Schulman et al., 2025, Thinking Machines Lab
|
| 56 |
+
- tulu-3-sft-mixture: Allen AI, used by Tulu 3 (SOTA open instruction-tuned)
|
| 57 |
+
- OpenThoughts-114k: Open community, reasoning-heavy CoT data
|
| 58 |
+
- LoRA Land: Predibase 2024, 224/310 LoRA models surpassed GPT-4
|