shaikhsalman commited on
Commit
8d1369e
·
verified ·
1 Parent(s): 4019af4

feat: add CLI runner + training recipe docs

Browse files
ai-ml/hf-finetuning/TRAINING_RECIPE.md ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Model Enhancement — Dataset & Training Recipe vNext
2
+
3
+ ## What Changed (v1 → v2)
4
+
5
+ | Parameter | v1 (Old) | v2 (LoRA Without Regret) | Why |
6
+ |-----------|----------|--------------------------|-----|
7
+ | **Dataset** | ultrachat_200k (5K subset) | **tulu-3-sft-mixture** (940K) | 19 curated sources > single source |
8
+ | **LoRA r** | 16 | **256** | SFT-scale datasets need r=256 to match full FT |
9
+ | **LoRA alpha** | 32 | **16** | Stable scaling with high rank |
10
+ | **Target modules** | q/k/v/o_proj only | **all-linear** | Attention-only underperforms even at higher rank |
11
+ | **Effective batch** | 32 | **16** | LoRA less tolerant of large batches |
12
+ | **Learning rate** | 2e-4 | **2e-4** (same) | 10x full FT rate — correct in v1 |
13
+ | **Packing** | False | **True (bfd_split)** | Preserves all tokens, 2-3x throughput |
14
+ | **assistant_only_loss** | False | **True** | Loss only on assistant tokens |
15
+ | **EOS token** | Not set | **<\|eot_id\|>** | Llama 3.1 chat template |
16
+ | **LR scheduler** | linear | **cosine** | Better convergence for LoRA |
17
+ | **Epochs** | 3 | **1** | 940K examples = 1 epoch sufficient |
18
+
19
+ ## Dataset Comparison
20
+
21
+ | Dataset | Size | Format | Best For | Quality |
22
+ |---------|------|--------|----------|---------|
23
+ | **tulu-3-sft-mixture** | 940K | messages ✅ | General SFT (code, math, IF, safety, science) | ⭐⭐⭐⭐⭐ |
24
+ | **OpenThoughts-114k** | 114K | conversations (needs conversion) | Reasoning, CoT traces | ⭐⭐⭐⭐ |
25
+ | ultrachat_200k | 200K | messages ✅ | Multi-turn chat baseline | ⭐⭐⭐ |
26
+
27
+ ## Key Research: "LoRA Without Regret" (Schulman et al., 2025)
28
+
29
+ Four findings that change how we fine-tune:
30
+
31
+ 1. **Target ALL linear layers** — not just attention. Increasing rank does NOT compensate for skipping layers.
32
+ 2. **Use r=256 for SFT** — sufficient capacity for post-training scale datasets.
33
+ 3. **Use 10x higher LR** (2e-4 vs 2e-5 for full FT) — 1/r scaling makes optimal LR rank-independent.
34
+ 4. **Keep batch size < 32** — LoRA is less tolerant of large batches. Cannot be mitigated by increasing rank.
35
+
36
+ ## Recommended Training Matrix
37
+
38
+ ### SFT (Supervised Fine-Tuning)
39
+
40
+ | Model | Dataset | Hardware | Time | Cost |
41
+ |-------|---------|----------|------|------|
42
+ | Llama-3.1-8B-Instruct | tulu-3-sft (940K) | A100 (80GB) | ~6h | ~$24 |
43
+ | Llama-3.1-8B-Instruct | OpenThoughts-114k | A100 (80GB) | ~2h | ~$8 |
44
+ | Llama-3.1-8B-Instruct | tulu-3-sft (940K) | A10G (24GB) + QLoRA | ~12h | ~$24 |
45
+
46
+ ### GRPO (Reinforcement Learning)
47
+
48
+ | Model | Dataset | LoRA r | Hardware |
49
+ |-------|---------|--------|----------|
50
+ | Qwen3-0.6B | OpenR1-Math-220k | 1 | A100 |
51
+ | Llama-3.1-8B-Base | GSM8k | 1-32 | A100 |
52
+
53
+ ## Source Attribution
54
+
55
+ - LoRA Without Regret: Schulman et al., 2025, Thinking Machines Lab
56
+ - tulu-3-sft-mixture: Allen AI, used by Tulu 3 (SOTA open instruction-tuned)
57
+ - OpenThoughts-114k: Open community, reasoning-heavy CoT data
58
+ - LoRA Land: Predibase 2024, 224/310 LoRA models surpassed GPT-4