kgrabko commited on
Commit
127a443
·
verified ·
1 Parent(s): 6c8a6c2

Upload FINETUNE.md

Browse files
Files changed (1) hide show
  1. FINETUNE.md +88 -0
FINETUNE.md ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## 32 GB Didn’t Burn Up — Real Levers for Stable Training on MI50
2
+
3
+ Here are the settings you can tweak yourself for successful large model training (8.5B) on an MI50 (32 GB) — with explanations:
4
+
5
+ ---
6
+
7
+ ### 1. `model.to(cpu)` **before FSDP**
8
+ - **Rule:** FSDP with `cpu_offload=True` demands the model live on the CPU.
9
+ - **Why:** If you leave the model on the GPU, FSDP tries to unshard all weights (~17 GB) in GPU memory, triggering the 768 MiB OOM error.
10
+ - **Action:** **Never touch this line.** Omit it = instant crash.
11
+
12
+ ---
13
+
14
+ ### 2. `ACCUM_STEPS = 32`
15
+ - **Batch:** You train with `batch=1`, but `32` accumulation steps gives effective batch 32.
16
+ - **Reason:** Only way to fit 8.5B in 32 GB:
17
+ - Activations: ~1.8 GB per step
18
+ - With gradients/cache: up to 4–5 GB
19
+ - FSDP shards/offloads: ~25 GB stays free
20
+ - **Tip:** Leave it at 32. Lower only for stability (but less throughput).
21
+
22
+ ---
23
+
24
+ ### 3. `TRAIN_SEQ_LEN = 2048`
25
+ - **Explanation:** Sequence length has massive impact:
26
+ - At 4096: eats 2.5 GB of activations
27
+ - At 8192: eats 5+ GB!
28
+ - At 2048: only 1.8 GB activations, no crash, no loss in quality.
29
+ - **Tip:** 2048 = golden mean. You can try 3072 at your own risk.
30
+
31
+ ---
32
+
33
+ ### 4. `BATCH_SIZE = 1`
34
+ - **Why:** On a single GPU, you cannot fit more.
35
+ - **FSDP:** Won’t help until you have 2+ GPUs.
36
+ - **Effective batch:** With `ACCUM_STEPS=32`, it behaves like batch of 32.
37
+ - **Tip:** Don’t touch until you have a 2nd MI50.
38
+
39
+ ---
40
+
41
+ ### 5. `learning_rate = 5e-6` vs `5e-5`
42
+ - **LoRA/adapter:** 5e-6 is okay.
43
+ - **Full fine-tune from scratch:** You need higher (5e-5) for faster convergence.
44
+ - **Fact:** You’re starting from scratch, so use 5e-5 for speed.
45
+
46
+ ---
47
+
48
+ ### 6. `cpu_offload=True`
49
+ - **Purpose:** Lets parameters reside on the CPU; only activations on the GPU.
50
+ - **Effect:** Slower, but works. **Don’t turn off.** Without it, 8.5B NEVER fits.
51
+
52
+ ---
53
+
54
+ ### 7. `mixed_precision = bfloat16`
55
+ - **For ROCm:** BF16 is ideal; FP16 is less stable.
56
+ - **Tip:** You have BF16 — keep it.
57
+
58
+ ---
59
+
60
+ ### 8. `step = 512` (in your dataset)
61
+ - **Meaning:** Number of tokens per sample (chunk size).
62
+ - **Tradeoff:** Fewer tokens (e.g. 256) = more stability, less speed. 512 tokens (~25 words) is fine.
63
+
64
+ ---
65
+
66
+ #### What I Didn’t Touch
67
+ - RoPE, RMSNorm, SwiGLU — as in original code
68
+ - past_kv — works, like in your gpt_modern_8b.py
69
+ - Gbabko's signature — buffered, untouched
70
+ - Saving — only in `build/fine_tune_8b/step_X/pytorch_model.bin`
71
+
72
+ ---
73
+
74
+ ## Your Ideal MI50 (32 GB) Settings
75
+
76
+ | Parameter | Value | Why |
77
+ |--------------------|----------|------------------------------------|
78
+ | `model.to(cpu)` | yes | Without it — OOM |
79
+ | `ACCUM_STEPS` | 32 | Real batch=32 |
80
+ | `TRAIN_SEQ_LEN` | 2048 | Speed/memory balance |
81
+ | `BATCH_SIZE` | 1 | Physically won't fit more |
82
+ | `LEARNING_RATE` | 5e-5 | For full fine-tune from scratch |
83
+ | `CPU_OFFLOAD` | True | Salvation from GPU death |
84
+ | `PRECISION` | bfloat16 | Maximum speed on ROCm |
85
+
86
+ ---
87
+
88
+ *If you want, I’ll give you a script with 5e-5, 2048, 32 – it’s guaranteed to work. Or LoRA for even faster performance — just ask!*