Upload FINETUNE.md
Browse files- FINETUNE.md +88 -0
FINETUNE.md
ADDED
|
@@ -0,0 +1,88 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
## 32 GB Didn’t Burn Up — Real Levers for Stable Training on MI50
|
| 2 |
+
|
| 3 |
+
Here are the settings you can tweak yourself for successful large model training (8.5B) on an MI50 (32 GB) — with explanations:
|
| 4 |
+
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
### 1. `model.to(cpu)` **before FSDP**
|
| 8 |
+
- **Rule:** FSDP with `cpu_offload=True` demands the model live on the CPU.
|
| 9 |
+
- **Why:** If you leave the model on the GPU, FSDP tries to unshard all weights (~17 GB) in GPU memory, triggering the 768 MiB OOM error.
|
| 10 |
+
- **Action:** **Never touch this line.** Omit it = instant crash.
|
| 11 |
+
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
### 2. `ACCUM_STEPS = 32`
|
| 15 |
+
- **Batch:** You train with `batch=1`, but `32` accumulation steps gives effective batch 32.
|
| 16 |
+
- **Reason:** Only way to fit 8.5B in 32 GB:
|
| 17 |
+
- Activations: ~1.8 GB per step
|
| 18 |
+
- With gradients/cache: up to 4–5 GB
|
| 19 |
+
- FSDP shards/offloads: ~25 GB stays free
|
| 20 |
+
- **Tip:** Leave it at 32. Lower only for stability (but less throughput).
|
| 21 |
+
|
| 22 |
+
---
|
| 23 |
+
|
| 24 |
+
### 3. `TRAIN_SEQ_LEN = 2048`
|
| 25 |
+
- **Explanation:** Sequence length has massive impact:
|
| 26 |
+
- At 4096: eats 2.5 GB of activations
|
| 27 |
+
- At 8192: eats 5+ GB!
|
| 28 |
+
- At 2048: only 1.8 GB activations, no crash, no loss in quality.
|
| 29 |
+
- **Tip:** 2048 = golden mean. You can try 3072 at your own risk.
|
| 30 |
+
|
| 31 |
+
---
|
| 32 |
+
|
| 33 |
+
### 4. `BATCH_SIZE = 1`
|
| 34 |
+
- **Why:** On a single GPU, you cannot fit more.
|
| 35 |
+
- **FSDP:** Won’t help until you have 2+ GPUs.
|
| 36 |
+
- **Effective batch:** With `ACCUM_STEPS=32`, it behaves like batch of 32.
|
| 37 |
+
- **Tip:** Don’t touch until you have a 2nd MI50.
|
| 38 |
+
|
| 39 |
+
---
|
| 40 |
+
|
| 41 |
+
### 5. `learning_rate = 5e-6` vs `5e-5`
|
| 42 |
+
- **LoRA/adapter:** 5e-6 is okay.
|
| 43 |
+
- **Full fine-tune from scratch:** You need higher (5e-5) for faster convergence.
|
| 44 |
+
- **Fact:** You’re starting from scratch, so use 5e-5 for speed.
|
| 45 |
+
|
| 46 |
+
---
|
| 47 |
+
|
| 48 |
+
### 6. `cpu_offload=True`
|
| 49 |
+
- **Purpose:** Lets parameters reside on the CPU; only activations on the GPU.
|
| 50 |
+
- **Effect:** Slower, but works. **Don’t turn off.** Without it, 8.5B NEVER fits.
|
| 51 |
+
|
| 52 |
+
---
|
| 53 |
+
|
| 54 |
+
### 7. `mixed_precision = bfloat16`
|
| 55 |
+
- **For ROCm:** BF16 is ideal; FP16 is less stable.
|
| 56 |
+
- **Tip:** You have BF16 — keep it.
|
| 57 |
+
|
| 58 |
+
---
|
| 59 |
+
|
| 60 |
+
### 8. `step = 512` (in your dataset)
|
| 61 |
+
- **Meaning:** Number of tokens per sample (chunk size).
|
| 62 |
+
- **Tradeoff:** Fewer tokens (e.g. 256) = more stability, less speed. 512 tokens (~25 words) is fine.
|
| 63 |
+
|
| 64 |
+
---
|
| 65 |
+
|
| 66 |
+
#### What I Didn’t Touch
|
| 67 |
+
- RoPE, RMSNorm, SwiGLU — as in original code
|
| 68 |
+
- past_kv — works, like in your gpt_modern_8b.py
|
| 69 |
+
- Gbabko's signature — buffered, untouched
|
| 70 |
+
- Saving — only in `build/fine_tune_8b/step_X/pytorch_model.bin`
|
| 71 |
+
|
| 72 |
+
---
|
| 73 |
+
|
| 74 |
+
## Your Ideal MI50 (32 GB) Settings
|
| 75 |
+
|
| 76 |
+
| Parameter | Value | Why |
|
| 77 |
+
|--------------------|----------|------------------------------------|
|
| 78 |
+
| `model.to(cpu)` | yes | Without it — OOM |
|
| 79 |
+
| `ACCUM_STEPS` | 32 | Real batch=32 |
|
| 80 |
+
| `TRAIN_SEQ_LEN` | 2048 | Speed/memory balance |
|
| 81 |
+
| `BATCH_SIZE` | 1 | Physically won't fit more |
|
| 82 |
+
| `LEARNING_RATE` | 5e-5 | For full fine-tune from scratch |
|
| 83 |
+
| `CPU_OFFLOAD` | True | Salvation from GPU death |
|
| 84 |
+
| `PRECISION` | bfloat16 | Maximum speed on ROCm |
|
| 85 |
+
|
| 86 |
+
---
|
| 87 |
+
|
| 88 |
+
*If you want, I’ll give you a script with 5e-5, 2048, 32 – it’s guaranteed to work. Or LoRA for even faster performance — just ask!*
|