File size: 12,447 Bytes
7cb972e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
# Tutorial 13: Troubleshooting & Performance Debugging

## Overview
Even with perfect theory, training often fails in practice. You will encounter **NaN losses**, **Out-Of-Memory (OOM)** errors, **non-convergence**, and **mysteriously slow training**. This tutorial provides a systematic debugging framework to diagnose and fix these issues efficiently.

## Prerequisites
- Experience running training loops (Tutorial 02)
- Basic understanding of CUDA/GPU architecture
- Familiarity with PyTorch debugging tools

---

## 1. The Debugging Mindset

**Rule #1**: Reproduce the issue on the smallest possible scale.
- If it fails on 8 GPUs, try 1 GPU.
- If it fails on 50B tokens, try 1M tokens.
- If it fails with batch size 64, try batch size 2.

**Rule #2**: Isolate the variable.
- Change only one thing at a time (LR, batch size, model size).
- Use a deterministic seed (`torch.manual_seed(42)`) to ensure reproducibility.

---

## 2. Diagnosing NaN/Inf Losses

The dreaded `loss = nan` is the most common failure mode.

### Causes & Solutions

#### 2.1 Learning Rate Too High
The optimizer takes steps too large, shooting past the minimum into infinity.
- **Symptom**: Loss spikes suddenly then becomes NaN.
- **Fix**: Reduce LR by 10x. Use a learning rate warmup.
```python
# Check for gradient explosion
for name, param in model.named_parameters():
    if param.grad is not None:
        grad_norm = param.grad.norm()
        if torch.isinf(grad_norm) or torch.isnan(grad_norm):
            print(f"Bad Grad: {name}, Norm: {grad_norm}")
```

#### 2.2 Mixed Precision Instability
FP16 has a limited range ($6 \times 10^{-5}$ to $65504$). Small gradients underflow to 0; large activations overflow to Inf.
- **Symptom**: NaN appears in specific layers (often attention).
- **Fix**: 
  - Use **Gradient Scaling** (automatic in `GradScaler`).
  - Switch to **BF16** (Bfloat16) if hardware supports (A100/H100). BF16 has the same range as FP32.
  - Disable AMP (Automatic Mixed Precision) for debugging.

```python
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for batch in dataloader:
    optimizer.zero_grad()
    
    with autocast(dtype=torch.bfloat16): # Try bfloat16 instead of float16
        outputs = model(batch)
        loss = outputs.loss
    
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()
```

#### 2.3 Division by Zero / Log(0)
Common in custom loss functions.
- **Symptom**: Immediate NaN at step 1.
- **Fix**: Add epsilon ($\epsilon = 1e-8$) to denominators and log inputs.
```python
# Bad
loss = -torch.log(probs)

# Good
loss = -torch.log(probs + 1e-8)
```

#### 2.4 Bad Data
Input contains NaNs or Infs.
- **Check**:
```python
assert not torch.isnan(inputs).any(), "Input contains NaNs"
assert not torch.isinf(inputs).any(), "Input contains Infs"
```

### Systematic NaN Hunt Script
Run this when NaN occurs:
```python
def debug_nan(model, inputs, loss):
    print(f"Loss: {loss.item()}")
    
    # 1. Check Inputs
    if torch.isnan(inputs).any(): print("NaN in Inputs")
    
    # 2. Check Parameters
    for name, p in model.named_parameters():
        if p.requires_grad and (torch.isnan(p).any() or torch.isinf(p).any()):
            print(f"NaN/Inf in Param: {name}")
    
    # 3. Check Gradients (after backward)
    for name, p in model.named_parameters():
        if p.grad is not None and (torch.isnan(p.grad).any() or torch.isinf(p.grad).any()):
            print(f"NaN/Inf in Grad: {name}")
    
    # 4. Check Activations (Hook method)
    def hook(module, inp, out):
        if torch.isnan(out).any() or torch.isinf(out).any():
            print(f"NaN/Inf in Output of: {module.__class__.__name__}")
    
    for module in model.modules():
        module.register_forward_hook(hook)
    
    # Re-run forward pass to trigger hooks
    _ = model(inputs)
```

---

## 3. Fixing Out-Of-Memory (OOM) Errors

`CUDA out of memory. Tried to allocate X GiB.`

### Step 3.1: Analyze Memory Usage
Use `nvidia-smi` to see total usage. Use PyTorch profiler to see *what* is using memory.

```python
import torch.cuda.memory as mem

print(f"Allocated: {mem.memory_allocated()/1e9:.2f} GB")
print(f"Reserved: {mem.memory_reserved()/1e9:.2f} GB")
```

### Step 3.2: Reduction Strategies

#### A. Reduce Batch Size
Most direct fix. Use **Gradient Accumulation** to maintain effective batch size.
```python
# Instead of batch_size=64 (OOM)
# Use batch_size=8, accumulation_steps=8
effective_batch = 8 * 8 = 64
```

#### B. Enable Gradient Checkpointing
Trade compute for memory. Recompute activations during backward pass instead of storing them.
- **Memory Save**: Up to 60%.
- **Cost**: 20-30% slower training.
```python
model.gradient_checkpointing_enable()
# Or manually
from torch.utils.checkpoint import checkpoint
# Wrap expensive layers: hidden = checkpoint(layer_module, hidden)
```

#### C. Use ZeRO / FSDP
Shard model states across GPUs (see Tutorial 10).
- **ZeRO-1**: Shard Optimizer.
- **ZeRO-3**: Shard Parameters (fits largest models).

#### D. Quantization
Load model in 8-bit or 4-bit (if inference or QLoRA).
```python
model = AutoModelForCausalLM.from_pretrained(..., load_in_8bit=True)
```

#### E. CPU Offload
Move optimizer states or parameters to CPU RAM.
- **DeepSpeed Config**:
```json
"zero_optimization": {
  "stage": 3,
  "offload_optimizer": { "device": "cpu", "pin_memory": true },
  "offload_param": { "device": "cpu", "pin_memory": true }
}
```
*Note: Slower due to PCIe transfer.*

---

## 4. Convergence Issues

The model trains but doesn't learn (loss stays flat) or learns too slowly.

### Checklist

#### 4.1 Data Loader Issues
- **Problem**: Labels are misaligned or all zeros.
- **Debug**: Print a sample batch. Verify `input_ids` correspond to `labels`.
- **Problem**: Shuffling is off, model sees same pattern repeatedly.
- **Fix**: Ensure `shuffle=True` in DataLoader (unless sequential).

#### 4.2 Learning Rate Problems
- **Too Low**: Loss decreases imperceptibly.
  - *Fix*: Increase LR by 10x.
- **Too High**: Loss oscillates wildly.
  - *Fix*: Decrease LR, add warmup.
- **No Warmup**: Transformers need warmup to stabilize embeddings.
  - *Fix*: Use scheduler with warmup (e.g., `get_linear_schedule_with_warmup`).

#### 4.3 Vanishing/Exploding Gradients
- **Symptom**: Early layers have near-zero gradients; later layers have huge ones.
- **Fix**: 
  - **Gradient Clipping**:
  ```python
  torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
  ```
  - **Proper Initialization**: Ensure you aren't re-initializing a pre-trained model randomly.

#### 4.4 Label Smoothing
Sometimes helps convergence by preventing over-confidence.
```python
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
```

#### 4.5 Verify Forward Pass
Is the model just predicting the mean?
- **Test**: Run a batch through. Check output distribution.
- If softmax probabilities are uniform ($1/Vocab$), the model hasn't learned.
- If they are peaked at one token always, maybe bias is too high.

---

## 5. Performance Bottlenecks (Slow Training)

Training is working but slower than expected (< 40% MFU).

### 5.1 Profiling with PyTorch Profiler
Identify where time is spent.

```python
from torch.profiler import profile, record_function, ProfilerActivity

with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], 
             schedule=torch.profiler.schedule(wait=1, warmup=1, active=3), 
             on_trace_ready=torch.profiler.tensorboard_trace_handler('./log')) as prof:
    
    for step, batch in enumerate(dataloader):
        # ... train step ...
        
        prof.step()
        if step > 5: break

# View results: tensorboard --logdir ./log
```

### Common Bottlenecks

#### A. Data Loading (CPU Bound)
GPU waits for CPU to feed data.
- **Symptom**: GPU utilization drops to 0% periodically.
- **Fix**:
  - Increase `num_workers` in DataLoader (try 4, 8, 16).
  - Use `pin_memory=True`.
  - Pre-process data offline (save tokenized IDs to disk).
  - Use asynchronous loading.

#### B. Communication Overhead (Distributed)
GPUs wait for each other.
- **Symptom**: Low scaling efficiency when adding GPUs.
- **Fix**:
  - Ensure NVLink is enabled (`nvidia-smi topo -m`).
  - Use `overlap_comm` in DeepSpeed/FSDP.
  - Increase micro-batch size to reduce frequency of sync.
  - Check network bandwidth (InfiniBand vs Ethernet).

#### C. Kernel Launch Overhead
Many small operations.
- **Fix**: Use **Fused Kernels**.
  - `apex.optimizers.FusedAdam` instead of `torch.optim.AdamW`.
  - `FlashAttention` (replaces standard attention, 2-3x faster).
  - `xformers` library for memory-efficient attention.

#### D. Python GIL / Overhead
- **Fix**: Use `torch.compile()` (PyTorch 2.0+).
```python
model = torch.compile(model) # JIT compilation
```
*Note: May have compatibility issues with some dynamic graphs.*

---

## 6. Distributed Debugging Specifics

### 6.1 Hanging Processes
One rank finishes, others wait forever.
- **Cause**: Mismatched tensor sizes causing broadcast failure.
- **Cause**: One rank skips a step (e.g., `if rank == 0` inside training loop without sync).
- **Debug**: Add print statements with `dist.barrier()` to find which rank stops.
```python
print(f"Rank {dist.get_rank()} starting step")
dist.barrier() # All ranks must reach here
```

### 6.2 NCCL Errors
`NCCL error: unhandled system error`.
- **Cause**: P2P communication failed.
- **Fix**:
  - `export NCCL_P2P_DISABLE=1`
  - `export NCCL_IB_DISABLE=1` (if using InfiniBand issues)
  - Check for ECC errors in `nvidia-smi`.

---

## 7. Practical Exercise: Debugging a Broken Run

**Scenario**: You start training and get `Loss: NaN` at step 50.

**Step 1: Reproduce with minimal config**
- Set `batch_size=2`, `max_steps=100`, `model=tiny`.

**Step 2: Add Debug Hooks**
Insert the `debug_nan` function from Section 2.

**Step 3: Check Gradient Norms**
Log gradient norms per layer.
```python
total_norm = 0
for p in model.parameters():
    if p.grad is not None:
        param_norm = p.grad.data.norm(2)
        total_norm += param_norm.item() ** 2
total_norm = total_norm ** 0.5
print(f"Grad Norm: {total_norm}")
# If > 1e5, you have exploding gradients -> Clip!
```

**Step 4: Verify Data**
Print the input IDs that cause the crash.
```python
if torch.isnan(loss):
    print("Crash Input:", input_ids[batch_idx])
    break
```

**Step 5: Apply Fix**
- Found gradients exploding? → Add `clip_grad_norm_`.
- Found FP16 overflow? → Switch to BF16 or reduce LR.
- Found bad token ID (-100 where it shouldn't be)? → Fix tokenizer.

---

## 8. Summary Checklist

| Issue | Symptom | Likely Cause | Fix |
| :--- | :--- | :--- | :--- |
| **NaN Loss** | Loss becomes `nan` | High LR, FP16 overflow, Bad Data | Reduce LR, Use BF16, Check Data |
| **OOM** | `CUDA out of memory` | Batch too big, No checkpointing | Reduce Batch, Gradient Checkpointing, ZeRO |
| **No Convergence** | Flat loss curve | LR too low, Bad labels, No warmup | Increase LR, Check Labels, Add Warmup |
| **Oscillation** | Loss jumps up/down | LR too high, Small batch | Reduce LR, Increase Batch |
| **Slow Training** | Low GPU util | Data loading, Comm overhead | More workers, Fused kernels, Flash Attn |
| **Hang** | Process stalls | Rank mismatch, Missing barrier | Check control flow, Add barriers |

## Final Words
Debugging AI systems is iterative. Always start small, isolate variables, and use tools (profilers, hooks) rather than guessing. Keep detailed logs of experiments (Tutorial 12) so you can trace back what changed when things break.

---

## Series Conclusion

You have now completed the **End-to-End AI Training Tutorial Series**.

**Recap of Journey**:
1.  **Foundations**: Built models from scratch, understood transformers.
2.  **Training**: Ran first runs, mastered full fine-tuning and PEFT.
3.  **Advanced**: Explored RLHF, DPO, and multi-task learning.
4.  **Scale**: Mastered distributed training (TP/PP/ZeRO).
5.  **Deployment**: Optimized inference, quantization, serving.
6.  **Ops**: Built CI/CD pipelines, governance, monitoring.
7.  **Debugging**: Learned to fix NaNs, OOMs, and bottlenecks.

**Next Steps for You**:
- Experiment with open-source models (Llama, Mistral, Qwen).
- Contribute to frameworks (Hugging Face, DeepSpeed, vLLM).
- Build a portfolio project: Train a domain-specific assistant end-to-end.
- Stay updated: The field moves fast (new architectures, new quantization methods).

Happy Training! 🚀