frankenstallm / source /eval /plan /exception_playbook.md
pathcosmos's picture
Upload folder using huggingface_hub (#29)
5b1ff4d
# SFT ํ•™์Šต ์˜ˆ์™ธ ์ƒํ™ฉ ํ”Œ๋ ˆ์ด๋ถ
**ํ”„๋กœ์ ํŠธ:** Korean 1B SFT ์žฌํ•™์Šต
**์„œ๋ฒ„:** 8ร— B200 183GB, Driver 580.95.05, CUDA 13.1, PyTorch 2.10
**์ž‘์„ฑ์ผ:** 2026-02-26
**์„ค์ •:** bs=4 ร— 8GPU ร— grad_accum=2 = effective batch 64, max_steps=10000, lr=2e-5, FP8
---
## ์‹œ๋‚˜๋ฆฌ์˜ค 1: Loss๊ฐ€ 0์œผ๋กœ ๋–จ์–ด์ง€๋Š” ๊ฒฝ์šฐ
### ๊ฐ์ง€ ๊ธฐ์ค€
- **์ฆ‰๊ฐ ๊ฒฝ๊ณ :** loss < 0.01์ด 3 step ์—ฐ์† ๋ฐœ์ƒ
- **์ฃผ์˜:** loss < 0.1์ด 10 step ์ด์ƒ ์ง€์†
- **์ •์ƒ ๋ฒ”์œ„:** 1B SFT์—์„œ ์ˆ˜๋ ด ์‹œ loss โ‰ˆ 1.5~2.0. 0์— ๊ฐ€๊นŒ์šฐ๋ฉด 100% ๋น„์ •์ƒ
### ์ฆ‰๊ฐ ๋Œ€์‘
1. ํ•™์Šต ์ฆ‰์‹œ ์ค‘๋‹จ (Ctrl+C ๋˜๋Š” `kill -SIGINT <PID>`)
2. ๊ฐ€์žฅ ์ตœ๊ทผ ์ •์ƒ ์ฒดํฌํฌ์ธํŠธ ํ™•์ธ:
```bash
ls -lt checkpoints/korean_1b_sft/checkpoint-* | head -5
```
### ์›์ธ๋ณ„ ์ง„๋‹จ ๋ฐ ๋Œ€์‘
#### 1-A. Labels Shift ๋ฒ„๊ทธ ์žฌ๋ฐœ
**ํ™•์ธ ๋ฐฉ๋ฒ•:**
```python
# ๋ฐ์ดํ„ฐ์—์„œ ์ƒ˜ํ”Œ ํ•˜๋‚˜ ๋กœ๋“œ ํ›„ labels ๊ฒ€์ฆ
from data.sft_dataset import SFTDataset
from tokenizers import Tokenizer
tok = Tokenizer.from_file("tokenizer/korean_sp/tokenizer.json")
ds = SFTDataset("data/sft/train.jsonl", tok, max_seq_len=4096)
ids, labels = ds[0]
# labels์—์„œ -1์ด ์•„๋‹Œ ๋ถ€๋ถ„์ด input_ids์˜ ๋‹ค์Œ ํ† ํฐ๊ณผ ์ผ์น˜ํ•˜๋Š”์ง€ ํ™•์ธ
mask = labels != -1
print(f"์œ ํšจ labels ์ˆ˜: {mask.sum()}")
print(f"์ฒซ ์œ ํšจ label ์œ„์น˜: {mask.nonzero()[0].item() if mask.any() else 'NONE'}")
# labels[i]๋Š” input_ids[i+1]๊ณผ ๊ฐ™์•„์•ผ ํ•จ (autoregressive)
# ๋งŒ์•ฝ labels == input_ids ์ด๋ฉด shift ์•ˆ ๋จ โ†’ ๋ฒ„๊ทธ
```
**์ˆ˜์ •:** `sft_dataset.py`์—์„œ `labels = input_ids[1:]`, `input_ids = input_ids[:-1]` shift ํ™•์ธ
#### 1-B. ๋ฐ์ดํ„ฐ ์˜ค์—ผ
**ํ™•์ธ ๋ฐฉ๋ฒ•:**
```python
# ๋žœ๋ค ๋ฐฐ์น˜์—์„œ ์‹ค์ œ ํ•™์Šต ํ† ํฐ ๊ฒ€์‚ฌ
for batch in train_loader:
ids, labels, mask = batch
valid = (labels != -1)
print(f"์œ ํšจ ํ† ํฐ ๋น„์œจ: {valid.float().mean():.4f}")
# ์œ ํšจ ํ† ํฐ์ด 0์ด๋ฉด ๋ชจ๋“  labels๊ฐ€ -1 โ†’ loss=0
if valid.sum() == 0:
print("๐Ÿ”ด ๋ชจ๋“  labels๊ฐ€ ignore_index! ๋ฐ์ดํ„ฐ ๋ฌธ์ œ")
break
```
**๋Œ€์‘:** ๋ฐ์ดํ„ฐ ์žฌ์ƒ์„ฑ, `prepare_sft_data.py` ์žฌ์‹คํ–‰
#### 1-C. Learning Rate ๋ฌธ์ œ
**ํ™•์ธ:** loss๊ฐ€ ๊ฐ‘์ž๊ธฐ 0์ด๋ฉด lr ๋ฌธ์ œ๋ณด๋‹ค๋Š” labels ๋ฒ„๊ทธ์ผ ๊ฐ€๋Šฅ์„ฑ์ด ํ›จ์”ฌ ๋†’์Œ. ๊ทธ๋ž˜๋„ ํ™•์ธ:
```bash
grep "lr " checkpoints/korean_1b_sft/train.log | tail -20
# lr์ด ๋น„์ •์ƒ์ ์œผ๋กœ ๋†’์œผ๋ฉด (>1e-3) ์ˆ˜์ •
```
---
## ์‹œ๋‚˜๋ฆฌ์˜ค 2: Loss Spike (๊ธ‰๋“ฑ)
### ๊ฐ์ง€ ๊ธฐ์ค€
- **Spike ์ •์˜:** ์ด์ „ log_interval ํ‰๊ท  ๋Œ€๋น„ **3๋ฐฐ ์ด์ƒ** ๊ธ‰๋“ฑ
- **์˜ˆ:** ํ‰๊ท  loss 1.9์—์„œ ๊ฐ‘์ž๊ธฐ 5.7 ์ด์ƒ
- **GNorm ๊ธฐ์ค€:** grad_norm > 10.0์ด๋ฉด ์ฃผ์˜, > 50.0์ด๋ฉด ์‹ฌ๊ฐ
### ์›์ธ๋ณ„ ๋Œ€์‘
| ์›์ธ | ์ง„๋‹จ | ๋Œ€์‘ |
|------|------|------|
| Bad batch (์ด์ƒ ๋ฐ์ดํ„ฐ) | ํ•ด๋‹น step์˜ ๋ฐฐ์น˜ ๋‚ด์šฉ ํ™•์ธ | 1~2ํšŒ spike ํ›„ ์ž์—ฐ ๋ณต๊ตฌ๋˜๋ฉด ๋ฌด์‹œ |
| LR ๋ฌธ์ œ | warmup ์งํ›„ spike โ†’ lr ๋„ˆ๋ฌด ๋†’์Œ | lr์„ 1e-5๋กœ ๋‚ฎ์ถ”๊ณ  ์žฌ์‹œ์ž‘ |
| GNorm ํญ๋ฐœ | gnorm > 50 | max_grad_norm์„ 0.5๋กœ ๊ฐ•ํ™” |
| FP8 ์ˆ˜์น˜ ๋ถˆ์•ˆ์ • | FP8 ๊ด€๋ จ warning ํ™•์ธ | `--use_fp8` ์ œ๊ฑฐํ•˜๊ณ  BF16์œผ๋กœ ์ „ํ™˜ |
### ๋Œ€์‘ ์ ˆ์ฐจ
1. **1ํšŒ spike:** ๋ฌด์‹œ (๋‹จ๋ฐœ์„ฑ bad batch). ๋‹ค์Œ log์—์„œ ๋ณต๊ตฌ ํ™•์ธ
2. **์—ฐ์† 3ํšŒ spike:** ํ•™์Šต ์ค‘๋‹จ
3. **๋ณต๊ตฌ ๋ฐฉ๋ฒ•:**
```bash
# ๋งˆ์ง€๋ง‰ ์ •์ƒ ์ฒดํฌํฌ์ธํŠธ์—์„œ ์žฌ์‹œ์ž‘, lr ๋‚ฎ์ถ”๊ธฐ
bash scripts/launch_sft.sh --resume checkpoints/korean_1b_sft/checkpoint-XXXX --lr 1e-5
```
### ํ˜„์žฌ ์ฝ”๋“œ์˜ ๋ณดํ˜ธ ์žฅ์น˜
- โœ… `max_grad_norm=1.0` (gradient clipping ํ™œ์„ฑํ™”)
- โœ… Non-finite loss ๊ฐ์ง€ โ†’ RuntimeError ๋ฐœ์ƒ (trainer.py `_step()`)
- โŒ Loss spike ์ž๋™ ๊ฐ์ง€/skip์€ ๋ฏธ๊ตฌํ˜„ โ†’ `monitor_training.sh`๋กœ ๋ณด์™„
---
## ์‹œ๋‚˜๋ฆฌ์˜ค 3: ๊ณผ์ ํ•ฉ (val_loss > train_loss ์ง€์†)
### ๊ฐ์ง€ ๊ธฐ์ค€
- **์ฃผ์˜:** val_loss - train_loss > 0.15 (์ƒ๋Œ€๊ฐญ 8% ์ด์ƒ)
- **์‹ฌ๊ฐ:** val_loss๊ฐ€ 3ํšŒ ์—ฐ์† eval์—์„œ ์ƒ์Šน (train_loss๋Š” ํ•˜๊ฐ• ์ค‘)
- **eval_interval:** ํ˜„์žฌ 250 steps โ†’ ๋งค 250 step๋งˆ๋‹ค val_loss ๊ธฐ๋ก๋จ
### ํ˜„์žฌ ์ฝ”๋“œ ์ƒํƒœ
- โœ… `val_loader` ์ง€์› (sft.py์—์„œ `--val_data` ์ธ์ž ์žˆ์Œ)
- โœ… `_run_validation()` ๊ตฌํ˜„๋จ (trainer.py)
- โœ… Best checkpoint ์ž๋™ ์ €์žฅ (`val_loss < self._best_val_loss`)
- โŒ **Early stopping ๋ฏธ๊ตฌํ˜„** โ€” val_loss๊ฐ€ ์˜ฌ๋ผ๋„ max_steps๊นŒ์ง€ ํ•™์Šต ๊ณ„์†
### ๋Œ€์‘
#### ์ฆ‰์‹œ ๊ฐ€๋Šฅํ•œ ์กฐ์น˜
1. **์ˆ˜๋™ early stop:** ๋ชจ๋‹ˆํ„ฐ๋ง ์Šคํฌ๋ฆฝํŠธ๊ฐ€ ๊ฒฝ๊ณ  โ†’ ์ˆ˜๋™ ์ค‘๋‹จ
2. **Best checkpoint ์‚ฌ์šฉ:** `checkpoint-best` ๋””๋ ‰ํ† ๋ฆฌ์— ์ž๋™ ์ €์žฅ๋จ
```bash
ls checkpoints/korean_1b_sft/checkpoint-best/
```
#### ๊ณผ์ ํ•ฉ ํ•ด์†Œ ๋ฐฉ๋ฒ• (์žฌํ•™์Šต ์‹œ)
| ๋ฐฉ๋ฒ• | ์„ค์ • ๋ณ€๊ฒฝ |
|------|-----------|
| LR ๋‚ฎ์ถ”๊ธฐ | `--lr 1e-5` |
| Weight decay ๋†’์ด๊ธฐ | `--weight_decay 0.05` |
| ๋ฐ์ดํ„ฐ augmentation | NEFTune ์ด๋ฏธ ํ™œ์„ฑํ™” (noise_alpha=10.0) โœ… |
| Steps ์ค„์ด๊ธฐ | `--max_steps 7000` (๊ณผ์ ํ•ฉ ์‹œ์ž‘ ์ „ step์—์„œ ๋ฉˆ์ถค) |
| Dropout | ๋ชจ๋ธ ๊ตฌ์กฐ ์ˆ˜์ • ํ•„์š” (ํ˜„์žฌ ์ฝ”๋“œ์—์„œ ์‰ฝ์ง€ ์•Š์Œ) |
#### Early Stopping ์ถ”๊ฐ€ ๋ฐฉ๋ฒ• (trainer.py ์ˆ˜์ •)
```python
# trainer.py์˜ train() ๋ฉ”์„œ๋“œ์—์„œ validation ํ›„:
if val_loss > self._best_val_loss:
self._patience_counter += 1
if self._patience_counter >= 5: # 5ํšŒ ์—ฐ์† ๊ฐœ์„  ์—†์œผ๋ฉด ์ค‘๋‹จ
self._log("Early stopping triggered")
break
else:
self._patience_counter = 0
self._best_val_loss = val_loss
```
---
## ์‹œ๋‚˜๋ฆฌ์˜ค 4: OOM (Out of Memory)
### ํ˜„์žฌ ๋ฉ”๋ชจ๋ฆฌ ์ถ”์ •
| ํ•ญ๋ชฉ | ์ถ”์ • |
|------|------|
| ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ (1.19B, BF16) | ~2.4 GB |
| ์˜ตํ‹ฐ๋งˆ์ด์ € ์ƒํƒœ (AdamW, fp32) | ~9.5 GB |
| Gradient (BF16) | ~2.4 GB |
| Activation (bs=4, seq=4096, gradient checkpointing ON) | ~8-15 GB |
| **Peak ์ดํ•ฉ (per GPU)** | **~25-35 GB** |
| **B200 ์—ฌ์œ ** | **183 - 35 = ~148 GB ์—ฌ์œ ** |
โ†’ 1B ๋ชจ๋ธ์—์„œ OOM ๊ฐ€๋Šฅ์„ฑ **๊ทนํžˆ ๋‚ฎ์Œ**
### ๋งŒ์•ฝ ๋ฐœ์ƒํ•œ๋‹ค๋ฉด
1. **์ฆ์ƒ:** `torch.cuda.OutOfMemoryError` โ†’ trainer.py์—์„œ ์ด๋ฏธ catchํ•˜์—ฌ ์ƒ์„ธ ๋ฉ”์‹œ์ง€ ์ถœ๋ ฅ
2. **์ฆ‰์‹œ ๋Œ€์‘:**
```bash
# batch_size ์ค„์ด๊ธฐ (4โ†’2), grad_accum ๋Š˜๋ฆฌ๊ธฐ (2โ†’4) โ†’ effective batch ๋™์ผ
bash scripts/launch_sft.sh --batch_size 2 --grad_accum 4 --resume <last_ckpt>
```
3. **Gradient checkpointing:**
- โœ… **์ด๋ฏธ ํ™œ์„ฑํ™”๋จ** (sft.py์—์„œ `model.gradient_checkpointing_enable()`)
4. **์ถ”๊ฐ€ ์กฐ์น˜:**
```bash
# ๋ฉ”๋ชจ๋ฆฌ fragmentation ๋ฐฉ์ง€
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
```
### ๋ฉ”๋ชจ๋ฆฌ ๋ชจ๋‹ˆํ„ฐ๋ง
```bash
watch -n 5 nvidia-smi # ์‹ค์‹œ๊ฐ„ ํ™•์ธ
# ๋˜๋Š” monitor_training.sh ์‚ฌ์šฉ (์•„๋ž˜ ์ฐธ์กฐ)
```
---
## ์‹œ๋‚˜๋ฆฌ์˜ค 5: GPU Hang / NCCL ํ†ต์‹  ์žฅ์• 
### ๊ฐ์ง€ ๋ฐฉ๋ฒ•
- **์ฆ์ƒ:** ํ•™์Šต ๋กœ๊ทธ๊ฐ€ ๋ฉˆ์ถค (์ƒˆ step์ด N๋ถ„ ์ด์ƒ ์•ˆ ๋‚˜์˜ด)
- **NCCL timeout:** ๊ธฐ๋ณธ 30๋ถ„ ํ›„ ์—๋Ÿฌ ๋ฐœ์ƒ
- `nvidia-smi`์—์„œ ํŠน์ • GPU utilization 0%
### ์ง„๋‹จ
```bash
# 1. GPU ์ƒํƒœ ํ™•์ธ
nvidia-smi
# 2. NCCL ๋””๋ฒ„๊ทธ ํ™œ์„ฑํ™”ํ•˜์—ฌ ์žฌ์‹œ์ž‘
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL
# 3. ํ”„๋กœ์„ธ์Šค ์ƒํƒœ ํ™•์ธ
ps aux | grep torchrun
```
### ๋ณต๊ตฌ ๋ฐฉ๋ฒ•
```bash
# 1. ๊ธฐ์กด ํ”„๋กœ์„ธ์Šค ์ •๋ฆฌ
pkill -f torchrun
sleep 5
# 2. ๊ฐ€์žฅ ์ตœ๊ทผ ์ฒดํฌํฌ์ธํŠธ ์ž๋™ ๊ฐ์ง€
LATEST_CKPT=$(ls -d checkpoints/korean_1b_sft/checkpoint-* 2>/dev/null \
| grep -v best | sort -t- -k2 -n | tail -1)
echo "Latest checkpoint: ${LATEST_CKPT}"
# 3. ์žฌ์‹œ์ž‘
bash scripts/launch_sft.sh --resume "${LATEST_CKPT}"
```
### ์ตœ๊ทผ ์ฒดํฌํฌ์ธํŠธ ์ž๋™ ๊ฐ์ง€ ์Šคํฌ๋ฆฝํŠธ
```bash
#!/bin/bash
# find_latest_checkpoint.sh
CKPT_DIR="${1:-checkpoints/korean_1b_sft}"
LATEST=$(ls -d "${CKPT_DIR}"/checkpoint-[0-9]* 2>/dev/null \
| sort -t- -k2 -n | tail -1)
if [[ -z "$LATEST" ]]; then
echo "No checkpoint found in ${CKPT_DIR}" >&2
exit 1
fi
echo "$LATEST"
```
### ์˜ˆ๋ฐฉ
- `save_interval=500` (ํ˜„์žฌ ์„ค์ •) โ†’ ์ตœ๋Œ€ 500 step ์†์‹ค
- NCCL timeout ์กฐ์ •: `export NCCL_TIMEOUT=1800` (30๋ถ„ โ†’ ํ•„์š” ์‹œ ์ค„์ด๊ธฐ)
---
## ์‹œ๋‚˜๋ฆฌ์˜ค 6: ํ•™์Šต ์™„๋ฃŒ ํ›„ ๋ฐ˜๋ณต๋ฅ  >15%
### ํŒ๋‹จ ๊ธฐ์ค€
| ๋ฐ˜๋ณต๋ฅ  | ํŒ๋‹จ | ๋Œ€์‘ |
|--------|------|------|
| <5% (rep_penalty ์—†์ด) | โœ… ์„ฑ๊ณต | ๋ฐฐํฌ ๊ฐ€๋Šฅ |
| 5-10% | ๐ŸŸก OK | rep_penalty=1.1๋กœ ๋ฐฐํฌ |
| 10-20% | ๐ŸŸ  ๊ฒฝ๊ณ„ | ์•„๋ž˜ ํŒŒ๋ผ๋ฏธํ„ฐ ์กฐ์ • ์‹œ๋„ |
| >20% | ๐Ÿ”ด ์‹คํŒจ | ์žฌํ•™์Šต ํ•„์š” |
### ํŒŒ๋ผ๋ฏธํ„ฐ ์กฐ์ •์œผ๋กœ ํ•ด๊ฒฐ ์‹œ๋„ (์žฌํ•™์Šต ์—†์ด)
```python
# ์ถ”๋ก  ์‹œ ์ ์šฉ
generate_kwargs = {
"repetition_penalty": 1.1, # 1.05~1.2 ๋ฒ”์œ„ ํƒ์ƒ‰
"no_repeat_ngram_size": 3, # 3-gram ๋ฐ˜๋ณต ์ฐจ๋‹จ
"temperature": 0.7, # ์•ฝ๊ฐ„ ๋‚ฎ์ถ”๋ฉด ๋ฐ˜๋ณต ๊ฐ์†Œ
"top_p": 0.9,
}
```
### ์žฌํ•™์Šต์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ
- rep_penalty=1.2 + no_repeat_3gram์—์„œ๋„ >10%
- ์›์ธ ๋ถ„์„:
1. **๋ฐ์ดํ„ฐ ๋‚ด ๋ฐ˜๋ณต ํŒจํ„ด:** `data_quality_audit.py`๋กœ ์žฌํ™•์ธ
2. **Epoch ๊ณผ๋‹ค:** 5+ epoch์€ ๋ฐ˜๋ณต ํŒจํ„ด ์•”๊ธฐ ์œ ๋ฐœ โ†’ 3-4 epoch์ด ์ ์ •
3. **EOS ํ•™์Šต ๋ถ€์กฑ:** truncation ์‹œ EOS ์†์‹ค ์—ฌ๋ถ€ ํ™•์ธ
### ๊ณ ๊ธ‰ ๋Œ€์‘ (์ถ”๊ฐ€ ํ•™์Šต ๋ฐฉ๋ฒ•)
| ๋ฐฉ๋ฒ• | ์„ค๋ช… | ์†Œ์š” |
|------|------|------|
| ORPO | Preference optimization, ๋ฐ˜๋ณต ํŒจํ„ด ์ง์ ‘ penalize | +3-6์‹œ๊ฐ„ |
| DPO | Chosen(๋น„๋ฐ˜๋ณต) vs Rejected(๋ฐ˜๋ณต) ์Œ ํ•„์š” | +4-8์‹œ๊ฐ„ |
| rep_penalty fine-tuning | ์ถ”๋ก  ์‹œ penalty ๊ฒฐ๊ณผ๋ฅผ reward๋กœ RL | ๋ณต์žก |
---
## ์‹œ๋‚˜๋ฆฌ์˜ค 7: ko_ifeval ๊ธฐ๋Œ€์น˜ ๋ฏธ๋‹ฌ (<15%)
### ์›์ธ ๋ถ„์„ ๋ฐฉ๋ฒ•
#### Step 1: ๋ชจ๋ธ ์ถœ๋ ฅ ์ง์ ‘ ํ™•์ธ
```bash
# ko_ifeval ์‹คํŒจ ์ƒ˜ํ”Œ ๋ถ„์„
python -c "
# lm_eval ๊ฒฐ๊ณผ์—์„œ ์‹คํŒจ ์ผ€์ด์Šค ์ถ”์ถœ
# ์ง€์‹œ๋ฌธ ์ดํ•ด ๋ถ€์กฑ vs ํฌ๋งท ์˜ค๋ฅ˜ vs ํ•œ๊ตญ์–ด ๋Šฅ๋ ฅ ๋ถ€์กฑ ๊ตฌ๋ถ„
"
```
#### Step 2: ์นดํ…Œ๊ณ ๋ฆฌ๋ณ„ ๋ถ„์„
| ์‹คํŒจ ์œ ํ˜• | ์˜๋ฏธ | ๋Œ€์‘ |
|-----------|------|------|
| ์ง€์‹œ ๋ฌด์‹œ (wrong format) | instruction following ์•ฝํ•จ | SFT ๋ฐ์ดํ„ฐ์— format-constrained ์ƒ˜ํ”Œ ์ถ”๊ฐ€ |
| ํ•œ๊ตญ์–ด ์ดํ•ด ์‹คํŒจ | ํ•œ๊ตญ์–ด ๋Šฅ๋ ฅ ๋ถ€์กฑ | ํ•œ๊ตญ์–ด ๋น„์œจ ๋†’์ด๊ธฐ (ํ˜„์žฌ ~70%) |
| ์ถ”๋ก  ์˜ค๋ฅ˜ | 1B ๋ชจ๋ธ ํ•œ๊ณ„ | ๋ชจ๋ธ ํฌ๊ธฐ ํ•œ๊ณ„ โ†’ 3B ์ „ํ™˜ |
#### Step 3: ๋ชจ๋ธ ํ•œ๊ณ„ vs ๋ฐ์ดํ„ฐ ๋ฌธ์ œ ๊ตฌ๋ถ„
```
1B ๋ชจ๋ธ ko_ifeval ํ˜„์‹ค์  ๋ฒ”์œ„: 15-30%
- <15%: ๋ฐ์ดํ„ฐ/ํ•™์Šต ๋ฌธ์ œ ๊ฐ€๋Šฅ์„ฑ ๋†’์Œ
- 15-25%: ์ •์ƒ ๋ฒ”์œ„, ๋ฐ์ดํ„ฐ๋กœ ๊ฐœ์„  ์—ฌ์ง€ ์žˆ์Œ
- 25-30%: 1B ํ•œ๊ณ„์— ๊ทผ์ ‘, 3B ์ „ํ™˜ ํ•„์š”
- >30%: 1B์—์„œ ๋‹ฌ์„ฑํ•˜๊ธฐ ์–ด๋ ค์›€
```
### ๋ฐ์ดํ„ฐ ์ถ”๊ฐ€ ์ˆ˜์ง‘ ๋ฐฉํ–ฅ
1. **Korean instruction-following ๋ฐ์ดํ„ฐ:** KoAlpaca, KULLM ๋“ฑ์—์„œ format-constrained ์ƒ˜ํ”Œ
2. **Multi-turn ํ•œ๊ตญ์–ด ๋Œ€ํ™”:** ์ง€์‹œ ๋”ฐ๋ฅด๊ธฐ ๋Šฅ๋ ฅ ๊ฐ•ํ™”
3. **ko_ifeval๊ณผ ์œ ์‚ฌํ•œ ํฌ๋งท ๋ฐ์ดํ„ฐ:** "~ํ˜•์‹์œผ๋กœ ๋‹ตํ•˜์‹œ์˜ค" ์œ ํ˜•
---
## ์‹œ๋‚˜๋ฆฌ์˜ค 8: ๋””์Šคํฌ ๊ณต๊ฐ„ ๋ถ€์กฑ
### ํ˜„์žฌ ์ƒํƒœ
```
/PROJECT: 3.5TB ์ด, 1.4TB ์‚ฌ์šฉ, 2.2TB ๊ฐ€์šฉ (39% ์‚ฌ์šฉ)
```
### ์ฒดํฌํฌ์ธํŠธ ํฌ๊ธฐ ์ถ”์ •
| ํ•ญ๋ชฉ | ํฌ๊ธฐ |
|------|------|
| model.pt (1.19B BF16) | ~2.4 GB |
| optimizer.pt (AdamW states) | ~9.5 GB |
| scheduler + meta | ~1 MB |
| **์ฒดํฌํฌ์ธํŠธ 1๊ฐœ** | **~12 GB** |
| 10,000 steps / 500 save = 20๊ฐœ | **~240 GB** |
| + best checkpoint | +12 GB |
| + tensorboard logs | ~100 MB |
| **์ด ์˜ˆ์ƒ** | **~252 GB** |
โ†’ 2.2TB ๊ฐ€์šฉ ๋Œ€๋น„ ์ถฉ๋ถ„ํ•˜์ง€๋งŒ, ์—ฌ๋Ÿฌ ์‹คํ—˜ ์‹œ ๋ˆ„์  ์ฃผ์˜
### ์ฒดํฌํฌ์ธํŠธ ๊ด€๋ฆฌ ์ „๋žต
#### ์ €์žฅ ์ฃผ๊ธฐ ์ตœ์ ํ™”
- **ํ˜„์žฌ:** 500 step๋งˆ๋‹ค (์ถ”์ฒœ ์œ ์ง€)
- ๋””์Šคํฌ ๋ถ€์กฑ ์‹œ: 1000 step์œผ๋กœ ๋ณ€๊ฒฝ โ†’ 120 GB๋กœ ์ ˆ๋ฐ˜ ๊ฐ์†Œ
- `train_config.save_interval = 1000`
#### ์˜ค๋ž˜๋œ ์ฒดํฌํฌ์ธํŠธ ์ž๋™ ์‚ญ์ œ
```bash
#!/bin/bash
# cleanup_checkpoints.sh โ€” ์ตœ์‹  N๊ฐœ๋งŒ ์œ ์ง€, best๋Š” ํ•ญ์ƒ ๋ณด์กด
CKPT_DIR="${1:-checkpoints/korean_1b_sft}"
KEEP="${2:-5}" # ์ตœ์‹  5๊ฐœ ์œ ์ง€
CKPTS=$(ls -d "${CKPT_DIR}"/checkpoint-[0-9]* 2>/dev/null | sort -t- -k2 -n)
TOTAL=$(echo "$CKPTS" | wc -l)
DELETE=$((TOTAL - KEEP))
if [[ $DELETE -gt 0 ]]; then
echo "$CKPTS" | head -n "$DELETE" | while read ckpt; do
echo "Removing: $ckpt"
rm -rf "$ckpt"
done
echo "Kept latest $KEEP checkpoints + checkpoint-best"
else
echo "Only $TOTAL checkpoints, nothing to delete (keep=$KEEP)"
fi
```
### ๋””์Šคํฌ ๋ชจ๋‹ˆํ„ฐ๋ง
```bash
# ํ•™์Šต ์ค‘ ์ฃผ๊ธฐ์  ํ™•์ธ
df -h /PROJECT | awk 'NR==2 {if ($5+0 > 80) print "๐Ÿ”ด DISK >80%: "$5}'
```
---
## ํ•™์Šต ์žฌ์‹œ์ž‘ ๊ฐ€์ด๋“œ
### ํ˜„์žฌ ์ฝ”๋“œ์˜ Resume ์ง€์›
โœ… **์™„์ „ ์ง€์›๋จ:**
- `sft.py`์— `--resume` ์ธ์ž ์žˆ์Œ
- `load_checkpoint()`์œผ๋กœ model, optimizer, scheduler ์ƒํƒœ ๋ชจ๋‘ ๋ณต์›
- `start_step` ๋ฐ˜ํ™˜ โ†’ ์ด์–ด์„œ ํ•™์Šต
### ์žฌ์‹œ์ž‘ ๋ช…๋ น์–ด
```bash
# ๋ฐฉ๋ฒ• 1: ์ตœ์‹  ์ฒดํฌํฌ์ธํŠธ์—์„œ ์ž๋™ ์žฌ์‹œ์ž‘
LATEST=$(ls -d checkpoints/korean_1b_sft/checkpoint-[0-9]* 2>/dev/null \
| sort -t- -k2 -n | tail -1)
bash scripts/launch_sft.sh --resume "${LATEST}"
# ๋ฐฉ๋ฒ• 2: ํŠน์ • ์ฒดํฌํฌ์ธํŠธ ์ง€์ •
bash scripts/launch_sft.sh --resume checkpoints/korean_1b_sft/checkpoint-0003000
# ๋ฐฉ๋ฒ• 3: LR ๋ณ€๊ฒฝํ•˜๋ฉฐ ์žฌ์‹œ์ž‘ (๊ณผ์ ํ•ฉ/spike ๋Œ€์‘)
bash scripts/launch_sft.sh --resume "${LATEST}" --lr 1e-5
```
### ์ฃผ์˜์‚ฌํ•ญ
- **cosine schedule:** resume ์‹œ scheduler๊ฐ€ ์ค‘๊ฐ„ step์—์„œ ๋ณต์›๋จ โ†’ LR์ด ์˜ฌ๋ฐ”๋ฅธ ์œ„์น˜์—์„œ ์žฌ๊ฐœ
- **max_steps ๋ณ€๊ฒฝ ์‹œ:** ์›๋ž˜ 5000 step ๊ธฐ์ค€ schedule์ธ๋ฐ 10000์œผ๋กœ ๋ณ€๊ฒฝํ•˜๋ฉด LR curve๊ฐ€ ๋‹ฌ๋ผ์ง โ†’ ์ฒ˜์Œ๋ถ€ํ„ฐ ์žฌํ•™์Šต ๊ถŒ์žฅ
- **DDP seed:** resume ์‹œ ๋™์ผ seed ์‚ฌ์šฉํ•ด์•ผ ๋ฐ์ดํ„ฐ ์ˆœ์„œ ์žฌํ˜„ (ํ˜„์žฌ ์ฝ”๋“œ์—์„œ ์ž๋™ ์ฒ˜๋ฆฌ)
---
## ๋ชจ๋‹ˆํ„ฐ๋ง ์ž๋™ํ™”
๋ณ„๋„ ์Šคํฌ๋ฆฝํŠธ: `scripts/monitor_training.sh` ์ฐธ์กฐ
### ๊ฐ์‹œ ํ•ญ๋ชฉ ์š”์•ฝ
| ํ•ญ๋ชฉ | ์ž„๊ณ„๊ฐ’ | ์˜๋ฏธ |
|------|--------|------|
| loss = 0.0000 (3 step ์—ฐ์†) | ๐Ÿ”ด Critical | Labels ๋ฒ„๊ทธ |
| loss spike (3ร— ํ‰๊ท ) | ๐ŸŸ  Warning | Bad batch / LR |
| gnorm > 10.0 | ๐ŸŸ  Warning | ๋ถˆ์•ˆ์ • |
| gnorm > 50.0 | ๐Ÿ”ด Critical | ๋ฐœ์‚ฐ ์ง์ „ |
| GPU util < 50% | ๐ŸŸก Info | ๋ณ‘๋ชฉ (data loading?) |
| ๋กœ๊ทธ 5๋ถ„ ์ด์ƒ ๋ฉˆ์ถค | ๐Ÿ”ด Critical | Hang / NCCL ์žฅ์•  |
| ๋””์Šคํฌ ์‚ฌ์šฉ > 80% | ๐ŸŸ  Warning | ์ฒดํฌํฌ์ธํŠธ ์ •๋ฆฌ ํ•„์š” |
---
## ์œ„ํ—˜๋„ ์ˆœ์œ„ (๋†’์Œ โ†’ ๋‚ฎ์Œ)
| ์ˆœ์œ„ | ์‹œ๋‚˜๋ฆฌ์˜ค | ์œ„ํ—˜๋„ | ์˜ˆ๋ฐฉ |
|------|----------|--------|------|
| 1 | **Loss โ†’ 0 (Labels ๋ฒ„๊ทธ)** | ๐Ÿ”ด๐Ÿ”ด๐Ÿ”ด | ํ•™์Šต ์ „ labels shift ๊ฒ€์ฆ ์Šคํฌ๋ฆฝํŠธ ์‹คํ–‰ |
| 2 | **GPU Hang (NCCL)** | ๐Ÿ”ด๐Ÿ”ด | save_interval=500, NCCL ํ™˜๊ฒฝ๋ณ€์ˆ˜ ์„ค์ • |
| 3 | **๊ณผ์ ํ•ฉ** | ๐Ÿ”ด | val_data ํ•„์ˆ˜, ๋ชจ๋‹ˆํ„ฐ๋ง |
| 4 | **๋ฐ˜๋ณต๋ฅ  >15%** | ๐ŸŸ ๐ŸŸ  | ๊นจ๋—ํ•œ ๋ฐ์ดํ„ฐ, ์ ์ • epoch |
| 5 | **Loss Spike** | ๐ŸŸ  | grad_clip=1.0, ์ด๋ฏธ ์„ค์ •๋จ |
| 6 | **ko_ifeval ๋ฏธ๋‹ฌ** | ๐ŸŸ  | 1B ํ•œ๊ณ„ ์ธ์ง€, ๋ฐ์ดํ„ฐ ๋‹ค์–‘์„ฑ |
| 7 | **๋””์Šคํฌ ๋ถ€์กฑ** | ๐ŸŸก | 2.2TB ์—ฌ์œ , ์ž๋™ ์ •๋ฆฌ |
| 8 | **OOM** | ๐ŸŸข | 183GB์— 1B ๋ชจ๋ธ, ๊ฑฐ์˜ ๋ถˆ๊ฐ€๋Šฅ |
---
## ํ•™์Šต ์ „ ์ฒดํฌ๋ฆฌ์ŠคํŠธ
```
โ–ก ๋ฐ์ดํ„ฐ ํ•„ํ„ฐ๋ง ์™„๋ฃŒ (data_quality_audit.py)
โ–ก Val split ์ƒ์„ฑ (90/10)
โ–ก Labels shift ๊ฒ€์ฆ (์œ„ ์ฝ”๋“œ ์Šค๋‹ˆํŽซ ์‹คํ–‰)
โ–ก sft_dataset.py ์ˆ˜์ • ํ™•์ธ (dynamic padding, EOS ๋ณด์กด)
โ–ก launch_sft.sh ์„ค์ • ํ™•์ธ (max_steps, val_data, lr)
โ–ก ๋””์Šคํฌ ๊ณต๊ฐ„ ํ™•์ธ (df -h /PROJECT)
โ–ก GPU ์ƒํƒœ ํ™•์ธ (nvidia-smi)
โ–ก monitor_training.sh ๋ฐฑ๊ทธ๋ผ์šด๋“œ ์‹คํ–‰
โ–ก tensorboard ์‹คํ–‰: tensorboard --logdir checkpoints/korean_1b_sft/tensorboard
```