frankenstallm / source /eval /plan /exception_playbook.md
pathcosmos's picture
Upload folder using huggingface_hub (#29)
5b1ff4d

SFT ํ•™์Šต ์˜ˆ์™ธ ์ƒํ™ฉ ํ”Œ๋ ˆ์ด๋ถ

ํ”„๋กœ์ ํŠธ: Korean 1B SFT ์žฌํ•™์Šต
์„œ๋ฒ„: 8ร— B200 183GB, Driver 580.95.05, CUDA 13.1, PyTorch 2.10
์ž‘์„ฑ์ผ: 2026-02-26
์„ค์ •: bs=4 ร— 8GPU ร— grad_accum=2 = effective batch 64, max_steps=10000, lr=2e-5, FP8


์‹œ๋‚˜๋ฆฌ์˜ค 1: Loss๊ฐ€ 0์œผ๋กœ ๋–จ์–ด์ง€๋Š” ๊ฒฝ์šฐ

๊ฐ์ง€ ๊ธฐ์ค€

  • ์ฆ‰๊ฐ ๊ฒฝ๊ณ : loss < 0.01์ด 3 step ์—ฐ์† ๋ฐœ์ƒ
  • ์ฃผ์˜: loss < 0.1์ด 10 step ์ด์ƒ ์ง€์†
  • ์ •์ƒ ๋ฒ”์œ„: 1B SFT์—์„œ ์ˆ˜๋ ด ์‹œ loss โ‰ˆ 1.5~2.0. 0์— ๊ฐ€๊นŒ์šฐ๋ฉด 100% ๋น„์ •์ƒ

์ฆ‰๊ฐ ๋Œ€์‘

  1. ํ•™์Šต ์ฆ‰์‹œ ์ค‘๋‹จ (Ctrl+C ๋˜๋Š” kill -SIGINT <PID>)
  2. ๊ฐ€์žฅ ์ตœ๊ทผ ์ •์ƒ ์ฒดํฌํฌ์ธํŠธ ํ™•์ธ:
    ls -lt checkpoints/korean_1b_sft/checkpoint-* | head -5
    

์›์ธ๋ณ„ ์ง„๋‹จ ๋ฐ ๋Œ€์‘

1-A. Labels Shift ๋ฒ„๊ทธ ์žฌ๋ฐœ

ํ™•์ธ ๋ฐฉ๋ฒ•:

# ๋ฐ์ดํ„ฐ์—์„œ ์ƒ˜ํ”Œ ํ•˜๋‚˜ ๋กœ๋“œ ํ›„ labels ๊ฒ€์ฆ
from data.sft_dataset import SFTDataset
from tokenizers import Tokenizer
tok = Tokenizer.from_file("tokenizer/korean_sp/tokenizer.json")
ds = SFTDataset("data/sft/train.jsonl", tok, max_seq_len=4096)
ids, labels = ds[0]
# labels์—์„œ -1์ด ์•„๋‹Œ ๋ถ€๋ถ„์ด input_ids์˜ ๋‹ค์Œ ํ† ํฐ๊ณผ ์ผ์น˜ํ•˜๋Š”์ง€ ํ™•์ธ
mask = labels != -1
print(f"์œ ํšจ labels ์ˆ˜: {mask.sum()}")
print(f"์ฒซ ์œ ํšจ label ์œ„์น˜: {mask.nonzero()[0].item() if mask.any() else 'NONE'}")
# labels[i]๋Š” input_ids[i+1]๊ณผ ๊ฐ™์•„์•ผ ํ•จ (autoregressive)
# ๋งŒ์•ฝ labels == input_ids ์ด๋ฉด shift ์•ˆ ๋จ โ†’ ๋ฒ„๊ทธ

์ˆ˜์ •: sft_dataset.py์—์„œ labels = input_ids[1:], input_ids = input_ids[:-1] shift ํ™•์ธ

1-B. ๋ฐ์ดํ„ฐ ์˜ค์—ผ

ํ™•์ธ ๋ฐฉ๋ฒ•:

# ๋žœ๋ค ๋ฐฐ์น˜์—์„œ ์‹ค์ œ ํ•™์Šต ํ† ํฐ ๊ฒ€์‚ฌ
for batch in train_loader:
    ids, labels, mask = batch
    valid = (labels != -1)
    print(f"์œ ํšจ ํ† ํฐ ๋น„์œจ: {valid.float().mean():.4f}")
    # ์œ ํšจ ํ† ํฐ์ด 0์ด๋ฉด ๋ชจ๋“  labels๊ฐ€ -1 โ†’ loss=0
    if valid.sum() == 0:
        print("๐Ÿ”ด ๋ชจ๋“  labels๊ฐ€ ignore_index! ๋ฐ์ดํ„ฐ ๋ฌธ์ œ")
    break

๋Œ€์‘: ๋ฐ์ดํ„ฐ ์žฌ์ƒ์„ฑ, prepare_sft_data.py ์žฌ์‹คํ–‰

1-C. Learning Rate ๋ฌธ์ œ

ํ™•์ธ: loss๊ฐ€ ๊ฐ‘์ž๊ธฐ 0์ด๋ฉด lr ๋ฌธ์ œ๋ณด๋‹ค๋Š” labels ๋ฒ„๊ทธ์ผ ๊ฐ€๋Šฅ์„ฑ์ด ํ›จ์”ฌ ๋†’์Œ. ๊ทธ๋ž˜๋„ ํ™•์ธ:

grep "lr " checkpoints/korean_1b_sft/train.log | tail -20
# lr์ด ๋น„์ •์ƒ์ ์œผ๋กœ ๋†’์œผ๋ฉด (>1e-3) ์ˆ˜์ •

์‹œ๋‚˜๋ฆฌ์˜ค 2: Loss Spike (๊ธ‰๋“ฑ)

๊ฐ์ง€ ๊ธฐ์ค€

  • Spike ์ •์˜: ์ด์ „ log_interval ํ‰๊ท  ๋Œ€๋น„ 3๋ฐฐ ์ด์ƒ ๊ธ‰๋“ฑ
  • ์˜ˆ: ํ‰๊ท  loss 1.9์—์„œ ๊ฐ‘์ž๊ธฐ 5.7 ์ด์ƒ
  • GNorm ๊ธฐ์ค€: grad_norm > 10.0์ด๋ฉด ์ฃผ์˜, > 50.0์ด๋ฉด ์‹ฌ๊ฐ

์›์ธ๋ณ„ ๋Œ€์‘

์›์ธ ์ง„๋‹จ ๋Œ€์‘
Bad batch (์ด์ƒ ๋ฐ์ดํ„ฐ) ํ•ด๋‹น step์˜ ๋ฐฐ์น˜ ๋‚ด์šฉ ํ™•์ธ 1~2ํšŒ spike ํ›„ ์ž์—ฐ ๋ณต๊ตฌ๋˜๋ฉด ๋ฌด์‹œ
LR ๋ฌธ์ œ warmup ์งํ›„ spike โ†’ lr ๋„ˆ๋ฌด ๋†’์Œ lr์„ 1e-5๋กœ ๋‚ฎ์ถ”๊ณ  ์žฌ์‹œ์ž‘
GNorm ํญ๋ฐœ gnorm > 50 max_grad_norm์„ 0.5๋กœ ๊ฐ•ํ™”
FP8 ์ˆ˜์น˜ ๋ถˆ์•ˆ์ • FP8 ๊ด€๋ จ warning ํ™•์ธ --use_fp8 ์ œ๊ฑฐํ•˜๊ณ  BF16์œผ๋กœ ์ „ํ™˜

๋Œ€์‘ ์ ˆ์ฐจ

  1. 1ํšŒ spike: ๋ฌด์‹œ (๋‹จ๋ฐœ์„ฑ bad batch). ๋‹ค์Œ log์—์„œ ๋ณต๊ตฌ ํ™•์ธ
  2. ์—ฐ์† 3ํšŒ spike: ํ•™์Šต ์ค‘๋‹จ
  3. ๋ณต๊ตฌ ๋ฐฉ๋ฒ•:
    # ๋งˆ์ง€๋ง‰ ์ •์ƒ ์ฒดํฌํฌ์ธํŠธ์—์„œ ์žฌ์‹œ์ž‘, lr ๋‚ฎ์ถ”๊ธฐ
    bash scripts/launch_sft.sh --resume checkpoints/korean_1b_sft/checkpoint-XXXX --lr 1e-5
    

ํ˜„์žฌ ์ฝ”๋“œ์˜ ๋ณดํ˜ธ ์žฅ์น˜

  • โœ… max_grad_norm=1.0 (gradient clipping ํ™œ์„ฑํ™”)
  • โœ… Non-finite loss ๊ฐ์ง€ โ†’ RuntimeError ๋ฐœ์ƒ (trainer.py _step())
  • โŒ Loss spike ์ž๋™ ๊ฐ์ง€/skip์€ ๋ฏธ๊ตฌํ˜„ โ†’ monitor_training.sh๋กœ ๋ณด์™„

์‹œ๋‚˜๋ฆฌ์˜ค 3: ๊ณผ์ ํ•ฉ (val_loss > train_loss ์ง€์†)

๊ฐ์ง€ ๊ธฐ์ค€

  • ์ฃผ์˜: val_loss - train_loss > 0.15 (์ƒ๋Œ€๊ฐญ 8% ์ด์ƒ)
  • ์‹ฌ๊ฐ: val_loss๊ฐ€ 3ํšŒ ์—ฐ์† eval์—์„œ ์ƒ์Šน (train_loss๋Š” ํ•˜๊ฐ• ์ค‘)
  • eval_interval: ํ˜„์žฌ 250 steps โ†’ ๋งค 250 step๋งˆ๋‹ค val_loss ๊ธฐ๋ก๋จ

ํ˜„์žฌ ์ฝ”๋“œ ์ƒํƒœ

  • โœ… val_loader ์ง€์› (sft.py์—์„œ --val_data ์ธ์ž ์žˆ์Œ)
  • โœ… _run_validation() ๊ตฌํ˜„๋จ (trainer.py)
  • โœ… Best checkpoint ์ž๋™ ์ €์žฅ (val_loss < self._best_val_loss)
  • โŒ Early stopping ๋ฏธ๊ตฌํ˜„ โ€” val_loss๊ฐ€ ์˜ฌ๋ผ๋„ max_steps๊นŒ์ง€ ํ•™์Šต ๊ณ„์†

๋Œ€์‘

์ฆ‰์‹œ ๊ฐ€๋Šฅํ•œ ์กฐ์น˜

  1. ์ˆ˜๋™ early stop: ๋ชจ๋‹ˆํ„ฐ๋ง ์Šคํฌ๋ฆฝํŠธ๊ฐ€ ๊ฒฝ๊ณ  โ†’ ์ˆ˜๋™ ์ค‘๋‹จ
  2. Best checkpoint ์‚ฌ์šฉ: checkpoint-best ๋””๋ ‰ํ† ๋ฆฌ์— ์ž๋™ ์ €์žฅ๋จ
    ls checkpoints/korean_1b_sft/checkpoint-best/
    

๊ณผ์ ํ•ฉ ํ•ด์†Œ ๋ฐฉ๋ฒ• (์žฌํ•™์Šต ์‹œ)

๋ฐฉ๋ฒ• ์„ค์ • ๋ณ€๊ฒฝ
LR ๋‚ฎ์ถ”๊ธฐ --lr 1e-5
Weight decay ๋†’์ด๊ธฐ --weight_decay 0.05
๋ฐ์ดํ„ฐ augmentation NEFTune ์ด๋ฏธ ํ™œ์„ฑํ™” (noise_alpha=10.0) โœ…
Steps ์ค„์ด๊ธฐ --max_steps 7000 (๊ณผ์ ํ•ฉ ์‹œ์ž‘ ์ „ step์—์„œ ๋ฉˆ์ถค)
Dropout ๋ชจ๋ธ ๊ตฌ์กฐ ์ˆ˜์ • ํ•„์š” (ํ˜„์žฌ ์ฝ”๋“œ์—์„œ ์‰ฝ์ง€ ์•Š์Œ)

Early Stopping ์ถ”๊ฐ€ ๋ฐฉ๋ฒ• (trainer.py ์ˆ˜์ •)

# trainer.py์˜ train() ๋ฉ”์„œ๋“œ์—์„œ validation ํ›„:
if val_loss > self._best_val_loss:
    self._patience_counter += 1
    if self._patience_counter >= 5:  # 5ํšŒ ์—ฐ์† ๊ฐœ์„  ์—†์œผ๋ฉด ์ค‘๋‹จ
        self._log("Early stopping triggered")
        break
else:
    self._patience_counter = 0
    self._best_val_loss = val_loss

์‹œ๋‚˜๋ฆฌ์˜ค 4: OOM (Out of Memory)

ํ˜„์žฌ ๋ฉ”๋ชจ๋ฆฌ ์ถ”์ •

ํ•ญ๋ชฉ ์ถ”์ •
๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ (1.19B, BF16) ~2.4 GB
์˜ตํ‹ฐ๋งˆ์ด์ € ์ƒํƒœ (AdamW, fp32) ~9.5 GB
Gradient (BF16) ~2.4 GB
Activation (bs=4, seq=4096, gradient checkpointing ON) ~8-15 GB
Peak ์ดํ•ฉ (per GPU) ~25-35 GB
B200 ์—ฌ์œ  183 - 35 = ~148 GB ์—ฌ์œ 

โ†’ 1B ๋ชจ๋ธ์—์„œ OOM ๊ฐ€๋Šฅ์„ฑ ๊ทนํžˆ ๋‚ฎ์Œ

๋งŒ์•ฝ ๋ฐœ์ƒํ•œ๋‹ค๋ฉด

  1. ์ฆ์ƒ: torch.cuda.OutOfMemoryError โ†’ trainer.py์—์„œ ์ด๋ฏธ catchํ•˜์—ฌ ์ƒ์„ธ ๋ฉ”์‹œ์ง€ ์ถœ๋ ฅ
  2. ์ฆ‰์‹œ ๋Œ€์‘:
    # batch_size ์ค„์ด๊ธฐ (4โ†’2), grad_accum ๋Š˜๋ฆฌ๊ธฐ (2โ†’4) โ†’ effective batch ๋™์ผ
    bash scripts/launch_sft.sh --batch_size 2 --grad_accum 4 --resume <last_ckpt>
    
  3. Gradient checkpointing:
    • โœ… ์ด๋ฏธ ํ™œ์„ฑํ™”๋จ (sft.py์—์„œ model.gradient_checkpointing_enable())
  4. ์ถ”๊ฐ€ ์กฐ์น˜:
    # ๋ฉ”๋ชจ๋ฆฌ fragmentation ๋ฐฉ์ง€
    export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
    

๋ฉ”๋ชจ๋ฆฌ ๋ชจ๋‹ˆํ„ฐ๋ง

watch -n 5 nvidia-smi  # ์‹ค์‹œ๊ฐ„ ํ™•์ธ
# ๋˜๋Š” monitor_training.sh ์‚ฌ์šฉ (์•„๋ž˜ ์ฐธ์กฐ)

์‹œ๋‚˜๋ฆฌ์˜ค 5: GPU Hang / NCCL ํ†ต์‹  ์žฅ์• 

๊ฐ์ง€ ๋ฐฉ๋ฒ•

  • ์ฆ์ƒ: ํ•™์Šต ๋กœ๊ทธ๊ฐ€ ๋ฉˆ์ถค (์ƒˆ step์ด N๋ถ„ ์ด์ƒ ์•ˆ ๋‚˜์˜ด)
  • NCCL timeout: ๊ธฐ๋ณธ 30๋ถ„ ํ›„ ์—๋Ÿฌ ๋ฐœ์ƒ
  • nvidia-smi์—์„œ ํŠน์ • GPU utilization 0%

์ง„๋‹จ

# 1. GPU ์ƒํƒœ ํ™•์ธ
nvidia-smi

# 2. NCCL ๋””๋ฒ„๊ทธ ํ™œ์„ฑํ™”ํ•˜์—ฌ ์žฌ์‹œ์ž‘
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL

# 3. ํ”„๋กœ์„ธ์Šค ์ƒํƒœ ํ™•์ธ
ps aux | grep torchrun

๋ณต๊ตฌ ๋ฐฉ๋ฒ•

# 1. ๊ธฐ์กด ํ”„๋กœ์„ธ์Šค ์ •๋ฆฌ
pkill -f torchrun
sleep 5

# 2. ๊ฐ€์žฅ ์ตœ๊ทผ ์ฒดํฌํฌ์ธํŠธ ์ž๋™ ๊ฐ์ง€
LATEST_CKPT=$(ls -d checkpoints/korean_1b_sft/checkpoint-* 2>/dev/null \
  | grep -v best | sort -t- -k2 -n | tail -1)
echo "Latest checkpoint: ${LATEST_CKPT}"

# 3. ์žฌ์‹œ์ž‘
bash scripts/launch_sft.sh --resume "${LATEST_CKPT}"

์ตœ๊ทผ ์ฒดํฌํฌ์ธํŠธ ์ž๋™ ๊ฐ์ง€ ์Šคํฌ๋ฆฝํŠธ

#!/bin/bash
# find_latest_checkpoint.sh
CKPT_DIR="${1:-checkpoints/korean_1b_sft}"
LATEST=$(ls -d "${CKPT_DIR}"/checkpoint-[0-9]* 2>/dev/null \
  | sort -t- -k2 -n | tail -1)
if [[ -z "$LATEST" ]]; then
    echo "No checkpoint found in ${CKPT_DIR}" >&2
    exit 1
fi
echo "$LATEST"

์˜ˆ๋ฐฉ

  • save_interval=500 (ํ˜„์žฌ ์„ค์ •) โ†’ ์ตœ๋Œ€ 500 step ์†์‹ค
  • NCCL timeout ์กฐ์ •: export NCCL_TIMEOUT=1800 (30๋ถ„ โ†’ ํ•„์š” ์‹œ ์ค„์ด๊ธฐ)

์‹œ๋‚˜๋ฆฌ์˜ค 6: ํ•™์Šต ์™„๋ฃŒ ํ›„ ๋ฐ˜๋ณต๋ฅ  >15%

ํŒ๋‹จ ๊ธฐ์ค€

๋ฐ˜๋ณต๋ฅ  ํŒ๋‹จ ๋Œ€์‘
<5% (rep_penalty ์—†์ด) โœ… ์„ฑ๊ณต ๋ฐฐํฌ ๊ฐ€๋Šฅ
5-10% ๐ŸŸก OK rep_penalty=1.1๋กœ ๋ฐฐํฌ
10-20% ๐ŸŸ  ๊ฒฝ๊ณ„ ์•„๋ž˜ ํŒŒ๋ผ๋ฏธํ„ฐ ์กฐ์ • ์‹œ๋„
>20% ๐Ÿ”ด ์‹คํŒจ ์žฌํ•™์Šต ํ•„์š”

ํŒŒ๋ผ๋ฏธํ„ฐ ์กฐ์ •์œผ๋กœ ํ•ด๊ฒฐ ์‹œ๋„ (์žฌํ•™์Šต ์—†์ด)

# ์ถ”๋ก  ์‹œ ์ ์šฉ
generate_kwargs = {
    "repetition_penalty": 1.1,      # 1.05~1.2 ๋ฒ”์œ„ ํƒ์ƒ‰
    "no_repeat_ngram_size": 3,      # 3-gram ๋ฐ˜๋ณต ์ฐจ๋‹จ
    "temperature": 0.7,             # ์•ฝ๊ฐ„ ๋‚ฎ์ถ”๋ฉด ๋ฐ˜๋ณต ๊ฐ์†Œ
    "top_p": 0.9,
}

์žฌํ•™์Šต์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ

  • rep_penalty=1.2 + no_repeat_3gram์—์„œ๋„ >10%
  • ์›์ธ ๋ถ„์„:
    1. ๋ฐ์ดํ„ฐ ๋‚ด ๋ฐ˜๋ณต ํŒจํ„ด: data_quality_audit.py๋กœ ์žฌํ™•์ธ
    2. Epoch ๊ณผ๋‹ค: 5+ epoch์€ ๋ฐ˜๋ณต ํŒจํ„ด ์•”๊ธฐ ์œ ๋ฐœ โ†’ 3-4 epoch์ด ์ ์ •
    3. EOS ํ•™์Šต ๋ถ€์กฑ: truncation ์‹œ EOS ์†์‹ค ์—ฌ๋ถ€ ํ™•์ธ

๊ณ ๊ธ‰ ๋Œ€์‘ (์ถ”๊ฐ€ ํ•™์Šต ๋ฐฉ๋ฒ•)

๋ฐฉ๋ฒ• ์„ค๋ช… ์†Œ์š”
ORPO Preference optimization, ๋ฐ˜๋ณต ํŒจํ„ด ์ง์ ‘ penalize +3-6์‹œ๊ฐ„
DPO Chosen(๋น„๋ฐ˜๋ณต) vs Rejected(๋ฐ˜๋ณต) ์Œ ํ•„์š” +4-8์‹œ๊ฐ„
rep_penalty fine-tuning ์ถ”๋ก  ์‹œ penalty ๊ฒฐ๊ณผ๋ฅผ reward๋กœ RL ๋ณต์žก

์‹œ๋‚˜๋ฆฌ์˜ค 7: ko_ifeval ๊ธฐ๋Œ€์น˜ ๋ฏธ๋‹ฌ (<15%)

์›์ธ ๋ถ„์„ ๋ฐฉ๋ฒ•

Step 1: ๋ชจ๋ธ ์ถœ๋ ฅ ์ง์ ‘ ํ™•์ธ

# ko_ifeval ์‹คํŒจ ์ƒ˜ํ”Œ ๋ถ„์„
python -c "
# lm_eval ๊ฒฐ๊ณผ์—์„œ ์‹คํŒจ ์ผ€์ด์Šค ์ถ”์ถœ
# ์ง€์‹œ๋ฌธ ์ดํ•ด ๋ถ€์กฑ vs ํฌ๋งท ์˜ค๋ฅ˜ vs ํ•œ๊ตญ์–ด ๋Šฅ๋ ฅ ๋ถ€์กฑ ๊ตฌ๋ถ„
"

Step 2: ์นดํ…Œ๊ณ ๋ฆฌ๋ณ„ ๋ถ„์„

์‹คํŒจ ์œ ํ˜• ์˜๋ฏธ ๋Œ€์‘
์ง€์‹œ ๋ฌด์‹œ (wrong format) instruction following ์•ฝํ•จ SFT ๋ฐ์ดํ„ฐ์— format-constrained ์ƒ˜ํ”Œ ์ถ”๊ฐ€
ํ•œ๊ตญ์–ด ์ดํ•ด ์‹คํŒจ ํ•œ๊ตญ์–ด ๋Šฅ๋ ฅ ๋ถ€์กฑ ํ•œ๊ตญ์–ด ๋น„์œจ ๋†’์ด๊ธฐ (ํ˜„์žฌ ~70%)
์ถ”๋ก  ์˜ค๋ฅ˜ 1B ๋ชจ๋ธ ํ•œ๊ณ„ ๋ชจ๋ธ ํฌ๊ธฐ ํ•œ๊ณ„ โ†’ 3B ์ „ํ™˜

Step 3: ๋ชจ๋ธ ํ•œ๊ณ„ vs ๋ฐ์ดํ„ฐ ๋ฌธ์ œ ๊ตฌ๋ถ„

1B ๋ชจ๋ธ ko_ifeval ํ˜„์‹ค์  ๋ฒ”์œ„: 15-30%
- <15%: ๋ฐ์ดํ„ฐ/ํ•™์Šต ๋ฌธ์ œ ๊ฐ€๋Šฅ์„ฑ ๋†’์Œ
- 15-25%: ์ •์ƒ ๋ฒ”์œ„, ๋ฐ์ดํ„ฐ๋กœ ๊ฐœ์„  ์—ฌ์ง€ ์žˆ์Œ
- 25-30%: 1B ํ•œ๊ณ„์— ๊ทผ์ ‘, 3B ์ „ํ™˜ ํ•„์š”
- >30%: 1B์—์„œ ๋‹ฌ์„ฑํ•˜๊ธฐ ์–ด๋ ค์›€

๋ฐ์ดํ„ฐ ์ถ”๊ฐ€ ์ˆ˜์ง‘ ๋ฐฉํ–ฅ

  1. Korean instruction-following ๋ฐ์ดํ„ฐ: KoAlpaca, KULLM ๋“ฑ์—์„œ format-constrained ์ƒ˜ํ”Œ
  2. Multi-turn ํ•œ๊ตญ์–ด ๋Œ€ํ™”: ์ง€์‹œ ๋”ฐ๋ฅด๊ธฐ ๋Šฅ๋ ฅ ๊ฐ•ํ™”
  3. ko_ifeval๊ณผ ์œ ์‚ฌํ•œ ํฌ๋งท ๋ฐ์ดํ„ฐ: "~ํ˜•์‹์œผ๋กœ ๋‹ตํ•˜์‹œ์˜ค" ์œ ํ˜•

์‹œ๋‚˜๋ฆฌ์˜ค 8: ๋””์Šคํฌ ๊ณต๊ฐ„ ๋ถ€์กฑ

ํ˜„์žฌ ์ƒํƒœ

/PROJECT: 3.5TB ์ด, 1.4TB ์‚ฌ์šฉ, 2.2TB ๊ฐ€์šฉ (39% ์‚ฌ์šฉ)

์ฒดํฌํฌ์ธํŠธ ํฌ๊ธฐ ์ถ”์ •

ํ•ญ๋ชฉ ํฌ๊ธฐ
model.pt (1.19B BF16) ~2.4 GB
optimizer.pt (AdamW states) ~9.5 GB
scheduler + meta ~1 MB
์ฒดํฌํฌ์ธํŠธ 1๊ฐœ ~12 GB
10,000 steps / 500 save = 20๊ฐœ ~240 GB
+ best checkpoint +12 GB
+ tensorboard logs ~100 MB
์ด ์˜ˆ์ƒ ~252 GB

โ†’ 2.2TB ๊ฐ€์šฉ ๋Œ€๋น„ ์ถฉ๋ถ„ํ•˜์ง€๋งŒ, ์—ฌ๋Ÿฌ ์‹คํ—˜ ์‹œ ๋ˆ„์  ์ฃผ์˜

์ฒดํฌํฌ์ธํŠธ ๊ด€๋ฆฌ ์ „๋žต

์ €์žฅ ์ฃผ๊ธฐ ์ตœ์ ํ™”

  • ํ˜„์žฌ: 500 step๋งˆ๋‹ค (์ถ”์ฒœ ์œ ์ง€)
  • ๋””์Šคํฌ ๋ถ€์กฑ ์‹œ: 1000 step์œผ๋กœ ๋ณ€๊ฒฝ โ†’ 120 GB๋กœ ์ ˆ๋ฐ˜ ๊ฐ์†Œ
  • train_config.save_interval = 1000

์˜ค๋ž˜๋œ ์ฒดํฌํฌ์ธํŠธ ์ž๋™ ์‚ญ์ œ

#!/bin/bash
# cleanup_checkpoints.sh โ€” ์ตœ์‹  N๊ฐœ๋งŒ ์œ ์ง€, best๋Š” ํ•ญ์ƒ ๋ณด์กด
CKPT_DIR="${1:-checkpoints/korean_1b_sft}"
KEEP="${2:-5}"  # ์ตœ์‹  5๊ฐœ ์œ ์ง€

CKPTS=$(ls -d "${CKPT_DIR}"/checkpoint-[0-9]* 2>/dev/null | sort -t- -k2 -n)
TOTAL=$(echo "$CKPTS" | wc -l)
DELETE=$((TOTAL - KEEP))

if [[ $DELETE -gt 0 ]]; then
    echo "$CKPTS" | head -n "$DELETE" | while read ckpt; do
        echo "Removing: $ckpt"
        rm -rf "$ckpt"
    done
    echo "Kept latest $KEEP checkpoints + checkpoint-best"
else
    echo "Only $TOTAL checkpoints, nothing to delete (keep=$KEEP)"
fi

๋””์Šคํฌ ๋ชจ๋‹ˆํ„ฐ๋ง

# ํ•™์Šต ์ค‘ ์ฃผ๊ธฐ์  ํ™•์ธ
df -h /PROJECT | awk 'NR==2 {if ($5+0 > 80) print "๐Ÿ”ด DISK >80%: "$5}'

ํ•™์Šต ์žฌ์‹œ์ž‘ ๊ฐ€์ด๋“œ

ํ˜„์žฌ ์ฝ”๋“œ์˜ Resume ์ง€์›

โœ… ์™„์ „ ์ง€์›๋จ:

  • sft.py์— --resume ์ธ์ž ์žˆ์Œ
  • load_checkpoint()์œผ๋กœ model, optimizer, scheduler ์ƒํƒœ ๋ชจ๋‘ ๋ณต์›
  • start_step ๋ฐ˜ํ™˜ โ†’ ์ด์–ด์„œ ํ•™์Šต

์žฌ์‹œ์ž‘ ๋ช…๋ น์–ด

# ๋ฐฉ๋ฒ• 1: ์ตœ์‹  ์ฒดํฌํฌ์ธํŠธ์—์„œ ์ž๋™ ์žฌ์‹œ์ž‘
LATEST=$(ls -d checkpoints/korean_1b_sft/checkpoint-[0-9]* 2>/dev/null \
  | sort -t- -k2 -n | tail -1)
bash scripts/launch_sft.sh --resume "${LATEST}"

# ๋ฐฉ๋ฒ• 2: ํŠน์ • ์ฒดํฌํฌ์ธํŠธ ์ง€์ •
bash scripts/launch_sft.sh --resume checkpoints/korean_1b_sft/checkpoint-0003000

# ๋ฐฉ๋ฒ• 3: LR ๋ณ€๊ฒฝํ•˜๋ฉฐ ์žฌ์‹œ์ž‘ (๊ณผ์ ํ•ฉ/spike ๋Œ€์‘)
bash scripts/launch_sft.sh --resume "${LATEST}" --lr 1e-5

์ฃผ์˜์‚ฌํ•ญ

  • cosine schedule: resume ์‹œ scheduler๊ฐ€ ์ค‘๊ฐ„ step์—์„œ ๋ณต์›๋จ โ†’ LR์ด ์˜ฌ๋ฐ”๋ฅธ ์œ„์น˜์—์„œ ์žฌ๊ฐœ
  • max_steps ๋ณ€๊ฒฝ ์‹œ: ์›๋ž˜ 5000 step ๊ธฐ์ค€ schedule์ธ๋ฐ 10000์œผ๋กœ ๋ณ€๊ฒฝํ•˜๋ฉด LR curve๊ฐ€ ๋‹ฌ๋ผ์ง โ†’ ์ฒ˜์Œ๋ถ€ํ„ฐ ์žฌํ•™์Šต ๊ถŒ์žฅ
  • DDP seed: resume ์‹œ ๋™์ผ seed ์‚ฌ์šฉํ•ด์•ผ ๋ฐ์ดํ„ฐ ์ˆœ์„œ ์žฌํ˜„ (ํ˜„์žฌ ์ฝ”๋“œ์—์„œ ์ž๋™ ์ฒ˜๋ฆฌ)

๋ชจ๋‹ˆํ„ฐ๋ง ์ž๋™ํ™”

๋ณ„๋„ ์Šคํฌ๋ฆฝํŠธ: scripts/monitor_training.sh ์ฐธ์กฐ

๊ฐ์‹œ ํ•ญ๋ชฉ ์š”์•ฝ

ํ•ญ๋ชฉ ์ž„๊ณ„๊ฐ’ ์˜๋ฏธ
loss = 0.0000 (3 step ์—ฐ์†) ๐Ÿ”ด Critical Labels ๋ฒ„๊ทธ
loss spike (3ร— ํ‰๊ท ) ๐ŸŸ  Warning Bad batch / LR
gnorm > 10.0 ๐ŸŸ  Warning ๋ถˆ์•ˆ์ •
gnorm > 50.0 ๐Ÿ”ด Critical ๋ฐœ์‚ฐ ์ง์ „
GPU util < 50% ๐ŸŸก Info ๋ณ‘๋ชฉ (data loading?)
๋กœ๊ทธ 5๋ถ„ ์ด์ƒ ๋ฉˆ์ถค ๐Ÿ”ด Critical Hang / NCCL ์žฅ์• 
๋””์Šคํฌ ์‚ฌ์šฉ > 80% ๐ŸŸ  Warning ์ฒดํฌํฌ์ธํŠธ ์ •๋ฆฌ ํ•„์š”

์œ„ํ—˜๋„ ์ˆœ์œ„ (๋†’์Œ โ†’ ๋‚ฎ์Œ)

์ˆœ์œ„ ์‹œ๋‚˜๋ฆฌ์˜ค ์œ„ํ—˜๋„ ์˜ˆ๋ฐฉ
1 Loss โ†’ 0 (Labels ๋ฒ„๊ทธ) ๐Ÿ”ด๐Ÿ”ด๐Ÿ”ด ํ•™์Šต ์ „ labels shift ๊ฒ€์ฆ ์Šคํฌ๋ฆฝํŠธ ์‹คํ–‰
2 GPU Hang (NCCL) ๐Ÿ”ด๐Ÿ”ด save_interval=500, NCCL ํ™˜๊ฒฝ๋ณ€์ˆ˜ ์„ค์ •
3 ๊ณผ์ ํ•ฉ ๐Ÿ”ด val_data ํ•„์ˆ˜, ๋ชจ๋‹ˆํ„ฐ๋ง
4 ๋ฐ˜๋ณต๋ฅ  >15% ๐ŸŸ ๐ŸŸ  ๊นจ๋—ํ•œ ๋ฐ์ดํ„ฐ, ์ ์ • epoch
5 Loss Spike ๐ŸŸ  grad_clip=1.0, ์ด๋ฏธ ์„ค์ •๋จ
6 ko_ifeval ๋ฏธ๋‹ฌ ๐ŸŸ  1B ํ•œ๊ณ„ ์ธ์ง€, ๋ฐ์ดํ„ฐ ๋‹ค์–‘์„ฑ
7 ๋””์Šคํฌ ๋ถ€์กฑ ๐ŸŸก 2.2TB ์—ฌ์œ , ์ž๋™ ์ •๋ฆฌ
8 OOM ๐ŸŸข 183GB์— 1B ๋ชจ๋ธ, ๊ฑฐ์˜ ๋ถˆ๊ฐ€๋Šฅ

ํ•™์Šต ์ „ ์ฒดํฌ๋ฆฌ์ŠคํŠธ

โ–ก ๋ฐ์ดํ„ฐ ํ•„ํ„ฐ๋ง ์™„๋ฃŒ (data_quality_audit.py)
โ–ก Val split ์ƒ์„ฑ (90/10)
โ–ก Labels shift ๊ฒ€์ฆ (์œ„ ์ฝ”๋“œ ์Šค๋‹ˆํŽซ ์‹คํ–‰)
โ–ก sft_dataset.py ์ˆ˜์ • ํ™•์ธ (dynamic padding, EOS ๋ณด์กด)
โ–ก launch_sft.sh ์„ค์ • ํ™•์ธ (max_steps, val_data, lr)
โ–ก ๋””์Šคํฌ ๊ณต๊ฐ„ ํ™•์ธ (df -h /PROJECT)
โ–ก GPU ์ƒํƒœ ํ™•์ธ (nvidia-smi)
โ–ก monitor_training.sh ๋ฐฑ๊ทธ๋ผ์šด๋“œ ์‹คํ–‰
โ–ก tensorboard ์‹คํ–‰: tensorboard --logdir checkpoints/korean_1b_sft/tensorboard