frankenstallm / source /eval /plan /3B_MASTER_PLAN.md
pathcosmos's picture
Upload folder using huggingface_hub (#29)
5b1ff4d

๐Ÿš€ 3B ํ•œ๊ตญ์–ด LLM ๋งˆ์Šคํ„ฐ ํ”Œ๋žœ

์ž‘์„ฑ์ผ: 2026-02-27 04:27 KST ํ”„๋กœ์ ํŠธ: /PROJECT/0325120031_A/ghong/taketimes/llm-bang/ ๊ฒฐ์ •: 1B โ†’ 3B ์ „ํ™˜ (1B ๊ตฌ์กฐ์  ํ•œ๊ณ„ ํ™•์ธ) ์ด ์˜ˆ์ƒ ๊ธฐ๊ฐ„: ~35์‹œ๊ฐ„


0. ํ˜„ํ™ฉ ์š”์•ฝ

1B์—์„œ ํ™•์ธ๋œ ๊ฒƒ

ํ•ญ๋ชฉ ๊ฒฐ๊ณผ
๋ฐ˜๋ณต๋ฅ  (raw, ์˜ฌ๋ฐ”๋ฅธ ํฌ๋งท) 30.7%
๋ฐ˜๋ณต๋ฅ  (rep_penalty=1.1) 18.0%
val_loss 2.2062
์ž์—ฐ ์ข…๋ฃŒ์œจ 60%
์งง์€ QA ํ’ˆ์งˆ โœ… ์–‘ํ˜ธ (์ˆ˜๋„, ๊น€์น˜ ๋“ฑ)
๋ณต์žกํ•œ ์งˆ๋ฌธ ํ’ˆ์งˆ โŒ ๋ฐ˜๋ณต ํ‡ดํ™” ์‹ฌ๊ฐ

3B ์ „ํ™˜ ๊ทผ๊ฑฐ

  1. ๋ฐ˜๋ณต๋ฅ  18%๋Š” 1B ๊ตฌ์กฐ์  ํ•œ๊ณ„ โ€” d_model=2048, 24 layers๋กœ๋Š” ๊ธด ์‹œํ€€์Šค์—์„œ hidden state ๋ถ•๊ดด ๋ถˆ๊ฐ€ํ”ผ
  2. Scaling law ์˜ˆ์ธก: 3B๋Š” loss 7% ๊ฐ์†Œ โ†’ ๋ฐ˜๋ณต๋ฅ  58% ์˜ˆ์ƒ
  3. ORPO ์—†์ด๋„ ๋ชฉํ‘œ ๋‹ฌ์„ฑ ๊ฐ€๋Šฅ: 3B SFT๋งŒ์œผ๋กœ <10%, +ORPO๋กœ <3%
  4. ์ด ์†Œ์š”์‹œ๊ฐ„ ORPO ์‹คํŒจโ†’3B (39h) vs 3B ์งํ–‰ (30h) โ€” ์งํ–‰์ด ๋น ๋ฆ„

1. 3B ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜

ํŒŒ๋ผ๋ฏธํ„ฐ 1B (ํ˜„์žฌ) 3B (๋ชฉํ‘œ) ๋ณ€ํ™”
d_model 2048 3072 1.5ร—
n_layers 24 32 1.33ร—
n_heads 16 24 1.5ร—
n_kv_heads (GQA) 4 8 2ร— (GQA 3:1)
d_ffn (SwiGLU) 5472 8192 1.5ร—
vocab_size 64000 64000 ๋™์ผ
max_seq_len 4096 4096 ๋™์ผ
rope_theta 500000 500000 ๋™์ผ
์ด ํŒŒ๋ผ๋ฏธํ„ฐ 1.19B ~3.42B 2.9ร—

ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜ ์ƒ์„ธ

Embedding:      64000 ร— 3072                    = 196.6M
Attention/layer: Q(3072ร—3072) + K(3072ร—1024) + V(3072ร—1024) + O(3072ร—3072) = 25.1M
FFN/layer:      SwiGLU gate(3072ร—8192) + up(3072ร—8192) + down(8192ร—3072) = 75.5M
Layer total:    25.1 + 75.5 = 100.6M ร— 32 layers = 3,219M
LM Head:        tied with embedding
์ด๊ณ„:           196.6M + 3,219M โ‰ˆ 3.42B

GPU ๋ฉ”๋ชจ๋ฆฌ ์˜ˆ์ƒ (8ร— B200 183GB)

๋ชจ๋ธ (FP8):           3.42 GB
Optimizer (FP32):     27.4 GB (DDP ๋ถ„์‚ฐ โ†’ ~3.4 GB/GPU)
Gradients (BF16):     6.84 GB (๋ถ„์‚ฐ โ†’ ~0.86 GB/GPU)
Activations (bs=4):   ~15-25 GB (gradient checkpointing)
Per GPU ํ•ฉ๊ณ„:         ~28 GB โ†’ B200์˜ 15% โ†’ ๋งค์šฐ ์—ฌ์œ 

Config ํŒŒ์ผ

# configs/korean_3b_fp8.yaml
model:
  vocab_size: 64000
  d_model: 3072
  n_layers: 32
  n_heads: 24
  n_kv_heads: 8
  d_ffn: 8192
  max_seq_len: 4096
  rope_theta: 500000.0
  dropout: 0.0
  bias: false
  use_flash_attn: true
  use_fp8: true

train:
  max_steps: 34000       # 8.91B ร— 4 epoch
  batch_size: 4
  grad_accum_steps: 8    # eff_batch: 4 ร— 8 ร— 8GPU ร— 4096 โ‰ˆ 1M tok/step
  lr: 1.5e-4
  min_lr: 1.5e-5
  weight_decay: 0.1
  warmup_steps: 2000
  max_grad_norm: 1.0
  log_interval: 10
  save_interval: 500
  eval_interval: 200
  fp8_format: "MXFP8"

tokenizer:
  vocab_size: 64000
  type: sentencepiece_unigram

2. ๋ฐ์ดํ„ฐ ํŒŒ์ดํ”„๋ผ์ธ

์ฆ‰์‹œ ์‚ฌ์šฉ ๊ฐ€๋Šฅ

์†Œ์Šค ํฌ๊ธฐ ํ† ํฐ ์ˆ˜ ์ƒํƒœ
korean_c4_train.bin 15.1 GB 7.56B โœ… ํ† ํฐํ™” ์™„๋ฃŒ
korean_namuwiki_train.bin 2.2 GB 1.08B โœ… ํ† ํฐํ™” ์™„๋ฃŒ
korean_wiki_train.bin 0.5 GB 0.26B โœ… ํ† ํฐํ™” ์™„๋ฃŒ
ํ•ฉ๊ณ„ (korean_train.bin) 17.8 GB 8.91B โœ… ์ฆ‰์‹œ ์‹œ์ž‘ ๊ฐ€๋Šฅ

์ถ”๊ฐ€ ์ค€๋น„ ํ•„์š” (๋ณ‘๋ ฌ ํ† ํฐํ™”)

์†Œ์Šค ํฌ๊ธฐ ์ถ”์ • ํ† ํฐ ์ž‘์—… ์˜ˆ์ƒ ์†Œ์š”
culturax_ko 60 GB ~30-40B parquetโ†’ํ† ํฐํ™” 4-6h
hplt_ko 23 GB ~12-15B ํ† ํฐํ™” 2-3h
cc100_ko 14 GB ~8-10B xzํ•ด์ œ+ํ† ํฐํ™” 2h
oscar_ko 9.2 GB ~5-6B ํ† ํฐํ™” 1-2h
korean_textbooks 6.4 GB ~3-4B ํ† ํฐํ™” 1h
ํ•ฉ๊ณ„ ~123 GB ~70-80B 8-12h (๋ณ‘๋ ฌ)

Chinchilla ๋ถ„์„

3.42B ร— 20 = 68.4B tokens (์ตœ์ )
์ฆ‰์‹œ ์‚ฌ์šฉ ๊ฐ€๋Šฅ: 8.91B ร— 4 epoch = 35.6B (์ตœ์ ์˜ 52%)
extra ํฌํ•จ ์‹œ: ~80-90B โ†’ ์ถฉ๋ถ„ (131%)

๋ฐ์ดํ„ฐ ํƒ€์ž„๋ผ์ธ

์‹œ์  ํ–‰๋™
์ง€๊ธˆ korean_train.bin 8.91B๋กœ ์‚ฌ์ „ํ•™์Šต ์‹œ์ž‘ (4 epoch)
๋ณ‘๋ ฌ korean_extra ํ† ํฐํ™” + MinHash ์ค‘๋ณต์ œ๊ฑฐ + PPL ํ•„ํ„ฐ ์ง„ํ–‰
Phase 2 ์ „์ฒด 60-80B ํ† ํฐ์œผ๋กœ extended pretrain (์„ ํƒ)

SFT ๋ฐ์ดํ„ฐ

ํ•ญ๋ชฉ ๊ฐ’
ํ˜„์žฌ ํด๋ฆฐ ๋ฐ์ดํ„ฐ ~120-135K ์ƒ˜ํ”Œ (ํ•„ํ„ฐ๋ง ํ›„)
Val split 10% (~12-13K)
3B์— ์ถฉ๋ถ„? โœ… (7B Alpaca๋„ 52K๋กœ ํ•™์Šต)

์ถ”๊ฐ€ ๊ณ ํ’ˆ์งˆ SFT ์†Œ์Šค (์„ ํƒ)

  • hPark/orca-ko (~200K)
  • maywell/synatra-orca (~300K)
  • HAERAE-HUB/qarv-instruct-100k (100K)
  • ํ•„ํ„ฐ๋ง ํ›„ 200-300K ์‚ฌ์šฉ ๊ฐ€๋Šฅ

3. ์‚ฌ์ „ํ•™์Šต ๊ณ„ํš

ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ

ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐ’ ๊ทผ๊ฑฐ
LR 1.5e-4 3B ํ‘œ์ค€ (1B์˜ 3e-4 ๋Œ€๋น„ ๋ณด์ˆ˜์ )
Min LR 1.5e-5 LR์˜ 10%
Warmup 2,000 steps ~6%
Weight Decay 0.1 Pretrain ํ‘œ์ค€
Batch Size 4/GPU ร— 8GPU ร— 8 grad_accum = 256 eff ~1M tok/step
Max Steps 34,000 8.91B ร— 4 epoch
Precision MXFP8 B200 ์ตœ์ ํ™”
Grad Clip 1.0 ํ‘œ์ค€

์˜ˆ์ƒ ์†Œ์š” ์‹œ๊ฐ„

1B ์‹ค์ธก: ~75,700 tok/s (๋‹จ์ผ B200)
3B ์˜ˆ์ƒ: ํŒŒ๋ผ๋ฏธํ„ฐ 3ร— โ†’ throughput ~40-50% ๊ฐ์†Œ
         BUT batch ์ตœ์ ํ™” + FP8 โ†’ ๋ณด์ •

๋ณด์ˆ˜์  ์ถ”์ •:
  8.91B ร— 4 epoch = 35.6B tokens
  ์ฒ˜๋ฆฌ๋Ÿ‰: ~400K tok/s (8ร— B200, FP8, ์ตœ์  ๋ฐฐ์น˜)
  ์†Œ์š”: 35.6B / 400K = 89,000์ดˆ โ‰ˆ 24.7์‹œ๊ฐ„

๋‚™๊ด€์  ์ถ”์ •:
  ์ฒ˜๋ฆฌ๋Ÿ‰: ~600K tok/s โ†’ 16.5์‹œ๊ฐ„
  
์ฑ„ํƒ ์ถ”์ •: ~26์‹œ๊ฐ„

๋ชจ๋‹ˆํ„ฐ๋ง

# ์‹ค์‹œ๊ฐ„ ๋กœ๊ทธ
tail -f checkpoints/korean_3b_fp8/train.log

# TensorBoard
tensorboard --logdir checkpoints/korean_3b_fp8/tensorboard --port 6007

# GPU ์ƒํƒœ
watch -n 10 nvidia-smi

ํ•ต์‹ฌ ๊ด€์ฐฐ ์ˆ˜์น˜:

์ˆ˜์น˜ ์ •์ƒ ๋ฒ”์œ„ ๊ฒฝ๊ณ  ์ฆ‰์‹œ ์ค‘๋‹จ
Train Loss ์‹œ์ž‘ ~10, ์ˆ˜๋ ด ~3-4 ์ •์ฒด 5000+ steps ๋ฐœ์‚ฐ (์ƒ์Šน)
GNorm 0.5-2.0 >5.0 >50
PPL ํ•˜๊ฐ• ์ถ”์„ธ ์ •์ฒด ์ƒ์Šน
GPU Util >90% <70% <50% (๋ณ‘๋ชฉ)
tok/s >300K <200K <100K

์ฒดํฌํฌ์ธํŠธ ์ „๋žต

Step ํ–‰๋™
500 Sanity check โ€” loss ํ•˜๊ฐ• ์ค‘? OOM ์—†๋‚˜?
5,000 1 epoch ์™„๋ฃŒ โ€” PPL ์ธก์ •, ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ perplexity <20?
10,000 ์ค‘๊ฐ„์  โ€” loss ์ถ”์„ธ ํ™•์ธ, ๊ณผ์ ํ•ฉ ์ง•ํ›„?
17,000 2 epoch โ€” PPL < 15?
25,000 3 epoch โ€” PPL < 12?
34,000 ์ตœ์ข… โ€” PPL < 10 ๋ชฉํ‘œ

๋””์Šคํฌ: ์ฒดํฌํฌ์ธํŠธ 1๊ฐœ ~27GB (๋ชจ๋ธ 7GB + optimizer 20GB) ร— save_interval=500 โ†’ 68๊ฐœ = ~1.8TB โ†’ save_interval=2000์œผ๋กœ ๋ณ€๊ฒฝ ๊ถŒ์žฅ โ†’ 17๊ฐœ = ~460GB


4. SFT ๊ณ„ํš

1B ๊ตํ›ˆ ์ „๋ถ€ ์ ์šฉ

๊ตํ›ˆ 1B์—์„œ ๋ฐœ๊ฒฌ 3B์— ์ ์šฉ
Dynamic padding ํ•„์ˆ˜ 4096 ๊ณ ์ •์œผ๋กœ 90% ๋‚ญ๋น„ โœ… sft_dataset.py ์ˆ˜์ • ์™„๋ฃŒ, ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉ
EOS ๋ณด์กด ํŠธ๋ ์ผ€์ด์…˜ ์‹œ EOS ์†์‹ค โœ… response_ids[-1] = eos_id ๊ฐ•์ œ
Val split ํ•„์ˆ˜ ๊ณผ์ ํ•ฉ ๋ชจ๋‹ˆํ„ฐ๋ง ๋ถˆ๊ฐ€ํ–ˆ์Œ โœ… 10% split
3-4 epoch 2 epoch์€ underfitting โœ… max_steps ๊ณ„์‚ฐ
OpenOrca ๊ณผ๋Œ€ํ‘œ์ง‘ ๋ฐฉ์ง€ 5ร— ๊ฐ€์ค‘์น˜๋กœ ๊ณผ์ ํ•ฉ โœ… 2.0ร— ์ดํ•˜
๋ฐ์ดํ„ฐ ํ’ˆ์งˆ ํ•„ํ„ฐ </s> ๋ฆฌํ„ฐ๋Ÿด, Q/A ๋งˆ์ปค ์˜ค์—ผ โœ… ํ•„ํ„ฐ ์Šคํฌ๋ฆฝํŠธ ์™„์„ฑ
์˜ฌ๋ฐ”๋ฅธ ํฌ๋งท ํ†ต์ผ ํ•™์Šต/์ถ”๋ก  ํฌ๋งท ๋ถˆ์ผ์น˜ โœ… <|user|>/<|assistant|> ์ผ๊ด€
Early stopping val_loss ์ƒ์Šนํ•ด๋„ ํ•™์Šต ๊ณ„์†๋จ โœ… patience=5 ๊ตฌํ˜„
NEFTune alpha 10.0์€ ๊ณผ๋„ โœ… 5.0์œผ๋กœ ์กฐ์ •

ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ

ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐ’ ๊ทผ๊ฑฐ
LR 2e-5 SFT ํ‘œ์ค€ (Alpaca, Vicuna ๋™์ผ)
Warmup 300 steps ~3%
Max Steps 10,000 ~3-4 epoch (๋ฐ์ดํ„ฐ ํฌ๊ธฐ ๋”ฐ๋ผ ์กฐ์ •)
Batch Size 4/GPU ร— 2 grad_accum ร— 8GPU = 64 SFT ํ‘œ์ค€
Weight Decay 0.01 SFT ํ‘œ์ค€ (pretrain 0.1๋ณด๋‹ค ๋‚ฎ๊ฒŒ)
NEFTune alpha=5.0 ๊ณผ์ ํ•ฉ ๋ฐฉ์ง€
Eval Interval 500 steps
Early Stopping patience=5 2,500 steps ๋ฌด๊ฐœ์„  ์‹œ ์ค‘๋‹จ
Dropout 0.05 ๊ณผ์ ํ•ฉ ๋ฐฉ์ง€ (1B์—์„œ 0.0์ด์—ˆ์Œ)

์‹คํ–‰ ๋ช…๋ น์–ด

RUN_NAME=korean_3b_sft \
BASE_CHECKPOINT=checkpoints/korean_3b_fp8/checkpoint-BEST \
SFT_DATA=data/sft_v2/train.jsonl \
VAL_DATA=data/sft_v2/val.jsonl \
MAX_STEPS=10000 \
LR=2.0e-5 \
WARMUP_STEPS=300 \
bash scripts/launch_sft.sh

์˜ˆ์ƒ ์‹œ๊ฐ„: ~2์‹œ๊ฐ„ (3B๋Š” 1B ๋Œ€๋น„ ~2.5ร— ๋А๋ฆผ)

์„ฑ๊ณต ๊ธฐ์ค€

์ง€ํ‘œ ๋ชฉํ‘œ ์‹คํŒจ ๊ธฐ์ค€
Train Loss < 1.85 > 2.00
Val Loss Train์˜ 1.1๋ฐฐ ์ด๋‚ด 1.2๋ฐฐ ์ดˆ๊ณผ
๋ฐ˜๋ณต๋ฅ  (raw) < 10% > 15%
๋ฐ˜๋ณต๋ฅ  (rep_penalty=1.1) < 3% > 8%
EOS ์ข…๋ฃŒ์œจ > 80% < 60%

5. ORPO ๊ณ„ํš

ํƒ€์ด๋ฐ: SFT ์™„๋ฃŒ ํ›„, ๋ฐ˜๋ณต๋ฅ  >5%์ผ ๋•Œ๋งŒ

๋ฐ์ดํ„ฐ

์†Œ์Šค ์ƒ˜ํ”Œ ์ˆ˜ ์œ ํ˜•
maywell/ko_Ultrafeedback_binarized ~60K ์ผ๋ฐ˜ ๋„๋ฉ”์ธ preference
kuotient/orca-math-korean-dpo-pairs ์ˆ˜์ฒœ ์ˆ˜ํ•™ ๋„๋ฉ”์ธ
์ž์ฒด ์ƒ์„ฑ (3B SFT ๋ชจ๋ธ๋กœ) ~2-5K ๋ฐ˜๋ณต ํƒ€๊ฒŸ preference

์ž์ฒด ์ƒ์„ฑ ๋ฐฉ๋ฒ•:

  1. 3B SFT ๋ชจ๋ธ๋กœ 1000 ํ”„๋กฌํ”„ํŠธ ร— 4 temperature ์ƒ์„ฑ
  2. ๋ฐ˜๋ณต ์ถœ๋ ฅ โ†’ rejected, ๊นจ๋—ํ•œ ์ถœ๋ ฅ โ†’ chosen
  3. 3B์—์„œ๋Š” ๋ฐ˜๋ณต๋ฅ  ๋‚ฎ์œผ๋ฏ€๋กœ 1B๋ณด๋‹ค ํ›จ์”ฌ ํŽธํ–ฅ ์ ์Œ

ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ

ORPOConfig(
    learning_rate=5e-7,    # ๋งค์šฐ ๋‚ฎ์€ LR (์ •๋ ฌ ํ•™์Šต)
    num_train_epochs=1,    # 1 epoch ์ถฉ๋ถ„
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    beta=0.1,              # ORPO coefficient
)

์˜ˆ์ƒ ์‹œ๊ฐ„: 1-2์‹œ๊ฐ„

๋ชฉํ‘œ: ๋ฐ˜๋ณต๋ฅ  <3% (raw), <1% (rep_penalty=1.1)


6. ํ‰๊ฐ€ ๊ณ„ํš

๋ฒค์น˜๋งˆํฌ

๋ฒค์น˜๋งˆํฌ ๋„๊ตฌ 1B ์˜ˆ์ƒ 3B ๋ชฉํ‘œ
ko_ifeval lm-eval-harness 15-25% 35-45%
ko_winogrande lm-eval-harness 53-58% 60-68%
KoBEST (5 tasks avg) lm-eval-harness 55-60% 65-75%
๋ฐ˜๋ณต๋ฅ  (raw) test_generation_params.py 18% <8%
๋ฐ˜๋ณต๋ฅ  (+rep_penalty) test_generation_params.py ~5-8% <3%

์‹คํ–‰ ๋ช…๋ น์–ด

# ko_ifeval
lm_eval --model hf \
    --model_args pretrained=checkpoints/korean_3b_sft/checkpoint-BEST,dtype=bfloat16 \
    --tasks ko_ifeval --device cuda:0

# KoBEST
lm_eval --model hf \
    --model_args pretrained=checkpoints/korean_3b_sft/checkpoint-BEST,dtype=bfloat16 \
    --tasks kobest_boolq,kobest_copa,kobest_wic,kobest_hellaswag,kobest_sentineg \
    --device cuda:0

ํŒ๋‹จ ๊ธฐ์ค€

                    [3B SFT ํ‰๊ฐ€ ๊ฒฐ๊ณผ]
                          โ”‚
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚                     โ”‚                     โ”‚
  โœ… PASS              โš ๏ธ PARTIAL            โŒ FAIL
 ๋ฐ˜๋ณต๋ฅ <5%            ๋ฐ˜๋ณต๋ฅ  5-10%          ๋ฐ˜๋ณต๋ฅ >10%
 ko_ifeval>35%       ko_ifeval 25-35%      ko_ifeval<25%
    โ”‚                     โ”‚                     โ”‚
    โ–ผ                     โ–ผ                     โ–ผ
 ๋ฐฐํฌ ์ค€๋น„             ORPO ์ ์šฉ             ์›์ธ ๋ถ„์„
 (Phase ๋ฐฐํฌ)        (Phase 5)           (๋ฐ์ดํ„ฐ/์•„ํ‚คํ…์ฒ˜ ์ ๊ฒ€)

7. ์ „์ฒด ํƒ€์ž„๋ผ์ธ

ํ•œ๋ˆˆ์— ๋ณด๊ธฐ

Day 0 (์ง€๊ธˆ, 04:30)
  โ”œโ”€โ”€ [04:30] Config ์ž‘์„ฑ + sanity check        30๋ถ„
  โ”œโ”€โ”€ [05:00] ๐Ÿ”ฅ ์‚ฌ์ „ํ•™์Šต ์‹œ์ž‘                   โ† ์˜ค๋Š˜ ๋ฐค ์‹œ์ž‘
  โ”œโ”€โ”€ [05:00] (๋ณ‘๋ ฌ) korean_extra ํ† ํฐํ™” ์‹œ์ž‘    8-12h
  โ”‚
Day 1 (๋‚ด์ผ)
  โ”œโ”€โ”€ [~07:00] ์‚ฌ์ „ํ•™์Šต ์ง„ํ–‰ ์ค‘... (~26์‹œ๊ฐ„)
  โ”œโ”€โ”€ [์ค‘๊ฐ„] ์ฒดํฌํฌ์ธํŠธ PPL ํ™•์ธ
  โ”‚
Day 1.5
  โ”œโ”€โ”€ [~07:00+26h = Day 2 07:00] ์‚ฌ์ „ํ•™์Šต ์™„๋ฃŒ
  โ”œโ”€โ”€ [07:00] SFT ์‹œ์ž‘                          2์‹œ๊ฐ„
  โ”œโ”€โ”€ [09:00] SFT ์™„๋ฃŒ โ†’ ํ‰๊ฐ€
  โ”œโ”€โ”€ [09:30] ๋ฐ˜๋ณต๋ฅ  <5%? โ†’ ๋ฐฐํฌ
  โ”œโ”€โ”€ [09:30] ๋ฐ˜๋ณต๋ฅ  5-10%? โ†’ ORPO            1-2์‹œ๊ฐ„
  โ”œโ”€โ”€ [11:30] ORPO ์™„๋ฃŒ โ†’ ์ตœ์ข… ํ‰๊ฐ€
  โ”‚
Day 2
  โ”œโ”€โ”€ ๋ฒค์น˜๋งˆํฌ ํ’€ ์Šค์œ„ํŠธ                        2์‹œ๊ฐ„
  โ”œโ”€โ”€ HuggingFace ์—…๋กœ๋“œ                        1์‹œ๊ฐ„
  โ”œโ”€โ”€ vLLM ์„œ๋น™ ํ…Œ์ŠคํŠธ                          1์‹œ๊ฐ„
  โ””โ”€โ”€ ๐ŸŽ‰ ๋ฐฐํฌ ์™„๋ฃŒ

ํ‘œ ํ˜•์‹

๋‹จ๊ณ„ ์‹œ์ž‘ ์†Œ์š” ์™„๋ฃŒ ์˜์กด์„ฑ
0. Config + Sanity Day 0 04:30 30๋ถ„ 05:00 ์—†์Œ
1. ์‚ฌ์ „ํ•™์Šต Day 0 05:00 26์‹œ๊ฐ„ Day 1 ~07:00 Config
(๋ณ‘๋ ฌ) Extra ํ† ํฐํ™” Day 0 05:00 8-12์‹œ๊ฐ„ Day 0 ~17:00 ์—†์Œ
2. SFT Day 1 07:00 2์‹œ๊ฐ„ Day 1 09:00 ์‚ฌ์ „ํ•™์Šต ์™„๋ฃŒ
3. 1์ฐจ ํ‰๊ฐ€ Day 1 09:00 30๋ถ„ Day 1 09:30 SFT ์™„๋ฃŒ
4. ORPO (์กฐ๊ฑด๋ถ€) Day 1 09:30 1-2์‹œ๊ฐ„ Day 1 11:30 ๋ฐ˜๋ณต๋ฅ  >5%
5. ํ’€ ๋ฒค์น˜๋งˆํฌ Day 1 11:30 2์‹œ๊ฐ„ Day 1 13:30
6. ๋ฐฐํฌ Day 1 13:30 2์‹œ๊ฐ„ Day 1 15:30 ๋ฒค์น˜๋งˆํฌ ํ†ต๊ณผ

8. ์˜์‚ฌ๊ฒฐ์ • ํŠธ๋ฆฌ

Phase 1: ์‚ฌ์ „ํ•™์Šต ์ค‘ (Step 5000, 10000, ...)

Loss ํ•˜๊ฐ• ์ค‘?
โ”œโ”€โ”€ YES โ†’ ๊ณ„์†
โ””โ”€โ”€ NO (์ •์ฒด 3000+ steps)
    โ”œโ”€โ”€ ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ ๋ฌธ์ œ? โ†’ PPL ํ•„ํ„ฐ ๊ฐ•ํ™” + ์žฌ์‹œ์ž‘
    โ”œโ”€โ”€ LR ๋ฌธ์ œ? โ†’ LR ๋ฐ˜๊ฐ ํ›„ resume
    โ””โ”€โ”€ ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜? โ†’ d_model/n_layers ์กฐ์ • (์ตœํ›„ ์ˆ˜๋‹จ)

PPL (ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ)?
โ”œโ”€โ”€ < 15 at 2 epoch โ†’ ์ •์ƒ
โ”œโ”€โ”€ 15-20 at 2 epoch โ†’ ์ฃผ์˜ (๋ฐ์ดํ„ฐ ๋ถ€์กฑ?)
โ””โ”€โ”€ > 20 at 2 epoch โ†’ ๋ฌธ์ œ (๋ฐ์ดํ„ฐ ํ’ˆ์งˆ or ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ)

OOM?
โ”œโ”€โ”€ batch_size 4โ†’2, grad_accum 8โ†’16
โ””โ”€โ”€ gradient checkpointing ํ™•์ธ

Phase 2: SFT ํ›„

๋ฐ˜๋ณต๋ฅ  (raw)?
โ”œโ”€โ”€ < 5%  โ†’ โœ… ๋ฐฐํฌ ๊ฐ€๋Šฅ! (ORPO ๊ฑด๋„ˆ๋œ€)
โ”œโ”€โ”€ 5-10% โ†’ โš ๏ธ ORPO ์ ์šฉ
โ”œโ”€โ”€ 10-15% โ†’ ๐ŸŸ  SFT ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์กฐ์ • ํ›„ ์žฌ์‹œ๋„
โ””โ”€โ”€ > 15% โ†’ โŒ ์‚ฌ์ „ํ•™์Šต ํ’ˆ์งˆ ๋ฌธ์ œ โ†’ Phase 1 ์žฌ์ ๊ฒ€

ko_ifeval?
โ”œโ”€โ”€ > 35% โ†’ โœ… ๋ชฉํ‘œ ๋‹ฌ์„ฑ
โ”œโ”€โ”€ 25-35% โ†’ ๐ŸŸก ๋ฐ์ดํ„ฐ augmentation ๊ณ ๋ ค
โ””โ”€โ”€ < 25% โ†’ ๐Ÿ”ด 3B์—์„œ๋„ ์ด๋Ÿฌ๋ฉด ๋ฐ์ดํ„ฐ ๋ฌธ์ œ ์‹ฌ๊ฐ

Phase 3: ORPO ํ›„

๋ฐ˜๋ณต๋ฅ ?
โ”œโ”€โ”€ < 3% โ†’ โœ… ์™„๋ฃŒ
โ”œโ”€โ”€ 3-5% โ†’ ๐ŸŸก ์„œ๋น™ ์‹œ rep_penalty=1.05๋กœ ๋ณด์™„
โ””โ”€โ”€ > 5% โ†’ ๐Ÿ”ด preference ๋ฐ์ดํ„ฐ ์žฌ๊ฒ€ํ† 

9. ์˜ˆ์™ธ ๋Œ€์‘

์‹œ๋‚˜๋ฆฌ์˜ค ํ™•๋ฅ  ๋Œ€์‘
OOM 5% batch_size 4โ†’2, grad_accum 2ร—
Loss ๋ฐœ์‚ฐ 5% LR ๋ฐ˜๊ฐ, grad_clip 0.5๋กœ ๊ฐ•ํ™”
GPU Hang / NCCL 10% pkill torchrun โ†’ latest checkpoint์—์„œ resume
๋””์Šคํฌ ๋ถ€์กฑ 3% save_interval 2000โ†’5000, ์˜ค๋ž˜๋œ ckpt ์‚ญ์ œ
์‚ฌ์ „ํ•™์Šต ํ›„ PPL >20 10% ๋ฐ์ดํ„ฐ ์ถ”๊ฐ€ (korean_extra) + extended training
SFT ํ›„ ๋ฐ˜๋ณต๋ฅ  >15% 10% ๋ฐ์ดํ„ฐ ํ•„ํ„ฐ ์žฌ๊ฐ•ํ™” + LR/epoch ์กฐ์ •
ORPO ํ›„ ํ’ˆ์งˆ ํ‡ดํ–‰ 15% ORPO LR ๋‚ฎ์ถ”๊ธฐ (5e-7 โ†’ 1e-7), beta ์กฐ์ •
FP8 ์ˆ˜์น˜ ๋ถˆ์•ˆ์ • 5% BF16์œผ๋กœ ํด๋ฐฑ (์‹œ๊ฐ„ 1.5ร— ์ฆ๊ฐ€)

NCCL/GPU ๋ณต๊ตฌ ์Šคํฌ๋ฆฝํŠธ

# ํ”„๋กœ์„ธ์Šค ์ •๋ฆฌ
pkill -f torchrun && sleep 5

# ์ตœ์‹  ์ฒดํฌํฌ์ธํŠธ ์ฐพ๊ธฐ
LATEST=$(ls -d checkpoints/korean_3b_fp8/checkpoint-[0-9]* 2>/dev/null \
  | sort -t- -k2 -n | tail -1)

# ์žฌ์‹œ์ž‘
bash scripts/run_pretrain.sh --config configs/korean_3b_fp8.yaml --resume "${LATEST}"

10. 1B์—์„œ ๋ฐฐ์šด ๊ตํ›ˆ ์ฒดํฌ๋ฆฌ์ŠคํŠธ

ํ•™์Šต ์ „ ํ•„์ˆ˜ ํ™•์ธ

  • Dynamic padding ์ž‘๋™ ํ™•์ธ: SFTDataset.__getitem__์ด ๊ฐ€๋ณ€ ๊ธธ์ด ํ…์„œ ๋ฐ˜ํ™˜, dynamic_collate_fn ๋ฐฐ์น˜๋ณ„ ํŒจ๋”ฉ
  • EOS ๋ณด์กด ํ™•์ธ: grep -n "eos_token_id" data/sft_dataset.py โ€” ํŠธ๋ ์ผ€์ด์…˜ ์‹œ ๊ฐ•์ œ ๋ถ€์ฐฉ
  • Val split ์กด์žฌ: wc -l data/sft_v2/val.jsonl โ†’ 10K+ ํ™•์ธ
  • ๋ฐ์ดํ„ฐ ์˜ค์—ผ ์ œ๊ฑฐ: </s> ๋ฆฌํ„ฐ๋Ÿด, Q/A ๋งˆ์ปค, ์ž์ฒด ๋ฐ˜๋ณต ํŒจํ„ด ํ•„ํ„ฐ ์ ์šฉ๋จ
  • OpenOrca ๊ฐ€์ค‘์น˜ โ‰ค 2.0: prepare_sft_data.py์—์„œ ํ™•์ธ
  • ํ”„๋กฌํ”„ํŠธ ํฌ๋งท ํ†ต์ผ: ํ•™์Šต = ์ถ”๋ก  = <|user|>/<|assistant|>
  • Labels shift ์ •์ƒ: trainer.py์—์„œ logits[t] โ†’ targets[t] ์ง์ ‘ ๋น„๊ต, labels์—์„œ shift ์ฒ˜๋ฆฌ

ํ•™์Šต ์ค‘ ํ•„์ˆ˜ ๋ชจ๋‹ˆํ„ฐ๋ง

  • Val loss ์ถ”์ : ๋งค eval_interval๋งˆ๋‹ค ๊ธฐ๋ก, 3์—ฐ์† ์ƒ์Šน ์‹œ ์ฃผ์˜
  • Early stopping ํ™œ์„ฑํ™”: patience=5
  • Loss 0 ๊ฐ์ง€: 3 step ์—ฐ์† loss < 0.01 โ†’ labels ๋ฒ„๊ทธ ์ฆ‰์‹œ ํ™•์ธ
  • Grad norm: > 10์ด๋ฉด ๊ฒฝ๊ณ , > 50์ด๋ฉด ์ค‘๋‹จ

ํ•™์Šต ํ›„ ํ•„์ˆ˜ ํ™•์ธ

  • ์˜ฌ๋ฐ”๋ฅธ ํฌ๋งท์œผ๋กœ ์ƒ์„ฑ ํ…Œ์ŠคํŠธ: <|user|>\n{์งˆ๋ฌธ}\n<|assistant|>\n
  • rep_penalty ์—†์ด ๋ฐ˜๋ณต๋ฅ  ์ธก์ •: ๋ชฉํ‘œ <10%
  • rep_penalty=1.1๋กœ ๋ฐ˜๋ณต๋ฅ : ๋ชฉํ‘œ <3%
  • ๋ฒค์น˜๋งˆํฌ ์‹คํ–‰: ko_ifeval, KoBEST

์ ˆ๋Œ€ ๋ฐ˜๋ณตํ•˜์ง€ ๋ง ๊ฒƒ

  • โŒ ํ•™์Šต/์ถ”๋ก  ํฌ๋งท ๋ถˆ์ผ์น˜ ์ƒํƒœ๋กœ ํ‰๊ฐ€ํ•˜์ง€ ๋ง ๊ฒƒ
  • โŒ Val split ์—†์ด ํ•™์Šตํ•˜์ง€ ๋ง ๊ฒƒ
  • โŒ ํŠน์ • ์†Œ์Šค 5ร— ์ด์ƒ ์—…์ƒ˜ํ”Œ๋งํ•˜์ง€ ๋ง ๊ฒƒ
  • โŒ 2 epoch ๋ฏธ๋งŒ์œผ๋กœ ํ•™์Šตํ•˜์ง€ ๋ง ๊ฒƒ
  • โŒ Dynamic padding ๋ฏธ์ž‘๋™ ์ƒํƒœ๋กœ ํ•™์Šตํ•˜์ง€ ๋ง ๊ฒƒ
  • โŒ ๋ฐ˜๋ณต๋ฅ  ์ธก์ • ์—†์ด "loss ๋‚ฎ์œผ๋‹ˆ OK" ํŒ๋‹จํ•˜์ง€ ๋ง ๊ฒƒ

๐Ÿ”ฅ ์˜ค๋Š˜ ๋ฐค ์ง€๊ธˆ ๋‹น์žฅ ์‹œ์ž‘ํ•  ์ฒซ ๋ฒˆ์งธ ๋ช…๋ น์–ด

cd /PROJECT/0325120031_A/ghong/taketimes/llm-bang

# 1. 3B config ์ž‘์„ฑ
cat > configs/korean_3b_fp8.yaml << 'YAML'
model:
  vocab_size: 64000
  d_model: 3072
  n_layers: 32
  n_heads: 24
  n_kv_heads: 8
  d_ffn: 8192
  max_seq_len: 4096
  rope_theta: 500000.0
  dropout: 0.0
  bias: false
  use_flash_attn: true
  use_fp8: true
train:
  max_steps: 34000
  batch_size: 4
  grad_accum_steps: 8
  lr: 1.5e-4
  min_lr: 1.5e-5
  weight_decay: 0.1
  warmup_steps: 2000
  max_grad_norm: 1.0
  log_interval: 10
  save_interval: 2000
  eval_interval: 500
  fp8_format: "MXFP8"
YAML

# 2. ์‚ฌ์ „ํ•™์Šต ์‹œ์ž‘!
bash scripts/run_pretrain.sh --config configs/korean_3b_fp8.yaml

โšก ๊ฐ€์žฅ ์ค‘์š”ํ•œ ํŒ๋‹จ ํฌ์ธํŠธ 3๊ฐœ

1๏ธโƒฃ ์‚ฌ์ „ํ•™์Šต Step 5,000 (1 epoch ์™„๋ฃŒ) โ€” "๊ธฐ์ดˆ ์ฒด๋ ฅ ํ™•์ธ"

  • PPL < 20? โ†’ ์ •์ƒ, ๊ณ„์†
  • PPL > 20? โ†’ ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ or ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ๋ฌธ์ œ. ์ฆ‰์‹œ ์ง„๋‹จ

2๏ธโƒฃ SFT ํ›„ ๋ฐ˜๋ณต๋ฅ  ์ธก์ • โ€” "3B์˜ ์ง„์งœ ์‹ค๋ ฅ"

  • <5%? โ†’ ๐ŸŽ‰ ORPO ์—†์ด ๋ฐ”๋กœ ๋ฐฐํฌ. ๋Œ€์„ฑ๊ณต
  • 5-10%? โ†’ ORPO 1๋ผ์šด๋“œ๋กœ ํ•ด๊ฒฐ ๊ฐ€๋Šฅ
  • >10%? โ†’ ์‚ฌ์ „ํ•™์Šต ํ’ˆ์งˆ ์žฌ๊ฒ€ํ†  ํ•„์š” (์ด ํ™•๋ฅ ์€ ๋‚ฎ์Œ)

3๏ธโƒฃ ko_ifeval ์ ์ˆ˜ โ€” "์‹ค์‚ฌ์šฉ ๊ฐ€๋Šฅ ์ˆ˜์ค€?"

  • >35%? โ†’ 3B ํ•œ๊ตญ์–ด ๋ชจ๋ธ๋กœ์„œ ๊ฒฝ์Ÿ๋ ฅ ์žˆ์Œ. ๋ฐฐํฌ
  • 25-35%? โ†’ ์ถ”๊ฐ€ SFT ๋ฐ์ดํ„ฐ๋กœ ๊ฐœ์„  ์—ฌ์ง€ ์žˆ์Œ
  • <25%? โ†’ ์‚ฌ์ „ํ•™์Šต ๋ฐ์ดํ„ฐ๊ฐ€ ๋ถ€์กฑํ–ˆ์„ ๊ฐ€๋Šฅ์„ฑ โ†’ extended pretrain ๊ณ ๋ ค

"1B์—์„œ ๋ฐฐ์› ๊ณ , 3B์—์„œ ์ฆ๋ช…ํ•œ๋‹ค."

์ด ๋ฌธ์„œ๋Š” ๊ฐ Phase ์™„๋ฃŒ ์‹œ ์‹ค์ธก ๊ฒฐ๊ณผ๋กœ ์—…๋ฐ์ดํŠธํ•  ๊ฒƒ.