frankenstallm / source /eval /plan /3b_training_pipeline.md
pathcosmos's picture
Upload folder using huggingface_hub (#29)
5b1ff4d

3B Korean LLM ํ•™์Šต ํŒŒ์ดํ”„๋ผ์ธ โ€” ์ „์ฒด ๊ณ„ํš

์ž‘์„ฑ์ผ: 2026-02-27
์„œ๋ฒ„: 8ร— B200 192GB, NVSwitch, CUDA 13.1, PyTorch 2.10, TransformerEngine FP8
ํ”„๋กœ์ ํŠธ: /PROJECT/0325120031_A/ghong/taketimes/llm-bang/


1. ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜ (3B)

ํ•ญ๋ชฉ 1B (ํ˜„์žฌ) 3B (์‹ ๊ทœ) ๊ทผ๊ฑฐ
d_model 2048 3072 LLaMA-3 3B ์ฐธ๊ณ 
n_layers 24 28
n_heads 16 24
n_kv_heads 4 (GQA 4:1) 8 (GQA 3:1) ๋” ํฐ ๋ชจ๋ธ์—์„œ KV ์ข€ ๋” ์—ฌ์œ 
d_ffn 5472 8192 ~2.67ร— d_model, 128๋ฐฐ์ˆ˜ (FP8)
max_seq_len 4096 4096 ๋™์ผ
vocab_size 64000 64000 ๋™์ผ ํ† ํฌ๋‚˜์ด์ €
์ด ํŒŒ๋ผ๋ฏธํ„ฐ ~1.0B ~3.0B

2. ์‚ฌ์ „ํ•™์Šต ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ

ํ•ญ๋ชฉ 1B ๊ฐ’ 3B ๊ฐ’ ๊ทผ๊ฑฐ
Learning Rate 2e-4 1.5e-4 ฮผP scaling ~1/โˆš(3), LLaMA-3 3B ์ฐธ๊ณ 
LR Schedule cosine decay cosine decay ๋™์ผ
Warmup Steps 2000 2000 57k์˜ 3.5% (์ ์ ˆ)
Weight Decay 0.1 0.1 ํ‘œ์ค€
Gradient Clip 1.0 1.0 ํ‘œ์ค€
Batch Size (local) 8 8 per GPU
Grad Accum 4 4
Eff Batch 1M tok/step 1M tok/step 8ร—8ร—4ร—4096
Max Steps 34,000 57,000 (60B tok) Chinchilla: 3B โ†’ 60B min
์ด ํ† ํฐ 35.6B 60B (์ตœ์†Œ) / 95k steps=100B (๊ถŒ์žฅ)
Save Interval 500 2000 27GB/์ฒดํฌํฌ์ธํŠธ โ†’ ๋œ ์ž์ฃผ
Eval Interval 200 500
FP8 MXFP8 MXFP8 B200 ๋„ค์ดํ‹ฐ๋ธŒ

3. ์ž์› ์˜ˆ์ธก

์ฒดํฌํฌ์ธํŠธ

  • model.pt: 3B ร— 1B(FP8) โ‰ˆ 3GB
  • optimizer.pt: 3B ร— 8B(FP32 states) โ‰ˆ 24GB
  • ์ฒดํฌํฌ์ธํŠธ๋‹น ~27GB
  • 2000 step ๊ฐ„๊ฒฉ โ†’ ์ตœ๋Œ€ ~28๊ฐœ = 756GB
  • /PROJECT ์—ฌ์œ  19TB โ†’ ์ถฉ๋ถ„

VRAM

  • ๋ชจ๋ธ: ~6GB (FP8)
  • Optimizer states: ~24GB
  • Activations: ~40-60GB (batch 8, seq 4096)
  • ์ด: ~80-90GB/GPU โ†’ B200 192GB๋กœ ์ถฉ๋ถ„ (47% ์‚ฌ์šฉ)

ํ•™์Šต ์‹œ๊ฐ„ ์˜ˆ์ƒ

  • 1B: 34k steps โ†’ ~12h (๊ด€์ฐฐ๊ฐ’ ๊ธฐ๋ฐ˜)
  • 3B: step๋‹น ~3๋ฐฐ โ†’ step ~1.05s ์˜ˆ์ƒ
  • 57k steps ร— 1.05s โ‰ˆ ~17h (60B tokens)
  • 95k steps ร— 1.05s โ‰ˆ ~28h (100B tokens)
  • ์•ˆ์ „ ๋งˆ์ง„ ํฌํ•จ: 24~36h

4. NCCL ์ตœ์ ํ™” (B200 NVSwitch)

export NCCL_IB_DISABLE=1          # ๋‹จ์ผ ๋…ธ๋“œ, IB ๋ถˆํ•„์š”
export NCCL_ALGO=Ring,Tree        # 3B gradient ํฌ๊ธฐ์— ๋‘ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ณ‘ํ–‰
export NCCL_PROTO=Simple          # NVLink bulk transfer ์ตœ์ 
export NCCL_MIN_NCHANNELS=16
export NCCL_MAX_NCHANNELS=16
export NCCL_BUFFSIZE=134217728    # 128MB (1B์˜ 64MB์—์„œ ์ฆ๊ฐ€)
export NCCL_P2P_LEVEL=NVL         # NVLink ์ง์ ‘ P2P

5. SFT ๊ณ„ํš (์‚ฌ์ „ํ•™์Šต ์™„๋ฃŒ ํ›„)

๋ฐ์ดํ„ฐ

  • 1B SFT ๋ฐ์ดํ„ฐ (161k ์ƒ˜ํ”Œ) โ€” ๊ธฐ์กด ๊ฒ€์ฆ ์™„๋ฃŒ
  • ์ถ”๊ฐ€ ๊ณ ํ’ˆ์งˆ ๋ฐ์ดํ„ฐ ๊ณ ๋ ค:
    • Ko-Alpaca ํ™•์žฅ
    • ShareGPT-ko ์ถ”๊ฐ€
    • ๋ชฉํ‘œ: 200k+ ์ƒ˜ํ”Œ

ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ

ํ•ญ๋ชฉ 1B SFT 3B SFT ๊ทผ๊ฑฐ
LR 2e-5 1e-5 ๋” ํฐ ๋ชจ๋ธ โ†’ ๋” ๋‚ฎ์€ LR
Batch (local) 4 4
Grad Accum 2 2
Eff Batch 64 64
Max Steps 9000 12000 3B๋Š” ์ˆ˜๋ ด์— ์ข€ ๋” ํ•„์š”
Warmup 300 500
Max Seq Len 4096 4096

1B SFT ๊ตํ›ˆ ๋ฐ˜์˜

  • โœ… Labels shift ๋ฒ„๊ทธ โ€” sft_dataset.py ์ด๋ฏธ ์ˆ˜์ •๋จ
  • โœ… ํ”„๋กœ์„ธ์Šค ์ค‘๋ณต ๋ฐฉ์ง€ โ€” launch ์Šคํฌ๋ฆฝํŠธ์— pgrep ์ฒดํฌ ์ถ”๊ฐ€
  • โœ… Loss 0 ๊ฐ์ง€ โ€” ๋ชจ๋‹ˆํ„ฐ๋ง์— ํฌํ•จ

์˜ˆ์ƒ SFT ์‹œ๊ฐ„

  • 12000 steps ร— 0.8s/step โ‰ˆ **3h**

6. ORPO ๊ณ„ํš (SFT ์™„๋ฃŒ ํ›„)

๋ฐ์ดํ„ฐ

  • maywell/ko_Ultrafeedback_binarized โ€” Korean preference ๋ฐ์ดํ„ฐ
  • ๋‹ค์šด๋กœ๋“œ:
    from datasets import load_dataset
    ds = load_dataset("maywell/ko_Ultrafeedback_binarized")
    

์„ค์ •

ํ•ญ๋ชฉ ๊ฐ’ ๊ทผ๊ฑฐ
LR 5e-6 ORPO ํ‘œ์ค€, SFT๋ณด๋‹ค ๋‚ฎ๊ฒŒ
ฮฒ (ORPO lambda) 0.1 ๋…ผ๋ฌธ ๊ธฐ๋ณธ๊ฐ’
Batch (local) 2 chosen+rejected ์Œ โ†’ ๋ฉ”๋ชจ๋ฆฌ 2๋ฐฐ
Grad Accum 4 eff_batch = 64
Max Steps ~3000-5000 ๋ฐ์ดํ„ฐ ํฌ๊ธฐ์— ๋”ฐ๋ผ
Max Seq Len 4096

์˜ˆ์ƒ ORPO ์‹œ๊ฐ„

  • 5000 steps ร— 1.5s/step (์Œ ๋น„๊ต) โ‰ˆ **2h**

7. ์ „์ฒด ํƒ€์ž„๋ผ์ธ

Phase 1: ์‚ฌ์ „ํ•™์Šต (60B tokens)
โ”œโ”€ ์ค€๋น„: configs, scripts ํ™•์ธ          ~30๋ถ„
โ”œโ”€ ํ•™์Šต: 57k steps                     ~24-36h
โ””โ”€ ํ‰๊ฐ€: eval suite                    ~1h

Phase 2: SFT
โ”œโ”€ ๋ฐ์ดํ„ฐ ์ค€๋น„: 161k+ ๊ฒ€์ฆ             ~30๋ถ„
โ”œโ”€ ํ•™์Šต: 12k steps                     ~3h
โ””โ”€ ํ‰๊ฐ€                                ~30๋ถ„

Phase 3: ORPO
โ”œโ”€ ๋ฐ์ดํ„ฐ ๋‹ค์šด๋กœ๋“œ+์ฒ˜๋ฆฌ                 ~30๋ถ„
โ”œโ”€ ํ•™์Šต: 3-5k steps                    ~2h
โ””โ”€ ์ตœ์ข… ํ‰๊ฐ€                           ~1h

์ด ์˜ˆ์ƒ: ์•ฝ 3-4์ผ (์‚ฌ์ „ํ•™์Šต ํฌํ•จ)
        ์‚ฌ์ „ํ•™์Šต๋งŒ: 24-36h
        SFT+ORPO:  ~6h

8. ์˜ˆ์™ธ ๋Œ€์‘ ํ”Œ๋ ˆ์ด๋ถ (3B ํŠนํ™”)

8-1. ์„œ๋ฒ„ ์žฌ์‹œ์ž‘์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ

  1. Ctrl+C๋กœ graceful stop (SIGINT โ†’ ํ˜„์žฌ step ์™„๋ฃŒ ํ›„ ์ฒดํฌํฌ์ธํŠธ ์ €์žฅ)
  2. ์•ˆ ๋˜๋ฉด kill -15 <PID> โ†’ 10์ดˆ ๋Œ€๊ธฐ โ†’ kill -9
  3. ์žฌ์‹œ์ž‘ ํ›„: bash scripts/launch_3b_pretrain.sh (์ž๋™ resume ๊ฐ์ง€)

8-2. ์ฒดํฌํฌ์ธํŠธ ์†์ƒ

# ์ตœ๊ทผ ์ฒดํฌํฌ์ธํŠธ ๋ฌด๊ฒฐ์„ฑ ํ™•์ธ
python -c "
import torch
ckpt = torch.load('checkpoints/korean_3b_fp8_run1/checkpoint-XXXXX/model.pt', weights_only=True)
print(f'Keys: {len(ckpt)}')
print(f'Total params: {sum(v.numel() for v in ckpt.values()):,}')
"
# ์†์ƒ ์‹œ โ†’ ์ด์ „ ์ฒดํฌํฌ์ธํŠธ๋กœ resume
bash scripts/launch_3b_pretrain.sh --resume checkpoints/korean_3b_fp8_run1/checkpoint-YYYYY

8-3. NCCL Hang ๊ฐ์ง€

  • monitor_3b.sh ๋กœ๊ทธ 10๋ถ„ ๋ฉˆ์ถค โ†’ CRITICAL ์•Œ๋ฆผ
  • --auto-restart ์˜ต์…˜์œผ๋กœ ์ž๋™ kill + ์žฌ์‹œ์ž‘ ๊ฐ€์ด๋“œ
  • ์ˆ˜๋™: kill -9 $(pgrep -f pretrain.py) โ†’ ์žฌ์‹คํ–‰

8-4. ๋””์Šคํฌ ๊ณต๊ฐ„ ๋ถ€์กฑ

  • monitor_3b.sh --auto-cleanup: MAX_CHECKPOINTS(15) ์ดˆ๊ณผ ์‹œ ์˜ค๋ž˜๋œ ๊ฒƒ ์ž๋™ ์‚ญ์ œ
  • ์ˆ˜๋™ ์ •๋ฆฌ:
    ls -d checkpoints/korean_3b_fp8_run1/checkpoint-* | sort -V | head -10 | xargs rm -rf
    

8-5. Loss ๋ฐœ์‚ฐ (NaN / spike)

  1. ์ฆ‰์‹œ ์ค‘๋‹จ
  2. ์ตœ๊ทผ ์ •์ƒ ์ฒดํฌํฌ์ธํŠธ์—์„œ resume
  3. LR 50% ๊ฐ์†Œํ•˜์—ฌ ์žฌ์‹œ์ž‘: --lr 7.5e-5
  4. ๋ฐ˜๋ณต ์‹œ warmup ๋Š˜๋ฆฌ๊ธฐ: --warmup_steps 4000

9. ๋ชจ๋‹ˆํ„ฐ๋ง ์Šคํฌ๋ฆฝํŠธ

์‹ค์‹œ๊ฐ„ ๊ฐ์‹œ

bash scripts/monitor_3b.sh                    # ๊ธฐ๋ณธ (60์ดˆ ๊ฐ„๊ฒฉ)
bash scripts/monitor_3b.sh --check-once       # 1ํšŒ
bash scripts/monitor_3b.sh --auto-cleanup     # ์ž๋™ ์ฒดํฌํฌ์ธํŠธ ์ •๋ฆฌ
bash scripts/monitor_3b.sh --auto-restart     # NCCL hang ์‹œ ์ž๋™ kill

TensorBoard

tensorboard --logdir checkpoints/korean_3b_fp8_run1/tensorboard --port 6007

10. ์‹คํ–‰ ์ปค๋งจ๋“œ ์š”์•ฝ

# Step 1: ์‚ฌ์ „ํ•™์Šต ์‹œ์ž‘
bash scripts/launch_3b_pretrain.sh

# Step 1b: ๋ชจ๋‹ˆํ„ฐ๋ง (๋ณ„๋„ ํ„ฐ๋ฏธ๋„)
bash scripts/monitor_3b.sh --auto-cleanup

# Step 2: ์‚ฌ์ „ํ•™์Šต ์™„๋ฃŒ ํ›„ SFT (launch_sft.sh 3B ๋ฒ„์ „ ํ•„์š”)
BASE_CHECKPOINT=checkpoints/korean_3b_fp8_run1/checkpoint-XXXXX \
RUN_NAME=korean_3b_sft \
bash scripts/launch_sft.sh --lr 1e-5 --max_steps 12000 --warmup_steps 500

# Step 3: ORPO (๋ณ„๋„ ์Šคํฌ๋ฆฝํŠธ ํ•„์š” โ€” train/orpo.py ์ž‘์„ฑ ํ•„์š”)
# TBD after SFT

์ƒํƒœ: โœ… ํŒŒ์ดํ”„๋ผ์ธ ์„ค๊ณ„ ์™„๋ฃŒ, ์Šคํฌ๋ฆฝํŠธ ์ž‘์„ฑ ์™„๋ฃŒ
๋‹ค์Œ ์•ก์…˜: bash scripts/launch_3b_pretrain.sh ์‹คํ–‰