frankenstallm / source /README.md
pathcosmos's picture
Upload source/README.md with huggingface_hub (#30)
d7931ae

FRANKENSTALLM

Phase 3 Model GPU FP8 SFT Status

ํ•œ๊ตญ์–ด 3B LLM์„ 8ร— NVIDIA B200 ์œ„์—์„œ ์ฒ˜์Œ๋ถ€ํ„ฐ ์ง์ ‘ ๋งŒ๋“ ๋‹ค. Frankenstein์ฒ˜๋Ÿผ ์กฐ๊ฐ์„ ์ด์–ด ๋ถ™์ด๊ณ , ์ฒ ๊ฐ•์ฒ˜๋Ÿผ ๋‹จ๋‹จํ•˜๊ฒŒ ๋‹จ๋ จํ•œ๋‹ค.

GitHub: pathcosmos/FRANKENSTALLM


๋ชฉ์ฐจ

  1. ์™œ ์ด ํ”„๋กœ์ ํŠธ์ธ๊ฐ€
  2. ํ˜„์žฌ ์ƒํƒœ โ€” ํ•œ๋ˆˆ์— ๋ณด๊ธฐ
  3. ํ•˜๋“œ์›จ์–ด ํ™˜๊ฒฝ
  4. ํ”„๋กœ์ ํŠธ ๊ตฌ์กฐ
  5. ํ”„๋กœ์ ํŠธ ์—ฌ์ • ํƒ€์ž„๋ผ์ธ
  6. ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜
  7. ํ•™์Šต ๋ฐ์ดํ„ฐ
  8. ํ•™์Šต ์„ค์ • ๋ฐ ์ตœ์ ํ™”
  9. ์‹คํ—˜ ๊ฒฐ๊ณผ โ€” 1B ๋ฒ ์ด์Šค๋ผ์ธ
  10. ์‹คํ—˜ ๊ฒฐ๊ณผ โ€” 3B Base ์ข…ํ•ฉ ํ‰๊ฐ€ (v2)
  11. ์‹คํ—˜ ๊ฒฐ๊ณผ โ€” 3B SFT ์ข…ํ•ฉ ํ‰๊ฐ€
  12. Phase 3 โ€” ORPO (์„ ํ˜ธ๋„ ์ •๋ ฌ)
  13. ์‹คํ–‰ ๋ฐฉ๋ฒ•
  14. ๋กœ๋“œ๋งต
  15. ์ฐธ๊ณ  ๋ฌธ์„œ
  16. ๊ธฐ์ˆ  ์Šคํƒ ์š”์•ฝ
  17. ๊ด€๋ จ ํ”„๋กœ์ ํŠธ
  18. ๋‹ค์Œ ์ตœ์ ํ™” ๊ณ„ํš
  19. GPU ํ•˜๋“œ์›จ์–ด & ๋น„์šฉ ๋ถ„์„

1. ์™œ ์ด ํ”„๋กœ์ ํŠธ์ธ๊ฐ€

ํ•œ๊ตญ์–ด LLM ์ƒํƒœ๊ณ„๋Š” ๋น ๋ฅด๊ฒŒ ์„ฑ์žฅํ•˜๊ณ  ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋Œ€๋ถ€๋ถ„์˜ ๊ณต๊ฐœ ๋ชจ๋ธ์€ ์˜์–ด ๊ธฐ๋ฐ˜ ์‚ฌ์ „ํ•™์Šต ์œ„์— ํ•œ๊ตญ์–ด ํŒŒ์ธํŠœ๋‹์„ ์–น์€ ํ˜•ํƒœ๊ฑฐ๋‚˜, ํ•™์Šต ๊ณผ์ •์ด ๊ณต๊ฐœ๋˜์ง€ ์•Š์•„ ์žฌํ˜„์ด ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค.

์ด ํ”„๋กœ์ ํŠธ๋Š” ๋‹ค๋ฅด๋‹ค.

  • ์ฒ˜์Œ๋ถ€ํ„ฐ(from scratch): ํ† ํฌ๋‚˜์ด์ € ํ•™์Šต๋ถ€ํ„ฐ ํ”„๋ฆฌํŠธ๋ ˆ์ธ, SFT, ์„ ํ˜ธ๋„ ์ •๋ ฌ๊นŒ์ง€ ๋ชจ๋“  ๋‹จ๊ณ„๋ฅผ ์ง์ ‘ ๊ตฌํ˜„ํ•œ๋‹ค.
  • ์™„์ „ ๊ณต๊ฐœ ๋นŒ๋” ๋กœ๊ทธ: ์„ฑ๊ณต๋งŒ ๊ธฐ๋กํ•˜์ง€ ์•Š๋Š”๋‹ค. ๋ฒ„๊ทธ, ์‹คํŒจ, ํŒ๋‹จ ์ฐฉ์˜ค, ๊ทธ๋ฆฌ๊ณ  ๊ทธ ์›์ธ ๋ถ„์„๊นŒ์ง€ ๋ชจ๋‘ ๊ธฐ๋กํ•œ๋‹ค.
  • ์‹ค์šฉ์ ์ธ ๊ทœ๋ชจ: ํ•™์ˆ  ๋…ผ๋ฌธ์šฉ ์žฅ๋‚œ๊ฐ ๋ชจ๋ธ(125M)๋„ ์•„๋‹ˆ๊ณ , ์—ฐ๊ตฌ์†Œ๊ฐ€ ์•„๋‹ˆ๋ฉด ์žฌํ˜„ ๋ถˆ๊ฐ€๋Šฅํ•œ 70B๋„ ์•„๋‹Œ, 3B ๊ทœ๋ชจ์˜ ์‹ค์šฉ์  ํ•œ๊ตญ์–ด ๋ชจ๋ธ์ด ๋ชฉํ‘œ๋‹ค.
  • B200 ์ตœ์ ํ™”: NVIDIA B200์˜ FP8 Tensor Core, NVLink 5.0, FlashAttention-2๋ฅผ ์ตœ๋Œ€ํ•œ ํ™œ์šฉํ•œ๋‹ค. ์ตœ์‹  ํ•˜๋“œ์›จ์–ด๋ฅผ ์ตœ๋Œ€๋กœ ์ฅ์–ด์งœ๋Š” ๊ณผ์ • ์ž์ฒด๊ฐ€ ํ•™์Šต์ด๋‹ค.

์ด README๋Š” ์™„์„ฑ๋œ ๊ฒฐ๊ณผ๋ฌผ์˜ ๋ฐœํ‘œ๊ฐ€ ์•„๋‹ˆ๋ผ, ํ˜„์žฌ ์ง„ํ–‰ ์ค‘์ธ ๋นŒ๋”์˜ ๋กœ๊ทธ๋‹ค.


2. ํ˜„์žฌ ์ƒํƒœ โ€” ํ•œ๋ˆˆ์— ๋ณด๊ธฐ

2026-03-09 ๊ธฐ์ค€
๋‹จ๊ณ„ ์ƒํƒœ ์„ธ๋ถ€ ๋‚ด์šฉ
Phase 0: ๊ธฐ๋ฐ˜ ๊ตฌ์ถ• โœ… ์™„๋ฃŒ OOM ์ˆ˜์ •, GQA FA ์ตœ์ ํ™”, NCCL NVLS, ํŒŒ์ดํ”„๋ผ์ธ ์ค€๋น„
Phase 1: 3B Pretrain โœ… ์™„๋ฃŒ 57,000 steps, loss 1.466, ~63์‹œ๊ฐ„
Phase 2: SFT โœ… ์™„๋ฃŒ 25,500 steps (early stop), val_loss 1.8851, ~15.5์‹œ๊ฐ„
Phase 2.5: SFT ํ‰๊ฐ€ โœ… ์™„๋ฃŒ 6์ฐจ์› ํ‰๊ฐ€ 4/6 PASS, ORPO ์ง„ํ–‰ ๊ฒฐ์ •
Phase 3: ORPO Sweep โœ… ์™„๋ฃŒ 6-config sweep ์™„๋ฃŒ, best: lr=1.2e-5, beta=0.25
Phase 3: ORPO ๋ณธ ํ•™์Šต ๐Ÿ”„ ์ง„ํ–‰ ์ค‘ 630K pairs, 2 epochs, ~9,840 steps, ~4.8์‹œ๊ฐ„
Phase 4: ๋ฐฐํฌ ๐Ÿ“‹ ๋Œ€๊ธฐ GGUF ๋ณ€ํ™˜ โ†’ Ollama ์„œ๋น™

Phase 2 (SFT) ์ตœ์ข… ๊ฒฐ๊ณผ

ํ•ญ๋ชฉ ๊ฐ’
์ตœ์ข… step 25,500 / 33,000 (77.3%, early stopping)
Val loss (best) 1.8851 (step 23,000)
ํ•™์Šต ์‹œ๊ฐ„ ~15์‹œ๊ฐ„ 41๋ถ„ (2026-03-05 22:15 ~ 2026-03-06 13:56)
VRAM ์‚ฌ์šฉ 24.2GB / 183GB per GPU (13.2%)
Base ๋ชจ๋ธ checkpoint-0057000 (pretrain loss 1.466)
SFT ๋ฐ์ดํ„ฐ 2,439,397 samples (24๊ฐœ ์†Œ์Šค, 7.48 GB)
์‚ฌ๊ณ  0๊ฑด (OOM, NCCL, NaN ์—†์Œ)

SFT Val Loss ์ „์ฒด ์ถ”์ด:

Step     500: 2.073
Step   2,000: 1.956  (-0.117)
Step   5,000: 1.911  (-0.045)
Step  10,000: 1.892  (-0.019)
Step  15,000: 1.886  (-0.006)
Step  20,000: 1.885  (-0.001)
Step  23,000: 1.8851 โ† BEST
Step  25,500: 1.8851 โ†’ Early Stop (patience 5/5)

SFT 6์ฐจ์› ํ‰๊ฐ€ ์š”์•ฝ

์ฐจ์› ๊ฒฐ๊ณผ ํ•ต์‹ฌ ์ˆ˜์น˜
Perplexity (์ง€์‹ ๋ณด์กด) PASS forgetting 0.9%
์ƒ์„ฑ ํ’ˆ์งˆ FAIL Greedy ๋ฐ˜๋ณต๋ฅ  72.97%
ํ•œ๊ตญ์–ด ๋ฒค์น˜๋งˆํฌ FAIL KoBEST ํ‰๊ท  43.26%
์˜์–ด ๋ฒค์น˜๋งˆํฌ PASS ์ „ ํƒœ์Šคํฌ ํ•˜ํ•œ ์ดˆ๊ณผ
Calibration PASS Top-1 68.59%
SFT Chat ๋Šฅ๋ ฅ PASS EOS ์ข…๋ฃŒ์œจ 60% (Base 0%)

ํŒ์ •: ORPO ์ง„ํ–‰ โ€” ์ง€์‹ ๋ณด์กด ์šฐ์ˆ˜(0.9%), ๋ฐ˜๋ณต๋ฅ ์€ ์„ ํ˜ธ๋„ ์ •๋ ฌ๋กœ ํ•ด๊ฒฐ. ์ƒ์„ธ: reports/2026-03-06_3B_SFT_COMPLETION_AND_EVAL_SUMMARY.md


3. ํ•˜๋“œ์›จ์–ด ํ™˜๊ฒฝ

GPU

ํ•ญ๋ชฉ ์‚ฌ์–‘
๋ชจ๋ธ 8ร— NVIDIA B200
VRAM 183GB HBM3e per GPU (~1.47TB ํ•ฉ๊ณ„)
FP8 Tensor Core 2,250 TFLOPS/GPU (์ด 18,000 TFLOPS)
BF16 1,125 TFLOPS/GPU
HBM3e ๋Œ€์—ญํญ ~7.67 TB/s per GPU
์ธํ„ฐ์ปค๋„ฅํŠธ NVLink 5.0 (900 GB/s bidirectional per GPU)
ํ† ํด๋กœ์ง€ NVSwitch โ€” ๋ชจ๋“  GPUโ†”GPU ๋‹จ์ผ ํ™‰ All-to-All Mesh
์ „๋ ฅ 940W ์‹ค์ธก / 1000W cap

B200์€ FP8 ๋„ค์ดํ‹ฐ๋ธŒ ์ง€์› ๋ชจ๋ธ์ด๋‹ค. torch.float8_e4m3fn ์„ TransformerEngine์˜ MXFP8 ๋ ˆ์‹œํ”ผ์™€ ๊ฒฐํ•ฉํ•ด ํ•™์Šตํ•œ๋‹ค. BF16 ๋Œ€๋น„ ์—ฐ์‚ฐ๋Ÿ‰์ด ์ด๋ก ์ƒ 2๋ฐฐ์ด๋ฉฐ, ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ๋„ ํ–ฅ์ƒ๋œ๋‹ค.

CPU ๋ฐ ์‹œ์Šคํ…œ ๋ฉ”๋ชจ๋ฆฌ

ํ•ญ๋ชฉ ์‚ฌ์–‘
CPU 2ร— AMD EPYC 9365 (Turin / Zen 5)
๋ฌผ๋ฆฌ ์ฝ”์–ด 72๊ฐœ (36์ฝ”์–ด ร— 2์†Œ์ผ“)
NUMA ๊ตฌ์„ฑ 2๋…ธ๋“œ: node0 (core 0-35) / node1 (core 36-71)
GPUโ†”NUMA ๋งคํ•‘ GPU 0-3 โ†’ NUMA node 0, GPU 4-7 โ†’ NUMA node 1
RAM 2.21TB DDR5 (~2.03TB ์—ฌ์œ )
L3 ์บ์‹œ 384MB (12 CCX ร— 32MB)

NUMA ์ฃผ์˜: ์ดˆ๊ธฐ DDP ๋Ÿฐ์นญ ์‹œ 5/8 rank๊ฐ€ ์ž˜๋ชป๋œ NUMA ๋…ธ๋“œ์—์„œ ์‹คํ–‰๋˜๋Š” ๋ฌธ์ œ ๋ฐœ์ƒ. 69%์˜ DataLoader worker๊ฐ€ ํฌ๋กœ์Šค-NUMA์˜€๋‹ค. NUMA affinity ์ตœ์ ํ™”๋Š” ๋ฏธ์ ์šฉ ์ƒํƒœ(๋กœ๋“œ๋งต ํ•ญ๋ชฉ).

์Šคํ† ๋ฆฌ์ง€

๊ฒฝ๋กœ ์šฉ๋„ ์—ฌ์œ  ๊ณต๊ฐ„
/PROJECT/0325120031_A/ghong/taketimes/llm-bang/ ๋ฉ”์ธ ์ž‘์—… (์ฒดํฌํฌ์ธํŠธ, ๋ฐ์ดํ„ฐ) 2.2TB
/home/ghong/ ์†Œ๊ทœ๋ชจ ์ฝ”๋“œ 5GB (์ œํ•œ)

์ฃผ์˜: ์ฒดํฌํฌ์ธํŠธ(์ˆ˜์‹ญ GB), ํ•™์Šต ๋ฐ์ดํ„ฐ(82GB+), ์ค‘๊ฐ„ ์‚ฐ์ถœ๋ฌผ์€ ๋ชจ๋‘ /PROJECT/... ๊ฒฝ๋กœ์— ์ €์žฅํ•œ๋‹ค. ํ™ˆ ๋””๋ ‰ํ† ๋ฆฌ ์šฉ๋Ÿ‰ ์ดˆ๊ณผ ์œ„ํ—˜.

์†Œํ”„ํŠธ์›จ์–ด ํ™˜๊ฒฝ

ํŒจํ‚ค์ง€ ๋ฒ„์ „
PyTorch 2.10.0a0+b4e4ee81d3.nv25.12 (NVIDIA ์ปค์Šคํ…€)
FlashAttention 2.7.4.post1+25.12
TransformerEngine 2.10.0
NCCL 2.28.9
Triton 3.5.1
CUDA 13.1
Driver 580.95.05

๊ฒฝ๊ณ : PyTorch๋Š” NVIDIA B200 ์ตœ์ ํ™” ์ปค์Šคํ…€ ๋นŒ๋“œ๋‹ค. pip install torch๋กœ ์žฌ์„ค์น˜ํ•˜๋ฉด B200 ์ตœ์ ํ™”๊ฐ€ ๊นจ์ง„๋‹ค. ์ ˆ๋Œ€ ์žฌ์„ค์น˜ ๊ธˆ์ง€.


4. ํ”„๋กœ์ ํŠธ ๊ตฌ์กฐ

llm-bang/
โ”œโ”€โ”€ CLAUDE.md                          # Claude Code ๊ฐ€์ด๋“œ
โ”œโ”€โ”€ README.md                          # ์ด ํŒŒ์ผ
โ”œโ”€โ”€ PROGRESS.md                        # ์ง„ํ–‰ ๊ธฐ๋ก (๋‚ ์งœ๋ณ„ ๋กœ๊ทธ)
โ”œโ”€โ”€ Modelfile.3b                       # Ollama ๋ชจ๋ธ ํŒŒ์ผ
โ”‚
โ”œโ”€โ”€ configs/
โ”‚   โ”œโ”€โ”€ korean_3b_fp8.yaml             # 3B FP8 ํ•™์Šต ์„ค์ • (ํ˜„์žฌ ์‚ฌ์šฉ ์ค‘)
โ”‚   โ”œโ”€โ”€ 3b_pretrain.yaml               # 3B ํ”„๋ฆฌํŠธ๋ ˆ์ธ ์„ค์ • (๋Œ€์ฒด)
โ”‚   โ”œโ”€โ”€ korean_1b_fp8.yaml             # 1B FP8 ์„ค์ • (์•„์นด์ด๋ธŒ)
โ”‚   โ”œโ”€โ”€ korean_3b_sft.yaml             # 3B SFT v1 ์„ค์ • (์™„๋ฃŒ)
โ”‚   โ”œโ”€โ”€ korean_3b_sft_v2.yaml          # 3B SFT v2 ์„ค์ • (lr=5e-5, data mixing)
โ”‚   โ”œโ”€โ”€ korean_3b_orpo.yaml            # 3B ORPO ์„ค์ • (lr=5e-6, beta=0.1)
โ”‚   โ”œโ”€โ”€ hybrid_3b.yaml                 # Hybrid 3B (Mamba-2 + Attention)
โ”‚   โ”œโ”€โ”€ small_fp8.yaml                 # 125M FP8 ๊ฒ€์ฆ์šฉ
โ”‚   โ”œโ”€โ”€ medium.yaml                    # ์ค‘ํ˜• ๋ชจ๋ธ ์„ค์ •
โ”‚   โ””โ”€โ”€ small.yaml                     # ์†Œํ˜• ๋ชจ๋ธ ์„ค์ •
โ”‚
โ”œโ”€โ”€ data/
โ”‚   โ”œโ”€โ”€ 3b_train.bin                   # ํ”„๋ฆฌํŠธ๋ ˆ์ธ ํ•™์Šต ๋ฐ์ดํ„ฐ (82GB, 41.12B tokens)
โ”‚   โ”œโ”€โ”€ 3b_val.bin                     # ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ (151MB)
โ”‚   โ”œโ”€โ”€ cc100_ko_train.bin             # CC100 ํ•œ๊ตญ์–ด (4.5GB)
โ”‚   โ”œโ”€โ”€ cosmo_auto_math_text_train.bin # ์ˆ˜ํ•™ ํ…์ŠคํŠธ (2.6GB)
โ”‚   โ””โ”€โ”€ build scripts, __init__.py
โ”‚
โ”œโ”€โ”€ model/
โ”‚   โ”œโ”€โ”€ attention.py                   # GQA FlashAttention (Phase 0 ์ตœ์ ํ™” ์ ์šฉ)
โ”‚   โ”œโ”€โ”€ transformer.py                 # ํŠธ๋žœ์Šคํฌ๋จธ ๋ฉ”์ธ ์•„ํ‚คํ…์ฒ˜
โ”‚   โ”œโ”€โ”€ config.py                      # ๋ชจ๋ธ ์„ค์ • dataclass
โ”‚   โ””โ”€โ”€ layers.py                      # ์ปค์Šคํ…€ ๋ ˆ์ด์–ด (RMSNorm, SwiGLU ๋“ฑ)
โ”‚
โ”œโ”€โ”€ train/
โ”‚   โ”œโ”€โ”€ pretrain.py                    # ํ”„๋ฆฌํŠธ๋ ˆ์ธ ์Šคํฌ๋ฆฝํŠธ (DDP ์ตœ์ ํ™”)
โ”‚   โ”œโ”€โ”€ sft.py                         # SFT ํ•™์Šต
โ”‚   โ”œโ”€โ”€ orpo.py                        # ORPO ํ•™์Šต
โ”‚   โ”œโ”€โ”€ trainer.py                     # ํ†ตํ•ฉ ํŠธ๋ ˆ์ด๋„ˆ (loss sync ์ตœ์ ํ™”)
โ”‚   โ””โ”€โ”€ utils.py                       # ์œ ํ‹ธ๋ฆฌํ‹ฐ (NCCL 7200s timeout ๋“ฑ)
โ”‚
โ”œโ”€โ”€ scripts/
โ”‚   โ”œโ”€โ”€ launch_3b_pretrain.sh          # 3B ํ”„๋ฆฌํŠธ๋ ˆ์ธ ๋Ÿฐ์ฒ˜ (NCCL ํ™˜๊ฒฝ๋ณ€์ˆ˜ ํฌํ•จ)
โ”‚   โ”œโ”€โ”€ launch_3b_sft.sh               # 3B SFT v1 ๋Ÿฐ์ฒ˜
โ”‚   โ”œโ”€โ”€ launch_3b_sft_v2.sh            # 3B SFT v2 ๋Ÿฐ์ฒ˜ (data mixing)
โ”‚   โ”œโ”€โ”€ launch_3b_orpo.sh              # 3B ORPO ๋Ÿฐ์ฒ˜
โ”‚   โ”œโ”€โ”€ monitor_3b.sh                  # ์‹ค์‹œ๊ฐ„ ํ•™์Šต ๋ชจ๋‹ˆํ„ฐ
โ”‚   โ”œโ”€โ”€ training_watchdog.sh           # ์›Œ์น˜๋… (10๋ถ„ ๊ฐ„๊ฒฉ, ํฌ๋ก )
โ”‚   โ”œโ”€โ”€ convert_3b_gguf.sh             # GGUF ๋ณ€ํ™˜ ์Šคํฌ๋ฆฝํŠธ
โ”‚   โ”œโ”€โ”€ deploy_3b_ollama.sh            # Ollama ๋ฐฐํฌ
โ”‚   โ”œโ”€โ”€ quality_gate.sh                # ๋ฐฐํฌ ์ „ ํ’ˆ์งˆ ๊ฒŒ์ดํŠธ
โ”‚   โ”œโ”€โ”€ telegram_notify.py             # ํ…”๋ ˆ๊ทธ๋žจ ์•Œ๋ฆผ (urllib ์‚ฌ์šฉ, curl ์ฐจ๋‹จ)
โ”‚   โ””โ”€โ”€ hourly_status.sh               # 1์‹œ๊ฐ„ ๊ฐ„๊ฒฉ ์ƒํƒœ ๋ฆฌํฌํŠธ
โ”‚
โ”œโ”€โ”€ eval/
โ”‚   โ”œโ”€โ”€ debate/
โ”‚   โ”‚   โ””โ”€โ”€ justice_league_3b_case.md  # 3B ์ „ํ™˜ ๋…ผ์ฆ (์ €์Šคํ‹ฐ์Šค๋ฆฌ๊ทธ ๋ฉ€ํ‹ฐ์—์ด์ „ํŠธ)
โ”‚   โ”œโ”€โ”€ decision/
โ”‚   โ”‚   โ””โ”€โ”€ FINAL_DECISION_REPORT.md   # SFT ์žฌ์‹œ์ž‘ ํŒ๊ฒฐ๋ฌธ
โ”‚   โ”œโ”€โ”€ plan/
โ”‚   โ”‚   โ””โ”€โ”€ 3B_MASTER_PLAN.md          # 3B ๋งˆ์Šคํ„ฐ ํ”Œ๋žœ
โ”‚   โ”œโ”€โ”€ tasks/                         # ๋ชจ๋“ˆํ™”๋œ ํ‰๊ฐ€ ํƒœ์Šคํฌ
โ”‚   โ”‚   โ”œโ”€โ”€ task_runner.py             # 8-GPU ๋ณ‘๋ ฌ ํƒœ์Šคํฌ ์‹คํ–‰๊ธฐ
โ”‚   โ”‚   โ”œโ”€โ”€ ppl_task.py                # Perplexity ํ‰๊ฐ€ ํƒœ์Šคํฌ
โ”‚   โ”‚   โ”œโ”€โ”€ lm_eval_task.py            # lm-evaluation-harness ๋ž˜ํผ
โ”‚   โ”‚   โ”œโ”€โ”€ calibration_task.py        # Calibration ๋ถ„์„
โ”‚   โ”‚   โ”œโ”€โ”€ generation_task.py         # ์ƒ์„ฑ ํ’ˆ์งˆ + ํŒŒ๋ผ๋ฏธํ„ฐ ๊ทธ๋ฆฌ๋“œ ์„œ์น˜
โ”‚   โ”‚   โ””โ”€โ”€ token_nll_task.py          # Token NLL ๋ถ„ํฌ ๋ถ„์„
โ”‚   โ”œโ”€โ”€ outputs/                       # ํ‰๊ฐ€ ๊ฒฐ๊ณผ (์ž๋™ ์ƒ์„ฑ, .gitignore)
โ”‚   โ”œโ”€โ”€ full_eval_pipeline.py          # v2 ์ข…ํ•ฉ ํ‰๊ฐ€ ํŒŒ์ดํ”„๋ผ์ธ (8-GPU ๋ณ‘๋ ฌ)
โ”‚   โ”œโ”€โ”€ sft_eval_pipeline.py           # SFT 6์ฐจ์› ํ‰๊ฐ€ ํŒŒ์ดํ”„๋ผ์ธ
โ”‚   โ”œโ”€โ”€ reeval_pipeline.py             # ์žฌํ‰๊ฐ€ ํŒŒ์ดํ”„๋ผ์ธ (0+5-shot ์—ฐ์†)
โ”‚   โ”œโ”€โ”€ report_generator.py            # ๋งˆํฌ๋‹ค์šด ๋ฆฌํฌํŠธ ์ž๋™ ์ƒ์„ฑ
โ”‚   โ”œโ”€โ”€ comprehensive_eval.py          # v1 ์ข…ํ•ฉ ํ‰๊ฐ€ (๋ ˆ๊ฑฐ์‹œ)
โ”‚   โ””โ”€โ”€ test_generation_params.py      # ์ƒ์„ฑ ํŒŒ๋ผ๋ฏธํ„ฐ ํƒ์ƒ‰
โ”‚
โ”œโ”€โ”€ tokenizer/
โ”‚   โ”œโ”€โ”€ korean_sp/                     # SentencePiece 64K ๋ชจ๋ธ ํŒŒ์ผ
โ”‚   โ”œโ”€โ”€ tokenizer.json                 # HuggingFace ํฌ๋งท (2.4MB)
โ”‚   โ”œโ”€โ”€ train_sp_tokenizer.py          # ํ† ํฌ๋‚˜์ด์ € ํ•™์Šต ์Šคํฌ๋ฆฝํŠธ
โ”‚   โ””โ”€โ”€ convert_sp_to_hf.py            # SentencePiece โ†’ HF ๋ณ€ํ™˜
โ”‚
โ”œโ”€โ”€ checkpoints/                       # ๋ชจ๋ธ ์ฒดํฌํฌ์ธํŠธ (๋Œ€์šฉ๋Ÿ‰, .gitignore)
โ”‚
โ”œโ”€โ”€ docs/
โ”‚   โ”œโ”€โ”€ PROJECT_HISTORY.md             # ํ”„๋กœ์ ํŠธ ์ „์ฒด ์—ฌ์ • ์ƒ์„ธ ๊ธฐ๋ก
โ”‚   โ””โ”€โ”€ 3B_WORKPLAN.md                 # 3B ์ž‘์—… ๊ณ„ํš
โ”‚
โ””โ”€โ”€ reports/
    โ”œโ”€โ”€ 2026-03-02_0200_FRANKENSTALLM_phase0_optimization_report.md
    โ”œโ”€โ”€ 2026-03-05_3B_BASE_EVALUATION_REPORT.md
    โ”œโ”€โ”€ 2026-03-05_3B_SFT_PROGRESS_REPORT.md   # SFT ํ•™์Šต ๋ณด๊ณ ์„œ (Phase 2)
    โ”œโ”€โ”€ 2026-03-05_3B_NEXT_STEPS_REFERENCE.md
    โ”œโ”€โ”€ 2026-03-05_NEMOTRON_NANO_FEASIBILITY_STUDY.md
    โ”œโ”€โ”€ 2026-03-05_PPL_EVALUATION.md
    โ”œโ”€โ”€ 2026-03-05_BENCHMARK_RESULTS.md
    โ”œโ”€โ”€ 2026-03-05_GENERATION_QUALITY.md
    โ”œโ”€โ”€ 2026-03-06_3B_SFT_EVAL_PLAN.md         # SFT 6์ฐจ์› ํ‰๊ฐ€ ๊ณ„ํš์„œ
    โ”œโ”€โ”€ 2026-03-06_3B_SFT_EVALUATION_REPORT.md  # SFT 6์ฐจ์› ํ‰๊ฐ€ ๊ฒฐ๊ณผ
    โ””โ”€โ”€ 2026-03-06_3B_SFT_COMPLETION_AND_EVAL_SUMMARY.md  # SFT ์™„๋ฃŒ + ์ฝ”๋“œ ๊ฐœ์„  ์ข…ํ•ฉ

5. ํ”„๋กœ์ ํŠธ ์—ฌ์ • ํƒ€์ž„๋ผ์ธ

์ด ์„น์…˜์ด ์ด README์˜ ํ•ต์‹ฌ์ด๋‹ค. ๊ฒฐ๊ณผ๋งŒ์ด ์•„๋‹ˆ๋ผ ์™œ ๊ทธ๋Ÿฐ ๊ฒฐ์ •์„ ๋‚ด๋ ธ๋Š”์ง€, ์–ด๋””์„œ ์‹คํŒจํ–ˆ๋Š”์ง€๋ฅผ ์†”์งํ•˜๊ฒŒ ๊ธฐ๋กํ•œ๋‹ค.


Day 1 (Feb 25) โ€” ์ฒซ ๋ถˆ์”จ: 125M FP8 ๊ฒ€์ฆ

ํ”„๋กœ์ ํŠธ์˜ ์‹œ์ž‘์€ ์ž‘์€ ์˜๋ฌธ์—์„œ ์ถœ๋ฐœํ–ˆ๋‹ค. B200์—์„œ FP8์ด ์‹ค์ œ๋กœ ์•ˆ์ •์ ์œผ๋กœ ํ•™์Šต๋˜๋Š”๊ฐ€?

TransformerEngine์˜ MXFP8 ๋ ˆ์‹œํ”ผ๋ฅผ 125M ์†Œํ˜• ๋ชจ๋ธ์— ์ ์šฉํ•ด ๊ฒ€์ฆํ–ˆ๋‹ค. ๊ฒฐ๋ก ์€ ์•ˆ์ •์ ์œผ๋กœ ๋™์ž‘ํ•œ๋‹ค. loss ์ˆ˜๋ ด๋„ ์ •์ƒ์ด์—ˆ๊ณ , VRAM ํšจ์œจ๋„ BF16 ๋Œ€๋น„ ํ™•์—ฐํ•œ ๊ฐœ์„ ์ด ์žˆ์—ˆ๋‹ค. ์ด ๊ฒ€์ฆ์ด ์ „์ฒด ํŒŒ์ดํ”„๋ผ์ธ์˜ ์ฒซ ๋ฒˆ์งธ ๋…น์ƒ‰ ์‹ ํ˜ธ์˜€๋‹ค.

๊ฐ™์€ ๋‚ , ์ธํ”„๋ผ ์„ธํŒ…๋„ ์™„๋ฃŒํ–ˆ๋‹ค. DDP 8-GPU ํ™˜๊ฒฝ, NCCL ํ™˜๊ฒฝ๋ณ€์ˆ˜, ์ฒดํฌํฌ์ธํŠธ ์ €์žฅ ๊ฒฝ๋กœ, ํ…”๋ ˆ๊ทธ๋žจ ์•Œ๋ฆผ ์‹œ์Šคํ…œ์˜ ์ดˆ์•ˆ์ด ์ด๋‚  ๊ฐ–์ถฐ์กŒ๋‹ค.


Day 12 (Feb 2526) โ€” 1B ํ”„๋ฆฌํŠธ๋ ˆ์ธ: 34K ์Šคํ…, PPL 5.67

125M ๊ฒ€์ฆ ์งํ›„ 1B ๋ชจ๋ธ ํ”„๋ฆฌํŠธ๋ ˆ์ธ์— ๋Œ์ž…ํ–ˆ๋‹ค.

  • ์•„ํ‚คํ…์ฒ˜: d_model=2048, 24 layers, GQA 4:1, SwiGLU, RoPE
  • ๋ฐ์ดํ„ฐ: C4 Korean ๊ธฐ๋ฐ˜
  • ํ•™์Šต: 34,000 ์Šคํ…, FP8, 8ร— B200 DDP

์ตœ์ข… ๊ฒฐ๊ณผ:

  • Loss: 1.904
  • PPL (C4 Korean): 5.67

์ˆ˜์น˜๋งŒ ๋ณด๋ฉด ๊ทธ๋Ÿญ์ €๋Ÿญ ๊ดœ์ฐฎ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์‹ค์ œ ํ…์ŠคํŠธ ์ƒ์„ฑ์„ ์‹œ์ผœ๋ณด๋ฉด ๋ฌธ์ œ๊ฐ€ ๋ณด์˜€๋‹ค. ๋ฐ˜๋ณต ํŒจํ„ด, ์–ด์ƒ‰ํ•œ ๋ฌธ์žฅ ๊ตฌ์กฐ, ๋งฅ๋ฝ ์ดํƒˆ. ํ”„๋ฆฌํŠธ๋ ˆ์ธ ๋ชจ๋ธ์ด๋‹ˆ ๋‹น์—ฐํ•˜๋‹ค. ์ด์ œ SFT ์ฐจ๋ก€์˜€๋‹ค.


Day 2 (Feb 26) โ€” SFT v1: 0.0์ด๋ผ๋Š” ์žฌ์•™

SFT๋ฅผ ๋Œ๋ ธ๋‹ค. ํ•™์Šต์ด ์‹œ์ž‘๋˜์ž๋งˆ์ž loss๊ฐ€ ๋น ๋ฅด๊ฒŒ ๋–จ์–ด์ง€๊ธฐ ์‹œ์ž‘ํ–ˆ๋‹ค. ์ฒ˜์Œ์—” ์ข‹์€ ์‹ ํ˜ธ๋ผ๊ณ  ์ƒ๊ฐํ–ˆ๋‹ค.

๊ทธ๋Ÿฐ๋ฐ loss๊ฐ€ 0.0์ด ๋๋‹ค.

val loss๋„ 0.0. ์ƒ์„ฑ ๊ฒฐ๊ณผ๋Š” ์™„์ „ํ•œ ์“ฐ๋ ˆ๊ธฐ์˜€๋‹ค.

์›์ธ์„ ์ฐพ์•˜๋‹ค: label off-by-one ๋ฒ„๊ทธ. ์ž…๋ ฅ ํ† ํฐ๊ณผ ๋ ˆ์ด๋ธ” ํ† ํฐ์ด ํ•œ ์นธ์”ฉ ๋ฐ€๋ ค ์žˆ์—ˆ๋‹ค. ๋ชจ๋ธ์ด ์‹ค์ œ๋กœ ๋‹ค์Œ ํ† ํฐ์„ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ์ด๋ฏธ ์•Œ๊ณ  ์žˆ๋Š” ์ •๋‹ต์„ ๋งž์ถ”๋Š” ๊ตฌ์กฐ๊ฐ€ ๋ผ ์žˆ์—ˆ๋‹ค. loss๊ฐ€ 0์ด ๋œ ๊ฑด "์™„๋ฒฝํ•œ ํ•™์Šต"์ด ์•„๋‹ˆ๋ผ ๋ฐ์ดํ„ฐ ๋ˆ„์ˆ˜(label leakage) ์˜€๋‹ค.

ํ•˜๋ฃจ๋ฅผ ๋‚ ๋ ธ๋‹ค.


Day 3 (Feb 27) โ€” 5๊ฐ€์ง€ ๋ฒ„๊ทธ, ๋ฃจํŠธ ์ฝ”์ฆˆ ๋ถ„์„

์‹คํŒจ๋ฅผ ๋ถ„์„ํ•˜๊ธฐ ์œ„ํ•ด 5-์—์ด์ „ํŠธ ๋ฃจํŠธ ์ฝ”์ฆˆ ๋ถ„์„์„ ์ˆ˜ํ–‰ํ–ˆ๋‹ค. ๊ฒฐ๋ก ์€ ๋ฒ„๊ทธ ํ•˜๋‚˜๊ฐ€ ์•„๋‹ˆ์—ˆ๋‹ค. SFT ํŒŒ์ดํ”„๋ผ์ธ ์ „์ฒด์— ๋ฌธ์ œ๊ฐ€ ์žˆ์—ˆ๋‹ค.

๋ฐœ๊ฒฌ๋œ 5๊ฐ€์ง€ ํ•ต์‹ฌ ๋ฒ„๊ทธ:

๋ฒ„๊ทธ ์ฆ์ƒ ์˜ํ–ฅ
Static padding (no packing) ์งง์€ ์ƒ˜ํ”Œ๋„ max_len์œผ๋กœ ํŒจ๋”ฉ GPU ๋‚ญ๋น„, ํ•™์Šต ๋น„ํšจ์œจ
EOS ํ† ํฐ ์ ˆ๋‹จ ์‘๋‹ต ๋์— EOS๊ฐ€ ์—†์Œ ๋ชจ๋ธ์ด "๋ฌธ์žฅ ๋"์„ ๋ชป ๋ฐฐ์›€
๋‹จ์ผ ์—ํญ ๋ฐ์ดํ„ฐ๋ฅผ ํ•œ ๋ฒˆ๋งŒ ๋ด„ ์–ธ๋”ํ”ผํŒ…
๊ฒ€์ฆ ๋ถ„๋ฆฌ ์—†์Œ val_loss ์ธก์ • ๋ถˆ๊ฐ€ ์˜ค๋ฒ„ํ”ผํŒ… ๊ฐ์ง€ ๋ถˆ๊ฐ€
๋ฐ์ดํ„ฐ ํ’ˆ์งˆ ๋…ธ์ด์ฆˆ, ์ค‘๋ณต, ๋ถˆ๊ท ํ˜• ๋ฐ˜๋ณต ์ƒ์„ฑ ํŒจํ„ด ์œ ๋„

ํŠนํžˆ EOS ์ ˆ๋‹จ ๋ฒ„๊ทธ๋Š” subtleํ•˜๋‹ค. ๋ชจ๋ธ์ด ์‘๋‹ต์„ ๋งˆ์น˜๋Š” ์‹œ์ ์„ ๋ฐฐ์šฐ์ง€ ๋ชปํ•˜๋ฉด, ์ƒ์„ฑ ์‹œ ๋Š์ž„์—†์ด ๊ฐ™์€ ํŒจํ„ด์„ ๋ฐ˜๋ณตํ•˜๊ฑฐ๋‚˜ ์˜๋ฏธ ์—†๋Š” ํ† ํฐ์„ ์ด์–ด๋ถ™์ธ๋‹ค. 18% ๋ฐ˜๋ณต๋ฅ ์˜ ์›์ธ ์ค‘ ํ•˜๋‚˜์˜€๋‹ค.


Day 3 (Feb 27) โ€” SFT v2: ์„ฑ๊ณต์ด์ง€๋งŒ 18% ๋ฐ˜๋ณต

5๊ฐ€์ง€ ๋ฒ„๊ทธ๋ฅผ ๋ชจ๋‘ ์ˆ˜์ •ํ•˜๊ณ  SFT v2๋ฅผ ๋Œ๋ ธ๋‹ค.

  • val_loss: 2.2062 โ€” ํ•ฉ๋ฆฌ์  ์ˆ˜์ค€
  • ๋ฐ˜๋ณต๋ฅ : 18% (rep_penalty=1.1 ์ ์šฉ ํ›„)

์ƒ์„ฑ ํ’ˆ์งˆ์€ v1์— ๋น„ํ•ด ํ™•์—ฐํžˆ ๊ฐœ์„ ๋๋‹ค. ํ•˜์ง€๋งŒ 18% ๋ฐ˜๋ณต๋ฅ ์€ ์—ฌ์ „ํžˆ ๋†’๋‹ค. rep_penalty๋ฅผ ๋†’์ด๋ฉด ๋ฐ˜๋ณต์€ ์ค„์ง€๋งŒ ์ƒ์„ฑ ๋‹ค์–‘์„ฑ๋„ ์ค„๊ณ  ์–ด์ƒ‰ํ•ด์ง„๋‹ค. ๋””์ฝ”๋”ฉ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ํ•ด๊ฒฐํ•˜๊ธฐ์—” ๊ตฌ์กฐ์  ํ•œ๊ณ„๊ฐ€ ์žˆ๋‹ค.

kobest_copa ๊ธฐ์ค€ 0.646. ๊ดœ์ฐฎ์€ ์ˆ˜์น˜์ด์ง€๋งŒ ๋ชฉํ‘œ์—๋Š” ๋ฏธ์น˜์ง€ ๋ชปํ•œ๋‹ค.


Day 3 (Feb 27) โ€” "์ €์Šคํ‹ฐ์Šค๋ฆฌ๊ทธ vs ์–ด๋ฒค์ €์Šค": 3B ์ „ํ™˜ ๊ฒฐ์ •

๋ฐ˜๋ณต๋ฅ  18%๋ฅผ ๋†“๊ณ  ํŒ€ ๋‚ด๋ถ€ ํ† ๋ก ์ด ๋ฒŒ์–ด์กŒ๋‹ค. ํ•ต์‹ฌ ์งˆ๋ฌธ์€ ํ•˜๋‚˜์˜€๋‹ค:

ORPO๋กœ ๋ฐ˜๋ณต์„ ์žก์„ ์ˆ˜ ์žˆ๋Š”๊ฐ€, ์•„๋‹ˆ๋ฉด 3B๋กœ ๊ฐ€์•ผ ํ•˜๋Š”๊ฐ€?

์ด ์งˆ๋ฌธ์— ๋‹ตํ•˜๊ธฐ ์œ„ํ•ด ๋ฉ€ํ‹ฐ์—์ด์ „ํŠธ ํ† ๋ก ์„ ์ˆ˜ํ–‰ํ–ˆ๋‹ค (์ฝ”๋“œ๋ช…: "์ €์Šคํ‹ฐ์Šค๋ฆฌ๊ทธ vs ์–ด๋ฒค์ €์Šค"). ๊ฐ ์—์ด์ „ํŠธ๊ฐ€ ๋‹ค๋ฅธ ์ž…์žฅ์„ ๋งก์•„ ๋…ผ์ฆํ–ˆ๋‹ค.

ํ† ๋ก ์˜ ํ•ต์‹ฌ ๋ฐœ๊ฒฌ:

  1. 18% ๋ฐ˜๋ณต์€ 1B ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ๊ตฌ์กฐ์  ํ•œ๊ณ„๋‹ค. 1B ๋ชจ๋ธ์€ ์žฅ๊ฑฐ๋ฆฌ ์˜์กด์„ฑ(long-range dependency)์„ ์ถฉ๋ถ„ํžˆ ํฌ์ฐฉํ•˜์ง€ ๋ชปํ•œ๋‹ค. ORPO ๊ฐ™์€ ์„ ํ˜ธ๋„ ์ •๋ ฌ์€ ๋ฐ˜๋ณต์„ ์ค„์ด๋Š” ๋ฐ ์ผ๋ถ€ ๋„์›€์ด ๋˜์ง€๋งŒ, ๊ทผ๋ณธ ์›์ธ(ํŒŒ๋ผ๋ฏธํ„ฐ ๋ถ€์กฑ)์„ ํ•ด๊ฒฐํ•˜์ง€๋Š” ๋ชปํ•œ๋‹ค.

  2. ์Šค์ผ€์ผ๋ง ๋ฒ•์น™ ๋ถ„์„: Chinchilla ๋ฒ•์น™๊ณผ ์‹คํ—˜ ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ 3B ๋ชจ๋ธ์€ ๋™์ผ ๋ฐ์ดํ„ฐ์—์„œ ๋ฐ˜๋ณต๋ฅ ์„ 5~8%๊นŒ์ง€ ๋‚ฎ์ถœ ์ˆ˜ ์žˆ๋‹ค๋Š” ์ถ”์ •์ด ๋‚˜์™”๋‹ค.

  3. ๋น„์šฉ-ํŽธ์ต ๋ถ„์„: ORPO๋ฅผ 1B์— ํˆฌ์žํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค 3B ํ”„๋ฆฌํŠธ๋ ˆ์ธ์— ํˆฌ์žํ•˜๋Š” ๊ฒƒ์ด ์ตœ์ข… ๋ชจ๋ธ ํ’ˆ์งˆ ์ธก๋ฉด์—์„œ ์šฐ์›”ํ•˜๋‹ค.

๊ฒฐ๋ก : 3B ์ „ํ™˜. 1B๋Š” ์•„์นด์ด๋ธŒํ•˜๊ณ  3B ํ”„๋ฆฌํŠธ๋ ˆ์ธ์„ ์‹œ์ž‘ํ•œ๋‹ค.

์ด ๊ฒฐ์ •์€ eval/debate/justice_league_3b_case.md์— ์ „์ฒด ๋…ผ์ฆ๊ณผ ํ•จ๊ป˜ ๊ธฐ๋ก๋ผ ์žˆ๋‹ค.


Day 3 (Feb 27) โ€” 640GB+ ๋ฐ์ดํ„ฐ ์กฐ๋ฆฝ

3B ์ „ํ™˜์ด ๊ฒฐ์ •๋˜์ž๋งˆ์ž ๋ฐ์ดํ„ฐ ํŒŒ์ดํ”„๋ผ์ธ์„ ๊ฐ€๋™ํ–ˆ๋‹ค. 1B์— ๋น„ํ•ด ํ›จ์”ฌ ๋งŽ์€ ๋ฐ์ดํ„ฐ๊ฐ€ ํ•„์š”ํ•˜๋‹ค (Chinchilla ์ตœ์  ๋น„์œจ: 3B ๋ชจ๋ธ ร— 20 = 60B tokens).

์ตœ์ข…์ ์œผ๋กœ ์กฐ๋ฆฝํ•œ ๋ฐ์ดํ„ฐ:

  • ์ด ํ† ํฐ: 41.12B tokens (์ตœ์ข… ์ด์ง„ ํŒŒ์ผ)
  • ์›์‹œ ๋ฐ์ดํ„ฐ: 640GB+ ๋‹ค๊ตญ์–ด ํ…์ŠคํŠธ
  • ์†Œ์Šค: C4 Korean, ๋‚˜๋ฌด์œ„ํ‚ค, Wikipedia Korean, korean_extra ๋ฐ์ดํ„ฐ์…‹

๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ(ํ† ํฌ๋‚˜์ด์ฆˆ, ์…”ํ”Œ, ์ด์ง„ ๋ณ€ํ™˜)๊ฐ€ ์™„๋ฃŒ๋œ data/3b_train.bin์€ 82GB๋‹ค. ๊ฒ€์ฆ์…‹ data/3b_val.bin์€ 151MB.


Mar 2 โ€” Phase 0: OOM ๊ฒฉํ‡ด ๋ฐ ์ตœ์ ํ™”

3B ํ•™์Šต์„ ์ฒ˜์Œ ์‹œ์ž‘ํ•˜์ž OOM(Out of Memory)์ด ๋ฐœ์ƒํ–ˆ๋‹ค. 183GB VRAM์ธ๋ฐ 3B ๋ชจ๋ธ์ด OOM์ด ๋‚œ๋‹ค๋Š” ๊ฒŒ ์ด์ƒํ•˜์ง€๋งŒ, ์›์ธ์€ ์žˆ์—ˆ๋‹ค.

GQA FlashAttention ๊ตฌํ˜„ ๋ฌธ์ œ์˜€๋‹ค. GQA(Grouped-Query Attention)์—์„œ KV ์บ์‹œ๋ฅผ expandํ•˜๋Š” ๋ฐฉ์‹์ด ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๋ถˆํ•„์š”ํ•˜๊ฒŒ ๋ณต์‚ฌํ•˜๊ณ  ์žˆ์—ˆ๋‹ค. FlashAttention์˜ native GQA support๋ฅผ ์ œ๋Œ€๋กœ ํ™œ์šฉํ•˜์ง€ ์•Š์€ ๊ฒƒ์ด๋‹ค.

Phase 0์—์„œ ์ˆ˜ํ–‰ํ•œ ์ตœ์ ํ™” ๋ชฉ๋ก:

์ตœ์ ํ™” ๋ฐฉ๋ฒ• ํšจ๊ณผ
GQA FA Native flash_attn_varlen_func native GQA ๊ฒฝ๋กœ ์‚ฌ์šฉ VRAM 60.4GB โ†’ 48.3GB (-20%)
DDP ์ตœ์ ํ™” gradient_as_bucket_view=True GPU-CPU ๋™๊ธฐํ™” ์˜ค๋ฒ„ํ—ค๋“œ -87.5%
NCCL NVLS Ring+Tree ํ† ํด๋กœ์ง€, NVLS ํ™œ์„ฑํ™” AllReduce ํšจ์œจ ๊ฐœ์„ 
๋ฐฐ์น˜ ํฌ๊ธฐ ๋ถ„์„ GPU 2,4,6์˜ NCCL relay node ์—ญํ•  ํŒŒ์•… bs=5 ์ตœ์ , bs=6 ์œ„ํ—˜ ํŒ์ •
SIGHUP ๋ฐฉ์–ด nohup+setsid + Python signal handler + emergency ckpt 3์ค‘ ๋ณดํ˜ธ
๋ชจ๋‹ˆํ„ฐ๋ง Telegram Bot (B200Bot) + cron 10๋ถ„ ์›Œ์น˜๋…, 1์‹œ๊ฐ„ ์ƒํƒœ ๋ฆฌํฌํŠธ

torch.compile ํ…Œ์ŠคํŠธ: ํšจ๊ณผ ์—†์Œ(1.00x). ์›์ธ์€ TransformerEngine์˜ opaque kernel์ด graph break๋ฅผ ์œ ๋ฐœํ•˜๊ณ , /tmp ๋””๋ ‰ํ† ๋ฆฌ์— noexec ํ”Œ๋ž˜๊ทธ๊ฐ€ ๊ฑธ๋ ค ์žˆ์–ด ์ปดํŒŒ์ผ๋œ kernel ์บ์‹œ๊ฐ€ ์“ฐ์ด์ง€ ์•Š์•˜๋‹ค. ์‹œ๊ฐ„ ๋‚ญ๋น„๋ฅผ ํ•œ ์…ˆ์ด์ง€๋งŒ, "ํšจ๊ณผ ์—†๋‹ค"๋Š” ๊ฒƒ์„ ์‹ค์ธก์œผ๋กœ ํ™•์ธํ•œ ๊ฒƒ๋„ ์„ฑ๊ณผ๋‹ค.

bs=5์˜ ์ด์œ : NCCL ring topology์—์„œ GPU 2, 4, 6์ด relay node ์—ญํ• ์„ ๋งก๋Š”๋‹ค. ์ด GPU๋“ค์€ ๋‹ค๋ฅธ GPU๋ณด๋‹ค ์•ฝ 11GB๋ฅผ ๋” ์‚ฌ์šฉํ•œ๋‹ค. bs=5์—์„œ๋Š” ์—ฌ์œ ๊ฐ€ ์žˆ์ง€๋งŒ, bs=6์œผ๋กœ ์˜ฌ๋ฆฌ๋ฉด ์ด relay GPU๋“ค์ด 183GB ๊ฒฝ๊ณ„์— ๋„ˆ๋ฌด ๊ฐ€๊นŒ์›Œ์ง„๋‹ค. ์•ˆ์ „ ๋งˆ์ง„์„ ์œ„ํ•ด bs=5๋ฅผ ์œ ์ง€ํ•œ๋‹ค.


Mar 2~Mar 5 โ€” Phase 1: 3B ํ”„๋ฆฌํŠธ๋ ˆ์ธ ์™„๋ฃŒ

Phase 0 ์ตœ์ ํ™”๊ฐ€ ์™„๋ฃŒ๋œ ํ›„ Phase 1์ด ์‹œ์ž‘๋๋‹ค.

์ดˆ๊ธฐ ์ง€ํ‘œ (step 3150):

  • Loss: 2.38
  • ์ฒ˜๋ฆฌ ์†๋„: 36K tok/s per rank
  • ์‹œ์Šคํ…œ ์ „์ฒด: ~292K tok/s (8 GPU)
  • MFU: ~33.5%

MFU 33.5%๋Š” ์ฒ˜์Œ์—๋Š” ๋‚ฎ์•„ ๋ณด์ผ ์ˆ˜ ์žˆ๋‹ค. ํ•˜์ง€๋งŒ TE MXFP8๊ฐ€ ์ด๋ฏธ ์ตœ์ ํ™”๋œ ์ƒํƒœ์—์„œ ๋‚˜์˜จ ์ˆ˜์น˜๋‹ค. ์ด๋ก ์  ํ”ผํฌ(18,000 TFLOPS) ๋Œ€๋น„ ์‹คํšจ์œจ์ด๋‹ค. ์ถ”๊ฐ€ ์ตœ์ ํ™” ์—ฌ์ง€๋กœ QKV fusion (+812%), NUMA affinity (+49%), FA2 native RoPE (+3~5%)๊ฐ€ ๋‚จ์•„์žˆ๋‹ค.

Phase 1 ์™„๋ฃŒ (2026-03-05):

  • 57,000 steps ์™„๋ฃŒ, ์ตœ์ข… loss 1.466
  • 41.12B ํ† ํฐ ์ฒ˜๋ฆฌ, ์ด ํ•™์Šต ์‹œ๊ฐ„ ์•ฝ 63์‹œ๊ฐ„
  • ๋ฌด์‚ฌ๊ณ  ์™„๋ฃŒ (SIGHUP, OOM, NCCL ์ด์ƒ ์—†์Œ)

์ข…ํ•ฉ ํ‰๊ฐ€ ๊ฒฐ๊ณผ ์š”์•ฝ (v2 ์žฌํ‰๊ฐ€ ๋ฐ˜์˜):

ํ•ญ๋ชฉ ๊ฒฐ๊ณผ
PPL (ํ†ตํ•ฉ ๊ฒ€์ฆ์…‹) 5.2263 (์ดˆ๊ธฐ v1 ํ‰๊ฐ€: 5.709)
PPL (C4 Korean) 5.717
KoBEST ํ‰๊ท  (5ํƒœ์Šคํฌ) 43.69%
MMLU-KO ํ‰๊ท  (6์นดํ…Œ๊ณ ๋ฆฌ) 22.75%
HAE-RAE 19.71%
winogrande / piqa 50.59% / 52.50%
Calibration Top-1 68.75%
Greedy 3-gram ๋ฐ˜๋ณต๋ฅ  60.99% (SFT ํ›„ ๊ฐœ์„  ์˜ˆ์ •)
์ตœ์  ์ƒ์„ฑ ํŒŒ๋ผ๋ฏธํ„ฐ temp=0.7, rep_penalty=1.3 โ†’ ๋ฐ˜๋ณต๋ฅ  0%

SFT ์ง„ํ–‰ ๊ฒฐ์ •: loss 1.466์€ ๊ฑด๊ฐ•ํ•œ ํ•™์Šต ์™„๋ฃŒ ์‹œ๊ทธ๋„. PPL/๋ฐ˜๋ณต๋ฅ /๋ฒค์น˜๋งˆํฌ ๋ชจ๋‘ SFT๊ฐ€ ํ•ด๊ฒฐํ•  ์˜์—ญ. ๋ชจ๋ธ ๊ตฌ์กฐ ๋ฌธ์ œ ์ง•ํ›„ ์—†์Œ. โ†’ Phase 2 SFT ์ง„ํ–‰.


Mar 5~ โ€” Phase 2: 3B SFT ์‹œ์ž‘ โ€” 2.44M ์ƒ˜ํ”Œ, val_loss 1.956

Phase 1 ์™„๋ฃŒ ์งํ›„, ๋Œ€๊ทœ๋ชจ SFT ๋ฐ์ดํ„ฐ๋ฅผ ์ค€๋น„ํ•˜๊ณ  ํ•™์Šต์„ ์‹œ์ž‘ํ–ˆ๋‹ค.

๋ฐ์ดํ„ฐ ํŒŒ์ดํ”„๋ผ์ธ:

  • 24๊ฐœ ์†Œ์Šค์—์„œ 6.59M raw samples ์ˆ˜์ง‘
  • prepare_sft_combined.sh: ํฌ๋งท ํ†ต์ผ(6๊ฐ€์ง€ ํฌ๋งท โ†’ messages), MD5 ์ค‘๋ณต ์ œ๊ฑฐ, 98:2 split
  • filter_sft_v2.py: 5๋‹จ๊ณ„ ํ’ˆ์งˆ ํ•„ํ„ฐ (EOS strip, QA marker ์ œ๊ฑฐ, ๊ธธ์ด ํ•„ํ„ฐ, 4-gram ๋ฐ˜๋ณต ํ•„ํ„ฐ)
  • ์ตœ์ข…: 2,439,397 train + 49,801 val (7.48 GB)

๋ฐ์ดํ„ฐ ๊ตฌ์„ฑ์€ ์ถ”๋ก /CoT(38%), ํ•œ๊ตญ์–ด ์ง€์‹œ(22.5%), ์˜์–ด ๋‹ค๋ชฉ์ (16%), ์ˆ˜ํ•™(12%), ๋Œ€ํ™”/์ฝ”๋“œ(11.5%)๋กœ ๊ท ํ˜•์„ ๋งž์ท„๋‹ค. 1B SFT์˜ 161K์—์„œ 15๋ฐฐ ํ™•๋Œ€ํ•œ ๊ทœ๋ชจ๋‹ค.

SFT ์„ค๊ณ„ โ€” 1B ์‹คํŒจ์—์„œ ๋ฐฐ์šด ๊ตํ›ˆ ๋ฐ˜์˜:

1B ๊ตํ›ˆ 3B SFT ์ ์šฉ
Label off-by-one โ†’ loss=0 Loss masking ๊ฒ€์ฆ (prompt=-1, response๋งŒ ํ•™์Šต)
EOS ์ ˆ๋‹จ โ†’ ์ข…๋ฃŒ ๋ถˆ๊ฐ€ Chat template <|user|>...<|assistant|>...</s> EOS ํฌํ•จ
Static padding โ†’ GPU ๋‚ญ๋น„ Dynamic padding (64-token ์ •๋ ฌ)
๊ฒ€์ฆ ์—†์Œ โ†’ ์˜ค๋ฒ„ํ”ผํŒ… ๋ฏธ๊ฐ์ง€ 49,801 val samples, 500 step ๊ฐ„๊ฒฉ eval
๋ฐ์ดํ„ฐ ๋…ธ์ด์ฆˆ 5๋‹จ๊ณ„ ํ’ˆ์งˆ ํ•„ํ„ฐ (1B์—๋Š” ์—†์—ˆ์Œ)
๋ฐ˜๋ณต๋ฅ  18% NEFTune alpha=5.0 ์ถ”๊ฐ€ (์ž„๋ฒ ๋”ฉ ๋…ธ์ด์ฆˆ ์ฃผ์ž…)

ํ•™์Šต ์„ค์ •:

  • LR: 1e-5 (pretrain์˜ 1/15 โ€” catastrophic forgetting ๋ฐฉ์ง€)
  • Effective batch: 2 ร— 8 GPU ร— 4 accum = 64 sequences
  • 33,000 steps (~3.3 epochs)
  • MXFP8, gradient checkpointing, NCCL Ring+Tree

์ดˆ๊ธฐ ๊ฒฐ๊ณผ (step 2,000, 6%):

  • Val loss: 2.073 โ†’ 2.004 โ†’ 1.975 โ†’ 1.956 (๋‹จ์กฐ ๊ฐ์†Œ)
  • Train-Val ๊ฐญ ~0.1 (์˜ค๋ฒ„ํ”ผํŒ… ์ง•ํ›„ ์—†์Œ)
  • VRAM 24.2 GB (13.2%) โ€” pretrain์˜ ์ ˆ๋ฐ˜, ๋งค์šฐ ์•ˆ์ •
  • Grad norm 1.0 ์ผ์ • (ํ•™์Šต๋ฅ  ์ ์ ˆ)

์ƒ์„ธ ๋ณด๊ณ ์„œ: reports/2026-03-05_3B_SFT_PROGRESS_REPORT.md


Mar 6 โ€” Phase 2 ์™„๋ฃŒ: SFT Early Stopping (val_loss 1.8851)

SFT๋Š” 33,000 steps ์ค‘ 25,500 steps์—์„œ early stopping์œผ๋กœ ์ข…๋ฃŒ๋˜์—ˆ๋‹ค. Val loss๋Š” step 23,000์—์„œ 1.8851์— ๋„๋‹ฌํ•œ ๋’ค, 5ํšŒ ์—ฐ์† ๊ฐœ์„  ์—†์ด ํ•™์Šต์ด ์ž๋™ ์ค‘๋‹จ๋˜์—ˆ๋‹ค.

์ด ํ•™์Šต ์‹œ๊ฐ„: ~15์‹œ๊ฐ„ 41๋ถ„ (2026-03-05 22:15 ~ 2026-03-06 13:56)

์ด ๊ฒฐ๊ณผ๋Š” LR 1e-5์˜ cosine decay๊ฐ€ step 20K ์ดํ›„ ์‚ฌ์‹ค์ƒ 0์— ์ˆ˜๋ ดํ•œ ๊ฒƒ๊ณผ ์ผ์น˜ํ•œ๋‹ค. ๋ชจ๋ธ์€ ์ฃผ์–ด์ง„ LR schedule ํ•˜์—์„œ ํ•™์Šต ๊ฐ€๋Šฅํ•œ ๋งŒํผ ์™„์ „ํžˆ ํ•™์Šตํ–ˆ๋‹ค.


Mar 6 โ€” SFT 6์ฐจ์› ์ข…ํ•ฉ ํ‰๊ฐ€: 4/6 PASS โ†’ ORPO ๊ฒฐ์ •

SFT ์ฒดํฌํฌ์ธํŠธ(checkpoint-best, step 23000)์— ๋Œ€ํ•ด 6์ฐจ์› ์ข…ํ•ฉ ํ‰๊ฐ€๋ฅผ ์ˆ˜ํ–‰ํ–ˆ๋‹ค. 49๋ถ„ 27์ดˆ ์†Œ์š”.

ํ•ต์‹ฌ ๊ฒฐ๊ณผ:

  • Perplexity: forgetting 0.9% (19๊ฐœ ๋ฐ์ดํ„ฐ์…‹ ์ „์ฒด PASS) โ€” ์ง€์‹ ๋ณด์กด ์šฐ์ˆ˜
  • ๋ฐ˜๋ณต๋ฅ : greedy 72.97% (Base 60.99%๋ณด๋‹ค ์•…ํ™”) โ€” FAIL
  • EOS ์ข…๋ฃŒ์œจ: 0% โ†’ 60% โ€” ๊ฐœ์„ ๋์ง€๋งŒ ๋ชฉํ‘œ(90%) ๋ฏธ๋‹ฌ
  • KoBEST: 43.26% (Base 43.69%์™€ ๊ฑฐ์˜ ๋™์ผ) โ€” FAIL
  • MMLU-KO: 22.75% โ†’ 26.00% (+3.2pp) โ€” ๋ถ€๋ถ„ ๊ฐœ์„ 
  • Calibration: Top-1 68.59% โ€” PASS

๊ฒฐ์ •: greedy ๋ฐ˜๋ณต๋ฅ  72.97%๋Š” SFT๋งŒ์œผ๋กœ ํ•ด๊ฒฐ ๋ถˆ๊ฐ€. ๊ทธ๋Ÿฌ๋‚˜ rep_penalty=1.2 ์ ์šฉ ์‹œ ๋ฐ˜๋ณต๋ฅ  0%๊ฐ€ ๋‹ฌ์„ฑ๋˜๋ฏ€๋กœ, ORPO(์„ ํ˜ธ๋„ ์ •๋ ฌ)๋กœ ์ด ํ–‰๋™์„ ๋‚ด์žฌํ™”ํ•˜๋Š” ๊ฒƒ์ด ์˜ฌ๋ฐ”๋ฅธ ๊ฒฝ๋กœ๋‹ค.


Mar 6 โ€” ์ฝ”๋“œ ๊ฐœ์„  ๋ฐ ORPO ์ค€๋น„

SFT ํ‰๊ฐ€์™€ ๋ณ‘ํ–‰ํ•˜์—ฌ ๋‹ค์ˆ˜์˜ ์ฝ”๋“œ ๊ฐœ์„  ๋ฐ Phase 3 ์ค€๋น„๋ฅผ ์™„๋ฃŒํ–ˆ๋‹ค:

๋ณ€๊ฒฝ ๋‚ด์šฉ ์˜ํ–ฅ
train/sft.py +238์ค„ MixingDataLoader (SFT+pretrain ์ธํ„ฐ๋ฆฌ๋น™), DDP rank 0 ํ† ํฌ๋‚˜์ด์ง• forgetting ๋ฐฉ์ง€, ๋ฉ”๋ชจ๋ฆฌ 8๋ฐฐ ์ ˆ๊ฐ
train/trainer.py +17์ค„ DDP early stopping broadcast (hang ๋ฐฉ์ง€), patience 5โ†’10 DDP ์•ˆ์ •์„ฑ
train/orpo.py +30์ค„ YAML config ์ง€์›, 3B ๊ธฐ๋ณธ๊ฐ’ ORPO ์‹คํ–‰ ์ค€๋น„
eval/report_generator.py +831์ค„ Base vs SFT ๋น„๊ต ๋ณด๊ณ ์„œ ์ž๋™ ์ƒ์„ฑ ํ‰๊ฐ€ ์ž๋™ํ™”
eval/sft_eval_pipeline.py ์‹ ๊ทœ SFT 6์ฐจ์› ํ‰๊ฐ€ ํŒŒ์ดํ”„๋ผ์ธ ์ข…ํ•ฉ ํ‰๊ฐ€
eval/tasks/generation_task.py +75์ค„ Chat template, ๋‹ค์–‘์„ฑ ๋ฉ”ํŠธ๋ฆญ SFT ํ‰๊ฐ€
configs/korean_3b_sft_v2.yaml ์‹ ๊ทœ SFT v2 ์„ค์ • (lr=5e-5, data mixing 70/30) ๋ฐฑ์—… ๊ฒฝ๋กœ
configs/korean_3b_orpo.yaml ์‹ ๊ทœ ORPO ์„ค์ • (lr=5e-6, beta=0.1) Phase 3

์ƒ์„ธ: reports/2026-03-06_3B_SFT_COMPLETION_AND_EVAL_SUMMARY.md


6. ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜

1B (์•„์นด์ด๋ธŒ)

ํ•ญ๋ชฉ ๊ฐ’
vocab_size 64,000
d_model 2,048
n_layers 24
n_heads 16
n_kv_heads 4 (GQA 4:1)
d_ffn 5,461 (SwiGLU)
ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜ ~1.19B
context 2,048
rope_theta 500,000

3B (ํ˜„์žฌ)

ํ•ญ๋ชฉ ๊ฐ’
vocab_size 64,000
d_model 3,072
n_layers 28
n_heads 24
n_kv_heads 8 (GQA 3:1)
d_ffn 8,192 (SwiGLU)
ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜ ~3.0B
context 2,048
rope_theta 500,000

๊ณตํ†ต ์„ค๊ณ„ ์›์น™

์ปดํฌ๋„ŒํŠธ ์„ ํƒ ์ด์œ 
์ •๊ทœํ™” Pre-norm RMSNorm Post-norm๋ณด๋‹ค ํ•™์Šต ์•ˆ์ •์ 
ํ™œ์„ฑํ™” SwiGLU FFN Llama ๊ณ„์—ด์—์„œ ๊ฒ€์ฆ๋œ ์„ ํƒ
์œ„์น˜ ์ธ์ฝ”๋”ฉ RoPE (ฮธ=500K) ๊ธด ์ปจํ…์ŠคํŠธ ํ™•์žฅ ๊ฐ€๋Šฅ์„ฑ
์–ดํ…์…˜ GQA (Grouped-Query Attention) KV ์บ์‹œ ๋ฉ”๋ชจ๋ฆฌ ์ ˆ๊ฐ
๊ตฌํ˜„ FlashAttention-2 IO-aware, VRAM ํšจ์œจ
์ •๋ฐ€๋„ FP8 (MXFP8 via TransformerEngine) B200 ์ตœ์  ํ™œ์šฉ

GQA ๋น„์œจ ์„ ํƒ ๊ทผ๊ฑฐ

1B๋Š” GQA 4:1 (head 16๊ฐœ, kv_head 4๊ฐœ), 3B๋Š” GQA 3:1 (head 24๊ฐœ, kv_head 8๊ฐœ)์„ ์„ ํƒํ–ˆ๋‹ค. 3B์—์„œ ๋น„์œจ์„ ๋‹ค์†Œ ์™„ํ™”ํ•œ ์ด์œ ๋Š”, ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๊ฐ€ ๋Š˜์–ด๋‚˜๋ฉด์„œ ์–ดํ…์…˜ ํ’ˆ์งˆ์„ ๋‹ค์†Œ ํฌ์ƒํ•˜๋Š” ๊ฒƒ์ด 3B ๊ทœ๋ชจ์—์„œ๋Š” ์†ํ•ด๋ผ๋Š” ํŒ๋‹จ์ด์—ˆ๋‹ค. Mistral 7B (GQA 8:1)์™€ Llama 3 (GQA 8:1)๋ฅผ ์ฐธ๊ณ ํ–ˆ๋‹ค.

rope_theta=500,000์˜ ์˜๋ฏธ

ํ‘œ์ค€ RoPE์˜ ฮธ=10,000์—์„œ 500,000์œผ๋กœ ๋Š˜๋ฆฐ ๊ฒƒ์€ ๊ธด ์ปจํ…์ŠคํŠธ์—์„œ ์ฃผํŒŒ์ˆ˜ ๊ฐ„์„ญ์„ ์ค„์ด๊ธฐ ์œ„ํ•ด์„œ๋‹ค. Code Llama, Llama 3 ๋“ฑ์ด ์ฑ„ํƒํ•œ ๋ฐฉ์‹์ด๋‹ค. ํ˜„์žฌ max_seq_len=2048์ด๋ฏ€๋กœ ๋‹น์žฅ ํšจ๊ณผ๋ฅผ ๋ณด๊ธฐ๋Š” ์–ด๋ ต์ง€๋งŒ, ํ–ฅํ›„ ์ปจํ…์ŠคํŠธ ํ™•์žฅ ํŒŒ์ธํŠœ๋‹์„ ์œ„ํ•œ ๊ธฐ๋ฐ˜์ด๋‹ค.


7. ํ•™์Šต ๋ฐ์ดํ„ฐ

7.1 ํ† ํฌ๋‚˜์ด์ €

ํ•ญ๋ชฉ ๊ฐ’
์ข…๋ฅ˜ SentencePiece Unigram
์–ดํœ˜ ํฌ๊ธฐ 64,000
ํ•œ๊ตญ์–ด ๋ฌธ์ž ์ปค๋ฒ„๋ฆฌ์ง€ 99.95%
์œ„์น˜ tokenizer/korean_sp/
HF ํฌ๋งท tokenizer/tokenizer.json (2.4MB)

64K ์–ดํœ˜๋Š” 32K(๋„ˆ๋ฌด ์ž‘์Œ, ํ•œ๊ตญ์–ด ์„œ๋ธŒ์›Œ๋“œ ๋‹จํŽธํ™” ์‹ฌํ•จ)์™€ 128K(๋„ˆ๋ฌด ํผ, ์ž„๋ฒ ๋”ฉ ๋ ˆ์ด์–ด ์˜ค๋ฒ„ํ—ค๋“œ ์ฆ๊ฐ€) ์‚ฌ์ด์˜ ๊ท ํ˜•์ด๋‹ค. Llama 3(128K)์™€ GPT-4(100K)๊ฐ€ ํฐ ์–ดํœ˜๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์ถ”์„ธ์ง€๋งŒ, 3B ๋ชจ๋ธ์—์„œ 128K ์–ดํœ˜๋Š” ์ž„๋ฒ ๋”ฉ ๋ ˆ์ด์–ด๋งŒ์œผ๋กœ๋„ ํŒŒ๋ผ๋ฏธํ„ฐ ๋น„์ค‘์ด ์ง€๋‚˜์น˜๊ฒŒ ์ปค์ง„๋‹ค.

7.2 ํ”„๋ฆฌํŠธ๋ ˆ์ธ ๋ฐ์ดํ„ฐ โ€” ์ „์ฒด ๊ตฌ์„ฑ

์ตœ์ข… ํ•™์Šต ํŒŒ์ผ: data/3b_train.bin (77GB, ~38.5B tokens) + data/3b_val.bin (145MB)

Chinchilla ๋ฒ•์น™ ๊ธฐ์ค€: 3B ร— 20 = 60B ํ† ํฐ์ด ์ตœ์ ์ด๋‹ค. ํ˜„์žฌ 38.5B ํ† ํฐ์„ 57,000 ์Šคํ…(batch 5 ร— accum 8 ร— seq 2048 ร— 8 GPU)์œผ๋กœ ๋ฐ˜๋ณต ์†Œ๋น„ํ•˜๋ฉฐ, ์ฒ˜์Œ 3B ํ•™์Šต์œผ๋กœ์„œ ํ•ฉ๋ฆฌ์ ์ธ ๋ฒ”์œ„๋‹ค.

ํ•œ๊ตญ์–ด โ€” ์›นํฌ๋กค (Web Crawl)

๋ฐ์ดํ„ฐ์…‹ HuggingFace ID ํ† ํฐํ™” ํŒŒ์ผ ํฌ๊ธฐ ์ถ”์ • ํ† ํฐ ์„ค๋ช…
C4 Korean allenai/c4 (ko subset) korean_c4_train.bin 15GB ~7.5B Google C4 ํ•œ๊ตญ์–ด ํ•„ํ„ฐ๋ง, ๋Œ€๊ทœ๋ชจ ํด๋ฆฐ ์›น ํ…์ŠคํŠธ
CC-100 Korean cc100 (ko subset) cc100_ko_train.bin 4.3GB ~2.15B Common Crawl ๊ธฐ๋ฐ˜ ๋‹จ์ผ์–ธ์–ด ์ฝ”ํผ์Šค
HPLT Korean HPLT/hplt_monolingual_v2 (ko) hplt_ko_train.bin 15GB ~7.5B High Performance Language Technologies ์›น ๋ฐ์ดํ„ฐ

ํ•œ๊ตญ์–ด โ€” ๋ฐฑ๊ณผ์‚ฌ์ „ (Encyclopedia)

๋ฐ์ดํ„ฐ์…‹ HuggingFace ID ํ† ํฐํ™” ํŒŒ์ผ ํฌ๊ธฐ ์ถ”์ • ํ† ํฐ ์„ค๋ช…
์œ„ํ‚ค๋ฐฑ๊ณผ ํ•œ๊ตญ์–ด wikimedia/wikipedia (20231101.ko) wikipedia_ko_train.bin 566MB ~283M ํ•œ๊ตญ์–ด ์œ„ํ‚ค๋ฐฑ๊ณผ ์ „์ฒด, ๊ตฌ์กฐํ™”๋œ ๋ฌธ์–ด์ฒด
์œ„ํ‚ค๋ฐฑ๊ณผ ํ•œ๊ตญ์–ด (v2) wikimedia/wikipedia (ko) korean_wiki_train.bin 500MB ~250M ์œ„ํ‚ค๋ฐฑ๊ณผ ๋ณ„๋„ ๋ฒ„์ „
๋‚˜๋ฌด์œ„ํ‚ค heegyu/namuwiki-extracted korean_namuwiki_train.bin 2.1GB ~1.05B ๋‚˜๋ฌด์œ„ํ‚ค ์ถ”์ถœ๋ณธ, ์„œ๋ธŒ์ปฌ์ฒ˜ยท์‹œ์‚ฌ ํ’๋ถ€
๋‚˜๋ฌด์œ„ํ‚ค 2023b heegyu/namuwiki-extracted (2023b) namuwiki_2023b_train.bin 2.5GB ~1.25B 2023๋…„ ์—…๋ฐ์ดํŠธ ์Šค๋ƒ…์ƒท

์˜์–ด/๋‹ค๊ตญ์–ด โ€” ๊ต์œก (Educational)

๋ฐ์ดํ„ฐ์…‹ HuggingFace ID ํ† ํฐํ™” ํŒŒ์ผ ํฌ๊ธฐ ์ถ”์ • ํ† ํฐ ์„ค๋ช…
Cosmopedia Stories HuggingFaceTB/cosmopedia cosmo_stories_train.bin 5.9GB ~2.95B ํ•ฉ์„ฑ ๊ต์œก์šฉ ์Šคํ† ๋ฆฌ
Cosmopedia Web v2 HuggingFaceTB/cosmopedia cosmo_web_v2_train.bin 2.7GB ~1.35B ์›น ๊ธฐ๋ฐ˜ ๊ต์œก ํ…์ŠคํŠธ
Cosmopedia Stanford HuggingFaceTB/cosmopedia cosmo_stanford_train.bin 2.1GB ~1.05B Stanford ๊ฐ•์˜ ๊ธฐ๋ฐ˜
Cosmopedia WikiHow HuggingFaceTB/cosmopedia cosmo_wikihow_train.bin 382MB ~191M WikiHow ๊ฐ€์ด๋“œ
Cosmopedia OpenStax HuggingFaceTB/cosmopedia cosmo_openstax_train.bin 224MB ~112M ์˜คํ”ˆ ๊ต๊ณผ์„œ
Cosmopedia Khan Academy HuggingFaceTB/cosmopedia cosmo_khanacademy_train.bin 46MB ~23M ์นธ ์•„์นด๋ฐ๋ฏธ

์˜์–ด/๋‹ค๊ตญ์–ด โ€” ์ˆ˜ํ•™ยท๊ณผํ•™ (Math & Science)

๋ฐ์ดํ„ฐ์…‹ HuggingFace ID ํ† ํฐํ™” ํŒŒ์ผ ํฌ๊ธฐ ์ถ”์ • ํ† ํฐ ์„ค๋ช…
Open Web Math open-web-math/open-web-math open_web_math_train.bin 4.8GB ~2.4B ์›น์—์„œ ์ถ”์ถœํ•œ ์ˆ˜ํ•™ ํ…์ŠคํŠธ
MathPile GAIR/MathPile mathpile_train.bin 2.9GB ~1.45B ์ˆ˜ํ•™ ๊ต๊ณผ์„œยท๋…ผ๋ฌธยทํฌ๋Ÿผ
Cosmopedia AutoMath HuggingFaceTB/cosmopedia cosmo_auto_math_text_train.bin 2.5GB ~1.25B ํ•ฉ์„ฑ ์ˆ˜ํ•™ ๋ฌธ์ œยทํ’€์ด

ํ•œ๊ตญ์–ด โ€” ํ˜ผํ•ฉ (Legacy Merged)

๋ฐ์ดํ„ฐ์…‹ ํ† ํฐํ™” ํŒŒ์ผ ํฌ๊ธฐ ์ถ”์ • ํ† ํฐ ์„ค๋ช…
์ดˆ๊ธฐ ํ˜ผํ•ฉ (C4+๋‚˜๋ฌด+์œ„ํ‚ค) korean_train.bin 17GB ~8.5B 1B ํ•™์Šต์— ์‚ฌ์šฉ๋œ ์›๋ณธ ํ˜ผํ•ฉ ๋ฐ์ดํ„ฐ
125M ๊ฒ€์ฆ์šฉ train.bin 1.2GB ~600M ์ตœ์ดˆ FP8 ๊ฒ€์ฆ์— ์‚ฌ์šฉ

๋ฏธ์‚ฌ์šฉ ์ˆ˜์ง‘ ๋ฐ์ดํ„ฐ (korean_extra/ โ€” 640GB+)

data/korean_extra/ ์— 39๊ฐœ ์„œ๋ธŒ๋””๋ ‰ํ† ๋ฆฌ๋กœ ์ˆ˜์ง‘๋˜์—ˆ์œผ๋‚˜, ํ† ํฐํ™”ยท๋ณ‘ํ•ฉ์€ ์ผ๋ถ€๋งŒ ์™„๋ฃŒ๋œ ๋Œ€๊ทœ๋ชจ ์›์‹œ ๋ฐ์ดํ„ฐ:

๋ถ„๋ฅ˜ ๋ฐ์ดํ„ฐ์…‹ ์„ค๋ช… ๋น„๊ณ 
์›นํฌ๋กค CulturaX Korean ๋Œ€๊ทœ๋ชจ ๋‹ค๊ตญ์–ด ์›น ์ฝ”ํผ์Šค ํ•œ๊ตญ์–ด ~50B+ tokens
์›นํฌ๋กค FineWeb2 Educational Korean ๊ต์œก์  ํ’ˆ์งˆ ํ•„ํ„ฐ๋ง ์›น ๋ฐ์ดํ„ฐ 234GB raw
์›นํฌ๋กค Korean Web Collection KORMo ์›น ์ปฌ๋ ‰์…˜ 175GB raw
์›นํฌ๋กค OSCAR Korean ๋‹ค๊ตญ์–ด ์›น ์ฝ”ํผ์Šค ํ•œ๊ตญ์–ด
๊ต์œก Korean Textbooks ํ•œ๊ตญ์–ด ๊ต๊ณผ์„œ ํ…์ŠคํŠธ 45๊ฐœ ์„œ๋ธŒ์นดํ…Œ๊ณ ๋ฆฌ
๊ต์œก FinePDFs Educational Korean PDF ๊ธฐ๋ฐ˜ ๊ต์œก ์ž๋ฃŒ
๋ฒ•๋ฅ  Korean Law ํ•œ๊ตญ ๋ฒ•๋ฅ  ํ…์ŠคํŠธ 15GB
๋‰ด์Šค Korean News Archive ํ•œ๊ตญ์–ด ๋‰ด์Šค ์•„์นด์ด๋ธŒ
๊ณต๊ฐœ์ฝ”ํผ์Šค Korean Public Corpus KORMo ๊ณต๊ฐœ ์ฝ”ํผ์Šค 26GB
์ฝ”๋“œ Code Pretrain ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์ฝ”๋“œ
ํ•™์ˆ  Academic Pretrain ํ•™์ˆ  ๋…ผ๋ฌธยท๋ฆฌํฌํŠธ
๋ฒ”์šฉ SlimPajama RedPajama ๊ฒฝ๋Ÿ‰ ๋ฒ„์ „

์ด ๋ฐ์ดํ„ฐ๋Š” Extended Pretrain (80-100B tokens) ๋‹จ๊ณ„์—์„œ ํ™œ์šฉ ์˜ˆ์ •์ด๋‹ค.

ํ”„๋ฆฌํŠธ๋ ˆ์ธ ๋ฐ์ดํ„ฐ ๋ถ„์•ผ๋ณ„ ๋น„์œจ

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚              3b_train.bin ํ† ํฐ ๊ตฌ์„ฑ (~38.5B)              โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘  ํ•œ๊ตญ์–ด ์›นํฌ๋กค    44.7%  โ”‚
โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘  ํ˜ผํ•ฉ ๋ ˆ๊ฑฐ์‹œ      22.1%  โ”‚
โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘  ๊ต์œก (EN)       14.7%  โ”‚
โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘  ์ˆ˜ํ•™ยท๊ณผํ•™       13.2%  โ”‚
โ”‚ โ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘  ๋ฐฑ๊ณผ์‚ฌ์ „ (KO)    5.3%  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

7.3 SFT ๋ฐ์ดํ„ฐ โ€” 2.44M ์ƒ˜ํ”Œ (ํ˜„์žฌ ํ•™์Šต ์ค‘)

24๊ฐœ ์†Œ์Šค์—์„œ 6.59M raw โ†’ ํ†ตํ•ฉยท์ค‘๋ณต ์ œ๊ฑฐ โ†’ ํ’ˆ์งˆ ํ•„ํ„ฐ๋ง โ†’ 2,439,397 train + 49,801 val

์ฃผ์š” SFT ์†Œ์Šค (์ƒ์œ„ 12, ์ „์ฒด์˜ 96%)

# ๋ฐ์ดํ„ฐ์…‹ ์ƒ˜ํ”Œ ์ˆ˜ ํฌ๊ธฐ ๋„๋ฉ”์ธ
1 reasoning_r1_1.4m 1,400,000 14.77 GB ์ถ”๋ก  (CoT)
2 openhermes_2.5 1,001,551 1.82 GB ์˜์–ด ๋‹ค๋ชฉ์ 
3 AI-MO_NuminaMath-CoT 859,494 2.51 GB ์ˆ˜ํ•™ CoT
4 korean_instruction_mix 515,911 1.39 GB ํ•œ๊ตญ์–ด ํ˜ผํ•ฉ
5 lemon-mint_smol-koreantalk 460,281 5.23 GB ํ•œ๊ตญ์–ด ๋Œ€ํ™”
6 open_korean_instructions 375,159 0.73 GB ํ•œ๊ตญ์–ด ์ง€์‹œ
7 magpie_reasoning_v2 249,922 3.99 GB ์ถ”๋ก  (์˜์–ด)
8 magpie_reasoning_ko 224,929 3.19 GB ์ถ”๋ก  (ํ•œ๊ตญ์–ด)
9 ultrachat_200k 207,865 1.34 GB ๋Œ€ํ™”
10 kuotient_orca-math-ko 193,789 0.61 GB ์ˆ˜ํ•™ (ํ•œ๊ตญ์–ด)
11 data/sft/train.jsonl (์›๋ณธ) 161,848 0.27 GB ์›๋ณธ SFT
12 kullm_v2 152,630 0.42 GB ํ•œ๊ตญ์–ด ์ง€์‹œ

๊ธฐํƒ€ 12๊ฐœ ์†Œ์Šค: DeepMath-103K, Evol-Instruct-Code-80k-ko, ShareGPT-74k-ko, evol-instruct-korean, alpaca-gpt4-korean, ko_wikidata_QA, Ko.WizardLM, KOR-OpenOrca-Platypus-v3, korean-writing-style-instruct, ko_lima, koalpaca_v1_1a, OpenAssistant_oasst1_ko

๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ํŒŒ์ดํ”„๋ผ์ธ

24๊ฐœ ์†Œ์Šค (6.59M raw)
    โ†“ prepare_sft_combined.sh (ํฌ๋งท ํ†ต์ผ, MD5 ์ค‘๋ณต ์ œ๊ฑฐ, 98:2 split)
ํ†ตํ•ฉ: 2,559,492 train + 52,234 val (7.95 GB)
    โ†“ filter_sft_v2.py (5๋‹จ๊ณ„: EOS strip, QA marker ์ œ๊ฑฐ, ๊ธธ์ด 50~20K, 4-gram ๋ฐ˜๋ณต >30% ์ œ๊ฑฐ)
์ตœ์ข…: 2,439,397 train + 49,801 val (7.63 GB)  โ† ์ œ๊ฑฐ์œจ 4.69%

๋„๋ฉ”์ธ ๋น„์œจ

์ถ”๋ก /CoT         38.0%  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ
ํ•œ๊ตญ์–ด ์ง€์‹œ       22.5%  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ
์˜์–ด ๋‹ค๋ชฉ์        16.0%  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ
์ˆ˜ํ•™             12.0%  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ
๋Œ€ํ™”/์ฝ”๋“œ/๊ธฐํƒ€    11.5%  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ

7.4 ์„ ํ˜ธ๋„ ๋ฐ์ดํ„ฐ (ORPO์šฉ) โ€” 795K ์Œ

์ด 795,468 preference pairs (7.9GB, data/preference/combined_preference.jsonl)

HuggingFace ID ํฌ๊ธฐ ๋ถ„์•ผ ํฌ๋งท
nayohan/preference-collection-ko-full 4.9GB ๋ฒ”์šฉ ์„ ํ˜ธ๋„ ํ‰๊ฐ€ instruction + response_A/B + preference
heegyu/orca-math-korean-preference-cleaned 1.6GB ์ˆ˜ํ•™ ์ถ”๋ก  prompt + chosen + rejected
kuotient/orca-math-korean-dpo-pairs 750MB ์ˆ˜ํ•™ DPO prompt + chosen + rejected
maywell/ko_Ultrafeedback_binarized 394MB ํ”ผ๋“œ๋ฐฑ ๊ธฐ๋ฐ˜ ์ •๋ ฌ prompt + winning/losing response
tellang/yeji-preference-ko-v1 171MB ๋ฒ”์šฉ ์„ ํ˜ธ๋„ prompt + chosen + rejected
jojo0217/korean_rlhf_dataset 137MB RLHF ์Œ prompt + chosen + rejected
lemon-mint/korean-realqa-reasoning-v01-preference 58MB QA ์ถ”๋ก  prompt + chosen + rejected

ํ•„ํ„ฐ๋ง ๊ธฐ์ค€: ์ตœ์†Œ ๊ธธ์ด 20์ž, EOS ์ œ๊ฑฐ, ํฌ๋งท ์ •๊ทœํ™” ํ›„ ํ†ตํ•ฉ

ORPO๋Š” Phase 3์—์„œ ๋ฐ˜๋ณต๋ฅ ์ด 5% ์ดˆ๊ณผํ•  ๊ฒฝ์šฐ์—๋งŒ ์‹คํ–‰ํ•œ๋‹ค. 3B ๋ชจ๋ธ์ด 1B์˜ ๊ตฌ์กฐ์  ๋ฐ˜๋ณต ๋ฌธ์ œ๋ฅผ ์Šค์Šค๋กœ ํ•ด๊ฒฐํ•œ๋‹ค๋ฉด ORPO ์—†์ด ๋ฐฐํฌํ•  ์ˆ˜ ์žˆ๋‹ค.

7.5 ๋ฐ์ดํ„ฐ ํŒŒ์ดํ”„๋ผ์ธ ์š”์•ฝ

[HuggingFace / ์›น ์ˆ˜์ง‘]
        โ”‚
        โ–ผ
โ”Œโ”€โ”€โ”€ ์›์‹œ ์ˆ˜์ง‘ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  korean_extra/ (39๊ฐœ ๋””๋ ‰ํ† ๋ฆฌ, 640GB+)                    โ”‚
โ”‚  sft_extra/ (27๊ฐœ ๋””๋ ‰ํ† ๋ฆฌ, 1.08M ์ƒ˜ํ”Œ)                   โ”‚
โ”‚  preference/ (7๊ฐœ JSONL, 795K ์Œ)                        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
        โ”‚
        โ–ผ
โ”Œโ”€โ”€โ”€ ํ† ํฐํ™” (SentencePiece 64K) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  tokenize_extra.py โ€” ์ž๋™ ํฌ๋งท ๊ฐ์ง€ (Arrow/Parquet/JSONL) โ”‚
โ”‚  8 workers ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ, uint16 memmap (.bin) ์ถœ๋ ฅ           โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
        โ”‚
        โ–ผ
โ”Œโ”€โ”€โ”€ ์ตœ์ข… ๋ณ‘ํ•ฉ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Pretrain: 3b_train.bin (77GB, ~38.5B tokens)           โ”‚
โ”‚  SFT:     sft_combined/train_filtered.jsonl (7.48GB, 2.44M ์ƒ˜ํ”Œ) โ”‚
โ”‚  ORPO:    preference/combined_preference.jsonl (7.9GB)  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

8. ํ•™์Šต ์„ค์ • ๋ฐ ์ตœ์ ํ™”

ํ˜„์žฌ ํ•™์Šต ์„ค์ • (configs/korean_3b_fp8.yaml)

model:
  vocab_size: 64000
  d_model: 3072
  n_layers: 28
  n_heads: 24
  n_kv_heads: 8
  d_ffn: 8192
  max_seq_len: 2048
  rope_theta: 500000.0

training:
  batch_size: 5
  gradient_accumulation_steps: 8
  learning_rate: 1.5e-4
  min_lr: 1.5e-5
  warmup_steps: 2000
  max_steps: 57000
  weight_decay: 0.1
  grad_clip: 1.0
  optimizer: adamw
  scheduler: cosine

fp8:
  enabled: true
  recipe: "mxfp8"
  use_transformer_engine: true

distributed:
  strategy: ddp
  gradient_as_bucket_view: true
  find_unused_parameters: false

nccl:
  timeout_seconds: 7200
  nvls_enabled: true

์œ ํšจ ๋ฐฐ์น˜ ํฌ๊ธฐ = batch_size(5) ร— grad_accum(8) ร— num_gpus(8) = 320

LR ์Šค์ผ€์ค„: warmup 2000 ์Šคํ… โ†’ cosine decay โ†’ min_lr=1.5e-5 (max_lr์˜ 10%)

Phase 0์—์„œ ๋ฐฐ์šด ์ตœ์ ํ™” ๊ตํ›ˆ

GQA FlashAttention Native

๊ฐ€์žฅ ํฐ VRAM ์ ˆ๊ฐ์„ ๊ฐ€์ ธ์˜จ ์ตœ์ ํ™”. ํ•ต์‹ฌ์€ FlashAttention์ด GQA๋ฅผ native๋กœ ์ง€์›ํ•œ๋‹ค๋Š” ์ ์ด๋‹ค. KV head๋ฅผ expandํ•˜์—ฌ MHA์ฒ˜๋Ÿผ ์ฒ˜๋ฆฌํ•˜๋ฉด ๋ฉ”๋ชจ๋ฆฌ ๋ณต์‚ฌ๊ฐ€ ๋ฐœ์ƒํ•˜์ง€๋งŒ, native path๋ฅผ ์“ฐ๋ฉด ๋‚ด๋ถ€์—์„œ ์ง์ ‘ ์ฒ˜๋ฆฌํ•œ๋‹ค.

# Before (๋น„ํšจ์œจ์ ): KV expand โ†’ MHA์ฒ˜๋Ÿผ ์ฒ˜๋ฆฌ
k = k.repeat_interleave(n_heads // n_kv_heads, dim=1)
v = v.repeat_interleave(n_heads // n_kv_heads, dim=1)
out = flash_attn_func(q, k, v)

# After (native GQA): flash_attn์ด ๋‚ด๋ถ€์—์„œ GQA ์ฒ˜๋ฆฌ
out = flash_attn_func(q, k, v)  # q: [B, S, H, D], k/v: [B, S, Hkv, D]
# VRAM 60.4GB โ†’ 48.3GB (-20%)

DDP ์ตœ์ ํ™”

# gradient_as_bucket_view=True: gradient tensor๋ฅผ bucket ๋ฉ”๋ชจ๋ฆฌ์˜ view๋กœ ์ง์ ‘ ๋งคํ•‘
# โ†’ ๋ถˆํ•„์š”ํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋ณต์‚ฌ ์ œ๊ฑฐ, GPU-CPU ๋™๊ธฐํ™” ์˜ค๋ฒ„ํ—ค๋“œ -87.5%
model = torch.nn.parallel.DistributedDataParallel(
    model,
    device_ids=[local_rank],
    gradient_as_bucket_view=True,
    find_unused_parameters=False,  # ๋ชจ๋“  ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์‚ฌ์šฉ๋จ
)

์ฃผ์˜: static_graph=True๋Š” ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š”๋‹ค. TransformerEngine์˜ te.Linear๊ฐ€ ์ผ๋ถ€ ์ผ€์ด์Šค์—์„œ dynamic graph๋ฅผ ์š”๊ตฌํ•˜๋Š”๋ฐ, static_graph๋ฅผ ์ผœ๋ฉด ๋Ÿฐํƒ€์ž„ ์—๋Ÿฌ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค.

NCCL NVLS

export NCCL_ALGO=NVLSTree    # NVLink SHARP (NVLS) ํ™œ์„ฑํ™”
export NCCL_PROTO=Simple
export NCCL_P2P_DISABLE=0
export NCCL_TIMEOUT=7200     # ๊ธด backward์— ๋Œ€๋น„ํ•œ ํƒ€์ž„์•„์›ƒ ์—ฌ์œ 

NVSwitch๊ฐ€ All-to-All single hop์„ ์ง€์›ํ•˜๋ฏ€๋กœ Ring topology๋ณด๋‹ค NVLSTree๊ฐ€ ํšจ์œจ์ ์ด๋‹ค.

SIGHUP 3์ค‘ ๋ฐฉ์–ด

์žฅ์‹œ๊ฐ„ ํ•™์Šต์—์„œ ์„ธ์…˜ ์—ฐ๊ฒฐ ๋Š๊น€(SIGHUP)์€ ์น˜๋ช…์ ์ด๋‹ค. 3์ค‘ ๋ณดํ˜ธ๋ฅผ ๊ตฌ์ถ•ํ–ˆ๋‹ค:

# 1์ค‘: nohup + setsid (์ƒˆ ์„ธ์…˜ ๊ทธ๋ฃน)
nohup setsid torchrun --nproc_per_node=8 train/pretrain.py ... &

# 2์ค‘: Python signal handler (Python ๋ ˆ๋ฒจ SIGHUP ๋ฌด์‹œ)
import signal
signal.signal(signal.SIGHUP, signal.SIG_IGN)

# 3์ค‘: emergency checkpoint (SIGTERM์—๋„ ์ฒดํฌํฌ์ธํŠธ ์ €์žฅ)
def emergency_save(signum, frame):
    save_checkpoint(model, optimizer, step, "emergency")
    sys.exit(0)
signal.signal(signal.SIGTERM, emergency_save)

torch.compile โ€” ํ…Œ์ŠคํŠธ ๊ฒฐ๊ณผ: ํšจ๊ณผ ์—†์Œ

torch.compile์„ ์ ์šฉํ•ด speedup์„ ๊ธฐ๋Œ€ํ–ˆ์ง€๋งŒ ์‹ค์ธก ๊ฒฐ๊ณผ **1.00x (ํšจ๊ณผ ์—†์Œ)**์ด์—ˆ๋‹ค. ๋‘ ๊ฐ€์ง€ ์ด์œ :

  1. TransformerEngine์˜ kernel์ด opaqueํ•˜์—ฌ graph break๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค. torch.compile์€ Python ์—ฐ์‚ฐ ๊ทธ๋ž˜ํ”„๋ฅผ ์ตœ์ ํ™”ํ•˜๋Š”๋ฐ, TE kernel์€ ๊ทธ ๊ทธ๋ž˜ํ”„ ๋ฐ–์— ์žˆ๋‹ค.
  2. /tmp ๋””๋ ‰ํ† ๋ฆฌ์— noexec ๋งˆ์šดํŠธ ํ”Œ๋ž˜๊ทธ๊ฐ€ ์žˆ์–ด ์ปดํŒŒ์ผ๋œ kernel์„ ์บ์‹œํ•˜์ง€ ๋ชปํ•œ๋‹ค.

๊ตํ›ˆ: "์ผ๋‹จ ์จ๋ณด์ž"๋ณด๋‹ค "์™œ ํšจ๊ณผ๊ฐ€ ์žˆ๋Š”์ง€ ๋จผ์ € ์ดํ•ดํ•˜์ž"๊ฐ€ ์ค‘์š”ํ•˜๋‹ค.

๋ชจ๋‹ˆํ„ฐ๋ง ์‹œ์Šคํ…œ

ํ…”๋ ˆ๊ทธ๋žจ ์•Œ๋ฆผ ์‹œ์Šคํ…œ
โ”œโ”€โ”€ B200Bot (token ์„ค์ •๋จ)
โ”œโ”€โ”€ training_watchdog.sh โ†’ 10๋ถ„ ๊ฐ„๊ฒฉ cron
โ”‚   โ””โ”€โ”€ loss ์ด์ƒ, ํ”„๋กœ์„ธ์Šค ์ข…๋ฃŒ ๊ฐ์ง€ โ†’ ์ฆ‰์‹œ ์•Œ๋ฆผ
โ””โ”€โ”€ hourly_status.sh โ†’ 1์‹œ๊ฐ„ ๊ฐ„๊ฒฉ cron
    โ””โ”€โ”€ step, loss, ์†๋„, VRAM, eta โ†’ ์ •๊ธฐ ๋ฆฌํฌํŠธ
# curl์ด ์ฐจ๋‹จ๋ผ ์žˆ์–ด urllib ์‚ฌ์šฉ
import urllib.request, json

def send_telegram(message):
    url = f"https://api.telegram.org/bot{TOKEN}/sendMessage"
    data = json.dumps({"chat_id": CHAT_ID, "text": message}).encode()
    req = urllib.request.Request(url, data=data,
                                  headers={"Content-Type": "application/json"})
    urllib.request.urlopen(req)

9. ์‹คํ—˜ ๊ฒฐ๊ณผ โ€” 1B ๋ฒ ์ด์Šค๋ผ์ธ

1B ๋ชจ๋ธ์˜ ์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ์ •์งํ•˜๊ฒŒ ๊ธฐ๋กํ•œ๋‹ค. ์„ฑ๊ณต๊ณผ ์‹คํŒจ ๋ชจ๋‘.

ํ”„๋ฆฌํŠธ๋ ˆ์ธ ๊ฒฐ๊ณผ

์ง€ํ‘œ ๊ฐ’
์ตœ์ข… Loss 1.904
PPL (C4 Korean) 5.67
ํ•™์Šต ์Šคํ… 34,000
ํ•™์Šต ์‹œ๊ฐ„ ~2์ผ

SFT v1 ๊ฒฐ๊ณผ โ€” ์‹คํŒจ

์ง€ํ‘œ ๊ฐ’
val_loss 0.0 (๋น„์ •์ƒ)
์›์ธ label off-by-one ๋ฒ„๊ทธ (๋ฐ์ดํ„ฐ ๋ˆ„์ˆ˜)
๊ฒฐ๋ก  ์ „๋ฉด ํ๊ธฐ

SFT v2 ๊ฒฐ๊ณผ โ€” ๋ถ€๋ถ„ ์„ฑ๊ณต

์ง€ํ‘œ ๊ฐ’
val_loss 2.2062
๋ฐ˜๋ณต๋ฅ  18% (rep_penalty=1.1 ์ ์šฉ)
kobest_copa 0.646
๊ฒฐ๋ก  ๊ธฐ๋Šฅํ•˜์ง€๋งŒ ๊ตฌ์กฐ์  ํ•œ๊ณ„ ์กด์žฌ

3B ๊ธฐ๋Œ€ ๋ชฉํ‘œ์น˜ (์Šค์ผ€์ผ๋ง ๋ฒ•์น™ ๊ธฐ๋ฐ˜ ์˜ˆ์ธก)

๋ฒค์น˜๋งˆํฌ 1B ํ˜„์žฌ 3B ๋ชฉํ‘œ
kobest_copa 0.646 >0.72
kobest_hellaswag ~0.42 >0.52
๋ฐ˜๋ณต๋ฅ  18% <5%
PPL (C4 Korean) 5.67 <4.5

1B์—์„œ 3B๋กœ์˜ ์Šค์ผ€์ผ์—…์€ ๋‹จ์ˆœํžˆ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋Š˜๋ฆฌ๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋‹ค. ๋ชจ๋ธ์ด ๋” ๊ธด ๋งฅ๋ฝ์„ ๊ธฐ์–ตํ•˜๊ณ , ๋” ๋‹ค์–‘ํ•œ ํŒจํ„ด์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์–ด์•ผ ๋ฐ˜๋ณต๋ฅ ์ด ๊ตฌ์กฐ์ ์œผ๋กœ ๋‚ฎ์•„์ง„๋‹ค. 3B ๋ชฉํ‘œ์น˜๋Š” Chinchilla ์Šค์ผ€์ผ๋ง ๊ณก์„ ๊ณผ ์œ ์‚ฌ ๊ทœ๋ชจ ๋ชจ๋ธ๋“ค์˜ ๋ฒค์น˜๋งˆํฌ๋ฅผ ์ฐธ๊ณ ํ•œ ์˜ˆ์ธก๊ฐ’์ด๋‹ค.


10. ์‹คํ—˜ ๊ฒฐ๊ณผ โ€” 3B Base ์ข…ํ•ฉ ํ‰๊ฐ€ (v2)

3B ์‚ฌ์ „ํ•™์Šต ์™„๋ฃŒ ํ›„ checkpoint-0057000 ๊ธฐ์ค€์œผ๋กœ ์ˆ˜ํ–‰ํ•œ ์ข…ํ•ฉ ํ‰๊ฐ€. v2 ์žฌํ‰๊ฐ€๋Š” 8-GPU ๋ณ‘๋ ฌ ํŒŒ์ดํ”„๋ผ์ธ์œผ๋กœ 13+ ๋ฒค์น˜๋งˆํฌ, 0/5-shot ๋น„๊ต, calibration, ์ฐธ๊ณ ๋ชจ๋ธ ๋น„๊ต๋ฅผ ํฌํ•จํ•œ๋‹ค. ์ด ์†Œ์š” ์‹œ๊ฐ„ 256.6์ดˆ.

v1 โ†’ v2 ๋ณ€๊ฒฝ์ : v1(์ดˆ๊ธฐ ํ‰๊ฐ€)์—์„œ๋Š” PPL 3๊ฐœ ๋ฐ์ดํ„ฐ์…‹ + belebele/MMLU 2๊ฐœ ๋ฒค์น˜๋งˆํฌ๋งŒ ์ธก์ •ํ–ˆ๋‹ค. v2๋Š” PPL 19๊ฐœ ๋ฐ์ดํ„ฐ์…‹, KoBEST 5๊ฐœ, HAE-RAE ์ „์ฒด, MMLU-KO 6์นดํ…Œ๊ณ ๋ฆฌ, MMLU-EN 61๊ณผ๋ชฉ, ์˜์–ด 5๋Œ€ ๋ฒค์น˜๋งˆํฌ, Calibration, 0/5-shot ๋น„๊ต, 12์กฐํ•ฉ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ทธ๋ฆฌ๋“œ ์„œ์น˜๋ฅผ ํฌํ•จํ•œ๋‹ค.

10.1 ํ•™์Šต ์ปค๋ธŒ

Step Loss LR ๋น„๊ณ 
10 11.657 1.50e-06 ์ดˆ๊ธฐ (warmup ์‹œ์ž‘)
500 5.047 7.50e-05 warmup ์ง„ํ–‰
2,000 2.851 3.00e-04 warmup ์™„๋ฃŒ, peak LR
10,000 2.057 2.86e-04 ์•ˆ์ • ํ•˜๊ฐ•
30,000 1.789 1.61e-04 ์ค‘๋ฐ˜, epoch 1 ์ง„์ž…
57,000 1.466 3.00e-05 ์ตœ์ข… (cosine min)

์ฒ˜๋ฆฌ ์†๋„๋Š” ์ „ ๊ตฌ๊ฐ„ 36~38K tok/s๋กœ ์•ˆ์ •. ์ด ํ•™์Šต ์‹œ๊ฐ„ ์•ฝ 63์‹œ๊ฐ„.

Base Model ๋ฐฑ์—…

ํ•ญ๋ชฉ ๊ฐ’
์›๋ณธ ์ฒดํฌํฌ์ธํŠธ checkpoints/korean_3b_fp8_run1/checkpoint-0057000/ (34GB)
๋ฐฑ์—… checkpoints/korean_3b_fp8_run1/checkpoint-0057000_BASE_BACKUP/
MD5 ๊ฒ€์ฆ 4f493d7bcc843727d32453bb3a4e6b7d (์ผ์น˜ ํ™•์ธ)
HF ๋ณ€ํ™˜ eval/outputs/hf_3b_base/ (11GB safetensors)

10.2 PPL (Perplexity) โ€” 19๊ฐœ ๋ฐ์ดํ„ฐ์…‹

์ฃผ์š” PPL (3b_val ํ†ตํ•ฉ): 5.2263 (์ดˆ๊ธฐ v1 ํ‰๊ฐ€: 5.709)

๋ฐ์ดํ„ฐ์…‹ PPL Bits/Token ํ‰๊ฐ€ ํ† ํฐ ์†Œ์š” ์‹œ๊ฐ„
korean_namuwiki 25.88 4.694 6.5M 63.7s
cc100_ko 21.78 4.445 13.6M 133.2s
namuwiki_2023b 18.92 4.242 7.7M 75.1s
val 18.30 4.194 9.1M 89.4s
korean_wiki 11.84 3.565 1.6M 15.5s
wikipedia_ko 10.71 3.420 1.8M 17.4s
korean 7.02 2.811 53.5M 521.6s
open_web_math 6.93 2.792 15.7M 153.5s
korean_c4 5.72 2.515 45.4M 443.1s
3b (ํ†ตํ•ฉ) 5.23 2.386 226.9M 2227.3s
cosmo_web_v2 4.17 2.059 8.6M 84.6s
cosmo_stories 3.96 1.984 18.9M 185.2s
cosmo_openstax 3.87 1.951 0.7M 7.2s
cosmo_stanford 3.36 1.750 6.6M 65.3s
cosmo_wikihow 3.31 1.727 1.2M 11.8s
cosmo_auto_math_text 3.15 1.655 7.9M 77.3s
cosmo_khanacademy 2.93 1.552 0.1M 1.5s
mathpile 2.72 1.446 7.1M 69.9s
hplt_ko 2.40 1.265 48.5M 475.9s

ํ•ด์„: in-distribution(ํ•™์Šต์— ํฌํ•จ๋œ) ๋ฐ์ดํ„ฐ(hplt_ko: 2.40, mathpile: 2.72)๊ฐ€ ๋‚ฎ๊ณ , OOD(ํ•™์Šต ๋น„์ค‘ ๋‚ฎ์€) ๋ฐ์ดํ„ฐ(cc100_ko: 21.78, namuwiki: 25.88)๊ฐ€ ๋†’์€ ๊ฒƒ์€ ์˜ˆ์ƒ๋œ ํŒจํ„ด. korean_c4 5.72๋Š” v1์˜ 5.717๊ณผ ์ผ์น˜ํ•˜์—ฌ ํ‰๊ฐ€ ์žฌํ˜„์„ฑ์„ ํ™•์ธ.

10.3 ํ•œ๊ตญ์–ด ๋ฒค์น˜๋งˆํฌ

KoBEST (0-shot) โ€” ํ‰๊ท  43.69%

ํƒœ์Šคํฌ Accuracy F1
kobest_boolq 50.28% 0.3457
kobest_copa 49.30% 0.4921
kobest_hellaswag 21.60% 0.2153
kobest_sentineg 48.61% 0.4737
kobest_wic 48.65% 0.3286
ํ‰๊ท  43.69%

HAE-RAE (0-shot) โ€” ์ „์ฒด 19.71%

์„œ๋ธŒํƒœ์Šคํฌ Accuracy
haerae_general_knowledge 21.59%
haerae_history 23.40%
haerae_loan_word 21.30%
haerae_rare_word 18.77%
haerae_standard_nomenclature 13.73%
์ „์ฒด 19.71%

MMLU-KO (0-shot) โ€” 6์นดํ…Œ๊ณ ๋ฆฌ ํ‰๊ท  22.75%

์นดํ…Œ๊ณ ๋ฆฌ Accuracy
medical 30.56%
humanities 24.51%
business 24.14%
social_sciences 20.59%
other 19.64%
stem 19.57%
ํ‰๊ท  22.75%

Base model์€ instruction-following ์—†์ด 4์ง€์„ ๋‹ค ํ˜•์‹ ๋ฒค์น˜๋งˆํฌ๋ฅผ ํ’€๋„๋ก ์ตœ์ ํ™”๋˜์ง€ ์•Š์Œ. KoBEST boolq/copa/sentineg/wic๋Š” ~50% ์ˆ˜์ค€์œผ๋กœ 2์ง€/4์ง€์„ ๋‹ค ๋žœ๋ค ๊ธฐ์ค€ ๋ถ€๊ทผ์ด๋ฉฐ, SFT ํ›„ ํ–ฅ์ƒ ๊ธฐ๋Œ€.

10.4 ์˜์–ด ๋ฒค์น˜๋งˆํฌ

์ฃผ์š” ๋ฒค์น˜๋งˆํฌ (0-shot)

ํƒœ์Šคํฌ Accuracy Acc (norm)
hellaswag 26.00% 26.15%
arc_easy 25.63% 26.64%
arc_challenge 21.67% 27.90%
winogrande 50.59% โ€”
piqa 52.50% 48.31%

winogrande(50.59%)์™€ piqa(52.50%)๋Š” 2์ง€์„ ๋‹ค๋กœ ๋žœ๋ค ๊ธฐ์ค€ 50%์— ๊ทผ์ ‘. hellaswag/arc๋Š” 4์ง€์„ ๋‹ค๋กœ ๋žœ๋ค ๊ธฐ์ค€ 25%.

MMLU-EN (0-shot) โ€” 61๊ณผ๋ชฉ ํ‰๊ท  25.81%

์ƒ์œ„ 10๊ฐœ ๊ณผ๋ชฉ:

๊ณผ๋ชฉ Accuracy
college_physics 37.25%
college_computer_science 34.00%
high_school_statistics 33.80%
us_foreign_policy 32.00%
security_studies 31.43%
world_religions 30.99%
professional_medicine 30.88%
high_school_government_and_politics 30.57%
jurisprudence 30.56%
human_sexuality 30.53%

ํ•˜์œ„ 5๊ฐœ ๊ณผ๋ชฉ:

๊ณผ๋ชฉ Accuracy
human_aging 19.73%
college_biology 19.44%
anatomy 17.04%
global_facts 17.00%
abstract_algebra 15.00%

10.5 Calibration

๋ฉ”ํŠธ๋ฆญ ๊ฐ’
Top-1 Accuracy 68.75%
Top-5 Accuracy 81.64%
Top-10 Accuracy 85.93%
Mean Correct Prob 0.6152
Mean Entropy 1.5682

Token NLL ๋ถ„ํฌ:

ํ†ต๊ณ„ ๊ฐ’
ํ‰๊ท  NLL 1.5561
ํ‘œ์ค€ํŽธ์ฐจ 2.4926
์ค‘์•™๊ฐ’ 0.1221
p95 7.0312
p99 10.3125
NLL > 5 ๋น„์œจ 10.86%
NLL > 10 ๋น„์œจ 1.18%

Top-1 68.75%๋Š” ๋ชจ๋ธ์ด ๊ฐ€์žฅ ํ™•์‹ ํ•˜๋Š” ์˜ˆ์ธก์ด ~69% ํ™•๋ฅ ๋กœ ์ •ํ™•ํ•˜๋‹ค๋Š” ์˜๋ฏธ. ์ค‘์•™๊ฐ’ NLL 0.12 (โ‰ˆ e^0.12 = 1.13 PPL)๋กœ ๋Œ€๋ถ€๋ถ„์˜ ํ† ํฐ์„ ๋งค์šฐ ๋†’์€ ํ™•์‹ ๋„๋กœ ์˜ˆ์ธกํ•˜๊ณ , ์†Œ์ˆ˜์˜ ๊ณ ๋‚œ์ด๋„ ํ† ํฐ์ด ํ‰๊ท  NLL์„ ๋Œ์–ด์˜ฌ๋ฆฌ๋Š” ์ „ํ˜•์ ์ธ ๋ถ„ํฌ.

10.6 0-shot vs 5-shot ๋น„๊ต

18๊ฐœ ํ•œ๊ตญ์–ด ํƒœ์Šคํฌ์—์„œ 0-shot๊ณผ 5-shot ์„ฑ๋Šฅ์„ ๋น„๊ตํ–ˆ๋‹ค.

ํƒœ์Šคํฌ 0-shot 5-shot ๋ณ€ํ™”
global_mmlu_ko 22.75% 26.75% +4.00pp
global_mmlu_ko_business 24.14% 31.03% +6.90pp
global_mmlu_ko_humanities 24.51% 28.43% +3.92pp
global_mmlu_ko_medical 30.56% 36.11% +5.56pp
global_mmlu_ko_other 19.64% 23.21% +3.57pp
global_mmlu_ko_social_sciences 20.59% 23.53% +2.94pp
global_mmlu_ko_stem 19.57% 21.74% +2.17pp
haerae 19.71% 20.26% +0.55pp
haerae_general_knowledge 21.59% 22.73% +1.14pp
haerae_history 23.40% 14.89% -8.51pp
haerae_loan_word 21.30% 24.26% +2.96pp
haerae_rare_word 18.77% 18.02% -0.74pp
haerae_standard_nomenclature 13.73% 25.49% +11.76pp
kobest_boolq 50.28% 50.21% -0.07pp
kobest_copa 49.30% 46.80% -2.50pp
kobest_hellaswag 21.60% 20.80% -0.80pp
kobest_sentineg 48.61% 47.86% -0.76pp
kobest_wic 48.65% 48.97% +0.32pp

ํ‰๊ท  ๋ณ€ํ™”: +1.80pp | ๊ฐœ์„ : 12 | ํ•˜๋ฝ: 6

MMLU-KO๋Š” 5-shot์—์„œ ์ผ๊ด€๋˜๊ฒŒ ๊ฐœ์„ (+2~7pp)๋˜์–ด in-context learning ๋Šฅ๋ ฅ์ด ์ž‘๋™ํ•จ์„ ํ™•์ธ. KoBEST๋Š” ๊ฑฐ์˜ ๋ณ€๋™ ์—†๊ฑฐ๋‚˜ ์†Œํญ ํ•˜๋ฝโ€”์ด๋ฏธ 0-shot์—์„œ ํŒจํ„ด ๋งค์นญ์„ ์ž˜ํ•˜๊ณ  ์žˆ์–ด few-shot ์˜ˆ์‹œ๊ฐ€ ์˜คํžˆ๋ ค ๋ฐฉํ•ด๊ฐ€ ๋˜๋Š” ํŒจํ„ด. haerae_standard_nomenclature์˜ +11.76pp๋Š” ์ด ํƒœ์Šคํฌ์˜ ํŠน์ˆ˜ํ•œ ํฌ๋งท์„ few-shot์—์„œ ํ•™์Šตํ•œ ๊ฒฐ๊ณผ.

10.7 ์ฐธ๊ณ  ๋ชจ๋ธ ๋น„๊ต

๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ MMLU-KO MMLU-EN KoBEST ํ‰๊ท  PPL
FRANKENSTALLM 3B 3B 22.75% 25.81% 43.69% 5.2263
Llama-3.2-3B 3B ~42% ~58% ~55% โ€”
Qwen2.5-3B 3B ~48% ~65% ~60% โ€”
EXAONE-3.5-2.4B 2.4B ~35% ~50% ~50% โ€”

์ฐธ๊ณ  ๋ชจ๋ธ๋“ค์€ ์ˆ˜์กฐ ํ† ํฐ ๊ทœ๋ชจ์˜ ํ•™์Šต ๋ฐ์ดํ„ฐ์™€ ์ˆ˜์ฒœ GPU-hour๋ฅผ ํˆฌ์ž…ํ•œ ๊ฒฐ๊ณผ. FRANKENSTALLM 3B๋Š” 41.12B ํ† ํฐ(Chinchilla ์ตœ์ ์˜ ~68%), 63์‹œ๊ฐ„, 8 GPU๋กœ ํ•™์Šตํ•œ ์ ์„ ๊ฐ์•ˆํ•ด์•ผ ํ•œ๋‹ค. SFT + ํ™•์žฅ ํ”„๋ฆฌํŠธ๋ ˆ์ธ(80-100B ํ† ํฐ) ์ดํ›„ ๊ฒฉ์ฐจ ์ถ•์†Œ ์˜ˆ์ƒ.

10.8 ์ƒ์„ฑ ํ’ˆ์งˆ ๋ฐ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ทธ๋ฆฌ๋“œ ์„œ์น˜

๋ฐ˜๋ณต๋ฅ  ์š”์•ฝ

์„ค์ • 3-gram ๋ฐ˜๋ณต๋ฅ  4-gram ๋ฐ˜๋ณต๋ฅ 
greedy (temp=0.0) 60.99% 57.02%
temp=0.5 60.12% 58.68%
temp=0.7 47.69% 43.40%
temp=1.0 3.58% 2.81%

์ดˆ๊ธฐ v1 ํ‰๊ฐ€์˜ greedy 71.1% ๋ฐ˜๋ณต๋ฅ ์€ no_repeat_ngram_size=3 ์ ์šฉ ๊ธฐ์ค€์ด์—ˆ๋‹ค. v2์—์„œ๋Š” ๋ฏธ์ ์šฉ ๊ธฐ์ค€(raw)์œผ๋กœ ํ†ต์ผํ•˜์—ฌ 60.99%๋ฅผ ๊ธฐ๋ก.

12์กฐํ•ฉ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ทธ๋ฆฌ๋“œ ์„œ์น˜ ๊ฒฐ๊ณผ

์„ค์ • Temp Rep Pen 3-gram 4-gram ๋น„๊ณ 
t0.7_rep1.3 0.70 1.30 0.00% 0.00% ์ตœ์ 
t0.9_rep1.2 0.90 1.20 0.00% 0.00% ์ฐจ์„ 
t0.7_rep1.2 0.70 1.20 0.88% 0.00%
t0.9_rep1.1 0.90 1.10 0.94% 0.13%
t1.0_rep1.1 1.00 1.10 1.21% 0.48%
t0.5_rep1.1 0.50 1.10 1.92% 1.19%
t1.0 1.00 1.00 3.58% 2.81%
t0.9 0.90 1.00 8.39% 4.64%
t0.7_rep1.1 0.70 1.10 8.51% 5.51%
t0.7 0.70 1.00 47.69% 43.40%
t0.5 0.50 1.00 60.12% 58.68%
greedy 0.00 1.00 60.99% 57.02%

๊ถŒ์žฅ ์ถ”๋ก  ํŒŒ๋ผ๋ฏธํ„ฐ (base ์‹คํ—˜์šฉ)

# v2 ๊ทธ๋ฆฌ๋“œ ์„œ์น˜ ์ตœ์ ๊ฐ’
temp=0.7, repetition_penalty=1.3
# ๋˜๋Š” (๋” ๋‹ค์–‘ํ•œ ์ƒ์„ฑ)
temp=0.9, repetition_penalty=1.2

์ดˆ๊ธฐ v1 ๊ถŒ์žฅ๊ฐ’(temp=0.9, top_p=0.9, no_repeat_ngram=3, repetition_penalty=1.1)์—์„œ repetition_penalty=1.3์œผ๋กœ ์ƒํ–ฅ ์กฐ์ •. no_repeat_ngram_size๋Š” ๊ทธ๋ฆฌ๋“œ ์„œ์น˜์—์„œ repetition_penalty๋งŒ์œผ๋กœ ์ถฉ๋ถ„ํžˆ ๋ฐ˜๋ณต ์ œ๊ฑฐ๊ฐ€ ๊ฐ€๋Šฅํ•จ์„ ํ™•์ธํ•˜์—ฌ ๋ถˆํ•„์š”.

10.9 ํ‰๊ฐ€ ํŒŒ์ดํ”„๋ผ์ธ

v2 ์žฌํ‰๊ฐ€๋Š” ๋ชจ๋“ˆํ™”๋œ 8-GPU ๋ณ‘๋ ฌ ํŒŒ์ดํ”„๋ผ์ธ(eval/reeval_pipeline.py)์œผ๋กœ ์ˆ˜ํ–‰๋˜์—ˆ๋‹ค.

์•„ํ‚คํ…์ฒ˜

reeval_pipeline.py
โ”œโ”€โ”€ ๋ชจ๋ธ 1ํšŒ ๋กœ๋“œ (GPU 0์— HF ๋ชจ๋ธ)
โ”œโ”€โ”€ Phase 1: PPL ํ‰๊ฐ€ (19๊ฐœ ๋ฐ์ดํ„ฐ์…‹, ์ˆœ์ฐจ)
โ”œโ”€โ”€ Phase 2: Calibration + Token NLL
โ”œโ”€โ”€ Phase 3: ์ƒ์„ฑ ํ’ˆ์งˆ + ํŒŒ๋ผ๋ฏธํ„ฐ ๊ทธ๋ฆฌ๋“œ ์„œ์น˜ (12์กฐํ•ฉ)
โ”œโ”€โ”€ Phase 4: lm-evaluation-harness (0-shot, 8-GPU ๋ณ‘๋ ฌ)
โ”œโ”€โ”€ Phase 5: lm-evaluation-harness (5-shot, 8-GPU ๋ณ‘๋ ฌ)
โ””โ”€โ”€ Phase 6: ๋ฆฌํฌํŠธ ์ž๋™ ์ƒ์„ฑ (5๊ฐœ ๊ฐœ๋ณ„ + 1๊ฐœ ์ข…ํ•ฉ)

Pipeline Mode

๋ชจ๋ธ์„ 1ํšŒ ๋กœ๋“œํ•˜์—ฌ 0-shot๊ณผ 5-shot์„ ์—ฐ์† ์‹คํ–‰ํ•œ๋‹ค. ๊ธฐ์กด ๋ฐฉ์‹(๋ณ„๋„ ํ”„๋กœ์„ธ์Šค 2ํšŒ)์— ๋น„ํ•ด ๋ชจ๋ธ ๋กœ๋”ฉ ์‹œ๊ฐ„์„ ์ ˆ๋ฐ˜์œผ๋กœ ์ค„์ธ๋‹ค.

GPU๋ณ„ ํƒœ์Šคํฌ ๋ถ„๋ฐฐ

GPU 0-shot ํƒœ์Šคํฌ 5-shot ํƒœ์Šคํฌ
0 kobest_boolq, kobest_copa, kobest_hellaswag ๋™์ผ
1 kobest_sentineg, kobest_wic ๋™์ผ
2 haerae (์ „์ฒด + 5๊ฐœ ์„œ๋ธŒ) ๋™์ผ
3 global_mmlu_ko (6์นดํ…Œ๊ณ ๋ฆฌ) ๋™์ผ
4 hellaswag, arc_easy ๋™์ผ
5 arc_challenge, winogrande ๋™์ผ
6 piqa, global_mmlu_en (61๊ณผ๋ชฉ) ๋™์ผ
7 (์˜ˆ๋น„ โ€” PPL/calibration ์ „๋‹ด) โ€”

NUMA affinity ์ ์šฉ: GPU 0-3์€ NUMA node 0 (cores 0-35), GPU 4-7์€ NUMA node 1 (cores 36-71).

์ด ์†Œ์š” ์‹œ๊ฐ„: 256.6์ดˆ (๋ชจ๋ธ ๋กœ๋“œ ํฌํ•จ)

SFT ์ง„ํ–‰ ํŒ๋‹จ

๊ฒฐ๋ก : SFT ์ง„ํ–‰ โ€” loss 1.466 ๊ฑด๊ฐ•ํ•œ ์™„๋ฃŒ ์‹œ๊ทธ๋„, ๊ตฌ์กฐ ๋ฌธ์ œ ์—†์Œ. โ†’ Phase 2 SFT ์‹œ์ž‘ (2026-03-05)

์ƒ์„ธ ๋ณด๊ณ ์„œ:

  • v2 ์ข…ํ•ฉ: eval/outputs/3b_reeval_20260305_1451/reports/ (5๊ฐœ ๊ฐœ๋ณ„ ๋ฆฌํฌํŠธ + ์ข…ํ•ฉ)
  • v1 ๋ ˆ๊ฑฐ์‹œ: reports/2026-03-05_3B_BASE_EVALUATION_REPORT.md

11. ์‹คํ—˜ ๊ฒฐ๊ณผ โ€” 3B SFT ์ข…ํ•ฉ ํ‰๊ฐ€

Phase 2 SFT๊ฐ€ early stopping์œผ๋กœ ์™„๋ฃŒ๋œ ํ›„ ์ˆ˜ํ–‰ํ•œ 6์ฐจ์› ์ข…ํ•ฉ ํ‰๊ฐ€.

11.1 SFT ํ•™์Šต ๊ฒฐ๊ณผ

ํ•ญ๋ชฉ ๊ฐ’
์ตœ์ข… Step 25,500 / 33,000 (77.3%, early stopping)
Best val_loss 1.8851 (step 23,000)
ํ•™์Šต ์‹œ๊ฐ„ ~15์‹œ๊ฐ„ 41๋ถ„
๋ฐ์ดํ„ฐ 24๊ฐœ ์†Œ์Šค โ†’ 2,439,397 samples (7.48 GB)
์„ค์ • LR=1e-5, eff_batch=64, NEFTune alpha=5.0

Val Loss ์ถ”์ด:

Step     500: 2.0732 (warmup ์™„๋ฃŒ)
Step   2,000: 1.9558 (๊ธ‰์† ํ•˜๊ฐ•)
Step   5,000: 1.9107 (์•ˆ์ • ์ˆ˜๋ ด)
Step  10,000: 1.8917 (๋ฏธ์„ธ ๊ฐ์†Œ)
Step  15,000: 1.8864 (plateau ์ง„์ž…)
Step  20,000: 1.8853 (๋ณ€๋™ < 0.001)
Step  23,000: 1.8851 โ† BEST (early stopping ๊ธฐ์ค€์ )
Step  25,500: Early Stop (patience 5/5 ์†Œ์ง„)

11.2 6์ฐจ์› ํ‰๊ฐ€ ์š”์•ฝ

# ์ฐจ์› ๊ฒฐ๊ณผ ํ•ต์‹ฌ ์ˆ˜์น˜
1 Perplexity (์ง€์‹ ๋ณด์กด) PASS ์ตœ๋Œ€ forgetting 0.9%, 19๊ฐœ ๋ฐ์ดํ„ฐ์…‹ ์ „์ฒด PASS
2 ์ƒ์„ฑ ํ’ˆ์งˆ FAIL Greedy ๋ฐ˜๋ณต๋ฅ  72.97% (๋ชฉํ‘œ <5%), EOS 60% (๋ชฉํ‘œ >90%)
3 ํ•œ๊ตญ์–ด ๋ฒค์น˜๋งˆํฌ FAIL KoBEST ํ‰๊ท  43.26% (๋ชฉํ‘œ >55%)
4 ์˜์–ด ๋ฒค์น˜๋งˆํฌ PASS hellaswag 26.1%, winogrande 50.8%, piqa 52.6% (์ „ ํ•ญ๋ชฉ ํ•˜ํ•œ ์ดˆ๊ณผ)
5 Calibration PASS Top-1 68.59%, Top-5 81.55%, Entropy 1.54
6 SFT Chat ๋Šฅ๋ ฅ PASS EOS ์ข…๋ฃŒ์œจ 0%โ†’60%, Chat template ์‘๋‹ต

11.3 Base vs SFT ๋น„๊ต

์ง€ํ‘œ Base SFT ๋ณ€ํ™” ํŒ์ •
PPL (ํ†ตํ•ฉ) 5.2263 5.2529 +0.5% forgetting PASS
Greedy 3-gram ๋ฐ˜๋ณต๋ฅ  60.99% 72.97% +12pp (์•…ํ™”) FAIL
EOS ์ข…๋ฃŒ์œจ 0% 60% +60pp (๋Œ€ํญ ๊ฐœ์„ ) ๋ถ€๋ถ„ PASS
KoBEST ํ‰๊ท  43.69% 43.26% -0.4pp FAIL
MMLU-KO 22.75% 26.00% +3.2pp ๋ถ€๋ถ„ ๊ฐœ์„ 
์˜์–ด ๋ฒค์น˜๋งˆํฌ โ€” โ€” ยฑ0.3pp ์ด๋‚ด PASS (์œ ์ง€)
Calibration Top-1 68.75% 68.59% -0.2pp PASS (์œ ์ง€)

Repetition ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฒ€์ƒ‰ (ํฌ๋ง์ ):

์„ค์ • ๋ฐ˜๋ณต๋ฅ  EOS Rate
t0.7_rep1.2 0.00% 100%
t1.0_rep1.1 0.00% 100%
greedy (raw) 72.97% 60%

rep_penalty 1.1~1.3 ์ ์šฉ ์‹œ ๋ฐ˜๋ณต๋ฅ  0% ๋‹ฌ์„ฑ โ†’ ๋ชจ๋ธ์ด ๋ฐ˜๋ณตํ•˜์ง€ ์•Š๋Š” ๋Šฅ๋ ฅ ์ž์ฒด๋Š” ๋ณด์œ . ORPO๋กœ ๋‚ด์žฌํ™” ๊ฐ€๋Šฅ.

11.4 ์ฝ”๋“œ ๊ฐœ์„  ์‚ฌํ•ญ

์ด๋ฒˆ Phase์—์„œ ์ˆ˜ํ–‰ํ•œ ์ฃผ์š” ์ฝ”๋“œ ๋ณ€๊ฒฝ:

ํŒŒ์ผ ๋ณ€๊ฒฝ ์ค„ ์ˆ˜ ๋ชฉ์ 
train/sft.py MixingDataLoader, DDP rank 0 ํ† ํฌ๋‚˜์ด์ง• +238 SFT+pretrain ์ธํ„ฐ๋ฆฌ๋น™, ๋ฉ”๋ชจ๋ฆฌ 8๋ฐฐ ์ ˆ๊ฐ
train/trainer.py DDP early stop broadcast +17 DDP hang ๋ฐฉ์ง€, patience 5โ†’10
train/orpo.py YAML config, 3B ๊ธฐ๋ณธ๊ฐ’ +30 ORPO ์‹คํ–‰ ์ค€๋น„
eval/report_generator.py SFT ๋น„๊ต ๋ณด๊ณ ์„œ ์ž๋™ ์ƒ์„ฑ +831 ํ‰๊ฐ€ ์ž๋™ํ™”
eval/sft_eval_pipeline.py 6์ฐจ์› ํ‰๊ฐ€ ํŒŒ์ดํ”„๋ผ์ธ ์‹ ๊ทœ SFT ์ข…ํ•ฉ ํ‰๊ฐ€
eval/tasks/generation_task.py Chat template, diversity metrics +75 SFT ํ‰๊ฐ€ ์ง€์›

11.5 ORPO ์ง„ํ–‰ ํŒ์ •

ํŒ์ •: Phase 3 ORPO ์ง„ํ–‰

๊ทผ๊ฑฐ ์ƒ์„ธ
์ง€์‹ ๋ณด์กด ์–‘ํ˜ธ forgetting 0.9% โ€” SFT๊ฐ€ base ์ง€์‹์„ ํŒŒ๊ดดํ•˜์ง€ ์•Š์Œ
๋ฐ˜๋ณต ๋ฏธํ•ด๊ฒฐ greedy 72.97% โ€” ์„ ํ˜ธ๋„ ์ •๋ ฌ์ด ์ง์ ‘์  ํ•ด๊ฒฐ ๊ฒฝ๋กœ
ํฌ๋ง์  ์‹ ํ˜ธ rep_penalty ์ ์šฉ ์‹œ 0% โ†’ ORPO๊ฐ€ ๋‚ด์žฌํ™” ๊ฐ€๋Šฅ
๋ฐ์ดํ„ฐ ์ค€๋น„ ์™„๋ฃŒ 795,468 preference pairs (7.9 GB)
์ฝ”๋“œ/์„ค์ • ์™„๋น„ train/orpo.py + configs/korean_3b_orpo.yaml

ORPO ํ›„ ํŒ์ • ๊ธฐ์ค€:

  • ๋ฐ˜๋ณต๋ฅ  < 5% AND KoBEST > 50% โ†’ GGUF + Ollama ๋ฐฐํฌ
  • ๋ฐ˜๋ณต๋ฅ  5~15% โ†’ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์กฐ์ • ํ›„ ์žฌ์‹œ๋„
  • ๋ฐ˜๋ณต๋ฅ  > 15% โ†’ SFT v2 (lr=5e-5, data mixing) ํ›„ ์žฌ๋„์ „

์ƒ์„ธ: reports/2026-03-06_3B_SFT_COMPLETION_AND_EVAL_SUMMARY.md


12. Phase 3 โ€” ORPO (์„ ํ˜ธ๋„ ์ •๋ ฌ)

12.1 ORPO ์„ ํƒ ๋ฐฐ๊ฒฝ

SFT 6์ฐจ์› ํ‰๊ฐ€์—์„œ greedy ๋ฐ˜๋ณต๋ฅ  72.97%, EOS ์ข…๋ฃŒ์œจ 0%๋ผ๋Š” ์น˜๋ช…์  ๋ฌธ์ œ๊ฐ€ ๋ฐœ๊ฒฌ๋๋‹ค. SFT๋Š” "์ข‹์€ ์‘๋‹ต๋งŒ ๋ชจ๋ฐฉ"ํ•˜๋Š” ํ•™์Šต์ด๋ฏ€๋กœ, "๋‚˜์œ ์‘๋‹ต์„ ์–ต์ œ"ํ•˜๋Š” ์‹ ํ˜ธ๊ฐ€ ์—†๋‹ค. ๋ฐ˜๋ณต ๋ฌธ์ œ ํ•ด๊ฒฐ์—๋Š” preference optimization์ด ํ•„์ˆ˜์ ์ด๋‹ค.

ORPO vs DPO:

ํ•ญ๋ชฉ ORPO DPO
Reference model ๋ถˆํ•„์š” ํ•„์š” (VRAM 2๋ฐฐ)
๊ตฌํ˜„ ๋ณต์žก๋„ ๋‚ฎ์Œ ์ค‘๊ฐ„
๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ ๋†’์Œ (3B 1๊ฐœ๋งŒ ๋กœ๋“œ) ๋‚ฎ์Œ (3B 2๊ฐœ ๋กœ๋“œ)
ํ•™์Šต ์•ˆ์ •์„ฑ ์ค‘๊ฐ„ ๋†’์Œ

ORPO๋ฅผ 1์ฐจ ์„ ํƒ, DPO๋ฅผ Plan B๋กœ ์„ค์ •ํ–ˆ๋‹ค.

12.2 ๋ฐ์ดํ„ฐ

  • ์›๋ณธ: 683,181 preference pairs (7๊ฐœ ์†Œ์Šค ํ†ตํ•ฉ)
  • ํ•„ํ„ฐ ํ›„: ~630,000 pairs (NaN ๋ฐฉ์ง€ ํ•„ํ„ฐ ์ ์šฉ)
  • Eval split: 5% (~31,500 pairs, seed=42)
  • Effective batch: 4 ร— 8 GPU ร— 4 accum = 128

12.3 HP Sweep ์„ค๊ณ„ (6-Config)

3๊ฐœ ์ถ•(beta, LR, max_length)์„ ์ค‘์‹ฌ์ถ• ๊ณ ์ • ๋ฐฉ์‹์œผ๋กœ 6๊ฐœ ์กฐํ•ฉ ์„ ์ •:

Run Name Beta LR Max Length ๋ชฉ์ 
1 baseline_b015 0.15 8e-6 1536 ์•ฝํ•œ beta ๋ฒ ์ด์Šค๋ผ์ธ
2 baseline_b025 0.25 8e-6 1536 ์ค‘๊ฐ„ beta ๋ฒ ์ด์Šค๋ผ์ธ
3 strong_b035 0.35 8e-6 1536 ๊ฐ•ํ•œ beta โ€” ์ ๊ทน์  ๋ฐ˜๋ณต ์–ต์ œ
4 fast_lr12e6 0.25 1.2e-5 1536 ๋†’์€ LR โ€” ๋น ๋ฅธ ์ˆ˜๋ ด
5 conserv_lr5e6 0.25 5e-6 1536 ๋ณด์ˆ˜์  LR โ€” ์•ˆ์ •์„ฑ
6 short_1024 0.25 8e-6 1024 ์งง์€ max_length โ€” VRAM ์ ˆ์•ฝ

๊ฐ 200 steps, eval_steps=100, 8ร—B200 DDP.

12.4 ์‹œ๋„ ์ด๋ ฅ โ€” 5๋ฒˆ์˜ ์‹คํŒจ

# ๋ฌธ์ œ ์›์ธ ์ˆ˜์ •
1 NCCL Timeout ํ† ํฌ๋‚˜์ด์ง• 30๋ถ„ > timeout 1800s ddp_timeout=7200, num_proc=64
2 Config ์ถฉ๋Œ save_steps โ‰  eval_steps ๋ฐฐ์ˆ˜ --no_load_best --save_steps 200
3 ํฌํŠธ ์ถฉ๋Œ + QKV ๋ˆ„๋ฝ ์ข€๋น„ ํ”„๋กœ์„ธ์Šค + fused QKV ๋ฏธ๋ถ„๋ฆฌ pkill + QKV split ๋กœ์ง
4 TRL NaN ๋ฒ„๊ทธ tokenize_row ์–‘์ชฝ response ๋™์‹œ ์ž˜๋ฆผ 3์ค‘ ํŒจ์น˜ (clamp, truncation)
5 Tokenizer ํ˜ธํ™˜ zip(strict=True) + ํ•œ๊ตญ์–ด merge ops TRL ์†Œ์Šค 8๊ฑด ํŒจ์น˜

๊ฐ€์žฅ ์‹ฌ๊ฐํ–ˆ๋˜ ๊ฒƒ์€ TRL NaN ๋ฒ„๊ทธ๋กœ, 0 response tokens โ†’ log(0) = -inf โ†’ NaN ์ „ํŒŒ ์ฒด์ธ์„ ์ผ์œผ์ผฐ๋‹ค. ์ƒ์„ธ: reports/2026-03-08_ORPO_TRAINING_JOURNEY.md

12.5 ์Šค์œ• ์ตœ์ข… ๊ฒฐ๊ณผ

Run Name Beta LR MaxLen Train Loss Eval Loss Margin Status
1 baseline_b015 0.15 8e-6 1536 1.811 1.827 0.004 โœ…
2 baseline_b025 0.25 8e-6 1536 1.890 1.906 0.009 โœ…
3 strong_b035 0.35 8e-6 1536 2.055 1.985 0.007 โœ…
4 fast_lr12e6 0.25 1.2e-5 1536 1.917 1.862 0.009 ๐Ÿ† Best
5 conserv_lr5e6 0.25 5e-6 1536 1.833 1.910 0.004 โœ…
6 short_1024 0.25 8e-6 1024 1.664 1.695 0.007 โœ…

Best config: Run 4 (eval_loss 1.862 ์ตœ์ €, margin 0.009 ์ตœ๊ณ , ๋น ๋ฅธ ์ˆ˜๋ ด).

12.6 Throughput ๋ฒค์น˜๋งˆํฌ โ†’ ๋ณธ ํ•™์Šต ์„ค์ •

๋ณธ ํ•™์Šต ์ „ batch/grad_accum ์กฐํ•ฉ์˜ throughput์„ ์ธก์ •ํ•˜์—ฌ ์ตœ์  ์„ค์ •์„ ๊ฒฐ์ •:

batch_size grad_accum eff_batch Throughput ๋น„๊ณ 
4 4 128 80.63 samples/s ์„ ์ •
2 8 128 73.14 samples/s ๊ธฐ์กด ์„ค์ •
8 2 128 OOM

12.7 ORPO ๋ณธ ํ•™์Šต (์ง„ํ–‰ ์ค‘, 2026-03-09)

ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐ’
Beta / LR 0.25 / 1.2e-5 (Sweep Run 4)
Batch / Accum / Eff 4 / 4 / 128 (๋ฒค์น˜๋งˆํฌ ์ตœ์ )
Max length 1536
Epochs 2 (~9,840 steps)
GPU VRAM ~52GB / 183GB (28%)
์†๋„ ~1.75 s/step
์˜ˆ์ƒ ์‹œ๊ฐ„ ~4.8์‹œ๊ฐ„

ํ•™์Šต ์ง€ํ‘œ ์ถ”์ด (step ~1,660 ๊ธฐ์ค€):

Step Eval Loss Pref Accuracy Reward Margin NLL Loss
~1,000 1.791 66.8% 0.107 1.647
~2,000 1.713 70.1% 0.293 1.591
~3,000 1.681 71.9% 0.372 1.567
  • Train loss: 2.34 โ†’ 1.68 (-0.66)
  • rewards/accuracies: 0.43 โ†’ 0.74 (chosen/rejected ๊ตฌ๋ถ„ ๋Šฅ๋ ฅ ๊ธ‰์ƒ์Šน)
  • rewards/margins: -0.005 โ†’ 0.387 (preference signal ํ•™์Šต ํ™•์ธ)
  • ์†๋„ 1.76 s/step, GPU 92100% utilization, ์•ˆ์ •์  ์ง„ํ–‰ ์ค‘

ํ•™์Šต ์™„๋ฃŒ ํ›„ ์ž๋™ ํ‰๊ฐ€: scripts/orpo_eval_watchdog.sh ๊ฐ€ ํ•™์Šต ํ”„๋กœ์„ธ์Šค๋ฅผ ๊ฐ์‹œํ•˜๋ฉฐ, ์™„๋ฃŒ ์‹œ ์ž๋™์œผ๋กœ 10์ฐจ์› ์ข…ํ•ฉ ํ‰๊ฐ€ ํŒŒ์ดํ”„๋ผ์ธ ์‹คํ–‰

12.8 ORPO ์ข…ํ•ฉ ํ‰๊ฐ€ ํŒŒ์ดํ”„๋ผ์ธ

SFT v2 ํ‰๊ฐ€์˜ 6์ฐจ์›์— ORPO ๊ณ ์œ  4์ฐจ์›์„ ์ถ”๊ฐ€ํ•œ 10์ฐจ์› ์ข…ํ•ฉ ํ‰๊ฐ€. ํ•™์Šต ์™„๋ฃŒ ์‹œ eval/orpo_eval_pipeline.py๊ฐ€ ์ž๋™ ์‹คํ–‰๋˜์–ด Base vs SFT vs ORPO 3-way ๋น„๊ต ๋ณด๊ณ ์„œ๋ฅผ ์ƒ์„ฑํ•œ๋‹ค.

ํ‰๊ฐ€ ๊ตฌ์กฐ:

Phase ๋‚ด์šฉ GPU ์˜ˆ์ƒ ์‹œ๊ฐ„
Pre-phase train.log์—์„œ ํ•™์Šต ๊ณก์„  ์ถ”์ถœ - ~1์ดˆ
Phase 1 ๋‚ด๋ถ€ ํ‰๊ฐ€ (PPL 19์…‹, Calibration, Generation, Repetition Grid) 8 GPU ๋ณ‘๋ ฌ ~30๋ถ„
Phase 2 ๋ฒค์น˜๋งˆํฌ (KoBEST, HAE-RAE, MMLU-KO/EN, hellaswag, arc, piqa) 8 GPU ๋ณ‘๋ ฌ ~1์‹œ๊ฐ„
Phase 3 3-way ๋น„๊ต ๋ณด๊ณ ์„œ ์ž๋™ ์ƒ์„ฑ - ~10์ดˆ

10์ฐจ์› ํ‰๊ฐ€ ํ•ญ๋ชฉ:

# ์ฐจ์› ๊ธฐ์ค€ SFT v2 ๊ฒฐ๊ณผ ORPO ๋ชฉํ‘œ
1 ์ง€์‹ ๋ณด์กด (PPL) forgetting < 15% 0.9% < 5%
2 ์ƒ์„ฑ ํ’ˆ์งˆ greedy ๋ฐ˜๋ณต๋ฅ  < 5%, EOS > 90% 72.97% / 60% < 5% / > 90%
3 ํ•œ๊ตญ์–ด ๋ฒค์น˜๋งˆํฌ KoBEST ํ‰๊ท  > 55% 43.26% โ‰ฅ 43%
4 ์˜์–ด ๋ฒค์น˜๋งˆํฌ ํ•˜ํ•œ ์ดˆ๊ณผ PASS ์œ ์ง€
5 Calibration Top-1 โ‰ฅ 65% 68.59% โ‰ฅ 65%
6 Chat ๋Šฅ๋ ฅ EOS ์ข…๋ฃŒ์œจ 60% > 90%
7 Preference Accuracy > 65% โ€” > 65%
8 Reward Margins > 0.1 โ€” > 0.1
9 ๋ฐ˜๋ณต ํŒŒ๋ผ๋ฏธํ„ฐ ๋ฏผ๊ฐ๋„ rep_penalty=1.0์—์„œ๋„ < 5% โ€” PASS
10 SFTโ†’ORPO ๊ฐœ์„  ๋ฐ˜๋ณต๋ฅ โ†“ + EOSโ†‘ โ€” PASS

ํ•ต์‹ฌ ํŒŒ์ผ:

  • eval/orpo_eval_pipeline.py โ€” ORPO ํ‰๊ฐ€ ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ดํ„ฐ
  • eval/report_generator.py โ€” 3-way ๋น„๊ต ๋ณด๊ณ ์„œ ์ƒ์„ฑ๊ธฐ (generate_three_way_report())
  • scripts/orpo_eval_watchdog.sh โ€” ํ•™์Šต ์™„๋ฃŒ ๊ฐ์ง€ + ์ž๋™ ํ‰๊ฐ€ ์‹คํ–‰

๋ฐฐํฌ ๊ธฐ์ค€: greedy ๋ฐ˜๋ณต๋ฅ  < 5% AND EOS > 90% AND forgetting < 5% AND KoBEST โ‰ฅ 43% โ†’ DEPLOY


13. ์‹คํ–‰ ๋ฐฉ๋ฒ•

์‚ฌ์ „ ์š”๊ตฌ์‚ฌํ•ญ

# PyTorch๋Š” ์žฌ์„ค์น˜ ๊ธˆ์ง€ (NVIDIA ์ปค์Šคํ…€ ๋นŒ๋“œ)
# ์•„๋ž˜ ํŒจํ‚ค์ง€๋งŒ ์ถ”๊ฐ€ ์„ค์น˜
pip install transformers accelerate peft trl deepspeed \
            bitsandbytes sentencepiece wandb

3B ํ”„๋ฆฌํŠธ๋ ˆ์ธ

# NCCL ํ™˜๊ฒฝ๋ณ€์ˆ˜์™€ ํ•จ๊ป˜ 8-GPU ํ•™์Šต ์‹คํ–‰
bash scripts/launch_3b_pretrain.sh

# ์ˆ˜๋™ ์‹คํ–‰ (์ง์ ‘ ์ œ์–ด)
torchrun --nproc_per_node=8 \
  --master_port=29500 \
  train/pretrain.py \
  --config configs/korean_3b_fp8.yaml

SFT

bash scripts/launch_3b_sft.sh

# ๋˜๋Š” ์ง์ ‘ ์‹คํ–‰
torchrun --nproc_per_node=8 \
  train/sft.py \
  --config configs/korean_3b_sft.yaml \
  --pretrain_ckpt checkpoints/3b_pretrain_best.pt

ORPO (์„ ํ˜ธ๋„ ์ •๋ ฌ)

# ORPO ํ•™์Šต
bash scripts/launch_3b_orpo.sh

# ํ•™์Šต ์™„๋ฃŒ ํ›„ ์ž๋™ ํ‰๊ฐ€ (watchdog)
nohup bash scripts/orpo_eval_watchdog.sh \
  > checkpoints/korean_3b_orpo_v1/watchdog.log 2>&1 &

ํ‰๊ฐ€

# Base ๋ชจ๋ธ ์ „์ฒด ํ‰๊ฐ€ (8 GPU ๋ณ‘๋ ฌ)
python eval/full_eval_pipeline.py

# SFT ๋ชจ๋ธ ํ‰๊ฐ€ (Base vs SFT 2-way ๋น„๊ต)
python eval/sft_eval_pipeline.py --skip-phase0 \
  --hf-model-path eval/outputs/hf_3b_sft_best

# ORPO ๋ชจ๋ธ ํ‰๊ฐ€ (Base vs SFT vs ORPO 3-way ๋น„๊ต)
python eval/orpo_eval_pipeline.py           # ์ž๋™์œผ๋กœ ์ตœ์‹  checkpoint ๊ฐ์ง€
python eval/orpo_eval_pipeline.py --dry-run  # ์‹คํ–‰ ๊ณ„ํš๋งŒ ํ™•์ธ

# ๋น ๋ฅธ ํ‰๊ฐ€ (kobest_copa + PPL)
bash scripts/run_eval_quick.sh

# ์ƒ์„ฑ ํŒŒ๋ผ๋ฏธํ„ฐ ํƒ์ƒ‰
python eval/test_generation_params.py \
  --checkpoint checkpoints/3b_best.pt

๋ฐฐํฌ

# Step 1: GGUF ๋ณ€ํ™˜ (llama.cpp ํฌ๋งท)
bash scripts/convert_3b_gguf.sh

# Step 2: Ollama ๋ชจ๋ธ ๋“ฑ๋ก ๋ฐ ์„œ๋น™
bash scripts/deploy_3b_ollama.sh

# Ollama๋กœ ํ…Œ์ŠคํŠธ
ollama run frankenstallm-3b "ํ•œ๊ตญ์˜ ์ฒ ๊ฐ• ์‚ฐ์—…์— ๋Œ€ํ•ด ์„ค๋ช…ํ•ด์ค˜."

ํ•™์Šต ๋ชจ๋‹ˆํ„ฐ๋ง

# ์‹ค์‹œ๊ฐ„ ๋ชจ๋‹ˆํ„ฐ (tail -f ๋ฐฉ์‹)
bash scripts/monitor_3b.sh

# ํ”„๋กœ์„ธ์Šค ์ƒํƒœ ํ™•์ธ
ps aux | grep pretrain

# GPU ์ƒํƒœ
nvidia-smi --query-gpu=index,name,memory.used,memory.total,utilization.gpu \
  --format=csv -l 5

๋‹จ์ผ GPU ํ…Œ์ŠคํŠธ (๊ฐœ๋ฐœ/๋””๋ฒ„๊ทธ)

python train/pretrain.py \
  --config configs/korean_3b_fp8.yaml \
  --device cuda:0 \
  --max_steps 100 \
  --debug

14. ๋กœ๋“œ๋งต

๋‹จ๊ธฐ (2026๋…„ 3์›”)

ํ•ญ๋ชฉ ์ƒํƒœ ๋น„๊ณ 
Phase 1 (3B Pretrain) ์™„๋ฃŒ โœ… ์™„๋ฃŒ 57K steps, loss 1.466, 2026-03-05
Phase 2 (SFT) ์™„๋ฃŒ โœ… ์™„๋ฃŒ 25.5K steps, val_loss 1.8851, 2026-03-06
SFT 6์ฐจ์› ํ‰๊ฐ€ โœ… ์™„๋ฃŒ 4/6 PASS, ORPO ํŒ์ •
Phase 3 (ORPO Sweep) โœ… ์™„๋ฃŒ 6-config sweep ์™„๋ฃŒ, best config ์„ ์ •
Phase 3 (ORPO ๋ณธ ํ•™์Šต) ๐Ÿ”„ ์ง„ํ–‰ ์ค‘ lr=1.2e-5, beta=0.25, 2 epochs, ~9,840 steps
Phase 3.5 (ORPO ์ข…ํ•ฉ ํ‰๊ฐ€) ๐Ÿ“‹ ๋Œ€๊ธฐ 10์ฐจ์› ํ‰๊ฐ€ (6 ๊ธฐ๋ณธ + 4 ORPO ๊ณ ์œ ), 3-way ๋น„๊ต ๋ณด๊ณ ์„œ
GGUF ๋ณ€ํ™˜ + Ollama ๋ฐฐํฌ ๐Ÿ“‹ ๋Œ€๊ธฐ Phase 4 (ORPO ํ‰๊ฐ€ PASS ์‹œ)

์ค‘๊ธฐ (2026๋…„ 2๋ถ„๊ธฐ)

ํ•ญ๋ชฉ ๋น„๊ณ 
ํ™•์žฅ ํ”„๋ฆฌํŠธ๋ ˆ์ธ (80~100B ํ† ํฐ) Chinchilla ์ตœ์ ์  ๋‹ฌ์„ฑ
QKV Fusion +8~12% MFU ๊ธฐ๋Œ€
NUMA Affinity ์„ค์ • +4~9% ์˜ˆ์ƒ
FA2 native RoPE +3~5% ์˜ˆ์ƒ
Context length ํ™•์žฅ (4096) RoPE ฮธ=500K ๊ธฐ๋ฐ˜

์žฅ๊ธฐ (2026๋…„ ํ•˜๋ฐ˜๊ธฐ)

ํ•ญ๋ชฉ ๋น„๊ณ 
7B ์‹คํ—˜ FSDP ์ „๋žต ํ•„์š”
vLLM serving PagedAttention ๊ธฐ๋ฐ˜ ์ถ”๋ก  ์„œ๋ฒ„
๋„๋ฉ”์ธ ํŠนํ™” ํŒŒ์ธํŠœ๋‹ ์ฒ ๊ฐ•/์ œ์กฐ์—… ๋„๋ฉ”์ธ
๊ณต๊ฐœ ๋ฐฐํฌ HuggingFace Hub ์—…๋กœ๋“œ

์•Œ๋ ค์ง„ ๋ฏธ์ ์šฉ ์ตœ์ ํ™”

Phase 0 ๋ถ„์„์—์„œ ๋ฐœ๊ฒฌํ–ˆ์ง€๋งŒ ์•„์ง ์ ์šฉํ•˜์ง€ ์•Š์€ ์ตœ์ ํ™”๋“ค:

์ตœ์ ํ™” ์˜ˆ์ƒ ํšจ๊ณผ ๊ตฌํ˜„ ๋ณต์žก๋„
QKV Fusion +8~12% MFU ์ค‘๊ฐ„
NUMA Affinity +4~9% ๋‚ฎ์Œ
FA2 Native RoPE +3~5% ๋‚ฎ์Œ
HugePages +1~3% (TLB ์ตœ์ ํ™”) ๋‚ฎ์Œ (sysctl)

์ด ์ตœ์ ํ™”๋“ค์„ ๋ชจ๋‘ ์ ์šฉํ•˜๋ฉด ํ˜„์žฌ 33.5% MFU์—์„œ 45~50%๊นŒ์ง€ ๋„๋‹ฌํ•  ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ๋‹ค.


15. ์ฐธ๊ณ  ๋ฌธ์„œ

๋ฌธ์„œ ์œ„์น˜ ๋‚ด์šฉ
ํ”„๋กœ์ ํŠธ ์ „์ฒด ์—ฌ์ • docs/PROJECT_HISTORY.md ์ผ๋ณ„ ์ƒ์„ธ ์ง„ํ–‰ ๊ธฐ๋ก
3B ์ž‘์—… ๊ณ„ํš docs/3B_WORKPLAN.md 3B ๋‹จ๊ณ„๋ณ„ ์ž‘์—… ๊ณ„ํš ์ƒ์„ธ
์ €์Šคํ‹ฐ์Šค๋ฆฌ๊ทธ ๋…ผ์ฆ eval/debate/justice_league_3b_case.md 1Bโ†’3B ์ „ํ™˜ ๋ฉ€ํ‹ฐ์—์ด์ „ํŠธ ํ† ๋ก  ์ „๋ฌธ
SFT ์žฌ์‹œ์ž‘ ํŒ๊ฒฐ eval/decision/FINAL_DECISION_REPORT.md SFT v1 ์‹คํŒจ โ†’ v2 ์„ค๊ณ„ ํŒ๊ฒฐ๋ฌธ
3B ๋งˆ์Šคํ„ฐ ํ”Œ๋žœ eval/plan/3B_MASTER_PLAN.md ์ „์ฒด ํ•™์Šต ํŒŒ์ดํ”„๋ผ์ธ ๋งˆ์Šคํ„ฐ ํ”Œ๋žœ
Phase 0 ์ตœ์ ํ™” ๋ณด๊ณ ์„œ reports/2026-03-02_0200_FRANKENSTALLM_phase0_optimization_report.md VRAM/MFU ์ตœ์ ํ™” ์ „์ฒด ๋ณด๊ณ 
3B Base ํ‰๊ฐ€ ๋ณด๊ณ ์„œ (v1) reports/2026-03-05_3B_BASE_EVALUATION_REPORT.md ์ดˆ๊ธฐ PPL/๋ฒค์น˜๋งˆํฌ/๋ฐ˜๋ณต๋ฅ  ํ‰๊ฐ€
PPL ํ‰๊ฐ€ ๋ณด๊ณ ์„œ (v1) reports/2026-03-05_PPL_EVALUATION.md 4๊ฐœ ๊ฒ€์ฆ์…‹ PPL ์ƒ์„ธ
๋ฒค์น˜๋งˆํฌ ๊ฒฐ๊ณผ (v1) reports/2026-03-05_BENCHMARK_RESULTS.md belebele, MMLU ์ƒ์„ธ
์ƒ์„ฑ ํ’ˆ์งˆ ๋ถ„์„ (v1) reports/2026-03-05_GENERATION_QUALITY.md ๋ฐ˜๋ณต๋ฅ , ๋””์ฝ”๋”ฉ ํŒŒ๋ผ๋ฏธํ„ฐ
SFT ํ•™์Šต ๋ณด๊ณ ์„œ reports/2026-03-05_3B_SFT_PROGRESS_REPORT.md Phase 2 SFT ํ•™์Šต ๊ณผ์ • ๊ธฐ๋ก
SFT ์™„๋ฃŒ ์ข…ํ•ฉ ๋ณด๊ณ ์„œ reports/2026-03-06_3B_SFT_COMPLETION_AND_EVAL_SUMMARY.md SFT ์™„๋ฃŒ + ํ‰๊ฐ€ + ์ฝ”๋“œ ๊ฐœ์„  + ORPO ๊ฒฐ์ • (์ตœ์‹ )
SFT ํ‰๊ฐ€ ๊ณ„ํš์„œ reports/2026-03-06_3B_SFT_EVAL_PLAN.md 6์ฐจ์› ํ‰๊ฐ€ ์„ค๊ณ„
SFT ํ‰๊ฐ€ ๊ฒฐ๊ณผ reports/2026-03-06_3B_SFT_EVALUATION_REPORT.md 6์ฐจ์› ํ‰๊ฐ€ ์ƒ์„ธ ๊ฒฐ๊ณผ
3B ํ›„์† ๋‹จ๊ณ„ ์ฐธ์กฐ reports/2026-03-05_3B_NEXT_STEPS_REFERENCE.md SFT ํ›„ ๋ฐฉํ–ฅ์„ฑ
Nemotron Nano ํƒ€๋‹น์„ฑ reports/2026-03-05_NEMOTRON_NANO_FEASIBILITY_STUDY.md Hybrid ์•„ํ‚คํ…์ฒ˜ ๊ฒ€ํ† 
v2 ์ข…ํ•ฉ ํ‰๊ฐ€ ๋ฆฌํฌํŠธ eval/outputs/3b_reeval_20260305_1451/full_eval_report.md 13+ ๋ฒค์น˜๋งˆํฌ ์ข…ํ•ฉ
v2 PPL ๋ฆฌํฌํŠธ eval/outputs/3b_reeval_20260305_1451/reports/01_perplexity_report.md 19๊ฐœ ๋ฐ์ดํ„ฐ์…‹ PPL ์ƒ์„ธ
v2 Calibration ๋ฆฌํฌํŠธ eval/outputs/3b_reeval_20260305_1451/reports/02_calibration_report.md Top-K ์ •ํ™•๋„, NLL ๋ถ„ํฌ
v2 ์ƒ์„ฑ ํ’ˆ์งˆ ๋ฆฌํฌํŠธ eval/outputs/3b_reeval_20260305_1451/reports/03_generation_quality.md 12์กฐํ•ฉ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ทธ๋ฆฌ๋“œ ์„œ์น˜
v2 ๋ฒค์น˜๋งˆํฌ ๋ฆฌํฌํŠธ eval/outputs/3b_reeval_20260305_1451/reports/04_benchmark_report.md KoBEST, HAE-RAE, MMLU, 0/5-shot
์ง„ํ–‰ ๊ธฐ๋ก PROGRESS.md ๋‚ ์งœ๋ณ„ ์ฒดํฌํฌ์ธํŠธ, ์ง€ํ‘œ, ๊ฒฐ์ • ๋กœ๊ทธ
ORPO ๋ถ„์„ ๋ฐ ๊ณ„ํš reports/2026-03-07_ORPO_ANALYSIS_AND_PLAN.md ORPO ์ง„ํ–‰ ๊ทผ๊ฑฐ, HP ์„ค๊ณ„, ์‹คํ–‰ ์ ˆ์ฐจ
ORPO Sweep ๋””๋ฒ„๊ทธ reports/2026-03-08_ORPO_SWEEP_DEBUG_REPORT.md QKV ๋ฒ„๊ทธ, NCCL timeout, TRL ํŒจ์น˜ ์ƒ์„ธ
ORPO ํ•™์Šต ์—ฌ์ • reports/2026-03-08_ORPO_TRAINING_JOURNEY.md ORPO ์ „์ฒด ๊ณผ์ •: 5๋ฒˆ์˜ ์‹คํŒจ์™€ HP sweep (์ตœ์‹ )

16. ๊ธฐ์ˆ  ์Šคํƒ ์š”์•ฝ

์˜์—ญ ๊ธฐ์ˆ  ๋ฒ„์ „
๋”ฅ๋Ÿฌ๋‹ ํ”„๋ ˆ์ž„์›Œํฌ PyTorch (NVIDIA ์ปค์Šคํ…€ ๋นŒ๋“œ) nv25.12
์–ดํ…์…˜ FlashAttention-2 2.7.4.post1+25.12
FP8 / ํ˜ผํ•ฉ ์ •๋ฐ€๋„ TransformerEngine (MXFP8) 2.10.0
๋ถ„์‚ฐ ํ•™์Šต DDP + NCCL (NVLS) NCCL 2.28.9
์ปค๋„ ์ปดํŒŒ์ผ Triton 3.5.1
ํ† ํฌ๋‚˜์ด์ € SentencePiece Unigram 64K -
๋ชจ๋‹ˆํ„ฐ๋ง Telegram Bot (B200Bot) + cron watchdog -
์ถ”๋ก  ์„œ๋น™ GGUF + Ollama -
GPU 8ร— NVIDIA B200 (NVLink 5.0, NVSwitch) CUDA 13.1
CPU 2ร— AMD EPYC 9365 (Zen 5) -

๊ด€๋ จ ํ”„๋กœ์ ํŠธ

EVAFRILL-Mo

ํ•˜์ด๋ธŒ๋ฆฌ๋“œ Mamba-2 + Transformer ์–ธ์–ด ๋ชจ๋ธ โ€” FRANKENSTALLM์˜ ์ž๋งค ํ”„๋กœ์ ํŠธ.

NVIDIA Nemotron-H ์•„ํ‚คํ…์ฒ˜์—์„œ ์˜๊ฐ์„ ๋ฐ›์•„ ๋ฐ‘๋ฐ”๋‹ฅ๋ถ€ํ„ฐ ์ง์ ‘ ๊ตฌํ˜„ํ•œ 3B ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ๋ชจ๋ธ์ด๋‹ค. FRANKENSTALLM์ด ์ˆœ์ˆ˜ Transformer ๊ธฐ๋ฐ˜์ด๋ผ๋ฉด, EVAFRILL-Mo๋Š” Mamba-2 SSM + ํฌ์†Œ Transformer ์–ดํ…์…˜ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ๊ตฌ์กฐ๋ฅผ ์ฑ„ํƒํ–ˆ๋‹ค.

ํ•ญ๋ชฉ FRANKENSTALLM EVAFRILL-Mo
์•„ํ‚คํ…์ฒ˜ ์ˆœ์ˆ˜ Transformer (28L) Mamba-2 24L + Attention 2L
ํŒŒ๋ผ๋ฏธํ„ฐ 3.17B 2.94B
ํ•ต์‹ฌ ๊ธฐ์ˆ  GQA, FP8, FlashAttention-2 Selective Scan, SwiGLU FFN in Mamba, GQA
์„ค๊ณ„ ์›์น™ ๊ฒ€์ฆ๋œ Transformer ์•„ํ‚คํ…์ฒ˜ Nemotron-H ๋‹จํŽธํ™” ๋„์ž…
GPU 8ร— B200 7ร— B200
ํ•™์Šต ์ „๋žต Chinchilla-optimal Chinchilla 93% ๋‹ฌ์„ฑ ๋ชฉํ‘œ

๋‘ ํ”„๋กœ์ ํŠธ๋Š” ๋™์ผํ•œ ํ† ํฌ๋‚˜์ด์ €(64K SentencePiece), ํ•™์Šต ๋ฐ์ดํ„ฐ ํŒŒ์ดํ”„๋ผ์ธ, DDP/FP8 ์ธํ”„๋ผ๋ฅผ ๊ณต์œ ํ•œ๋‹ค. "๊ฐ™์€ ์žฌ๋ฃŒ, ๋‹ค๋ฅธ ๋ ˆ์‹œํ”ผ"๋กœ ์•„ํ‚คํ…์ฒ˜ ์ฐจ์ด๊ฐ€ ์„ฑ๋Šฅ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์„ ๋น„๊ต ์‹คํ—˜ํ•  ์ˆ˜ ์žˆ๋‹ค.

*์ด๋ฆ„์˜ ์œ ๋ž˜: Bride Eva (ํ”„๋ž‘์ผ„์Šˆํƒ€์ธ์˜ ์‹ ๋ถ€) + FRIDAY (์•„์ด์–ธ๋งจ AI ๋น„์„œ) + LLM + Nemotron์˜ Mo*


18. ๋‹ค์Œ ์ตœ์ ํ™” ๊ณ„ํš โ€” MFU 33.5% โ†’ 47% ๋ชฉํ‘œ

์ƒ์„ธ ๋ฌธ์„œ: docs/NEXT_OPTIMIZATION_PLAN.md

ํ˜„์žฌ ์„ฑ๋Šฅ ์ง„๋‹จ

Phase 1 ํ”„๋ฆฌํŠธ๋ ˆ์ธ ์‹ค์ธก:

  • 57,000 steps, ~38.5B tokens, ์•ฝ 63์‹œ๊ฐ„
  • ์ฒ˜๋ฆฌ ์†๋„: 3638K tok/s per rank โ†’ ์ „์ฒด **292K tok/s** (8GPU)
  • MFU: ~33.5%

ํ•ต์‹ฌ ๋ณ‘๋ชฉ: NUMA Misalignment

AMD EPYC 9365 ร— 2์†Œ์ผ“:
  GPU 0~3 โ†’ NUMA node 0 (core 0-35)
  GPU 4~7 โ†’ NUMA node 1 (core 36-71)

์ดˆ๊ธฐ DDP ๋Ÿฐ์นญ ์‹œ 5/8 rank๊ฐ€ ์ž˜๋ชป๋œ NUMA ๋…ธ๋“œ์—์„œ ์‹คํ–‰.
69%์˜ DataLoader worker๊ฐ€ ํฌ๋กœ์Šค-NUMA โ€” ~2๋ฐฐ ์ง€์—ฐ ๋ฐœ์ƒ.

์ตœ์ ํ™” ํ•ญ๋ชฉ๋ณ„ ์˜ˆ์ƒ ํšจ๊ณผ

์ตœ์ ํ™” ์˜ˆ์ƒ MFU ๊ฐœ์„  ๋‚œ์ด๋„
NUMA affinity ๊ณ ์ • +4~9% ๋‚ฎ์Œ (launch script ์ˆ˜์ •)
QKV fusion (TransformerEngine) +8~12% ์ค‘๊ฐ„ (๋ชจ๋ธ ์ฝ”๋“œ ์ˆ˜์ •)
FA2 native RoPE +3~5% ์ค‘๊ฐ„ (FA2 ๋ฒ„์ „ ์˜์กด)
NCCL ํ™˜๊ฒฝ๋ณ€์ˆ˜ ํŠœ๋‹ +1~2% ๋‚ฎ์Œ (ํ•œ ์ค„ ์ถ”๊ฐ€)

์ตœ์ ํ™” ์ „ํ›„ ์˜ˆ์ƒ ๋น„๊ต

ํ•ญ๋ชฉ ํ˜„์žฌ ์ตœ์ ํ™” ํ›„
MFU 33.5% 4547%
์ฒ˜๋ฆฌ์†๋„ 292K tok/s 390410K tok/s
50B ํ† ํฐ ํ•™์Šต ~47์‹œ๊ฐ„ 3436์‹œ๊ฐ„

์ฆ‰์‹œ ์ ์šฉ ๊ฐ€๋Šฅํ•œ ์ฝ”๋“œ

NUMA affinity (launch script):

numactl --cpunodebind=0 --membind=0 torchrun \
  --nproc_per_node=4 --node_rank=0 train/pretrain.py ... &
numactl --cpunodebind=1 --membind=1 torchrun \
  --nproc_per_node=4 --node_rank=1 train/pretrain.py ... &

NCCL ํ™˜๊ฒฝ๋ณ€์ˆ˜:

export NCCL_MIN_NCHANNELS=4
export NCCL_SOCKET_NTHREADS=4
export CUDA_DEVICE_MAX_CONNECTIONS=1

Phase 3 ORPO ์™„๋ฃŒ ํ›„, ๋‹ค์Œ ํ”„๋ฆฌํŠธ๋ ˆ์ธ ๋Ÿฐ ์ „์— NUMA affinity๋ฅผ ๋จผ์ € ์ ์šฉํ•˜๋ฉด ํ•™์Šต ์‹œ๊ฐ„์„ ~30% ๋‹จ์ถ•ํ•  ์ˆ˜ ์žˆ๋‹ค.


19. GPU ํ•˜๋“œ์›จ์–ด & ๋น„์šฉ ๋ถ„์„ โ€” 3B ร— 60B ํ”„๋ฆฌํŠธ๋ ˆ์ธ

์ƒ์„ธ ๋ฌธ์„œ: docs/GPU_COST_ANALYSIS.md

์‹ค์ธก ๊ธฐ์ค€ ๋ฒ ์ด์Šค๋ผ์ธ

FRANKENSTALLM Phase 1 ์‹ค์ธก:
  B200 ร— 8, MFU 33.5%, 292K tok/s
  38.5B ํ† ํฐ โ†’ 63์‹œ๊ฐ„
  60B ํ† ํฐ ํ™˜์‚ฐ โ†’ ์•ฝ 98์‹œ๊ฐ„

ํด๋ผ์šฐ๋“œ ๊ฐ€์„ฑ๋น„ Top 3 (60B ํ† ํฐ, ์ตœ์ ํ™” ํ›„)

์ˆœ์œ„ ๊ตฌ์„ฑ ์†Œ์š”์‹œ๊ฐ„ ์ด ๋น„์šฉ
1 H100ร—8 Cudo 44.8hr $645 (~93๋งŒ์›)
2 H100ร—8 Vast.ai 44.8hr $670 (~97๋งŒ์›)
3 H100ร—8 RunPod 44.8hr $713 (~103๋งŒ์›)

B200 Blackwell์ด ๋น ๋ฅด์ง€๋งŒ, ํด๋ผ์šฐ๋“œ ๋‹จ๊ฐ€๊ฐ€ H100์˜ 3๋ฐฐ โ†’ H100์ด ์ด๋น„์šฉ 4.3๋ฐฐ ์ €๋ ด

๊ฐœ์ธ GPU ๊ตฌ์„ฑ ์ถ”์ฒœ

๊ตฌ์„ฑ VRAM NVLink ๊ฐ€๊ฒฉ ์ถ”์ฒœ๋„
A6000 Ada ร— 2 ์ค‘๊ณ  96GB (ํ†ตํ•ฉ) โœ… ~1,000๋งŒ์› โญโญโญโญโญ
L40S ร— 2 96GB (ํ†ตํ•ฉ) โœ… ~1,400๋งŒ์› โญโญโญโญ
RTX Pro 6000 Blackwell 96GB (๋‹จ์ผ) โŒ ~1,200๋งŒ์› โญโญโญ

์†Œ๋น„์ž์šฉ GPU(RTX 5090/4090)๋Š” NVLink ๋ฏธ์ง€์›. 80GB+ ํ†ตํ•ฉ ๋ฉ”๋ชจ๋ฆฌ ํ•„์š” ์‹œ ์ „๋ฌธ๊ฐ€์šฉ ํ•„์ˆ˜.

์ถ”์ฒœ ์ „๋žต: ๋กœ์ปฌ + ํด๋ผ์šฐ๋“œ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ

[๋กœ์ปฌ] RTX 4090 ร— 4 (880๋งŒ์›) โ€” ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ, ์‹คํ—˜, SFT/ORPO
[ํด๋ผ์šฐ๋“œ] H100ร—8 (๋Ÿฐ๋‹น ~103๋งŒ์›) โ€” ๋ณธ ํ”„๋ฆฌํŠธ๋ ˆ์ธ๋งŒ

๋งˆ์น˜๋ฉฐ

์ด ํ”„๋กœ์ ํŠธ์˜ ๋ชจํ† ๋Š” ํ•˜๋‚˜๋‹ค:

"๋งํ•˜๋Š” ๊ฒƒ๋„ ๊ธฐ๋กํ•œ๋‹ค."

SFT v1์˜ loss=0.0 ์‹คํŒจ, torch.compile์ด ํšจ๊ณผ ์—†์—ˆ๋˜ ๊ฒƒ, 18% ๋ฐ˜๋ณต๋ฅ ์˜ ์ขŒ์ ˆ โ€” ์ด ๋ชจ๋“  ๊ฒƒ์ด ๊ธฐ๋ก์— ๋‚จ์•„ ์žˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด์ œ Phase 3 ORPO์—์„œ๋„ ๊ทธ ์ „ํ†ต์€ ์ด์–ด์ง„๋‹ค. 5๋ฒˆ์˜ ์‹คํŒจ โ€” NCCL timeout, config ์ถฉ๋Œ, QKV ๋ณ€ํ™˜ ๋ฒ„๊ทธ, ํฌํŠธ ์ถฉ๋Œ, TRL NaN ๋ฒ„๊ทธ โ€” ๋ฅผ ๊ฑฐ์ณ ๋งˆ์นจ๋‚ด 6-config HP sweep์ด ๋Œ์•„๊ฐ€๊ณ  ์žˆ๋‹ค.

Frankenstein์ด ์กฐ๊ฐ๋“ค์„ ์ด์–ด ๋ถ™์—ฌ ์ƒ๋ช…์„ ๋งŒ๋“ค์—ˆ๋“ฏ, ์šฐ๋ฆฌ๋„ ๋‹ค์–‘ํ•œ ์†Œ์Šค์˜ ๋ฐ์ดํ„ฐ์™€ ๊ธฐ์ˆ ์„ ์ด์–ด ๋ถ™์—ฌ ํ•œ๊ตญ์–ด๋ฅผ ์ดํ•ดํ•˜๊ณ  ๋งํ•˜๋Š” ๋ชจ๋ธ์„ ๋งŒ๋“ค์–ด๊ฐ€๊ณ  ์žˆ๋‹ค. ์•„์ง ์™„์„ฑ๋˜์ง€ ์•Š์•˜์ง€๋งŒ, ๊ทธ ๊ณผ์ • ์ž์ฒด๊ฐ€ ์ด ํ”„๋กœ์ ํŠธ์˜ ๊ฐ€์น˜๋‹ค.

Phase 1 ํ”„๋ฆฌํŠธ๋ ˆ์ธ์€ 57,000 steps, loss 1.466์œผ๋กœ ์™„๋ฃŒ๋๋‹ค. Phase 2 SFT๋Š” 25,500 steps์—์„œ early stopping (val_loss 1.8851). 6์ฐจ์› ์ข…ํ•ฉ ํ‰๊ฐ€์—์„œ 4/6์„ ํ†ต๊ณผํ–ˆ๋‹ค.

์ข‹์€ ์†Œ์‹: ์ง€์‹ ๋ณด์กด์ด ๊ฑฐ์˜ ์™„๋ฒฝํ•˜๋‹ค (forgetting 0.9%). SFT๊ฐ€ base ๋ชจ๋ธ์˜ ์ง€์‹์„ ํŒŒ๊ดดํ•˜์ง€ ์•Š์•˜๋‹ค. EOS ์ข…๋ฃŒ์œจ์€ 0%์—์„œ 60%๋กœ ์˜ฌ๋ผ๊ฐ”๋‹ค. MMLU-KO๋„ +3.2pp ๊ฐœ์„ ๋˜์—ˆ๋‹ค.

์•„์‰ฌ์šด ์†Œ์‹: greedy ๋ฐ˜๋ณต๋ฅ  72.97%. SFT๋งŒ์œผ๋กœ๋Š” ๋ฐ˜๋ณต ๋ฌธ์ œ๊ฐ€ ํ•ด๊ฒฐ๋˜์ง€ ์•Š์•˜๋‹ค. ์˜คํžˆ๋ ค ์•…ํ™”๋˜์—ˆ๋‹ค (Base 60.99% โ†’ SFT 72.97%). ํ•˜์ง€๋งŒ rep_penalty=1.2๋งŒ ์ ์šฉํ•˜๋ฉด ๋ฐ˜๋ณต๋ฅ  0%๊ฐ€ ๋‹ฌ์„ฑ๋œ๋‹ค. ๋ชจ๋ธ์€ ๋ฐ˜๋ณตํ•˜์ง€ ์•Š๋Š” ๋Šฅ๋ ฅ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. ๋‹ค๋งŒ ๊ทธ๊ฒƒ์„ "๊ธฐ๋ณธ ํ–‰๋™"์œผ๋กœ ํ•™์Šตํ•˜์ง€ ๋ชปํ–ˆ์„ ๋ฟ์ด๋‹ค.

ํ˜„์žฌ: Phase 3 ORPO ๋ณธ ํ•™์Šต์ด ์ง„ํ–‰ ์ค‘์ด๋‹ค. 6-config HP sweep์„ ๋ชจ๋‘ ์™„๋ฃŒํ•˜๊ณ , eval_loss ๊ธฐ์ค€ ์ตœ์  config (lr=1.2e-5, beta=0.25)๋ฅผ ์„ ์ •ํ–ˆ๋‹ค. Throughput ๋ฒค์น˜๋งˆํฌ๋กœ batch_size=4, grad_accum=4 ์กฐํ•ฉ์ด 80.63 samples/s๋กœ ์ตœ์ ์ž„์„ ํ™•์ธํ•˜๊ณ , 8ร—B200 ์ „์ฒด GPU๋กœ ๋ณธ ํ•™์Šต์„ ์‹œ์ž‘ํ–ˆ๋‹ค. ~9,840 steps, ์˜ˆ์ƒ ~4.8์‹œ๊ฐ„. ํ•™์Šต ์™„๋ฃŒ ์‹œ watchdog์ด ์ž๋™์œผ๋กœ 10์ฐจ์› ์ข…ํ•ฉ ํ‰๊ฐ€(Base vs SFT vs ORPO 3-way ๋น„๊ต)๋ฅผ ์‹คํ–‰ํ•œ๋‹ค.

ORPO๊ฐ€ greedy ๋ฐ˜๋ณต๋ฅ ์„ 5% ๋ฏธ๋งŒ์œผ๋กœ ๋Œ์–ด๋‚ด๋ฆด ์ˆ˜ ์žˆ๋Š”๊ฐ€?

๊ทธ ๋‹ต์ด ๊ณง ๋‚˜์˜จ๋‹ค. ํ•™์Šต์ด ๋๋‚˜๋ฉด 6์ฐจ์› ์žฌํ‰๊ฐ€๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ณ , ํ†ต๊ณผํ•˜๋ฉด GGUF๋กœ ๋ณ€ํ™˜๋˜์–ด Ollama ์œ„์—์„œ ๋Œ์•„๊ฐ€๊ฒŒ ๋œ๋‹ค. ํ•œ๊ตญ์–ด๋ฅผ ์ดํ•ดํ•˜๊ณ  ๋งํ•˜๋Š” 3B ๋ชจ๋ธ, ์ฒ˜์Œ๋ถ€ํ„ฐ ๋งŒ๋“  ๊ฒƒ.


์ตœ์ข… ์—…๋ฐ์ดํŠธ: 2026-03-09 ํ˜„์žฌ ์ƒํƒœ: Phase 3 ORPO ๋ณธ ํ•™์Šต ์ง„ํ–‰ ์ค‘ (lr=1.2e-5, beta=0.25, step ~1,660/9,840, 17%) โ€” ํ•™์Šต ์™„๋ฃŒ ์‹œ 10์ฐจ์› ์ข…ํ•ฉ ํ‰๊ฐ€ ์ž๋™ ์‹คํ–‰ ๋Œ€๊ธฐ