frankenstallm / source /README.md
pathcosmos's picture
Upload source/README.md with huggingface_hub (#30)
d7931ae
# FRANKENSTALLM
![Phase 3](https://img.shields.io/badge/Phase_3_ORPO-๋ณธ_ํ•™์Šต_์ค‘-blue)
![Model](https://img.shields.io/badge/Model-3B_Korean_LLM-green)
![GPU](https://img.shields.io/badge/GPU-8ร—_NVIDIA_B200-76b900)
![FP8](https://img.shields.io/badge/Precision-MXFP8-orange)
![SFT](https://img.shields.io/badge/SFT-์™„๋ฃŒ_val__loss_1.8851-success)
![Status](https://img.shields.io/badge/Status-ORPO_Full_Training-ff9900)
> **ํ•œ๊ตญ์–ด 3B LLM์„ 8ร— NVIDIA B200 ์œ„์—์„œ ์ฒ˜์Œ๋ถ€ํ„ฐ ์ง์ ‘ ๋งŒ๋“ ๋‹ค.**
> Frankenstein์ฒ˜๋Ÿผ ์กฐ๊ฐ์„ ์ด์–ด ๋ถ™์ด๊ณ , ์ฒ ๊ฐ•์ฒ˜๋Ÿผ ๋‹จ๋‹จํ•˜๊ฒŒ ๋‹จ๋ จํ•œ๋‹ค.
GitHub: [`pathcosmos/FRANKENSTALLM`](https://github.com/pathcosmos/FRANKENSTALLM)
---
## ๋ชฉ์ฐจ
1. [์™œ ์ด ํ”„๋กœ์ ํŠธ์ธ๊ฐ€](#1-์™œ-์ด-ํ”„๋กœ์ ํŠธ์ธ๊ฐ€)
2. [ํ˜„์žฌ ์ƒํƒœ โ€” ํ•œ๋ˆˆ์— ๋ณด๊ธฐ](#2-ํ˜„์žฌ-์ƒํƒœ--ํ•œ๋ˆˆ์—-๋ณด๊ธฐ)
3. [ํ•˜๋“œ์›จ์–ด ํ™˜๊ฒฝ](#3-ํ•˜๋“œ์›จ์–ด-ํ™˜๊ฒฝ)
4. [ํ”„๋กœ์ ํŠธ ๊ตฌ์กฐ](#4-ํ”„๋กœ์ ํŠธ-๊ตฌ์กฐ)
5. [ํ”„๋กœ์ ํŠธ ์—ฌ์ • ํƒ€์ž„๋ผ์ธ](#5-ํ”„๋กœ์ ํŠธ-์—ฌ์ •-ํƒ€์ž„๋ผ์ธ)
6. [๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜](#6-๋ชจ๋ธ-์•„ํ‚คํ…์ฒ˜)
7. [ํ•™์Šต ๋ฐ์ดํ„ฐ](#7-ํ•™์Šต-๋ฐ์ดํ„ฐ)
8. [ํ•™์Šต ์„ค์ • ๋ฐ ์ตœ์ ํ™”](#8-ํ•™์Šต-์„ค์ •-๋ฐ-์ตœ์ ํ™”)
9. [์‹คํ—˜ ๊ฒฐ๊ณผ โ€” 1B ๋ฒ ์ด์Šค๋ผ์ธ](#9-์‹คํ—˜-๊ฒฐ๊ณผ--1b-๋ฒ ์ด์Šค๋ผ์ธ)
10. [์‹คํ—˜ ๊ฒฐ๊ณผ โ€” 3B Base ์ข…ํ•ฉ ํ‰๊ฐ€ (v2)](#10-์‹คํ—˜-๊ฒฐ๊ณผ--3b-base-์ข…ํ•ฉ-ํ‰๊ฐ€-v2)
- [10.1 ํ•™์Šต ์ปค๋ธŒ](#101-ํ•™์Šต-์ปค๋ธŒ)
- [10.2 PPL (Perplexity) โ€” 19๊ฐœ ๋ฐ์ดํ„ฐ์…‹](#102-ppl-perplexity--19๊ฐœ-๋ฐ์ดํ„ฐ์…‹)
- [10.3 ํ•œ๊ตญ์–ด ๋ฒค์น˜๋งˆํฌ](#103-ํ•œ๊ตญ์–ด-๋ฒค์น˜๋งˆํฌ)
- [10.4 ์˜์–ด ๋ฒค์น˜๋งˆํฌ](#104-์˜์–ด-๋ฒค์น˜๋งˆํฌ)
- [10.5 Calibration](#105-calibration)
- [10.6 0-shot vs 5-shot ๋น„๊ต](#106-0-shot-vs-5-shot-๋น„๊ต)
- [10.7 ์ฐธ๊ณ  ๋ชจ๋ธ ๋น„๊ต](#107-์ฐธ๊ณ -๋ชจ๋ธ-๋น„๊ต)
- [10.8 ์ƒ์„ฑ ํ’ˆ์งˆ ๋ฐ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ทธ๋ฆฌ๋“œ ์„œ์น˜](#108-์ƒ์„ฑ-ํ’ˆ์งˆ-๋ฐ-ํŒŒ๋ผ๋ฏธํ„ฐ-๊ทธ๋ฆฌ๋“œ-์„œ์น˜)
- [10.9 ํ‰๊ฐ€ ํŒŒ์ดํ”„๋ผ์ธ](#109-ํ‰๊ฐ€-ํŒŒ์ดํ”„๋ผ์ธ)
11. [์‹คํ—˜ ๊ฒฐ๊ณผ โ€” 3B SFT ์ข…ํ•ฉ ํ‰๊ฐ€](#11-์‹คํ—˜-๊ฒฐ๊ณผ--3b-sft-์ข…ํ•ฉ-ํ‰๊ฐ€)
- [11.1 SFT ํ•™์Šต ๊ฒฐ๊ณผ](#111-sft-ํ•™์Šต-๊ฒฐ๊ณผ)
- [11.2 6์ฐจ์› ํ‰๊ฐ€ ์š”์•ฝ](#112-6์ฐจ์›-ํ‰๊ฐ€-์š”์•ฝ)
- [11.3 Base vs SFT ๋น„๊ต](#113-base-vs-sft-๋น„๊ต)
- [11.4 ์ฝ”๋“œ ๊ฐœ์„  ์‚ฌํ•ญ](#114-์ฝ”๋“œ-๊ฐœ์„ -์‚ฌํ•ญ)
- [11.5 ORPO ์ง„ํ–‰ ํŒ์ •](#115-orpo-์ง„ํ–‰-ํŒ์ •)
12. [Phase 3 โ€” ORPO (์„ ํ˜ธ๋„ ์ •๋ ฌ)](#12-phase-3--orpo-์„ ํ˜ธ๋„-์ •๋ ฌ)
- [12.1 ORPO ์„ ํƒ ๋ฐฐ๊ฒฝ](#121-orpo-์„ ํƒ-๋ฐฐ๊ฒฝ)
- [12.2 ๋ฐ์ดํ„ฐ](#122-๋ฐ์ดํ„ฐ)
- [12.3 HP Sweep ์„ค๊ณ„](#123-hp-sweep-์„ค๊ณ„-6-config)
- [12.4 ์‹œ๋„ ์ด๋ ฅ](#124-์‹œ๋„-์ด๋ ฅ--5๋ฒˆ์˜-์‹คํŒจ)
- [12.5 ์Šค์œ• ๊ฒฐ๊ณผ](#125-์Šค์œ•-๊ฒฐ๊ณผ-์ง„ํ–‰-์ค‘)
- [12.7 ORPO ๋ณธ ํ•™์Šต](#127-orpo-๋ณธ-ํ•™์Šต-์ง„ํ–‰-์ค‘-2026-03-09)
- [12.8 ORPO ์ข…ํ•ฉ ํ‰๊ฐ€ ํŒŒ์ดํ”„๋ผ์ธ](#128-orpo-์ข…ํ•ฉ-ํ‰๊ฐ€-ํŒŒ์ดํ”„๋ผ์ธ)
13. [์‹คํ–‰ ๋ฐฉ๋ฒ•](#13-์‹คํ–‰-๋ฐฉ๋ฒ•)
14. [๋กœ๋“œ๋งต](#14-๋กœ๋“œ๋งต)
15. [์ฐธ๊ณ  ๋ฌธ์„œ](#15-์ฐธ๊ณ -๋ฌธ์„œ)
16. [๊ธฐ์ˆ  ์Šคํƒ ์š”์•ฝ](#16-๊ธฐ์ˆ -์Šคํƒ-์š”์•ฝ)
17. [๊ด€๋ จ ํ”„๋กœ์ ํŠธ](#๊ด€๋ จ-ํ”„๋กœ์ ํŠธ)
18. [๋‹ค์Œ ์ตœ์ ํ™” ๊ณ„ํš](#18-๋‹ค์Œ-์ตœ์ ํ™”-๊ณ„ํš--mfu-335--47-๋ชฉํ‘œ)
19. [GPU ํ•˜๋“œ์›จ์–ด & ๋น„์šฉ ๋ถ„์„](#19-gpu-ํ•˜๋“œ์›จ์–ด--๋น„์šฉ-๋ถ„์„--3b--60b-ํ”„๋ฆฌํŠธ๋ ˆ์ธ)
---
## 1. ์™œ ์ด ํ”„๋กœ์ ํŠธ์ธ๊ฐ€
ํ•œ๊ตญ์–ด LLM ์ƒํƒœ๊ณ„๋Š” ๋น ๋ฅด๊ฒŒ ์„ฑ์žฅํ•˜๊ณ  ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋Œ€๋ถ€๋ถ„์˜ ๊ณต๊ฐœ ๋ชจ๋ธ์€ ์˜์–ด ๊ธฐ๋ฐ˜ ์‚ฌ์ „ํ•™์Šต ์œ„์— ํ•œ๊ตญ์–ด ํŒŒ์ธํŠœ๋‹์„ ์–น์€ ํ˜•ํƒœ๊ฑฐ๋‚˜, ํ•™์Šต ๊ณผ์ •์ด ๊ณต๊ฐœ๋˜์ง€ ์•Š์•„ ์žฌํ˜„์ด ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค.
์ด ํ”„๋กœ์ ํŠธ๋Š” ๋‹ค๋ฅด๋‹ค.
- **์ฒ˜์Œ๋ถ€ํ„ฐ(from scratch)**: ํ† ํฌ๋‚˜์ด์ € ํ•™์Šต๋ถ€ํ„ฐ ํ”„๋ฆฌํŠธ๋ ˆ์ธ, SFT, ์„ ํ˜ธ๋„ ์ •๋ ฌ๊นŒ์ง€ ๋ชจ๋“  ๋‹จ๊ณ„๋ฅผ ์ง์ ‘ ๊ตฌํ˜„ํ•œ๋‹ค.
- **์™„์ „ ๊ณต๊ฐœ ๋นŒ๋” ๋กœ๊ทธ**: ์„ฑ๊ณต๋งŒ ๊ธฐ๋กํ•˜์ง€ ์•Š๋Š”๋‹ค. ๋ฒ„๊ทธ, ์‹คํŒจ, ํŒ๋‹จ ์ฐฉ์˜ค, ๊ทธ๋ฆฌ๊ณ  ๊ทธ ์›์ธ ๋ถ„์„๊นŒ์ง€ ๋ชจ๋‘ ๊ธฐ๋กํ•œ๋‹ค.
- **์‹ค์šฉ์ ์ธ ๊ทœ๋ชจ**: ํ•™์ˆ  ๋…ผ๋ฌธ์šฉ ์žฅ๋‚œ๊ฐ ๋ชจ๋ธ(125M)๋„ ์•„๋‹ˆ๊ณ , ์—ฐ๊ตฌ์†Œ๊ฐ€ ์•„๋‹ˆ๋ฉด ์žฌํ˜„ ๋ถˆ๊ฐ€๋Šฅํ•œ 70B๋„ ์•„๋‹Œ, **3B ๊ทœ๋ชจ**์˜ ์‹ค์šฉ์  ํ•œ๊ตญ์–ด ๋ชจ๋ธ์ด ๋ชฉํ‘œ๋‹ค.
- **B200 ์ตœ์ ํ™”**: NVIDIA B200์˜ FP8 Tensor Core, NVLink 5.0, FlashAttention-2๋ฅผ ์ตœ๋Œ€ํ•œ ํ™œ์šฉํ•œ๋‹ค. ์ตœ์‹  ํ•˜๋“œ์›จ์–ด๋ฅผ ์ตœ๋Œ€๋กœ ์ฅ์–ด์งœ๋Š” ๊ณผ์ • ์ž์ฒด๊ฐ€ ํ•™์Šต์ด๋‹ค.
์ด README๋Š” ์™„์„ฑ๋œ ๊ฒฐ๊ณผ๋ฌผ์˜ ๋ฐœํ‘œ๊ฐ€ ์•„๋‹ˆ๋ผ, **ํ˜„์žฌ ์ง„ํ–‰ ์ค‘์ธ ๋นŒ๋”์˜ ๋กœ๊ทธ**๋‹ค.
---
## 2. ํ˜„์žฌ ์ƒํƒœ โ€” ํ•œ๋ˆˆ์— ๋ณด๊ธฐ
```
2026-03-09 ๊ธฐ์ค€
```
| ๋‹จ๊ณ„ | ์ƒํƒœ | ์„ธ๋ถ€ ๋‚ด์šฉ |
|------|------|-----------|
| Phase 0: ๊ธฐ๋ฐ˜ ๊ตฌ์ถ• | โœ… ์™„๋ฃŒ | OOM ์ˆ˜์ •, GQA FA ์ตœ์ ํ™”, NCCL NVLS, ํŒŒ์ดํ”„๋ผ์ธ ์ค€๋น„ |
| Phase 1: 3B Pretrain | โœ… ์™„๋ฃŒ | 57,000 steps, loss 1.466, ~63์‹œ๊ฐ„ |
| Phase 2: SFT | โœ… ์™„๋ฃŒ | 25,500 steps (early stop), val_loss 1.8851, ~15.5์‹œ๊ฐ„ |
| Phase 2.5: SFT ํ‰๊ฐ€ | โœ… ์™„๋ฃŒ | 6์ฐจ์› ํ‰๊ฐ€ 4/6 PASS, ORPO ์ง„ํ–‰ ๊ฒฐ์ • |
| Phase 3: ORPO Sweep | โœ… ์™„๋ฃŒ | 6-config sweep ์™„๋ฃŒ, best: lr=1.2e-5, beta=0.25 |
| **Phase 3: ORPO ๋ณธ ํ•™์Šต** | **๐Ÿ”„ ์ง„ํ–‰ ์ค‘** | **630K pairs, 2 epochs, ~9,840 steps, ~4.8์‹œ๊ฐ„** |
| Phase 4: ๋ฐฐํฌ | ๐Ÿ“‹ ๋Œ€๊ธฐ | GGUF ๋ณ€ํ™˜ โ†’ Ollama ์„œ๋น™ |
### Phase 2 (SFT) ์ตœ์ข… ๊ฒฐ๊ณผ
| ํ•ญ๋ชฉ | ๊ฐ’ |
|------|-----|
| ์ตœ์ข… step | **25,500 / 33,000** (77.3%, early stopping) |
| **Val loss (best)** | **1.8851** (step 23,000) |
| ํ•™์Šต ์‹œ๊ฐ„ | **~15์‹œ๊ฐ„ 41๋ถ„** (2026-03-05 22:15 ~ 2026-03-06 13:56) |
| VRAM ์‚ฌ์šฉ | **24.2GB** / 183GB per GPU (13.2%) |
| Base ๋ชจ๋ธ | checkpoint-0057000 (pretrain loss 1.466) |
| SFT ๋ฐ์ดํ„ฐ | **2,439,397 samples** (24๊ฐœ ์†Œ์Šค, 7.48 GB) |
| ์‚ฌ๊ณ  | 0๊ฑด (OOM, NCCL, NaN ์—†์Œ) |
**SFT Val Loss ์ „์ฒด ์ถ”์ด**:
```
Step 500: 2.073
Step 2,000: 1.956 (-0.117)
Step 5,000: 1.911 (-0.045)
Step 10,000: 1.892 (-0.019)
Step 15,000: 1.886 (-0.006)
Step 20,000: 1.885 (-0.001)
Step 23,000: 1.8851 โ† BEST
Step 25,500: 1.8851 โ†’ Early Stop (patience 5/5)
```
### SFT 6์ฐจ์› ํ‰๊ฐ€ ์š”์•ฝ
| ์ฐจ์› | ๊ฒฐ๊ณผ | ํ•ต์‹ฌ ์ˆ˜์น˜ |
|------|------|-----------|
| Perplexity (์ง€์‹ ๋ณด์กด) | **PASS** | forgetting 0.9% |
| ์ƒ์„ฑ ํ’ˆ์งˆ | **FAIL** | Greedy ๋ฐ˜๋ณต๋ฅ  72.97% |
| ํ•œ๊ตญ์–ด ๋ฒค์น˜๋งˆํฌ | **FAIL** | KoBEST ํ‰๊ท  43.26% |
| ์˜์–ด ๋ฒค์น˜๋งˆํฌ | **PASS** | ์ „ ํƒœ์Šคํฌ ํ•˜ํ•œ ์ดˆ๊ณผ |
| Calibration | **PASS** | Top-1 68.59% |
| SFT Chat ๋Šฅ๋ ฅ | **PASS** | EOS ์ข…๋ฃŒ์œจ 60% (Base 0%) |
> **ํŒ์ •: ORPO ์ง„ํ–‰** โ€” ์ง€์‹ ๋ณด์กด ์šฐ์ˆ˜(0.9%), ๋ฐ˜๋ณต๋ฅ ์€ ์„ ํ˜ธ๋„ ์ •๋ ฌ๋กœ ํ•ด๊ฒฐ.
> ์ƒ์„ธ: `reports/2026-03-06_3B_SFT_COMPLETION_AND_EVAL_SUMMARY.md`
---
## 3. ํ•˜๋“œ์›จ์–ด ํ™˜๊ฒฝ
### GPU
| ํ•ญ๋ชฉ | ์‚ฌ์–‘ |
|------|------|
| ๋ชจ๋ธ | 8ร— NVIDIA B200 |
| VRAM | 183GB HBM3e per GPU (~1.47TB ํ•ฉ๊ณ„) |
| FP8 Tensor Core | 2,250 TFLOPS/GPU (์ด 18,000 TFLOPS) |
| BF16 | 1,125 TFLOPS/GPU |
| HBM3e ๋Œ€์—ญํญ | ~7.67 TB/s per GPU |
| ์ธํ„ฐ์ปค๋„ฅํŠธ | NVLink 5.0 (900 GB/s bidirectional per GPU) |
| ํ† ํด๋กœ์ง€ | NVSwitch โ€” ๋ชจ๋“  GPUโ†”GPU ๋‹จ์ผ ํ™‰ All-to-All Mesh |
| ์ „๋ ฅ | 940W ์‹ค์ธก / 1000W cap |
B200์€ FP8 ๋„ค์ดํ‹ฐ๋ธŒ ์ง€์› ๋ชจ๋ธ์ด๋‹ค. `torch.float8_e4m3fn` ์„ TransformerEngine์˜ MXFP8 ๋ ˆ์‹œํ”ผ์™€ ๊ฒฐํ•ฉํ•ด ํ•™์Šตํ•œ๋‹ค. BF16 ๋Œ€๋น„ ์—ฐ์‚ฐ๋Ÿ‰์ด ์ด๋ก ์ƒ 2๋ฐฐ์ด๋ฉฐ, ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ๋„ ํ–ฅ์ƒ๋œ๋‹ค.
### CPU ๋ฐ ์‹œ์Šคํ…œ ๋ฉ”๋ชจ๋ฆฌ
| ํ•ญ๋ชฉ | ์‚ฌ์–‘ |
|------|------|
| CPU | 2ร— AMD EPYC 9365 (Turin / Zen 5) |
| ๋ฌผ๋ฆฌ ์ฝ”์–ด | 72๊ฐœ (36์ฝ”์–ด ร— 2์†Œ์ผ“) |
| NUMA ๊ตฌ์„ฑ | 2๋…ธ๋“œ: node0 (core 0-35) / node1 (core 36-71) |
| GPUโ†”NUMA ๋งคํ•‘ | GPU 0-3 โ†’ NUMA node 0, GPU 4-7 โ†’ NUMA node 1 |
| RAM | 2.21TB DDR5 (~2.03TB ์—ฌ์œ ) |
| L3 ์บ์‹œ | 384MB (12 CCX ร— 32MB) |
**NUMA ์ฃผ์˜**: ์ดˆ๊ธฐ DDP ๋Ÿฐ์นญ ์‹œ 5/8 rank๊ฐ€ ์ž˜๋ชป๋œ NUMA ๋…ธ๋“œ์—์„œ ์‹คํ–‰๋˜๋Š” ๋ฌธ์ œ ๋ฐœ์ƒ. 69%์˜ DataLoader worker๊ฐ€ ํฌ๋กœ์Šค-NUMA์˜€๋‹ค. NUMA affinity ์ตœ์ ํ™”๋Š” ๋ฏธ์ ์šฉ ์ƒํƒœ(๋กœ๋“œ๋งต ํ•ญ๋ชฉ).
### ์Šคํ† ๋ฆฌ์ง€
| ๊ฒฝ๋กœ | ์šฉ๋„ | ์—ฌ์œ  ๊ณต๊ฐ„ |
|------|------|-----------|
| `/PROJECT/0325120031_A/ghong/taketimes/llm-bang/` | ๋ฉ”์ธ ์ž‘์—… (์ฒดํฌํฌ์ธํŠธ, ๋ฐ์ดํ„ฐ) | 2.2TB |
| `/home/ghong/` | ์†Œ๊ทœ๋ชจ ์ฝ”๋“œ | 5GB (์ œํ•œ) |
> **์ฃผ์˜**: ์ฒดํฌํฌ์ธํŠธ(์ˆ˜์‹ญ GB), ํ•™์Šต ๋ฐ์ดํ„ฐ(82GB+), ์ค‘๊ฐ„ ์‚ฐ์ถœ๋ฌผ์€ ๋ชจ๋‘ `/PROJECT/...` ๊ฒฝ๋กœ์— ์ €์žฅํ•œ๋‹ค. ํ™ˆ ๋””๋ ‰ํ† ๋ฆฌ ์šฉ๋Ÿ‰ ์ดˆ๊ณผ ์œ„ํ—˜.
### ์†Œํ”„ํŠธ์›จ์–ด ํ™˜๊ฒฝ
| ํŒจํ‚ค์ง€ | ๋ฒ„์ „ |
|--------|------|
| PyTorch | `2.10.0a0+b4e4ee81d3.nv25.12` (NVIDIA ์ปค์Šคํ…€) |
| FlashAttention | 2.7.4.post1+25.12 |
| TransformerEngine | 2.10.0 |
| NCCL | 2.28.9 |
| Triton | 3.5.1 |
| CUDA | 13.1 |
| Driver | 580.95.05 |
> **๊ฒฝ๊ณ **: PyTorch๋Š” NVIDIA B200 ์ตœ์ ํ™” ์ปค์Šคํ…€ ๋นŒ๋“œ๋‹ค. `pip install torch`๋กœ ์žฌ์„ค์น˜ํ•˜๋ฉด B200 ์ตœ์ ํ™”๊ฐ€ ๊นจ์ง„๋‹ค. **์ ˆ๋Œ€ ์žฌ์„ค์น˜ ๊ธˆ์ง€.**
---
## 4. ํ”„๋กœ์ ํŠธ ๊ตฌ์กฐ
```
llm-bang/
โ”œโ”€โ”€ CLAUDE.md # Claude Code ๊ฐ€์ด๋“œ
โ”œโ”€โ”€ README.md # ์ด ํŒŒ์ผ
โ”œโ”€โ”€ PROGRESS.md # ์ง„ํ–‰ ๊ธฐ๋ก (๋‚ ์งœ๋ณ„ ๋กœ๊ทธ)
โ”œโ”€โ”€ Modelfile.3b # Ollama ๋ชจ๋ธ ํŒŒ์ผ
โ”‚
โ”œโ”€โ”€ configs/
โ”‚ โ”œโ”€โ”€ korean_3b_fp8.yaml # 3B FP8 ํ•™์Šต ์„ค์ • (ํ˜„์žฌ ์‚ฌ์šฉ ์ค‘)
โ”‚ โ”œโ”€โ”€ 3b_pretrain.yaml # 3B ํ”„๋ฆฌํŠธ๋ ˆ์ธ ์„ค์ • (๋Œ€์ฒด)
โ”‚ โ”œโ”€โ”€ korean_1b_fp8.yaml # 1B FP8 ์„ค์ • (์•„์นด์ด๋ธŒ)
โ”‚ โ”œโ”€โ”€ korean_3b_sft.yaml # 3B SFT v1 ์„ค์ • (์™„๋ฃŒ)
โ”‚ โ”œโ”€โ”€ korean_3b_sft_v2.yaml # 3B SFT v2 ์„ค์ • (lr=5e-5, data mixing)
โ”‚ โ”œโ”€โ”€ korean_3b_orpo.yaml # 3B ORPO ์„ค์ • (lr=5e-6, beta=0.1)
โ”‚ โ”œโ”€โ”€ hybrid_3b.yaml # Hybrid 3B (Mamba-2 + Attention)
โ”‚ โ”œโ”€โ”€ small_fp8.yaml # 125M FP8 ๊ฒ€์ฆ์šฉ
โ”‚ โ”œโ”€โ”€ medium.yaml # ์ค‘ํ˜• ๋ชจ๋ธ ์„ค์ •
โ”‚ โ””โ”€โ”€ small.yaml # ์†Œํ˜• ๋ชจ๋ธ ์„ค์ •
โ”‚
โ”œโ”€โ”€ data/
โ”‚ โ”œโ”€โ”€ 3b_train.bin # ํ”„๋ฆฌํŠธ๋ ˆ์ธ ํ•™์Šต ๋ฐ์ดํ„ฐ (82GB, 41.12B tokens)
โ”‚ โ”œโ”€โ”€ 3b_val.bin # ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ (151MB)
โ”‚ โ”œโ”€โ”€ cc100_ko_train.bin # CC100 ํ•œ๊ตญ์–ด (4.5GB)
โ”‚ โ”œโ”€โ”€ cosmo_auto_math_text_train.bin # ์ˆ˜ํ•™ ํ…์ŠคํŠธ (2.6GB)
โ”‚ โ””โ”€โ”€ build scripts, __init__.py
โ”‚
โ”œโ”€โ”€ model/
โ”‚ โ”œโ”€โ”€ attention.py # GQA FlashAttention (Phase 0 ์ตœ์ ํ™” ์ ์šฉ)
โ”‚ โ”œโ”€โ”€ transformer.py # ํŠธ๋žœ์Šคํฌ๋จธ ๋ฉ”์ธ ์•„ํ‚คํ…์ฒ˜
โ”‚ โ”œโ”€โ”€ config.py # ๋ชจ๋ธ ์„ค์ • dataclass
โ”‚ โ””โ”€โ”€ layers.py # ์ปค์Šคํ…€ ๋ ˆ์ด์–ด (RMSNorm, SwiGLU ๋“ฑ)
โ”‚
โ”œโ”€โ”€ train/
โ”‚ โ”œโ”€โ”€ pretrain.py # ํ”„๋ฆฌํŠธ๋ ˆ์ธ ์Šคํฌ๋ฆฝํŠธ (DDP ์ตœ์ ํ™”)
โ”‚ โ”œโ”€โ”€ sft.py # SFT ํ•™์Šต
โ”‚ โ”œโ”€โ”€ orpo.py # ORPO ํ•™์Šต
โ”‚ โ”œโ”€โ”€ trainer.py # ํ†ตํ•ฉ ํŠธ๋ ˆ์ด๋„ˆ (loss sync ์ตœ์ ํ™”)
โ”‚ โ””โ”€โ”€ utils.py # ์œ ํ‹ธ๋ฆฌํ‹ฐ (NCCL 7200s timeout ๋“ฑ)
โ”‚
โ”œโ”€โ”€ scripts/
โ”‚ โ”œโ”€โ”€ launch_3b_pretrain.sh # 3B ํ”„๋ฆฌํŠธ๋ ˆ์ธ ๋Ÿฐ์ฒ˜ (NCCL ํ™˜๊ฒฝ๋ณ€์ˆ˜ ํฌํ•จ)
โ”‚ โ”œโ”€โ”€ launch_3b_sft.sh # 3B SFT v1 ๋Ÿฐ์ฒ˜
โ”‚ โ”œโ”€โ”€ launch_3b_sft_v2.sh # 3B SFT v2 ๋Ÿฐ์ฒ˜ (data mixing)
โ”‚ โ”œโ”€โ”€ launch_3b_orpo.sh # 3B ORPO ๋Ÿฐ์ฒ˜
โ”‚ โ”œโ”€โ”€ monitor_3b.sh # ์‹ค์‹œ๊ฐ„ ํ•™์Šต ๋ชจ๋‹ˆํ„ฐ
โ”‚ โ”œโ”€โ”€ training_watchdog.sh # ์›Œ์น˜๋… (10๋ถ„ ๊ฐ„๊ฒฉ, ํฌ๋ก )
โ”‚ โ”œโ”€โ”€ convert_3b_gguf.sh # GGUF ๋ณ€ํ™˜ ์Šคํฌ๋ฆฝํŠธ
โ”‚ โ”œโ”€โ”€ deploy_3b_ollama.sh # Ollama ๋ฐฐํฌ
โ”‚ โ”œโ”€โ”€ quality_gate.sh # ๋ฐฐํฌ ์ „ ํ’ˆ์งˆ ๊ฒŒ์ดํŠธ
โ”‚ โ”œโ”€โ”€ telegram_notify.py # ํ…”๋ ˆ๊ทธ๋žจ ์•Œ๋ฆผ (urllib ์‚ฌ์šฉ, curl ์ฐจ๋‹จ)
โ”‚ โ””โ”€โ”€ hourly_status.sh # 1์‹œ๊ฐ„ ๊ฐ„๊ฒฉ ์ƒํƒœ ๋ฆฌํฌํŠธ
โ”‚
โ”œโ”€โ”€ eval/
โ”‚ โ”œโ”€โ”€ debate/
โ”‚ โ”‚ โ””โ”€โ”€ justice_league_3b_case.md # 3B ์ „ํ™˜ ๋…ผ์ฆ (์ €์Šคํ‹ฐ์Šค๋ฆฌ๊ทธ ๋ฉ€ํ‹ฐ์—์ด์ „ํŠธ)
โ”‚ โ”œโ”€โ”€ decision/
โ”‚ โ”‚ โ””โ”€โ”€ FINAL_DECISION_REPORT.md # SFT ์žฌ์‹œ์ž‘ ํŒ๊ฒฐ๋ฌธ
โ”‚ โ”œโ”€โ”€ plan/
โ”‚ โ”‚ โ””โ”€โ”€ 3B_MASTER_PLAN.md # 3B ๋งˆ์Šคํ„ฐ ํ”Œ๋žœ
โ”‚ โ”œโ”€โ”€ tasks/ # ๋ชจ๋“ˆํ™”๋œ ํ‰๊ฐ€ ํƒœ์Šคํฌ
โ”‚ โ”‚ โ”œโ”€โ”€ task_runner.py # 8-GPU ๋ณ‘๋ ฌ ํƒœ์Šคํฌ ์‹คํ–‰๊ธฐ
โ”‚ โ”‚ โ”œโ”€โ”€ ppl_task.py # Perplexity ํ‰๊ฐ€ ํƒœ์Šคํฌ
โ”‚ โ”‚ โ”œโ”€โ”€ lm_eval_task.py # lm-evaluation-harness ๋ž˜ํผ
โ”‚ โ”‚ โ”œโ”€โ”€ calibration_task.py # Calibration ๋ถ„์„
โ”‚ โ”‚ โ”œโ”€โ”€ generation_task.py # ์ƒ์„ฑ ํ’ˆ์งˆ + ํŒŒ๋ผ๋ฏธํ„ฐ ๊ทธ๋ฆฌ๋“œ ์„œ์น˜
โ”‚ โ”‚ โ””โ”€โ”€ token_nll_task.py # Token NLL ๋ถ„ํฌ ๋ถ„์„
โ”‚ โ”œโ”€โ”€ outputs/ # ํ‰๊ฐ€ ๊ฒฐ๊ณผ (์ž๋™ ์ƒ์„ฑ, .gitignore)
โ”‚ โ”œโ”€โ”€ full_eval_pipeline.py # v2 ์ข…ํ•ฉ ํ‰๊ฐ€ ํŒŒ์ดํ”„๋ผ์ธ (8-GPU ๋ณ‘๋ ฌ)
โ”‚ โ”œโ”€โ”€ sft_eval_pipeline.py # SFT 6์ฐจ์› ํ‰๊ฐ€ ํŒŒ์ดํ”„๋ผ์ธ
โ”‚ โ”œโ”€โ”€ reeval_pipeline.py # ์žฌํ‰๊ฐ€ ํŒŒ์ดํ”„๋ผ์ธ (0+5-shot ์—ฐ์†)
โ”‚ โ”œโ”€โ”€ report_generator.py # ๋งˆํฌ๋‹ค์šด ๋ฆฌํฌํŠธ ์ž๋™ ์ƒ์„ฑ
โ”‚ โ”œโ”€โ”€ comprehensive_eval.py # v1 ์ข…ํ•ฉ ํ‰๊ฐ€ (๋ ˆ๊ฑฐ์‹œ)
โ”‚ โ””โ”€โ”€ test_generation_params.py # ์ƒ์„ฑ ํŒŒ๋ผ๋ฏธํ„ฐ ํƒ์ƒ‰
โ”‚
โ”œโ”€โ”€ tokenizer/
โ”‚ โ”œโ”€โ”€ korean_sp/ # SentencePiece 64K ๋ชจ๋ธ ํŒŒ์ผ
โ”‚ โ”œโ”€โ”€ tokenizer.json # HuggingFace ํฌ๋งท (2.4MB)
โ”‚ โ”œโ”€โ”€ train_sp_tokenizer.py # ํ† ํฌ๋‚˜์ด์ € ํ•™์Šต ์Šคํฌ๋ฆฝํŠธ
โ”‚ โ””โ”€โ”€ convert_sp_to_hf.py # SentencePiece โ†’ HF ๋ณ€ํ™˜
โ”‚
โ”œโ”€โ”€ checkpoints/ # ๋ชจ๋ธ ์ฒดํฌํฌ์ธํŠธ (๋Œ€์šฉ๋Ÿ‰, .gitignore)
โ”‚
โ”œโ”€โ”€ docs/
โ”‚ โ”œโ”€โ”€ PROJECT_HISTORY.md # ํ”„๋กœ์ ํŠธ ์ „์ฒด ์—ฌ์ • ์ƒ์„ธ ๊ธฐ๋ก
โ”‚ โ””โ”€โ”€ 3B_WORKPLAN.md # 3B ์ž‘์—… ๊ณ„ํš
โ”‚
โ””โ”€โ”€ reports/
โ”œโ”€โ”€ 2026-03-02_0200_FRANKENSTALLM_phase0_optimization_report.md
โ”œโ”€โ”€ 2026-03-05_3B_BASE_EVALUATION_REPORT.md
โ”œโ”€โ”€ 2026-03-05_3B_SFT_PROGRESS_REPORT.md # SFT ํ•™์Šต ๋ณด๊ณ ์„œ (Phase 2)
โ”œโ”€โ”€ 2026-03-05_3B_NEXT_STEPS_REFERENCE.md
โ”œโ”€โ”€ 2026-03-05_NEMOTRON_NANO_FEASIBILITY_STUDY.md
โ”œโ”€โ”€ 2026-03-05_PPL_EVALUATION.md
โ”œโ”€โ”€ 2026-03-05_BENCHMARK_RESULTS.md
โ”œโ”€โ”€ 2026-03-05_GENERATION_QUALITY.md
โ”œโ”€โ”€ 2026-03-06_3B_SFT_EVAL_PLAN.md # SFT 6์ฐจ์› ํ‰๊ฐ€ ๊ณ„ํš์„œ
โ”œโ”€โ”€ 2026-03-06_3B_SFT_EVALUATION_REPORT.md # SFT 6์ฐจ์› ํ‰๊ฐ€ ๊ฒฐ๊ณผ
โ””โ”€โ”€ 2026-03-06_3B_SFT_COMPLETION_AND_EVAL_SUMMARY.md # SFT ์™„๋ฃŒ + ์ฝ”๋“œ ๊ฐœ์„  ์ข…ํ•ฉ
```
---
## 5. ํ”„๋กœ์ ํŠธ ์—ฌ์ • ํƒ€์ž„๋ผ์ธ
์ด ์„น์…˜์ด ์ด README์˜ ํ•ต์‹ฌ์ด๋‹ค. ๊ฒฐ๊ณผ๋งŒ์ด ์•„๋‹ˆ๋ผ **์™œ** ๊ทธ๋Ÿฐ ๊ฒฐ์ •์„ ๋‚ด๋ ธ๋Š”์ง€, **์–ด๋””์„œ** ์‹คํŒจํ–ˆ๋Š”์ง€๋ฅผ ์†”์งํ•˜๊ฒŒ ๊ธฐ๋กํ•œ๋‹ค.
---
### Day 1 (Feb 25) โ€” ์ฒซ ๋ถˆ์”จ: 125M FP8 ๊ฒ€์ฆ
ํ”„๋กœ์ ํŠธ์˜ ์‹œ์ž‘์€ ์ž‘์€ ์˜๋ฌธ์—์„œ ์ถœ๋ฐœํ–ˆ๋‹ค. B200์—์„œ FP8์ด ์‹ค์ œ๋กœ ์•ˆ์ •์ ์œผ๋กœ ํ•™์Šต๋˜๋Š”๊ฐ€?
TransformerEngine์˜ MXFP8 ๋ ˆ์‹œํ”ผ๋ฅผ 125M ์†Œํ˜• ๋ชจ๋ธ์— ์ ์šฉํ•ด ๊ฒ€์ฆํ–ˆ๋‹ค. ๊ฒฐ๋ก ์€ **์•ˆ์ •์ ์œผ๋กœ ๋™์ž‘ํ•œ๋‹ค**. loss ์ˆ˜๋ ด๋„ ์ •์ƒ์ด์—ˆ๊ณ , VRAM ํšจ์œจ๋„ BF16 ๋Œ€๋น„ ํ™•์—ฐํ•œ ๊ฐœ์„ ์ด ์žˆ์—ˆ๋‹ค. ์ด ๊ฒ€์ฆ์ด ์ „์ฒด ํŒŒ์ดํ”„๋ผ์ธ์˜ ์ฒซ ๋ฒˆ์งธ ๋…น์ƒ‰ ์‹ ํ˜ธ์˜€๋‹ค.
๊ฐ™์€ ๋‚ , ์ธํ”„๋ผ ์„ธํŒ…๋„ ์™„๋ฃŒํ–ˆ๋‹ค. DDP 8-GPU ํ™˜๊ฒฝ, NCCL ํ™˜๊ฒฝ๋ณ€์ˆ˜, ์ฒดํฌํฌ์ธํŠธ ์ €์žฅ ๊ฒฝ๋กœ, ํ…”๋ ˆ๊ทธ๋žจ ์•Œ๋ฆผ ์‹œ์Šคํ…œ์˜ ์ดˆ์•ˆ์ด ์ด๋‚  ๊ฐ–์ถฐ์กŒ๋‹ค.
---
### Day 1~2 (Feb 25~26) โ€” 1B ํ”„๋ฆฌํŠธ๋ ˆ์ธ: 34K ์Šคํ…, PPL 5.67
125M ๊ฒ€์ฆ ์งํ›„ 1B ๋ชจ๋ธ ํ”„๋ฆฌํŠธ๋ ˆ์ธ์— ๋Œ์ž…ํ–ˆ๋‹ค.
- **์•„ํ‚คํ…์ฒ˜**: d_model=2048, 24 layers, GQA 4:1, SwiGLU, RoPE
- **๋ฐ์ดํ„ฐ**: C4 Korean ๊ธฐ๋ฐ˜
- **ํ•™์Šต**: 34,000 ์Šคํ…, FP8, 8ร— B200 DDP
์ตœ์ข… ๊ฒฐ๊ณผ:
- **Loss: 1.904**
- **PPL (C4 Korean): 5.67**
์ˆ˜์น˜๋งŒ ๋ณด๋ฉด ๊ทธ๋Ÿญ์ €๋Ÿญ ๊ดœ์ฐฎ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์‹ค์ œ ํ…์ŠคํŠธ ์ƒ์„ฑ์„ ์‹œ์ผœ๋ณด๋ฉด ๋ฌธ์ œ๊ฐ€ ๋ณด์˜€๋‹ค. ๋ฐ˜๋ณต ํŒจํ„ด, ์–ด์ƒ‰ํ•œ ๋ฌธ์žฅ ๊ตฌ์กฐ, ๋งฅ๋ฝ ์ดํƒˆ. ํ”„๋ฆฌํŠธ๋ ˆ์ธ ๋ชจ๋ธ์ด๋‹ˆ ๋‹น์—ฐํ•˜๋‹ค. ์ด์ œ SFT ์ฐจ๋ก€์˜€๋‹ค.
---
### Day 2 (Feb 26) โ€” SFT v1: 0.0์ด๋ผ๋Š” ์žฌ์•™
SFT๋ฅผ ๋Œ๋ ธ๋‹ค. ํ•™์Šต์ด ์‹œ์ž‘๋˜์ž๋งˆ์ž loss๊ฐ€ ๋น ๋ฅด๊ฒŒ ๋–จ์–ด์ง€๊ธฐ ์‹œ์ž‘ํ–ˆ๋‹ค. ์ฒ˜์Œ์—” ์ข‹์€ ์‹ ํ˜ธ๋ผ๊ณ  ์ƒ๊ฐํ–ˆ๋‹ค.
๊ทธ๋Ÿฐ๋ฐ loss๊ฐ€ **0.0**์ด ๋๋‹ค.
val loss๋„ 0.0. ์ƒ์„ฑ ๊ฒฐ๊ณผ๋Š” ์™„์ „ํ•œ ์“ฐ๋ ˆ๊ธฐ์˜€๋‹ค.
์›์ธ์„ ์ฐพ์•˜๋‹ค: **label off-by-one ๋ฒ„๊ทธ**. ์ž…๋ ฅ ํ† ํฐ๊ณผ ๋ ˆ์ด๋ธ” ํ† ํฐ์ด ํ•œ ์นธ์”ฉ ๋ฐ€๋ ค ์žˆ์—ˆ๋‹ค. ๋ชจ๋ธ์ด ์‹ค์ œ๋กœ ๋‹ค์Œ ํ† ํฐ์„ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ์ด๋ฏธ ์•Œ๊ณ  ์žˆ๋Š” ์ •๋‹ต์„ ๋งž์ถ”๋Š” ๊ตฌ์กฐ๊ฐ€ ๋ผ ์žˆ์—ˆ๋‹ค. loss๊ฐ€ 0์ด ๋œ ๊ฑด "์™„๋ฒฝํ•œ ํ•™์Šต"์ด ์•„๋‹ˆ๋ผ **๋ฐ์ดํ„ฐ ๋ˆ„์ˆ˜(label leakage)** ์˜€๋‹ค.
ํ•˜๋ฃจ๋ฅผ ๋‚ ๋ ธ๋‹ค.
---
### Day 3 (Feb 27) โ€” 5๊ฐ€์ง€ ๋ฒ„๊ทธ, ๋ฃจํŠธ ์ฝ”์ฆˆ ๋ถ„์„
์‹คํŒจ๋ฅผ ๋ถ„์„ํ•˜๊ธฐ ์œ„ํ•ด **5-์—์ด์ „ํŠธ ๋ฃจํŠธ ์ฝ”์ฆˆ ๋ถ„์„**์„ ์ˆ˜ํ–‰ํ–ˆ๋‹ค. ๊ฒฐ๋ก ์€ ๋ฒ„๊ทธ ํ•˜๋‚˜๊ฐ€ ์•„๋‹ˆ์—ˆ๋‹ค. SFT ํŒŒ์ดํ”„๋ผ์ธ ์ „์ฒด์— ๋ฌธ์ œ๊ฐ€ ์žˆ์—ˆ๋‹ค.
๋ฐœ๊ฒฌ๋œ 5๊ฐ€์ง€ ํ•ต์‹ฌ ๋ฒ„๊ทธ:
| ๋ฒ„๊ทธ | ์ฆ์ƒ | ์˜ํ–ฅ |
|------|------|------|
| Static padding (no packing) | ์งง์€ ์ƒ˜ํ”Œ๋„ max_len์œผ๋กœ ํŒจ๋”ฉ | GPU ๋‚ญ๋น„, ํ•™์Šต ๋น„ํšจ์œจ |
| EOS ํ† ํฐ ์ ˆ๋‹จ | ์‘๋‹ต ๋์— EOS๊ฐ€ ์—†์Œ | ๋ชจ๋ธ์ด "๋ฌธ์žฅ ๋"์„ ๋ชป ๋ฐฐ์›€ |
| ๋‹จ์ผ ์—ํญ | ๋ฐ์ดํ„ฐ๋ฅผ ํ•œ ๋ฒˆ๋งŒ ๋ด„ | ์–ธ๋”ํ”ผํŒ… |
| ๊ฒ€์ฆ ๋ถ„๋ฆฌ ์—†์Œ | val_loss ์ธก์ • ๋ถˆ๊ฐ€ | ์˜ค๋ฒ„ํ”ผํŒ… ๊ฐ์ง€ ๋ถˆ๊ฐ€ |
| ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ | ๋…ธ์ด์ฆˆ, ์ค‘๋ณต, ๋ถˆ๊ท ํ˜• | ๋ฐ˜๋ณต ์ƒ์„ฑ ํŒจํ„ด ์œ ๋„ |
ํŠนํžˆ EOS ์ ˆ๋‹จ ๋ฒ„๊ทธ๋Š” subtleํ•˜๋‹ค. ๋ชจ๋ธ์ด ์‘๋‹ต์„ ๋งˆ์น˜๋Š” ์‹œ์ ์„ ๋ฐฐ์šฐ์ง€ ๋ชปํ•˜๋ฉด, ์ƒ์„ฑ ์‹œ ๋Š์ž„์—†์ด ๊ฐ™์€ ํŒจํ„ด์„ ๋ฐ˜๋ณตํ•˜๊ฑฐ๋‚˜ ์˜๋ฏธ ์—†๋Š” ํ† ํฐ์„ ์ด์–ด๋ถ™์ธ๋‹ค. 18% ๋ฐ˜๋ณต๋ฅ ์˜ ์›์ธ ์ค‘ ํ•˜๋‚˜์˜€๋‹ค.
---
### Day 3 (Feb 27) โ€” SFT v2: ์„ฑ๊ณต์ด์ง€๋งŒ 18% ๋ฐ˜๋ณต
5๊ฐ€์ง€ ๋ฒ„๊ทธ๋ฅผ ๋ชจ๋‘ ์ˆ˜์ •ํ•˜๊ณ  SFT v2๋ฅผ ๋Œ๋ ธ๋‹ค.
- **val_loss: 2.2062** โ€” ํ•ฉ๋ฆฌ์  ์ˆ˜์ค€
- **๋ฐ˜๋ณต๋ฅ : 18%** (rep_penalty=1.1 ์ ์šฉ ํ›„)
์ƒ์„ฑ ํ’ˆ์งˆ์€ v1์— ๋น„ํ•ด ํ™•์—ฐํžˆ ๊ฐœ์„ ๋๋‹ค. ํ•˜์ง€๋งŒ 18% ๋ฐ˜๋ณต๋ฅ ์€ ์—ฌ์ „ํžˆ ๋†’๋‹ค. `rep_penalty`๋ฅผ ๋†’์ด๋ฉด ๋ฐ˜๋ณต์€ ์ค„์ง€๋งŒ ์ƒ์„ฑ ๋‹ค์–‘์„ฑ๋„ ์ค„๊ณ  ์–ด์ƒ‰ํ•ด์ง„๋‹ค. ๋””์ฝ”๋”ฉ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ํ•ด๊ฒฐํ•˜๊ธฐ์—” ๊ตฌ์กฐ์  ํ•œ๊ณ„๊ฐ€ ์žˆ๋‹ค.
kobest_copa ๊ธฐ์ค€ 0.646. ๊ดœ์ฐฎ์€ ์ˆ˜์น˜์ด์ง€๋งŒ ๋ชฉํ‘œ์—๋Š” ๋ฏธ์น˜์ง€ ๋ชปํ•œ๋‹ค.
---
### Day 3 (Feb 27) โ€” "์ €์Šคํ‹ฐ์Šค๋ฆฌ๊ทธ vs ์–ด๋ฒค์ €์Šค": 3B ์ „ํ™˜ ๊ฒฐ์ •
๋ฐ˜๋ณต๋ฅ  18%๋ฅผ ๋†“๊ณ  ํŒ€ ๋‚ด๋ถ€ ํ† ๋ก ์ด ๋ฒŒ์–ด์กŒ๋‹ค. ํ•ต์‹ฌ ์งˆ๋ฌธ์€ ํ•˜๋‚˜์˜€๋‹ค:
> **ORPO๋กœ ๋ฐ˜๋ณต์„ ์žก์„ ์ˆ˜ ์žˆ๋Š”๊ฐ€, ์•„๋‹ˆ๋ฉด 3B๋กœ ๊ฐ€์•ผ ํ•˜๋Š”๊ฐ€?**
์ด ์งˆ๋ฌธ์— ๋‹ตํ•˜๊ธฐ ์œ„ํ•ด **๋ฉ€ํ‹ฐ์—์ด์ „ํŠธ ํ† ๋ก **์„ ์ˆ˜ํ–‰ํ–ˆ๋‹ค (์ฝ”๋“œ๋ช…: "์ €์Šคํ‹ฐ์Šค๋ฆฌ๊ทธ vs ์–ด๋ฒค์ €์Šค"). ๊ฐ ์—์ด์ „ํŠธ๊ฐ€ ๋‹ค๋ฅธ ์ž…์žฅ์„ ๋งก์•„ ๋…ผ์ฆํ–ˆ๋‹ค.
ํ† ๋ก ์˜ ํ•ต์‹ฌ ๋ฐœ๊ฒฌ:
1. **18% ๋ฐ˜๋ณต์€ 1B ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ๊ตฌ์กฐ์  ํ•œ๊ณ„**๋‹ค. 1B ๋ชจ๋ธ์€ ์žฅ๊ฑฐ๋ฆฌ ์˜์กด์„ฑ(long-range dependency)์„ ์ถฉ๋ถ„ํžˆ ํฌ์ฐฉํ•˜์ง€ ๋ชปํ•œ๋‹ค. ORPO ๊ฐ™์€ ์„ ํ˜ธ๋„ ์ •๋ ฌ์€ ๋ฐ˜๋ณต์„ ์ค„์ด๋Š” ๋ฐ ์ผ๋ถ€ ๋„์›€์ด ๋˜์ง€๋งŒ, ๊ทผ๋ณธ ์›์ธ(ํŒŒ๋ผ๋ฏธํ„ฐ ๋ถ€์กฑ)์„ ํ•ด๊ฒฐํ•˜์ง€๋Š” ๋ชปํ•œ๋‹ค.
2. **์Šค์ผ€์ผ๋ง ๋ฒ•์น™ ๋ถ„์„**: Chinchilla ๋ฒ•์น™๊ณผ ์‹คํ—˜ ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ 3B ๋ชจ๋ธ์€ ๋™์ผ ๋ฐ์ดํ„ฐ์—์„œ ๋ฐ˜๋ณต๋ฅ ์„ 5~8%๊นŒ์ง€ ๋‚ฎ์ถœ ์ˆ˜ ์žˆ๋‹ค๋Š” ์ถ”์ •์ด ๋‚˜์™”๋‹ค.
3. **๋น„์šฉ-ํŽธ์ต ๋ถ„์„**: ORPO๋ฅผ 1B์— ํˆฌ์žํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค 3B ํ”„๋ฆฌํŠธ๋ ˆ์ธ์— ํˆฌ์žํ•˜๋Š” ๊ฒƒ์ด ์ตœ์ข… ๋ชจ๋ธ ํ’ˆ์งˆ ์ธก๋ฉด์—์„œ ์šฐ์›”ํ•˜๋‹ค.
**๊ฒฐ๋ก : 3B ์ „ํ™˜**. 1B๋Š” ์•„์นด์ด๋ธŒํ•˜๊ณ  3B ํ”„๋ฆฌํŠธ๋ ˆ์ธ์„ ์‹œ์ž‘ํ•œ๋‹ค.
์ด ๊ฒฐ์ •์€ `eval/debate/justice_league_3b_case.md`์— ์ „์ฒด ๋…ผ์ฆ๊ณผ ํ•จ๊ป˜ ๊ธฐ๋ก๋ผ ์žˆ๋‹ค.
---
### Day 3 (Feb 27) โ€” 640GB+ ๋ฐ์ดํ„ฐ ์กฐ๋ฆฝ
3B ์ „ํ™˜์ด ๊ฒฐ์ •๋˜์ž๋งˆ์ž ๋ฐ์ดํ„ฐ ํŒŒ์ดํ”„๋ผ์ธ์„ ๊ฐ€๋™ํ–ˆ๋‹ค. 1B์— ๋น„ํ•ด ํ›จ์”ฌ ๋งŽ์€ ๋ฐ์ดํ„ฐ๊ฐ€ ํ•„์š”ํ•˜๋‹ค (Chinchilla ์ตœ์  ๋น„์œจ: 3B ๋ชจ๋ธ ร— 20 = 60B tokens).
์ตœ์ข…์ ์œผ๋กœ ์กฐ๋ฆฝํ•œ ๋ฐ์ดํ„ฐ:
- **์ด ํ† ํฐ**: 41.12B tokens (์ตœ์ข… ์ด์ง„ ํŒŒ์ผ)
- **์›์‹œ ๋ฐ์ดํ„ฐ**: 640GB+ ๋‹ค๊ตญ์–ด ํ…์ŠคํŠธ
- **์†Œ์Šค**: C4 Korean, ๋‚˜๋ฌด์œ„ํ‚ค, Wikipedia Korean, korean_extra ๋ฐ์ดํ„ฐ์…‹
๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ(ํ† ํฌ๋‚˜์ด์ฆˆ, ์…”ํ”Œ, ์ด์ง„ ๋ณ€ํ™˜)๊ฐ€ ์™„๋ฃŒ๋œ `data/3b_train.bin`์€ 82GB๋‹ค. ๊ฒ€์ฆ์…‹ `data/3b_val.bin`์€ 151MB.
---
### Mar 2 โ€” Phase 0: OOM ๊ฒฉํ‡ด ๋ฐ ์ตœ์ ํ™”
3B ํ•™์Šต์„ ์ฒ˜์Œ ์‹œ์ž‘ํ•˜์ž OOM(Out of Memory)์ด ๋ฐœ์ƒํ–ˆ๋‹ค. 183GB VRAM์ธ๋ฐ 3B ๋ชจ๋ธ์ด OOM์ด ๋‚œ๋‹ค๋Š” ๊ฒŒ ์ด์ƒํ•˜์ง€๋งŒ, ์›์ธ์€ ์žˆ์—ˆ๋‹ค.
**GQA FlashAttention ๊ตฌํ˜„ ๋ฌธ์ œ**์˜€๋‹ค. GQA(Grouped-Query Attention)์—์„œ KV ์บ์‹œ๋ฅผ expandํ•˜๋Š” ๋ฐฉ์‹์ด ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๋ถˆํ•„์š”ํ•˜๊ฒŒ ๋ณต์‚ฌํ•˜๊ณ  ์žˆ์—ˆ๋‹ค. FlashAttention์˜ native GQA support๋ฅผ ์ œ๋Œ€๋กœ ํ™œ์šฉํ•˜์ง€ ์•Š์€ ๊ฒƒ์ด๋‹ค.
Phase 0์—์„œ ์ˆ˜ํ–‰ํ•œ ์ตœ์ ํ™” ๋ชฉ๋ก:
| ์ตœ์ ํ™” | ๋ฐฉ๋ฒ• | ํšจ๊ณผ |
|--------|------|------|
| GQA FA Native | `flash_attn_varlen_func` native GQA ๊ฒฝ๋กœ ์‚ฌ์šฉ | VRAM 60.4GB โ†’ 48.3GB (**-20%**) |
| DDP ์ตœ์ ํ™” | `gradient_as_bucket_view=True` | GPU-CPU ๋™๊ธฐํ™” ์˜ค๋ฒ„ํ—ค๋“œ -87.5% |
| NCCL NVLS | Ring+Tree ํ† ํด๋กœ์ง€, NVLS ํ™œ์„ฑํ™” | AllReduce ํšจ์œจ ๊ฐœ์„  |
| ๋ฐฐ์น˜ ํฌ๊ธฐ ๋ถ„์„ | GPU 2,4,6์˜ NCCL relay node ์—ญํ•  ํŒŒ์•… | bs=5 ์ตœ์ , bs=6 ์œ„ํ—˜ ํŒ์ • |
| SIGHUP ๋ฐฉ์–ด | nohup+setsid + Python signal handler + emergency ckpt | 3์ค‘ ๋ณดํ˜ธ |
| ๋ชจ๋‹ˆํ„ฐ๋ง | Telegram Bot (B200Bot) + cron | 10๋ถ„ ์›Œ์น˜๋…, 1์‹œ๊ฐ„ ์ƒํƒœ ๋ฆฌํฌํŠธ |
**torch.compile ํ…Œ์ŠคํŠธ**: ํšจ๊ณผ ์—†์Œ(1.00x). ์›์ธ์€ TransformerEngine์˜ opaque kernel์ด graph break๋ฅผ ์œ ๋ฐœํ•˜๊ณ , `/tmp` ๋””๋ ‰ํ† ๋ฆฌ์— noexec ํ”Œ๋ž˜๊ทธ๊ฐ€ ๊ฑธ๋ ค ์žˆ์–ด ์ปดํŒŒ์ผ๋œ kernel ์บ์‹œ๊ฐ€ ์“ฐ์ด์ง€ ์•Š์•˜๋‹ค. ์‹œ๊ฐ„ ๋‚ญ๋น„๋ฅผ ํ•œ ์…ˆ์ด์ง€๋งŒ, "ํšจ๊ณผ ์—†๋‹ค"๋Š” ๊ฒƒ์„ ์‹ค์ธก์œผ๋กœ ํ™•์ธํ•œ ๊ฒƒ๋„ ์„ฑ๊ณผ๋‹ค.
**bs=5์˜ ์ด์œ **: NCCL ring topology์—์„œ GPU 2, 4, 6์ด relay node ์—ญํ• ์„ ๋งก๋Š”๋‹ค. ์ด GPU๋“ค์€ ๋‹ค๋ฅธ GPU๋ณด๋‹ค ์•ฝ 11GB๋ฅผ ๋” ์‚ฌ์šฉํ•œ๋‹ค. bs=5์—์„œ๋Š” ์—ฌ์œ ๊ฐ€ ์žˆ์ง€๋งŒ, bs=6์œผ๋กœ ์˜ฌ๋ฆฌ๋ฉด ์ด relay GPU๋“ค์ด 183GB ๊ฒฝ๊ณ„์— ๋„ˆ๋ฌด ๊ฐ€๊นŒ์›Œ์ง„๋‹ค. ์•ˆ์ „ ๋งˆ์ง„์„ ์œ„ํ•ด bs=5๋ฅผ ์œ ์ง€ํ•œ๋‹ค.
---
### Mar 2~Mar 5 โ€” Phase 1: 3B ํ”„๋ฆฌํŠธ๋ ˆ์ธ ์™„๋ฃŒ
Phase 0 ์ตœ์ ํ™”๊ฐ€ ์™„๋ฃŒ๋œ ํ›„ Phase 1์ด ์‹œ์ž‘๋๋‹ค.
์ดˆ๊ธฐ ์ง€ํ‘œ (step 3150):
- Loss: 2.38
- ์ฒ˜๋ฆฌ ์†๋„: 36K tok/s per rank
- ์‹œ์Šคํ…œ ์ „์ฒด: ~292K tok/s (8 GPU)
- MFU: ~33.5%
MFU 33.5%๋Š” ์ฒ˜์Œ์—๋Š” ๋‚ฎ์•„ ๋ณด์ผ ์ˆ˜ ์žˆ๋‹ค. ํ•˜์ง€๋งŒ TE MXFP8๊ฐ€ ์ด๋ฏธ ์ตœ์ ํ™”๋œ ์ƒํƒœ์—์„œ ๋‚˜์˜จ ์ˆ˜์น˜๋‹ค. ์ด๋ก ์  ํ”ผํฌ(18,000 TFLOPS) ๋Œ€๋น„ ์‹คํšจ์œจ์ด๋‹ค. ์ถ”๊ฐ€ ์ตœ์ ํ™” ์—ฌ์ง€๋กœ QKV fusion (+8~12%), NUMA affinity (+4~9%), FA2 native RoPE (+3~5%)๊ฐ€ ๋‚จ์•„์žˆ๋‹ค.
**Phase 1 ์™„๋ฃŒ (2026-03-05)**:
- **57,000 steps ์™„๋ฃŒ**, ์ตœ์ข… loss **1.466**
- 41.12B ํ† ํฐ ์ฒ˜๋ฆฌ, ์ด ํ•™์Šต ์‹œ๊ฐ„ ์•ฝ 63์‹œ๊ฐ„
- ๋ฌด์‚ฌ๊ณ  ์™„๋ฃŒ (SIGHUP, OOM, NCCL ์ด์ƒ ์—†์Œ)
์ข…ํ•ฉ ํ‰๊ฐ€ ๊ฒฐ๊ณผ ์š”์•ฝ (v2 ์žฌํ‰๊ฐ€ ๋ฐ˜์˜):
| ํ•ญ๋ชฉ | ๊ฒฐ๊ณผ |
|------|------|
| PPL (ํ†ตํ•ฉ ๊ฒ€์ฆ์…‹) | 5.2263 (์ดˆ๊ธฐ v1 ํ‰๊ฐ€: 5.709) |
| PPL (C4 Korean) | 5.717 |
| KoBEST ํ‰๊ท  (5ํƒœ์Šคํฌ) | 43.69% |
| MMLU-KO ํ‰๊ท  (6์นดํ…Œ๊ณ ๋ฆฌ) | 22.75% |
| HAE-RAE | 19.71% |
| winogrande / piqa | 50.59% / 52.50% |
| Calibration Top-1 | 68.75% |
| Greedy 3-gram ๋ฐ˜๋ณต๋ฅ  | 60.99% (SFT ํ›„ ๊ฐœ์„  ์˜ˆ์ •) |
| ์ตœ์  ์ƒ์„ฑ ํŒŒ๋ผ๋ฏธํ„ฐ | temp=0.7, rep_penalty=1.3 โ†’ ๋ฐ˜๋ณต๋ฅ  0% |
**SFT ์ง„ํ–‰ ๊ฒฐ์ •**: loss 1.466์€ ๊ฑด๊ฐ•ํ•œ ํ•™์Šต ์™„๋ฃŒ ์‹œ๊ทธ๋„. PPL/๋ฐ˜๋ณต๋ฅ /๋ฒค์น˜๋งˆํฌ ๋ชจ๋‘ SFT๊ฐ€ ํ•ด๊ฒฐํ•  ์˜์—ญ. ๋ชจ๋ธ ๊ตฌ์กฐ ๋ฌธ์ œ ์ง•ํ›„ ์—†์Œ. โ†’ Phase 2 SFT ์ง„ํ–‰.
---
### Mar 5~ โ€” Phase 2: 3B SFT ์‹œ์ž‘ โ€” 2.44M ์ƒ˜ํ”Œ, val_loss 1.956
Phase 1 ์™„๋ฃŒ ์งํ›„, ๋Œ€๊ทœ๋ชจ SFT ๋ฐ์ดํ„ฐ๋ฅผ ์ค€๋น„ํ•˜๊ณ  ํ•™์Šต์„ ์‹œ์ž‘ํ–ˆ๋‹ค.
**๋ฐ์ดํ„ฐ ํŒŒ์ดํ”„๋ผ์ธ**:
- **24๊ฐœ ์†Œ์Šค**์—์„œ 6.59M raw samples ์ˆ˜์ง‘
- `prepare_sft_combined.sh`: ํฌ๋งท ํ†ต์ผ(6๊ฐ€์ง€ ํฌ๋งท โ†’ messages), MD5 ์ค‘๋ณต ์ œ๊ฑฐ, 98:2 split
- `filter_sft_v2.py`: 5๋‹จ๊ณ„ ํ’ˆ์งˆ ํ•„ํ„ฐ (EOS strip, QA marker ์ œ๊ฑฐ, ๊ธธ์ด ํ•„ํ„ฐ, 4-gram ๋ฐ˜๋ณต ํ•„ํ„ฐ)
- ์ตœ์ข…: **2,439,397 train + 49,801 val** (7.48 GB)
๋ฐ์ดํ„ฐ ๊ตฌ์„ฑ์€ ์ถ”๋ก /CoT(38%), ํ•œ๊ตญ์–ด ์ง€์‹œ(22.5%), ์˜์–ด ๋‹ค๋ชฉ์ (16%), ์ˆ˜ํ•™(12%), ๋Œ€ํ™”/์ฝ”๋“œ(11.5%)๋กœ ๊ท ํ˜•์„ ๋งž์ท„๋‹ค. 1B SFT์˜ 161K์—์„œ **15๋ฐฐ ํ™•๋Œ€**ํ•œ ๊ทœ๋ชจ๋‹ค.
**SFT ์„ค๊ณ„ โ€” 1B ์‹คํŒจ์—์„œ ๋ฐฐ์šด ๊ตํ›ˆ ๋ฐ˜์˜**:
| 1B ๊ตํ›ˆ | 3B SFT ์ ์šฉ |
|---------|-------------|
| Label off-by-one โ†’ loss=0 | Loss masking ๊ฒ€์ฆ (prompt=-1, response๋งŒ ํ•™์Šต) |
| EOS ์ ˆ๋‹จ โ†’ ์ข…๋ฃŒ ๋ถˆ๊ฐ€ | Chat template `<\|user\|>...<\|assistant\|>...</s>` EOS ํฌํ•จ |
| Static padding โ†’ GPU ๋‚ญ๋น„ | Dynamic padding (64-token ์ •๋ ฌ) |
| ๊ฒ€์ฆ ์—†์Œ โ†’ ์˜ค๋ฒ„ํ”ผํŒ… ๋ฏธ๊ฐ์ง€ | 49,801 val samples, 500 step ๊ฐ„๊ฒฉ eval |
| ๋ฐ์ดํ„ฐ ๋…ธ์ด์ฆˆ | 5๋‹จ๊ณ„ ํ’ˆ์งˆ ํ•„ํ„ฐ (1B์—๋Š” ์—†์—ˆ์Œ) |
| ๋ฐ˜๋ณต๋ฅ  18% | **NEFTune alpha=5.0** ์ถ”๊ฐ€ (์ž„๋ฒ ๋”ฉ ๋…ธ์ด์ฆˆ ์ฃผ์ž…) |
**ํ•™์Šต ์„ค์ •**:
- LR: **1e-5** (pretrain์˜ 1/15 โ€” catastrophic forgetting ๋ฐฉ์ง€)
- Effective batch: 2 ร— 8 GPU ร— 4 accum = 64 sequences
- 33,000 steps (~3.3 epochs)
- MXFP8, gradient checkpointing, NCCL Ring+Tree
**์ดˆ๊ธฐ ๊ฒฐ๊ณผ** (step 2,000, 6%):
- Val loss: 2.073 โ†’ 2.004 โ†’ 1.975 โ†’ **1.956** (๋‹จ์กฐ ๊ฐ์†Œ)
- Train-Val ๊ฐญ ~0.1 (์˜ค๋ฒ„ํ”ผํŒ… ์ง•ํ›„ ์—†์Œ)
- VRAM 24.2 GB (13.2%) โ€” pretrain์˜ ์ ˆ๋ฐ˜, ๋งค์šฐ ์•ˆ์ •
- Grad norm 1.0 ์ผ์ • (ํ•™์Šต๋ฅ  ์ ์ ˆ)
์ƒ์„ธ ๋ณด๊ณ ์„œ: `reports/2026-03-05_3B_SFT_PROGRESS_REPORT.md`
---
### Mar 6 โ€” Phase 2 ์™„๋ฃŒ: SFT Early Stopping (val_loss 1.8851)
SFT๋Š” 33,000 steps ์ค‘ **25,500 steps**์—์„œ early stopping์œผ๋กœ ์ข…๋ฃŒ๋˜์—ˆ๋‹ค. Val loss๋Š” step 23,000์—์„œ 1.8851์— ๋„๋‹ฌํ•œ ๋’ค, 5ํšŒ ์—ฐ์† ๊ฐœ์„  ์—†์ด ํ•™์Šต์ด ์ž๋™ ์ค‘๋‹จ๋˜์—ˆ๋‹ค.
**์ด ํ•™์Šต ์‹œ๊ฐ„**: ~15์‹œ๊ฐ„ 41๋ถ„ (2026-03-05 22:15 ~ 2026-03-06 13:56)
์ด ๊ฒฐ๊ณผ๋Š” LR 1e-5์˜ cosine decay๊ฐ€ step 20K ์ดํ›„ ์‚ฌ์‹ค์ƒ 0์— ์ˆ˜๋ ดํ•œ ๊ฒƒ๊ณผ ์ผ์น˜ํ•œ๋‹ค. ๋ชจ๋ธ์€ ์ฃผ์–ด์ง„ LR schedule ํ•˜์—์„œ ํ•™์Šต ๊ฐ€๋Šฅํ•œ ๋งŒํผ ์™„์ „ํžˆ ํ•™์Šตํ–ˆ๋‹ค.
---
### Mar 6 โ€” SFT 6์ฐจ์› ์ข…ํ•ฉ ํ‰๊ฐ€: 4/6 PASS โ†’ ORPO ๊ฒฐ์ •
SFT ์ฒดํฌํฌ์ธํŠธ(`checkpoint-best`, step 23000)์— ๋Œ€ํ•ด 6์ฐจ์› ์ข…ํ•ฉ ํ‰๊ฐ€๋ฅผ ์ˆ˜ํ–‰ํ–ˆ๋‹ค. 49๋ถ„ 27์ดˆ ์†Œ์š”.
**ํ•ต์‹ฌ ๊ฒฐ๊ณผ**:
- **Perplexity**: forgetting 0.9% (19๊ฐœ ๋ฐ์ดํ„ฐ์…‹ ์ „์ฒด PASS) โ€” ์ง€์‹ ๋ณด์กด ์šฐ์ˆ˜
- **๋ฐ˜๋ณต๋ฅ **: greedy 72.97% (Base 60.99%๋ณด๋‹ค **์•…ํ™”**) โ€” FAIL
- **EOS ์ข…๋ฃŒ์œจ**: 0% โ†’ 60% โ€” ๊ฐœ์„ ๋์ง€๋งŒ ๋ชฉํ‘œ(90%) ๋ฏธ๋‹ฌ
- **KoBEST**: 43.26% (Base 43.69%์™€ ๊ฑฐ์˜ ๋™์ผ) โ€” FAIL
- **MMLU-KO**: 22.75% โ†’ 26.00% (+3.2pp) โ€” ๋ถ€๋ถ„ ๊ฐœ์„ 
- **Calibration**: Top-1 68.59% โ€” PASS
**๊ฒฐ์ •**: greedy ๋ฐ˜๋ณต๋ฅ  72.97%๋Š” SFT๋งŒ์œผ๋กœ ํ•ด๊ฒฐ ๋ถˆ๊ฐ€. ๊ทธ๋Ÿฌ๋‚˜ `rep_penalty=1.2` ์ ์šฉ ์‹œ ๋ฐ˜๋ณต๋ฅ  0%๊ฐ€ ๋‹ฌ์„ฑ๋˜๋ฏ€๋กœ, ORPO(์„ ํ˜ธ๋„ ์ •๋ ฌ)๋กœ ์ด ํ–‰๋™์„ ๋‚ด์žฌํ™”ํ•˜๋Š” ๊ฒƒ์ด ์˜ฌ๋ฐ”๋ฅธ ๊ฒฝ๋กœ๋‹ค.
---
### Mar 6 โ€” ์ฝ”๋“œ ๊ฐœ์„  ๋ฐ ORPO ์ค€๋น„
SFT ํ‰๊ฐ€์™€ ๋ณ‘ํ–‰ํ•˜์—ฌ ๋‹ค์ˆ˜์˜ ์ฝ”๋“œ ๊ฐœ์„  ๋ฐ Phase 3 ์ค€๋น„๋ฅผ ์™„๋ฃŒํ–ˆ๋‹ค:
| ๋ณ€๊ฒฝ | ๋‚ด์šฉ | ์˜ํ–ฅ |
|------|------|------|
| `train/sft.py` +238์ค„ | MixingDataLoader (SFT+pretrain ์ธํ„ฐ๋ฆฌ๋น™), DDP rank 0 ํ† ํฌ๋‚˜์ด์ง• | forgetting ๋ฐฉ์ง€, ๋ฉ”๋ชจ๋ฆฌ 8๋ฐฐ ์ ˆ๊ฐ |
| `train/trainer.py` +17์ค„ | DDP early stopping broadcast (hang ๋ฐฉ์ง€), patience 5โ†’10 | DDP ์•ˆ์ •์„ฑ |
| `train/orpo.py` +30์ค„ | YAML config ์ง€์›, 3B ๊ธฐ๋ณธ๊ฐ’ | ORPO ์‹คํ–‰ ์ค€๋น„ |
| `eval/report_generator.py` +831์ค„ | Base vs SFT ๋น„๊ต ๋ณด๊ณ ์„œ ์ž๋™ ์ƒ์„ฑ | ํ‰๊ฐ€ ์ž๋™ํ™” |
| `eval/sft_eval_pipeline.py` ์‹ ๊ทœ | SFT 6์ฐจ์› ํ‰๊ฐ€ ํŒŒ์ดํ”„๋ผ์ธ | ์ข…ํ•ฉ ํ‰๊ฐ€ |
| `eval/tasks/generation_task.py` +75์ค„ | Chat template, ๋‹ค์–‘์„ฑ ๋ฉ”ํŠธ๋ฆญ | SFT ํ‰๊ฐ€ |
| `configs/korean_3b_sft_v2.yaml` ์‹ ๊ทœ | SFT v2 ์„ค์ • (lr=5e-5, data mixing 70/30) | ๋ฐฑ์—… ๊ฒฝ๋กœ |
| `configs/korean_3b_orpo.yaml` ์‹ ๊ทœ | ORPO ์„ค์ • (lr=5e-6, beta=0.1) | Phase 3 |
์ƒ์„ธ: `reports/2026-03-06_3B_SFT_COMPLETION_AND_EVAL_SUMMARY.md`
---
## 6. ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜
### 1B (์•„์นด์ด๋ธŒ)
| ํ•ญ๋ชฉ | ๊ฐ’ |
|------|-----|
| vocab_size | 64,000 |
| d_model | 2,048 |
| n_layers | 24 |
| n_heads | 16 |
| n_kv_heads | 4 (GQA 4:1) |
| d_ffn | 5,461 (SwiGLU) |
| ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜ | ~1.19B |
| context | 2,048 |
| rope_theta | 500,000 |
### 3B (ํ˜„์žฌ)
| ํ•ญ๋ชฉ | ๊ฐ’ |
|------|-----|
| vocab_size | 64,000 |
| d_model | 3,072 |
| n_layers | 28 |
| n_heads | 24 |
| n_kv_heads | 8 (GQA 3:1) |
| d_ffn | 8,192 (SwiGLU) |
| ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜ | ~3.0B |
| context | 2,048 |
| rope_theta | 500,000 |
### ๊ณตํ†ต ์„ค๊ณ„ ์›์น™
| ์ปดํฌ๋„ŒํŠธ | ์„ ํƒ | ์ด์œ  |
|----------|------|------|
| ์ •๊ทœํ™” | Pre-norm RMSNorm | Post-norm๋ณด๋‹ค ํ•™์Šต ์•ˆ์ •์  |
| ํ™œ์„ฑํ™” | SwiGLU FFN | Llama ๊ณ„์—ด์—์„œ ๊ฒ€์ฆ๋œ ์„ ํƒ |
| ์œ„์น˜ ์ธ์ฝ”๋”ฉ | RoPE (ฮธ=500K) | ๊ธด ์ปจํ…์ŠคํŠธ ํ™•์žฅ ๊ฐ€๋Šฅ์„ฑ |
| ์–ดํ…์…˜ | GQA (Grouped-Query Attention) | KV ์บ์‹œ ๋ฉ”๋ชจ๋ฆฌ ์ ˆ๊ฐ |
| ๊ตฌํ˜„ | FlashAttention-2 | IO-aware, VRAM ํšจ์œจ |
| ์ •๋ฐ€๋„ | FP8 (MXFP8 via TransformerEngine) | B200 ์ตœ์  ํ™œ์šฉ |
### GQA ๋น„์œจ ์„ ํƒ ๊ทผ๊ฑฐ
1B๋Š” GQA 4:1 (head 16๊ฐœ, kv_head 4๊ฐœ), 3B๋Š” GQA 3:1 (head 24๊ฐœ, kv_head 8๊ฐœ)์„ ์„ ํƒํ–ˆ๋‹ค. 3B์—์„œ ๋น„์œจ์„ ๋‹ค์†Œ ์™„ํ™”ํ•œ ์ด์œ ๋Š”, ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๊ฐ€ ๋Š˜์–ด๋‚˜๋ฉด์„œ ์–ดํ…์…˜ ํ’ˆ์งˆ์„ ๋‹ค์†Œ ํฌ์ƒํ•˜๋Š” ๊ฒƒ์ด 3B ๊ทœ๋ชจ์—์„œ๋Š” ์†ํ•ด๋ผ๋Š” ํŒ๋‹จ์ด์—ˆ๋‹ค. Mistral 7B (GQA 8:1)์™€ Llama 3 (GQA 8:1)๋ฅผ ์ฐธ๊ณ ํ–ˆ๋‹ค.
### rope_theta=500,000์˜ ์˜๋ฏธ
ํ‘œ์ค€ RoPE์˜ ฮธ=10,000์—์„œ 500,000์œผ๋กœ ๋Š˜๋ฆฐ ๊ฒƒ์€ ๊ธด ์ปจํ…์ŠคํŠธ์—์„œ ์ฃผํŒŒ์ˆ˜ ๊ฐ„์„ญ์„ ์ค„์ด๊ธฐ ์œ„ํ•ด์„œ๋‹ค. Code Llama, Llama 3 ๋“ฑ์ด ์ฑ„ํƒํ•œ ๋ฐฉ์‹์ด๋‹ค. ํ˜„์žฌ max_seq_len=2048์ด๋ฏ€๋กœ ๋‹น์žฅ ํšจ๊ณผ๋ฅผ ๋ณด๊ธฐ๋Š” ์–ด๋ ต์ง€๋งŒ, ํ–ฅํ›„ ์ปจํ…์ŠคํŠธ ํ™•์žฅ ํŒŒ์ธํŠœ๋‹์„ ์œ„ํ•œ ๊ธฐ๋ฐ˜์ด๋‹ค.
---
## 7. ํ•™์Šต ๋ฐ์ดํ„ฐ
### 7.1 ํ† ํฌ๋‚˜์ด์ €
| ํ•ญ๋ชฉ | ๊ฐ’ |
|------|-----|
| ์ข…๋ฅ˜ | SentencePiece Unigram |
| ์–ดํœ˜ ํฌ๊ธฐ | 64,000 |
| ํ•œ๊ตญ์–ด ๋ฌธ์ž ์ปค๋ฒ„๋ฆฌ์ง€ | 99.95% |
| ์œ„์น˜ | `tokenizer/korean_sp/` |
| HF ํฌ๋งท | `tokenizer/tokenizer.json` (2.4MB) |
64K ์–ดํœ˜๋Š” 32K(๋„ˆ๋ฌด ์ž‘์Œ, ํ•œ๊ตญ์–ด ์„œ๋ธŒ์›Œ๋“œ ๋‹จํŽธํ™” ์‹ฌํ•จ)์™€ 128K(๋„ˆ๋ฌด ํผ, ์ž„๋ฒ ๋”ฉ ๋ ˆ์ด์–ด ์˜ค๋ฒ„ํ—ค๋“œ ์ฆ๊ฐ€) ์‚ฌ์ด์˜ ๊ท ํ˜•์ด๋‹ค. Llama 3(128K)์™€ GPT-4(100K)๊ฐ€ ํฐ ์–ดํœ˜๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์ถ”์„ธ์ง€๋งŒ, 3B ๋ชจ๋ธ์—์„œ 128K ์–ดํœ˜๋Š” ์ž„๋ฒ ๋”ฉ ๋ ˆ์ด์–ด๋งŒ์œผ๋กœ๋„ ํŒŒ๋ผ๋ฏธํ„ฐ ๋น„์ค‘์ด ์ง€๋‚˜์น˜๊ฒŒ ์ปค์ง„๋‹ค.
### 7.2 ํ”„๋ฆฌํŠธ๋ ˆ์ธ ๋ฐ์ดํ„ฐ โ€” ์ „์ฒด ๊ตฌ์„ฑ
์ตœ์ข… ํ•™์Šต ํŒŒ์ผ: `data/3b_train.bin` (77GB, ~38.5B tokens) + `data/3b_val.bin` (145MB)
Chinchilla ๋ฒ•์น™ ๊ธฐ์ค€: 3B ร— 20 = **60B ํ† ํฐ**์ด ์ตœ์ ์ด๋‹ค. ํ˜„์žฌ 38.5B ํ† ํฐ์„ 57,000 ์Šคํ…(batch 5 ร— accum 8 ร— seq 2048 ร— 8 GPU)์œผ๋กœ ๋ฐ˜๋ณต ์†Œ๋น„ํ•˜๋ฉฐ, ์ฒ˜์Œ 3B ํ•™์Šต์œผ๋กœ์„œ ํ•ฉ๋ฆฌ์ ์ธ ๋ฒ”์œ„๋‹ค.
#### ํ•œ๊ตญ์–ด โ€” ์›นํฌ๋กค (Web Crawl)
| ๋ฐ์ดํ„ฐ์…‹ | HuggingFace ID | ํ† ํฐํ™” ํŒŒ์ผ | ํฌ๊ธฐ | ์ถ”์ • ํ† ํฐ | ์„ค๋ช… |
|----------|---------------|------------|------|----------|------|
| C4 Korean | `allenai/c4` (ko subset) | `korean_c4_train.bin` | 15GB | ~7.5B | Google C4 ํ•œ๊ตญ์–ด ํ•„ํ„ฐ๋ง, ๋Œ€๊ทœ๋ชจ ํด๋ฆฐ ์›น ํ…์ŠคํŠธ |
| CC-100 Korean | `cc100` (ko subset) | `cc100_ko_train.bin` | 4.3GB | ~2.15B | Common Crawl ๊ธฐ๋ฐ˜ ๋‹จ์ผ์–ธ์–ด ์ฝ”ํผ์Šค |
| HPLT Korean | `HPLT/hplt_monolingual_v2` (ko) | `hplt_ko_train.bin` | 15GB | ~7.5B | High Performance Language Technologies ์›น ๋ฐ์ดํ„ฐ |
#### ํ•œ๊ตญ์–ด โ€” ๋ฐฑ๊ณผ์‚ฌ์ „ (Encyclopedia)
| ๋ฐ์ดํ„ฐ์…‹ | HuggingFace ID | ํ† ํฐํ™” ํŒŒ์ผ | ํฌ๊ธฐ | ์ถ”์ • ํ† ํฐ | ์„ค๋ช… |
|----------|---------------|------------|------|----------|------|
| ์œ„ํ‚ค๋ฐฑ๊ณผ ํ•œ๊ตญ์–ด | `wikimedia/wikipedia` (20231101.ko) | `wikipedia_ko_train.bin` | 566MB | ~283M | ํ•œ๊ตญ์–ด ์œ„ํ‚ค๋ฐฑ๊ณผ ์ „์ฒด, ๊ตฌ์กฐํ™”๋œ ๋ฌธ์–ด์ฒด |
| ์œ„ํ‚ค๋ฐฑ๊ณผ ํ•œ๊ตญ์–ด (v2) | `wikimedia/wikipedia` (ko) | `korean_wiki_train.bin` | 500MB | ~250M | ์œ„ํ‚ค๋ฐฑ๊ณผ ๋ณ„๋„ ๋ฒ„์ „ |
| ๋‚˜๋ฌด์œ„ํ‚ค | `heegyu/namuwiki-extracted` | `korean_namuwiki_train.bin` | 2.1GB | ~1.05B | ๋‚˜๋ฌด์œ„ํ‚ค ์ถ”์ถœ๋ณธ, ์„œ๋ธŒ์ปฌ์ฒ˜ยท์‹œ์‚ฌ ํ’๋ถ€ |
| ๋‚˜๋ฌด์œ„ํ‚ค 2023b | `heegyu/namuwiki-extracted` (2023b) | `namuwiki_2023b_train.bin` | 2.5GB | ~1.25B | 2023๋…„ ์—…๋ฐ์ดํŠธ ์Šค๋ƒ…์ƒท |
#### ์˜์–ด/๋‹ค๊ตญ์–ด โ€” ๊ต์œก (Educational)
| ๋ฐ์ดํ„ฐ์…‹ | HuggingFace ID | ํ† ํฐํ™” ํŒŒ์ผ | ํฌ๊ธฐ | ์ถ”์ • ํ† ํฐ | ์„ค๋ช… |
|----------|---------------|------------|------|----------|------|
| Cosmopedia Stories | `HuggingFaceTB/cosmopedia` | `cosmo_stories_train.bin` | 5.9GB | ~2.95B | ํ•ฉ์„ฑ ๊ต์œก์šฉ ์Šคํ† ๋ฆฌ |
| Cosmopedia Web v2 | `HuggingFaceTB/cosmopedia` | `cosmo_web_v2_train.bin` | 2.7GB | ~1.35B | ์›น ๊ธฐ๋ฐ˜ ๊ต์œก ํ…์ŠคํŠธ |
| Cosmopedia Stanford | `HuggingFaceTB/cosmopedia` | `cosmo_stanford_train.bin` | 2.1GB | ~1.05B | Stanford ๊ฐ•์˜ ๊ธฐ๋ฐ˜ |
| Cosmopedia WikiHow | `HuggingFaceTB/cosmopedia` | `cosmo_wikihow_train.bin` | 382MB | ~191M | WikiHow ๊ฐ€์ด๋“œ |
| Cosmopedia OpenStax | `HuggingFaceTB/cosmopedia` | `cosmo_openstax_train.bin` | 224MB | ~112M | ์˜คํ”ˆ ๊ต๊ณผ์„œ |
| Cosmopedia Khan Academy | `HuggingFaceTB/cosmopedia` | `cosmo_khanacademy_train.bin` | 46MB | ~23M | ์นธ ์•„์นด๋ฐ๋ฏธ |
#### ์˜์–ด/๋‹ค๊ตญ์–ด โ€” ์ˆ˜ํ•™ยท๊ณผํ•™ (Math & Science)
| ๋ฐ์ดํ„ฐ์…‹ | HuggingFace ID | ํ† ํฐํ™” ํŒŒ์ผ | ํฌ๊ธฐ | ์ถ”์ • ํ† ํฐ | ์„ค๋ช… |
|----------|---------------|------------|------|----------|------|
| Open Web Math | `open-web-math/open-web-math` | `open_web_math_train.bin` | 4.8GB | ~2.4B | ์›น์—์„œ ์ถ”์ถœํ•œ ์ˆ˜ํ•™ ํ…์ŠคํŠธ |
| MathPile | `GAIR/MathPile` | `mathpile_train.bin` | 2.9GB | ~1.45B | ์ˆ˜ํ•™ ๊ต๊ณผ์„œยท๋…ผ๋ฌธยทํฌ๋Ÿผ |
| Cosmopedia AutoMath | `HuggingFaceTB/cosmopedia` | `cosmo_auto_math_text_train.bin` | 2.5GB | ~1.25B | ํ•ฉ์„ฑ ์ˆ˜ํ•™ ๋ฌธ์ œยทํ’€์ด |
#### ํ•œ๊ตญ์–ด โ€” ํ˜ผํ•ฉ (Legacy Merged)
| ๋ฐ์ดํ„ฐ์…‹ | ํ† ํฐํ™” ํŒŒ์ผ | ํฌ๊ธฐ | ์ถ”์ • ํ† ํฐ | ์„ค๋ช… |
|----------|------------|------|----------|------|
| ์ดˆ๊ธฐ ํ˜ผํ•ฉ (C4+๋‚˜๋ฌด+์œ„ํ‚ค) | `korean_train.bin` | 17GB | ~8.5B | 1B ํ•™์Šต์— ์‚ฌ์šฉ๋œ ์›๋ณธ ํ˜ผํ•ฉ ๋ฐ์ดํ„ฐ |
| 125M ๊ฒ€์ฆ์šฉ | `train.bin` | 1.2GB | ~600M | ์ตœ์ดˆ FP8 ๊ฒ€์ฆ์— ์‚ฌ์šฉ |
#### ๋ฏธ์‚ฌ์šฉ ์ˆ˜์ง‘ ๋ฐ์ดํ„ฐ (korean_extra/ โ€” 640GB+)
`data/korean_extra/` ์— 39๊ฐœ ์„œ๋ธŒ๋””๋ ‰ํ† ๋ฆฌ๋กœ ์ˆ˜์ง‘๋˜์—ˆ์œผ๋‚˜, ํ† ํฐํ™”ยท๋ณ‘ํ•ฉ์€ ์ผ๋ถ€๋งŒ ์™„๋ฃŒ๋œ ๋Œ€๊ทœ๋ชจ ์›์‹œ ๋ฐ์ดํ„ฐ:
| ๋ถ„๋ฅ˜ | ๋ฐ์ดํ„ฐ์…‹ | ์„ค๋ช… | ๋น„๊ณ  |
|------|----------|------|------|
| ์›นํฌ๋กค | CulturaX Korean | ๋Œ€๊ทœ๋ชจ ๋‹ค๊ตญ์–ด ์›น ์ฝ”ํผ์Šค ํ•œ๊ตญ์–ด | ~50B+ tokens |
| ์›นํฌ๋กค | FineWeb2 Educational Korean | ๊ต์œก์  ํ’ˆ์งˆ ํ•„ํ„ฐ๋ง ์›น ๋ฐ์ดํ„ฐ | 234GB raw |
| ์›นํฌ๋กค | Korean Web Collection | KORMo ์›น ์ปฌ๋ ‰์…˜ | 175GB raw |
| ์›นํฌ๋กค | OSCAR Korean | ๋‹ค๊ตญ์–ด ์›น ์ฝ”ํผ์Šค ํ•œ๊ตญ์–ด | |
| ๊ต์œก | Korean Textbooks | ํ•œ๊ตญ์–ด ๊ต๊ณผ์„œ ํ…์ŠคํŠธ | 45๊ฐœ ์„œ๋ธŒ์นดํ…Œ๊ณ ๋ฆฌ |
| ๊ต์œก | FinePDFs Educational Korean | PDF ๊ธฐ๋ฐ˜ ๊ต์œก ์ž๋ฃŒ | |
| ๋ฒ•๋ฅ  | Korean Law | ํ•œ๊ตญ ๋ฒ•๋ฅ  ํ…์ŠคํŠธ | 15GB |
| ๋‰ด์Šค | Korean News Archive | ํ•œ๊ตญ์–ด ๋‰ด์Šค ์•„์นด์ด๋ธŒ | |
| ๊ณต๊ฐœ์ฝ”ํผ์Šค | Korean Public Corpus | KORMo ๊ณต๊ฐœ ์ฝ”ํผ์Šค | 26GB |
| ์ฝ”๋“œ | Code Pretrain | ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์ฝ”๋“œ | |
| ํ•™์ˆ  | Academic Pretrain | ํ•™์ˆ  ๋…ผ๋ฌธยท๋ฆฌํฌํŠธ | |
| ๋ฒ”์šฉ | SlimPajama | RedPajama ๊ฒฝ๋Ÿ‰ ๋ฒ„์ „ | |
> ์ด ๋ฐ์ดํ„ฐ๋Š” Extended Pretrain (80-100B tokens) ๋‹จ๊ณ„์—์„œ ํ™œ์šฉ ์˜ˆ์ •์ด๋‹ค.
#### ํ”„๋ฆฌํŠธ๋ ˆ์ธ ๋ฐ์ดํ„ฐ ๋ถ„์•ผ๋ณ„ ๋น„์œจ
```
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 3b_train.bin ํ† ํฐ ๊ตฌ์„ฑ (~38.5B) โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ ํ•œ๊ตญ์–ด ์›นํฌ๋กค 44.7% โ”‚
โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ ํ˜ผํ•ฉ ๋ ˆ๊ฑฐ์‹œ 22.1% โ”‚
โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ ๊ต์œก (EN) 14.7% โ”‚
โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ ์ˆ˜ํ•™ยท๊ณผํ•™ 13.2% โ”‚
โ”‚ โ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ ๋ฐฑ๊ณผ์‚ฌ์ „ (KO) 5.3% โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
```
### 7.3 SFT ๋ฐ์ดํ„ฐ โ€” 2.44M ์ƒ˜ํ”Œ (ํ˜„์žฌ ํ•™์Šต ์ค‘)
**24๊ฐœ ์†Œ์Šค**์—์„œ 6.59M raw โ†’ ํ†ตํ•ฉยท์ค‘๋ณต ์ œ๊ฑฐ โ†’ ํ’ˆ์งˆ ํ•„ํ„ฐ๋ง โ†’ **2,439,397 train + 49,801 val**
#### ์ฃผ์š” SFT ์†Œ์Šค (์ƒ์œ„ 12, ์ „์ฒด์˜ 96%)
| # | ๋ฐ์ดํ„ฐ์…‹ | ์ƒ˜ํ”Œ ์ˆ˜ | ํฌ๊ธฐ | ๋„๋ฉ”์ธ |
|---|---------|---------|------|--------|
| 1 | reasoning_r1_1.4m | 1,400,000 | 14.77 GB | ์ถ”๋ก  (CoT) |
| 2 | openhermes_2.5 | 1,001,551 | 1.82 GB | ์˜์–ด ๋‹ค๋ชฉ์  |
| 3 | AI-MO_NuminaMath-CoT | 859,494 | 2.51 GB | ์ˆ˜ํ•™ CoT |
| 4 | korean_instruction_mix | 515,911 | 1.39 GB | ํ•œ๊ตญ์–ด ํ˜ผํ•ฉ |
| 5 | lemon-mint_smol-koreantalk | 460,281 | 5.23 GB | ํ•œ๊ตญ์–ด ๋Œ€ํ™” |
| 6 | open_korean_instructions | 375,159 | 0.73 GB | ํ•œ๊ตญ์–ด ์ง€์‹œ |
| 7 | magpie_reasoning_v2 | 249,922 | 3.99 GB | ์ถ”๋ก  (์˜์–ด) |
| 8 | magpie_reasoning_ko | 224,929 | 3.19 GB | ์ถ”๋ก  (ํ•œ๊ตญ์–ด) |
| 9 | ultrachat_200k | 207,865 | 1.34 GB | ๋Œ€ํ™” |
| 10 | kuotient_orca-math-ko | 193,789 | 0.61 GB | ์ˆ˜ํ•™ (ํ•œ๊ตญ์–ด) |
| 11 | data/sft/train.jsonl (์›๋ณธ) | 161,848 | 0.27 GB | ์›๋ณธ SFT |
| 12 | kullm_v2 | 152,630 | 0.42 GB | ํ•œ๊ตญ์–ด ์ง€์‹œ |
๊ธฐํƒ€ 12๊ฐœ ์†Œ์Šค: DeepMath-103K, Evol-Instruct-Code-80k-ko, ShareGPT-74k-ko, evol-instruct-korean, alpaca-gpt4-korean, ko_wikidata_QA, Ko.WizardLM, KOR-OpenOrca-Platypus-v3, korean-writing-style-instruct, ko_lima, koalpaca_v1_1a, OpenAssistant_oasst1_ko
#### ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ํŒŒ์ดํ”„๋ผ์ธ
```
24๊ฐœ ์†Œ์Šค (6.59M raw)
โ†“ prepare_sft_combined.sh (ํฌ๋งท ํ†ต์ผ, MD5 ์ค‘๋ณต ์ œ๊ฑฐ, 98:2 split)
ํ†ตํ•ฉ: 2,559,492 train + 52,234 val (7.95 GB)
โ†“ filter_sft_v2.py (5๋‹จ๊ณ„: EOS strip, QA marker ์ œ๊ฑฐ, ๊ธธ์ด 50~20K, 4-gram ๋ฐ˜๋ณต >30% ์ œ๊ฑฐ)
์ตœ์ข…: 2,439,397 train + 49,801 val (7.63 GB) โ† ์ œ๊ฑฐ์œจ 4.69%
```
#### ๋„๋ฉ”์ธ ๋น„์œจ
```
์ถ”๋ก /CoT 38.0% โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ
ํ•œ๊ตญ์–ด ์ง€์‹œ 22.5% โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ
์˜์–ด ๋‹ค๋ชฉ์  16.0% โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ
์ˆ˜ํ•™ 12.0% โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ
๋Œ€ํ™”/์ฝ”๋“œ/๊ธฐํƒ€ 11.5% โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ
```
### 7.4 ์„ ํ˜ธ๋„ ๋ฐ์ดํ„ฐ (ORPO์šฉ) โ€” 795K ์Œ
์ด **795,468 preference pairs** (7.9GB, `data/preference/combined_preference.jsonl`)
| HuggingFace ID | ํฌ๊ธฐ | ๋ถ„์•ผ | ํฌ๋งท |
|---------------|------|------|------|
| `nayohan/preference-collection-ko-full` | 4.9GB | ๋ฒ”์šฉ ์„ ํ˜ธ๋„ ํ‰๊ฐ€ | instruction + response_A/B + preference |
| `heegyu/orca-math-korean-preference-cleaned` | 1.6GB | ์ˆ˜ํ•™ ์ถ”๋ก  | prompt + chosen + rejected |
| `kuotient/orca-math-korean-dpo-pairs` | 750MB | ์ˆ˜ํ•™ DPO | prompt + chosen + rejected |
| `maywell/ko_Ultrafeedback_binarized` | 394MB | ํ”ผ๋“œ๋ฐฑ ๊ธฐ๋ฐ˜ ์ •๋ ฌ | prompt + winning/losing response |
| `tellang/yeji-preference-ko-v1` | 171MB | ๋ฒ”์šฉ ์„ ํ˜ธ๋„ | prompt + chosen + rejected |
| `jojo0217/korean_rlhf_dataset` | 137MB | RLHF ์Œ | prompt + chosen + rejected |
| `lemon-mint/korean-realqa-reasoning-v01-preference` | 58MB | QA ์ถ”๋ก  | prompt + chosen + rejected |
**ํ•„ํ„ฐ๋ง ๊ธฐ์ค€**: ์ตœ์†Œ ๊ธธ์ด 20์ž, EOS ์ œ๊ฑฐ, ํฌ๋งท ์ •๊ทœํ™” ํ›„ ํ†ตํ•ฉ
> ORPO๋Š” Phase 3์—์„œ ๋ฐ˜๋ณต๋ฅ ์ด 5% ์ดˆ๊ณผํ•  ๊ฒฝ์šฐ์—๋งŒ ์‹คํ–‰ํ•œ๋‹ค. 3B ๋ชจ๋ธ์ด 1B์˜ ๊ตฌ์กฐ์  ๋ฐ˜๋ณต ๋ฌธ์ œ๋ฅผ ์Šค์Šค๋กœ ํ•ด๊ฒฐํ•œ๋‹ค๋ฉด ORPO ์—†์ด ๋ฐฐํฌํ•  ์ˆ˜ ์žˆ๋‹ค.
### 7.5 ๋ฐ์ดํ„ฐ ํŒŒ์ดํ”„๋ผ์ธ ์š”์•ฝ
```
[HuggingFace / ์›น ์ˆ˜์ง‘]
โ”‚
โ–ผ
โ”Œโ”€โ”€โ”€ ์›์‹œ ์ˆ˜์ง‘ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ korean_extra/ (39๊ฐœ ๋””๋ ‰ํ† ๋ฆฌ, 640GB+) โ”‚
โ”‚ sft_extra/ (27๊ฐœ ๋””๋ ‰ํ† ๋ฆฌ, 1.08M ์ƒ˜ํ”Œ) โ”‚
โ”‚ preference/ (7๊ฐœ JSONL, 795K ์Œ) โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚
โ–ผ
โ”Œโ”€โ”€โ”€ ํ† ํฐํ™” (SentencePiece 64K) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ tokenize_extra.py โ€” ์ž๋™ ํฌ๋งท ๊ฐ์ง€ (Arrow/Parquet/JSONL) โ”‚
โ”‚ 8 workers ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ, uint16 memmap (.bin) ์ถœ๋ ฅ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚
โ–ผ
โ”Œโ”€โ”€โ”€ ์ตœ์ข… ๋ณ‘ํ•ฉ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Pretrain: 3b_train.bin (77GB, ~38.5B tokens) โ”‚
โ”‚ SFT: sft_combined/train_filtered.jsonl (7.48GB, 2.44M ์ƒ˜ํ”Œ) โ”‚
โ”‚ ORPO: preference/combined_preference.jsonl (7.9GB) โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
```
---
## 8. ํ•™์Šต ์„ค์ • ๋ฐ ์ตœ์ ํ™”
### ํ˜„์žฌ ํ•™์Šต ์„ค์ • (`configs/korean_3b_fp8.yaml`)
```yaml
model:
vocab_size: 64000
d_model: 3072
n_layers: 28
n_heads: 24
n_kv_heads: 8
d_ffn: 8192
max_seq_len: 2048
rope_theta: 500000.0
training:
batch_size: 5
gradient_accumulation_steps: 8
learning_rate: 1.5e-4
min_lr: 1.5e-5
warmup_steps: 2000
max_steps: 57000
weight_decay: 0.1
grad_clip: 1.0
optimizer: adamw
scheduler: cosine
fp8:
enabled: true
recipe: "mxfp8"
use_transformer_engine: true
distributed:
strategy: ddp
gradient_as_bucket_view: true
find_unused_parameters: false
nccl:
timeout_seconds: 7200
nvls_enabled: true
```
์œ ํšจ ๋ฐฐ์น˜ ํฌ๊ธฐ = `batch_size(5) ร— grad_accum(8) ร— num_gpus(8)` = **320**
LR ์Šค์ผ€์ค„: warmup 2000 ์Šคํ… โ†’ cosine decay โ†’ min_lr=1.5e-5 (max_lr์˜ 10%)
### Phase 0์—์„œ ๋ฐฐ์šด ์ตœ์ ํ™” ๊ตํ›ˆ
#### GQA FlashAttention Native
๊ฐ€์žฅ ํฐ VRAM ์ ˆ๊ฐ์„ ๊ฐ€์ ธ์˜จ ์ตœ์ ํ™”. ํ•ต์‹ฌ์€ FlashAttention์ด GQA๋ฅผ native๋กœ ์ง€์›ํ•œ๋‹ค๋Š” ์ ์ด๋‹ค. KV head๋ฅผ expandํ•˜์—ฌ MHA์ฒ˜๋Ÿผ ์ฒ˜๋ฆฌํ•˜๋ฉด ๋ฉ”๋ชจ๋ฆฌ ๋ณต์‚ฌ๊ฐ€ ๋ฐœ์ƒํ•˜์ง€๋งŒ, native path๋ฅผ ์“ฐ๋ฉด ๋‚ด๋ถ€์—์„œ ์ง์ ‘ ์ฒ˜๋ฆฌํ•œ๋‹ค.
```python
# Before (๋น„ํšจ์œจ์ ): KV expand โ†’ MHA์ฒ˜๋Ÿผ ์ฒ˜๋ฆฌ
k = k.repeat_interleave(n_heads // n_kv_heads, dim=1)
v = v.repeat_interleave(n_heads // n_kv_heads, dim=1)
out = flash_attn_func(q, k, v)
# After (native GQA): flash_attn์ด ๋‚ด๋ถ€์—์„œ GQA ์ฒ˜๋ฆฌ
out = flash_attn_func(q, k, v) # q: [B, S, H, D], k/v: [B, S, Hkv, D]
# VRAM 60.4GB โ†’ 48.3GB (-20%)
```
#### DDP ์ตœ์ ํ™”
```python
# gradient_as_bucket_view=True: gradient tensor๋ฅผ bucket ๋ฉ”๋ชจ๋ฆฌ์˜ view๋กœ ์ง์ ‘ ๋งคํ•‘
# โ†’ ๋ถˆํ•„์š”ํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋ณต์‚ฌ ์ œ๊ฑฐ, GPU-CPU ๋™๊ธฐํ™” ์˜ค๋ฒ„ํ—ค๋“œ -87.5%
model = torch.nn.parallel.DistributedDataParallel(
model,
device_ids=[local_rank],
gradient_as_bucket_view=True,
find_unused_parameters=False, # ๋ชจ๋“  ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์‚ฌ์šฉ๋จ
)
```
**์ฃผ์˜**: `static_graph=True`๋Š” ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š”๋‹ค. TransformerEngine์˜ `te.Linear`๊ฐ€ ์ผ๋ถ€ ์ผ€์ด์Šค์—์„œ dynamic graph๋ฅผ ์š”๊ตฌํ•˜๋Š”๋ฐ, static_graph๋ฅผ ์ผœ๋ฉด ๋Ÿฐํƒ€์ž„ ์—๋Ÿฌ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค.
#### NCCL NVLS
```bash
export NCCL_ALGO=NVLSTree # NVLink SHARP (NVLS) ํ™œ์„ฑํ™”
export NCCL_PROTO=Simple
export NCCL_P2P_DISABLE=0
export NCCL_TIMEOUT=7200 # ๊ธด backward์— ๋Œ€๋น„ํ•œ ํƒ€์ž„์•„์›ƒ ์—ฌ์œ 
```
NVSwitch๊ฐ€ All-to-All single hop์„ ์ง€์›ํ•˜๋ฏ€๋กœ Ring topology๋ณด๋‹ค NVLSTree๊ฐ€ ํšจ์œจ์ ์ด๋‹ค.
#### SIGHUP 3์ค‘ ๋ฐฉ์–ด
์žฅ์‹œ๊ฐ„ ํ•™์Šต์—์„œ ์„ธ์…˜ ์—ฐ๊ฒฐ ๋Š๊น€(SIGHUP)์€ ์น˜๋ช…์ ์ด๋‹ค. 3์ค‘ ๋ณดํ˜ธ๋ฅผ ๊ตฌ์ถ•ํ–ˆ๋‹ค:
```bash
# 1์ค‘: nohup + setsid (์ƒˆ ์„ธ์…˜ ๊ทธ๋ฃน)
nohup setsid torchrun --nproc_per_node=8 train/pretrain.py ... &
# 2์ค‘: Python signal handler (Python ๋ ˆ๋ฒจ SIGHUP ๋ฌด์‹œ)
import signal
signal.signal(signal.SIGHUP, signal.SIG_IGN)
# 3์ค‘: emergency checkpoint (SIGTERM์—๋„ ์ฒดํฌํฌ์ธํŠธ ์ €์žฅ)
def emergency_save(signum, frame):
save_checkpoint(model, optimizer, step, "emergency")
sys.exit(0)
signal.signal(signal.SIGTERM, emergency_save)
```
#### torch.compile โ€” ํ…Œ์ŠคํŠธ ๊ฒฐ๊ณผ: ํšจ๊ณผ ์—†์Œ
`torch.compile`์„ ์ ์šฉํ•ด speedup์„ ๊ธฐ๋Œ€ํ–ˆ์ง€๋งŒ ์‹ค์ธก ๊ฒฐ๊ณผ **1.00x (ํšจ๊ณผ ์—†์Œ)**์ด์—ˆ๋‹ค. ๋‘ ๊ฐ€์ง€ ์ด์œ :
1. TransformerEngine์˜ kernel์ด opaqueํ•˜์—ฌ graph break๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค. `torch.compile`์€ Python ์—ฐ์‚ฐ ๊ทธ๋ž˜ํ”„๋ฅผ ์ตœ์ ํ™”ํ•˜๋Š”๋ฐ, TE kernel์€ ๊ทธ ๊ทธ๋ž˜ํ”„ ๋ฐ–์— ์žˆ๋‹ค.
2. `/tmp` ๋””๋ ‰ํ† ๋ฆฌ์— `noexec` ๋งˆ์šดํŠธ ํ”Œ๋ž˜๊ทธ๊ฐ€ ์žˆ์–ด ์ปดํŒŒ์ผ๋œ kernel์„ ์บ์‹œํ•˜์ง€ ๋ชปํ•œ๋‹ค.
**๊ตํ›ˆ**: "์ผ๋‹จ ์จ๋ณด์ž"๋ณด๋‹ค "์™œ ํšจ๊ณผ๊ฐ€ ์žˆ๋Š”์ง€ ๋จผ์ € ์ดํ•ดํ•˜์ž"๊ฐ€ ์ค‘์š”ํ•˜๋‹ค.
### ๋ชจ๋‹ˆํ„ฐ๋ง ์‹œ์Šคํ…œ
```
ํ…”๋ ˆ๊ทธ๋žจ ์•Œ๋ฆผ ์‹œ์Šคํ…œ
โ”œโ”€โ”€ B200Bot (token ์„ค์ •๋จ)
โ”œโ”€โ”€ training_watchdog.sh โ†’ 10๋ถ„ ๊ฐ„๊ฒฉ cron
โ”‚ โ””โ”€โ”€ loss ์ด์ƒ, ํ”„๋กœ์„ธ์Šค ์ข…๋ฃŒ ๊ฐ์ง€ โ†’ ์ฆ‰์‹œ ์•Œ๋ฆผ
โ””โ”€โ”€ hourly_status.sh โ†’ 1์‹œ๊ฐ„ ๊ฐ„๊ฒฉ cron
โ””โ”€โ”€ step, loss, ์†๋„, VRAM, eta โ†’ ์ •๊ธฐ ๋ฆฌํฌํŠธ
```
```python
# curl์ด ์ฐจ๋‹จ๋ผ ์žˆ์–ด urllib ์‚ฌ์šฉ
import urllib.request, json
def send_telegram(message):
url = f"https://api.telegram.org/bot{TOKEN}/sendMessage"
data = json.dumps({"chat_id": CHAT_ID, "text": message}).encode()
req = urllib.request.Request(url, data=data,
headers={"Content-Type": "application/json"})
urllib.request.urlopen(req)
```
---
## 9. ์‹คํ—˜ ๊ฒฐ๊ณผ โ€” 1B ๋ฒ ์ด์Šค๋ผ์ธ
1B ๋ชจ๋ธ์˜ ์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ์ •์งํ•˜๊ฒŒ ๊ธฐ๋กํ•œ๋‹ค. ์„ฑ๊ณต๊ณผ ์‹คํŒจ ๋ชจ๋‘.
### ํ”„๋ฆฌํŠธ๋ ˆ์ธ ๊ฒฐ๊ณผ
| ์ง€ํ‘œ | ๊ฐ’ |
|------|-----|
| ์ตœ์ข… Loss | 1.904 |
| PPL (C4 Korean) | 5.67 |
| ํ•™์Šต ์Šคํ… | 34,000 |
| ํ•™์Šต ์‹œ๊ฐ„ | ~2์ผ |
### SFT v1 ๊ฒฐ๊ณผ โ€” ์‹คํŒจ
| ์ง€ํ‘œ | ๊ฐ’ |
|------|-----|
| val_loss | 0.0 (๋น„์ •์ƒ) |
| ์›์ธ | label off-by-one ๋ฒ„๊ทธ (๋ฐ์ดํ„ฐ ๋ˆ„์ˆ˜) |
| ๊ฒฐ๋ก  | ์ „๋ฉด ํ๊ธฐ |
### SFT v2 ๊ฒฐ๊ณผ โ€” ๋ถ€๋ถ„ ์„ฑ๊ณต
| ์ง€ํ‘œ | ๊ฐ’ |
|------|-----|
| val_loss | 2.2062 |
| ๋ฐ˜๋ณต๋ฅ  | 18% (rep_penalty=1.1 ์ ์šฉ) |
| kobest_copa | 0.646 |
| ๊ฒฐ๋ก  | ๊ธฐ๋Šฅํ•˜์ง€๋งŒ ๊ตฌ์กฐ์  ํ•œ๊ณ„ ์กด์žฌ |
### 3B ๊ธฐ๋Œ€ ๋ชฉํ‘œ์น˜ (์Šค์ผ€์ผ๋ง ๋ฒ•์น™ ๊ธฐ๋ฐ˜ ์˜ˆ์ธก)
| ๋ฒค์น˜๋งˆํฌ | 1B ํ˜„์žฌ | 3B ๋ชฉํ‘œ |
|----------|---------|---------|
| kobest_copa | 0.646 | >0.72 |
| kobest_hellaswag | ~0.42 | >0.52 |
| ๋ฐ˜๋ณต๋ฅ  | 18% | <5% |
| PPL (C4 Korean) | 5.67 | <4.5 |
1B์—์„œ 3B๋กœ์˜ ์Šค์ผ€์ผ์—…์€ ๋‹จ์ˆœํžˆ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋Š˜๋ฆฌ๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋‹ค. ๋ชจ๋ธ์ด ๋” ๊ธด ๋งฅ๋ฝ์„ ๊ธฐ์–ตํ•˜๊ณ , ๋” ๋‹ค์–‘ํ•œ ํŒจํ„ด์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์–ด์•ผ ๋ฐ˜๋ณต๋ฅ ์ด ๊ตฌ์กฐ์ ์œผ๋กœ ๋‚ฎ์•„์ง„๋‹ค. 3B ๋ชฉํ‘œ์น˜๋Š” Chinchilla ์Šค์ผ€์ผ๋ง ๊ณก์„ ๊ณผ ์œ ์‚ฌ ๊ทœ๋ชจ ๋ชจ๋ธ๋“ค์˜ ๋ฒค์น˜๋งˆํฌ๋ฅผ ์ฐธ๊ณ ํ•œ ์˜ˆ์ธก๊ฐ’์ด๋‹ค.
---
## 10. ์‹คํ—˜ ๊ฒฐ๊ณผ โ€” 3B Base ์ข…ํ•ฉ ํ‰๊ฐ€ (v2)
3B ์‚ฌ์ „ํ•™์Šต ์™„๋ฃŒ ํ›„ checkpoint-0057000 ๊ธฐ์ค€์œผ๋กœ ์ˆ˜ํ–‰ํ•œ ์ข…ํ•ฉ ํ‰๊ฐ€.
v2 ์žฌํ‰๊ฐ€๋Š” 8-GPU ๋ณ‘๋ ฌ ํŒŒ์ดํ”„๋ผ์ธ์œผ๋กœ 13+ ๋ฒค์น˜๋งˆํฌ, 0/5-shot ๋น„๊ต, calibration, ์ฐธ๊ณ ๋ชจ๋ธ ๋น„๊ต๋ฅผ ํฌํ•จํ•œ๋‹ค.
์ด ์†Œ์š” ์‹œ๊ฐ„ 256.6์ดˆ.
> **v1 โ†’ v2 ๋ณ€๊ฒฝ์ **: v1(์ดˆ๊ธฐ ํ‰๊ฐ€)์—์„œ๋Š” PPL 3๊ฐœ ๋ฐ์ดํ„ฐ์…‹ + belebele/MMLU 2๊ฐœ ๋ฒค์น˜๋งˆํฌ๋งŒ ์ธก์ •ํ–ˆ๋‹ค. v2๋Š” PPL 19๊ฐœ ๋ฐ์ดํ„ฐ์…‹, KoBEST 5๊ฐœ, HAE-RAE ์ „์ฒด, MMLU-KO 6์นดํ…Œ๊ณ ๋ฆฌ, MMLU-EN 61๊ณผ๋ชฉ, ์˜์–ด 5๋Œ€ ๋ฒค์น˜๋งˆํฌ, Calibration, 0/5-shot ๋น„๊ต, 12์กฐํ•ฉ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ทธ๋ฆฌ๋“œ ์„œ์น˜๋ฅผ ํฌํ•จํ•œ๋‹ค.
### 10.1 ํ•™์Šต ์ปค๋ธŒ
| Step | Loss | LR | ๋น„๊ณ  |
|------|------|----|------|
| 10 | 11.657 | 1.50e-06 | ์ดˆ๊ธฐ (warmup ์‹œ์ž‘) |
| 500 | 5.047 | 7.50e-05 | warmup ์ง„ํ–‰ |
| 2,000 | 2.851 | 3.00e-04 | warmup ์™„๋ฃŒ, peak LR |
| 10,000 | 2.057 | 2.86e-04 | ์•ˆ์ • ํ•˜๊ฐ• |
| 30,000 | 1.789 | 1.61e-04 | ์ค‘๋ฐ˜, epoch 1 ์ง„์ž… |
| 57,000 | 1.466 | 3.00e-05 | ์ตœ์ข… (cosine min) |
> ์ฒ˜๋ฆฌ ์†๋„๋Š” ์ „ ๊ตฌ๊ฐ„ 36~38K tok/s๋กœ ์•ˆ์ •. ์ด ํ•™์Šต ์‹œ๊ฐ„ ์•ฝ 63์‹œ๊ฐ„.
### Base Model ๋ฐฑ์—…
| ํ•ญ๋ชฉ | ๊ฐ’ |
|------|-----|
| ์›๋ณธ ์ฒดํฌํฌ์ธํŠธ | `checkpoints/korean_3b_fp8_run1/checkpoint-0057000/` (34GB) |
| ๋ฐฑ์—… | `checkpoints/korean_3b_fp8_run1/checkpoint-0057000_BASE_BACKUP/` |
| MD5 ๊ฒ€์ฆ | `4f493d7bcc843727d32453bb3a4e6b7d` (์ผ์น˜ ํ™•์ธ) |
| HF ๋ณ€ํ™˜ | `eval/outputs/hf_3b_base/` (11GB safetensors) |
### 10.2 PPL (Perplexity) โ€” 19๊ฐœ ๋ฐ์ดํ„ฐ์…‹
**์ฃผ์š” PPL (3b_val ํ†ตํ•ฉ): 5.2263** (์ดˆ๊ธฐ v1 ํ‰๊ฐ€: 5.709)
| ๋ฐ์ดํ„ฐ์…‹ | PPL | Bits/Token | ํ‰๊ฐ€ ํ† ํฐ | ์†Œ์š” ์‹œ๊ฐ„ |
|---------|-----|-----------|---------|---------|
| korean_namuwiki | 25.88 | 4.694 | 6.5M | 63.7s |
| cc100_ko | 21.78 | 4.445 | 13.6M | 133.2s |
| namuwiki_2023b | 18.92 | 4.242 | 7.7M | 75.1s |
| val | 18.30 | 4.194 | 9.1M | 89.4s |
| korean_wiki | 11.84 | 3.565 | 1.6M | 15.5s |
| wikipedia_ko | 10.71 | 3.420 | 1.8M | 17.4s |
| korean | 7.02 | 2.811 | 53.5M | 521.6s |
| open_web_math | 6.93 | 2.792 | 15.7M | 153.5s |
| **korean_c4** | **5.72** | **2.515** | **45.4M** | **443.1s** |
| **3b (ํ†ตํ•ฉ)** | **5.23** | **2.386** | **226.9M** | **2227.3s** |
| cosmo_web_v2 | 4.17 | 2.059 | 8.6M | 84.6s |
| cosmo_stories | 3.96 | 1.984 | 18.9M | 185.2s |
| cosmo_openstax | 3.87 | 1.951 | 0.7M | 7.2s |
| cosmo_stanford | 3.36 | 1.750 | 6.6M | 65.3s |
| cosmo_wikihow | 3.31 | 1.727 | 1.2M | 11.8s |
| cosmo_auto_math_text | 3.15 | 1.655 | 7.9M | 77.3s |
| cosmo_khanacademy | 2.93 | 1.552 | 0.1M | 1.5s |
| mathpile | 2.72 | 1.446 | 7.1M | 69.9s |
| hplt_ko | 2.40 | 1.265 | 48.5M | 475.9s |
> **ํ•ด์„**: in-distribution(ํ•™์Šต์— ํฌํ•จ๋œ) ๋ฐ์ดํ„ฐ(hplt_ko: 2.40, mathpile: 2.72)๊ฐ€ ๋‚ฎ๊ณ , OOD(ํ•™์Šต ๋น„์ค‘ ๋‚ฎ์€) ๋ฐ์ดํ„ฐ(cc100_ko: 21.78, namuwiki: 25.88)๊ฐ€ ๋†’์€ ๊ฒƒ์€ ์˜ˆ์ƒ๋œ ํŒจํ„ด. korean_c4 5.72๋Š” v1์˜ 5.717๊ณผ ์ผ์น˜ํ•˜์—ฌ ํ‰๊ฐ€ ์žฌํ˜„์„ฑ์„ ํ™•์ธ.
### 10.3 ํ•œ๊ตญ์–ด ๋ฒค์น˜๋งˆํฌ
#### KoBEST (0-shot) โ€” ํ‰๊ท  43.69%
| ํƒœ์Šคํฌ | Accuracy | F1 |
|--------|----------|-----|
| kobest_boolq | 50.28% | 0.3457 |
| kobest_copa | 49.30% | 0.4921 |
| kobest_hellaswag | 21.60% | 0.2153 |
| kobest_sentineg | 48.61% | 0.4737 |
| kobest_wic | 48.65% | 0.3286 |
| **ํ‰๊ท ** | **43.69%** | |
#### HAE-RAE (0-shot) โ€” ์ „์ฒด 19.71%
| ์„œ๋ธŒํƒœ์Šคํฌ | Accuracy |
|-----------|----------|
| haerae_general_knowledge | 21.59% |
| haerae_history | 23.40% |
| haerae_loan_word | 21.30% |
| haerae_rare_word | 18.77% |
| haerae_standard_nomenclature | 13.73% |
| **์ „์ฒด** | **19.71%** |
#### MMLU-KO (0-shot) โ€” 6์นดํ…Œ๊ณ ๋ฆฌ ํ‰๊ท  22.75%
| ์นดํ…Œ๊ณ ๋ฆฌ | Accuracy |
|----------|----------|
| medical | 30.56% |
| humanities | 24.51% |
| business | 24.14% |
| social_sciences | 20.59% |
| other | 19.64% |
| stem | 19.57% |
| **ํ‰๊ท ** | **22.75%** |
> Base model์€ instruction-following ์—†์ด 4์ง€์„ ๋‹ค ํ˜•์‹ ๋ฒค์น˜๋งˆํฌ๋ฅผ ํ’€๋„๋ก ์ตœ์ ํ™”๋˜์ง€ ์•Š์Œ. KoBEST boolq/copa/sentineg/wic๋Š” ~50% ์ˆ˜์ค€์œผ๋กœ 2์ง€/4์ง€์„ ๋‹ค ๋žœ๋ค ๊ธฐ์ค€ ๋ถ€๊ทผ์ด๋ฉฐ, SFT ํ›„ ํ–ฅ์ƒ ๊ธฐ๋Œ€.
### 10.4 ์˜์–ด ๋ฒค์น˜๋งˆํฌ
#### ์ฃผ์š” ๋ฒค์น˜๋งˆํฌ (0-shot)
| ํƒœ์Šคํฌ | Accuracy | Acc (norm) |
|--------|----------|-----------|
| hellaswag | 26.00% | 26.15% |
| arc_easy | 25.63% | 26.64% |
| arc_challenge | 21.67% | 27.90% |
| winogrande | 50.59% | โ€” |
| piqa | 52.50% | 48.31% |
> winogrande(50.59%)์™€ piqa(52.50%)๋Š” 2์ง€์„ ๋‹ค๋กœ ๋žœ๋ค ๊ธฐ์ค€ 50%์— ๊ทผ์ ‘. hellaswag/arc๋Š” 4์ง€์„ ๋‹ค๋กœ ๋žœ๋ค ๊ธฐ์ค€ 25%.
#### MMLU-EN (0-shot) โ€” 61๊ณผ๋ชฉ ํ‰๊ท  25.81%
**์ƒ์œ„ 10๊ฐœ ๊ณผ๋ชฉ**:
| ๊ณผ๋ชฉ | Accuracy |
|------|----------|
| college_physics | 37.25% |
| college_computer_science | 34.00% |
| high_school_statistics | 33.80% |
| us_foreign_policy | 32.00% |
| security_studies | 31.43% |
| world_religions | 30.99% |
| professional_medicine | 30.88% |
| high_school_government_and_politics | 30.57% |
| jurisprudence | 30.56% |
| human_sexuality | 30.53% |
**ํ•˜์œ„ 5๊ฐœ ๊ณผ๋ชฉ**:
| ๊ณผ๋ชฉ | Accuracy |
|------|----------|
| human_aging | 19.73% |
| college_biology | 19.44% |
| anatomy | 17.04% |
| global_facts | 17.00% |
| abstract_algebra | 15.00% |
### 10.5 Calibration
| ๋ฉ”ํŠธ๋ฆญ | ๊ฐ’ |
|--------|-----|
| Top-1 Accuracy | 68.75% |
| Top-5 Accuracy | 81.64% |
| Top-10 Accuracy | 85.93% |
| Mean Correct Prob | 0.6152 |
| Mean Entropy | 1.5682 |
**Token NLL ๋ถ„ํฌ**:
| ํ†ต๊ณ„ | ๊ฐ’ |
|------|-----|
| ํ‰๊ท  NLL | 1.5561 |
| ํ‘œ์ค€ํŽธ์ฐจ | 2.4926 |
| ์ค‘์•™๊ฐ’ | 0.1221 |
| p95 | 7.0312 |
| p99 | 10.3125 |
| NLL > 5 ๋น„์œจ | 10.86% |
| NLL > 10 ๋น„์œจ | 1.18% |
> Top-1 68.75%๋Š” ๋ชจ๋ธ์ด ๊ฐ€์žฅ ํ™•์‹ ํ•˜๋Š” ์˜ˆ์ธก์ด ~69% ํ™•๋ฅ ๋กœ ์ •ํ™•ํ•˜๋‹ค๋Š” ์˜๋ฏธ. ์ค‘์•™๊ฐ’ NLL 0.12 (โ‰ˆ e^0.12 = 1.13 PPL)๋กœ ๋Œ€๋ถ€๋ถ„์˜ ํ† ํฐ์„ ๋งค์šฐ ๋†’์€ ํ™•์‹ ๋„๋กœ ์˜ˆ์ธกํ•˜๊ณ , ์†Œ์ˆ˜์˜ ๊ณ ๋‚œ์ด๋„ ํ† ํฐ์ด ํ‰๊ท  NLL์„ ๋Œ์–ด์˜ฌ๋ฆฌ๋Š” ์ „ํ˜•์ ์ธ ๋ถ„ํฌ.
### 10.6 0-shot vs 5-shot ๋น„๊ต
18๊ฐœ ํ•œ๊ตญ์–ด ํƒœ์Šคํฌ์—์„œ 0-shot๊ณผ 5-shot ์„ฑ๋Šฅ์„ ๋น„๊ตํ–ˆ๋‹ค.
| ํƒœ์Šคํฌ | 0-shot | 5-shot | ๋ณ€ํ™” |
|--------|--------|--------|------|
| global_mmlu_ko | 22.75% | 26.75% | **+4.00pp** |
| global_mmlu_ko_business | 24.14% | 31.03% | **+6.90pp** |
| global_mmlu_ko_humanities | 24.51% | 28.43% | +3.92pp |
| global_mmlu_ko_medical | 30.56% | 36.11% | **+5.56pp** |
| global_mmlu_ko_other | 19.64% | 23.21% | +3.57pp |
| global_mmlu_ko_social_sciences | 20.59% | 23.53% | +2.94pp |
| global_mmlu_ko_stem | 19.57% | 21.74% | +2.17pp |
| haerae | 19.71% | 20.26% | +0.55pp |
| haerae_general_knowledge | 21.59% | 22.73% | +1.14pp |
| haerae_history | 23.40% | 14.89% | -8.51pp |
| haerae_loan_word | 21.30% | 24.26% | +2.96pp |
| haerae_rare_word | 18.77% | 18.02% | -0.74pp |
| haerae_standard_nomenclature | 13.73% | 25.49% | **+11.76pp** |
| kobest_boolq | 50.28% | 50.21% | -0.07pp |
| kobest_copa | 49.30% | 46.80% | -2.50pp |
| kobest_hellaswag | 21.60% | 20.80% | -0.80pp |
| kobest_sentineg | 48.61% | 47.86% | -0.76pp |
| kobest_wic | 48.65% | 48.97% | +0.32pp |
**ํ‰๊ท  ๋ณ€ํ™”: +1.80pp** | ๊ฐœ์„ : 12 | ํ•˜๋ฝ: 6
> MMLU-KO๋Š” 5-shot์—์„œ ์ผ๊ด€๋˜๊ฒŒ ๊ฐœ์„ (+2~7pp)๋˜์–ด in-context learning ๋Šฅ๋ ฅ์ด ์ž‘๋™ํ•จ์„ ํ™•์ธ. KoBEST๋Š” ๊ฑฐ์˜ ๋ณ€๋™ ์—†๊ฑฐ๋‚˜ ์†Œํญ ํ•˜๋ฝโ€”์ด๋ฏธ 0-shot์—์„œ ํŒจํ„ด ๋งค์นญ์„ ์ž˜ํ•˜๊ณ  ์žˆ์–ด few-shot ์˜ˆ์‹œ๊ฐ€ ์˜คํžˆ๋ ค ๋ฐฉํ•ด๊ฐ€ ๋˜๋Š” ํŒจํ„ด. haerae_standard_nomenclature์˜ +11.76pp๋Š” ์ด ํƒœ์Šคํฌ์˜ ํŠน์ˆ˜ํ•œ ํฌ๋งท์„ few-shot์—์„œ ํ•™์Šตํ•œ ๊ฒฐ๊ณผ.
### 10.7 ์ฐธ๊ณ  ๋ชจ๋ธ ๋น„๊ต
| ๋ชจ๋ธ | ํŒŒ๋ผ๋ฏธํ„ฐ | MMLU-KO | MMLU-EN | KoBEST ํ‰๊ท  | PPL |
|------|---------|---------|---------|------------|-----|
| **FRANKENSTALLM 3B** | **3B** | **22.75%** | **25.81%** | **43.69%** | **5.2263** |
| Llama-3.2-3B | 3B | ~42% | ~58% | ~55% | โ€” |
| Qwen2.5-3B | 3B | ~48% | ~65% | ~60% | โ€” |
| EXAONE-3.5-2.4B | 2.4B | ~35% | ~50% | ~50% | โ€” |
> ์ฐธ๊ณ  ๋ชจ๋ธ๋“ค์€ ์ˆ˜์กฐ ํ† ํฐ ๊ทœ๋ชจ์˜ ํ•™์Šต ๋ฐ์ดํ„ฐ์™€ ์ˆ˜์ฒœ GPU-hour๋ฅผ ํˆฌ์ž…ํ•œ ๊ฒฐ๊ณผ. FRANKENSTALLM 3B๋Š” 41.12B ํ† ํฐ(Chinchilla ์ตœ์ ์˜ ~68%), 63์‹œ๊ฐ„, 8 GPU๋กœ ํ•™์Šตํ•œ ์ ์„ ๊ฐ์•ˆํ•ด์•ผ ํ•œ๋‹ค. SFT + ํ™•์žฅ ํ”„๋ฆฌํŠธ๋ ˆ์ธ(80-100B ํ† ํฐ) ์ดํ›„ ๊ฒฉ์ฐจ ์ถ•์†Œ ์˜ˆ์ƒ.
### 10.8 ์ƒ์„ฑ ํ’ˆ์งˆ ๋ฐ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ทธ๋ฆฌ๋“œ ์„œ์น˜
#### ๋ฐ˜๋ณต๋ฅ  ์š”์•ฝ
| ์„ค์ • | 3-gram ๋ฐ˜๋ณต๋ฅ  | 4-gram ๋ฐ˜๋ณต๋ฅ  |
|------|--------------|--------------|
| greedy (temp=0.0) | 60.99% | 57.02% |
| temp=0.5 | 60.12% | 58.68% |
| temp=0.7 | 47.69% | 43.40% |
| temp=1.0 | 3.58% | 2.81% |
> ์ดˆ๊ธฐ v1 ํ‰๊ฐ€์˜ greedy 71.1% ๋ฐ˜๋ณต๋ฅ ์€ `no_repeat_ngram_size=3` ์ ์šฉ ๊ธฐ์ค€์ด์—ˆ๋‹ค. v2์—์„œ๋Š” ๋ฏธ์ ์šฉ ๊ธฐ์ค€(raw)์œผ๋กœ ํ†ต์ผํ•˜์—ฌ 60.99%๋ฅผ ๊ธฐ๋ก.
#### 12์กฐํ•ฉ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ทธ๋ฆฌ๋“œ ์„œ์น˜ ๊ฒฐ๊ณผ
| ์„ค์ • | Temp | Rep Pen | 3-gram | 4-gram | ๋น„๊ณ  |
|------|------|---------|--------|--------|------|
| **t0.7_rep1.3** | **0.70** | **1.30** | **0.00%** | **0.00%** | **์ตœ์ ** |
| t0.9_rep1.2 | 0.90 | 1.20 | 0.00% | 0.00% | ์ฐจ์„  |
| t0.7_rep1.2 | 0.70 | 1.20 | 0.88% | 0.00% | |
| t0.9_rep1.1 | 0.90 | 1.10 | 0.94% | 0.13% | |
| t1.0_rep1.1 | 1.00 | 1.10 | 1.21% | 0.48% | |
| t0.5_rep1.1 | 0.50 | 1.10 | 1.92% | 1.19% | |
| t1.0 | 1.00 | 1.00 | 3.58% | 2.81% | |
| t0.9 | 0.90 | 1.00 | 8.39% | 4.64% | |
| t0.7_rep1.1 | 0.70 | 1.10 | 8.51% | 5.51% | |
| t0.7 | 0.70 | 1.00 | 47.69% | 43.40% | |
| t0.5 | 0.50 | 1.00 | 60.12% | 58.68% | |
| greedy | 0.00 | 1.00 | 60.99% | 57.02% | |
#### ๊ถŒ์žฅ ์ถ”๋ก  ํŒŒ๋ผ๋ฏธํ„ฐ (base ์‹คํ—˜์šฉ)
```python
# v2 ๊ทธ๋ฆฌ๋“œ ์„œ์น˜ ์ตœ์ ๊ฐ’
temp=0.7, repetition_penalty=1.3
# ๋˜๋Š” (๋” ๋‹ค์–‘ํ•œ ์ƒ์„ฑ)
temp=0.9, repetition_penalty=1.2
```
> ์ดˆ๊ธฐ v1 ๊ถŒ์žฅ๊ฐ’(`temp=0.9, top_p=0.9, no_repeat_ngram=3, repetition_penalty=1.1`)์—์„œ `repetition_penalty=1.3`์œผ๋กœ ์ƒํ–ฅ ์กฐ์ •. `no_repeat_ngram_size`๋Š” ๊ทธ๋ฆฌ๋“œ ์„œ์น˜์—์„œ `repetition_penalty`๋งŒ์œผ๋กœ ์ถฉ๋ถ„ํžˆ ๋ฐ˜๋ณต ์ œ๊ฑฐ๊ฐ€ ๊ฐ€๋Šฅํ•จ์„ ํ™•์ธํ•˜์—ฌ ๋ถˆํ•„์š”.
### 10.9 ํ‰๊ฐ€ ํŒŒ์ดํ”„๋ผ์ธ
v2 ์žฌํ‰๊ฐ€๋Š” ๋ชจ๋“ˆํ™”๋œ 8-GPU ๋ณ‘๋ ฌ ํŒŒ์ดํ”„๋ผ์ธ(`eval/reeval_pipeline.py`)์œผ๋กœ ์ˆ˜ํ–‰๋˜์—ˆ๋‹ค.
#### ์•„ํ‚คํ…์ฒ˜
```
reeval_pipeline.py
โ”œโ”€โ”€ ๋ชจ๋ธ 1ํšŒ ๋กœ๋“œ (GPU 0์— HF ๋ชจ๋ธ)
โ”œโ”€โ”€ Phase 1: PPL ํ‰๊ฐ€ (19๊ฐœ ๋ฐ์ดํ„ฐ์…‹, ์ˆœ์ฐจ)
โ”œโ”€โ”€ Phase 2: Calibration + Token NLL
โ”œโ”€โ”€ Phase 3: ์ƒ์„ฑ ํ’ˆ์งˆ + ํŒŒ๋ผ๋ฏธํ„ฐ ๊ทธ๋ฆฌ๋“œ ์„œ์น˜ (12์กฐํ•ฉ)
โ”œโ”€โ”€ Phase 4: lm-evaluation-harness (0-shot, 8-GPU ๋ณ‘๋ ฌ)
โ”œโ”€โ”€ Phase 5: lm-evaluation-harness (5-shot, 8-GPU ๋ณ‘๋ ฌ)
โ””โ”€โ”€ Phase 6: ๋ฆฌํฌํŠธ ์ž๋™ ์ƒ์„ฑ (5๊ฐœ ๊ฐœ๋ณ„ + 1๊ฐœ ์ข…ํ•ฉ)
```
#### Pipeline Mode
๋ชจ๋ธ์„ 1ํšŒ ๋กœ๋“œํ•˜์—ฌ 0-shot๊ณผ 5-shot์„ ์—ฐ์† ์‹คํ–‰ํ•œ๋‹ค. ๊ธฐ์กด ๋ฐฉ์‹(๋ณ„๋„ ํ”„๋กœ์„ธ์Šค 2ํšŒ)์— ๋น„ํ•ด ๋ชจ๋ธ ๋กœ๋”ฉ ์‹œ๊ฐ„์„ ์ ˆ๋ฐ˜์œผ๋กœ ์ค„์ธ๋‹ค.
#### GPU๋ณ„ ํƒœ์Šคํฌ ๋ถ„๋ฐฐ
| GPU | 0-shot ํƒœ์Šคํฌ | 5-shot ํƒœ์Šคํฌ |
|-----|--------------|--------------|
| 0 | kobest_boolq, kobest_copa, kobest_hellaswag | ๋™์ผ |
| 1 | kobest_sentineg, kobest_wic | ๋™์ผ |
| 2 | haerae (์ „์ฒด + 5๊ฐœ ์„œ๋ธŒ) | ๋™์ผ |
| 3 | global_mmlu_ko (6์นดํ…Œ๊ณ ๋ฆฌ) | ๋™์ผ |
| 4 | hellaswag, arc_easy | ๋™์ผ |
| 5 | arc_challenge, winogrande | ๋™์ผ |
| 6 | piqa, global_mmlu_en (61๊ณผ๋ชฉ) | ๋™์ผ |
| 7 | (์˜ˆ๋น„ โ€” PPL/calibration ์ „๋‹ด) | โ€” |
NUMA affinity ์ ์šฉ: GPU 0-3์€ NUMA node 0 (cores 0-35), GPU 4-7์€ NUMA node 1 (cores 36-71).
**์ด ์†Œ์š” ์‹œ๊ฐ„: 256.6์ดˆ** (๋ชจ๋ธ ๋กœ๋“œ ํฌํ•จ)
### SFT ์ง„ํ–‰ ํŒ๋‹จ
**๊ฒฐ๋ก : SFT ์ง„ํ–‰** โ€” loss 1.466 ๊ฑด๊ฐ•ํ•œ ์™„๋ฃŒ ์‹œ๊ทธ๋„, ๊ตฌ์กฐ ๋ฌธ์ œ ์—†์Œ. โ†’ **Phase 2 SFT ์‹œ์ž‘ (2026-03-05)**
์ƒ์„ธ ๋ณด๊ณ ์„œ:
- v2 ์ข…ํ•ฉ: `eval/outputs/3b_reeval_20260305_1451/reports/` (5๊ฐœ ๊ฐœ๋ณ„ ๋ฆฌํฌํŠธ + ์ข…ํ•ฉ)
- v1 ๋ ˆ๊ฑฐ์‹œ: `reports/2026-03-05_3B_BASE_EVALUATION_REPORT.md`
---
## 11. ์‹คํ—˜ ๊ฒฐ๊ณผ โ€” 3B SFT ์ข…ํ•ฉ ํ‰๊ฐ€
Phase 2 SFT๊ฐ€ early stopping์œผ๋กœ ์™„๋ฃŒ๋œ ํ›„ ์ˆ˜ํ–‰ํ•œ 6์ฐจ์› ์ข…ํ•ฉ ํ‰๊ฐ€.
### 11.1 SFT ํ•™์Šต ๊ฒฐ๊ณผ
| ํ•ญ๋ชฉ | ๊ฐ’ |
|------|-----|
| ์ตœ์ข… Step | 25,500 / 33,000 (77.3%, early stopping) |
| Best val_loss | **1.8851** (step 23,000) |
| ํ•™์Šต ์‹œ๊ฐ„ | ~15์‹œ๊ฐ„ 41๋ถ„ |
| ๋ฐ์ดํ„ฐ | 24๊ฐœ ์†Œ์Šค โ†’ 2,439,397 samples (7.48 GB) |
| ์„ค์ • | LR=1e-5, eff_batch=64, NEFTune alpha=5.0 |
**Val Loss ์ถ”์ด**:
```
Step 500: 2.0732 (warmup ์™„๋ฃŒ)
Step 2,000: 1.9558 (๊ธ‰์† ํ•˜๊ฐ•)
Step 5,000: 1.9107 (์•ˆ์ • ์ˆ˜๋ ด)
Step 10,000: 1.8917 (๋ฏธ์„ธ ๊ฐ์†Œ)
Step 15,000: 1.8864 (plateau ์ง„์ž…)
Step 20,000: 1.8853 (๋ณ€๋™ < 0.001)
Step 23,000: 1.8851 โ† BEST (early stopping ๊ธฐ์ค€์ )
Step 25,500: Early Stop (patience 5/5 ์†Œ์ง„)
```
### 11.2 6์ฐจ์› ํ‰๊ฐ€ ์š”์•ฝ
| # | ์ฐจ์› | ๊ฒฐ๊ณผ | ํ•ต์‹ฌ ์ˆ˜์น˜ |
|---|------|------|-----------|
| 1 | Perplexity (์ง€์‹ ๋ณด์กด) | **PASS** | ์ตœ๋Œ€ forgetting 0.9%, 19๊ฐœ ๋ฐ์ดํ„ฐ์…‹ ์ „์ฒด PASS |
| 2 | ์ƒ์„ฑ ํ’ˆ์งˆ | **FAIL** | Greedy ๋ฐ˜๋ณต๋ฅ  72.97% (๋ชฉํ‘œ <5%), EOS 60% (๋ชฉํ‘œ >90%) |
| 3 | ํ•œ๊ตญ์–ด ๋ฒค์น˜๋งˆํฌ | **FAIL** | KoBEST ํ‰๊ท  43.26% (๋ชฉํ‘œ >55%) |
| 4 | ์˜์–ด ๋ฒค์น˜๋งˆํฌ | **PASS** | hellaswag 26.1%, winogrande 50.8%, piqa 52.6% (์ „ ํ•ญ๋ชฉ ํ•˜ํ•œ ์ดˆ๊ณผ) |
| 5 | Calibration | **PASS** | Top-1 68.59%, Top-5 81.55%, Entropy 1.54 |
| 6 | SFT Chat ๋Šฅ๋ ฅ | **PASS** | EOS ์ข…๋ฃŒ์œจ 0%โ†’60%, Chat template ์‘๋‹ต |
### 11.3 Base vs SFT ๋น„๊ต
| ์ง€ํ‘œ | Base | SFT | ๋ณ€ํ™” | ํŒ์ • |
|------|------|-----|------|------|
| PPL (ํ†ตํ•ฉ) | 5.2263 | 5.2529 | +0.5% forgetting | PASS |
| Greedy 3-gram ๋ฐ˜๋ณต๋ฅ  | 60.99% | 72.97% | +12pp (์•…ํ™”) | FAIL |
| EOS ์ข…๋ฃŒ์œจ | 0% | 60% | +60pp (๋Œ€ํญ ๊ฐœ์„ ) | ๋ถ€๋ถ„ PASS |
| KoBEST ํ‰๊ท  | 43.69% | 43.26% | -0.4pp | FAIL |
| MMLU-KO | 22.75% | 26.00% | +3.2pp | ๋ถ€๋ถ„ ๊ฐœ์„  |
| ์˜์–ด ๋ฒค์น˜๋งˆํฌ | โ€” | โ€” | ยฑ0.3pp ์ด๋‚ด | PASS (์œ ์ง€) |
| Calibration Top-1 | 68.75% | 68.59% | -0.2pp | PASS (์œ ์ง€) |
**Repetition ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฒ€์ƒ‰** (ํฌ๋ง์ ):
| ์„ค์ • | ๋ฐ˜๋ณต๋ฅ  | EOS Rate |
|------|--------|----------|
| t0.7_rep1.2 | **0.00%** | **100%** |
| t1.0_rep1.1 | **0.00%** | **100%** |
| greedy (raw) | 72.97% | 60% |
> rep_penalty 1.1~1.3 ์ ์šฉ ์‹œ ๋ฐ˜๋ณต๋ฅ  0% ๋‹ฌ์„ฑ โ†’ ๋ชจ๋ธ์ด ๋ฐ˜๋ณตํ•˜์ง€ ์•Š๋Š” ๋Šฅ๋ ฅ ์ž์ฒด๋Š” ๋ณด์œ . ORPO๋กœ ๋‚ด์žฌํ™” ๊ฐ€๋Šฅ.
### 11.4 ์ฝ”๋“œ ๊ฐœ์„  ์‚ฌํ•ญ
์ด๋ฒˆ Phase์—์„œ ์ˆ˜ํ–‰ํ•œ ์ฃผ์š” ์ฝ”๋“œ ๋ณ€๊ฒฝ:
| ํŒŒ์ผ | ๋ณ€๊ฒฝ | ์ค„ ์ˆ˜ | ๋ชฉ์  |
|------|------|-------|------|
| `train/sft.py` | MixingDataLoader, DDP rank 0 ํ† ํฌ๋‚˜์ด์ง• | +238 | SFT+pretrain ์ธํ„ฐ๋ฆฌ๋น™, ๋ฉ”๋ชจ๋ฆฌ 8๋ฐฐ ์ ˆ๊ฐ |
| `train/trainer.py` | DDP early stop broadcast | +17 | DDP hang ๋ฐฉ์ง€, patience 5โ†’10 |
| `train/orpo.py` | YAML config, 3B ๊ธฐ๋ณธ๊ฐ’ | +30 | ORPO ์‹คํ–‰ ์ค€๋น„ |
| `eval/report_generator.py` | SFT ๋น„๊ต ๋ณด๊ณ ์„œ ์ž๋™ ์ƒ์„ฑ | +831 | ํ‰๊ฐ€ ์ž๋™ํ™” |
| `eval/sft_eval_pipeline.py` | 6์ฐจ์› ํ‰๊ฐ€ ํŒŒ์ดํ”„๋ผ์ธ | ์‹ ๊ทœ | SFT ์ข…ํ•ฉ ํ‰๊ฐ€ |
| `eval/tasks/generation_task.py` | Chat template, diversity metrics | +75 | SFT ํ‰๊ฐ€ ์ง€์› |
### 11.5 ORPO ์ง„ํ–‰ ํŒ์ •
**ํŒ์ •: Phase 3 ORPO ์ง„ํ–‰**
| ๊ทผ๊ฑฐ | ์ƒ์„ธ |
|------|------|
| ์ง€์‹ ๋ณด์กด ์–‘ํ˜ธ | forgetting 0.9% โ€” SFT๊ฐ€ base ์ง€์‹์„ ํŒŒ๊ดดํ•˜์ง€ ์•Š์Œ |
| ๋ฐ˜๋ณต ๋ฏธํ•ด๊ฒฐ | greedy 72.97% โ€” ์„ ํ˜ธ๋„ ์ •๋ ฌ์ด ์ง์ ‘์  ํ•ด๊ฒฐ ๊ฒฝ๋กœ |
| ํฌ๋ง์  ์‹ ํ˜ธ | rep_penalty ์ ์šฉ ์‹œ 0% โ†’ ORPO๊ฐ€ ๋‚ด์žฌํ™” ๊ฐ€๋Šฅ |
| ๋ฐ์ดํ„ฐ ์ค€๋น„ ์™„๋ฃŒ | 795,468 preference pairs (7.9 GB) |
| ์ฝ”๋“œ/์„ค์ • ์™„๋น„ | `train/orpo.py` + `configs/korean_3b_orpo.yaml` |
**ORPO ํ›„ ํŒ์ • ๊ธฐ์ค€**:
- ๋ฐ˜๋ณต๋ฅ  < 5% AND KoBEST > 50% โ†’ GGUF + Ollama ๋ฐฐํฌ
- ๋ฐ˜๋ณต๋ฅ  5~15% โ†’ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์กฐ์ • ํ›„ ์žฌ์‹œ๋„
- ๋ฐ˜๋ณต๋ฅ  > 15% โ†’ SFT v2 (lr=5e-5, data mixing) ํ›„ ์žฌ๋„์ „
์ƒ์„ธ: `reports/2026-03-06_3B_SFT_COMPLETION_AND_EVAL_SUMMARY.md`
---
## 12. Phase 3 โ€” ORPO (์„ ํ˜ธ๋„ ์ •๋ ฌ)
### 12.1 ORPO ์„ ํƒ ๋ฐฐ๊ฒฝ
SFT 6์ฐจ์› ํ‰๊ฐ€์—์„œ greedy ๋ฐ˜๋ณต๋ฅ  72.97%, EOS ์ข…๋ฃŒ์œจ 0%๋ผ๋Š” ์น˜๋ช…์  ๋ฌธ์ œ๊ฐ€ ๋ฐœ๊ฒฌ๋๋‹ค. SFT๋Š” "์ข‹์€ ์‘๋‹ต๋งŒ ๋ชจ๋ฐฉ"ํ•˜๋Š” ํ•™์Šต์ด๋ฏ€๋กœ, "๋‚˜์œ ์‘๋‹ต์„ ์–ต์ œ"ํ•˜๋Š” ์‹ ํ˜ธ๊ฐ€ ์—†๋‹ค. ๋ฐ˜๋ณต ๋ฌธ์ œ ํ•ด๊ฒฐ์—๋Š” preference optimization์ด ํ•„์ˆ˜์ ์ด๋‹ค.
**ORPO vs DPO**:
| ํ•ญ๋ชฉ | ORPO | DPO |
|------|------|-----|
| Reference model | ๋ถˆํ•„์š” | ํ•„์š” (VRAM 2๋ฐฐ) |
| ๊ตฌํ˜„ ๋ณต์žก๋„ | ๋‚ฎ์Œ | ์ค‘๊ฐ„ |
| ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ | ๋†’์Œ (3B 1๊ฐœ๋งŒ ๋กœ๋“œ) | ๋‚ฎ์Œ (3B 2๊ฐœ ๋กœ๋“œ) |
| ํ•™์Šต ์•ˆ์ •์„ฑ | ์ค‘๊ฐ„ | ๋†’์Œ |
ORPO๋ฅผ 1์ฐจ ์„ ํƒ, DPO๋ฅผ Plan B๋กœ ์„ค์ •ํ–ˆ๋‹ค.
### 12.2 ๋ฐ์ดํ„ฐ
- **์›๋ณธ**: 683,181 preference pairs (7๊ฐœ ์†Œ์Šค ํ†ตํ•ฉ)
- **ํ•„ํ„ฐ ํ›„**: ~630,000 pairs (NaN ๋ฐฉ์ง€ ํ•„ํ„ฐ ์ ์šฉ)
- **Eval split**: 5% (~31,500 pairs, seed=42)
- **Effective batch**: 4 ร— 8 GPU ร— 4 accum = 128
### 12.3 HP Sweep ์„ค๊ณ„ (6-Config)
3๊ฐœ ์ถ•(beta, LR, max_length)์„ ์ค‘์‹ฌ์ถ• ๊ณ ์ • ๋ฐฉ์‹์œผ๋กœ 6๊ฐœ ์กฐํ•ฉ ์„ ์ •:
| Run | Name | Beta | LR | Max Length | ๋ชฉ์  |
|-----|------|------|----|-----------|------|
| 1 | baseline_b015 | 0.15 | 8e-6 | 1536 | ์•ฝํ•œ beta ๋ฒ ์ด์Šค๋ผ์ธ |
| 2 | baseline_b025 | 0.25 | 8e-6 | 1536 | ์ค‘๊ฐ„ beta ๋ฒ ์ด์Šค๋ผ์ธ |
| 3 | strong_b035 | 0.35 | 8e-6 | 1536 | ๊ฐ•ํ•œ beta โ€” ์ ๊ทน์  ๋ฐ˜๋ณต ์–ต์ œ |
| 4 | fast_lr12e6 | 0.25 | 1.2e-5 | 1536 | ๋†’์€ LR โ€” ๋น ๋ฅธ ์ˆ˜๋ ด |
| 5 | conserv_lr5e6 | 0.25 | 5e-6 | 1536 | ๋ณด์ˆ˜์  LR โ€” ์•ˆ์ •์„ฑ |
| 6 | short_1024 | 0.25 | 8e-6 | 1024 | ์งง์€ max_length โ€” VRAM ์ ˆ์•ฝ |
๊ฐ 200 steps, eval_steps=100, 8ร—B200 DDP.
### 12.4 ์‹œ๋„ ์ด๋ ฅ โ€” 5๋ฒˆ์˜ ์‹คํŒจ
| # | ๋ฌธ์ œ | ์›์ธ | ์ˆ˜์ • |
|---|------|------|------|
| 1 | NCCL Timeout | ํ† ํฌ๋‚˜์ด์ง• 30๋ถ„ > timeout 1800s | ddp_timeout=7200, num_proc=64 |
| 2 | Config ์ถฉ๋Œ | save_steps โ‰  eval_steps ๋ฐฐ์ˆ˜ | --no_load_best --save_steps 200 |
| 3 | ํฌํŠธ ์ถฉ๋Œ + QKV ๋ˆ„๋ฝ | ์ข€๋น„ ํ”„๋กœ์„ธ์Šค + fused QKV ๋ฏธ๋ถ„๋ฆฌ | pkill + QKV split ๋กœ์ง |
| 4 | TRL NaN ๋ฒ„๊ทธ | tokenize_row ์–‘์ชฝ response ๋™์‹œ ์ž˜๋ฆผ | 3์ค‘ ํŒจ์น˜ (clamp, truncation) |
| 5 | Tokenizer ํ˜ธํ™˜ | zip(strict=True) + ํ•œ๊ตญ์–ด merge ops | TRL ์†Œ์Šค 8๊ฑด ํŒจ์น˜ |
๊ฐ€์žฅ ์‹ฌ๊ฐํ–ˆ๋˜ ๊ฒƒ์€ TRL NaN ๋ฒ„๊ทธ๋กœ, 0 response tokens โ†’ log(0) = -inf โ†’ NaN ์ „ํŒŒ ์ฒด์ธ์„ ์ผ์œผ์ผฐ๋‹ค. ์ƒ์„ธ: `reports/2026-03-08_ORPO_TRAINING_JOURNEY.md`
### 12.5 ์Šค์œ• ์ตœ์ข… ๊ฒฐ๊ณผ
| Run | Name | Beta | LR | MaxLen | Train Loss | Eval Loss | Margin | Status |
|-----|------|------|----|--------|-----------|-----------|--------|--------|
| 1 | baseline_b015 | 0.15 | 8e-6 | 1536 | 1.811 | 1.827 | 0.004 | โœ… |
| 2 | baseline_b025 | 0.25 | 8e-6 | 1536 | 1.890 | 1.906 | 0.009 | โœ… |
| 3 | strong_b035 | 0.35 | 8e-6 | 1536 | 2.055 | 1.985 | 0.007 | โœ… |
| **4** | **fast_lr12e6** | **0.25** | **1.2e-5** | **1536** | **1.917** | **1.862** | **0.009** | **๐Ÿ† Best** |
| 5 | conserv_lr5e6 | 0.25 | 5e-6 | 1536 | 1.833 | 1.910 | 0.004 | โœ… |
| 6 | short_1024 | 0.25 | 8e-6 | 1024 | 1.664 | 1.695 | 0.007 | โœ… |
**Best config: Run 4** (eval_loss 1.862 ์ตœ์ €, margin 0.009 ์ตœ๊ณ , ๋น ๋ฅธ ์ˆ˜๋ ด).
### 12.6 Throughput ๋ฒค์น˜๋งˆํฌ โ†’ ๋ณธ ํ•™์Šต ์„ค์ •
๋ณธ ํ•™์Šต ์ „ batch/grad_accum ์กฐํ•ฉ์˜ throughput์„ ์ธก์ •ํ•˜์—ฌ ์ตœ์  ์„ค์ •์„ ๊ฒฐ์ •:
| batch_size | grad_accum | eff_batch | Throughput | ๋น„๊ณ  |
|-----------|-----------|----------|-----------|------|
| **4** | **4** | **128** | **80.63 samples/s** | **์„ ์ •** |
| 2 | 8 | 128 | 73.14 samples/s | ๊ธฐ์กด ์„ค์ • |
| 8 | 2 | 128 | OOM | |
### 12.7 ORPO ๋ณธ ํ•™์Šต (์ง„ํ–‰ ์ค‘, 2026-03-09)
| ํŒŒ๋ผ๋ฏธํ„ฐ | ๊ฐ’ |
|---------|-----|
| Beta / LR | 0.25 / 1.2e-5 (Sweep Run 4) |
| Batch / Accum / Eff | 4 / 4 / 128 (๋ฒค์น˜๋งˆํฌ ์ตœ์ ) |
| Max length | 1536 |
| Epochs | 2 (~9,840 steps) |
| GPU VRAM | ~52GB / 183GB (28%) |
| ์†๋„ | ~1.75 s/step |
| ์˜ˆ์ƒ ์‹œ๊ฐ„ | ~4.8์‹œ๊ฐ„ |
**ํ•™์Šต ์ง€ํ‘œ ์ถ”์ด (step ~1,660 ๊ธฐ์ค€)**:
| Step | Eval Loss | Pref Accuracy | Reward Margin | NLL Loss |
|-----:|----------:|--------------:|--------------:|---------:|
| ~1,000 | 1.791 | 66.8% | 0.107 | 1.647 |
| ~2,000 | 1.713 | 70.1% | 0.293 | 1.591 |
| ~3,000 | 1.681 | 71.9% | 0.372 | 1.567 |
- Train loss: 2.34 โ†’ **1.68** (-0.66)
- rewards/accuracies: 0.43 โ†’ **0.74** (chosen/rejected ๊ตฌ๋ถ„ ๋Šฅ๋ ฅ ๊ธ‰์ƒ์Šน)
- rewards/margins: -0.005 โ†’ **0.387** (preference signal ํ•™์Šต ํ™•์ธ)
- ์†๋„ ~1.76 s/step, GPU 92~100% utilization, ์•ˆ์ •์  ์ง„ํ–‰ ์ค‘
**ํ•™์Šต ์™„๋ฃŒ ํ›„ ์ž๋™ ํ‰๊ฐ€**: `scripts/orpo_eval_watchdog.sh` ๊ฐ€ ํ•™์Šต ํ”„๋กœ์„ธ์Šค๋ฅผ ๊ฐ์‹œํ•˜๋ฉฐ, ์™„๋ฃŒ ์‹œ ์ž๋™์œผ๋กœ 10์ฐจ์› ์ข…ํ•ฉ ํ‰๊ฐ€ ํŒŒ์ดํ”„๋ผ์ธ ์‹คํ–‰
### 12.8 ORPO ์ข…ํ•ฉ ํ‰๊ฐ€ ํŒŒ์ดํ”„๋ผ์ธ
SFT v2 ํ‰๊ฐ€์˜ 6์ฐจ์›์— ORPO ๊ณ ์œ  4์ฐจ์›์„ ์ถ”๊ฐ€ํ•œ **10์ฐจ์› ์ข…ํ•ฉ ํ‰๊ฐ€**.
ํ•™์Šต ์™„๋ฃŒ ์‹œ `eval/orpo_eval_pipeline.py`๊ฐ€ ์ž๋™ ์‹คํ–‰๋˜์–ด Base vs SFT vs ORPO 3-way ๋น„๊ต ๋ณด๊ณ ์„œ๋ฅผ ์ƒ์„ฑํ•œ๋‹ค.
**ํ‰๊ฐ€ ๊ตฌ์กฐ**:
| Phase | ๋‚ด์šฉ | GPU | ์˜ˆ์ƒ ์‹œ๊ฐ„ |
|-------|------|-----|----------|
| Pre-phase | train.log์—์„œ ํ•™์Šต ๊ณก์„  ์ถ”์ถœ | - | ~1์ดˆ |
| Phase 1 | ๋‚ด๋ถ€ ํ‰๊ฐ€ (PPL 19์…‹, Calibration, Generation, Repetition Grid) | 8 GPU ๋ณ‘๋ ฌ | ~30๋ถ„ |
| Phase 2 | ๋ฒค์น˜๋งˆํฌ (KoBEST, HAE-RAE, MMLU-KO/EN, hellaswag, arc, piqa) | 8 GPU ๋ณ‘๋ ฌ | ~1์‹œ๊ฐ„ |
| Phase 3 | 3-way ๋น„๊ต ๋ณด๊ณ ์„œ ์ž๋™ ์ƒ์„ฑ | - | ~10์ดˆ |
**10์ฐจ์› ํ‰๊ฐ€ ํ•ญ๋ชฉ**:
| # | ์ฐจ์› | ๊ธฐ์ค€ | SFT v2 ๊ฒฐ๊ณผ | ORPO ๋ชฉํ‘œ |
|---|------|------|------------|----------|
| 1 | ์ง€์‹ ๋ณด์กด (PPL) | forgetting < 15% | 0.9% | < 5% |
| 2 | ์ƒ์„ฑ ํ’ˆ์งˆ | greedy ๋ฐ˜๋ณต๋ฅ  < 5%, EOS > 90% | **72.97% / 60%** | **< 5% / > 90%** |
| 3 | ํ•œ๊ตญ์–ด ๋ฒค์น˜๋งˆํฌ | KoBEST ํ‰๊ท  > 55% | 43.26% | โ‰ฅ 43% |
| 4 | ์˜์–ด ๋ฒค์น˜๋งˆํฌ | ํ•˜ํ•œ ์ดˆ๊ณผ | PASS | ์œ ์ง€ |
| 5 | Calibration | Top-1 โ‰ฅ 65% | 68.59% | โ‰ฅ 65% |
| 6 | Chat ๋Šฅ๋ ฅ | EOS ์ข…๋ฃŒ์œจ | 60% | > 90% |
| 7 | Preference Accuracy | > 65% | โ€” | > 65% |
| 8 | Reward Margins | > 0.1 | โ€” | > 0.1 |
| 9 | ๋ฐ˜๋ณต ํŒŒ๋ผ๋ฏธํ„ฐ ๋ฏผ๊ฐ๋„ | rep_penalty=1.0์—์„œ๋„ < 5% | โ€” | PASS |
| 10 | SFTโ†’ORPO ๊ฐœ์„  | ๋ฐ˜๋ณต๋ฅ โ†“ + EOSโ†‘ | โ€” | PASS |
**ํ•ต์‹ฌ ํŒŒ์ผ**:
- `eval/orpo_eval_pipeline.py` โ€” ORPO ํ‰๊ฐ€ ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ดํ„ฐ
- `eval/report_generator.py` โ€” 3-way ๋น„๊ต ๋ณด๊ณ ์„œ ์ƒ์„ฑ๊ธฐ (`generate_three_way_report()`)
- `scripts/orpo_eval_watchdog.sh` โ€” ํ•™์Šต ์™„๋ฃŒ ๊ฐ์ง€ + ์ž๋™ ํ‰๊ฐ€ ์‹คํ–‰
**๋ฐฐํฌ ๊ธฐ์ค€**: greedy ๋ฐ˜๋ณต๋ฅ  < 5% AND EOS > 90% AND forgetting < 5% AND KoBEST โ‰ฅ 43% โ†’ **DEPLOY**
---
## 13. ์‹คํ–‰ ๋ฐฉ๋ฒ•
### ์‚ฌ์ „ ์š”๊ตฌ์‚ฌํ•ญ
```bash
# PyTorch๋Š” ์žฌ์„ค์น˜ ๊ธˆ์ง€ (NVIDIA ์ปค์Šคํ…€ ๋นŒ๋“œ)
# ์•„๋ž˜ ํŒจํ‚ค์ง€๋งŒ ์ถ”๊ฐ€ ์„ค์น˜
pip install transformers accelerate peft trl deepspeed \
bitsandbytes sentencepiece wandb
```
### 3B ํ”„๋ฆฌํŠธ๋ ˆ์ธ
```bash
# NCCL ํ™˜๊ฒฝ๋ณ€์ˆ˜์™€ ํ•จ๊ป˜ 8-GPU ํ•™์Šต ์‹คํ–‰
bash scripts/launch_3b_pretrain.sh
# ์ˆ˜๋™ ์‹คํ–‰ (์ง์ ‘ ์ œ์–ด)
torchrun --nproc_per_node=8 \
--master_port=29500 \
train/pretrain.py \
--config configs/korean_3b_fp8.yaml
```
### SFT
```bash
bash scripts/launch_3b_sft.sh
# ๋˜๋Š” ์ง์ ‘ ์‹คํ–‰
torchrun --nproc_per_node=8 \
train/sft.py \
--config configs/korean_3b_sft.yaml \
--pretrain_ckpt checkpoints/3b_pretrain_best.pt
```
### ORPO (์„ ํ˜ธ๋„ ์ •๋ ฌ)
```bash
# ORPO ํ•™์Šต
bash scripts/launch_3b_orpo.sh
# ํ•™์Šต ์™„๋ฃŒ ํ›„ ์ž๋™ ํ‰๊ฐ€ (watchdog)
nohup bash scripts/orpo_eval_watchdog.sh \
> checkpoints/korean_3b_orpo_v1/watchdog.log 2>&1 &
```
### ํ‰๊ฐ€
```bash
# Base ๋ชจ๋ธ ์ „์ฒด ํ‰๊ฐ€ (8 GPU ๋ณ‘๋ ฌ)
python eval/full_eval_pipeline.py
# SFT ๋ชจ๋ธ ํ‰๊ฐ€ (Base vs SFT 2-way ๋น„๊ต)
python eval/sft_eval_pipeline.py --skip-phase0 \
--hf-model-path eval/outputs/hf_3b_sft_best
# ORPO ๋ชจ๋ธ ํ‰๊ฐ€ (Base vs SFT vs ORPO 3-way ๋น„๊ต)
python eval/orpo_eval_pipeline.py # ์ž๋™์œผ๋กœ ์ตœ์‹  checkpoint ๊ฐ์ง€
python eval/orpo_eval_pipeline.py --dry-run # ์‹คํ–‰ ๊ณ„ํš๋งŒ ํ™•์ธ
# ๋น ๋ฅธ ํ‰๊ฐ€ (kobest_copa + PPL)
bash scripts/run_eval_quick.sh
# ์ƒ์„ฑ ํŒŒ๋ผ๋ฏธํ„ฐ ํƒ์ƒ‰
python eval/test_generation_params.py \
--checkpoint checkpoints/3b_best.pt
```
### ๋ฐฐํฌ
```bash
# Step 1: GGUF ๋ณ€ํ™˜ (llama.cpp ํฌ๋งท)
bash scripts/convert_3b_gguf.sh
# Step 2: Ollama ๋ชจ๋ธ ๋“ฑ๋ก ๋ฐ ์„œ๋น™
bash scripts/deploy_3b_ollama.sh
# Ollama๋กœ ํ…Œ์ŠคํŠธ
ollama run frankenstallm-3b "ํ•œ๊ตญ์˜ ์ฒ ๊ฐ• ์‚ฐ์—…์— ๋Œ€ํ•ด ์„ค๋ช…ํ•ด์ค˜."
```
### ํ•™์Šต ๋ชจ๋‹ˆํ„ฐ๋ง
```bash
# ์‹ค์‹œ๊ฐ„ ๋ชจ๋‹ˆํ„ฐ (tail -f ๋ฐฉ์‹)
bash scripts/monitor_3b.sh
# ํ”„๋กœ์„ธ์Šค ์ƒํƒœ ํ™•์ธ
ps aux | grep pretrain
# GPU ์ƒํƒœ
nvidia-smi --query-gpu=index,name,memory.used,memory.total,utilization.gpu \
--format=csv -l 5
```
### ๋‹จ์ผ GPU ํ…Œ์ŠคํŠธ (๊ฐœ๋ฐœ/๋””๋ฒ„๊ทธ)
```bash
python train/pretrain.py \
--config configs/korean_3b_fp8.yaml \
--device cuda:0 \
--max_steps 100 \
--debug
```
---
## 14. ๋กœ๋“œ๋งต
### ๋‹จ๊ธฐ (2026๋…„ 3์›”)
| ํ•ญ๋ชฉ | ์ƒํƒœ | ๋น„๊ณ  |
|------|------|------|
| Phase 1 (3B Pretrain) ์™„๋ฃŒ | โœ… ์™„๋ฃŒ | 57K steps, loss 1.466, 2026-03-05 |
| Phase 2 (SFT) ์™„๋ฃŒ | โœ… ์™„๋ฃŒ | 25.5K steps, val_loss 1.8851, 2026-03-06 |
| SFT 6์ฐจ์› ํ‰๊ฐ€ | โœ… ์™„๋ฃŒ | 4/6 PASS, ORPO ํŒ์ • |
| Phase 3 (ORPO Sweep) | โœ… ์™„๋ฃŒ | 6-config sweep ์™„๋ฃŒ, best config ์„ ์ • |
| **Phase 3 (ORPO ๋ณธ ํ•™์Šต)** | **๐Ÿ”„ ์ง„ํ–‰ ์ค‘** | **lr=1.2e-5, beta=0.25, 2 epochs, ~9,840 steps** |
| Phase 3.5 (ORPO ์ข…ํ•ฉ ํ‰๊ฐ€) | ๐Ÿ“‹ ๋Œ€๊ธฐ | 10์ฐจ์› ํ‰๊ฐ€ (6 ๊ธฐ๋ณธ + 4 ORPO ๊ณ ์œ ), 3-way ๋น„๊ต ๋ณด๊ณ ์„œ |
| GGUF ๋ณ€ํ™˜ + Ollama ๋ฐฐํฌ | ๐Ÿ“‹ ๋Œ€๊ธฐ | Phase 4 (ORPO ํ‰๊ฐ€ PASS ์‹œ) |
### ์ค‘๊ธฐ (2026๋…„ 2๋ถ„๊ธฐ)
| ํ•ญ๋ชฉ | ๋น„๊ณ  |
|------|------|
| ํ™•์žฅ ํ”„๋ฆฌํŠธ๋ ˆ์ธ (80~100B ํ† ํฐ) | Chinchilla ์ตœ์ ์  ๋‹ฌ์„ฑ |
| QKV Fusion | +8~12% MFU ๊ธฐ๋Œ€ |
| NUMA Affinity ์„ค์ • | +4~9% ์˜ˆ์ƒ |
| FA2 native RoPE | +3~5% ์˜ˆ์ƒ |
| Context length ํ™•์žฅ (4096) | RoPE ฮธ=500K ๊ธฐ๋ฐ˜ |
### ์žฅ๊ธฐ (2026๋…„ ํ•˜๋ฐ˜๊ธฐ)
| ํ•ญ๋ชฉ | ๋น„๊ณ  |
|------|------|
| 7B ์‹คํ—˜ | FSDP ์ „๋žต ํ•„์š” |
| vLLM serving | PagedAttention ๊ธฐ๋ฐ˜ ์ถ”๋ก  ์„œ๋ฒ„ |
| ๋„๋ฉ”์ธ ํŠนํ™” ํŒŒ์ธํŠœ๋‹ | ์ฒ ๊ฐ•/์ œ์กฐ์—… ๋„๋ฉ”์ธ |
| ๊ณต๊ฐœ ๋ฐฐํฌ | HuggingFace Hub ์—…๋กœ๋“œ |
### ์•Œ๋ ค์ง„ ๋ฏธ์ ์šฉ ์ตœ์ ํ™”
Phase 0 ๋ถ„์„์—์„œ ๋ฐœ๊ฒฌํ–ˆ์ง€๋งŒ ์•„์ง ์ ์šฉํ•˜์ง€ ์•Š์€ ์ตœ์ ํ™”๋“ค:
| ์ตœ์ ํ™” | ์˜ˆ์ƒ ํšจ๊ณผ | ๊ตฌํ˜„ ๋ณต์žก๋„ |
|--------|-----------|-------------|
| QKV Fusion | +8~12% MFU | ์ค‘๊ฐ„ |
| NUMA Affinity | +4~9% | ๋‚ฎ์Œ |
| FA2 Native RoPE | +3~5% | ๋‚ฎ์Œ |
| HugePages | +1~3% (TLB ์ตœ์ ํ™”) | ๋‚ฎ์Œ (sysctl) |
์ด ์ตœ์ ํ™”๋“ค์„ ๋ชจ๋‘ ์ ์šฉํ•˜๋ฉด ํ˜„์žฌ 33.5% MFU์—์„œ 45~50%๊นŒ์ง€ ๋„๋‹ฌํ•  ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ๋‹ค.
---
## 15. ์ฐธ๊ณ  ๋ฌธ์„œ
| ๋ฌธ์„œ | ์œ„์น˜ | ๋‚ด์šฉ |
|------|------|------|
| ํ”„๋กœ์ ํŠธ ์ „์ฒด ์—ฌ์ • | `docs/PROJECT_HISTORY.md` | ์ผ๋ณ„ ์ƒ์„ธ ์ง„ํ–‰ ๊ธฐ๋ก |
| 3B ์ž‘์—… ๊ณ„ํš | `docs/3B_WORKPLAN.md` | 3B ๋‹จ๊ณ„๋ณ„ ์ž‘์—… ๊ณ„ํš ์ƒ์„ธ |
| ์ €์Šคํ‹ฐ์Šค๋ฆฌ๊ทธ ๋…ผ์ฆ | `eval/debate/justice_league_3b_case.md` | 1Bโ†’3B ์ „ํ™˜ ๋ฉ€ํ‹ฐ์—์ด์ „ํŠธ ํ† ๋ก  ์ „๋ฌธ |
| SFT ์žฌ์‹œ์ž‘ ํŒ๊ฒฐ | `eval/decision/FINAL_DECISION_REPORT.md` | SFT v1 ์‹คํŒจ โ†’ v2 ์„ค๊ณ„ ํŒ๊ฒฐ๋ฌธ |
| 3B ๋งˆ์Šคํ„ฐ ํ”Œ๋žœ | `eval/plan/3B_MASTER_PLAN.md` | ์ „์ฒด ํ•™์Šต ํŒŒ์ดํ”„๋ผ์ธ ๋งˆ์Šคํ„ฐ ํ”Œ๋žœ |
| Phase 0 ์ตœ์ ํ™” ๋ณด๊ณ ์„œ | `reports/2026-03-02_0200_FRANKENSTALLM_phase0_optimization_report.md` | VRAM/MFU ์ตœ์ ํ™” ์ „์ฒด ๋ณด๊ณ  |
| 3B Base ํ‰๊ฐ€ ๋ณด๊ณ ์„œ (v1) | `reports/2026-03-05_3B_BASE_EVALUATION_REPORT.md` | ์ดˆ๊ธฐ PPL/๋ฒค์น˜๋งˆํฌ/๋ฐ˜๋ณต๋ฅ  ํ‰๊ฐ€ |
| PPL ํ‰๊ฐ€ ๋ณด๊ณ ์„œ (v1) | `reports/2026-03-05_PPL_EVALUATION.md` | 4๊ฐœ ๊ฒ€์ฆ์…‹ PPL ์ƒ์„ธ |
| ๋ฒค์น˜๋งˆํฌ ๊ฒฐ๊ณผ (v1) | `reports/2026-03-05_BENCHMARK_RESULTS.md` | belebele, MMLU ์ƒ์„ธ |
| ์ƒ์„ฑ ํ’ˆ์งˆ ๋ถ„์„ (v1) | `reports/2026-03-05_GENERATION_QUALITY.md` | ๋ฐ˜๋ณต๋ฅ , ๋””์ฝ”๋”ฉ ํŒŒ๋ผ๋ฏธํ„ฐ |
| SFT ํ•™์Šต ๋ณด๊ณ ์„œ | `reports/2026-03-05_3B_SFT_PROGRESS_REPORT.md` | Phase 2 SFT ํ•™์Šต ๊ณผ์ • ๊ธฐ๋ก |
| **SFT ์™„๋ฃŒ ์ข…ํ•ฉ ๋ณด๊ณ ์„œ** | `reports/2026-03-06_3B_SFT_COMPLETION_AND_EVAL_SUMMARY.md` | **SFT ์™„๋ฃŒ + ํ‰๊ฐ€ + ์ฝ”๋“œ ๊ฐœ์„  + ORPO ๊ฒฐ์ • (์ตœ์‹ )** |
| SFT ํ‰๊ฐ€ ๊ณ„ํš์„œ | `reports/2026-03-06_3B_SFT_EVAL_PLAN.md` | 6์ฐจ์› ํ‰๊ฐ€ ์„ค๊ณ„ |
| SFT ํ‰๊ฐ€ ๊ฒฐ๊ณผ | `reports/2026-03-06_3B_SFT_EVALUATION_REPORT.md` | 6์ฐจ์› ํ‰๊ฐ€ ์ƒ์„ธ ๊ฒฐ๊ณผ |
| 3B ํ›„์† ๋‹จ๊ณ„ ์ฐธ์กฐ | `reports/2026-03-05_3B_NEXT_STEPS_REFERENCE.md` | SFT ํ›„ ๋ฐฉํ–ฅ์„ฑ |
| Nemotron Nano ํƒ€๋‹น์„ฑ | `reports/2026-03-05_NEMOTRON_NANO_FEASIBILITY_STUDY.md` | Hybrid ์•„ํ‚คํ…์ฒ˜ ๊ฒ€ํ†  |
| **v2 ์ข…ํ•ฉ ํ‰๊ฐ€ ๋ฆฌํฌํŠธ** | `eval/outputs/3b_reeval_20260305_1451/full_eval_report.md` | **13+ ๋ฒค์น˜๋งˆํฌ ์ข…ํ•ฉ** |
| v2 PPL ๋ฆฌํฌํŠธ | `eval/outputs/3b_reeval_20260305_1451/reports/01_perplexity_report.md` | 19๊ฐœ ๋ฐ์ดํ„ฐ์…‹ PPL ์ƒ์„ธ |
| v2 Calibration ๋ฆฌํฌํŠธ | `eval/outputs/3b_reeval_20260305_1451/reports/02_calibration_report.md` | Top-K ์ •ํ™•๋„, NLL ๋ถ„ํฌ |
| v2 ์ƒ์„ฑ ํ’ˆ์งˆ ๋ฆฌํฌํŠธ | `eval/outputs/3b_reeval_20260305_1451/reports/03_generation_quality.md` | 12์กฐํ•ฉ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ทธ๋ฆฌ๋“œ ์„œ์น˜ |
| v2 ๋ฒค์น˜๋งˆํฌ ๋ฆฌํฌํŠธ | `eval/outputs/3b_reeval_20260305_1451/reports/04_benchmark_report.md` | KoBEST, HAE-RAE, MMLU, 0/5-shot |
| ์ง„ํ–‰ ๊ธฐ๋ก | `PROGRESS.md` | ๋‚ ์งœ๋ณ„ ์ฒดํฌํฌ์ธํŠธ, ์ง€ํ‘œ, ๊ฒฐ์ • ๋กœ๊ทธ |
| **ORPO ๋ถ„์„ ๋ฐ ๊ณ„ํš** | `reports/2026-03-07_ORPO_ANALYSIS_AND_PLAN.md` | **ORPO ์ง„ํ–‰ ๊ทผ๊ฑฐ, HP ์„ค๊ณ„, ์‹คํ–‰ ์ ˆ์ฐจ** |
| **ORPO Sweep ๋””๋ฒ„๊ทธ** | `reports/2026-03-08_ORPO_SWEEP_DEBUG_REPORT.md` | **QKV ๋ฒ„๊ทธ, NCCL timeout, TRL ํŒจ์น˜ ์ƒ์„ธ** |
| **ORPO ํ•™์Šต ์—ฌ์ •** | `reports/2026-03-08_ORPO_TRAINING_JOURNEY.md` | **ORPO ์ „์ฒด ๊ณผ์ •: 5๋ฒˆ์˜ ์‹คํŒจ์™€ HP sweep (์ตœ์‹ )** |
---
## 16. ๊ธฐ์ˆ  ์Šคํƒ ์š”์•ฝ
| ์˜์—ญ | ๊ธฐ์ˆ  | ๋ฒ„์ „ |
|------|------|------|
| ๋”ฅ๋Ÿฌ๋‹ ํ”„๋ ˆ์ž„์›Œํฌ | PyTorch (NVIDIA ์ปค์Šคํ…€ ๋นŒ๋“œ) | nv25.12 |
| ์–ดํ…์…˜ | FlashAttention-2 | 2.7.4.post1+25.12 |
| FP8 / ํ˜ผํ•ฉ ์ •๋ฐ€๋„ | TransformerEngine (MXFP8) | 2.10.0 |
| ๋ถ„์‚ฐ ํ•™์Šต | DDP + NCCL (NVLS) | NCCL 2.28.9 |
| ์ปค๋„ ์ปดํŒŒ์ผ | Triton | 3.5.1 |
| ํ† ํฌ๋‚˜์ด์ € | SentencePiece Unigram 64K | - |
| ๋ชจ๋‹ˆํ„ฐ๋ง | Telegram Bot (B200Bot) + cron watchdog | - |
| ์ถ”๋ก  ์„œ๋น™ | GGUF + Ollama | - |
| GPU | 8ร— NVIDIA B200 (NVLink 5.0, NVSwitch) | CUDA 13.1 |
| CPU | 2ร— AMD EPYC 9365 (Zen 5) | - |
---
## ๊ด€๋ จ ํ”„๋กœ์ ํŠธ
### [EVAFRILL-Mo](https://github.com/pathcosmos/EVAFRILL-Mo)
**ํ•˜์ด๋ธŒ๋ฆฌ๋“œ Mamba-2 + Transformer ์–ธ์–ด ๋ชจ๋ธ** โ€” FRANKENSTALLM์˜ ์ž๋งค ํ”„๋กœ์ ํŠธ.
NVIDIA [Nemotron-H](https://arxiv.org/abs/2504.03624) ์•„ํ‚คํ…์ฒ˜์—์„œ ์˜๊ฐ์„ ๋ฐ›์•„ ๋ฐ‘๋ฐ”๋‹ฅ๋ถ€ํ„ฐ ์ง์ ‘ ๊ตฌํ˜„ํ•œ 3B ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ๋ชจ๋ธ์ด๋‹ค. FRANKENSTALLM์ด ์ˆœ์ˆ˜ Transformer ๊ธฐ๋ฐ˜์ด๋ผ๋ฉด, EVAFRILL-Mo๋Š” **Mamba-2 SSM + ํฌ์†Œ Transformer ์–ดํ…์…˜** ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ๊ตฌ์กฐ๋ฅผ ์ฑ„ํƒํ–ˆ๋‹ค.
| ํ•ญ๋ชฉ | FRANKENSTALLM | EVAFRILL-Mo |
|------|:---:|:---:|
| ์•„ํ‚คํ…์ฒ˜ | ์ˆœ์ˆ˜ Transformer (28L) | Mamba-2 24L + Attention 2L |
| ํŒŒ๋ผ๋ฏธํ„ฐ | 3.17B | 2.94B |
| ํ•ต์‹ฌ ๊ธฐ์ˆ  | GQA, FP8, FlashAttention-2 | Selective Scan, SwiGLU FFN in Mamba, GQA |
| ์„ค๊ณ„ ์›์น™ | ๊ฒ€์ฆ๋œ Transformer ์•„ํ‚คํ…์ฒ˜ | Nemotron-H ๋‹จํŽธํ™” ๋„์ž… |
| GPU | 8ร— B200 | 7ร— B200 |
| ํ•™์Šต ์ „๋žต | Chinchilla-optimal | Chinchilla 93% ๋‹ฌ์„ฑ ๋ชฉํ‘œ |
๋‘ ํ”„๋กœ์ ํŠธ๋Š” ๋™์ผํ•œ ํ† ํฌ๋‚˜์ด์ €(64K SentencePiece), ํ•™์Šต ๋ฐ์ดํ„ฐ ํŒŒ์ดํ”„๋ผ์ธ, DDP/FP8 ์ธํ”„๋ผ๋ฅผ ๊ณต์œ ํ•œ๋‹ค. "๊ฐ™์€ ์žฌ๋ฃŒ, ๋‹ค๋ฅธ ๋ ˆ์‹œํ”ผ"๋กœ ์•„ํ‚คํ…์ฒ˜ ์ฐจ์ด๊ฐ€ ์„ฑ๋Šฅ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์„ ๋น„๊ต ์‹คํ—˜ํ•  ์ˆ˜ ์žˆ๋‹ค.
> *์ด๋ฆ„์˜ ์œ ๋ž˜: Bride **Eva** (ํ”„๋ž‘์ผ„์Šˆํƒ€์ธ์˜ ์‹ ๋ถ€) + **FRI**DAY (์•„์ด์–ธ๋งจ AI ๋น„์„œ) + **LL**M + Nemotron์˜ **Mo***
---
## 18. ๋‹ค์Œ ์ตœ์ ํ™” ๊ณ„ํš โ€” MFU 33.5% โ†’ 47% ๋ชฉํ‘œ
> ์ƒ์„ธ ๋ฌธ์„œ: [`docs/NEXT_OPTIMIZATION_PLAN.md`](docs/NEXT_OPTIMIZATION_PLAN.md)
### ํ˜„์žฌ ์„ฑ๋Šฅ ์ง„๋‹จ
Phase 1 ํ”„๋ฆฌํŠธ๋ ˆ์ธ ์‹ค์ธก:
- **57,000 steps**, ~38.5B tokens, **์•ฝ 63์‹œ๊ฐ„**
- ์ฒ˜๋ฆฌ ์†๋„: 36~38K tok/s per rank โ†’ ์ „์ฒด **~292K tok/s** (8GPU)
- **MFU: ~33.5%**
### ํ•ต์‹ฌ ๋ณ‘๋ชฉ: NUMA Misalignment
```
AMD EPYC 9365 ร— 2์†Œ์ผ“:
GPU 0~3 โ†’ NUMA node 0 (core 0-35)
GPU 4~7 โ†’ NUMA node 1 (core 36-71)
์ดˆ๊ธฐ DDP ๋Ÿฐ์นญ ์‹œ 5/8 rank๊ฐ€ ์ž˜๋ชป๋œ NUMA ๋…ธ๋“œ์—์„œ ์‹คํ–‰.
69%์˜ DataLoader worker๊ฐ€ ํฌ๋กœ์Šค-NUMA โ€” ~2๋ฐฐ ์ง€์—ฐ ๋ฐœ์ƒ.
```
### ์ตœ์ ํ™” ํ•ญ๋ชฉ๋ณ„ ์˜ˆ์ƒ ํšจ๊ณผ
| ์ตœ์ ํ™” | ์˜ˆ์ƒ MFU ๊ฐœ์„  | ๋‚œ์ด๋„ |
|--------|-------------|--------|
| NUMA affinity ๊ณ ์ • | +4~9% | ๋‚ฎ์Œ (launch script ์ˆ˜์ •) |
| QKV fusion (TransformerEngine) | +8~12% | ์ค‘๊ฐ„ (๋ชจ๋ธ ์ฝ”๋“œ ์ˆ˜์ •) |
| FA2 native RoPE | +3~5% | ์ค‘๊ฐ„ (FA2 ๋ฒ„์ „ ์˜์กด) |
| NCCL ํ™˜๊ฒฝ๋ณ€์ˆ˜ ํŠœ๋‹ | +1~2% | ๋‚ฎ์Œ (ํ•œ ์ค„ ์ถ”๊ฐ€) |
### ์ตœ์ ํ™” ์ „ํ›„ ์˜ˆ์ƒ ๋น„๊ต
| ํ•ญ๋ชฉ | ํ˜„์žฌ | ์ตœ์ ํ™” ํ›„ |
|------|------|----------|
| MFU | 33.5% | ~45~47% |
| ์ฒ˜๋ฆฌ์†๋„ | 292K tok/s | ~390~410K tok/s |
| 50B ํ† ํฐ ํ•™์Šต | ~47์‹œ๊ฐ„ | ~34~36์‹œ๊ฐ„ |
### ์ฆ‰์‹œ ์ ์šฉ ๊ฐ€๋Šฅํ•œ ์ฝ”๋“œ
**NUMA affinity (launch script):**
```bash
numactl --cpunodebind=0 --membind=0 torchrun \
--nproc_per_node=4 --node_rank=0 train/pretrain.py ... &
numactl --cpunodebind=1 --membind=1 torchrun \
--nproc_per_node=4 --node_rank=1 train/pretrain.py ... &
```
**NCCL ํ™˜๊ฒฝ๋ณ€์ˆ˜:**
```bash
export NCCL_MIN_NCHANNELS=4
export NCCL_SOCKET_NTHREADS=4
export CUDA_DEVICE_MAX_CONNECTIONS=1
```
> Phase 3 ORPO ์™„๋ฃŒ ํ›„, ๋‹ค์Œ ํ”„๋ฆฌํŠธ๋ ˆ์ธ ๋Ÿฐ ์ „์— NUMA affinity๋ฅผ ๋จผ์ € ์ ์šฉํ•˜๋ฉด ํ•™์Šต ์‹œ๊ฐ„์„ ~30% ๋‹จ์ถ•ํ•  ์ˆ˜ ์žˆ๋‹ค.
---
## 19. GPU ํ•˜๋“œ์›จ์–ด & ๋น„์šฉ ๋ถ„์„ โ€” 3B ร— 60B ํ”„๋ฆฌํŠธ๋ ˆ์ธ
> ์ƒ์„ธ ๋ฌธ์„œ: [`docs/GPU_COST_ANALYSIS.md`](docs/GPU_COST_ANALYSIS.md)
### ์‹ค์ธก ๊ธฐ์ค€ ๋ฒ ์ด์Šค๋ผ์ธ
```
FRANKENSTALLM Phase 1 ์‹ค์ธก:
B200 ร— 8, MFU 33.5%, 292K tok/s
38.5B ํ† ํฐ โ†’ 63์‹œ๊ฐ„
60B ํ† ํฐ ํ™˜์‚ฐ โ†’ ์•ฝ 98์‹œ๊ฐ„
```
### ํด๋ผ์šฐ๋“œ ๊ฐ€์„ฑ๋น„ Top 3 (60B ํ† ํฐ, ์ตœ์ ํ™” ํ›„)
| ์ˆœ์œ„ | ๊ตฌ์„ฑ | ์†Œ์š”์‹œ๊ฐ„ | ์ด ๋น„์šฉ |
|------|------|---------|--------|
| 1 | H100ร—8 Cudo | 44.8hr | **$645** (~93๋งŒ์›) |
| 2 | H100ร—8 Vast.ai | 44.8hr | $670 (~97๋งŒ์›) |
| 3 | H100ร—8 RunPod | 44.8hr | $713 (~103๋งŒ์›) |
> B200 Blackwell์ด ๋น ๋ฅด์ง€๋งŒ, ํด๋ผ์šฐ๋“œ ๋‹จ๊ฐ€๊ฐ€ H100์˜ 3๋ฐฐ โ†’ **H100์ด ์ด๋น„์šฉ 4.3๋ฐฐ ์ €๋ ด**
### ๊ฐœ์ธ GPU ๊ตฌ์„ฑ ์ถ”์ฒœ
| ๊ตฌ์„ฑ | VRAM | NVLink | ๊ฐ€๊ฒฉ | ์ถ”์ฒœ๋„ |
|------|------|--------|------|--------|
| A6000 Ada ร— 2 ์ค‘๊ณ  | 96GB (ํ†ตํ•ฉ) | โœ… | ~1,000๋งŒ์› | โญโญโญโญโญ |
| L40S ร— 2 | 96GB (ํ†ตํ•ฉ) | โœ… | ~1,400๋งŒ์› | โญโญโญโญ |
| RTX Pro 6000 Blackwell | 96GB (๋‹จ์ผ) | โŒ | ~1,200๋งŒ์› | โญโญโญ |
> ์†Œ๋น„์ž์šฉ GPU(RTX 5090/4090)๋Š” NVLink ๋ฏธ์ง€์›. 80GB+ ํ†ตํ•ฉ ๋ฉ”๋ชจ๋ฆฌ ํ•„์š” ์‹œ ์ „๋ฌธ๊ฐ€์šฉ ํ•„์ˆ˜.
### ์ถ”์ฒœ ์ „๋žต: ๋กœ์ปฌ + ํด๋ผ์šฐ๋“œ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ
```
[๋กœ์ปฌ] RTX 4090 ร— 4 (880๋งŒ์›) โ€” ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ, ์‹คํ—˜, SFT/ORPO
[ํด๋ผ์šฐ๋“œ] H100ร—8 (๋Ÿฐ๋‹น ~103๋งŒ์›) โ€” ๋ณธ ํ”„๋ฆฌํŠธ๋ ˆ์ธ๋งŒ
```
---
## ๋งˆ์น˜๋ฉฐ
์ด ํ”„๋กœ์ ํŠธ์˜ ๋ชจํ† ๋Š” ํ•˜๋‚˜๋‹ค:
> **"๋งํ•˜๋Š” ๊ฒƒ๋„ ๊ธฐ๋กํ•œ๋‹ค."**
SFT v1์˜ loss=0.0 ์‹คํŒจ, torch.compile์ด ํšจ๊ณผ ์—†์—ˆ๋˜ ๊ฒƒ, 18% ๋ฐ˜๋ณต๋ฅ ์˜ ์ขŒ์ ˆ โ€” ์ด ๋ชจ๋“  ๊ฒƒ์ด ๊ธฐ๋ก์— ๋‚จ์•„ ์žˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด์ œ Phase 3 ORPO์—์„œ๋„ ๊ทธ ์ „ํ†ต์€ ์ด์–ด์ง„๋‹ค. **5๋ฒˆ์˜ ์‹คํŒจ** โ€” NCCL timeout, config ์ถฉ๋Œ, QKV ๋ณ€ํ™˜ ๋ฒ„๊ทธ, ํฌํŠธ ์ถฉ๋Œ, TRL NaN ๋ฒ„๊ทธ โ€” ๋ฅผ ๊ฑฐ์ณ ๋งˆ์นจ๋‚ด 6-config HP sweep์ด ๋Œ์•„๊ฐ€๊ณ  ์žˆ๋‹ค.
Frankenstein์ด ์กฐ๊ฐ๋“ค์„ ์ด์–ด ๋ถ™์—ฌ ์ƒ๋ช…์„ ๋งŒ๋“ค์—ˆ๋“ฏ, ์šฐ๋ฆฌ๋„ ๋‹ค์–‘ํ•œ ์†Œ์Šค์˜ ๋ฐ์ดํ„ฐ์™€ ๊ธฐ์ˆ ์„ ์ด์–ด ๋ถ™์—ฌ ํ•œ๊ตญ์–ด๋ฅผ ์ดํ•ดํ•˜๊ณ  ๋งํ•˜๋Š” ๋ชจ๋ธ์„ ๋งŒ๋“ค์–ด๊ฐ€๊ณ  ์žˆ๋‹ค. ์•„์ง ์™„์„ฑ๋˜์ง€ ์•Š์•˜์ง€๋งŒ, ๊ทธ ๊ณผ์ • ์ž์ฒด๊ฐ€ ์ด ํ”„๋กœ์ ํŠธ์˜ ๊ฐ€์น˜๋‹ค.
Phase 1 ํ”„๋ฆฌํŠธ๋ ˆ์ธ์€ 57,000 steps, loss 1.466์œผ๋กœ ์™„๋ฃŒ๋๋‹ค. Phase 2 SFT๋Š” 25,500 steps์—์„œ early stopping (val_loss 1.8851). 6์ฐจ์› ์ข…ํ•ฉ ํ‰๊ฐ€์—์„œ 4/6์„ ํ†ต๊ณผํ–ˆ๋‹ค.
**์ข‹์€ ์†Œ์‹**: ์ง€์‹ ๋ณด์กด์ด ๊ฑฐ์˜ ์™„๋ฒฝํ•˜๋‹ค (forgetting 0.9%). SFT๊ฐ€ base ๋ชจ๋ธ์˜ ์ง€์‹์„ ํŒŒ๊ดดํ•˜์ง€ ์•Š์•˜๋‹ค. EOS ์ข…๋ฃŒ์œจ์€ 0%์—์„œ 60%๋กœ ์˜ฌ๋ผ๊ฐ”๋‹ค. MMLU-KO๋„ +3.2pp ๊ฐœ์„ ๋˜์—ˆ๋‹ค.
**์•„์‰ฌ์šด ์†Œ์‹**: greedy ๋ฐ˜๋ณต๋ฅ  72.97%. SFT๋งŒ์œผ๋กœ๋Š” ๋ฐ˜๋ณต ๋ฌธ์ œ๊ฐ€ ํ•ด๊ฒฐ๋˜์ง€ ์•Š์•˜๋‹ค. ์˜คํžˆ๋ ค ์•…ํ™”๋˜์—ˆ๋‹ค (Base 60.99% โ†’ SFT 72.97%). ํ•˜์ง€๋งŒ `rep_penalty=1.2`๋งŒ ์ ์šฉํ•˜๋ฉด ๋ฐ˜๋ณต๋ฅ  0%๊ฐ€ ๋‹ฌ์„ฑ๋œ๋‹ค. ๋ชจ๋ธ์€ ๋ฐ˜๋ณตํ•˜์ง€ ์•Š๋Š” ๋Šฅ๋ ฅ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. ๋‹ค๋งŒ ๊ทธ๊ฒƒ์„ "๊ธฐ๋ณธ ํ–‰๋™"์œผ๋กœ ํ•™์Šตํ•˜์ง€ ๋ชปํ–ˆ์„ ๋ฟ์ด๋‹ค.
**ํ˜„์žฌ**: Phase 3 ORPO ๋ณธ ํ•™์Šต์ด ์ง„ํ–‰ ์ค‘์ด๋‹ค. 6-config HP sweep์„ ๋ชจ๋‘ ์™„๋ฃŒํ•˜๊ณ , eval_loss ๊ธฐ์ค€ ์ตœ์  config (lr=1.2e-5, beta=0.25)๋ฅผ ์„ ์ •ํ–ˆ๋‹ค. Throughput ๋ฒค์น˜๋งˆํฌ๋กœ batch_size=4, grad_accum=4 ์กฐํ•ฉ์ด 80.63 samples/s๋กœ ์ตœ์ ์ž„์„ ํ™•์ธํ•˜๊ณ , 8ร—B200 ์ „์ฒด GPU๋กœ ๋ณธ ํ•™์Šต์„ ์‹œ์ž‘ํ–ˆ๋‹ค. ~9,840 steps, ์˜ˆ์ƒ ~4.8์‹œ๊ฐ„. ํ•™์Šต ์™„๋ฃŒ ์‹œ watchdog์ด ์ž๋™์œผ๋กœ 10์ฐจ์› ์ข…ํ•ฉ ํ‰๊ฐ€(Base vs SFT vs ORPO 3-way ๋น„๊ต)๋ฅผ ์‹คํ–‰ํ•œ๋‹ค.
> **ORPO๊ฐ€ greedy ๋ฐ˜๋ณต๋ฅ ์„ 5% ๋ฏธ๋งŒ์œผ๋กœ ๋Œ์–ด๋‚ด๋ฆด ์ˆ˜ ์žˆ๋Š”๊ฐ€?**
๊ทธ ๋‹ต์ด ๊ณง ๋‚˜์˜จ๋‹ค. ํ•™์Šต์ด ๋๋‚˜๋ฉด 6์ฐจ์› ์žฌํ‰๊ฐ€๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ณ , ํ†ต๊ณผํ•˜๋ฉด GGUF๋กœ ๋ณ€ํ™˜๋˜์–ด Ollama ์œ„์—์„œ ๋Œ์•„๊ฐ€๊ฒŒ ๋œ๋‹ค. ํ•œ๊ตญ์–ด๋ฅผ ์ดํ•ดํ•˜๊ณ  ๋งํ•˜๋Š” 3B ๋ชจ๋ธ, ์ฒ˜์Œ๋ถ€ํ„ฐ ๋งŒ๋“  ๊ฒƒ.
---
*์ตœ์ข… ์—…๋ฐ์ดํŠธ: 2026-03-09*
*ํ˜„์žฌ ์ƒํƒœ: Phase 3 ORPO ๋ณธ ํ•™์Šต ์ง„ํ–‰ ์ค‘ (lr=1.2e-5, beta=0.25, step ~1,660/9,840, 17%) โ€” ํ•™์Šต ์™„๋ฃŒ ์‹œ 10์ฐจ์› ์ข…ํ•ฉ ํ‰๊ฐ€ ์ž๋™ ์‹คํ–‰ ๋Œ€๊ธฐ*