| # ๋ฐ์ดํฐ ๊ฐญ ๋ถ์ ๋ณด๊ณ ์ | |
| > ์์ฑ์ผ: 2026-02-27 | ๋ชจ๋ธ: 3B parameter LLM | |
| ## 1. ํ์ฌ ๋ฐ์ดํฐ ์ธ๋ฒคํ ๋ฆฌ | |
| ### 1.1 Pretrain ๋ฐ์ดํฐ (ํ ํฐํ ์๋ฃ .bin) | |
| | ํ์ผ | ํฌ๊ธฐ | ํ ํฐ ์ (uint16) | | |
| |------|------|------------------| | |
| | korean_train.bin | 17GB | **8.9B** | | |
| | korean_c4_train.bin | 15GB | 7.56B | | |
| | korean_namuwiki_train.bin | 2.1GB | 1.08B | | |
| | korean_wiki_train.bin | 500MB | 0.26B | | |
| | train.bin (์์ด) | 1.2GB | 0.60B | | |
| | **ํฉ๊ณ (ํ ํฐํ ์๋ฃ)** | | **~18.4B tokens** | | |
| > โ ๏ธ `korean_train.bin`์ c4+namuwiki+wiki์ ๋จธ์ง๋ณธ์ผ ๊ฐ๋ฅ์ฑ ๋์ โ ์ค์ ๊ณ ์ ํ ํฐ์ **~9B** ์์ค | |
| ### 1.2 ๋ฏธํ ํฐํ ์์ ๋ฐ์ดํฐ (korean_extra/) | |
| | ์์ค | ๋์คํฌ ํฌ๊ธฐ | ์ถ์ ํ ํฐ ์ | ํ์ง ๋ฑ๊ธ | | |
| |------|-----------|-------------|---------| | |
| | CulturaX ko | 60GB | ~15B | B+ | | |
| | HPLT ko | 23GB | ~5B | B | | |
| | cc100 ko | 14GB | ~3.5B | C+ | | |
| | OSCAR ko | 9.2GB | ~2.3B | B | | |
| | korean_textbooks | 6.4GB | ~1.5B | A | | |
| | korean_webtext | 4.2GB | ~1B | B+ | | |
| | finepdfs_edu_ko | 2.9GB | ~0.7B | A- | | |
| | namuwiki_extracted | 2.2GB | ~0.5B | A- | | |
| | wikipedia_korean | 1.7GB | ~0.4B | A | | |
| | kovast | 449MB | ~0.1B | B | | |
| | **์๊ณ** | **~124GB** | **~30B** | | | |
| ### 1.3 SFT ๋ฐ์ดํฐ | |
| - train.jsonl: 161,848 ์ํ (276MB) | |
| - val.jsonl: 8,518 ์ํ (15MB) | |
| - ์์ค: evol_instruct_ko, korean_safe_conv ๋ฑ | |
| ### 1.4 Preference ๋ฐ์ดํฐ | |
| - **ํ์ฌ ๋ณด์ : 0** โ | |
| ### ์ดํฉ | |
| | ๋จ๊ณ | ๋ณด์ ๋ | | |
| |------|--------| | |
| | Pretrain (ํ ํฐํ) | ~9B tokens | | |
| | Pretrain (๋ฏธ์ฒ๋ฆฌ) | ~30B tokens | | |
| | **Pretrain ํฉ๊ณ** | **~39B tokens** | | |
| | SFT | 170K ์ํ | | |
| | Preference | 0 | | |
| --- | |
| ## 2. 3B ๋ชจ๋ธ ํ์ต ์๊ตฌ๋ vs ํ์ฌ | |
| ### 2.1 Pretrain | |
| | ๊ธฐ์ค | ํ์ ํ ํฐ | ํ์ฌ | ๊ฐญ | ์ํ | | |
| |------|----------|------|-----|------| | |
| | Chinchilla optimal (ร70) | 210B | 39B | -171B | ๐ด ์ฌ๊ฐ ๋ถ์กฑ | | |
| | Chinchilla minimum (ร20) | 60B | 39B | -21B | ๐ก ๋ถ์กฑ | | |
| | LLaMA-style (ร33) | 100B | 39B | -61B | ๐ด ๋ถ์กฑ | | |
| | **์ค์ฉ์ ๋ชฉํ** | **60~80B** | **39B** | **-21~41B** | ๐ก | | |
| **๊ฒฐ๋ก :** ์ต์ ๊ธฐ์ค(60B)์๋ **21B tokens ๋ถ์กฑ**. ํ์ค์ ์ผ๋ก 60~80B ํ๊ฒ ์ ์ถ๊ฐ 21~41B ํ์. | |
| ### 2.2 SFT | |
| | ๊ธฐ์ค | ํ์๋ | ํ์ฌ | ๊ฐญ | ์ํ | | |
| |------|--------|------|-----|------| | |
| | ์ต์ ๊ณ ํ์ง | 50K | 170K | ์ถฉ๋ถ | ๐ข | | |
| | ์ ๊ณ ํ์ค | 100~200K | 170K | ์ถฉ๋ถ | ๐ข | | |
| | ๋๋ฉ์ธ ๋ค์์ฑ | ๋ค์ํ ํ์คํฌ | ์ ํ์ | ๋ณด์ ํ์ | ๐ก | | |
| **๊ฒฐ๋ก :** ์์ ์ผ๋ก ์ถฉ๋ถํ๋ ๋๋ฉ์ธ ์ปค๋ฒ๋ฆฌ์ง(์ํ, ์ฝ๋, ์ถ๋ก ) ๋ณด๊ฐ ํ์. | |
| ### 2.3 Preference (ORPO/DPO) | |
| | ๊ธฐ์ค | ํ์๋ | ํ์ฌ | ๊ฐญ | ์ํ | | |
| |------|--------|------|-----|------| | |
| | ์ต์ | 5K ์ | 0 | -5K | ๐ด | | |
| | ์ ์ | 20~60K ์ | 0 | -60K | ๐ด | | |
| **๊ฒฐ๋ก :** **์ฌ๊ฐํ ๊ฐญ**. ORPO/DPO ํ์ต ์์ฒด๊ฐ ๋ถ๊ฐ๋ฅ. | |
| --- | |
| ## 3. ๊ฒฝ์ ๋ชจ๋ธ ๋๋น ํฌ์ง์ ๋ | |
| | ๋ชจ๋ธ | ํ๋ผ๋ฏธํฐ | Pretrain ํ ํฐ | ์ฐ๋ฆฌ ๋๋น | | |
| |------|---------|-------------|----------| | |
| | Polyglot-Ko 12.8B | 12.8B | 1.2T | 30ร | | |
| | EXAONE 3.0 | 7.8B | 8T | 200ร | | |
| | HyperCLOVA X | ๋น๊ณต๊ฐ | ์๋ฐฑB~์T | 10~100ร | | |
| | Phi-3 mini 3.8B | 3.8B | 3.3T | 85ร | | |
| | StableLM 3B | 3B | 4T | 100ร | | |
| | **์ฐ๋ฆฌ (๋ชฉํ)** | **3B** | **60~80B** | **๊ธฐ์ค** | | |
| **๋ถ์:** | |
| - ์ฐ๋ฆฌ 60~80B์ ๋ชจ๋ธ ํฌ๊ธฐ ๋๋น Chinchilla minimum~์ ์ ์์ค | |
| - ๋ํ ๋ชจ๋ธ๋ค์ 10~100ร ๋ง์ ๋ฐ์ดํฐ ์ฌ์ฉํ์ง๋ง, ๋ชจ๋ธ๋ 2~40ร ํผ | |
| - **3B์ 60B tokens์ ํฉ๋ฆฌ์ ์ต์์น** โ ํ๊ณ์์ 3B๊ธ์ 50~100B์์ ์ข์ ๊ฒฐ๊ณผ | |
| - ํ์ง ํํฐ๋ง + ์ปค๋ฆฌํ๋ผ ํ์ต์ผ๋ก ํจ์จ ๋ณด์ ๊ฐ๋ฅ | |
| --- | |
| ## 4. ๋ฐ์ดํฐ ํ์ง ๋ถ์ | |
| ### ํ์ฌ ํ์ง ๋ถํฌ (์ถ์ ํ ํฐ ๊ธฐ์ค) | |
| ``` | |
| A๋ฑ๊ธ (๊ณ ํ์ง): ~3.0B (8%) - wiki, textbooks, finepdfs_edu | |
| B๋ฑ๊ธ (์ํธ): ~24B (61%) - CulturaX, OSCAR, HPLT, webtext | |
| C๋ฑ๊ธ (๋ ธ์ด์ฆ): ~12B (31%) - cc100, ๊ธฐํ ์น ํฌ๋กค๋ง | |
| ``` | |
| **๋ฌธ์ ์ :** | |
| - ๊ณ ํ์ง(A๊ธ) ๋น์ค์ด **8%๋ก ๋งค์ฐ ๋ฎ์** | |
| - ์ฝ๋/์ํ/๊ณผํ ๋ฐ์ดํฐ **์ ๋ฌด** | |
| - ์์ด ๋ฐ์ดํฐ ๋น์ค ๊ทนํ ์ ์ (0.6B) โ ๋ค๊ตญ์ด ๋ฅ๋ ฅ ๋ถ์กฑ | |
| --- | |
| ## 5. ํต์ฌ ๊ฒฐ๋ก | |
| ### ํ์ฌ ๋ฐ์ดํฐ๋ก 3B ํ์ต ์ถฉ๋ถํ๊ฐ? | |
| ## **No** โ ๋ค์ ์ด์ ๋ก ๋ถ์ถฉ๋ถ: | |
| 1. **Pretrain ํ ํฐ ๋ถ์กฑ** (39B vs ์ต์ 60B, 21B ๊ฐญ) | |
| 2. **Preference ๋ฐ์ดํฐ ๋ถ์ฌ** (ORPO ํ์ต ๋ถ๊ฐ) | |
| 3. **์ฝ๋/์ํ ๋ฐ์ดํฐ ์ ๋ฌด** (๋ฒ์ฉ ๋ฅ๋ ฅ ์ ํ) | |
| 4. **๊ณ ํ์ง ๋น์จ ๋ฎ์** (8%) | |
| 5. **์์ด ๋ฐ์ดํฐ ๋ถ์กฑ** (cross-lingual transfer ์ ํ) | |
| ### ๋ถ์กฑํ ๋ฐ์ดํฐ ์ ํ ์์ฝ | |
| | ์ ํ | ์ฌ๊ฐ๋ | ํ์ ์กฐ์น | | |
| |------|--------|----------| | |
| | Pretrain ํ ํฐ | ๐ก ์ค๊ฐ | +21~41B ํ ํฐ ํ๋ณด | | |
| | ์ฝ๋ ๋ฐ์ดํฐ | ๐ด ์ฌ๊ฐ | ์ฝ๋ ์ฝํผ์ค ์ถ๊ฐ (5~10B) | | |
| | ์ํ/๊ณผํ | ๐ด ์ฌ๊ฐ | ์ ๋ฌธ ์ฝํผ์ค ์ถ๊ฐ (2~5B) | | |
| | ์์ด ๋ฐ์ดํฐ | ๐ก ์ค๊ฐ | ๊ณ ํ์ง ์์ด 10~20B ์ถ๊ฐ | | |
| | Preference | ๐ด ์ฌ๊ฐ | 20K+ ์ ํ๋ณด | | |
| | SFT ๋ค์์ฑ | ๐ก ์ค๊ฐ | ์ฝ๋/์ํ/์ถ๋ก SFT ์ถ๊ฐ | | |