| # ๋ฐ์ดํฐ ์ ์ ์ค์ธก ์กฐ์ฌ ๊ฒฐ๊ณผ | |
| > ์กฐ์ฌ์ผ: 2026-02-27 | ์ด ๋์คํฌ ์ฌ์ฉ๋: **195GB** | |
| --- | |
| ## 1. Pretrain ๋ฐ์ดํฐ (.bin ํ์ผ) โ ์ฆ์ ์ฌ์ฉ ๊ฐ๋ฅ | |
| | ํ์ผ | ํฌ๊ธฐ | ์ถ์ ํ ํฐ ์ | ๋น๊ณ | | |
| |------|------|-------------|------| | |
| | `korean_train.bin` | 17GB | **8.9B** | ํตํฉ (c4+wiki+namuwiki ๋จธ์ง) | | |
| | `korean_val.bin` | 35MB | 17.9M | ํตํฉ val | | |
| | `korean_c4_train.bin` | 15GB | **7.5B** | C4 ํ๊ตญ์ด | | |
| | `korean_c4_val.bin` | 29MB | 15.2M | | | |
| | `korean_namuwiki_train.bin` | 2.1GB | **1.1B** | ๋๋ฌด์ํค | | |
| | `korean_namuwiki_val.bin` | 4.2MB | 2.2M | | | |
| | `korean_wiki_train.bin` | 500MB | **261.8M** | ํ๊ตญ์ด ์ํค | | |
| | `korean_wiki_val.bin` | 1.1MB | 524K | | | |
| | `train.bin` | 1.2GB | **605M** | ์์ด ์ํค (Shakespeare ๋ฑ) | | |
| | `val.bin` | 5.8MB | 3.0M | | | |
| ### Pretrain ํ ํฐ ํฉ๊ณ | |
| - **korean_train.bin (ํตํฉ)**: 8.9B tokens โ C4 + Wiki + Namuwiki ๋จธ์ง๋ณธ | |
| - **๊ฐ๋ณ ํฉ์ฐ** (c4 7.5B + wiki 0.26B + namuwiki 1.1B = 8.86B) โ ํตํฉ๋ณธ๊ณผ ์ผ์น | |
| - **์์ด train.bin**: 605M tokens | |
| - โ ๏ธ **korean_train.bin์ ๊ฐ๋ณ .bin์ ๋จธ์ง์ด๋ฏ๋ก ์ค๋ณต ๊ณ์ฐ ์ฃผ์** | |
| - **๋น์ค๋ณต Pretrain ์ดํฉ: ~9.5B tokens** (ํ๊ตญ์ด 8.9B + ์์ด 0.6B) | |
| --- | |
| ## 2. korean_extra (HuggingFace ๋ค์ด๋ก๋) โ ์ฒ๋ฆฌ ํ์ | |
| | ๋๋ ํ ๋ฆฌ | ํฌ๊ธฐ | ํฌ๋งท | ์ถ์ ํ ํฐ | | |
| |----------|------|------|----------| | |
| | `culturax_ko` | 60GB | parquet | ~15B+ | | |
| | `hplt_ko` | 23GB | parquet | ~6B | | |
| | `cc100_ko` | 14GB | parquet/txt | ~3.5B | | |
| | `oscar_ko` | 9.2GB | parquet | ~2.3B | | |
| | `korean_textbooks` | 6.4GB | parquet | ~1.6B | | |
| | `korean_webtext` | 4.2GB | parquet | ~1B | | |
| | `finepdfs_edu_ko` | 2.9GB | parquet | ~700M | | |
| | `namuwiki_extracted` | 2.2GB | parquet | ~550M | | |
| | `wikipedia_korean` | 1.7GB | parquet | ~400M | | |
| | `kovast` | 449MB | parquet | ~110M | | |
| | `evol_instruct_ko` | 144MB | parquet/json | ~35M (SFT์ฉ) | | |
| | `korean_safe_conv` | 51MB | parquet/json | ~12M (SFT์ฉ) | | |
| **korean_extra ์ดํฉ: ~123GB, ์ถ์ ~30B+ tokens** (ํ ํฐํ ์ , ์๋ฌธ ๊ธฐ์ค) | |
| --- | |
| ## 3. SFT ๋ฐ์ดํฐ โ ์ฆ์ ์ฌ์ฉ ๊ฐ๋ฅ | |
| | ํ์ผ | ํฌ๊ธฐ | ์ํ ์ | | |
| |------|------|---------| | |
| | `sft/train.jsonl` | 276MB | **161,848** | | |
| | `sft/val.jsonl` | 15MB | **8,518** | | |
| - **์ด SFT ์ํ: 170,366** | |
| - ํฌ๋งท: instruction/output ์, ํ๊ตญ์ด ๋ฒ์ญ ๋ฐ์ดํฐ | |
| - ํ์ง: ์ํธ (์์ฐ์ค๋ฌ์ด ํ๊ตญ์ด, ๋ค์ํ ์ฃผ์ ) | |
| --- | |
| ## 4. Raw ํ ์คํธ ๋ฐ์ดํฐ โ ์ด๋ฏธ .bin์ผ๋ก ๋ณํ ์๋ฃ | |
| | ๋๋ ํ ๋ฆฌ | ํฌ๊ธฐ | ํ์ผ ์ | ๋น๊ณ | | |
| |----------|------|---------|------| | |
| | `raw/c4_ko/` | 30GB | 50๊ฐ txt | โ korean_c4_train.bin์ผ๋ก ๋ณํ๋จ | | |
| | `raw/namuwiki_ko/` | 5.7GB | 6๊ฐ txt | โ korean_namuwiki_train.bin์ผ๋ก ๋ณํ๋จ | | |
| | `raw/ko_wiki_*.txt` | 1.2GB | 5๊ฐ txt | โ korean_wiki_train.bin์ผ๋ก ๋ณํ๋จ | | |
| | `raw/en_wiki_*.txt` | 1.2GB | 3๊ฐ txt | โ train.bin์ผ๋ก ๋ณํ๋จ | | |
| | **raw ํฉ๊ณ** | **38GB** | **64๊ฐ** | ์ญ์ ๊ฐ๋ฅ (๋์คํฌ ์ ์ฝ) | | |
| --- | |
| ## 5. ์ข ํฉ ์์ฝ | |
| ### ์ฆ์ ์ฌ์ฉ ๊ฐ๋ฅ | |
| | ์ฉ๋ | ๋ฐ์ดํฐ | ๊ท๋ชจ | | |
| |------|--------|------| | |
| | **Pretrain** | korean_train.bin + train.bin | **9.5B tokens** | | |
| | **SFT** | sft/train.jsonl | **161,848 ์ํ** | | |
| ### ์ฒ๋ฆฌํ๋ฉด ์ถ๊ฐ ํ๋ณด ๊ฐ๋ฅ | |
| | ์์ค | ์ถ์ ๊ท๋ชจ | ํ์ ์์ | | |
| |------|----------|----------| | |
| | korean_extra (์ ์ฒด) | **~30B+ tokens** | ํ ํฐํ โ .bin ๋ณํ | | |
| | evol_instruct_ko + korean_safe_conv | **~47M tokens (SFT)** | JSONL ๋ณํ | | |
| ### ๋์คํฌ ์ ์ฝ ๊ฐ๋ฅ | |
| - `raw/` 38GB โ ์ด๋ฏธ .bin ๋ณํ ์๋ฃ, ์ญ์ ๊ฐ๋ฅ | |
| - ๊ฐ๋ณ .bin (c4/wiki/namuwiki) โ korean_train.bin ๋จธ์ง ํ ์ค๋ณต, ์ญ์ ๊ฐ๋ฅ (~18GB) | |
| ### ์ต์ข ์ ์ฌ๋ ฅ | |
| - **Pretrain**: ํ์ฌ 9.5B + korean_extra 30B+ = **~40B tokens ํ๋ณด ๊ฐ๋ฅ** | |
| - **SFT**: ํ์ฌ 162K + ์ถ๊ฐ ๋ณํ = **~200K+ ์ํ ๊ฐ๋ฅ** | |