File size: 3,774 Bytes
5b1ff4d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 | # ๋ฐ์ดํฐ ์ ์ ์ค์ธก ์กฐ์ฌ ๊ฒฐ๊ณผ
> ์กฐ์ฌ์ผ: 2026-02-27 | ์ด ๋์คํฌ ์ฌ์ฉ๋: **195GB**
---
## 1. Pretrain ๋ฐ์ดํฐ (.bin ํ์ผ) โ ์ฆ์ ์ฌ์ฉ ๊ฐ๋ฅ
| ํ์ผ | ํฌ๊ธฐ | ์ถ์ ํ ํฐ ์ | ๋น๊ณ |
|------|------|-------------|------|
| `korean_train.bin` | 17GB | **8.9B** | ํตํฉ (c4+wiki+namuwiki ๋จธ์ง) |
| `korean_val.bin` | 35MB | 17.9M | ํตํฉ val |
| `korean_c4_train.bin` | 15GB | **7.5B** | C4 ํ๊ตญ์ด |
| `korean_c4_val.bin` | 29MB | 15.2M | |
| `korean_namuwiki_train.bin` | 2.1GB | **1.1B** | ๋๋ฌด์ํค |
| `korean_namuwiki_val.bin` | 4.2MB | 2.2M | |
| `korean_wiki_train.bin` | 500MB | **261.8M** | ํ๊ตญ์ด ์ํค |
| `korean_wiki_val.bin` | 1.1MB | 524K | |
| `train.bin` | 1.2GB | **605M** | ์์ด ์ํค (Shakespeare ๋ฑ) |
| `val.bin` | 5.8MB | 3.0M | |
### Pretrain ํ ํฐ ํฉ๊ณ
- **korean_train.bin (ํตํฉ)**: 8.9B tokens โ C4 + Wiki + Namuwiki ๋จธ์ง๋ณธ
- **๊ฐ๋ณ ํฉ์ฐ** (c4 7.5B + wiki 0.26B + namuwiki 1.1B = 8.86B) โ ํตํฉ๋ณธ๊ณผ ์ผ์น
- **์์ด train.bin**: 605M tokens
- โ ๏ธ **korean_train.bin์ ๊ฐ๋ณ .bin์ ๋จธ์ง์ด๋ฏ๋ก ์ค๋ณต ๊ณ์ฐ ์ฃผ์**
- **๋น์ค๋ณต Pretrain ์ดํฉ: ~9.5B tokens** (ํ๊ตญ์ด 8.9B + ์์ด 0.6B)
---
## 2. korean_extra (HuggingFace ๋ค์ด๋ก๋) โ ์ฒ๋ฆฌ ํ์
| ๋๋ ํ ๋ฆฌ | ํฌ๊ธฐ | ํฌ๋งท | ์ถ์ ํ ํฐ |
|----------|------|------|----------|
| `culturax_ko` | 60GB | parquet | ~15B+ |
| `hplt_ko` | 23GB | parquet | ~6B |
| `cc100_ko` | 14GB | parquet/txt | ~3.5B |
| `oscar_ko` | 9.2GB | parquet | ~2.3B |
| `korean_textbooks` | 6.4GB | parquet | ~1.6B |
| `korean_webtext` | 4.2GB | parquet | ~1B |
| `finepdfs_edu_ko` | 2.9GB | parquet | ~700M |
| `namuwiki_extracted` | 2.2GB | parquet | ~550M |
| `wikipedia_korean` | 1.7GB | parquet | ~400M |
| `kovast` | 449MB | parquet | ~110M |
| `evol_instruct_ko` | 144MB | parquet/json | ~35M (SFT์ฉ) |
| `korean_safe_conv` | 51MB | parquet/json | ~12M (SFT์ฉ) |
**korean_extra ์ดํฉ: ~123GB, ์ถ์ ~30B+ tokens** (ํ ํฐํ ์ , ์๋ฌธ ๊ธฐ์ค)
---
## 3. SFT ๋ฐ์ดํฐ โ ์ฆ์ ์ฌ์ฉ ๊ฐ๋ฅ
| ํ์ผ | ํฌ๊ธฐ | ์ํ ์ |
|------|------|---------|
| `sft/train.jsonl` | 276MB | **161,848** |
| `sft/val.jsonl` | 15MB | **8,518** |
- **์ด SFT ์ํ: 170,366**
- ํฌ๋งท: instruction/output ์, ํ๊ตญ์ด ๋ฒ์ญ ๋ฐ์ดํฐ
- ํ์ง: ์ํธ (์์ฐ์ค๋ฌ์ด ํ๊ตญ์ด, ๋ค์ํ ์ฃผ์ )
---
## 4. Raw ํ
์คํธ ๋ฐ์ดํฐ โ ์ด๋ฏธ .bin์ผ๋ก ๋ณํ ์๋ฃ
| ๋๋ ํ ๋ฆฌ | ํฌ๊ธฐ | ํ์ผ ์ | ๋น๊ณ |
|----------|------|---------|------|
| `raw/c4_ko/` | 30GB | 50๊ฐ txt | โ korean_c4_train.bin์ผ๋ก ๋ณํ๋จ |
| `raw/namuwiki_ko/` | 5.7GB | 6๊ฐ txt | โ korean_namuwiki_train.bin์ผ๋ก ๋ณํ๋จ |
| `raw/ko_wiki_*.txt` | 1.2GB | 5๊ฐ txt | โ korean_wiki_train.bin์ผ๋ก ๋ณํ๋จ |
| `raw/en_wiki_*.txt` | 1.2GB | 3๊ฐ txt | โ train.bin์ผ๋ก ๋ณํ๋จ |
| **raw ํฉ๊ณ** | **38GB** | **64๊ฐ** | ์ญ์ ๊ฐ๋ฅ (๋์คํฌ ์ ์ฝ) |
---
## 5. ์ข
ํฉ ์์ฝ
### ์ฆ์ ์ฌ์ฉ ๊ฐ๋ฅ
| ์ฉ๋ | ๋ฐ์ดํฐ | ๊ท๋ชจ |
|------|--------|------|
| **Pretrain** | korean_train.bin + train.bin | **9.5B tokens** |
| **SFT** | sft/train.jsonl | **161,848 ์ํ** |
### ์ฒ๋ฆฌํ๋ฉด ์ถ๊ฐ ํ๋ณด ๊ฐ๋ฅ
| ์์ค | ์ถ์ ๊ท๋ชจ | ํ์ ์์
|
|------|----------|----------|
| korean_extra (์ ์ฒด) | **~30B+ tokens** | ํ ํฐํ โ .bin ๋ณํ |
| evol_instruct_ko + korean_safe_conv | **~47M tokens (SFT)** | JSONL ๋ณํ |
### ๋์คํฌ ์ ์ฝ ๊ฐ๋ฅ
- `raw/` 38GB โ ์ด๋ฏธ .bin ๋ณํ ์๋ฃ, ์ญ์ ๊ฐ๋ฅ
- ๊ฐ๋ณ .bin (c4/wiki/namuwiki) โ korean_train.bin ๋จธ์ง ํ ์ค๋ณต, ์ญ์ ๊ฐ๋ฅ (~18GB)
### ์ต์ข
์ ์ฌ๋ ฅ
- **Pretrain**: ํ์ฌ 9.5B + korean_extra 30B+ = **~40B tokens ํ๋ณด ๊ฐ๋ฅ**
- **SFT**: ํ์ฌ 162K + ์ถ๊ฐ ๋ณํ = **~200K+ ์ํ ๊ฐ๋ฅ**
|