pathcosmos's picture
Upload folder using huggingface_hub (#29)
5b1ff4d
# ๋ฐ์ดํ„ฐ ์ „์ˆ˜ ์‹ค์ธก ์กฐ์‚ฌ ๊ฒฐ๊ณผ
> ์กฐ์‚ฌ์ผ: 2026-02-27 | ์ด ๋””์Šคํฌ ์‚ฌ์šฉ๋Ÿ‰: **195GB**
---
## 1. Pretrain ๋ฐ์ดํ„ฐ (.bin ํŒŒ์ผ) โ€” ์ฆ‰์‹œ ์‚ฌ์šฉ ๊ฐ€๋Šฅ
| ํŒŒ์ผ | ํฌ๊ธฐ | ์ถ”์ • ํ† ํฐ ์ˆ˜ | ๋น„๊ณ  |
|------|------|-------------|------|
| `korean_train.bin` | 17GB | **8.9B** | ํ†ตํ•ฉ (c4+wiki+namuwiki ๋จธ์ง€) |
| `korean_val.bin` | 35MB | 17.9M | ํ†ตํ•ฉ val |
| `korean_c4_train.bin` | 15GB | **7.5B** | C4 ํ•œ๊ตญ์–ด |
| `korean_c4_val.bin` | 29MB | 15.2M | |
| `korean_namuwiki_train.bin` | 2.1GB | **1.1B** | ๋‚˜๋ฌด์œ„ํ‚ค |
| `korean_namuwiki_val.bin` | 4.2MB | 2.2M | |
| `korean_wiki_train.bin` | 500MB | **261.8M** | ํ•œ๊ตญ์–ด ์œ„ํ‚ค |
| `korean_wiki_val.bin` | 1.1MB | 524K | |
| `train.bin` | 1.2GB | **605M** | ์˜์–ด ์œ„ํ‚ค (Shakespeare ๋“ฑ) |
| `val.bin` | 5.8MB | 3.0M | |
### Pretrain ํ† ํฐ ํ•ฉ๊ณ„
- **korean_train.bin (ํ†ตํ•ฉ)**: 8.9B tokens โ† C4 + Wiki + Namuwiki ๋จธ์ง€๋ณธ
- **๊ฐœ๋ณ„ ํ•ฉ์‚ฐ** (c4 7.5B + wiki 0.26B + namuwiki 1.1B = 8.86B) โ†’ ํ†ตํ•ฉ๋ณธ๊ณผ ์ผ์น˜
- **์˜์–ด train.bin**: 605M tokens
- โš ๏ธ **korean_train.bin์€ ๊ฐœ๋ณ„ .bin์˜ ๋จธ์ง€์ด๋ฏ€๋กœ ์ค‘๋ณต ๊ณ„์‚ฐ ์ฃผ์˜**
- **๋น„์ค‘๋ณต Pretrain ์ดํ•ฉ: ~9.5B tokens** (ํ•œ๊ตญ์–ด 8.9B + ์˜์–ด 0.6B)
---
## 2. korean_extra (HuggingFace ๋‹ค์šด๋กœ๋“œ) โ€” ์ฒ˜๋ฆฌ ํ•„์š”
| ๋””๋ ‰ํ† ๋ฆฌ | ํฌ๊ธฐ | ํฌ๋งท | ์ถ”์ • ํ† ํฐ |
|----------|------|------|----------|
| `culturax_ko` | 60GB | parquet | ~15B+ |
| `hplt_ko` | 23GB | parquet | ~6B |
| `cc100_ko` | 14GB | parquet/txt | ~3.5B |
| `oscar_ko` | 9.2GB | parquet | ~2.3B |
| `korean_textbooks` | 6.4GB | parquet | ~1.6B |
| `korean_webtext` | 4.2GB | parquet | ~1B |
| `finepdfs_edu_ko` | 2.9GB | parquet | ~700M |
| `namuwiki_extracted` | 2.2GB | parquet | ~550M |
| `wikipedia_korean` | 1.7GB | parquet | ~400M |
| `kovast` | 449MB | parquet | ~110M |
| `evol_instruct_ko` | 144MB | parquet/json | ~35M (SFT์šฉ) |
| `korean_safe_conv` | 51MB | parquet/json | ~12M (SFT์šฉ) |
**korean_extra ์ดํ•ฉ: ~123GB, ์ถ”์ • ~30B+ tokens** (ํ† ํฐํ™” ์ „, ์›๋ฌธ ๊ธฐ์ค€)
---
## 3. SFT ๋ฐ์ดํ„ฐ โ€” ์ฆ‰์‹œ ์‚ฌ์šฉ ๊ฐ€๋Šฅ
| ํŒŒ์ผ | ํฌ๊ธฐ | ์ƒ˜ํ”Œ ์ˆ˜ |
|------|------|---------|
| `sft/train.jsonl` | 276MB | **161,848** |
| `sft/val.jsonl` | 15MB | **8,518** |
- **์ด SFT ์ƒ˜ํ”Œ: 170,366**
- ํฌ๋งท: instruction/output ์Œ, ํ•œ๊ตญ์–ด ๋ฒˆ์—ญ ๋ฐ์ดํ„ฐ
- ํ’ˆ์งˆ: ์–‘ํ˜ธ (์ž์—ฐ์Šค๋Ÿฌ์šด ํ•œ๊ตญ์–ด, ๋‹ค์–‘ํ•œ ์ฃผ์ œ)
---
## 4. Raw ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ โ€” ์ด๋ฏธ .bin์œผ๋กœ ๋ณ€ํ™˜ ์™„๋ฃŒ
| ๋””๋ ‰ํ† ๋ฆฌ | ํฌ๊ธฐ | ํŒŒ์ผ ์ˆ˜ | ๋น„๊ณ  |
|----------|------|---------|------|
| `raw/c4_ko/` | 30GB | 50๊ฐœ txt | โ†’ korean_c4_train.bin์œผ๋กœ ๋ณ€ํ™˜๋จ |
| `raw/namuwiki_ko/` | 5.7GB | 6๊ฐœ txt | โ†’ korean_namuwiki_train.bin์œผ๋กœ ๋ณ€ํ™˜๋จ |
| `raw/ko_wiki_*.txt` | 1.2GB | 5๊ฐœ txt | โ†’ korean_wiki_train.bin์œผ๋กœ ๋ณ€ํ™˜๋จ |
| `raw/en_wiki_*.txt` | 1.2GB | 3๊ฐœ txt | โ†’ train.bin์œผ๋กœ ๋ณ€ํ™˜๋จ |
| **raw ํ•ฉ๊ณ„** | **38GB** | **64๊ฐœ** | ์‚ญ์ œ ๊ฐ€๋Šฅ (๋””์Šคํฌ ์ ˆ์•ฝ) |
---
## 5. ์ข…ํ•ฉ ์š”์•ฝ
### ์ฆ‰์‹œ ์‚ฌ์šฉ ๊ฐ€๋Šฅ
| ์šฉ๋„ | ๋ฐ์ดํ„ฐ | ๊ทœ๋ชจ |
|------|--------|------|
| **Pretrain** | korean_train.bin + train.bin | **9.5B tokens** |
| **SFT** | sft/train.jsonl | **161,848 ์ƒ˜ํ”Œ** |
### ์ฒ˜๋ฆฌํ•˜๋ฉด ์ถ”๊ฐ€ ํ™•๋ณด ๊ฐ€๋Šฅ
| ์†Œ์Šค | ์ถ”์ • ๊ทœ๋ชจ | ํ•„์š” ์ž‘์—… |
|------|----------|----------|
| korean_extra (์ „์ฒด) | **~30B+ tokens** | ํ† ํฐํ™” โ†’ .bin ๋ณ€ํ™˜ |
| evol_instruct_ko + korean_safe_conv | **~47M tokens (SFT)** | JSONL ๋ณ€ํ™˜ |
### ๋””์Šคํฌ ์ ˆ์•ฝ ๊ฐ€๋Šฅ
- `raw/` 38GB โ†’ ์ด๋ฏธ .bin ๋ณ€ํ™˜ ์™„๋ฃŒ, ์‚ญ์ œ ๊ฐ€๋Šฅ
- ๊ฐœ๋ณ„ .bin (c4/wiki/namuwiki) โ†’ korean_train.bin ๋จธ์ง€ ํ›„ ์ค‘๋ณต, ์‚ญ์ œ ๊ฐ€๋Šฅ (~18GB)
### ์ตœ์ข… ์ž ์žฌ๋ ฅ
- **Pretrain**: ํ˜„์žฌ 9.5B + korean_extra 30B+ = **~40B tokens ํ™•๋ณด ๊ฐ€๋Šฅ**
- **SFT**: ํ˜„์žฌ 162K + ์ถ”๊ฐ€ ๋ณ€ํ™˜ = **~200K+ ์ƒ˜ํ”Œ ๊ฐ€๋Šฅ**