frankenstallm / source /eval /data_inventory /MASTER_DATA_REPORT.md
pathcosmos's picture
Upload folder using huggingface_hub (#29)
5b1ff4d
# ν•œκ΅­μ–΄ LLM 데이터 μ’…ν•© 리포트
> 생성: 2026-02-27 | 5개 subagent 쑰사 κ²°κ³Ό 톡합
---
## 1. ν˜„μž¬ 보유 ν˜„ν™©
| μΉ΄ν…Œκ³ λ¦¬ | 데이터셋 | λ””μŠ€ν¬ | μΆ”μ • 토큰 | ν’ˆμ§ˆ |
|---------|---------|--------|---------|------|
| ꡐ윑 μ›Ή | fineweb2_edu_ko | 234G | ~50B | A |
| μ›Ή 크둀 | culturax_ko | 60G | ~24B | B+ |
| μˆ˜ν•™ | open_web_math | 26G | ~10B | A |
| μ›Ή 크둀 | hplt_ko | 23G | ~9B | B |
| μ›Ή 크둀 | cc100_processed | 19G | ~7B | C+ |
| μ›Ή 크둀 | cc100_ko | 14G | ~5.5B | C |
| μ›Ή 크둀 | oscar_ko | 9.2G | ~3.5B | B |
| ꡐ윑 | korean_textbooks | 6.4G | ~1.5B | A |
| μ›Ή | korean_webtext | 4.2G | ~1B | B+ |
| λ°±κ³Ό | namuwiki_2023 | 2.9G | ~1B | A- |
| ꡐ윑 | finepdfs_edu_ko | 2.9G | ~0.7B | A- |
| λ°±κ³Ό | namuwiki_extracted | 2.2G | ~0.5B | A- |
| λ°±κ³Ό | wikipedia_korean | 1.7G | ~0.4B | A |
| λ°±κ³Ό | wikipedia_ko_2024 | 1.4G | ~0.3B | A |
| Instruct | kovast | 449M | ~0.1B | B |
| Instruct | evol_instruct_ko | 144M | ~0.03B | B |
| λŒ€ν™” | korean_safe_conv | 51M | ~0.01B | B |
| **합계** | | **~410G** | **~114B raw** | |
> ⚠️ 토큰화 μ™„λ£Œ `.bin`: korean_train.bin(17Gβ‰ˆ8.9B), korean_c4_train(15Gβ‰ˆ7.5B) λ“± μ‹€μ œ ν•™μŠ΅ μ‚¬μš© ~39B
---
## 2. λΆ€μ‘± 도메인 κ°­ 뢄석
### πŸ”΄ CRITICAL (μ—†μŒ)
| 도메인 | ν˜„ν™© | 영ν–₯ |
|--------|------|------|
| **Preference/DPO** | 0건 | ORPO ν•™μŠ΅ λΆˆκ°€ |
| **법λ₯ /νŒλ‘€** | 0 | 법λ₯  μΆ”λ‘  λΆˆκ°€ |
| **의료/μ˜ν•™** | 0 | ν—¬μŠ€μΌ€μ–΄ 응닡 λΆˆκ°€ |
| **μ½”λ“œ (ν•œκ΅­μ–΄ 주석)** | 0 | μ½”λ”© 지원 약함 |
| **λ‰΄μŠ€/μ–Έλ‘ ** | 0 | μ‹œμ‚¬ λ§₯락 약함 |
### 🟑 WEAK (맀우 λΆ€μ‘±)
| 도메인 | ν˜„ν™© | 영ν–₯ |
|--------|------|------|
| **Instruction/SFT** | ~0.6G (644MB) | μ§€μ‹œ λ”°λ₯΄κΈ° 약함 |
| **금육/경제** | 0 | 금육 도메인 응닡 약함 |
| **ν•™μˆ λ…Όλ¬Έ** | 0 | ν•™μˆ μ  κΈ€μ“°κΈ° 약함 |
| **μ†Œμ„€/λ¬Έν•™** | 0 | μ°½μž‘ λŠ₯λ ₯ 약함 |
---
## 3. 졜고 후보ꡰ β€” Pretrain 용 (λΆ€μ‘± 도메인 μ±„μš°κΈ°)
### πŸ₯‡ 1μˆœμœ„: KORMo-Team/korean-web-collection
- **크기**: ~50~80GB / ~20~30B 토큰
- **νŠΉμ§•**: HFμ—μ„œ κ°€μž₯ 큰 ν•œκ΅­μ–΄ μ „μš© μ›Ή 크둀. ν˜„μž¬ 보유 데이터와 쀑볡 적음
- **λΌμ΄μ„ μŠ€**: 곡개
- **λ‹€μš΄λ‘œλ“œ**: `huggingface-cli download KORMo-Team/korean-web-collection --repo-type dataset --local-dir ./data/korean-web-collection`
### πŸ₯ˆ 2μˆœμœ„: HPLT/HPLT2.0_cleaned (ko)
- **크기**: ~30GB / ~12B 토큰
- **νŠΉμ§•**: HPLT v1.2 이미 보유(23G) β†’ v2.0은 더 크고 μ •μ œλ¨. μΆ”κ°€ 순수 증가뢄 쑴재
- **λΌμ΄μ„ μŠ€**: 곡개
- **λ‹€μš΄λ‘œλ“œ**: `python -c "from datasets import load_dataset; ds = load_dataset('HPLT/HPLT2.0_cleaned', 'ko', split='train'); ds.save_to_disk('./data/hplt2-ko')"`
### πŸ₯‰ 3μˆœμœ„: 법λ₯  도메인 묢음
| 데이터셋 | 크기 | λ‚΄μš© |
|---------|------|------|
| `joonhok-exo-ai/korean_law_open_data_precedents` | ~1-2G | 법원 νŒλ‘€ μ „λ¬Έ |
| `smhilee/korean-law-dataset` | ~1-3G | 법령/법λ₯  ν…μŠ€νŠΈ |
| `Rootpye/korean-lawdata2` | ~0.5-1G | 법λ₯  데이터 |
| `Rootpye/korean-lawdata4` | ~0.5-1G | 법λ₯  데이터 v4 |
| `ducut91/korean-constitutional-court-decisions` | ~0.5G | ν—Œλ²•μž¬νŒμ†Œ κ²°μ • |
- **합계**: ~4~8G / ~1~2B 토큰
- **μ™œ μ€‘μš”**: 법λ₯ μ€ μ™„μ „ 곡백 도메인. μ •λ°€ν•œ ν•œκ΅­μ–΄ + 논리 ꡬ쑰 β†’ pretrain ν’ˆμ§ˆ ν–₯상
### 4μˆœμœ„: mc4 (ko)
- **크기**: ~50GB / ~20B 토큰
- **νŠΉμ§•**: CulturaX와 일뢀 μ€‘λ³΅μ΄λ‚˜ 원본 mC4 μΆ”κ°€ ν…μŠ€νŠΈ 쑴재
- **λΌμ΄μ„ μŠ€**: 곡개
- **λ‹€μš΄λ‘œλ“œ**: `python -c "from datasets import load_dataset; ds = load_dataset('mc4', 'ko', split='train'); ds.save_to_disk('./data/mc4-ko')"`
### 5μˆœμœ„: RedPajama-Data-1T (μ½”λ“œ+ArXiv)
- **크기**: 선별 ~15~20GB / ~8~10B 토큰
- **νŠΉμ§•**: ν•œκ΅­μ–΄ λͺ¨λΈμ΄λΌλ„ μ½”λ“œ+κ³Όν•™ μ˜μ–΄ 데이터 ν•„μˆ˜ (cross-lingual transfer)
- **μ„œλΈŒμ…‹**: `github` (μ½”λ“œ 5B) + `arxiv` (κ³Όν•™ 3B) + `book` (2B)
- **λΌμ΄μ„ μŠ€**: 곡개
---
## 4. 졜고 후보ꡰ β€” SFT 용
### πŸ₯‡ 1: kuotient/orca-math-word-problems-193k-korean
- **크기**: 193K μƒ˜ν”Œ
- **λ‚΄μš©**: μˆ˜ν•™ 문제 ν•œκ΅­μ–΄, Orca Math 기반
- **μ™œ**: μˆ˜ν•™ 도메인 μ™„μ „ 곡백 채움. κ²€μ¦λœ κ³ ν’ˆμ§ˆ
### πŸ₯ˆ 2: dbdu/ShareGPT-74k-ko
- **크기**: 74K μƒ˜ν”Œ
- **λ‚΄μš©**: ChatGPT μ‹€μ‚¬μš© λŒ€ν™” λ©€ν‹°ν„΄ ν•œκ΅­μ–΄ λ²ˆμ—­
- **μ™œ**: μ‹±κΈ€ν„΄ 편ν–₯인 ν˜„μž¬ 데이터 보완, λ‹€μ–‘ν•œ 도메인
### πŸ₯‰ 3: nayohan/Evol-Instruct-Code-80k-v1-ko
- **크기**: 80K μƒ˜ν”Œ
- **λ‚΄μš©**: WizardCoder 기반 μ½”λ”© instruction ν•œκ΅­μ–΄
- **μ™œ**: μ½”λ”© 도메인 ν˜„μž¬ ~5% β†’ λŒ€ν­ κ°•ν™”
### 4: nlp-with-deeplearning/Ko.WizardLM_evol_instruct_V2_196k
- **크기**: 196K μƒ˜ν”Œ
- **λ‚΄μš©**: WizardLM Evol Instruct ν•œκ΅­μ–΄ β€” λ³΅μž‘ν•œ μΆ”λ‘  포함
### 5: FreedomIntelligence/alpaca-gpt4-korean
- **크기**: 52K μƒ˜ν”Œ
- **λ‚΄μš©**: GPT-4 생성 Alpaca ν•œκ΅­μ–΄ β€” κ³ ν’ˆμ§ˆ 응닡
> **SFT μΆ”κ°€ ν›„ μ˜ˆμƒ**: ν˜„μž¬ 162K + 595K = **~757K** (4.7λ°° 증가)
---
## 5. 졜고 후보ꡰ β€” Preference/ORPO 용
### πŸ₯‡ 1: jojo0217/korean_rlhf_dataset
- **크기**: 100K+ 쌍
- **λ‚΄μš©**: ν•œκ΅­μ–΄ RLHF μ’…ν•© β€” κ°€μž₯ λ²”μš©μ 
- **μš°μ„ μˆœμœ„**: μ¦‰μ‹œ λ‹€μš΄λ‘œλ“œ
### πŸ₯ˆ 2: maywell/ko_Ultrafeedback_binarized
- **크기**: ~60K 쌍
- **λ‚΄μš©**: UltraFeedback ν•œκ΅­μ–΄ λ²ˆμ—­, binarized (chosen/rejected)
- **μ™œ**: 이미 chosen/rejected ν˜•μ‹μœΌλ‘œ ORPO λ°”λ‘œ μ‚¬μš© κ°€λŠ₯
### πŸ₯‰ 3: nayohan/preference-collection-ko-full
- **크기**: 100K+ 쌍
- **λ‚΄μš©**: ν•œκ΅­μ–΄ μ’…ν•© preference μ»¬λ ‰μ…˜
### 4: kuotient/orca-math-korean-dpo-pairs
- **크기**: 100K+ 쌍
- **λ‚΄μš©**: μˆ˜ν•™ νŠΉν™” DPO 쌍
> **ORPO μΆ”μ²œ μ‘°ν•©**: jojo0217 + maywell + nayohan = ~260K쌍 β†’ λ°”λ‘œ μ‹œμž‘ κ°€λŠ₯
---
## 6. μ™ΈλΆ€ μ†ŒμŠ€ (μ‹ μ²­ ν•„μš”)
| μ†ŒμŠ€ | μΆ”μ •λŸ‰ | νŠΉμ§• |
|------|--------|------|
| AI Hub (aihub.or.kr) | ~60~100GB | λ‰΄μŠ€, λŒ€ν™”, 의료, 법λ₯ , 금육 μ „λ¬Έ β€” 승인 ν•„μš”, 비상업적 κ°€λŠ₯ |
| NIKL λͺ¨λ‘μ˜ λ§λ­‰μΉ˜ | ~35~50GB | λ¬Έμ–΄/ꡬ어 μ½”νΌμŠ€, 비상업적 μ—°κ΅¬μš© μ‹ μ²­ |
| ꡭ가법령정보센터 | ~5~10GB | 크둀링 κ°€λŠ₯ (곡곡 데이터) |
| KCI ν•™μˆ λ…Όλ¬Έ | ~3~5GB | λ…Όλ¬Έ 초둝, API 제곡 |
---
## 7. λ‹€μš΄λ‘œλ“œ μ‹€ν–‰ ν”Œλžœ (μš°μ„ μˆœμœ„μˆœ)
```bash
cd /PROJECT/0325120031_A/ghong/taketimes/llm-bang
# === Phase 1: Preference (ORPO μ¦‰μ‹œ ν™œμ„±ν™”, μ†Œμš©λŸ‰) ===
python3 -c "
from datasets import load_dataset
import os
out = 'data/preference'
os.makedirs(out, exist_ok=True)
for name in ['jojo0217/korean_rlhf_dataset', 'maywell/ko_Ultrafeedback_binarized', 'nayohan/preference-collection-ko-full', 'kuotient/orca-math-korean-dpo-pairs']:
ds = load_dataset(name, split='train')
ds.to_json(f'{out}/{name.replace(\"/\",\"_\")}.jsonl')
print(f'βœ… {name}: {len(ds)} samples')
" 2>&1 | tee /tmp/preference_dl.log &
# === Phase 2: SFT 보강 (λŒ€ν™”/μˆ˜ν•™/μ½”λ“œ) ===
python3 -c "
from datasets import load_dataset
import os
out = 'data/sft_extra'
os.makedirs(out, exist_ok=True)
for name in ['kuotient/orca-math-word-problems-193k-korean','dbdu/ShareGPT-74k-ko','nayohan/Evol-Instruct-Code-80k-v1-ko','nlp-with-deeplearning/Ko.WizardLM_evol_instruct_V2_196k','FreedomIntelligence/alpaca-gpt4-korean']:
try:
ds = load_dataset(name, split='train')
ds.to_json(f'{out}/{name.replace(\"/\",\"_\")}.jsonl')
print(f'βœ… {name}: {len(ds)}')
except Exception as e:
print(f'❌ {name}: {e}')
" 2>&1 | tee /tmp/sft_extra_dl.log &
# === Phase 3: 법λ₯  Pretrain 보강 ===
python3 -c "
from datasets import load_dataset
import os
out = 'data/korean_extra/korean_law'
os.makedirs(out, exist_ok=True)
for name in ['joonhok-exo-ai/korean_law_open_data_precedents','smhilee/korean-law-dataset','Rootpye/korean-lawdata2']:
try:
ds = load_dataset(name, split='train')
ds.to_json(f'{out}/{name.replace(\"/\",\"_\")}.jsonl')
print(f'βœ… {name}: {len(ds)}')
except Exception as e:
print(f'❌ {name}: {e}')
" 2>&1 | tee /tmp/law_dl.log &
# === Phase 4: λŒ€μš©λŸ‰ Pretrain (λ°±κ·ΈλΌμš΄λ“œ μž₯μ‹œκ°„) ===
# mc4 Korean (~50GB)
# python3 -c "from datasets import load_dataset; ds = load_dataset('mc4', 'ko', split='train'); ds.save_to_disk('data/korean_extra/mc4_ko')"
# KORMo Web Collection
# huggingface-cli download KORMo-Team/korean-web-collection --repo-type dataset --local-dir data/korean_extra/korean_web_collection
```
---
## 8. μΆ”κ°€ ν›„ μ˜ˆμƒ 데이터 ꡬ성
| μΉ΄ν…Œκ³ λ¦¬ | ν˜„μž¬ 토큰 | μΆ”κ°€ ν›„ | λΉ„κ³  |
|---------|---------|---------|------|
| ν•œκ΅­μ–΄ Pretrain | ~39B (토큰화) | ~60~80B | mc4+KORMo+법λ₯  μΆ”κ°€ μ‹œ |
| SFT | 162K | ~757K | 5개 μΆ”κ°€ ν›„ |
| Preference | 0 | ~260K쌍 | jojo+maywell+nayohan |
| μ½”λ“œ/μ˜μ–΄ | ~0.6B | ~10B | RedPajama github+arxiv |
| 법λ₯  | 0 | ~1~2B | 법λ₯  묢음 |
**Chinchilla minimum (60B) 달성 κ°€λŠ₯** βœ…
---
_λ³΄κ³ μ„œ μ €μž₯: `/PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/data_inventory/`_