frankenstallm / source /eval /data_inventory /DOWNLOAD_PRIORITY.md
pathcosmos's picture
Upload folder using huggingface_hub (#29)
5b1ff4d
# λ‹€μš΄λ‘œλ“œ μš°μ„ μˆœμœ„ κ³„νš
> 생성일: 2026-02-27 | λ””μŠ€ν¬ μ—¬μœ : 19TB
## μ¦‰μ‹œ λ‹€μš΄λ‘œλ“œ Top 5 (μš°μ„ μˆœμœ„μˆœ)
---
### πŸ₯‡ Priority 1: FineWeb-Edu (Korean subset)
- **데이터셋:** `HuggingFaceFW/fineweb-edu`
- **μ™œ:** ꡐ윑 ν’ˆμ§ˆ ν•„ν„°λ§λœ μ›Ή 데이터, κ³ ν’ˆμ§ˆ(AκΈ‰). ν•œκ΅­μ–΄ μ„œλΈŒμ…‹λ§Œ μΆ”μΆœ κ°€λŠ₯
- **μ˜ˆμƒ:** 5~15B tokens (ν•œκ΅­μ–΄ λΆ€λΆ„)
- **μ ‘κ·Ό:** βœ… 무료, gated μ•„λ‹˜
- **μž„νŒ©νŠΈ:** κ³ ν’ˆμ§ˆ pretrain 토큰 λŒ€λŸ‰ 확보 + ꡐ윑 도메인 κ°•ν™”
```bash
# ν•œκ΅­μ–΄ μ„œλΈŒμ…‹ λ‹€μš΄λ‘œλ“œ
pip install datasets
python3 -c "
from datasets import load_dataset
ds = load_dataset('HuggingFaceFW/fineweb-edu', 'CC-MAIN-2024-10', split='train', streaming=True)
# language filter needed - fineweb-edu is primarily English
# Alternative: fineweb-edu-score filtered Korean web data
"
```
> ⚠️ 주의: fineweb-eduλŠ” λŒ€λΆ€λΆ„ μ˜μ–΄. ν•œκ΅­μ–΄ 비쀑 적을 수 있음. μ˜μ–΄ κ³ ν’ˆμ§ˆ λ³΄μΆ©μš©μœΌλ‘œλ„ κ°€μΉ˜ 있음.
---
### πŸ₯ˆ Priority 2: Korean Preference/DPO 데이터 (λ‹€μˆ˜ μ†ŒμŠ€)
- **데이터셋듀:**
- `kuotient/orca-math-korean-preference` βœ…
- `kuotient/orca-math-korean-dpo-pairs` βœ…
- `heegyu/orca-math-korean-preference-cleaned` βœ…
- `ohsuz/dpo-v1010-korean` βœ…
- `ChuGyouk/argilla-distilabel-math-preference-dpo-korean` βœ…
- **μ™œ:** Preference 데이터 **0건**인 ν˜„μž¬ μƒνƒœμ—μ„œ ORPO ν•™μŠ΅ 자체 λΆˆκ°€ β†’ κ°€μž₯ μ‹œκΈ‰
- **μ˜ˆμƒ:** 합계 30~60K 쌍
- **μ ‘κ·Ό:** βœ… λͺ¨λ‘ 무료
- **μž„νŒ©νŠΈ:** ORPO/DPO ν•™μŠ΅ νŒŒμ΄ν”„λΌμΈ ν™œμ„±ν™”
```bash
python3 << 'PYEOF'
from datasets import load_dataset
import json, os
out_dir = "/PROJECT/0325120031_A/ghong/taketimes/llm-bang/data/preference"
os.makedirs(out_dir, exist_ok=True)
datasets_to_dl = [
("kuotient/orca-math-korean-preference", None),
("kuotient/orca-math-korean-dpo-pairs", None),
("heegyu/orca-math-korean-preference-cleaned", None),
("ohsuz/dpo-v1010-korean", None),
]
for name, config in datasets_to_dl:
try:
ds = load_dataset(name, config, split="train")
safe_name = name.replace("/", "_")
ds.to_json(f"{out_dir}/{safe_name}.jsonl")
print(f"βœ… {name}: {len(ds)} samples")
except Exception as e:
print(f"❌ {name}: {e}")
PYEOF
```
---
### πŸ₯‰ Priority 3: RedPajama-Data-1T (μ˜μ–΄ κ³ ν’ˆμ§ˆ μ„œλΈŒμ…‹)
- **데이터셋:** `togethercomputer/RedPajama-Data-1T`
- **μ™œ:** μ˜μ–΄ 데이터 극히 λΆ€μ‘± (0.6B). μ½”λ“œ/ArXiv/Book/StackExchange μ„œλΈŒμ…‹ 선별 λ‹€μš΄λ‘œλ“œ
- **μ˜ˆμƒ:** 선별 10~20B tokens (μ½”λ“œ 5B + ArXiv 3B + Book 2B + SE 2B)
- **μ ‘κ·Ό:** βœ… 무료
- **μž„νŒ©νŠΈ:** μ½”λ“œ/κ³Όν•™/μΆ”λ‘  λŠ₯λ ₯ + cross-lingual transfer λŒ€ν­ κ°•ν™”
```bash
python3 << 'PYEOF'
from datasets import load_dataset
# μ½”λ“œ μ„œλΈŒμ…‹λ§Œ λ¨Όμ € (github subset)
ds = load_dataset("togethercomputer/RedPajama-Data-1T", "github",
split="train", streaming=True,
cache_dir="/PROJECT/0325120031_A/ghong/taketimes/llm-bang/data/redpajama")
# ArXiv subset
ds_arxiv = load_dataset("togethercomputer/RedPajama-Data-1T", "arxiv",
split="train", streaming=True,
cache_dir="/PROJECT/0325120031_A/ghong/taketimes/llm-bang/data/redpajama")
PYEOF
```
---
### 4️⃣ Priority 4: ν•œκ΅­μ–΄ SFT λ‹€μ–‘μ„± 보강
- **데이터셋듀:**
- `kyujinpy/KOR-OpenOrca-Platypus-v3` βœ… (μΆ”λ‘ /μˆ˜ν•™)
- `maywell/ko_wikidata_QA` βœ… (지식 QA)
- `nlpai-lab/kullm-v2` βœ… (λ²”μš© μ§€μ‹œ)
- **μ™œ:** ν˜„μž¬ SFT 170K은 양적 μΆ©λΆ„ν•˜λ‚˜ μ½”λ“œ/μˆ˜ν•™/μΆ”λ‘  도메인 λΆ€μ‘±
- **μ˜ˆμƒ:** +50~100K λ‹€μ–‘ν•œ 도메인 μƒ˜ν”Œ
- **μ ‘κ·Ό:** βœ… λͺ¨λ‘ 무료
```bash
python3 << 'PYEOF'
from datasets import load_dataset
import os
out_dir = "/PROJECT/0325120031_A/ghong/taketimes/llm-bang/data/sft_extra"
os.makedirs(out_dir, exist_ok=True)
for name in ["kyujinpy/KOR-OpenOrca-Platypus-v3", "maywell/ko_wikidata_QA", "nlpai-lab/kullm-v2"]:
try:
ds = load_dataset(name, split="train")
safe = name.replace("/","_")
ds.to_json(f"{out_dir}/{safe}.jsonl")
print(f"βœ… {name}: {len(ds)}")
except Exception as e:
print(f"❌ {name}: {e}")
PYEOF
```
---
### 5️⃣ Priority 5: Open-Web-Math (μˆ˜ν•™ νŠΉν™”)
- **데이터셋:** `open-web-math/open-web-math`
- **μ™œ:** μˆ˜ν•™ 데이터 전무. μˆ˜ν•™ λŠ₯λ ₯은 LLM 벀치마크 핡심 μ˜μ—­
- **μ˜ˆμƒ:** ~14B tokens (μ˜μ–΄ μˆ˜ν•™)
- **μ ‘κ·Ό:** βœ… 무료
- **μž„νŒ©νŠΈ:** μˆ˜ν•™ μΆ”λ‘  λŠ₯λ ₯ 기반 확보
```bash
python3 -c "
from datasets import load_dataset
ds = load_dataset('open-web-math/open-web-math', split='train', streaming=True,
cache_dir='/PROJECT/0325120031_A/ghong/taketimes/llm-bang/data/open-web-math')
# Stream and save
"
```
---
## λ‹€μš΄λ‘œλ“œ ν›„ μ˜ˆμƒ 토큰 뢄포
| μΉ΄ν…Œκ³ λ¦¬ | ν˜„μž¬ | μΆ”κ°€ | 합계 |
|---------|------|------|------|
| ν•œκ΅­μ–΄ Pretrain | 39B | +5~10B (fineweb-edu ko) | 44~49B |
| μ˜μ–΄ μ½”λ“œ | 0 | +5B (RedPajama github) | 5B |
| μ˜μ–΄ κ³Όν•™/ArXiv | 0 | +3B (RedPajama arxiv) | 3B |
| μ˜μ–΄ μˆ˜ν•™ | 0 | +10B (open-web-math) | 10B |
| μ˜μ–΄ 기타 κ³ ν’ˆμ§ˆ | 0.6B | +5B (RedPajama book+SE) | 5.6B |
| **Pretrain 합계** | **~39B** | **+28~33B** | **~67~72B** |
| SFT | 170K | +50~100K | 220~270K |
| Preference | 0 | +30~60K 쌍 | 30~60K 쌍 |
### λͺ©ν‘œ 달성 μ—¬λΆ€
- βœ… Chinchilla minimum (60B) 달성 κ°€λŠ₯
- βœ… ORPO/DPO ν•™μŠ΅ κ°€λŠ₯
- βœ… μ½”λ“œ/μˆ˜ν•™/κ³Όν•™ 도메인 컀버
- 🟑 Chinchilla optimal (210B)μ—λŠ” μ—¬μ „νžˆ λΆ€μ‘± β†’ μΆ”ν›„ CulturaX 전체, SlimPajama λ“± μΆ”κ°€ κ²€ν† 
---
## 데이터 믹슀 ꢌμž₯ λΉ„μœ¨ (ν•™μŠ΅ μ‹œ)
```
ν•œκ΅­μ–΄ ν…μŠ€νŠΈ: 50% (~35B tokens)
μ˜μ–΄ μ½”λ“œ: 15% (~10B tokens)
μ˜μ–΄ μˆ˜ν•™/κ³Όν•™: 15% (~10B tokens)
μ˜μ–΄ 일반: 15% (~10B tokens)
ν•œκ΅­μ–΄ ꡐ윑: 5% (~3B tokens)
```
## μ£Όμ˜μ‚¬ν•­
1. CulturaXλŠ” gated(auto) β†’ HuggingFaceμ—μ„œ λ™μ˜ ν•„μš” (이미 λ‹€μš΄λ°›μ€ 60GB ν™œμš©)
2. the-stack-dedup도 gated β†’ 승인 ν•„μš”, RedPajama github둜 λŒ€μ²΄
3. λ‹€μš΄λ‘œλ“œ μ „ `huggingface-cli login --token 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` μ‹€ν–‰
4. λŒ€μš©λŸ‰ λ‹€μš΄λ‘œλ“œ μ‹œ `HF_HUB_ENABLE_HF_TRANSFER=1` ν™˜κ²½λ³€μˆ˜ μ„€μ • ꢌμž₯