λ€μ΄λ‘λ μ°μ μμ κ³ν
μμ±μΌ: 2026-02-27 | λμ€ν¬ μ¬μ : 19TB
μ¦μ λ€μ΄λ‘λ Top 5 (μ°μ μμμ)
π₯ Priority 1: FineWeb-Edu (Korean subset)
- λ°μ΄ν°μ
:
HuggingFaceFW/fineweb-edu - μ: κ΅μ‘ νμ§ νν°λ§λ μΉ λ°μ΄ν°, κ³ νμ§(AκΈ). νκ΅μ΄ μλΈμ λ§ μΆμΆ κ°λ₯
- μμ: 5~15B tokens (νκ΅μ΄ λΆλΆ)
- μ κ·Ό: β 무λ£, gated μλ
- μν©νΈ: κ³ νμ§ pretrain ν ν° λλ ν보 + κ΅μ‘ λλ©μΈ κ°ν
# νκ΅μ΄ μλΈμ
λ€μ΄λ‘λ
pip install datasets
python3 -c "
from datasets import load_dataset
ds = load_dataset('HuggingFaceFW/fineweb-edu', 'CC-MAIN-2024-10', split='train', streaming=True)
# language filter needed - fineweb-edu is primarily English
# Alternative: fineweb-edu-score filtered Korean web data
"
β οΈ μ£Όμ: fineweb-eduλ λλΆλΆ μμ΄. νκ΅μ΄ λΉμ€ μ μ μ μμ. μμ΄ κ³ νμ§ λ³΄μΆ©μ©μΌλ‘λ κ°μΉ μμ.
π₯ Priority 2: Korean Preference/DPO λ°μ΄ν° (λ€μ μμ€)
- λ°μ΄ν°μ
λ€:
kuotient/orca-math-korean-preferenceβkuotient/orca-math-korean-dpo-pairsβheegyu/orca-math-korean-preference-cleanedβohsuz/dpo-v1010-koreanβChuGyouk/argilla-distilabel-math-preference-dpo-koreanβ
- μ: Preference λ°μ΄ν° 0κ±΄μΈ νμ¬ μνμμ ORPO νμ΅ μ체 λΆκ° β κ°μ₯ μκΈ
- μμ: ν©κ³ 30~60K μ
- μ κ·Ό: β λͺ¨λ 무λ£
- μν©νΈ: ORPO/DPO νμ΅ νμ΄νλΌμΈ νμ±ν
python3 << 'PYEOF'
from datasets import load_dataset
import json, os
out_dir = "/PROJECT/0325120031_A/ghong/taketimes/llm-bang/data/preference"
os.makedirs(out_dir, exist_ok=True)
datasets_to_dl = [
("kuotient/orca-math-korean-preference", None),
("kuotient/orca-math-korean-dpo-pairs", None),
("heegyu/orca-math-korean-preference-cleaned", None),
("ohsuz/dpo-v1010-korean", None),
]
for name, config in datasets_to_dl:
try:
ds = load_dataset(name, config, split="train")
safe_name = name.replace("/", "_")
ds.to_json(f"{out_dir}/{safe_name}.jsonl")
print(f"β
{name}: {len(ds)} samples")
except Exception as e:
print(f"β {name}: {e}")
PYEOF
π₯ Priority 3: RedPajama-Data-1T (μμ΄ κ³ νμ§ μλΈμ )
- λ°μ΄ν°μ
:
togethercomputer/RedPajama-Data-1T - μ: μμ΄ λ°μ΄ν° κ·Ήν λΆμ‘± (0.6B). μ½λ/ArXiv/Book/StackExchange μλΈμ μ λ³ λ€μ΄λ‘λ
- μμ: μ λ³ 10~20B tokens (μ½λ 5B + ArXiv 3B + Book 2B + SE 2B)
- μ κ·Ό: β 무λ£
- μν©νΈ: μ½λ/κ³Όν/μΆλ‘ λ₯λ ₯ + cross-lingual transfer λν κ°ν
python3 << 'PYEOF'
from datasets import load_dataset
# μ½λ μλΈμ
λ§ λ¨Όμ (github subset)
ds = load_dataset("togethercomputer/RedPajama-Data-1T", "github",
split="train", streaming=True,
cache_dir="/PROJECT/0325120031_A/ghong/taketimes/llm-bang/data/redpajama")
# ArXiv subset
ds_arxiv = load_dataset("togethercomputer/RedPajama-Data-1T", "arxiv",
split="train", streaming=True,
cache_dir="/PROJECT/0325120031_A/ghong/taketimes/llm-bang/data/redpajama")
PYEOF
4οΈβ£ Priority 4: νκ΅μ΄ SFT λ€μμ± λ³΄κ°
- λ°μ΄ν°μ
λ€:
kyujinpy/KOR-OpenOrca-Platypus-v3β (μΆλ‘ /μν)maywell/ko_wikidata_QAβ (μ§μ QA)nlpai-lab/kullm-v2β (λ²μ© μ§μ)
- μ: νμ¬ SFT 170Kμ μμ μΆ©λΆνλ μ½λ/μν/μΆλ‘ λλ©μΈ λΆμ‘±
- μμ: +50~100K λ€μν λλ©μΈ μν
- μ κ·Ό: β λͺ¨λ 무λ£
python3 << 'PYEOF'
from datasets import load_dataset
import os
out_dir = "/PROJECT/0325120031_A/ghong/taketimes/llm-bang/data/sft_extra"
os.makedirs(out_dir, exist_ok=True)
for name in ["kyujinpy/KOR-OpenOrca-Platypus-v3", "maywell/ko_wikidata_QA", "nlpai-lab/kullm-v2"]:
try:
ds = load_dataset(name, split="train")
safe = name.replace("/","_")
ds.to_json(f"{out_dir}/{safe}.jsonl")
print(f"β
{name}: {len(ds)}")
except Exception as e:
print(f"β {name}: {e}")
PYEOF
5οΈβ£ Priority 5: Open-Web-Math (μν νΉν)
- λ°μ΄ν°μ
:
open-web-math/open-web-math - μ: μν λ°μ΄ν° μ 무. μν λ₯λ ₯μ LLM λ²€μΉλ§ν¬ ν΅μ¬ μμ
- μμ: ~14B tokens (μμ΄ μν)
- μ κ·Ό: β 무λ£
- μν©νΈ: μν μΆλ‘ λ₯λ ₯ κΈ°λ° ν보
python3 -c "
from datasets import load_dataset
ds = load_dataset('open-web-math/open-web-math', split='train', streaming=True,
cache_dir='/PROJECT/0325120031_A/ghong/taketimes/llm-bang/data/open-web-math')
# Stream and save
"
λ€μ΄λ‘λ ν μμ ν ν° λΆν¬
| μΉ΄ν κ³ λ¦¬ | νμ¬ | μΆκ° | ν©κ³ |
|---|---|---|---|
| νκ΅μ΄ Pretrain | 39B | +5~10B (fineweb-edu ko) | 44~49B |
| μμ΄ μ½λ | 0 | +5B (RedPajama github) | 5B |
| μμ΄ κ³Όν/ArXiv | 0 | +3B (RedPajama arxiv) | 3B |
| μμ΄ μν | 0 | +10B (open-web-math) | 10B |
| μμ΄ κΈ°ν κ³ νμ§ | 0.6B | +5B (RedPajama book+SE) | 5.6B |
| Pretrain ν©κ³ | ~39B | +28~33B | |
| SFT | 170K | +50~100K | 220~270K |
| Preference | 0 | +30~60K μ | 30~60K μ |
λͺ©ν λ¬μ± μ¬λΆ
- β Chinchilla minimum (60B) λ¬μ± κ°λ₯
- β ORPO/DPO νμ΅ κ°λ₯
- β μ½λ/μν/κ³Όν λλ©μΈ 컀λ²
- π‘ Chinchilla optimal (210B)μλ μ¬μ ν λΆμ‘± β μΆν CulturaX μ 체, SlimPajama λ± μΆκ° κ²ν
λ°μ΄ν° λ―Ήμ€ κΆμ₯ λΉμ¨ (νμ΅ μ)
νκ΅μ΄ ν
μ€νΈ: 50% (~35B tokens)
μμ΄ μ½λ: 15% (~10B tokens)
μμ΄ μν/κ³Όν: 15% (~10B tokens)
μμ΄ μΌλ°: 15% (~10B tokens)
νκ΅μ΄ κ΅μ‘: 5% (~3B tokens)
μ£Όμμ¬ν
- CulturaXλ gated(auto) β HuggingFaceμμ λμ νμ (μ΄λ―Έ λ€μ΄λ°μ 60GB νμ©)
- the-stack-dedupλ gated β μΉμΈ νμ, RedPajama githubλ‘ λ체
- λ€μ΄λ‘λ μ
huggingface-cli login --token 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μ€ν - λμ©λ λ€μ΄λ‘λ μ
HF_HUB_ENABLE_HF_TRANSFER=1νκ²½λ³μ μ€μ κΆμ₯