| # λ€μ΄λ‘λ μ°μ μμ κ³ν | |
| > μμ±μΌ: 2026-02-27 | λμ€ν¬ μ¬μ : 19TB | |
| ## μ¦μ λ€μ΄λ‘λ Top 5 (μ°μ μμμ) | |
| --- | |
| ### π₯ Priority 1: FineWeb-Edu (Korean subset) | |
| - **λ°μ΄ν°μ :** `HuggingFaceFW/fineweb-edu` | |
| - **μ:** κ΅μ‘ νμ§ νν°λ§λ μΉ λ°μ΄ν°, κ³ νμ§(AκΈ). νκ΅μ΄ μλΈμ λ§ μΆμΆ κ°λ₯ | |
| - **μμ:** 5~15B tokens (νκ΅μ΄ λΆλΆ) | |
| - **μ κ·Ό:** β 무λ£, gated μλ | |
| - **μν©νΈ:** κ³ νμ§ pretrain ν ν° λλ ν보 + κ΅μ‘ λλ©μΈ κ°ν | |
| ```bash | |
| # νκ΅μ΄ μλΈμ λ€μ΄λ‘λ | |
| pip install datasets | |
| python3 -c " | |
| from datasets import load_dataset | |
| ds = load_dataset('HuggingFaceFW/fineweb-edu', 'CC-MAIN-2024-10', split='train', streaming=True) | |
| # language filter needed - fineweb-edu is primarily English | |
| # Alternative: fineweb-edu-score filtered Korean web data | |
| " | |
| ``` | |
| > β οΈ μ£Όμ: fineweb-eduλ λλΆλΆ μμ΄. νκ΅μ΄ λΉμ€ μ μ μ μμ. μμ΄ κ³ νμ§ λ³΄μΆ©μ©μΌλ‘λ κ°μΉ μμ. | |
| --- | |
| ### π₯ Priority 2: Korean Preference/DPO λ°μ΄ν° (λ€μ μμ€) | |
| - **λ°μ΄ν°μ λ€:** | |
| - `kuotient/orca-math-korean-preference` β | |
| - `kuotient/orca-math-korean-dpo-pairs` β | |
| - `heegyu/orca-math-korean-preference-cleaned` β | |
| - `ohsuz/dpo-v1010-korean` β | |
| - `ChuGyouk/argilla-distilabel-math-preference-dpo-korean` β | |
| - **μ:** Preference λ°μ΄ν° **0건**μΈ νμ¬ μνμμ ORPO νμ΅ μ체 λΆκ° β κ°μ₯ μκΈ | |
| - **μμ:** ν©κ³ 30~60K μ | |
| - **μ κ·Ό:** β λͺ¨λ λ¬΄λ£ | |
| - **μν©νΈ:** ORPO/DPO νμ΅ νμ΄νλΌμΈ νμ±ν | |
| ```bash | |
| python3 << 'PYEOF' | |
| from datasets import load_dataset | |
| import json, os | |
| out_dir = "/PROJECT/0325120031_A/ghong/taketimes/llm-bang/data/preference" | |
| os.makedirs(out_dir, exist_ok=True) | |
| datasets_to_dl = [ | |
| ("kuotient/orca-math-korean-preference", None), | |
| ("kuotient/orca-math-korean-dpo-pairs", None), | |
| ("heegyu/orca-math-korean-preference-cleaned", None), | |
| ("ohsuz/dpo-v1010-korean", None), | |
| ] | |
| for name, config in datasets_to_dl: | |
| try: | |
| ds = load_dataset(name, config, split="train") | |
| safe_name = name.replace("/", "_") | |
| ds.to_json(f"{out_dir}/{safe_name}.jsonl") | |
| print(f"β {name}: {len(ds)} samples") | |
| except Exception as e: | |
| print(f"β {name}: {e}") | |
| PYEOF | |
| ``` | |
| --- | |
| ### π₯ Priority 3: RedPajama-Data-1T (μμ΄ κ³ νμ§ μλΈμ ) | |
| - **λ°μ΄ν°μ :** `togethercomputer/RedPajama-Data-1T` | |
| - **μ:** μμ΄ λ°μ΄ν° κ·Ήν λΆμ‘± (0.6B). μ½λ/ArXiv/Book/StackExchange μλΈμ μ λ³ λ€μ΄λ‘λ | |
| - **μμ:** μ λ³ 10~20B tokens (μ½λ 5B + ArXiv 3B + Book 2B + SE 2B) | |
| - **μ κ·Ό:** β λ¬΄λ£ | |
| - **μν©νΈ:** μ½λ/κ³Όν/μΆλ‘ λ₯λ ₯ + cross-lingual transfer λν κ°ν | |
| ```bash | |
| python3 << 'PYEOF' | |
| from datasets import load_dataset | |
| # μ½λ μλΈμ λ§ λ¨Όμ (github subset) | |
| ds = load_dataset("togethercomputer/RedPajama-Data-1T", "github", | |
| split="train", streaming=True, | |
| cache_dir="/PROJECT/0325120031_A/ghong/taketimes/llm-bang/data/redpajama") | |
| # ArXiv subset | |
| ds_arxiv = load_dataset("togethercomputer/RedPajama-Data-1T", "arxiv", | |
| split="train", streaming=True, | |
| cache_dir="/PROJECT/0325120031_A/ghong/taketimes/llm-bang/data/redpajama") | |
| PYEOF | |
| ``` | |
| --- | |
| ### 4οΈβ£ Priority 4: νκ΅μ΄ SFT λ€μμ± λ³΄κ° | |
| - **λ°μ΄ν°μ λ€:** | |
| - `kyujinpy/KOR-OpenOrca-Platypus-v3` β (μΆλ‘ /μν) | |
| - `maywell/ko_wikidata_QA` β (μ§μ QA) | |
| - `nlpai-lab/kullm-v2` β (λ²μ© μ§μ) | |
| - **μ:** νμ¬ SFT 170Kμ μμ μΆ©λΆνλ μ½λ/μν/μΆλ‘ λλ©μΈ λΆμ‘± | |
| - **μμ:** +50~100K λ€μν λλ©μΈ μν | |
| - **μ κ·Ό:** β λͺ¨λ λ¬΄λ£ | |
| ```bash | |
| python3 << 'PYEOF' | |
| from datasets import load_dataset | |
| import os | |
| out_dir = "/PROJECT/0325120031_A/ghong/taketimes/llm-bang/data/sft_extra" | |
| os.makedirs(out_dir, exist_ok=True) | |
| for name in ["kyujinpy/KOR-OpenOrca-Platypus-v3", "maywell/ko_wikidata_QA", "nlpai-lab/kullm-v2"]: | |
| try: | |
| ds = load_dataset(name, split="train") | |
| safe = name.replace("/","_") | |
| ds.to_json(f"{out_dir}/{safe}.jsonl") | |
| print(f"β {name}: {len(ds)}") | |
| except Exception as e: | |
| print(f"β {name}: {e}") | |
| PYEOF | |
| ``` | |
| --- | |
| ### 5οΈβ£ Priority 5: Open-Web-Math (μν νΉν) | |
| - **λ°μ΄ν°μ :** `open-web-math/open-web-math` | |
| - **μ:** μν λ°μ΄ν° μ 무. μν λ₯λ ₯μ LLM λ²€μΉλ§ν¬ ν΅μ¬ μμ | |
| - **μμ:** ~14B tokens (μμ΄ μν) | |
| - **μ κ·Ό:** β λ¬΄λ£ | |
| - **μν©νΈ:** μν μΆλ‘ λ₯λ ₯ κΈ°λ° ν보 | |
| ```bash | |
| python3 -c " | |
| from datasets import load_dataset | |
| ds = load_dataset('open-web-math/open-web-math', split='train', streaming=True, | |
| cache_dir='/PROJECT/0325120031_A/ghong/taketimes/llm-bang/data/open-web-math') | |
| # Stream and save | |
| " | |
| ``` | |
| --- | |
| ## λ€μ΄λ‘λ ν μμ ν ν° λΆν¬ | |
| | μΉ΄ν κ³ λ¦¬ | νμ¬ | μΆκ° | ν©κ³ | | |
| |---------|------|------|------| | |
| | νκ΅μ΄ Pretrain | 39B | +5~10B (fineweb-edu ko) | 44~49B | | |
| | μμ΄ μ½λ | 0 | +5B (RedPajama github) | 5B | | |
| | μμ΄ κ³Όν/ArXiv | 0 | +3B (RedPajama arxiv) | 3B | | |
| | μμ΄ μν | 0 | +10B (open-web-math) | 10B | | |
| | μμ΄ κΈ°ν κ³ νμ§ | 0.6B | +5B (RedPajama book+SE) | 5.6B | | |
| | **Pretrain ν©κ³** | **~39B** | **+28~33B** | **~67~72B** | | |
| | SFT | 170K | +50~100K | 220~270K | | |
| | Preference | 0 | +30~60K μ | 30~60K μ | | |
| ### λͺ©ν λ¬μ± μ¬λΆ | |
| - β Chinchilla minimum (60B) λ¬μ± κ°λ₯ | |
| - β ORPO/DPO νμ΅ κ°λ₯ | |
| - β μ½λ/μν/κ³Όν λλ©μΈ μ»€λ² | |
| - π‘ Chinchilla optimal (210B)μλ μ¬μ ν λΆμ‘± β μΆν CulturaX μ 체, SlimPajama λ± μΆκ° κ²ν | |
| --- | |
| ## λ°μ΄ν° λ―Ήμ€ κΆμ₯ λΉμ¨ (νμ΅ μ) | |
| ``` | |
| νκ΅μ΄ ν μ€νΈ: 50% (~35B tokens) | |
| μμ΄ μ½λ: 15% (~10B tokens) | |
| μμ΄ μν/κ³Όν: 15% (~10B tokens) | |
| μμ΄ μΌλ°: 15% (~10B tokens) | |
| νκ΅μ΄ κ΅μ‘: 5% (~3B tokens) | |
| ``` | |
| ## μ£Όμμ¬ν | |
| 1. CulturaXλ gated(auto) β HuggingFaceμμ λμ νμ (μ΄λ―Έ λ€μ΄λ°μ 60GB νμ©) | |
| 2. the-stack-dedupλ gated β μΉμΈ νμ, RedPajama githubλ‘ λ체 | |
| 3. λ€μ΄λ‘λ μ `huggingface-cli login --token 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` μ€ν | |
| 4. λμ©λ λ€μ΄λ‘λ μ `HF_HUB_ENABLE_HF_TRANSFER=1` νκ²½λ³μ μ€μ κΆμ₯ | |