frankenstallm / source /eval /data_inventory /DOWNLOAD_PRIORITY.md
pathcosmos's picture
Upload folder using huggingface_hub (#29)
5b1ff4d

λ‹€μš΄λ‘œλ“œ μš°μ„ μˆœμœ„ κ³„νš

생성일: 2026-02-27 | λ””μŠ€ν¬ μ—¬μœ : 19TB

μ¦‰μ‹œ λ‹€μš΄λ‘œλ“œ Top 5 (μš°μ„ μˆœμœ„μˆœ)


πŸ₯‡ Priority 1: FineWeb-Edu (Korean subset)

  • 데이터셋: HuggingFaceFW/fineweb-edu
  • μ™œ: ꡐ윑 ν’ˆμ§ˆ ν•„ν„°λ§λœ μ›Ή 데이터, κ³ ν’ˆμ§ˆ(AκΈ‰). ν•œκ΅­μ–΄ μ„œλΈŒμ…‹λ§Œ μΆ”μΆœ κ°€λŠ₯
  • μ˜ˆμƒ: 5~15B tokens (ν•œκ΅­μ–΄ λΆ€λΆ„)
  • μ ‘κ·Ό: βœ… 무료, gated μ•„λ‹˜
  • μž„νŒ©νŠΈ: κ³ ν’ˆμ§ˆ pretrain 토큰 λŒ€λŸ‰ 확보 + ꡐ윑 도메인 κ°•ν™”
# ν•œκ΅­μ–΄ μ„œλΈŒμ…‹ λ‹€μš΄λ‘œλ“œ
pip install datasets
python3 -c "
from datasets import load_dataset
ds = load_dataset('HuggingFaceFW/fineweb-edu', 'CC-MAIN-2024-10', split='train', streaming=True)
# language filter needed - fineweb-edu is primarily English
# Alternative: fineweb-edu-score filtered Korean web data
"

⚠️ 주의: fineweb-eduλŠ” λŒ€λΆ€λΆ„ μ˜μ–΄. ν•œκ΅­μ–΄ 비쀑 적을 수 있음. μ˜μ–΄ κ³ ν’ˆμ§ˆ λ³΄μΆ©μš©μœΌλ‘œλ„ κ°€μΉ˜ 있음.


πŸ₯ˆ Priority 2: Korean Preference/DPO 데이터 (λ‹€μˆ˜ μ†ŒμŠ€)

  • 데이터셋듀:
    • kuotient/orca-math-korean-preference βœ…
    • kuotient/orca-math-korean-dpo-pairs βœ…
    • heegyu/orca-math-korean-preference-cleaned βœ…
    • ohsuz/dpo-v1010-korean βœ…
    • ChuGyouk/argilla-distilabel-math-preference-dpo-korean βœ…
  • μ™œ: Preference 데이터 0건인 ν˜„μž¬ μƒνƒœμ—μ„œ ORPO ν•™μŠ΅ 자체 λΆˆκ°€ β†’ κ°€μž₯ μ‹œκΈ‰
  • μ˜ˆμƒ: 합계 30~60K 쌍
  • μ ‘κ·Ό: βœ… λͺ¨λ‘ 무료
  • μž„νŒ©νŠΈ: ORPO/DPO ν•™μŠ΅ νŒŒμ΄ν”„λΌμΈ ν™œμ„±ν™”
python3 << 'PYEOF'
from datasets import load_dataset
import json, os

out_dir = "/PROJECT/0325120031_A/ghong/taketimes/llm-bang/data/preference"
os.makedirs(out_dir, exist_ok=True)

datasets_to_dl = [
    ("kuotient/orca-math-korean-preference", None),
    ("kuotient/orca-math-korean-dpo-pairs", None),
    ("heegyu/orca-math-korean-preference-cleaned", None),
    ("ohsuz/dpo-v1010-korean", None),
]

for name, config in datasets_to_dl:
    try:
        ds = load_dataset(name, config, split="train")
        safe_name = name.replace("/", "_")
        ds.to_json(f"{out_dir}/{safe_name}.jsonl")
        print(f"βœ… {name}: {len(ds)} samples")
    except Exception as e:
        print(f"❌ {name}: {e}")
PYEOF

πŸ₯‰ Priority 3: RedPajama-Data-1T (μ˜μ–΄ κ³ ν’ˆμ§ˆ μ„œλΈŒμ…‹)

  • 데이터셋: togethercomputer/RedPajama-Data-1T
  • μ™œ: μ˜μ–΄ 데이터 극히 λΆ€μ‘± (0.6B). μ½”λ“œ/ArXiv/Book/StackExchange μ„œλΈŒμ…‹ 선별 λ‹€μš΄λ‘œλ“œ
  • μ˜ˆμƒ: 선별 10~20B tokens (μ½”λ“œ 5B + ArXiv 3B + Book 2B + SE 2B)
  • μ ‘κ·Ό: βœ… 무료
  • μž„νŒ©νŠΈ: μ½”λ“œ/κ³Όν•™/μΆ”λ‘  λŠ₯λ ₯ + cross-lingual transfer λŒ€ν­ κ°•ν™”
python3 << 'PYEOF'
from datasets import load_dataset

# μ½”λ“œ μ„œλΈŒμ…‹λ§Œ λ¨Όμ € (github subset)
ds = load_dataset("togethercomputer/RedPajama-Data-1T", "github", 
                   split="train", streaming=True,
                   cache_dir="/PROJECT/0325120031_A/ghong/taketimes/llm-bang/data/redpajama")
# ArXiv subset
ds_arxiv = load_dataset("togethercomputer/RedPajama-Data-1T", "arxiv",
                         split="train", streaming=True,
                         cache_dir="/PROJECT/0325120031_A/ghong/taketimes/llm-bang/data/redpajama")
PYEOF

4️⃣ Priority 4: ν•œκ΅­μ–΄ SFT λ‹€μ–‘μ„± 보강

  • 데이터셋듀:
    • kyujinpy/KOR-OpenOrca-Platypus-v3 βœ… (μΆ”λ‘ /μˆ˜ν•™)
    • maywell/ko_wikidata_QA βœ… (지식 QA)
    • nlpai-lab/kullm-v2 βœ… (λ²”μš© μ§€μ‹œ)
  • μ™œ: ν˜„μž¬ SFT 170K은 양적 μΆ©λΆ„ν•˜λ‚˜ μ½”λ“œ/μˆ˜ν•™/μΆ”λ‘  도메인 λΆ€μ‘±
  • μ˜ˆμƒ: +50~100K λ‹€μ–‘ν•œ 도메인 μƒ˜ν”Œ
  • μ ‘κ·Ό: βœ… λͺ¨λ‘ 무료
python3 << 'PYEOF'
from datasets import load_dataset
import os

out_dir = "/PROJECT/0325120031_A/ghong/taketimes/llm-bang/data/sft_extra"
os.makedirs(out_dir, exist_ok=True)

for name in ["kyujinpy/KOR-OpenOrca-Platypus-v3", "maywell/ko_wikidata_QA", "nlpai-lab/kullm-v2"]:
    try:
        ds = load_dataset(name, split="train")
        safe = name.replace("/","_")
        ds.to_json(f"{out_dir}/{safe}.jsonl")
        print(f"βœ… {name}: {len(ds)}")
    except Exception as e:
        print(f"❌ {name}: {e}")
PYEOF

5️⃣ Priority 5: Open-Web-Math (μˆ˜ν•™ νŠΉν™”)

  • 데이터셋: open-web-math/open-web-math
  • μ™œ: μˆ˜ν•™ 데이터 전무. μˆ˜ν•™ λŠ₯λ ₯은 LLM 벀치마크 핡심 μ˜μ—­
  • μ˜ˆμƒ: ~14B tokens (μ˜μ–΄ μˆ˜ν•™)
  • μ ‘κ·Ό: βœ… 무료
  • μž„νŒ©νŠΈ: μˆ˜ν•™ μΆ”λ‘  λŠ₯λ ₯ 기반 확보
python3 -c "
from datasets import load_dataset
ds = load_dataset('open-web-math/open-web-math', split='train', streaming=True,
                   cache_dir='/PROJECT/0325120031_A/ghong/taketimes/llm-bang/data/open-web-math')
# Stream and save
"

λ‹€μš΄λ‘œλ“œ ν›„ μ˜ˆμƒ 토큰 뢄포

μΉ΄ν…Œκ³ λ¦¬ ν˜„μž¬ μΆ”κ°€ 합계
ν•œκ΅­μ–΄ Pretrain 39B +5~10B (fineweb-edu ko) 44~49B
μ˜μ–΄ μ½”λ“œ 0 +5B (RedPajama github) 5B
μ˜μ–΄ κ³Όν•™/ArXiv 0 +3B (RedPajama arxiv) 3B
μ˜μ–΄ μˆ˜ν•™ 0 +10B (open-web-math) 10B
μ˜μ–΄ 기타 κ³ ν’ˆμ§ˆ 0.6B +5B (RedPajama book+SE) 5.6B
Pretrain 합계 ~39B +28~33B 6772B
SFT 170K +50~100K 220~270K
Preference 0 +30~60K 쌍 30~60K 쌍

λͺ©ν‘œ 달성 μ—¬λΆ€

  • βœ… Chinchilla minimum (60B) 달성 κ°€λŠ₯
  • βœ… ORPO/DPO ν•™μŠ΅ κ°€λŠ₯
  • βœ… μ½”λ“œ/μˆ˜ν•™/κ³Όν•™ 도메인 컀버
  • 🟑 Chinchilla optimal (210B)μ—λŠ” μ—¬μ „νžˆ λΆ€μ‘± β†’ μΆ”ν›„ CulturaX 전체, SlimPajama λ“± μΆ”κ°€ κ²€ν† 

데이터 믹슀 ꢌμž₯ λΉ„μœ¨ (ν•™μŠ΅ μ‹œ)

ν•œκ΅­μ–΄ ν…μŠ€νŠΈ:  50% (~35B tokens)
μ˜μ–΄ μ½”λ“œ:     15% (~10B tokens)  
μ˜μ–΄ μˆ˜ν•™/κ³Όν•™: 15% (~10B tokens)
μ˜μ–΄ 일반:     15% (~10B tokens)
ν•œκ΅­μ–΄ ꡐ윑:    5% (~3B tokens)

μ£Όμ˜μ‚¬ν•­

  1. CulturaXλŠ” gated(auto) β†’ HuggingFaceμ—μ„œ λ™μ˜ ν•„μš” (이미 λ‹€μš΄λ°›μ€ 60GB ν™œμš©)
  2. the-stack-dedup도 gated β†’ 승인 ν•„μš”, RedPajama github둜 λŒ€μ²΄
  3. λ‹€μš΄λ‘œλ“œ μ „ huggingface-cli login --token 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 μ‹€ν–‰
  4. λŒ€μš©λŸ‰ λ‹€μš΄λ‘œλ“œ μ‹œ HF_HUB_ENABLE_HF_TRANSFER=1 ν™˜κ²½λ³€μˆ˜ μ„€μ • ꢌμž₯