frankenstallm / source /eval /data_inventory /MASTER_DATA_REPORT.md
pathcosmos's picture
Upload folder using huggingface_hub (#29)
5b1ff4d

ํ•œ๊ตญ์–ด LLM ๋ฐ์ดํ„ฐ ์ข…ํ•ฉ ๋ฆฌํฌํŠธ

์ƒ์„ฑ: 2026-02-27 | 5๊ฐœ subagent ์กฐ์‚ฌ ๊ฒฐ๊ณผ ํ†ตํ•ฉ


1. ํ˜„์žฌ ๋ณด์œ  ํ˜„ํ™ฉ

์นดํ…Œ๊ณ ๋ฆฌ ๋ฐ์ดํ„ฐ์…‹ ๋””์Šคํฌ ์ถ”์ • ํ† ํฐ ํ’ˆ์งˆ
๊ต์œก ์›น fineweb2_edu_ko 234G ~50B A
์›น ํฌ๋กค culturax_ko 60G ~24B B+
์ˆ˜ํ•™ open_web_math 26G ~10B A
์›น ํฌ๋กค hplt_ko 23G ~9B B
์›น ํฌ๋กค cc100_processed 19G ~7B C+
์›น ํฌ๋กค cc100_ko 14G ~5.5B C
์›น ํฌ๋กค oscar_ko 9.2G ~3.5B B
๊ต์œก korean_textbooks 6.4G ~1.5B A
์›น korean_webtext 4.2G ~1B B+
๋ฐฑ๊ณผ namuwiki_2023 2.9G ~1B A-
๊ต์œก finepdfs_edu_ko 2.9G ~0.7B A-
๋ฐฑ๊ณผ namuwiki_extracted 2.2G ~0.5B A-
๋ฐฑ๊ณผ wikipedia_korean 1.7G ~0.4B A
๋ฐฑ๊ณผ wikipedia_ko_2024 1.4G ~0.3B A
Instruct kovast 449M ~0.1B B
Instruct evol_instruct_ko 144M ~0.03B B
๋Œ€ํ™” korean_safe_conv 51M ~0.01B B
ํ•ฉ๊ณ„ ~410G ~114B raw

โš ๏ธ ํ† ํฐํ™” ์™„๋ฃŒ .bin: korean_train.bin(17Gโ‰ˆ8.9B), korean_c4_train(15Gโ‰ˆ7.5B) ๋“ฑ ์‹ค์ œ ํ•™์Šต ์‚ฌ์šฉ ~39B


2. ๋ถ€์กฑ ๋„๋ฉ”์ธ ๊ฐญ ๋ถ„์„

๐Ÿ”ด CRITICAL (์—†์Œ)

๋„๋ฉ”์ธ ํ˜„ํ™ฉ ์˜ํ–ฅ
Preference/DPO 0๊ฑด ORPO ํ•™์Šต ๋ถˆ๊ฐ€
๋ฒ•๋ฅ /ํŒ๋ก€ 0 ๋ฒ•๋ฅ  ์ถ”๋ก  ๋ถˆ๊ฐ€
์˜๋ฃŒ/์˜ํ•™ 0 ํ—ฌ์Šค์ผ€์–ด ์‘๋‹ต ๋ถˆ๊ฐ€
์ฝ”๋“œ (ํ•œ๊ตญ์–ด ์ฃผ์„) 0 ์ฝ”๋”ฉ ์ง€์› ์•ฝํ•จ
๋‰ด์Šค/์–ธ๋ก  0 ์‹œ์‚ฌ ๋งฅ๋ฝ ์•ฝํ•จ

๐ŸŸก WEAK (๋งค์šฐ ๋ถ€์กฑ)

๋„๋ฉ”์ธ ํ˜„ํ™ฉ ์˜ํ–ฅ
Instruction/SFT ~0.6G (644MB) ์ง€์‹œ ๋”ฐ๋ฅด๊ธฐ ์•ฝํ•จ
๊ธˆ์œต/๊ฒฝ์ œ 0 ๊ธˆ์œต ๋„๋ฉ”์ธ ์‘๋‹ต ์•ฝํ•จ
ํ•™์ˆ ๋…ผ๋ฌธ 0 ํ•™์ˆ ์  ๊ธ€์“ฐ๊ธฐ ์•ฝํ•จ
์†Œ์„ค/๋ฌธํ•™ 0 ์ฐฝ์ž‘ ๋Šฅ๋ ฅ ์•ฝํ•จ

3. ์ตœ๊ณ  ํ›„๋ณด๊ตฐ โ€” Pretrain ์šฉ (๋ถ€์กฑ ๋„๋ฉ”์ธ ์ฑ„์šฐ๊ธฐ)

๐Ÿฅ‡ 1์ˆœ์œ„: KORMo-Team/korean-web-collection

  • ํฌ๊ธฐ: 5080GB / 2030B ํ† ํฐ
  • ํŠน์ง•: HF์—์„œ ๊ฐ€์žฅ ํฐ ํ•œ๊ตญ์–ด ์ „์šฉ ์›น ํฌ๋กค. ํ˜„์žฌ ๋ณด์œ  ๋ฐ์ดํ„ฐ์™€ ์ค‘๋ณต ์ ์Œ
  • ๋ผ์ด์„ ์Šค: ๊ณต๊ฐœ
  • ๋‹ค์šด๋กœ๋“œ: huggingface-cli download KORMo-Team/korean-web-collection --repo-type dataset --local-dir ./data/korean-web-collection

๐Ÿฅˆ 2์ˆœ์œ„: HPLT/HPLT2.0_cleaned (ko)

  • ํฌ๊ธฐ: ~30GB / ~12B ํ† ํฐ
  • ํŠน์ง•: HPLT v1.2 ์ด๋ฏธ ๋ณด์œ (23G) โ†’ v2.0์€ ๋” ํฌ๊ณ  ์ •์ œ๋จ. ์ถ”๊ฐ€ ์ˆœ์ˆ˜ ์ฆ๊ฐ€๋ถ„ ์กด์žฌ
  • ๋ผ์ด์„ ์Šค: ๊ณต๊ฐœ
  • ๋‹ค์šด๋กœ๋“œ: python -c "from datasets import load_dataset; ds = load_dataset('HPLT/HPLT2.0_cleaned', 'ko', split='train'); ds.save_to_disk('./data/hplt2-ko')"

๐Ÿฅ‰ 3์ˆœ์œ„: ๋ฒ•๋ฅ  ๋„๋ฉ”์ธ ๋ฌถ์Œ

๋ฐ์ดํ„ฐ์…‹ ํฌ๊ธฐ ๋‚ด์šฉ
joonhok-exo-ai/korean_law_open_data_precedents ~1-2G ๋ฒ•์› ํŒ๋ก€ ์ „๋ฌธ
smhilee/korean-law-dataset ~1-3G ๋ฒ•๋ น/๋ฒ•๋ฅ  ํ…์ŠคํŠธ
Rootpye/korean-lawdata2 ~0.5-1G ๋ฒ•๋ฅ  ๋ฐ์ดํ„ฐ
Rootpye/korean-lawdata4 ~0.5-1G ๋ฒ•๋ฅ  ๋ฐ์ดํ„ฐ v4
ducut91/korean-constitutional-court-decisions ~0.5G ํ—Œ๋ฒ•์žฌํŒ์†Œ ๊ฒฐ์ •
  • ํ•ฉ๊ณ„: 48G / 12B ํ† ํฐ
  • ์™œ ์ค‘์š”: ๋ฒ•๋ฅ ์€ ์™„์ „ ๊ณต๋ฐฑ ๋„๋ฉ”์ธ. ์ •๋ฐ€ํ•œ ํ•œ๊ตญ์–ด + ๋…ผ๋ฆฌ ๊ตฌ์กฐ โ†’ pretrain ํ’ˆ์งˆ ํ–ฅ์ƒ

4์ˆœ์œ„: mc4 (ko)

  • ํฌ๊ธฐ: ~50GB / ~20B ํ† ํฐ
  • ํŠน์ง•: CulturaX์™€ ์ผ๋ถ€ ์ค‘๋ณต์ด๋‚˜ ์›๋ณธ mC4 ์ถ”๊ฐ€ ํ…์ŠคํŠธ ์กด์žฌ
  • ๋ผ์ด์„ ์Šค: ๊ณต๊ฐœ
  • ๋‹ค์šด๋กœ๋“œ: python -c "from datasets import load_dataset; ds = load_dataset('mc4', 'ko', split='train'); ds.save_to_disk('./data/mc4-ko')"

5์ˆœ์œ„: RedPajama-Data-1T (์ฝ”๋“œ+ArXiv)

  • ํฌ๊ธฐ: ์„ ๋ณ„ 1520GB / 810B ํ† ํฐ
  • ํŠน์ง•: ํ•œ๊ตญ์–ด ๋ชจ๋ธ์ด๋ผ๋„ ์ฝ”๋“œ+๊ณผํ•™ ์˜์–ด ๋ฐ์ดํ„ฐ ํ•„์ˆ˜ (cross-lingual transfer)
  • ์„œ๋ธŒ์…‹: github (์ฝ”๋“œ 5B) + arxiv (๊ณผํ•™ 3B) + book (2B)
  • ๋ผ์ด์„ ์Šค: ๊ณต๊ฐœ

4. ์ตœ๊ณ  ํ›„๋ณด๊ตฐ โ€” SFT ์šฉ

๐Ÿฅ‡ 1: kuotient/orca-math-word-problems-193k-korean

  • ํฌ๊ธฐ: 193K ์ƒ˜ํ”Œ
  • ๋‚ด์šฉ: ์ˆ˜ํ•™ ๋ฌธ์ œ ํ•œ๊ตญ์–ด, Orca Math ๊ธฐ๋ฐ˜
  • ์™œ: ์ˆ˜ํ•™ ๋„๋ฉ”์ธ ์™„์ „ ๊ณต๋ฐฑ ์ฑ„์›€. ๊ฒ€์ฆ๋œ ๊ณ ํ’ˆ์งˆ

๐Ÿฅˆ 2: dbdu/ShareGPT-74k-ko

  • ํฌ๊ธฐ: 74K ์ƒ˜ํ”Œ
  • ๋‚ด์šฉ: ChatGPT ์‹ค์‚ฌ์šฉ ๋Œ€ํ™” ๋ฉ€ํ‹ฐํ„ด ํ•œ๊ตญ์–ด ๋ฒˆ์—ญ
  • ์™œ: ์‹ฑ๊ธ€ํ„ด ํŽธํ–ฅ์ธ ํ˜„์žฌ ๋ฐ์ดํ„ฐ ๋ณด์™„, ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ

๐Ÿฅ‰ 3: nayohan/Evol-Instruct-Code-80k-v1-ko

  • ํฌ๊ธฐ: 80K ์ƒ˜ํ”Œ
  • ๋‚ด์šฉ: WizardCoder ๊ธฐ๋ฐ˜ ์ฝ”๋”ฉ instruction ํ•œ๊ตญ์–ด
  • ์™œ: ์ฝ”๋”ฉ ๋„๋ฉ”์ธ ํ˜„์žฌ ~5% โ†’ ๋Œ€ํญ ๊ฐ•ํ™”

4: nlp-with-deeplearning/Ko.WizardLM_evol_instruct_V2_196k

  • ํฌ๊ธฐ: 196K ์ƒ˜ํ”Œ
  • ๋‚ด์šฉ: WizardLM Evol Instruct ํ•œ๊ตญ์–ด โ€” ๋ณต์žกํ•œ ์ถ”๋ก  ํฌํ•จ

5: FreedomIntelligence/alpaca-gpt4-korean

  • ํฌ๊ธฐ: 52K ์ƒ˜ํ”Œ
  • ๋‚ด์šฉ: GPT-4 ์ƒ์„ฑ Alpaca ํ•œ๊ตญ์–ด โ€” ๊ณ ํ’ˆ์งˆ ์‘๋‹ต

SFT ์ถ”๊ฐ€ ํ›„ ์˜ˆ์ƒ: ํ˜„์žฌ 162K + 595K = ~757K (4.7๋ฐฐ ์ฆ๊ฐ€)


5. ์ตœ๊ณ  ํ›„๋ณด๊ตฐ โ€” Preference/ORPO ์šฉ

๐Ÿฅ‡ 1: jojo0217/korean_rlhf_dataset

  • ํฌ๊ธฐ: 100K+ ์Œ
  • ๋‚ด์šฉ: ํ•œ๊ตญ์–ด RLHF ์ข…ํ•ฉ โ€” ๊ฐ€์žฅ ๋ฒ”์šฉ์ 
  • ์šฐ์„ ์ˆœ์œ„: ์ฆ‰์‹œ ๋‹ค์šด๋กœ๋“œ

๐Ÿฅˆ 2: maywell/ko_Ultrafeedback_binarized

  • ํฌ๊ธฐ: ~60K ์Œ
  • ๋‚ด์šฉ: UltraFeedback ํ•œ๊ตญ์–ด ๋ฒˆ์—ญ, binarized (chosen/rejected)
  • ์™œ: ์ด๋ฏธ chosen/rejected ํ˜•์‹์œผ๋กœ ORPO ๋ฐ”๋กœ ์‚ฌ์šฉ ๊ฐ€๋Šฅ

๐Ÿฅ‰ 3: nayohan/preference-collection-ko-full

  • ํฌ๊ธฐ: 100K+ ์Œ
  • ๋‚ด์šฉ: ํ•œ๊ตญ์–ด ์ข…ํ•ฉ preference ์ปฌ๋ ‰์…˜

4: kuotient/orca-math-korean-dpo-pairs

  • ํฌ๊ธฐ: 100K+ ์Œ
  • ๋‚ด์šฉ: ์ˆ˜ํ•™ ํŠนํ™” DPO ์Œ

ORPO ์ถ”์ฒœ ์กฐํ•ฉ: jojo0217 + maywell + nayohan = ~260K์Œ โ†’ ๋ฐ”๋กœ ์‹œ์ž‘ ๊ฐ€๋Šฅ


6. ์™ธ๋ถ€ ์†Œ์Šค (์‹ ์ฒญ ํ•„์š”)

์†Œ์Šค ์ถ”์ •๋Ÿ‰ ํŠน์ง•
AI Hub (aihub.or.kr) 60100GB ๋‰ด์Šค, ๋Œ€ํ™”, ์˜๋ฃŒ, ๋ฒ•๋ฅ , ๊ธˆ์œต ์ „๋ฌธ โ€” ์Šน์ธ ํ•„์š”, ๋น„์ƒ์—…์  ๊ฐ€๋Šฅ
NIKL ๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜ 3550GB ๋ฌธ์–ด/๊ตฌ์–ด ์ฝ”ํผ์Šค, ๋น„์ƒ์—…์  ์—ฐ๊ตฌ์šฉ ์‹ ์ฒญ
๊ตญ๊ฐ€๋ฒ•๋ น์ •๋ณด์„ผํ„ฐ 510GB ํฌ๋กค๋ง ๊ฐ€๋Šฅ (๊ณต๊ณต ๋ฐ์ดํ„ฐ)
KCI ํ•™์ˆ ๋…ผ๋ฌธ 35GB ๋…ผ๋ฌธ ์ดˆ๋ก, API ์ œ๊ณต

7. ๋‹ค์šด๋กœ๋“œ ์‹คํ–‰ ํ”Œ๋žœ (์šฐ์„ ์ˆœ์œ„์ˆœ)

cd /PROJECT/0325120031_A/ghong/taketimes/llm-bang

# === Phase 1: Preference (ORPO ์ฆ‰์‹œ ํ™œ์„ฑํ™”, ์†Œ์šฉ๋Ÿ‰) ===
python3 -c "
from datasets import load_dataset
import os
out = 'data/preference'
os.makedirs(out, exist_ok=True)
for name in ['jojo0217/korean_rlhf_dataset', 'maywell/ko_Ultrafeedback_binarized', 'nayohan/preference-collection-ko-full', 'kuotient/orca-math-korean-dpo-pairs']:
    ds = load_dataset(name, split='train')
    ds.to_json(f'{out}/{name.replace(\"/\",\"_\")}.jsonl')
    print(f'โœ… {name}: {len(ds)} samples')
" 2>&1 | tee /tmp/preference_dl.log &

# === Phase 2: SFT ๋ณด๊ฐ• (๋Œ€ํ™”/์ˆ˜ํ•™/์ฝ”๋“œ) ===
python3 -c "
from datasets import load_dataset
import os
out = 'data/sft_extra'
os.makedirs(out, exist_ok=True)
for name in ['kuotient/orca-math-word-problems-193k-korean','dbdu/ShareGPT-74k-ko','nayohan/Evol-Instruct-Code-80k-v1-ko','nlp-with-deeplearning/Ko.WizardLM_evol_instruct_V2_196k','FreedomIntelligence/alpaca-gpt4-korean']:
    try:
        ds = load_dataset(name, split='train')
        ds.to_json(f'{out}/{name.replace(\"/\",\"_\")}.jsonl')
        print(f'โœ… {name}: {len(ds)}')
    except Exception as e:
        print(f'โŒ {name}: {e}')
" 2>&1 | tee /tmp/sft_extra_dl.log &

# === Phase 3: ๋ฒ•๋ฅ  Pretrain ๋ณด๊ฐ• ===
python3 -c "
from datasets import load_dataset
import os
out = 'data/korean_extra/korean_law'
os.makedirs(out, exist_ok=True)
for name in ['joonhok-exo-ai/korean_law_open_data_precedents','smhilee/korean-law-dataset','Rootpye/korean-lawdata2']:
    try:
        ds = load_dataset(name, split='train')
        ds.to_json(f'{out}/{name.replace(\"/\",\"_\")}.jsonl')
        print(f'โœ… {name}: {len(ds)}')
    except Exception as e:
        print(f'โŒ {name}: {e}')
" 2>&1 | tee /tmp/law_dl.log &

# === Phase 4: ๋Œ€์šฉ๋Ÿ‰ Pretrain (๋ฐฑ๊ทธ๋ผ์šด๋“œ ์žฅ์‹œ๊ฐ„) ===
# mc4 Korean (~50GB)
# python3 -c "from datasets import load_dataset; ds = load_dataset('mc4', 'ko', split='train'); ds.save_to_disk('data/korean_extra/mc4_ko')"
# KORMo Web Collection
# huggingface-cli download KORMo-Team/korean-web-collection --repo-type dataset --local-dir data/korean_extra/korean_web_collection

8. ์ถ”๊ฐ€ ํ›„ ์˜ˆ์ƒ ๋ฐ์ดํ„ฐ ๊ตฌ์„ฑ

์นดํ…Œ๊ณ ๋ฆฌ ํ˜„์žฌ ํ† ํฐ ์ถ”๊ฐ€ ํ›„ ๋น„๊ณ 
ํ•œ๊ตญ์–ด Pretrain ~39B (ํ† ํฐํ™”) 6080B mc4+KORMo+๋ฒ•๋ฅ  ์ถ”๊ฐ€ ์‹œ
SFT 162K ~757K 5๊ฐœ ์ถ”๊ฐ€ ํ›„
Preference 0 ~260K์Œ jojo+maywell+nayohan
์ฝ”๋“œ/์˜์–ด ~0.6B ~10B RedPajama github+arxiv
๋ฒ•๋ฅ  0 12B ๋ฒ•๋ฅ  ๋ฌถ์Œ

Chinchilla minimum (60B) ๋‹ฌ์„ฑ ๊ฐ€๋Šฅ โœ…


๋ณด๊ณ ์„œ ์ €์žฅ: /PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/data_inventory/