| # νκ΅μ΄ LLM λ°μ΄ν° μ’
ν© λ¦¬ν¬νΈ |
| > μμ±: 2026-02-27 | 5κ° subagent μ‘°μ¬ κ²°κ³Ό ν΅ν© |
|
|
| --- |
|
|
| ## 1. νμ¬ λ³΄μ νν© |
|
|
| | μΉ΄ν
κ³ λ¦¬ | λ°μ΄ν°μ
| λμ€ν¬ | μΆμ ν ν° | νμ§ | |
| |---------|---------|--------|---------|------| |
| | κ΅μ‘ μΉ | fineweb2_edu_ko | 234G | ~50B | A | |
| | μΉ ν¬λ‘€ | culturax_ko | 60G | ~24B | B+ | |
| | μν | open_web_math | 26G | ~10B | A | |
| | μΉ ν¬λ‘€ | hplt_ko | 23G | ~9B | B | |
| | μΉ ν¬λ‘€ | cc100_processed | 19G | ~7B | C+ | |
| | μΉ ν¬λ‘€ | cc100_ko | 14G | ~5.5B | C | |
| | μΉ ν¬λ‘€ | oscar_ko | 9.2G | ~3.5B | B | |
| | κ΅μ‘ | korean_textbooks | 6.4G | ~1.5B | A | |
| | μΉ | korean_webtext | 4.2G | ~1B | B+ | |
| | λ°±κ³Ό | namuwiki_2023 | 2.9G | ~1B | A- | |
| | κ΅μ‘ | finepdfs_edu_ko | 2.9G | ~0.7B | A- | |
| | λ°±κ³Ό | namuwiki_extracted | 2.2G | ~0.5B | A- | |
| | λ°±κ³Ό | wikipedia_korean | 1.7G | ~0.4B | A | |
| | λ°±κ³Ό | wikipedia_ko_2024 | 1.4G | ~0.3B | A | |
| | Instruct | kovast | 449M | ~0.1B | B | |
| | Instruct | evol_instruct_ko | 144M | ~0.03B | B | |
| | λν | korean_safe_conv | 51M | ~0.01B | B | |
| | **ν©κ³** | | **~410G** | **~114B raw** | | |
|
|
| > β οΈ ν ν°ν μλ£ `.bin`: korean_train.bin(17Gβ8.9B), korean_c4_train(15Gβ7.5B) λ± μ€μ νμ΅ μ¬μ© ~39B |
| |
| --- |
| |
| ## 2. λΆμ‘± λλ©μΈ κ° λΆμ |
| |
| ### π΄ CRITICAL (μμ) |
| | λλ©μΈ | νν© | μν₯ | |
| |--------|------|------| |
| | **Preference/DPO** | 0건 | ORPO νμ΅ λΆκ° | |
| | **λ²λ₯ /νλ‘** | 0 | λ²λ₯ μΆλ‘ λΆκ° | |
| | **μλ£/μν** | 0 | ν¬μ€μΌμ΄ μλ΅ λΆκ° | |
| | **μ½λ (νκ΅μ΄ μ£Όμ)** | 0 | μ½λ© μ§μ μ½ν¨ | |
| | **λ΄μ€/μΈλ‘ ** | 0 | μμ¬ λ§₯λ½ μ½ν¨ | |
| |
| ### π‘ WEAK (λ§€μ° λΆμ‘±) |
| | λλ©μΈ | νν© | μν₯ | |
| |--------|------|------| |
| | **Instruction/SFT** | ~0.6G (644MB) | μ§μ λ°λ₯΄κΈ° μ½ν¨ | |
| | **κΈμ΅/κ²½μ ** | 0 | κΈμ΅ λλ©μΈ μλ΅ μ½ν¨ | |
| | **νμ λ
Όλ¬Έ** | 0 | νμ μ κΈμ°κΈ° μ½ν¨ | |
| | **μμ€/λ¬Έν** | 0 | μ°½μ λ₯λ ₯ μ½ν¨ | |
| |
| --- |
| |
| ## 3. μ΅κ³ ν보ꡰ β Pretrain μ© (λΆμ‘± λλ©μΈ μ±μ°κΈ°) |
| |
| ### π₯ 1μμ: KORMo-Team/korean-web-collection |
| - **ν¬κΈ°**: ~50~80GB / ~20~30B ν ν° |
| - **νΉμ§**: HFμμ κ°μ₯ ν° νκ΅μ΄ μ μ© μΉ ν¬λ‘€. νμ¬ λ³΄μ λ°μ΄ν°μ μ€λ³΅ μ μ |
| - **λΌμ΄μ μ€**: κ³΅κ° |
| - **λ€μ΄λ‘λ**: `huggingface-cli download KORMo-Team/korean-web-collection --repo-type dataset --local-dir ./data/korean-web-collection` |
| |
| ### π₯ 2μμ: HPLT/HPLT2.0_cleaned (ko) |
| - **ν¬κΈ°**: ~30GB / ~12B ν ν° |
| - **νΉμ§**: HPLT v1.2 μ΄λ―Έ 보μ (23G) β v2.0μ λ ν¬κ³ μ μ λ¨. μΆκ° μμ μ¦κ°λΆ μ‘΄μ¬ |
| - **λΌμ΄μ μ€**: κ³΅κ° |
| - **λ€μ΄λ‘λ**: `python -c "from datasets import load_dataset; ds = load_dataset('HPLT/HPLT2.0_cleaned', 'ko', split='train'); ds.save_to_disk('./data/hplt2-ko')"` |
|
|
| ### π₯ 3μμ: λ²λ₯ λλ©μΈ λ¬Άμ |
| | λ°μ΄ν°μ
| ν¬κΈ° | λ΄μ© | |
| |---------|------|------| |
| | `joonhok-exo-ai/korean_law_open_data_precedents` | ~1-2G | λ²μ νλ‘ μ λ¬Έ | |
| | `smhilee/korean-law-dataset` | ~1-3G | λ²λ Ή/λ²λ₯ ν
μ€νΈ | |
| | `Rootpye/korean-lawdata2` | ~0.5-1G | λ²λ₯ λ°μ΄ν° | |
| | `Rootpye/korean-lawdata4` | ~0.5-1G | λ²λ₯ λ°μ΄ν° v4 | |
| | `ducut91/korean-constitutional-court-decisions` | ~0.5G | νλ²μ¬νμ κ²°μ | |
| - **ν©κ³**: ~4~8G / ~1~2B ν ν° |
| - **μ μ€μ**: λ²λ₯ μ μμ 곡백 λλ©μΈ. μ λ°ν νκ΅μ΄ + λ
Όλ¦¬ ꡬ쑰 β pretrain νμ§ ν₯μ |
|
|
| ### 4μμ: mc4 (ko) |
| - **ν¬κΈ°**: ~50GB / ~20B ν ν° |
| - **νΉμ§**: CulturaXμ μΌλΆ μ€λ³΅μ΄λ μλ³Έ mC4 μΆκ° ν
μ€νΈ μ‘΄μ¬ |
| - **λΌμ΄μ μ€**: κ³΅κ° |
| - **λ€μ΄λ‘λ**: `python -c "from datasets import load_dataset; ds = load_dataset('mc4', 'ko', split='train'); ds.save_to_disk('./data/mc4-ko')"` |
|
|
| ### 5μμ: RedPajama-Data-1T (μ½λ+ArXiv) |
| - **ν¬κΈ°**: μ λ³ ~15~20GB / ~8~10B ν ν° |
| - **νΉμ§**: νκ΅μ΄ λͺ¨λΈμ΄λΌλ μ½λ+κ³Όν μμ΄ λ°μ΄ν° νμ (cross-lingual transfer) |
| - **μλΈμ
**: `github` (μ½λ 5B) + `arxiv` (κ³Όν 3B) + `book` (2B) |
| - **λΌμ΄μ μ€**: κ³΅κ° |
|
|
| --- |
|
|
| ## 4. μ΅κ³ ν보ꡰ β SFT μ© |
|
|
| ### π₯ 1: kuotient/orca-math-word-problems-193k-korean |
| - **ν¬κΈ°**: 193K μν |
| - **λ΄μ©**: μν λ¬Έμ νκ΅μ΄, Orca Math κΈ°λ° |
| - **μ**: μν λλ©μΈ μμ 곡백 μ±μ. κ²μ¦λ κ³ νμ§ |
|
|
| ### π₯ 2: dbdu/ShareGPT-74k-ko |
| - **ν¬κΈ°**: 74K μν |
| - **λ΄μ©**: ChatGPT μ€μ¬μ© λν λ©ν°ν΄ νκ΅μ΄ λ²μ |
| - **μ**: μ±κΈν΄ νΈν₯μΈ νμ¬ λ°μ΄ν° 보μ, λ€μν λλ©μΈ |
|
|
| ### π₯ 3: nayohan/Evol-Instruct-Code-80k-v1-ko |
| - **ν¬κΈ°**: 80K μν |
| - **λ΄μ©**: WizardCoder κΈ°λ° μ½λ© instruction νκ΅μ΄ |
| - **μ**: μ½λ© λλ©μΈ νμ¬ ~5% β λν κ°ν |
|
|
| ### 4: nlp-with-deeplearning/Ko.WizardLM_evol_instruct_V2_196k |
| - **ν¬κΈ°**: 196K μν |
| - **λ΄μ©**: WizardLM Evol Instruct νκ΅μ΄ β 볡μ‘ν μΆλ‘ ν¬ν¨ |
|
|
| ### 5: FreedomIntelligence/alpaca-gpt4-korean |
| - **ν¬κΈ°**: 52K μν |
| - **λ΄μ©**: GPT-4 μμ± Alpaca νκ΅μ΄ β κ³ νμ§ μλ΅ |
|
|
| > **SFT μΆκ° ν μμ**: νμ¬ 162K + 595K = **~757K** (4.7λ°° μ¦κ°) |
|
|
| --- |
|
|
| ## 5. μ΅κ³ ν보ꡰ β Preference/ORPO μ© |
|
|
| ### π₯ 1: jojo0217/korean_rlhf_dataset |
| - **ν¬κΈ°**: 100K+ μ |
| - **λ΄μ©**: νκ΅μ΄ RLHF μ’
ν© β κ°μ₯ λ²μ©μ |
| - **μ°μ μμ**: μ¦μ λ€μ΄λ‘λ |
|
|
| ### π₯ 2: maywell/ko_Ultrafeedback_binarized |
| - **ν¬κΈ°**: ~60K μ |
| - **λ΄μ©**: UltraFeedback νκ΅μ΄ λ²μ, binarized (chosen/rejected) |
| - **μ**: μ΄λ―Έ chosen/rejected νμμΌλ‘ ORPO λ°λ‘ μ¬μ© κ°λ₯ |
|
|
| ### π₯ 3: nayohan/preference-collection-ko-full |
| - **ν¬κΈ°**: 100K+ μ |
| - **λ΄μ©**: νκ΅μ΄ μ’
ν© preference 컬λ μ
|
|
|
| ### 4: kuotient/orca-math-korean-dpo-pairs |
| - **ν¬κΈ°**: 100K+ μ |
| - **λ΄μ©**: μν νΉν DPO μ |
|
|
| > **ORPO μΆμ² μ‘°ν©**: jojo0217 + maywell + nayohan = ~260Kμ β λ°λ‘ μμ κ°λ₯ |
|
|
| --- |
|
|
| ## 6. μΈλΆ μμ€ (μ μ² νμ) |
|
|
| | μμ€ | μΆμ λ | νΉμ§ | |
| |------|--------|------| |
| | AI Hub (aihub.or.kr) | ~60~100GB | λ΄μ€, λν, μλ£, λ²λ₯ , κΈμ΅ μ λ¬Έ β μΉμΈ νμ, λΉμμ
μ κ°λ₯ | |
| | NIKL λͺ¨λμ λ§λμΉ | ~35~50GB | λ¬Έμ΄/κ΅¬μ΄ μ½νΌμ€, λΉμμ
μ μ°κ΅¬μ© μ μ² | |
| | κ΅κ°λ²λ Ήμ 보μΌν° | ~5~10GB | ν¬λ‘€λ§ κ°λ₯ (곡곡 λ°μ΄ν°) | |
| | KCI νμ λ
Όλ¬Έ | ~3~5GB | λ
Όλ¬Έ μ΄λ‘, API μ 곡 | |
|
|
| --- |
|
|
| ## 7. λ€μ΄λ‘λ μ€ν νλ (μ°μ μμμ) |
|
|
| ```bash |
| cd /PROJECT/0325120031_A/ghong/taketimes/llm-bang |
| |
| # === Phase 1: Preference (ORPO μ¦μ νμ±ν, μμ©λ) === |
| python3 -c " |
| from datasets import load_dataset |
| import os |
| out = 'data/preference' |
| os.makedirs(out, exist_ok=True) |
| for name in ['jojo0217/korean_rlhf_dataset', 'maywell/ko_Ultrafeedback_binarized', 'nayohan/preference-collection-ko-full', 'kuotient/orca-math-korean-dpo-pairs']: |
| ds = load_dataset(name, split='train') |
| ds.to_json(f'{out}/{name.replace(\"/\",\"_\")}.jsonl') |
| print(f'β
{name}: {len(ds)} samples') |
| " 2>&1 | tee /tmp/preference_dl.log & |
| |
| # === Phase 2: SFT λ³΄κ° (λν/μν/μ½λ) === |
| python3 -c " |
| from datasets import load_dataset |
| import os |
| out = 'data/sft_extra' |
| os.makedirs(out, exist_ok=True) |
| for name in ['kuotient/orca-math-word-problems-193k-korean','dbdu/ShareGPT-74k-ko','nayohan/Evol-Instruct-Code-80k-v1-ko','nlp-with-deeplearning/Ko.WizardLM_evol_instruct_V2_196k','FreedomIntelligence/alpaca-gpt4-korean']: |
| try: |
| ds = load_dataset(name, split='train') |
| ds.to_json(f'{out}/{name.replace(\"/\",\"_\")}.jsonl') |
| print(f'β
{name}: {len(ds)}') |
| except Exception as e: |
| print(f'β {name}: {e}') |
| " 2>&1 | tee /tmp/sft_extra_dl.log & |
| |
| # === Phase 3: λ²λ₯ Pretrain λ³΄κ° === |
| python3 -c " |
| from datasets import load_dataset |
| import os |
| out = 'data/korean_extra/korean_law' |
| os.makedirs(out, exist_ok=True) |
| for name in ['joonhok-exo-ai/korean_law_open_data_precedents','smhilee/korean-law-dataset','Rootpye/korean-lawdata2']: |
| try: |
| ds = load_dataset(name, split='train') |
| ds.to_json(f'{out}/{name.replace(\"/\",\"_\")}.jsonl') |
| print(f'β
{name}: {len(ds)}') |
| except Exception as e: |
| print(f'β {name}: {e}') |
| " 2>&1 | tee /tmp/law_dl.log & |
| |
| # === Phase 4: λμ©λ Pretrain (λ°±κ·ΈλΌμ΄λ μ₯μκ°) === |
| # mc4 Korean (~50GB) |
| # python3 -c "from datasets import load_dataset; ds = load_dataset('mc4', 'ko', split='train'); ds.save_to_disk('data/korean_extra/mc4_ko')" |
| # KORMo Web Collection |
| # huggingface-cli download KORMo-Team/korean-web-collection --repo-type dataset --local-dir data/korean_extra/korean_web_collection |
| ``` |
|
|
| --- |
|
|
| ## 8. μΆκ° ν μμ λ°μ΄ν° κ΅¬μ± |
|
|
| | μΉ΄ν
κ³ λ¦¬ | νμ¬ ν ν° | μΆκ° ν | λΉκ³ | |
| |---------|---------|---------|------| |
| | νκ΅μ΄ Pretrain | ~39B (ν ν°ν) | ~60~80B | mc4+KORMo+λ²λ₯ μΆκ° μ | |
| | SFT | 162K | ~757K | 5κ° μΆκ° ν | |
| | Preference | 0 | ~260Kμ | jojo+maywell+nayohan | |
| | μ½λ/μμ΄ | ~0.6B | ~10B | RedPajama github+arxiv | |
| | λ²λ₯ | 0 | ~1~2B | λ²λ₯ λ¬Άμ | |
|
|
| **Chinchilla minimum (60B) λ¬μ± κ°λ₯** β
|
|
|
| --- |
|
|
| _λ³΄κ³ μ μ μ₯: `/PROJECT/0325120031_A/ghong/taketimes/llm-bang/eval/data_inventory/`_ |
|
|