3B μ¬μ νμ΅ λ°μ΄ν° νμ΄νλΌμΈ
μμ±μΌ: 2026-02-27
νλ‘μ νΈ: /PROJECT/0325120031_A/ghong/taketimes/llm-bang/
1. νμ¬ λ°μ΄ν° νν©
ν ν°ν μλ£ (μ¦μ μ¬μ© κ°λ₯)
| νμΌ | ν¬κΈ° | ν ν° μ |
|---|---|---|
| korean_c4_train.bin | 15 GB | 7.56B |
| korean_namuwiki_train.bin | 2.1 GB | 1.08B |
| korean_wiki_train.bin | 500 MB | 262M |
| korean_train.bin (μ 3κ° λ³ν©) | 17 GB | 9.0B |
| train.bin (μμ΄ κΈ°ν) | 1.2 GB | 606M |
μ¦μ μ¬μ© κ°λ₯: ~9.6B ν ν°
λ―Έν ν°ν μλ³Έ (korean_extra/)
| μμ€ | λμ€ν¬ ν¬κΈ° | νμ | μμ ν ν° μ | λΉκ³ |
|---|---|---|---|---|
| CulturaX (ko) | 60 GB (32 parquet) | parquet, text μ»¬λΌ |
~11.6B | κ³ νμ§ μΉ + mC4 μ€λ³΅μ κ±° |
| cc100 (ko) | 14 GB xz β 54 GB text | xz μμΆ ν μ€νΈ | ~13.5B | μ€ λ¨μ ν μ€νΈ |
| HPLT (en-ko) | 23 GB (193 parquet) | λ³λ ¬ μ½νΌμ€, tgt_doc.sentences |
~3.7B | νκ΅μ΄ μΈ‘λ§ μΆμΆ νμ |
| OSCAR (ko) | 9.2 GB (27 parquet) | μ€μ²© ꡬ쑰, text[].text |
~2.0B | μΉ ν μ€νΈ |
| korean_webtext | 4.2 GB (18 parquet) | parquet | ~1.5B | νκ΅μ΄ μΉ |
| korean_textbooks | 6.4 GB | MMLU μ€νμΌ parquet | ~0.5B | κ΅κ³Όμ/μν (ꡬ쑰μ ) |
| finepdfs_edu_ko | 2.9 GB (parquet) | parquet | ~0.8B | κ΅μ‘ PDF |
| namuwiki_extracted | 2.2 GB | ν μ€νΈ | ~0.6B | μ΄λ―Έ namuwiki_train.binκ³Ό μ€λ³΅ κ°λ₯ |
| kovast | 449 MB | - | ~0.1B | μλ |
| korean_safe_conv | 51 MB | jsonl | ~15M | λν λ°μ΄ν° |
| evol_instruct_ko | 144 MB | - | ~40M | SFTμ©, pretrain λΆμ ν© |
λ―Έν ν°ν μ΄ μμ: ~34B ν ν° μ 체 ν©κ³: ~43B ν ν° (μ€λ³΅ μ κ±° μ )
2. 3B λͺ¨λΈ νμ΅ λͺ©ν
Chinchilla μ€μΌμΌλ§ λ²μΉμ λ°λ₯Έ μ΅μ :
- μ΅μ: 3B Γ 20 = 60B ν ν°
- μ΅μ : 3B Γ 50 = 150B ν ν°
- νμ€μ λͺ©ν: κ°μ© λ°μ΄ν°
35B ν ν° (μ€λ³΅ μ κ±° ν) β **23 epoch λ°λ³΅μΌλ‘ 60-100B ν ν°**
λ°μ΄ν° λ―Ήμ± μ λ΅
| μΉ΄ν κ³ λ¦¬ | μμ€ | λΉμ¨ | μ΄μ |
|---|---|---|---|
| κ³ νμ§ μΉ | CulturaX | 35% | μ΄λ―Έ μ€λ³΅ μ κ±°λ κ³ νμ§ |
| λκ·λͺ¨ μΉ | cc100 + mC4(κΈ°μ‘΄) | 35% | μμ ν보 |
| λ°±κ³Όμ¬μ | μν€ + λ무μν€ | 10% | μ¬μ€ μ§μ |
| 보쑰 μΉ | OSCAR + korean_webtext + HPLT | 15% | λ€μμ± |
| μ λ¬Έ λλ©μΈ | textbooks + finepdfs | 5% | κ΅μ‘ νμ§ |
3. λ°μ΄ν° νμ§ νν°λ§ κ³ν
Phase 1: κΈ°λ³Έ νν° (λΉ λ¦, 1-2μκ°)
- μΈμ΄ νν°:
langdetectλ‘ νκ΅μ΄ λΉμ¨ < 50% λ¬Έμ μ κ±°- HPLT: λ³λ ¬ μ½νΌμ€λΌ νκ΅μ΄ μΆμΆλ§ νλ©΄ λ¨
- cc100: μ΄λ―Έ νκ΅μ΄μ§λ§ νΌμ νμΈ
- κΈΈμ΄ νν°: 50μ λ―Έλ§ λ¬Έμ μ κ±°
- μ€λ³΅ μ€ μ κ±°: κ°μ μ€ 5ν μ΄μ λ°λ³΅νλ λ¬Έμ μ κ±°
Phase 2: MinHash μ€λ³΅ μ κ±° (4-8μκ°)
λꡬ: datasketch (pip install datasketch)
λ°©λ²: 5-gram MinHash, 128 permutations
μκ³κ°: Jaccard > 0.8 β μ€λ³΅
μμ μ κ±°μ¨: 15-25% (νΉν cc100 + CulturaX κ°)
- CulturaXλ μ΄λ―Έ λ΄λΆ μ€λ³΅ μ κ±°λ¨ β λ€λ₯Έ μμ€μμ κ΅μ°¨ μ€λ³΅λ§ 체ν¬
- 72μ½μ΄ λ³λ ¬ μ²λ¦¬λ‘ ~4μκ° μμ
Phase 3: Perplexity νν° (μ ν, 12-24μκ°)
- νμ¬ 1B λͺ¨λΈλ‘ κ° λ¬Έμ perplexity κ³μ°
- νμ 5% (λ무 μ¬μ = ν νλ¦Ώ/λ°λ³΅) + μμ 5% (λ Έμ΄μ¦) μ κ±°
- κΆμ₯: 3B 첫 νμ΅ ν 2μ°¨ νμ΅ μ μ μ© (μκ° μ μ½)
4. ν ν°ν νμ΄νλΌμΈ
μ°μ μμ λ° μμ μκ° (72μ½μ΄ κΈ°μ€)
| μμ | μμ€ | μμ μκ° | μ΄μ |
|---|---|---|---|
| 1 | CulturaX | 3-4μκ° | 60GB parquet, κ°μ₯ ν¬κ³ κ³ νμ§ |
| 2 | cc100 | 2-3μκ° | xz ν΄μ 30λΆ + ν ν°ν 2μκ° |
| 3 | OSCAR | 1μκ° | 9.2GB, ꡬ쑰 νμ± νμ |
| 4 | korean_webtext | 30λΆ | 4.2GB |
| 5 | HPLT (Korean) | 1-2μκ° | νκ΅μ΄ μΆμΆ + ν ν°ν |
| 6 | textbooks + finepdfs | 30λΆ | μλ |
μ΄ μμ: 8-12μκ° (λ³λ ¬ μ²λ¦¬ μ)
5. νμλΌμΈ
μ΅μ μμ λ°©μ (μ¦μ, 0μκ°)
- κΈ°μ‘΄
korean_train.bin(9B ν ν°)μΌλ‘ 3B νμ΅ μμ κ°λ₯ - 1 epochμ λΆμ‘±νμ§λ§ νμ΅ μ½λ κ²μ¦ + μ΄κΈ° νμ΅μλ μΆ©λΆ
Phase A: λΉ λ₯Έ νμ₯ (8-12μκ°)
- CulturaX ν ν°ν (3-4μκ°)
- cc100 ν΄μ + ν ν°ν (2-3μκ°, CulturaXμ λ³λ ¬ κ°λ₯)
- κΈ°ν μμ€ ν ν°ν (2μκ°)
- λ³ν© β ~35B ν ν°
Phase B: νμ§ νν°λ§ (μΆκ° 4-8μκ°)
- κΈ°λ³Έ νν° (1-2μκ°)
- MinHash μ€λ³΅ μ κ±° (4-8μκ°)
- μ΅μ’ λ³ν© β ~28-30B ν ν° (κΉ¨λν λ°μ΄ν°)
Phase C: νμ΅
- 30B ν ν° Γ 2 epoch = 60B ν ν° (Chinchilla μ΅μ)
- λλ 30B Γ 3 epoch = 90B ν ν° (μμ )
6. νμΌ κ΅¬μ‘° (μ΅μ’ )
data/
βββ korean_train.bin # κΈ°μ‘΄ 9B (c4+wiki+namuwiki)
βββ korean_val.bin
βββ culturax_train.bin # ~11.6B
βββ culturax_val.bin
βββ cc100_train.bin # ~13.5B
βββ cc100_val.bin
βββ oscar_train.bin # ~2B
βββ oscar_val.bin
βββ webtext_train.bin # ~1.5B
βββ webtext_val.bin
βββ hplt_ko_train.bin # ~3.7B
βββ hplt_ko_val.bin
βββ extra_train.bin # textbooks + finepdfs + kovast
βββ extra_val.bin
βββ merged_3b_train.bin # μ 체 λ³ν© (~35B)
βββ merged_3b_val.bin
7. μμ½
| νλͺ© | κ° |
|---|---|
| μ¦μ μ¬μ© κ°λ₯ ν ν° | 9.6B |
| μΆκ° ν ν°ν ν | ~35B (μ€λ³΅ μ κ±° μ ) |
| μ€λ³΅ μ κ±° ν μμ | ~28-30B |
| 3B Chinchilla μ΅μ | 60B (2 epoch) |
| λ°μ΄ν° μ€λΉ μ΅μ μκ° | 0μκ° (κΈ°μ‘΄ λ°μ΄ν° μ¬μ©) |
| μ 체 νμ΄νλΌμΈ μλ£ | 12-20μκ° |
| νμ΅ μμ μΆμ² | κΈ°μ‘΄ 9Bλ‘ μ¦μ μμ + CulturaX λ³λ ¬ μ€λΉ |