frankenstallm / source /eval /hyperparam_analysis.md
pathcosmos's picture
Upload folder using huggingface_hub (#29)
5b1ff4d
# SFT ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ๋ถ„์„ & ๋‹ค์Œ ํŠœ๋‹ ์˜ต์…˜ ์กฐ์‚ฌ
> ์ƒ์„ฑ์ผ: 2026-02-26
> ๋ชจ๋ธ: korean_1b_sft (1.19B params, base: korean_1b_fp8_run1/checkpoint-0034000)
> ํ•™์Šต: 5000 steps, 39๋ถ„, 8ร— B200
---
## 1. Loss Curve ๋ถ„์„
### 1-1. ๊ธฐ๋ณธ ํ†ต๊ณ„
| ๊ตฌ๊ฐ„ | Steps | n | Loss Mean | Loss Stdev | Loss Min | Loss Max | GNorm Mean |
|------|-------|---|-----------|------------|----------|----------|------------|
| Warmup | 10โ€“150 | 15 | 2.3100 | 0.1144 | 2.1129 | 2.5229 | 1.414 |
| Post-warmup ์ „์ฒด | 160โ€“5000 | 485 | 1.9984 | 0.0942 | 1.7305 | 2.3413 | 1.133 |
| Q1 (์ดˆ๊ธฐ) | 160โ€“1360 | 121 | 2.0698 | 0.0860 | 1.8850 | 2.3413 | 1.138 |
| Q2 (์ค‘๋ฐ˜1) | 1370โ€“2570 | 121 | 1.9915 | 0.0801 | 1.7960 | 2.2088 | 1.131 |
| Q3 (์ค‘๋ฐ˜2) | 2580โ€“3780 | 121 | 1.9583 | 0.0870 | 1.7384 | 2.1293 | 1.119 |
| Q4 (ํ›„๋ฐ˜) | 3790โ€“5000 | 122 | **1.9739** | 0.0835 | 1.7305 | 2.1635 | 1.142 |
### 1-2. 500-step ์ด๋™ ํ‰๊ท  Loss (ยฑ50 step ์œˆ๋„์šฐ)
| Step | Loss(avg) | GNorm(avg) | ํ•ด์„ |
|------|-----------|------------|------|
| ~500 | 2.0658 | 1.098 | ์ดˆ๊ธฐ ํ•˜๊ฐ• ๋‹จ๊ณ„ |
| ~1000 | 2.0281 | 1.121 | ๋น ๋ฅธ ํ•˜๊ฐ• ์ง€์† |
| ~1500 | 1.9663 | 1.092 | โœ… ์ตœ์ดˆ <2.0 ์ง„์ž… |
| ~2000 | 1.9802 | 1.158 | ์†Œํญ ๋ฐ˜๋“ฑ (์ •์ƒ) |
| ~2500 | 1.9882 | 1.140 | ์•ˆ์ •ํ™” ๊ตฌ๊ฐ„ ์‹œ์ž‘ |
| ~3000 | 1.9628 | 1.083 | ์ตœ์ €์  ๊ทผ๋ฐฉ |
| ~3500 | 1.9668 | 1.151 | ์ˆ˜๋ ด ์‹ ํ˜ธ |
| ~4000 | 1.9679 | 1.161 | ๊ณ ์› ์ง„์ž… |
| ~4500 | 1.9555 | 1.142 | ๋ฏธ์„ธ ํ•˜๊ฐ• ์ง€์† |
| ~5000 | 1.9718 | 1.195 | **์ตœ์ข…: 1.9677** |
### 1-3. ํ•ด์„
**Warmup ๊ตฌ๊ฐ„ (step 10โ€“150):**
- LR์ด 1.33e-6 โ†’ 2e-5๋กœ ์„ ํ˜• ์ฆ๊ฐ€ํ•˜๋Š” ๋™์•ˆ loss๊ฐ€ 2.11โ€“2.52 ๋ฒ”์œ„์—์„œ ๋ถˆ๊ทœ์น™ํ•จ
- Warmup ์งํ›„ step 160์—์„œ loss spike (2.34, 3.6ฯƒ) ๋ฐœ์ƒ โ€” warmup ์ข…๋ฃŒ ์งํ›„ full LR ์ถฉ๊ฒฉ. ์ •์ƒ์ ์ด๊ณ  ํ”ํ•œ ํŒจํ„ด
- Warmup 150 steps๋Š” ์ด 5000 steps์˜ 3% โ†’ ์ ์ ˆ
**์ •์ƒ ํ•™์Šต ๊ตฌ๊ฐ„ (step 160โ€“5000):**
- Loss๊ฐ€ Q1โ†’Q3 ๊ตฌ๊ฐ„์—์„œ 2.07โ†’1.96์œผ๋กœ ์ง€์† ํ•˜๊ฐ• (์ด 0.11 ๊ฐ์†Œ)
- Q3โ†’Q4๋Š” 1.958โ†’1.974์œผ๋กœ **์˜คํžˆ๋ ค ์†Œํญ ์ƒ์Šน** โ€” cosine LR์ด ์ถฉ๋ถ„ํžˆ ๋‚ฎ์•„์ง€๋ฉด์„œ ํ•™์Šต ์†๋„ ์ €ํ•˜, ์ˆ˜๋ ด ์ง•ํ›„
- ํ‘œ์ค€ํŽธ์ฐจ 0.094๋Š” ์•ˆ์ •์  (SFT ๊ธฐ์ค€ 0.05โ€“0.15 ์ •์ƒ ๋ฒ”์œ„)
**Outlier ๋ถ„์„:**
- Mean+2ฯƒ = 2.187 ์ดˆ๊ณผ: 10๊ฐœ / 485 = **2.1%** โ†’ ์ •์ƒ ์ˆ˜์ค€
- ๋ชจ๋‘ ์ดˆ๊ธฐ(step 160โ€“800)์— ์ง‘์ค‘ + step 2190 1๊ฐœ โ€” ๋ฐ์ดํ„ฐ ๋‹ค์–‘์„ฑ์— ์˜ํ•œ ์ •์ƒ ๋ณ€๋™
- gnorm spike์™€ ๋™๋ฐ˜ํ•˜์ง€ ์•Š์•„ gradient ํญ๋ฐœ ์—†์Œ
**GNorm ํŒจํ„ด:**
- ์ „์ฒด ํ‰๊ท  1.13, max_grad_norm=1.0์œผ๋กœ ์„ค์ •๋˜์–ด ์žˆ์œผ๋‚˜ ๋กœ๊ทธ๊ฐ’์€ 0.89โ€“1.53
- ๋กœ๊ทธ๋˜๋Š” gnorm์€ clip **์ด์ „** ๊ฐ’์œผ๋กœ ์ถ”์ •; ์‹ค์ œ 1.0 ์ดˆ๊ณผ ์‹œ clip ๋ฐœ์ƒ
- Warmup ๊ตฌ๊ฐ„(ํ‰๊ท  1.41)์ด ์ดํ›„(ํ‰๊ท  1.13)๋ณด๋‹ค ๋†’์Œ โ€” ์ •์ƒ ํŒจํ„ด
- ํ•™์Šต ์ „๋ฐ˜์— ๊ฑธ์ณ ๊ฐ์†Œ ์ถ”์„ธ (gnorm ์•ˆ์ •ํ™” = ํ•™์Šต์ด ์ˆ˜๋ ด ์ค‘)
**ํ•ต์‹ฌ ๊ฒฐ๋ก :** ํ•™์Šต์€ ๊ฑด๊ฐ•ํ•˜๊ฒŒ ์ง„ํ–‰๋จ. Step ~3000 ์ดํ›„ ์ˆ˜๋ ด ์‹ ํ˜ธ๊ฐ€ ์žˆ์œผ๋‚˜ loss๋Š” ์—ฌ์ „ํžˆ ๋ฏธ์„ธ ํ•˜๊ฐ• ์ค‘. 5000 steps ์ข…๋ฃŒ ์‹œ์ ์ด ์ ์ ˆํ•œ stopping point์˜€๊ฑฐ๋‚˜ ์ถ”๊ฐ€ ํ•™์Šต ์—ฌ์ง€ ์žˆ์Œ.
---
## 2. ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์˜ํ–ฅ ๋ถ„์„
### 2-1. Learning Rate: **2e-5** โ†’ โœ… ์ ์ ˆ (์—…๊ณ„ ํ‘œ์ค€ ๋ฒ”์œ„)
| ๋ชจ๋ธ/ํ”„๋ ˆ์ž„์›Œํฌ | LR | ๊ทœ๋ชจ |
|---|---|---|
| Meta Alpaca (Llama 7B) | 2e-5 | 7B |
| WizardLM (Vicuna 13B) | 2e-5 | 13B |
| OpenHermes (Mistral 7B) | 2e-5 | 7B |
| LIMA (65B) | 1e-5 | 65B |
| TinyLlama SFT (1.1B) | 2e-5 | 1.1B |
| **ํ˜„์žฌ ์„ค์ •** | **2e-5** | **1.2B** |
- 1B ๊ทœ๋ชจ์—์„œ 2e-5๋Š” ์—…๊ณ„ ํ‘œ์ค€๊ฐ’๊ณผ ์ •ํ™•ํžˆ ์ผ์น˜
- pretrain LR(2e-4)์˜ 1/10์œผ๋กœ ์„ค์ • โ†’ catastrophic forgetting ๋ฐฉ์ง€ ์›์น™ ์ถฉ์กฑ
- ๋‹จ, ์ถ”๊ฐ€ epoch ์‹œ์—๋Š” 1e-5๋กœ ๋‚ฎ์ถ”๋Š” ๊ฒƒ์ด ์•ˆ์ „
**๊ฐœ์„  ๋ฐฉํ–ฅ:** ํ˜„์žฌ ์„ค์ • ์œ ์ง€. 2์ฐจ ํ•™์Šต ์‹œ 1e-5 ์ถ”์ฒœ.
### 2-2. Cosine Decay ์Šค์ผ€์ค„ โ†’ โœ… ์ ์ ˆ (๋‹จ, ์ตœ์ข… LR ์•ฝ๊ฐ„ ๋†’์Œ)
- ์ตœ์ข… LR: 2.00e-6 (peak์˜ 10%)
- ํ‘œ์ค€ cosine schedule: min_lr = 0.1 ร— peak_lr
- 5000 steps์— ๋งž๋Š” ์„ค์ •: warmup 150 + cosine decay 4850 steps
- step 5000์—์„œ LR์ด 2e-6์œผ๋กœ ์ž์—ฐ ์ˆ˜๋ ด โ†’ ํ•™์Šต์ด ๋งˆ๋ฌด๋ฆฌ๋œ ๋А๋‚Œ
**๊ฐœ์„  ๋ฐฉํ–ฅ:** min_lr์„ 0 ๋˜๋Š” 1e-7๋กœ ๋‚ฎ์ถ”๋ฉด ๋งˆ์ง€๋ง‰ ๊ตฌ๊ฐ„ ๋” ์•ˆ์ •์  ์ˆ˜๋ ด ๊ฐ€๋Šฅ. ํ˜„์žฌ ์„ค์ •๋„ ๋ฌด๋ฐฉ.
### 2-3. Effective Batch Size: **64 sequences** (=262K tokens/step) โ†’ โœ… ์ ์ ˆ
- 64 seqs ร— ํ‰๊ท  ~500 tokens (dynamic padding) โ‰ˆ 32,000 tokens/step ์‹ค์ œ ์ฒ˜๋ฆฌ๋Ÿ‰
- max_seq_len=4096 ๊ธฐ์ค€ ์ด๋ก ๊ฐ’์€ 262,144 tok/step์ด์ง€๋งŒ ๋™์  ํŒจ๋”ฉ์œผ๋กœ ์‹ค์ œ๋Š” ๋‚ฎ์Œ
- SFT ๋ฐฐ์น˜ ํฌ๊ธฐ ์ฐธ๊ณ : Alpaca=128 seqs, WizardLM=64 seqs, LIMA=64 seqs
- **64๋Š” ์—…๊ณ„ ํ‘œ์ค€๊ฐ’๊ณผ ์ •ํ™• ์ผ์น˜**
**๊ฐœ์„  ๋ฐฉํ–ฅ:** ํ˜„์žฌ ์„ค์ • ์œ ์ง€. ๋ฐฐ์น˜๊ฐ€ ๋„ˆ๋ฌด ํฌ๋ฉด generalization ์ €ํ•˜ ๊ฐ€๋Šฅ์„ฑ ์žˆ์Œ.
### 2-4. Epochs: **~2 epoch** โ†’ โš ๏ธ ๋ถ€์กฑ ๊ฐ€๋Šฅ์„ฑ (์•ˆ์ „์€ ํ•จ)
- 5000 steps ร— 64 seqs = 320,000 ์˜ˆ์ œ ์ฒ˜๋ฆฌ / 159,000 ์ƒ˜ํ”Œ = **์•ฝ 2.0 epoch**
- SFT ์—…๊ณ„ ๊ธฐ์ค€:
- LIMA: 15 epoch (์†Œ๋Ÿ‰ ๋ฐ์ดํ„ฐ 1K๊ฐœ)
- Alpaca, WizardLM: **3 epoch**
- OpenHermes, Hermes: 3โ€“5 epoch
- ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ(>100K): 1โ€“3 epoch
- 2 epoch๋Š” **๊ณผ์†Œํ•™์Šต ๊ฐ€๋Šฅ์„ฑ** ์žˆ์Œ (ํŠนํžˆ ๋‚ฎ์€ ๋นˆ๋„ ๋ฐ์ดํ„ฐ ํŒจํ„ด ํ•™์Šต ๋ถ€์กฑ)
- Q4 loss(1.974)๊ฐ€ Q3(1.958)๋ณด๋‹ค ์‚ด์ง ๋†’์•„์ง„ ๊ฒƒ์€ cosine LR ๊ฐ์†Œ ํšจ๊ณผ + ์•„์ง ์ˆ˜๋ ด ์ „์ผ ๊ฐ€๋Šฅ์„ฑ ๊ณต์กด
- Val loss๊ฐ€ ์—†์–ด ๊ณผ์ ํ•ฉ ์—ฌ๋ถ€ ํ™•์ธ ๋ถˆ๊ฐ€ (โœ… eval_interval=100์œผ๋กœ ์„ค์ •์€ ๋˜์–ด ์žˆ์—ˆ์œผ๋‚˜ ๊ฒฐ๊ณผ ์—†์Œ)
**๊ฐœ์„  ๋ฐฉํ–ฅ:** 3โ€“4 epoch (7500โ€“10000 steps) ์ถ”๊ฐ€ ์‹คํ—˜ ๊ถŒ์žฅ. ๋‹จ val split ํ•„์ˆ˜ ํ™•๋ณด ํ›„ ์ง„ํ–‰.
### 2-5. NEFTune alpha=10 โ†’ โœ… ์ด ๋ฐ์ดํ„ฐ์…‹ ํฌ๊ธฐ์— ์ ํ•ฉ
- ์›๋…ผ๋ฌธ(Jain et al., 2023) ๊ถŒ์žฅ๊ฐ’: ์†Œ๊ทœ๋ชจ(<10K) โ†’ 5, ์ค‘๊ทœ๋ชจ(10Kโ€“500K) โ†’ 10, ๋Œ€๊ทœ๋ชจ(>500K) โ†’ 15
- 159K ์ƒ˜ํ”Œ โ†’ **alpha=10 ์ ํ•ฉ**
- Noise magnitude = alpha / sqrt(seq_len ร— d_model) = 10 / sqrt(500 ร— 2048) โ‰ˆ 0.0099
- ์‹ค์ œ embedding ๊ฐ’ ๋Œ€๋น„ ์ ์ ˆํ•œ noise ๋น„์œจ
- Loss curve ์•ˆ์ •์„ฑ(stdev 0.094)์œผ๋กœ ๋ณผ ๋•Œ NEFTune์ด ํ•™์Šต์„ ๋ถˆ์•ˆ์ •ํ•˜๊ฒŒ ๋งŒ๋“ค์ง€ ์•Š์•˜์Œ
**๊ฐœ์„  ๋ฐฉํ–ฅ:** ํ˜„์žฌ ์„ค์ • ์œ ์ง€. ๋ฐ์ดํ„ฐ ์ฆ๊ฐ€(500K+) ์‹œ alpha=15๋กœ ์ƒํ–ฅ ๊ณ ๋ ค.
### 2-6. max_seq_len: **4096** โ†’ โœ… ์ ์ ˆ (๋‹จ, ํ™œ์šฉ๋„ ํ™•์ธ ํ•„์š”)
- ์„ค์ •: max_seq_len=4096, dynamic padding ์ ์šฉ
- ํ•œ๊ตญ์–ด instruction ๋ฐ์ดํ„ฐ ํ‰๊ท  ๊ธธ์ด: 200โ€“1000 tokens (kullm/KoAlpaca ๊ธฐ์ค€)
- Dynamic padding ๋•๋ถ„์— ์งง์€ ์‹œํ€€์Šค๋“ค์€ ์‹ค์ œ๋กœ 4096์„ ์ฑ„์šฐ์ง€ ์•Š์Œ โ†’ compute ํšจ์œจ์ 
- rope_theta=500000 (Llama-3 ์Šคํƒ€์ผ) โ†’ 4096 ์ด์ƒ ์™ธ์‚ฝ๋„ ์ง€์›
**์ž ์žฌ ๋ฌธ์ œ:**
- ๋ฐ์ดํ„ฐ์…‹์— 4096 ์ดˆ๊ณผ ๋Œ€ํ™”๊ฐ€ ์žˆ๋‹ค๋ฉด truncation ๋ฐœ์ƒ โ†’ ๊ธด multi-turn ๋Œ€ํ™” ์†์‹ค
- ํ˜„์žฌ ๋ฐ์ดํ„ฐ์…‹(kullm, KoAlpaca, LIMA ๋“ฑ)์€ ๋Œ€๋ถ€๋ถ„ 2048 ์ดํ•˜์ด๋ฏ€๋กœ ์‹ค์งˆ์  ์˜ํ–ฅ ์ ์Œ
**๊ฐœ์„  ๋ฐฉํ–ฅ:** ํ˜„์žฌ ์„ค์ • ์œ ์ง€. ์žฅ๋ฌธ ๋Œ€ํ™” ๋ฐ์ดํ„ฐ ์ถ”๊ฐ€ ์‹œ 8192 ๊ณ ๋ ค.
---
## 3. ๋‹ค์Œ ํŠœ๋‹ ์˜ต์…˜ ํ›„๋ณด๊ตฐ
### A. ์ถ”๊ฐ€ SFT Epoch (5000 โ†’ 10000 steps, epoch 4)
**Pros:**
- ํ˜„์žฌ loss๊ฐ€ ์—ฌ์ „ํžˆ ํ•˜๊ฐ• ์ถ”์„ธ โ€” ์ถ”๊ฐ€ ํ•™์Šต ์—ฌ์ง€ ์žˆ์Œ
- epoch 3โ€“4๋Š” SFT ์—…๊ณ„ ํ‘œ์ค€ (Alpaca, WizardLM ๊ธฐ์ค€)
- ๊ธฐ์กด ์ฒดํฌํฌ์ธํŠธ์—์„œ resume ๊ฐ€๋Šฅ, 39๋ถ„ ์ถ”๊ฐ€๋ฉด ์ถฉ๋ถ„ (B200 ์†๋„ ๊ธฐ์ค€)
- ๊ตฌํ˜„ ๊ฐ€๋Šฅ: `--resume checkpoints/korean_1b_sft/checkpoint-5000 --max_steps 10000`
**Cons:**
- Val loss ์—†์ด ์ง„ํ–‰ ์‹œ ๊ณผ์ ํ•ฉ ๊ฐ์ง€ ๋ถˆ๊ฐ€
- cosine schedule์ด ์ด๋ฏธ step 5000 ๊ธฐ์ค€์œผ๋กœ ์„ค๊ณ„๋˜์–ด ์žˆ์Œ โ†’ resume ์‹œ LR ์Šค์ผ€์ค„ ์žฌ์„ค์ • ํ•„์š”
- epoch 4 ์ดํ›„ ๊ณผ์ ํ•ฉ ์œ„ํ—˜ (ํŠนํžˆ ๋ฐ˜๋ณต ํŒจํ„ด memorization)
**์ถ”์ฒœ:** โœ… **์กฐ๊ฑด๋ถ€ ์ถ”์ฒœ** โ€” val split 5โ€“10% ํ™•๋ณด ํ›„, LR=1e-5๋กœ ์ƒˆ cosine schedule ์„ค์ •ํ•˜์—ฌ ์ถ”๊ฐ€ ํ•™์Šต. Resume๋ณด๋‹ค fresh start ๊ถŒ์žฅ.
**๊ตฌ์ฒด์  ์„ค์ •:**
```yaml
max_steps: 5000 # ์ถ”๊ฐ€ 5000 steps (epoch 3-4)
lr: 1.0e-5 # ์ด์ „์˜ ์ ˆ๋ฐ˜
warmup_steps: 50 # ์งง์€ warmup
```
---
### B. LR ํŠœ๋‹: 2e-5 vs 1e-5 vs 5e-6
| LR | ์žฅ์  | ๋‹จ์  | ์ถ”์ฒœ |
|----|------|------|------|
| 5e-6 | ๋งค์šฐ ์•ˆ์ „, ๊ณผ์ ํ•ฉ ๋ฐฉ์ง€ | 5000 steps์—์„œ ๊ฐœ์„  ํญ ์ ์„ ์ˆ˜ ์žˆ์Œ | โŒ ๋„ˆ๋ฌด ๋ณด์ˆ˜์  |
| **1e-5** | **๊ท ํ˜•์žกํžŒ ์„ ํƒ, 2์ฐจ ํ•™์Šต ํ‘œ์ค€** | ํ˜„์žฌ ๋Œ€๋น„ ํ•™์Šต ์†๋„ ์ ˆ๋ฐ˜ | โœ… **์ถ”์ฒœ** |
| 2e-5 (ํ˜„์žฌ) | 1์ฐจ ํ•™์Šต์—์„œ ์ข‹์€ ๊ฒฐ๊ณผ | ์ถ”๊ฐ€ epoch์—์„œ ๊ณผ์ ํ•ฉ ์œ„ํ—˜ | โš ๏ธ ์ถ”๊ฐ€ ํ•™์Šต์— ๋ถˆ๋ฆฌ |
**๊ฒฐ๋ก :** 2์ฐจ ํ•™์Šต ์‹œ **lr=1e-5** ์‚ฌ์šฉ. ํ˜„์žฌ lr=2e-5๋Š” 1์ฐจ ํ•™์Šต์— ์ตœ์ .
---
### C. ORPO (Odds Ratio Preference Optimization)
**๊ฐœ์š”:** SFT + preference alignment์„ ๋‹จ์ผ ๋‹จ๊ณ„์—์„œ ๋™์‹œ ์ˆ˜ํ–‰. Reference model ๋ถˆํ•„์š”.
**Pros:**
- Reference model ์—†์–ด ๋ฉ”๋ชจ๋ฆฌ ์ ˆ์•ฝ (DPO ๋Œ€๋น„ VRAM ์•ฝ 40% ์ ˆ์•ฝ)
- SFT์™€ preference๋ฅผ ๋™์‹œ์— ์ตœ์ ํ™” โ†’ ๋ชจ๋ธ ํ’ˆ์งˆ ์ €ํ•˜ ์—†์ด alignment ๊ฐ€๋Šฅ
- 1-stage ํŒŒ์ดํ”„๋ผ์ธ โ†’ ์šด์˜ ๋‹จ์ˆœํ™”
- `trl` ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋กœ ์‰ฝ๊ฒŒ ๊ตฌํ˜„ ๊ฐ€๋Šฅ
**Cons:**
- Chosen/rejected ์Œ ๋ฐ์ดํ„ฐ ํ•„์ˆ˜ (ํ˜„์žฌ ์—†์Œ)
- ํ•œ๊ตญ์–ด preference ๋ฐ์ดํ„ฐ ์„ ํƒ์ง€๊ฐ€ ์ œํ•œ์ 
**ํ•œ๊ตญ์–ด Preference ๋ฐ์ดํ„ฐ ํ˜„ํ™ฉ (HuggingFace ๊ธฐ์ค€):**
| ๋ฐ์ดํ„ฐ์…‹ | ์ƒ˜ํ”Œ ์ˆ˜ | ํŠน์ง• |
|---------|---------|------|
| `maywell/ko_Ultrafeedback` | ~60K | UltraFeedback ํ•œ๊ตญ์–ด ๋ฒˆ์—ญ |
| `ChuGyouk/korean-ultrafeedback-armorm` | ~60K | ArmoRM ์Šค์ฝ”์–ด ํฌํ•จ |
| `HAERAE-HUB/K2-Align` | ~10K | ํ•œ๊ตญ์–ด RLHF alignment |
| `heegyu/KORANI-v1` | ~20K | Korean RANI (human feedback) |
| `trl-lib/ultrafeedback_binarized` | ~60K | ์˜์–ด (๋ฒˆ์—ญ ํ•„์š”) |
**์ถ”์ฒœ:** โœ… **์ถ”์ฒœ** โ€” `maywell/ko_Ultrafeedback` ๋˜๋Š” `ChuGyouk/korean-ultrafeedback-armorm` ํ™•๋ณด ํ›„ TRL `ORPOTrainer`๋กœ ๊ตฌํ˜„. SFT ํ›„ ORPO ์ ์šฉ ๋˜๋Š” from scratch ORPO ๋ชจ๋‘ ๊ฐ€๋Šฅ.
**๊ตฌํ˜„ ์˜ˆ์‹œ:**
```python
from trl import ORPOConfig, ORPOTrainer
config = ORPOConfig(learning_rate=5e-7, num_train_epochs=1, ...)
trainer = ORPOTrainer(model, config, train_dataset=preference_data)
```
---
### D. DPO (Direct Preference Optimization)
**๊ฐœ์š”:** SFT ์™„๋ฃŒ ๋ชจ๋ธ ์œ„์— preference alignment์„ ์ถ”๊ฐ€ ํ•™์Šต. Reference model(=SFT ๋ชจ๋ธ frozen) ํ•„์š”.
**vs ORPO:**
| | DPO | ORPO |
|--|-----|------|
| Reference model | ํ•„์š” (VRAM +40%) | ๋ถˆํ•„์š” |
| SFT ๋‹จ๊ณ„ | ๋ณ„๋„ ํ•„์š” | ํ†ตํ•ฉ ๊ฐ€๋Šฅ |
| ์•ˆ์ •์„ฑ | ๊ฒ€์ฆ๋œ ๋ฐฉ๋ฒ• | ์ƒ๋Œ€์ ์œผ๋กœ ์‹ ๊ทœ |
| ๋ฐ์ดํ„ฐ | chosen/rejected | chosen/rejected |
| ๊ตฌํ˜„ ๋ณต์žก๋„ | ์ค‘๊ฐ„ | ๋‚ฎ์Œ |
**Pros:**
- ๊ฐ€์žฅ ๋„๋ฆฌ ๊ฒ€์ฆ๋œ preference optimization ๋ฐฉ๋ฒ•
- `trl` ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์™„์ „ ์ง€์›
- Llama, Mistral ๊ธฐ๋ฐ˜ ๋ชจ๋“  ์ฃผ์š” ๋ชจ๋ธ์— ์ ์šฉ๋จ
**Cons:**
- SFT ๋ชจ๋ธ์„ reference๋กœ ๋‘๊ณ  ์ถ”๊ฐ€ ํ•™์Šต โ†’ ๋ฉ”๋ชจ๋ฆฌ 2๋ฐฐ (1.2B ร— 2 = ~16GB, B200 192GB์—์„œ ๋ฌด๋ฆฌ ์—†์Œ)
- 2๋‹จ๊ณ„ ํ•™์Šต ํŒŒ์ดํ”„๋ผ์ธ ๋ณต์žก์„ฑ
**์ถ”์ฒœ:** โœ… **์ถ”์ฒœ** โ€” ORPO๋ณด๋‹ค ๊ฒ€์ฆ๋œ ๋ฐฉ๋ฒ•. B200 ร— 8์—์„œ ๋ฉ”๋ชจ๋ฆฌ ์ด์Šˆ ์—†์Œ. ORPO์™€ A/B ํ…Œ์ŠคํŠธ ๊ฐ€์น˜ ์žˆ์Œ.
---
### E. LoRA/QLoRA
**๋งฅ๋ฝ:** ์ด๋ฏธ full fine-tuning ์™„๋ฃŒ. LoRA์˜ ์—ญํ• ์€?
**Pros:**
- ๋น ๋ฅธ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์‹คํ—˜ (LR, epoch, alpha ์กฐํ•ฉ): full FT ๋Œ€๋น„ 3-5x ๋น ๋ฆ„
- ์—ฌ๋Ÿฌ adaptation ๋™์‹œ ๊ด€๋ฆฌ (domain-specific LoRA weights)
- DPO/ORPO ๋‹จ๊ณ„์—์„œ adapter๋งŒ ํ•™์Šต ๊ฐ€๋Šฅ
- VRAM ์‚ฌ์šฉ ์ ˆ์•ฝ โ†’ batch size ์ฆ๊ฐ€ ๊ฐ€๋Šฅ
**Cons:**
- ์ด๋ฏธ full FT๋œ ๋ชจ๋ธ์ด ์žˆ์œผ๋ฏ€๋กœ LoRA ์„ฑ๋Šฅ ์ƒํ•œ โ‰ค full FT
- 1B ๋ชจ๋ธ์€ ์ด๋ฏธ ์ž‘์•„์„œ QLoRA์˜ 4-bit quantization ์ด์ ์ด ํฌ์ง€ ์•Š์Œ
- Fine-tuning quality๋Š” full FT๊ฐ€ ํ•ญ์ƒ ์šฐ์„ธ
**์ถ”์ฒœ:** โš ๏ธ **์กฐ๊ฑด๋ถ€ ์ถ”์ฒœ** โ€” ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํƒ์ƒ‰(lr ๊ทธ๋ฆฌ๋“œ์„œ์น˜, epoch sweep)์— LoRA ํ™œ์šฉ. ์ตœ์ข… ๋ชจ๋ธ์€ full FT.
**์‹ค์šฉ์  ์‚ฌ์šฉ๋ฒ•:**
```python
# ๋น ๋ฅธ ์‹คํ—˜: LoRA rank=64๋กœ LR ๊ทธ๋ฆฌ๋“œ์„œ์น˜
# rank=64, alpha=128, dropout=0.05
# ์•ฝ 5-10๋ถ„ / ์‹คํ—˜ (B200 ๊ธฐ์ค€)
```
---
### F. ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ ๊ฐœ์„ 
**ํ˜„์žฌ ๋ฐ์ดํ„ฐ ๊ตฌ์„ฑ:**
- kullm: ๋Œ€๊ทœ๋ชจ ํ•œ๊ตญ์–ด instruction (ํ’ˆ์งˆ ํ˜ผ์žฌ)
- KoAlpaca: Alpaca ํ•œ๊ตญ์–ด ๋ฒˆ์—ญ (๋ฒˆ์—ญ ํ’ˆ์งˆ ์ด์Šˆ)
- safe_conv: ์•ˆ์ „ ๋Œ€ํ™” ๋ฐ์ดํ„ฐ
- LIMA: ๊ณ ํ’ˆ์งˆ ์˜์–ด instruction (1000๊ฐœ)
- evol_instruct: GPT-4 ์ƒ์„ฑ (๊ณ ํ’ˆ์งˆ)
- kovast: ํ•œ๊ตญ์–ด ๋Œ€ํ™”
**๊ฐœ์„  ๋ฐฉํ–ฅ:**
1. **Deduplication (MinHash LSH):**
- instruction text์— ๋Œ€ํ•ด locality-sensitive hashing
- ์˜ˆ์ƒ ์ค‘๋ณต ์ œ๊ฑฐ์œจ: 5โ€“15% (159K โ†’ 135โ€“150K ์ •๋„)
- ํ’ˆ์งˆ ํ–ฅ์ƒ ํšจ๊ณผ: ์ค‘๋ณต ํŒจํ„ด memorization ๋ฐฉ์ง€
2. **Quality Filtering:**
- Perplexity ๊ธฐ๋ฐ˜ ํ•„ํ„ฐ: ๋„ˆ๋ฌด ๋‚ฎ๊ฑฐ๋‚˜ ๋„ˆ๋ฌด ๋†’์€ perplexity ์ œ๊ฑฐ
- ์–ธ์–ด ํ™•์ธ: ํ•œ๊ตญ์–ด ๋น„์œจ ์ฒดํฌ (`langdetect`)
- ๊ธธ์ด ํ•„ํ„ฐ: ๋„ˆ๋ฌด ์งง์€ ์‘๋‹ต(<50 tokens) ์ œ๊ฑฐ
- ๋ฐ˜๋ณต ํŒจํ„ด ์ œ๊ฑฐ: `n-gram repetition score` ๊ธฐ๋ฐ˜
3. **Domain Mixing ์กฐ์ •:**
- LIMA-style: ์†Œ๋Ÿ‰์˜ ๊ณ ํ’ˆ์งˆ ๋ฐ์ดํ„ฐ๊ฐ€ ๋Œ€๋Ÿ‰์˜ ์ €ํ’ˆ์งˆ๋ณด๋‹ค ํšจ๊ณผ์ 
- evol_instruct ๋น„์œจ โ†‘ (GPT-4 ์ƒ์„ฑ์ด๋ฏ€๋กœ ๊ณ ํ’ˆ์งˆ)
- ๋‹จ์ˆœ ๋ฒˆ์—ญ ๋ฐ์ดํ„ฐ(KoAlpaca) ๋น„์œจ โ†“
**์ถ”์ฒœ:** โœ… **๊ฐ•๋ ฅ ์ถ”์ฒœ** โ€” ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ์ด epoch ์ˆ˜๋ณด๋‹ค ์ค‘์š”. 1์ฃผ์ผ ํˆฌ์ž๋กœ ์‹ค์งˆ์  ์„ฑ๋Šฅ ํ–ฅ์ƒ ๊ธฐ๋Œ€.
---
### G. ๋” ๋งŽ์€ SFT ๋ฐ์ดํ„ฐ (159K โ†’ 500K+)
**HuggingFace ์ถ”๊ฐ€ ๊ฐ€๋Šฅ ๋ฐ์ดํ„ฐ์…‹:**
| ๋ฐ์ดํ„ฐ์…‹ | ์ƒ˜ํ”Œ ์ˆ˜ | ์–ธ์–ด | ํ’ˆ์งˆ | ๋น„๊ณ  |
|---------|---------|------|------|------|
| `HAERAE-HUB/qarv-instruct-100k` | 100K | ํ•œ๊ตญ์–ด | ์ค‘์ƒ | ํ•œ๊ตญ์–ด instruction 100K |
| `nayohan/llama3-instruct-ko-dataset` | 58K | ํ•œ๊ตญ์–ด | ์ƒ | Llama-3 instruction ํ•œ๊ตญ์–ด |
| `hPark/orca-ko` | 200K+ | ํ•œ๊ตญ์–ด | ์ƒ | Orca ์Šคํƒ€์ผ ํ•œ๊ตญ์–ด |
| `maywell/synatra-orca` | 300K+ | ํ•œ๊ตญ์–ด | ์ƒ | ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ, ๊ณ ํ’ˆ์งˆ |
| `FreedomIntelligence/evol-instruct-korean` | 70K | ํ•œ๊ตญ์–ด | ์ƒ | GPT-4 ์ƒ์„ฑ ํ•œ๊ตญ์–ด |
| `Bingsu/ko_alpaca_data` | 52K | ํ•œ๊ตญ์–ด | ์ค‘ | Alpaca ํ•œ๊ตญ์–ด (๋ฒˆ์—ญ) |
| `HAERAE-HUB/KoInstruct` | 50K+ | ํ•œ๊ตญ์–ด | ์ค‘์ƒ | ํ•œ๊ตญ์–ด instruction |
| `Open-Orca/OpenOrca` | 1M+ | ์˜์–ด | ์ตœ์ƒ | ๊ณ ํ’ˆ์งˆ ์˜์–ด (ํ•œ๊ตญ์–ด ๋ชจ๋ธ์— ํ˜ผํ•ฉ ๊ฐ€๋Šฅ) |
**500K ๋‹ฌ์„ฑ ๊ฒฝ๋กœ:**
1. ํ˜„์žฌ 159K
2. `hPark/orca-ko` + `maywell/synatra-orca` ์ถ”๊ฐ€: +200K = 359K
3. `HAERAE-HUB/qarv-instruct-100k` + `nayohan/llama3-instruct-ko-dataset`: +158K = 517K
4. ํ’ˆ์งˆ ํ•„ํ„ฐ ํ›„ ์œ ์ง€ ๋น„์œจ ~80% โ†’ **์•ฝ 400K ์ˆœ ๋ฐ์ดํ„ฐ**
**Pros:**
- ๋” ๋งŽ์€ ๋„๋ฉ”์ธ ์ปค๋ฒ„๋ฆฌ์ง€
- ๋“œ๋ฌธ ํŒจํ„ด ํ•™์Šต ๊ธฐํšŒ ์ฆ๊ฐ€
- Generalization ํ–ฅ์ƒ
**Cons:**
- ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ ๊ฒ€์ฆ ํ•„์š” (๋ฌด๋ถ„๋ณ„ ์ถ”๊ฐ€๋Š” ์—ญํšจ๊ณผ)
- ํ•™์Šต ์‹œ๊ฐ„ ์ฆ๊ฐ€ (๊ฐ™์€ epoch ๊ธฐ์ค€ 3๋ฐฐ โ†’ 2์‹œ๊ฐ„+)
- ๊ณ ํ’ˆ์งˆ ์†Œ๋Ÿ‰ vs ์ €ํ’ˆ์งˆ ๋‹ค๋Ÿ‰ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„
**์ถ”์ฒœ:** โœ… **์ถ”์ฒœ (ํ’ˆ์งˆ ํ•„ํ„ฐ ์ „์ œ)** โ€” `hPark/orca-ko`๋‚˜ `maywell/synatra-orca` ๊ฐ™์€ ๊ณ ํ’ˆ์งˆ ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ ์šฐ์„  ์ถ”๊ฐ€. ๋‹จ์ˆœ ๋ฒˆ์—ญ ๋ฐ์ดํ„ฐ ๋น„์œจ ์ฃผ์˜.
---
## 4. ์ฆ‰์‹œ ์‹คํ–‰ ๊ฐ€๋Šฅํ•œ ์‹คํ—˜ Top 3
### ๐Ÿฅ‡ 1์ˆœ์œ„: **ํ˜„์žฌ ๋ชจ๋ธ ์ข…ํ•ฉ ํ‰๊ฐ€ (eval ์‹คํ–‰)**
**์ด์œ :**
- Loss 1.9677์ด ์‹ค์ œ๋กœ ์ข‹์€ ๋ชจ๋ธ์ธ์ง€ ์•Œ ์ˆ˜ ์—†์Œ
- ์ถ”๊ฐ€ ํ•™์Šต ๋ฐฉํ–ฅ ๊ฒฐ์ • ์ „ baseline ํ•„์ˆ˜
- ์ด๋ฏธ `eval/comprehensive_eval.py` ์กด์žฌ
**์ฆ‰์‹œ ์‹คํ–‰:**
```bash
cd /PROJECT/0325120031_A/ghong/taketimes/llm-bang
# Perplexity ํ‰๊ฐ€
python eval/perplexity.py \
--checkpoint checkpoints/korean_1b_sft/checkpoint-5000 \
--data data/sft/val.jsonl # val split ํ•„์š”
# ์ƒ์„ฑ ํ’ˆ์งˆ ๋น ๋ฅธ ์ฒดํฌ
python eval/generate.py \
--checkpoint checkpoints/korean_1b_sft/checkpoint-5000 \
--prompts "์•ˆ๋…•ํ•˜์„ธ์š”, ์ €๋Š” AI ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ์˜ค๋Š˜ ๋‚ ์”จ์— ๋Œ€ํ•ด ์„ค๋ช…ํ•ด์ฃผ์„ธ์š”."
```
**์˜ˆ์ƒ ์‹œ๊ฐ„:** 10โ€“30๋ถ„
---
### ๐Ÿฅˆ 2์ˆœ์œ„: **lr=1e-5๋กœ ์ถ”๊ฐ€ SFT (epoch 3โ€“4๊นŒ์ง€)**
**์ด์œ :**
- Loss curve๊ฐ€ ์•„์ง ์ˆ˜๋ ดํ•˜์ง€ ์•Š์•˜๊ณ  epoch 2๋Š” ์—…๊ณ„ ํ‘œ์ค€๋ณด๋‹ค ๋ถ€์กฑ
- ๊ตฌํ˜„ ๋น„์šฉ ์ตœ์†Œ (๊ธฐ์กด ์ฝ”๋“œ ์žฌ์‚ฌ์šฉ)
- B200 ร— 8์—์„œ ์•ฝ 40โ€“60๋ถ„ ์ถ”๊ฐ€ (39๋ถ„/5000steps ๊ธฐ์ค€)
**๊ตฌ์ฒด์  ์„ค์ •:**
```bash
# ์ƒˆ run์œผ๋กœ checkpoint-5000์—์„œ ์‹œ์ž‘
RUN_NAME=korean_1b_sft_v2 \
BASE_CHECKPOINT=checkpoints/korean_1b_sft/checkpoint-5000 \
LR=1.0e-5 \
MAX_STEPS=5000 \ # epoch 3-4
WARMUP_STEPS=50 \ # ์งง์€ warmup
bash scripts/launch_sft.sh
```
**์ฃผ์˜:** val split ์—†์œผ๋ฉด step 3000โ€“5000์—์„œ val loss ์ฒดํฌํ•˜๋ฉฐ early stop ๊ธฐ์ค€ ์ˆ˜๋™ ์„ค์ • ํ•„์š”.
**์˜ˆ์ƒ ๊ฒฐ๊ณผ:** loss 1.90โ€“1.93 (ํ˜„์žฌ 1.97 ๋Œ€๋น„ ์•ฝ 2โ€“3% ๊ฐœ์„ ), ์ƒ์„ฑ ํ’ˆ์งˆ ์ฒด๊ฐ ํ–ฅ์ƒ ๊ธฐ๋Œ€.
---
### ๐Ÿฅ‰ 3์ˆœ์œ„: **๋ฐ์ดํ„ฐ ํ’ˆ์งˆ ๊ฐœ์„  + ์ถ”๊ฐ€ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘**
**์ด์œ :**
- ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ์ด ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹๋ณด๋‹ค ์žฅ๊ธฐ์ ์œผ๋กœ ์ค‘์š”
- ํ˜„์žฌ ๋ฐ์ดํ„ฐ์— ์ค‘๋ณต/์ €ํ’ˆ์งˆ ํฌํ•จ ๊ฐ€๋Šฅ์„ฑ ์žˆ์Œ
- ORPO/DPO ํŒŒ์ดํ”„๋ผ์ธ ์ค€๋น„๋ฅผ ์œ„ํ•ด preference ๋ฐ์ดํ„ฐ๋„ ๋™์‹œ์— ์ˆ˜์ง‘
**์ฆ‰์‹œ ์‹คํ–‰ ๊ฐ€๋Šฅํ•œ ์ž‘์—…:**
```python
# 1. Deduplication (MinHash)
pip install datasketch
# instruction text ๊ธฐ์ค€ MinHash dedup, threshold=0.8
# 2. ์ถ”๊ฐ€ ๋ฐ์ดํ„ฐ ๋‹ค์šด๋กœ๋“œ
from datasets import load_dataset
ds = load_dataset("hPark/orca-ko") # ~200K ๊ณ ํ’ˆ์งˆ ํ•œ๊ตญ์–ด
ds2 = load_dataset("maywell/synatra-orca") # ~300K ํ•ฉ์„ฑ
# 3. ํ•œ๊ตญ์–ด Preference ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ (ORPO/DPO ์ค€๋น„)
pref = load_dataset("maywell/ko_Ultrafeedback") # ~60K preference ์Œ
```
**์˜ˆ์ƒ ์‹œ๊ฐ„:** ๋ฐ์ดํ„ฐ ์ค€๋น„ 2โ€“4์‹œ๊ฐ„, ์žฌํ•™์Šต์€ ์ถ”๊ฐ€ ์„ค์ • ํ›„ ์ง„ํ–‰.
---
## 5. ์ข…ํ•ฉ ํ‰๊ฐ€ ์š”์•ฝ
### ํ˜„์žฌ ์„ค์ • ํ‰๊ฐ€
| ํ•ญ๋ชฉ | ์„ค์ •๊ฐ’ | ํ‰๊ฐ€ | ๋น„๊ณ  |
|------|--------|------|------|
| Learning Rate | 2e-5 | โœ… ์ ์ ˆ | ์—…๊ณ„ ํ‘œ์ค€ ์ •์ค‘์•™ |
| Cosine Decay | 5000 steps | โœ… ์ ์ ˆ | min_lr ~10% |
| Warmup | 150 steps (3%) | โœ… ์ ์ ˆ | 3-5% ๊ถŒ์žฅ ๋ฒ”์œ„ |
| Effective Batch | 64 seqs | โœ… ์ ์ ˆ | ์—…๊ณ„ ํ‘œ์ค€ |
| Epochs | ~2 | โš ๏ธ ๋ถ€์กฑ ๊ฐ€๋Šฅ | 3 epoch ํ‘œ์ค€ |
| NEFTune alpha | 10 | โœ… ์ ์ ˆ | 159K ๋ฐ์ดํ„ฐ์— ๋งž์Œ |
| max_seq_len | 4096 | โœ… ์ ์ ˆ | ๋™์  ํŒจ๋”ฉ์œผ๋กœ ํšจ์œจ์  |
| Weight Decay | 0.01 | โœ… ์ ์ ˆ | pretrain(0.1)์˜ 1/10 |
### ์˜ต์…˜๋ณ„ ์ถ”์ฒœ ์šฐ์„ ์ˆœ์œ„
| ์˜ต์…˜ | ์ถ”์ฒœ | ์ด์œ  |
|------|------|------|
| A. ์ถ”๊ฐ€ SFT (epoch 4) | โœ… ๋†’์Œ | epoch ๋ถ€์กฑ, ์ฆ‰์‹œ ์‹คํ–‰ ๊ฐ€๋Šฅ |
| B. LR 1e-5๋กœ ์žฌํ•™์Šต | โœ… ๋†’์Œ | ์ถ”๊ฐ€ ํ•™์Šต ์‹œ ํ•„์ˆ˜ |
| C. ORPO | โœ… ์ค‘๊ฐ„ | ๋ฐ์ดํ„ฐ ์ค€๋น„ ํ•„์š” |
| D. DPO | โœ… ์ค‘๊ฐ„ | ORPO ๋Œ€์•ˆ, ๋” ๊ฒ€์ฆ๋จ |
| E. LoRA | โš ๏ธ ๋‚ฎ์Œ | ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํƒ์ƒ‰์—๋งŒ ์œ ์šฉ |
| F. ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ ๊ฐœ์„  | โœ… ๋†’์Œ | ์žฅ๊ธฐ ํˆฌ์ž ๋Œ€๋น„ ํšจ๊ณผ ํผ |
| G. ๋ฐ์ดํ„ฐ ์ถ”๊ฐ€ (500K) | โœ… ์ค‘๊ฐ„ | ๊ณ ํ’ˆ์งˆ ์†Œ์Šค ์ „์ œ |
### ํ•™์Šต ๊ณก์„  ์ดํ‰
ํ˜„์žฌ SFT๋Š” **๊ฑด๊ฐ•ํ•˜๊ฒŒ ์™„๋ฃŒ**๋จ:
- Gradient norm ์•ˆ์ •, spike ์—†์Œ
- Loss ๋‹จ์กฐ ๊ฐ์†Œ (๋ฏธ์‹œ์  ๋ณ€๋™์€ ์ •์ƒ)
- Outlier 2.1%๋Š” ์ •์ƒ ๋ฒ”์œ„
- ์ˆ˜๋ ด ์‹ ํ˜ธ๊ฐ€ step 3000+ ์ดํ›„ ๋‚˜ํƒ€๋‚˜์ง€๋งŒ ์•„์ง plateau๋Š” ์•„๋‹˜
**๊ฐ€์žฅ ์šฐ๋ ค๋˜๋Š” ์ :** Validation loss ์—†์Œ โ†’ ๊ณผ์ ํ•ฉ ์—ฌ๋ถ€ ๋ถˆ๋ช…ํ™•. **์ฆ‰์‹œ val split ํ™•๋ณด ํ•„์š”.**
---
*๋ถ„์„ ์™„๋ฃŒ. ๋‹ค์Œ ์‹คํ–‰ ์‹œ ์ด ํŒŒ์ผ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์‹คํ—˜ ๋ฐฉํ–ฅ ๊ฒฐ์ • ๊ถŒ์žฅ.*