frankenstallm / source /eval /preference_opt_report.md
pathcosmos's picture
Upload folder using huggingface_hub (#29)
5b1ff4d
# Preference Optimization 쑰사 λ³΄κ³ μ„œ
**μž‘μ„±μΌ:** 2026-02-26
**λͺ©μ :** SFT 이후 반볡 퇴화(repetition degeneration) 해결을 μœ„ν•œ Preference Optimization 방법둠 쑰사
---
## 1. ν˜„μž¬ ν™˜κ²½
| νŒ¨ν‚€μ§€ | 버전 | λΉ„κ³  |
|---------|------|------|
| transformers | 5.2.0 | βœ… μ„€μΉ˜λ¨ |
| accelerate | - | 확인 ν•„μš” |
| peft | - | 확인 ν•„μš” |
| **trl** | **λ―Έμ„€μΉ˜** | ⚠️ `pip install trl` ν•„μš” |
**인프라:** 8Γ— B200 183GB
**λͺ¨λΈ:** μ»€μŠ€ν…€ 1B νŒŒλΌλ―Έν„° (Llama 계열 μ•„ν‚€ν…μ²˜, FP8 지원)
**μ΅œμ‹  체크포인트:**
- Pretrain: `checkpoints/korean_1b_fp8_run1/checkpoint-0034000`
- SFT: `checkpoints/korean_1b_sft/` (μ΅œμ’… μ²΄ν¬ν¬μΈνŠΈλŠ” log 확인 ν•„μš”)
**HF λ³€ν™˜:** `scripts/convert_to_hf.py` 쑴재 βœ… β€” LlamaForCausalLM 포맷으둜 λ³€ν™˜ κ°€λŠ₯
---
## 2. ORPO vs DPO vs SimPO 비ꡐ
### ORPO (Odds Ratio Preference Optimization)
- **λ…Όλ¬Έ:** Hong et al. 2024 (arXiv:2403.07691)
- **Reference model:** λΆˆν•„μš” βœ…
- **핡심 아이디어:** SFT loss + odds ratio 기반 preference lossλ₯Ό 단일 λͺ¨λΈλ‘œ λ™μ‹œ ν•™μŠ΅
- **λ©”λͺ¨λ¦¬:** SFT와 동일 (1Γ— λͺ¨λΈλ§Œ ν•„μš”)
- **1B λͺ¨λΈ 적용:** 8Γ— B200μ—μ„œ 맀우 μ—¬μœ  (단일 GPUλ‘œλ„ κ°€λŠ₯)
- **κ΅¬ν˜„:** TRL `ORPOTrainer` (trl >= 0.8.0)
- **μž₯점:** κ°€μž₯ 간단, λ©”λͺ¨λ¦¬ 효율적, SFT+preference ν•œ λ²ˆμ—
- **단점:** DPO λŒ€λΉ„ μ•ˆμ •μ„± 검증 사둀 적음
### DPO (Direct Preference Optimization)
- **λ…Όλ¬Έ:** Rafailov et al. 2023 (arXiv:2305.18290)
- **Reference model:** ν•„μš” (frozen copy, 2Γ— λ©”λͺ¨λ¦¬)
- **λ©”λͺ¨λ¦¬:** 1B λͺ¨λΈ Γ— 2 β‰ˆ 4GB (BF16) β€” μ—¬μ „νžˆ μ—¬μœ 
- **1B λͺ¨λΈ 적용:** λ¬Έμ œμ—†μŒ
- **κ΅¬ν˜„:** TRL `DPOTrainer`
- **μž₯점:** κ°€μž₯ 잘 검증됨, μ•ˆμ •μ , λ…Όλ¬Έ/사둀 풍뢀
- **단점:** reference model 관리 ν•„μš”
### SimPO (Simple Preference Optimization)
- **λ…Όλ¬Έ:** Meng et al. 2024 (arXiv:2405.14734)
- **Reference model:** λΆˆν•„μš”
- **핡심:** Length-normalized implicit reward, margin 기반
- **κ΅¬ν˜„:** TRL에 별도 Trainer μ—†μŒ β†’ DPOTrainer의 `loss_type="simpo"` 둜 μ‚¬μš© κ°€λŠ₯ (trl >= 0.9.0)
- **μž₯점:** ORPO보닀 μ„±λŠ₯ μš°μˆ˜ν•˜λ‹€λŠ” 보고, reference-free
- **단점:** μƒλŒ€μ μœΌλ‘œ μƒˆλ‘œμš΄ 방법
### PPO (Proximal Policy Optimization) β€” 참고용
- Reward model 별도 ν•™μŠ΅ ν•„μš” β†’ λ³΅μž‘λ„ λ†’μŒ
- 1B λͺ¨λΈμ—λŠ” κ³Όλ„ν•œ μ˜€λ²„ν—€λ“œ
- **μΆ”μ²œν•˜μ§€ μ•ŠμŒ** (데이터/인프라 λŒ€λΉ„ λΉ„νš¨μœ¨)
---
## 3. μΆ”μ²œ: **ORPO β†’ DPO μˆœμ„œ**
### 1μˆœμœ„: ORPO
- Reference model μ—†μŒ β†’ λ©”λͺ¨λ¦¬/κ΅¬ν˜„ μ΅œμ†Œ
- SFT μ²΄ν¬ν¬μΈνŠΈμ—μ„œ λ°”λ‘œ μ‹œμž‘ κ°€λŠ₯
- 반볡 ν‡΄ν™”μš© preference 데이터 μ œμž‘μ΄ 간단
### 2μˆœμœ„: DPO
- ORPO둜 λΆ€μ‘±ν•˜λ©΄ DPO둜 μ „ν™˜
- 1B λͺ¨λΈμ΄λΌ reference model λΆ€λ‹΄ μ—†μŒ
- 더 μ•ˆμ •μ μ΄κ³  κ²€μ¦λœ 방법
### κ·Όκ±°
1B λͺ¨λΈ + 8Γ— B200 ν™˜κ²½μ—μ„œλŠ” DPO의 2Γ— λ©”λͺ¨λ¦¬λ„ λ¬Έμ œμ—†μ§€λ§Œ,
**κ΅¬ν˜„ 속도와 λ‹¨μˆœμ„±** λ©΄μ—μ„œ ORPOκ°€ λ¨Όμ € μ‹œλ„ν•  κ°€μΉ˜κ°€ 있음.
---
## 4. ν•œκ΅­μ–΄ Preference 데이터셋
### βœ… μ ‘κ·Ό κ°€λŠ₯ (DPO/ORPO ν˜•μ‹ ν˜Έν™˜)
| 데이터셋 | ν˜•μ‹ | Downloads | 적합도 |
|----------|------|-----------|--------|
| **kuotient/orca-math-korean-dpo-pairs** | `{system, question, chosen, rejected}` | 111 | ⭐⭐⭐ DPO/ORPO μ¦‰μ‹œ μ‚¬μš© κ°€λŠ₯ |
| **ChuGyouk/argilla-distilabel-math-preference-dpo-korean** | DPO ν˜•μ‹ | 10 | ⭐⭐⭐ μˆ˜ν•™ 도메인 |
| **nayohan/preference-collection-ko-full** | `{response_A, response_B, orig_score_A, orig_score_B, orig_preference}` | 30 | ⭐⭐⭐ λ³€ν™˜ ν•„μš”ν•˜μ§€λ§Œ 풍뢀 |
### βœ… μ ‘κ·Ό κ°€λŠ₯ (SFT ν˜•μ‹, preference λ³€ν™˜ ν•„μš”)
| 데이터셋 | ν˜•μ‹ | Downloads |
|----------|------|-----------|
| jojo0217/korean_rlhf_dataset | `{instruction, input, output}` | 54 |
| FreedomIntelligence/alpaca-gpt4-korean | SFT ν˜•μ‹ | 158 |
| nlpai-lab/kullm-v2 | SFT ν˜•μ‹ | 730 |
### ❌ μ ‘κ·Ό λΆˆκ°€
maywell/ko_Ultrafeedback, HAERAE-HUB/KoRA, heegyu/OpenOrca-ko, Bongseok/ko-DPO-v0.1 β€” λͺ¨λ‘ 404
### πŸ’‘ 자체 Preference 데이터 생성 μ „λž΅ (반볡 퇴화 νŠΉν™”)
κ°€μž₯ 효과적인 방법: **ν˜„μž¬ λͺ¨λΈμ˜ 반볡 좜λ ₯을 rejected둜 ν™œμš©**
```
{
"prompt": "μ„œμšΈμ˜ 유λͺ…ν•œ κ΄€κ΄‘μ§€λ₯Ό μΆ”μ²œν•΄μ£Όμ„Έμš”.",
"chosen": "μ„œμšΈμ˜ λŒ€ν‘œμ μΈ κ΄€κ΄‘μ§€λ‘œλŠ” 경볡ꢁ, λΆμ΄Œν•œμ˜₯λ§ˆμ„, λ‚¨μ‚°νƒ€μ›Œ...",
"rejected": "μ„œμšΈμ˜ κ΄€κ΄‘μ§€λ‘œλŠ” 경볡ꢁ이 μžˆμŠ΅λ‹ˆλ‹€. 경볡ꢁ이 μžˆμŠ΅λ‹ˆλ‹€. 경볡ꢁ이 μžˆμŠ΅λ‹ˆλ‹€..."
}
```
1. ν˜„μž¬ SFT λͺ¨λΈλ‘œ λ‹€μ–‘ν•œ ν”„λ‘¬ν”„νŠΈμ— λŒ€ν•΄ 생성 (temperature λ‹€μ–‘ν•˜κ²Œ)
2. 반볡이 λ°œμƒν•œ 응닡 β†’ rejected
3. 정상 응닡 (λ˜λŠ” GPT-4둜 생성) β†’ chosen
4. 500~2000κ°œλ§ŒμœΌλ‘œλ„ 효과적
---
## 5. HF λ³€ν™˜
`scripts/convert_to_hf.py` κ°€ 이미 μ‘΄μž¬ν•˜λ©° LlamaForCausalLM 포맷으둜 λ³€ν™˜:
- FP8 / BF16 체크포인트 λͺ¨λ‘ 지원
- 좜λ ₯: `config.json`, `model.safetensors`, `tokenizer.json` λ“±
**λ³€ν™˜ λͺ…λ Ή:**
```bash
cd /PROJECT/0325120031_A/ghong/taketimes/llm-bang
python scripts/convert_to_hf.py \
--checkpoint checkpoints/korean_1b_sft/checkpoint-XXXXX \
--output outputs/hf_for_orpo \
--tokenizer tokenizer/korean_sp/tokenizer.json
```
λ³€ν™˜ ν›„ `AutoModelForCausalLM.from_pretrained("outputs/hf_for_orpo")` 둜 λ‘œλ“œ β†’ TRL ORPOTrainer μ‚¬μš© κ°€λŠ₯.
---
## 6. 반볡 퇴화 해결에 ORPOκ°€ 효과적인 이유
### λ©”μ»€λ‹ˆμ¦˜
ORPO의 odds ratio lossλŠ” λ‹€μŒμ„ ν•™μŠ΅:
- **chosen μ‘λ‹΅μ˜ 생성 ν™•λ₯  ↑** (정상적이고 λ‹€μ–‘ν•œ 응닡)
- **rejected μ‘λ‹΅μ˜ 생성 ν™•λ₯  ↓** (반볡적인 응닡)
반볡 ν‡΄ν™”λŠ” νŠΉμ • 토큰 μ‹œν€€μŠ€μ˜ ν™•λ₯ μ΄ μžκΈ°κ°•ν™”(self-reinforcing)λ˜λ©΄μ„œ λ°œμƒ.
ORPOλŠ” 이 νŒ¨ν„΄ 자체λ₯Ό μ§μ ‘μ μœΌλ‘œ νŽ˜λ„ν‹°:
1. **반볡 νŒ¨ν„΄ = rejected** β†’ λͺ¨λΈμ΄ 반볡 μ‹œν€€μŠ€μ— 높은 ν™•λ₯ μ„ λΆ€μ—¬ν•˜λŠ” 것을 직접 μ–΅μ œ
2. **λ‹€μ–‘ν•œ 정상 응닡 = chosen** β†’ λ‹€μ–‘ν•œ 토큰 뢄포λ₯Ό μœ λ„
3. **SFT loss와 λ™μ‹œ ν•™μŠ΅** β†’ 일반 μ„±λŠ₯ μœ μ§€ν•˜λ©΄μ„œ 반볡 μ–΅μ œ
### μ™œ SFT만으둜 λΆ€μ‘±ν•œκ°€
- SFTλŠ” "쒋은 응닡을 λ”°λΌν•˜λΌ"만 ν•™μŠ΅
- "λ‚˜μœ 응닡을 ν”Όν•˜λΌ"λŠ” μ‹ ν˜Έκ°€ μ—†μŒ
- Preference optimization은 "이것은 ν•˜μ§€ 마라"λ₯Ό λͺ…μ‹œμ μœΌλ‘œ ν•™μŠ΅
### μ˜ˆμƒ 효과
- 500~2000개의 반볡-vs-정상 preference μŒμœΌλ‘œλ„ 반볡 퇴화 λŒ€ν­ κ°μ†Œ κ°€λŠ₯
- repetition penalty 같은 λ””μ½”λ”© νŠΈλ¦­λ³΄λ‹€ 근본적 ν•΄κ²°
- 일반 μ„±λŠ₯ μ €ν•˜ μ΅œμ†Œ (SFT lossκ°€ ν•¨κ»˜ μž‘μš©)
---
## 7. μ‹€ν–‰ κ³„νš
```
1. TRL μ„€μΉ˜: pip install trl --break-system-packages (λ˜λŠ” venv)
2. HF λ³€ν™˜: python scripts/convert_to_hf.py --checkpoint ... --output outputs/hf_for_orpo
3. Preference 데이터 μ€€λΉ„:
a. kuotient/orca-math-korean-dpo-pairs λ‹€μš΄λ‘œλ“œ (μ¦‰μ‹œ μ‚¬μš© κ°€λŠ₯)
b. 자체 반볡 퇴화 데이터 생성 (eval/generate.py ν™œμš©)
4. ORPO ν•™μŠ΅: python train/orpo.py (μ•„λž˜ 슀크립트)
5. 평가: 반볡λ₯  μΈ‘μ • + perplexity
```
ORPO ν•™μŠ΅ 슀크립트: `train/orpo.py` μ°Έμ‘°