frankenstallm / source /eval /preference_opt_report.md
pathcosmos's picture
Upload folder using huggingface_hub (#29)
5b1ff4d
|
raw
history blame
7.01 kB
# Preference Optimization ์กฐ์‚ฌ ๋ณด๊ณ ์„œ
**์ž‘์„ฑ์ผ:** 2026-02-26
**๋ชฉ์ :** SFT ์ดํ›„ ๋ฐ˜๋ณต ํ‡ดํ™”(repetition degeneration) ํ•ด๊ฒฐ์„ ์œ„ํ•œ Preference Optimization ๋ฐฉ๋ฒ•๋ก  ์กฐ์‚ฌ
---
## 1. ํ˜„์žฌ ํ™˜๊ฒฝ
| ํŒจํ‚ค์ง€ | ๋ฒ„์ „ | ๋น„๊ณ  |
|---------|------|------|
| transformers | 5.2.0 | โœ… ์„ค์น˜๋จ |
| accelerate | - | ํ™•์ธ ํ•„์š” |
| peft | - | ํ™•์ธ ํ•„์š” |
| **trl** | **๋ฏธ์„ค์น˜** | โš ๏ธ `pip install trl` ํ•„์š” |
**์ธํ”„๋ผ:** 8ร— B200 183GB
**๋ชจ๋ธ:** ์ปค์Šคํ…€ 1B ํŒŒ๋ผ๋ฏธํ„ฐ (Llama ๊ณ„์—ด ์•„ํ‚คํ…์ฒ˜, FP8 ์ง€์›)
**์ตœ์‹  ์ฒดํฌํฌ์ธํŠธ:**
- Pretrain: `checkpoints/korean_1b_fp8_run1/checkpoint-0034000`
- SFT: `checkpoints/korean_1b_sft/` (์ตœ์ข… ์ฒดํฌํฌ์ธํŠธ๋Š” log ํ™•์ธ ํ•„์š”)
**HF ๋ณ€ํ™˜:** `scripts/convert_to_hf.py` ์กด์žฌ โœ… โ€” LlamaForCausalLM ํฌ๋งท์œผ๋กœ ๋ณ€ํ™˜ ๊ฐ€๋Šฅ
---
## 2. ORPO vs DPO vs SimPO ๋น„๊ต
### ORPO (Odds Ratio Preference Optimization)
- **๋…ผ๋ฌธ:** Hong et al. 2024 (arXiv:2403.07691)
- **Reference model:** ๋ถˆํ•„์š” โœ…
- **ํ•ต์‹ฌ ์•„์ด๋””์–ด:** SFT loss + odds ratio ๊ธฐ๋ฐ˜ preference loss๋ฅผ ๋‹จ์ผ ๋ชจ๋ธ๋กœ ๋™์‹œ ํ•™์Šต
- **๋ฉ”๋ชจ๋ฆฌ:** SFT์™€ ๋™์ผ (1ร— ๋ชจ๋ธ๋งŒ ํ•„์š”)
- **1B ๋ชจ๋ธ ์ ์šฉ:** 8ร— B200์—์„œ ๋งค์šฐ ์—ฌ์œ  (๋‹จ์ผ GPU๋กœ๋„ ๊ฐ€๋Šฅ)
- **๊ตฌํ˜„:** TRL `ORPOTrainer` (trl >= 0.8.0)
- **์žฅ์ :** ๊ฐ€์žฅ ๊ฐ„๋‹จ, ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์ , SFT+preference ํ•œ ๋ฒˆ์—
- **๋‹จ์ :** DPO ๋Œ€๋น„ ์•ˆ์ •์„ฑ ๊ฒ€์ฆ ์‚ฌ๋ก€ ์ ์Œ
### DPO (Direct Preference Optimization)
- **๋…ผ๋ฌธ:** Rafailov et al. 2023 (arXiv:2305.18290)
- **Reference model:** ํ•„์š” (frozen copy, 2ร— ๋ฉ”๋ชจ๋ฆฌ)
- **๋ฉ”๋ชจ๋ฆฌ:** 1B ๋ชจ๋ธ ร— 2 โ‰ˆ 4GB (BF16) โ€” ์—ฌ์ „ํžˆ ์—ฌ์œ 
- **1B ๋ชจ๋ธ ์ ์šฉ:** ๋ฌธ์ œ์—†์Œ
- **๊ตฌํ˜„:** TRL `DPOTrainer`
- **์žฅ์ :** ๊ฐ€์žฅ ์ž˜ ๊ฒ€์ฆ๋จ, ์•ˆ์ •์ , ๋…ผ๋ฌธ/์‚ฌ๋ก€ ํ’๋ถ€
- **๋‹จ์ :** reference model ๊ด€๋ฆฌ ํ•„์š”
### SimPO (Simple Preference Optimization)
- **๋…ผ๋ฌธ:** Meng et al. 2024 (arXiv:2405.14734)
- **Reference model:** ๋ถˆํ•„์š”
- **ํ•ต์‹ฌ:** Length-normalized implicit reward, margin ๊ธฐ๋ฐ˜
- **๊ตฌํ˜„:** TRL์— ๋ณ„๋„ Trainer ์—†์Œ โ†’ DPOTrainer์˜ `loss_type="simpo"` ๋กœ ์‚ฌ์šฉ ๊ฐ€๋Šฅ (trl >= 0.9.0)
- **์žฅ์ :** ORPO๋ณด๋‹ค ์„ฑ๋Šฅ ์šฐ์ˆ˜ํ•˜๋‹ค๋Š” ๋ณด๊ณ , reference-free
- **๋‹จ์ :** ์ƒ๋Œ€์ ์œผ๋กœ ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•
### PPO (Proximal Policy Optimization) โ€” ์ฐธ๊ณ ์šฉ
- Reward model ๋ณ„๋„ ํ•™์Šต ํ•„์š” โ†’ ๋ณต์žก๋„ ๋†’์Œ
- 1B ๋ชจ๋ธ์—๋Š” ๊ณผ๋„ํ•œ ์˜ค๋ฒ„ํ—ค๋“œ
- **์ถ”์ฒœํ•˜์ง€ ์•Š์Œ** (๋ฐ์ดํ„ฐ/์ธํ”„๋ผ ๋Œ€๋น„ ๋น„ํšจ์œจ)
---
## 3. ์ถ”์ฒœ: **ORPO โ†’ DPO ์ˆœ์„œ**
### 1์ˆœ์œ„: ORPO
- Reference model ์—†์Œ โ†’ ๋ฉ”๋ชจ๋ฆฌ/๊ตฌํ˜„ ์ตœ์†Œ
- SFT ์ฒดํฌํฌ์ธํŠธ์—์„œ ๋ฐ”๋กœ ์‹œ์ž‘ ๊ฐ€๋Šฅ
- ๋ฐ˜๋ณต ํ‡ดํ™”์šฉ preference ๋ฐ์ดํ„ฐ ์ œ์ž‘์ด ๊ฐ„๋‹จ
### 2์ˆœ์œ„: DPO
- ORPO๋กœ ๋ถ€์กฑํ•˜๋ฉด DPO๋กœ ์ „ํ™˜
- 1B ๋ชจ๋ธ์ด๋ผ reference model ๋ถ€๋‹ด ์—†์Œ
- ๋” ์•ˆ์ •์ ์ด๊ณ  ๊ฒ€์ฆ๋œ ๋ฐฉ๋ฒ•
### ๊ทผ๊ฑฐ
1B ๋ชจ๋ธ + 8ร— B200 ํ™˜๊ฒฝ์—์„œ๋Š” DPO์˜ 2ร— ๋ฉ”๋ชจ๋ฆฌ๋„ ๋ฌธ์ œ์—†์ง€๋งŒ,
**๊ตฌํ˜„ ์†๋„์™€ ๋‹จ์ˆœ์„ฑ** ๋ฉด์—์„œ ORPO๊ฐ€ ๋จผ์ € ์‹œ๋„ํ•  ๊ฐ€์น˜๊ฐ€ ์žˆ์Œ.
---
## 4. ํ•œ๊ตญ์–ด Preference ๋ฐ์ดํ„ฐ์…‹
### โœ… ์ ‘๊ทผ ๊ฐ€๋Šฅ (DPO/ORPO ํ˜•์‹ ํ˜ธํ™˜)
| ๋ฐ์ดํ„ฐ์…‹ | ํ˜•์‹ | Downloads | ์ ํ•ฉ๋„ |
|----------|------|-----------|--------|
| **kuotient/orca-math-korean-dpo-pairs** | `{system, question, chosen, rejected}` | 111 | โญโญโญ DPO/ORPO ์ฆ‰์‹œ ์‚ฌ์šฉ ๊ฐ€๋Šฅ |
| **ChuGyouk/argilla-distilabel-math-preference-dpo-korean** | DPO ํ˜•์‹ | 10 | โญโญโญ ์ˆ˜ํ•™ ๋„๋ฉ”์ธ |
| **nayohan/preference-collection-ko-full** | `{response_A, response_B, orig_score_A, orig_score_B, orig_preference}` | 30 | โญโญโญ ๋ณ€ํ™˜ ํ•„์š”ํ•˜์ง€๋งŒ ํ’๋ถ€ |
### โœ… ์ ‘๊ทผ ๊ฐ€๋Šฅ (SFT ํ˜•์‹, preference ๋ณ€ํ™˜ ํ•„์š”)
| ๋ฐ์ดํ„ฐ์…‹ | ํ˜•์‹ | Downloads |
|----------|------|-----------|
| jojo0217/korean_rlhf_dataset | `{instruction, input, output}` | 54 |
| FreedomIntelligence/alpaca-gpt4-korean | SFT ํ˜•์‹ | 158 |
| nlpai-lab/kullm-v2 | SFT ํ˜•์‹ | 730 |
### โŒ ์ ‘๊ทผ ๋ถˆ๊ฐ€
maywell/ko_Ultrafeedback, HAERAE-HUB/KoRA, heegyu/OpenOrca-ko, Bongseok/ko-DPO-v0.1 โ€” ๋ชจ๋‘ 404
### ๐Ÿ’ก ์ž์ฒด Preference ๋ฐ์ดํ„ฐ ์ƒ์„ฑ ์ „๋žต (๋ฐ˜๋ณต ํ‡ดํ™” ํŠนํ™”)
๊ฐ€์žฅ ํšจ๊ณผ์ ์ธ ๋ฐฉ๋ฒ•: **ํ˜„์žฌ ๋ชจ๋ธ์˜ ๋ฐ˜๋ณต ์ถœ๋ ฅ์„ rejected๋กœ ํ™œ์šฉ**
```
{
"prompt": "์„œ์šธ์˜ ์œ ๋ช…ํ•œ ๊ด€๊ด‘์ง€๋ฅผ ์ถ”์ฒœํ•ด์ฃผ์„ธ์š”.",
"chosen": "์„œ์šธ์˜ ๋Œ€ํ‘œ์ ์ธ ๊ด€๊ด‘์ง€๋กœ๋Š” ๊ฒฝ๋ณต๊ถ, ๋ถ์ดŒํ•œ์˜ฅ๋งˆ์„, ๋‚จ์‚ฐํƒ€์›Œ...",
"rejected": "์„œ์šธ์˜ ๊ด€๊ด‘์ง€๋กœ๋Š” ๊ฒฝ๋ณต๊ถ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฒฝ๋ณต๊ถ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฒฝ๋ณต๊ถ์ด ์žˆ์Šต๋‹ˆ๋‹ค..."
}
```
1. ํ˜„์žฌ SFT ๋ชจ๋ธ๋กœ ๋‹ค์–‘ํ•œ ํ”„๋กฌํ”„ํŠธ์— ๋Œ€ํ•ด ์ƒ์„ฑ (temperature ๋‹ค์–‘ํ•˜๊ฒŒ)
2. ๋ฐ˜๋ณต์ด ๋ฐœ์ƒํ•œ ์‘๋‹ต โ†’ rejected
3. ์ •์ƒ ์‘๋‹ต (๋˜๋Š” GPT-4๋กœ ์ƒ์„ฑ) โ†’ chosen
4. 500~2000๊ฐœ๋งŒ์œผ๋กœ๋„ ํšจ๊ณผ์ 
---
## 5. HF ๋ณ€ํ™˜
`scripts/convert_to_hf.py` ๊ฐ€ ์ด๋ฏธ ์กด์žฌํ•˜๋ฉฐ LlamaForCausalLM ํฌ๋งท์œผ๋กœ ๋ณ€ํ™˜:
- FP8 / BF16 ์ฒดํฌํฌ์ธํŠธ ๋ชจ๋‘ ์ง€์›
- ์ถœ๋ ฅ: `config.json`, `model.safetensors`, `tokenizer.json` ๋“ฑ
**๋ณ€ํ™˜ ๋ช…๋ น:**
```bash
cd /PROJECT/0325120031_A/ghong/taketimes/llm-bang
python scripts/convert_to_hf.py \
--checkpoint checkpoints/korean_1b_sft/checkpoint-XXXXX \
--output outputs/hf_for_orpo \
--tokenizer tokenizer/korean_sp/tokenizer.json
```
๋ณ€ํ™˜ ํ›„ `AutoModelForCausalLM.from_pretrained("outputs/hf_for_orpo")` ๋กœ ๋กœ๋“œ โ†’ TRL ORPOTrainer ์‚ฌ์šฉ ๊ฐ€๋Šฅ.
---
## 6. ๋ฐ˜๋ณต ํ‡ดํ™” ํ•ด๊ฒฐ์— ORPO๊ฐ€ ํšจ๊ณผ์ ์ธ ์ด์œ 
### ๋ฉ”์ปค๋‹ˆ์ฆ˜
ORPO์˜ odds ratio loss๋Š” ๋‹ค์Œ์„ ํ•™์Šต:
- **chosen ์‘๋‹ต์˜ ์ƒ์„ฑ ํ™•๋ฅ  โ†‘** (์ •์ƒ์ ์ด๊ณ  ๋‹ค์–‘ํ•œ ์‘๋‹ต)
- **rejected ์‘๋‹ต์˜ ์ƒ์„ฑ ํ™•๋ฅ  โ†“** (๋ฐ˜๋ณต์ ์ธ ์‘๋‹ต)
๋ฐ˜๋ณต ํ‡ดํ™”๋Š” ํŠน์ • ํ† ํฐ ์‹œํ€€์Šค์˜ ํ™•๋ฅ ์ด ์ž๊ธฐ๊ฐ•ํ™”(self-reinforcing)๋˜๋ฉด์„œ ๋ฐœ์ƒ.
ORPO๋Š” ์ด ํŒจํ„ด ์ž์ฒด๋ฅผ ์ง์ ‘์ ์œผ๋กœ ํŽ˜๋„ํ‹ฐ:
1. **๋ฐ˜๋ณต ํŒจํ„ด = rejected** โ†’ ๋ชจ๋ธ์ด ๋ฐ˜๋ณต ์‹œํ€€์Šค์— ๋†’์€ ํ™•๋ฅ ์„ ๋ถ€์—ฌํ•˜๋Š” ๊ฒƒ์„ ์ง์ ‘ ์–ต์ œ
2. **๋‹ค์–‘ํ•œ ์ •์ƒ ์‘๋‹ต = chosen** โ†’ ๋‹ค์–‘ํ•œ ํ† ํฐ ๋ถ„ํฌ๋ฅผ ์œ ๋„
3. **SFT loss์™€ ๋™์‹œ ํ•™์Šต** โ†’ ์ผ๋ฐ˜ ์„ฑ๋Šฅ ์œ ์ง€ํ•˜๋ฉด์„œ ๋ฐ˜๋ณต ์–ต์ œ
### ์™œ SFT๋งŒ์œผ๋กœ ๋ถ€์กฑํ•œ๊ฐ€
- SFT๋Š” "์ข‹์€ ์‘๋‹ต์„ ๋”ฐ๋ผํ•˜๋ผ"๋งŒ ํ•™์Šต
- "๋‚˜์œ ์‘๋‹ต์„ ํ”ผํ•˜๋ผ"๋Š” ์‹ ํ˜ธ๊ฐ€ ์—†์Œ
- Preference optimization์€ "์ด๊ฒƒ์€ ํ•˜์ง€ ๋งˆ๋ผ"๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ํ•™์Šต
### ์˜ˆ์ƒ ํšจ๊ณผ
- 500~2000๊ฐœ์˜ ๋ฐ˜๋ณต-vs-์ •์ƒ preference ์Œ์œผ๋กœ๋„ ๋ฐ˜๋ณต ํ‡ดํ™” ๋Œ€ํญ ๊ฐ์†Œ ๊ฐ€๋Šฅ
- repetition penalty ๊ฐ™์€ ๋””์ฝ”๋”ฉ ํŠธ๋ฆญ๋ณด๋‹ค ๊ทผ๋ณธ์  ํ•ด๊ฒฐ
- ์ผ๋ฐ˜ ์„ฑ๋Šฅ ์ €ํ•˜ ์ตœ์†Œ (SFT loss๊ฐ€ ํ•จ๊ป˜ ์ž‘์šฉ)
---
## 7. ์‹คํ–‰ ๊ณ„ํš
```
1. TRL ์„ค์น˜: pip install trl --break-system-packages (๋˜๋Š” venv)
2. HF ๋ณ€ํ™˜: python scripts/convert_to_hf.py --checkpoint ... --output outputs/hf_for_orpo
3. Preference ๋ฐ์ดํ„ฐ ์ค€๋น„:
a. kuotient/orca-math-korean-dpo-pairs ๋‹ค์šด๋กœ๋“œ (์ฆ‰์‹œ ์‚ฌ์šฉ ๊ฐ€๋Šฅ)
b. ์ž์ฒด ๋ฐ˜๋ณต ํ‡ดํ™” ๋ฐ์ดํ„ฐ ์ƒ์„ฑ (eval/generate.py ํ™œ์šฉ)
4. ORPO ํ•™์Šต: python train/orpo.py (์•„๋ž˜ ์Šคํฌ๋ฆฝํŠธ)
5. ํ‰๊ฐ€: ๋ฐ˜๋ณต๋ฅ  ์ธก์ • + perplexity
```
ORPO ํ•™์Šต ์Šคํฌ๋ฆฝํŠธ: `train/orpo.py` ์ฐธ์กฐ