| # Preference Optimization μ‘°μ¬ λ³΄κ³ μ | |
| **μμ±μΌ:** 2026-02-26 | |
| **λͺ©μ :** SFT μ΄ν λ°λ³΅ ν΄ν(repetition degeneration) ν΄κ²°μ μν Preference Optimization λ°©λ²λ‘ μ‘°μ¬ | |
| --- | |
| ## 1. νμ¬ νκ²½ | |
| | ν¨ν€μ§ | λ²μ | λΉκ³ | | |
| |---------|------|------| | |
| | transformers | 5.2.0 | β μ€μΉλ¨ | | |
| | accelerate | - | νμΈ νμ | | |
| | peft | - | νμΈ νμ | | |
| | **trl** | **λ―Έμ€μΉ** | β οΈ `pip install trl` νμ | | |
| **μΈνλΌ:** 8Γ B200 183GB | |
| **λͺ¨λΈ:** 컀μ€ν 1B νλΌλ―Έν° (Llama κ³μ΄ μν€ν μ², FP8 μ§μ) | |
| **μ΅μ 체ν¬ν¬μΈνΈ:** | |
| - Pretrain: `checkpoints/korean_1b_fp8_run1/checkpoint-0034000` | |
| - SFT: `checkpoints/korean_1b_sft/` (μ΅μ’ 체ν¬ν¬μΈνΈλ log νμΈ νμ) | |
| **HF λ³ν:** `scripts/convert_to_hf.py` μ‘΄μ¬ β β LlamaForCausalLM ν¬λ§·μΌλ‘ λ³ν κ°λ₯ | |
| --- | |
| ## 2. ORPO vs DPO vs SimPO λΉκ΅ | |
| ### ORPO (Odds Ratio Preference Optimization) | |
| - **λ Όλ¬Έ:** Hong et al. 2024 (arXiv:2403.07691) | |
| - **Reference model:** λΆνμ β | |
| - **ν΅μ¬ μμ΄λμ΄:** SFT loss + odds ratio κΈ°λ° preference lossλ₯Ό λ¨μΌ λͺ¨λΈλ‘ λμ νμ΅ | |
| - **λ©λͺ¨λ¦¬:** SFTμ λμΌ (1Γ λͺ¨λΈλ§ νμ) | |
| - **1B λͺ¨λΈ μ μ©:** 8Γ B200μμ λ§€μ° μ¬μ (λ¨μΌ GPUλ‘λ κ°λ₯) | |
| - **ꡬν:** TRL `ORPOTrainer` (trl >= 0.8.0) | |
| - **μ₯μ :** κ°μ₯ κ°λ¨, λ©λͺ¨λ¦¬ ν¨μ¨μ , SFT+preference ν λ²μ | |
| - **λ¨μ :** DPO λλΉ μμ μ± κ²μ¦ μ¬λ‘ μ μ | |
| ### DPO (Direct Preference Optimization) | |
| - **λ Όλ¬Έ:** Rafailov et al. 2023 (arXiv:2305.18290) | |
| - **Reference model:** νμ (frozen copy, 2Γ λ©λͺ¨λ¦¬) | |
| - **λ©λͺ¨λ¦¬:** 1B λͺ¨λΈ Γ 2 β 4GB (BF16) β μ¬μ ν μ¬μ | |
| - **1B λͺ¨λΈ μ μ©:** λ¬Έμ μμ | |
| - **ꡬν:** TRL `DPOTrainer` | |
| - **μ₯μ :** κ°μ₯ μ κ²μ¦λ¨, μμ μ , λ Όλ¬Έ/μ¬λ‘ νλΆ | |
| - **λ¨μ :** reference model κ΄λ¦¬ νμ | |
| ### SimPO (Simple Preference Optimization) | |
| - **λ Όλ¬Έ:** Meng et al. 2024 (arXiv:2405.14734) | |
| - **Reference model:** λΆνμ | |
| - **ν΅μ¬:** Length-normalized implicit reward, margin κΈ°λ° | |
| - **ꡬν:** TRLμ λ³λ Trainer μμ β DPOTrainerμ `loss_type="simpo"` λ‘ μ¬μ© κ°λ₯ (trl >= 0.9.0) | |
| - **μ₯μ :** ORPOλ³΄λ€ μ±λ₯ μ°μνλ€λ λ³΄κ³ , reference-free | |
| - **λ¨μ :** μλμ μΌλ‘ μλ‘μ΄ λ°©λ² | |
| ### PPO (Proximal Policy Optimization) β μ°Έκ³ μ© | |
| - Reward model λ³λ νμ΅ νμ β 볡μ‘λ λμ | |
| - 1B λͺ¨λΈμλ κ³Όλν μ€λ²ν€λ | |
| - **μΆμ²νμ§ μμ** (λ°μ΄ν°/μΈνλΌ λλΉ λΉν¨μ¨) | |
| --- | |
| ## 3. μΆμ²: **ORPO β DPO μμ** | |
| ### 1μμ: ORPO | |
| - Reference model μμ β λ©λͺ¨λ¦¬/ꡬν μ΅μ | |
| - SFT 체ν¬ν¬μΈνΈμμ λ°λ‘ μμ κ°λ₯ | |
| - λ°λ³΅ ν΄νμ© preference λ°μ΄ν° μ μμ΄ κ°λ¨ | |
| ### 2μμ: DPO | |
| - ORPOλ‘ λΆμ‘±νλ©΄ DPOλ‘ μ ν | |
| - 1B λͺ¨λΈμ΄λΌ reference model λΆλ΄ μμ | |
| - λ μμ μ μ΄κ³ κ²μ¦λ λ°©λ² | |
| ### κ·Όκ±° | |
| 1B λͺ¨λΈ + 8Γ B200 νκ²½μμλ DPOμ 2Γ λ©λͺ¨λ¦¬λ λ¬Έμ μμ§λ§, | |
| **ꡬν μλμ λ¨μμ±** λ©΄μμ ORPOκ° λ¨Όμ μλν κ°μΉκ° μμ. | |
| --- | |
| ## 4. νκ΅μ΄ Preference λ°μ΄ν°μ | |
| ### β μ κ·Ό κ°λ₯ (DPO/ORPO νμ νΈν) | |
| | λ°μ΄ν°μ | νμ | Downloads | μ ν©λ | | |
| |----------|------|-----------|--------| | |
| | **kuotient/orca-math-korean-dpo-pairs** | `{system, question, chosen, rejected}` | 111 | βββ DPO/ORPO μ¦μ μ¬μ© κ°λ₯ | | |
| | **ChuGyouk/argilla-distilabel-math-preference-dpo-korean** | DPO νμ | 10 | βββ μν λλ©μΈ | | |
| | **nayohan/preference-collection-ko-full** | `{response_A, response_B, orig_score_A, orig_score_B, orig_preference}` | 30 | βββ λ³ν νμνμ§λ§ νλΆ | | |
| ### β μ κ·Ό κ°λ₯ (SFT νμ, preference λ³ν νμ) | |
| | λ°μ΄ν°μ | νμ | Downloads | | |
| |----------|------|-----------| | |
| | jojo0217/korean_rlhf_dataset | `{instruction, input, output}` | 54 | | |
| | FreedomIntelligence/alpaca-gpt4-korean | SFT νμ | 158 | | |
| | nlpai-lab/kullm-v2 | SFT νμ | 730 | | |
| ### β μ κ·Ό λΆκ° | |
| maywell/ko_Ultrafeedback, HAERAE-HUB/KoRA, heegyu/OpenOrca-ko, Bongseok/ko-DPO-v0.1 β λͺ¨λ 404 | |
| ### π‘ μ체 Preference λ°μ΄ν° μμ± μ λ΅ (λ°λ³΅ ν΄ν νΉν) | |
| κ°μ₯ ν¨κ³Όμ μΈ λ°©λ²: **νμ¬ λͺ¨λΈμ λ°λ³΅ μΆλ ₯μ rejectedλ‘ νμ©** | |
| ``` | |
| { | |
| "prompt": "μμΈμ μ λͺ ν κ΄κ΄μ§λ₯Ό μΆμ²ν΄μ£ΌμΈμ.", | |
| "chosen": "μμΈμ λνμ μΈ κ΄κ΄μ§λ‘λ 경볡κΆ, λΆμ΄νμ₯λ§μ, λ¨μ°νμ...", | |
| "rejected": "μμΈμ κ΄κ΄μ§λ‘λ 경볡κΆμ΄ μμ΅λλ€. 경볡κΆμ΄ μμ΅λλ€. 경볡κΆμ΄ μμ΅λλ€..." | |
| } | |
| ``` | |
| 1. νμ¬ SFT λͺ¨λΈλ‘ λ€μν ν둬ννΈμ λν΄ μμ± (temperature λ€μνκ²) | |
| 2. λ°λ³΅μ΄ λ°μν μλ΅ β rejected | |
| 3. μ μ μλ΅ (λλ GPT-4λ‘ μμ±) β chosen | |
| 4. 500~2000κ°λ§μΌλ‘λ ν¨κ³Όμ | |
| --- | |
| ## 5. HF λ³ν | |
| `scripts/convert_to_hf.py` κ° μ΄λ―Έ μ‘΄μ¬νλ©° LlamaForCausalLM ν¬λ§·μΌλ‘ λ³ν: | |
| - FP8 / BF16 체ν¬ν¬μΈνΈ λͺ¨λ μ§μ | |
| - μΆλ ₯: `config.json`, `model.safetensors`, `tokenizer.json` λ± | |
| **λ³ν λͺ λ Ή:** | |
| ```bash | |
| cd /PROJECT/0325120031_A/ghong/taketimes/llm-bang | |
| python scripts/convert_to_hf.py \ | |
| --checkpoint checkpoints/korean_1b_sft/checkpoint-XXXXX \ | |
| --output outputs/hf_for_orpo \ | |
| --tokenizer tokenizer/korean_sp/tokenizer.json | |
| ``` | |
| λ³ν ν `AutoModelForCausalLM.from_pretrained("outputs/hf_for_orpo")` λ‘ λ‘λ β TRL ORPOTrainer μ¬μ© κ°λ₯. | |
| --- | |
| ## 6. λ°λ³΅ ν΄ν ν΄κ²°μ ORPOκ° ν¨κ³Όμ μΈ μ΄μ | |
| ### λ©μ»€λμ¦ | |
| ORPOμ odds ratio lossλ λ€μμ νμ΅: | |
| - **chosen μλ΅μ μμ± νλ₯ β** (μ μμ μ΄κ³ λ€μν μλ΅) | |
| - **rejected μλ΅μ μμ± νλ₯ β** (λ°λ³΅μ μΈ μλ΅) | |
| λ°λ³΅ ν΄νλ νΉμ ν ν° μνμ€μ νλ₯ μ΄ μκΈ°κ°ν(self-reinforcing)λλ©΄μ λ°μ. | |
| ORPOλ μ΄ ν¨ν΄ μ체λ₯Ό μ§μ μ μΌλ‘ νλν°: | |
| 1. **λ°λ³΅ ν¨ν΄ = rejected** β λͺ¨λΈμ΄ λ°λ³΅ μνμ€μ λμ νλ₯ μ λΆμ¬νλ κ²μ μ§μ μ΅μ | |
| 2. **λ€μν μ μ μλ΅ = chosen** β λ€μν ν ν° λΆν¬λ₯Ό μ λ | |
| 3. **SFT lossμ λμ νμ΅** β μΌλ° μ±λ₯ μ μ§νλ©΄μ λ°λ³΅ μ΅μ | |
| ### μ SFTλ§μΌλ‘ λΆμ‘±νκ° | |
| - SFTλ "μ’μ μλ΅μ λ°λΌνλΌ"λ§ νμ΅ | |
| - "λμ μλ΅μ νΌνλΌ"λ μ νΈκ° μμ | |
| - Preference optimizationμ "μ΄κ²μ νμ§ λ§λΌ"λ₯Ό λͺ μμ μΌλ‘ νμ΅ | |
| ### μμ ν¨κ³Ό | |
| - 500~2000κ°μ λ°λ³΅-vs-μ μ preference μμΌλ‘λ λ°λ³΅ ν΄ν λν κ°μ κ°λ₯ | |
| - repetition penalty κ°μ λμ½λ© νΈλ¦λ³΄λ€ κ·Όλ³Έμ ν΄κ²° | |
| - μΌλ° μ±λ₯ μ ν μ΅μ (SFT lossκ° ν¨κ» μμ©) | |
| --- | |
| ## 7. μ€ν κ³ν | |
| ``` | |
| 1. TRL μ€μΉ: pip install trl --break-system-packages (λλ venv) | |
| 2. HF λ³ν: python scripts/convert_to_hf.py --checkpoint ... --output outputs/hf_for_orpo | |
| 3. Preference λ°μ΄ν° μ€λΉ: | |
| a. kuotient/orca-math-korean-dpo-pairs λ€μ΄λ‘λ (μ¦μ μ¬μ© κ°λ₯) | |
| b. μ체 λ°λ³΅ ν΄ν λ°μ΄ν° μμ± (eval/generate.py νμ©) | |
| 4. ORPO νμ΅: python train/orpo.py (μλ μ€ν¬λ¦½νΈ) | |
| 5. νκ°: λ°λ³΅λ₯ μΈ‘μ + perplexity | |
| ``` | |
| ORPO νμ΅ μ€ν¬λ¦½νΈ: `train/orpo.py` μ°Έμ‘° | |