frankenstallm / source /eval /preference_opt_report.md
pathcosmos's picture
Upload folder using huggingface_hub (#29)
5b1ff4d

Preference Optimization ์กฐ์‚ฌ ๋ณด๊ณ ์„œ

์ž‘์„ฑ์ผ: 2026-02-26 ๋ชฉ์ : SFT ์ดํ›„ ๋ฐ˜๋ณต ํ‡ดํ™”(repetition degeneration) ํ•ด๊ฒฐ์„ ์œ„ํ•œ Preference Optimization ๋ฐฉ๋ฒ•๋ก  ์กฐ์‚ฌ


1. ํ˜„์žฌ ํ™˜๊ฒฝ

ํŒจํ‚ค์ง€ ๋ฒ„์ „ ๋น„๊ณ 
transformers 5.2.0 โœ… ์„ค์น˜๋จ
accelerate - ํ™•์ธ ํ•„์š”
peft - ํ™•์ธ ํ•„์š”
trl ๋ฏธ์„ค์น˜ โš ๏ธ pip install trl ํ•„์š”

์ธํ”„๋ผ: 8ร— B200 183GB ๋ชจ๋ธ: ์ปค์Šคํ…€ 1B ํŒŒ๋ผ๋ฏธํ„ฐ (Llama ๊ณ„์—ด ์•„ํ‚คํ…์ฒ˜, FP8 ์ง€์›) ์ตœ์‹  ์ฒดํฌํฌ์ธํŠธ:

  • Pretrain: checkpoints/korean_1b_fp8_run1/checkpoint-0034000
  • SFT: checkpoints/korean_1b_sft/ (์ตœ์ข… ์ฒดํฌํฌ์ธํŠธ๋Š” log ํ™•์ธ ํ•„์š”)

HF ๋ณ€ํ™˜: scripts/convert_to_hf.py ์กด์žฌ โœ… โ€” LlamaForCausalLM ํฌ๋งท์œผ๋กœ ๋ณ€ํ™˜ ๊ฐ€๋Šฅ


2. ORPO vs DPO vs SimPO ๋น„๊ต

ORPO (Odds Ratio Preference Optimization)

  • ๋…ผ๋ฌธ: Hong et al. 2024 (arXiv:2403.07691)
  • Reference model: ๋ถˆํ•„์š” โœ…
  • ํ•ต์‹ฌ ์•„์ด๋””์–ด: SFT loss + odds ratio ๊ธฐ๋ฐ˜ preference loss๋ฅผ ๋‹จ์ผ ๋ชจ๋ธ๋กœ ๋™์‹œ ํ•™์Šต
  • ๋ฉ”๋ชจ๋ฆฌ: SFT์™€ ๋™์ผ (1ร— ๋ชจ๋ธ๋งŒ ํ•„์š”)
  • 1B ๋ชจ๋ธ ์ ์šฉ: 8ร— B200์—์„œ ๋งค์šฐ ์—ฌ์œ  (๋‹จ์ผ GPU๋กœ๋„ ๊ฐ€๋Šฅ)
  • ๊ตฌํ˜„: TRL ORPOTrainer (trl >= 0.8.0)
  • ์žฅ์ : ๊ฐ€์žฅ ๊ฐ„๋‹จ, ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์ , SFT+preference ํ•œ ๋ฒˆ์—
  • ๋‹จ์ : DPO ๋Œ€๋น„ ์•ˆ์ •์„ฑ ๊ฒ€์ฆ ์‚ฌ๋ก€ ์ ์Œ

DPO (Direct Preference Optimization)

  • ๋…ผ๋ฌธ: Rafailov et al. 2023 (arXiv:2305.18290)
  • Reference model: ํ•„์š” (frozen copy, 2ร— ๋ฉ”๋ชจ๋ฆฌ)
  • ๋ฉ”๋ชจ๋ฆฌ: 1B ๋ชจ๋ธ ร— 2 โ‰ˆ 4GB (BF16) โ€” ์—ฌ์ „ํžˆ ์—ฌ์œ 
  • 1B ๋ชจ๋ธ ์ ์šฉ: ๋ฌธ์ œ์—†์Œ
  • ๊ตฌํ˜„: TRL DPOTrainer
  • ์žฅ์ : ๊ฐ€์žฅ ์ž˜ ๊ฒ€์ฆ๋จ, ์•ˆ์ •์ , ๋…ผ๋ฌธ/์‚ฌ๋ก€ ํ’๋ถ€
  • ๋‹จ์ : reference model ๊ด€๋ฆฌ ํ•„์š”

SimPO (Simple Preference Optimization)

  • ๋…ผ๋ฌธ: Meng et al. 2024 (arXiv:2405.14734)
  • Reference model: ๋ถˆํ•„์š”
  • ํ•ต์‹ฌ: Length-normalized implicit reward, margin ๊ธฐ๋ฐ˜
  • ๊ตฌํ˜„: TRL์— ๋ณ„๋„ Trainer ์—†์Œ โ†’ DPOTrainer์˜ loss_type="simpo" ๋กœ ์‚ฌ์šฉ ๊ฐ€๋Šฅ (trl >= 0.9.0)
  • ์žฅ์ : ORPO๋ณด๋‹ค ์„ฑ๋Šฅ ์šฐ์ˆ˜ํ•˜๋‹ค๋Š” ๋ณด๊ณ , reference-free
  • ๋‹จ์ : ์ƒ๋Œ€์ ์œผ๋กœ ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•

PPO (Proximal Policy Optimization) โ€” ์ฐธ๊ณ ์šฉ

  • Reward model ๋ณ„๋„ ํ•™์Šต ํ•„์š” โ†’ ๋ณต์žก๋„ ๋†’์Œ
  • 1B ๋ชจ๋ธ์—๋Š” ๊ณผ๋„ํ•œ ์˜ค๋ฒ„ํ—ค๋“œ
  • ์ถ”์ฒœํ•˜์ง€ ์•Š์Œ (๋ฐ์ดํ„ฐ/์ธํ”„๋ผ ๋Œ€๋น„ ๋น„ํšจ์œจ)

3. ์ถ”์ฒœ: ORPO โ†’ DPO ์ˆœ์„œ

1์ˆœ์œ„: ORPO

  • Reference model ์—†์Œ โ†’ ๋ฉ”๋ชจ๋ฆฌ/๊ตฌํ˜„ ์ตœ์†Œ
  • SFT ์ฒดํฌํฌ์ธํŠธ์—์„œ ๋ฐ”๋กœ ์‹œ์ž‘ ๊ฐ€๋Šฅ
  • ๋ฐ˜๋ณต ํ‡ดํ™”์šฉ preference ๋ฐ์ดํ„ฐ ์ œ์ž‘์ด ๊ฐ„๋‹จ

2์ˆœ์œ„: DPO

  • ORPO๋กœ ๋ถ€์กฑํ•˜๋ฉด DPO๋กœ ์ „ํ™˜
  • 1B ๋ชจ๋ธ์ด๋ผ reference model ๋ถ€๋‹ด ์—†์Œ
  • ๋” ์•ˆ์ •์ ์ด๊ณ  ๊ฒ€์ฆ๋œ ๋ฐฉ๋ฒ•

๊ทผ๊ฑฐ

1B ๋ชจ๋ธ + 8ร— B200 ํ™˜๊ฒฝ์—์„œ๋Š” DPO์˜ 2ร— ๋ฉ”๋ชจ๋ฆฌ๋„ ๋ฌธ์ œ์—†์ง€๋งŒ, ๊ตฌํ˜„ ์†๋„์™€ ๋‹จ์ˆœ์„ฑ ๋ฉด์—์„œ ORPO๊ฐ€ ๋จผ์ € ์‹œ๋„ํ•  ๊ฐ€์น˜๊ฐ€ ์žˆ์Œ.


4. ํ•œ๊ตญ์–ด Preference ๋ฐ์ดํ„ฐ์…‹

โœ… ์ ‘๊ทผ ๊ฐ€๋Šฅ (DPO/ORPO ํ˜•์‹ ํ˜ธํ™˜)

๋ฐ์ดํ„ฐ์…‹ ํ˜•์‹ Downloads ์ ํ•ฉ๋„
kuotient/orca-math-korean-dpo-pairs {system, question, chosen, rejected} 111 โญโญโญ DPO/ORPO ์ฆ‰์‹œ ์‚ฌ์šฉ ๊ฐ€๋Šฅ
ChuGyouk/argilla-distilabel-math-preference-dpo-korean DPO ํ˜•์‹ 10 โญโญโญ ์ˆ˜ํ•™ ๋„๋ฉ”์ธ
nayohan/preference-collection-ko-full {response_A, response_B, orig_score_A, orig_score_B, orig_preference} 30 โญโญโญ ๋ณ€ํ™˜ ํ•„์š”ํ•˜์ง€๋งŒ ํ’๋ถ€

โœ… ์ ‘๊ทผ ๊ฐ€๋Šฅ (SFT ํ˜•์‹, preference ๋ณ€ํ™˜ ํ•„์š”)

๋ฐ์ดํ„ฐ์…‹ ํ˜•์‹ Downloads
jojo0217/korean_rlhf_dataset {instruction, input, output} 54
FreedomIntelligence/alpaca-gpt4-korean SFT ํ˜•์‹ 158
nlpai-lab/kullm-v2 SFT ํ˜•์‹ 730

โŒ ์ ‘๊ทผ ๋ถˆ๊ฐ€

maywell/ko_Ultrafeedback, HAERAE-HUB/KoRA, heegyu/OpenOrca-ko, Bongseok/ko-DPO-v0.1 โ€” ๋ชจ๋‘ 404

๐Ÿ’ก ์ž์ฒด Preference ๋ฐ์ดํ„ฐ ์ƒ์„ฑ ์ „๋žต (๋ฐ˜๋ณต ํ‡ดํ™” ํŠนํ™”)

๊ฐ€์žฅ ํšจ๊ณผ์ ์ธ ๋ฐฉ๋ฒ•: ํ˜„์žฌ ๋ชจ๋ธ์˜ ๋ฐ˜๋ณต ์ถœ๋ ฅ์„ rejected๋กœ ํ™œ์šฉ

{
  "prompt": "์„œ์šธ์˜ ์œ ๋ช…ํ•œ ๊ด€๊ด‘์ง€๋ฅผ ์ถ”์ฒœํ•ด์ฃผ์„ธ์š”.",
  "chosen": "์„œ์šธ์˜ ๋Œ€ํ‘œ์ ์ธ ๊ด€๊ด‘์ง€๋กœ๋Š” ๊ฒฝ๋ณต๊ถ, ๋ถ์ดŒํ•œ์˜ฅ๋งˆ์„, ๋‚จ์‚ฐํƒ€์›Œ...",
  "rejected": "์„œ์šธ์˜ ๊ด€๊ด‘์ง€๋กœ๋Š” ๊ฒฝ๋ณต๊ถ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฒฝ๋ณต๊ถ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฒฝ๋ณต๊ถ์ด ์žˆ์Šต๋‹ˆ๋‹ค..."
}
  1. ํ˜„์žฌ SFT ๋ชจ๋ธ๋กœ ๋‹ค์–‘ํ•œ ํ”„๋กฌํ”„ํŠธ์— ๋Œ€ํ•ด ์ƒ์„ฑ (temperature ๋‹ค์–‘ํ•˜๊ฒŒ)
  2. ๋ฐ˜๋ณต์ด ๋ฐœ์ƒํ•œ ์‘๋‹ต โ†’ rejected
  3. ์ •์ƒ ์‘๋‹ต (๋˜๋Š” GPT-4๋กœ ์ƒ์„ฑ) โ†’ chosen
  4. 500~2000๊ฐœ๋งŒ์œผ๋กœ๋„ ํšจ๊ณผ์ 

5. HF ๋ณ€ํ™˜

scripts/convert_to_hf.py ๊ฐ€ ์ด๋ฏธ ์กด์žฌํ•˜๋ฉฐ LlamaForCausalLM ํฌ๋งท์œผ๋กœ ๋ณ€ํ™˜:

  • FP8 / BF16 ์ฒดํฌํฌ์ธํŠธ ๋ชจ๋‘ ์ง€์›
  • ์ถœ๋ ฅ: config.json, model.safetensors, tokenizer.json ๋“ฑ

๋ณ€ํ™˜ ๋ช…๋ น:

cd /PROJECT/0325120031_A/ghong/taketimes/llm-bang
python scripts/convert_to_hf.py \
    --checkpoint checkpoints/korean_1b_sft/checkpoint-XXXXX \
    --output outputs/hf_for_orpo \
    --tokenizer tokenizer/korean_sp/tokenizer.json

๋ณ€ํ™˜ ํ›„ AutoModelForCausalLM.from_pretrained("outputs/hf_for_orpo") ๋กœ ๋กœ๋“œ โ†’ TRL ORPOTrainer ์‚ฌ์šฉ ๊ฐ€๋Šฅ.


6. ๋ฐ˜๋ณต ํ‡ดํ™” ํ•ด๊ฒฐ์— ORPO๊ฐ€ ํšจ๊ณผ์ ์ธ ์ด์œ 

๋ฉ”์ปค๋‹ˆ์ฆ˜

ORPO์˜ odds ratio loss๋Š” ๋‹ค์Œ์„ ํ•™์Šต:

  • chosen ์‘๋‹ต์˜ ์ƒ์„ฑ ํ™•๋ฅ  โ†‘ (์ •์ƒ์ ์ด๊ณ  ๋‹ค์–‘ํ•œ ์‘๋‹ต)
  • rejected ์‘๋‹ต์˜ ์ƒ์„ฑ ํ™•๋ฅ  โ†“ (๋ฐ˜๋ณต์ ์ธ ์‘๋‹ต)

๋ฐ˜๋ณต ํ‡ดํ™”๋Š” ํŠน์ • ํ† ํฐ ์‹œํ€€์Šค์˜ ํ™•๋ฅ ์ด ์ž๊ธฐ๊ฐ•ํ™”(self-reinforcing)๋˜๋ฉด์„œ ๋ฐœ์ƒ. ORPO๋Š” ์ด ํŒจํ„ด ์ž์ฒด๋ฅผ ์ง์ ‘์ ์œผ๋กœ ํŽ˜๋„ํ‹ฐ:

  1. ๋ฐ˜๋ณต ํŒจํ„ด = rejected โ†’ ๋ชจ๋ธ์ด ๋ฐ˜๋ณต ์‹œํ€€์Šค์— ๋†’์€ ํ™•๋ฅ ์„ ๋ถ€์—ฌํ•˜๋Š” ๊ฒƒ์„ ์ง์ ‘ ์–ต์ œ
  2. ๋‹ค์–‘ํ•œ ์ •์ƒ ์‘๋‹ต = chosen โ†’ ๋‹ค์–‘ํ•œ ํ† ํฐ ๋ถ„ํฌ๋ฅผ ์œ ๋„
  3. SFT loss์™€ ๋™์‹œ ํ•™์Šต โ†’ ์ผ๋ฐ˜ ์„ฑ๋Šฅ ์œ ์ง€ํ•˜๋ฉด์„œ ๋ฐ˜๋ณต ์–ต์ œ

์™œ SFT๋งŒ์œผ๋กœ ๋ถ€์กฑํ•œ๊ฐ€

  • SFT๋Š” "์ข‹์€ ์‘๋‹ต์„ ๋”ฐ๋ผํ•˜๋ผ"๋งŒ ํ•™์Šต
  • "๋‚˜์œ ์‘๋‹ต์„ ํ”ผํ•˜๋ผ"๋Š” ์‹ ํ˜ธ๊ฐ€ ์—†์Œ
  • Preference optimization์€ "์ด๊ฒƒ์€ ํ•˜์ง€ ๋งˆ๋ผ"๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ํ•™์Šต

์˜ˆ์ƒ ํšจ๊ณผ

  • 500~2000๊ฐœ์˜ ๋ฐ˜๋ณต-vs-์ •์ƒ preference ์Œ์œผ๋กœ๋„ ๋ฐ˜๋ณต ํ‡ดํ™” ๋Œ€ํญ ๊ฐ์†Œ ๊ฐ€๋Šฅ
  • repetition penalty ๊ฐ™์€ ๋””์ฝ”๋”ฉ ํŠธ๋ฆญ๋ณด๋‹ค ๊ทผ๋ณธ์  ํ•ด๊ฒฐ
  • ์ผ๋ฐ˜ ์„ฑ๋Šฅ ์ €ํ•˜ ์ตœ์†Œ (SFT loss๊ฐ€ ํ•จ๊ป˜ ์ž‘์šฉ)

7. ์‹คํ–‰ ๊ณ„ํš

1. TRL ์„ค์น˜: pip install trl --break-system-packages (๋˜๋Š” venv)
2. HF ๋ณ€ํ™˜: python scripts/convert_to_hf.py --checkpoint ... --output outputs/hf_for_orpo
3. Preference ๋ฐ์ดํ„ฐ ์ค€๋น„:
   a. kuotient/orca-math-korean-dpo-pairs ๋‹ค์šด๋กœ๋“œ (์ฆ‰์‹œ ์‚ฌ์šฉ ๊ฐ€๋Šฅ)
   b. ์ž์ฒด ๋ฐ˜๋ณต ํ‡ดํ™” ๋ฐ์ดํ„ฐ ์ƒ์„ฑ (eval/generate.py ํ™œ์šฉ)
4. ORPO ํ•™์Šต: python train/orpo.py (์•„๋ž˜ ์Šคํฌ๋ฆฝํŠธ)
5. ํ‰๊ฐ€: ๋ฐ˜๋ณต๋ฅ  ์ธก์ • + perplexity

ORPO ํ•™์Šต ์Šคํฌ๋ฆฝํŠธ: train/orpo.py ์ฐธ์กฐ