frankenstallm / source /eval /debate /avengers_strategy.md
pathcosmos's picture
Upload folder using huggingface_hub (#29)
5b1ff4d
|
raw
history blame
10.3 kB

์–ด๋ฒค์ ธ์Šค ํŒ€ 2๋ฒˆ โ€” ORPO + ๊ณ ํ’ˆ์งˆ ๋ฐ์ดํ„ฐ๋กœ 1B ์™„์„ฑ ์ „๋žต

์ž‘์„ฑ์ผ: 2026-02-27
์ „๋žต: ํ˜„์žฌ 1B SFT v2 ๋ชจ๋ธ์„ ORPO๋กœ ๋ฐ˜๋ณต๋ฅ  <5% ๋‹ฌ์„ฑ
ํ˜„์žฌ ์ƒํƒœ: ๋ฐ˜๋ณต๋ฅ  18.0%, val_loss 2.2062


1. ๋ฐ˜๋ณต๋ฅ  18% โ†’ <5% ๋‹ฌ์„ฑ ๋กœ๋“œ๋งต

Step A: ์ถ”๋ก  ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹ (์ฆ‰์‹œ, 0์‹œ๊ฐ„)

ํŒŒ๋ผ๋ฏธํ„ฐ ํ˜„์žฌ ๋ณ€๊ฒฝ
repetition_penalty 1.1 1.2
no_repeat_ngram_size 3 4

์˜ˆ์ƒ ๋ฐ˜๋ณต๋ฅ : 18% โ†’ 10~12%

  • ๊ทผ๊ฑฐ: ํ˜„์žฌ eval์—์„œ repetition_penalty=1.1๋กœ ์ธก์ •. 1.2๋กœ ์˜ฌ๋ฆฌ๋ฉด n-gram ๋ฐ˜๋ณต์ด ์ง์ ‘ ์–ต์ œ๋จ
  • ํ•œ๊ณ„: ์ƒ์„ฑ ํ’ˆ์งˆ ์ €ํ•˜ ์—†์ด ๊ฐ€๋Šฅํ•œ ๋ฒ”์œ„. 1.3 ์ด์ƒ์€ ๋ฌธ๋งฅ coherence ์†์ƒ
  • ๋…๋ฆฝ ํšจ๊ณผ: ๋ชจ๋ธ ๊ฐ€์ค‘์น˜ ๋ณ€๊ฒฝ ์—†์ด ์ฆ‰์‹œ ์ ์šฉ. ๋‹ค๋ฅธ ๋‹จ๊ณ„์™€ ์™„์ „ํžˆ ๋…๋ฆฝ

Step B: ORPO ํ•™์Šต (ํ•ต์‹ฌ, 3~5์‹œ๊ฐ„)

์˜ˆ์ƒ ๋ฐ˜๋ณต๋ฅ : 1012% โ†’ 47%

ORPO(Odds Ratio Preference Optimization)๋Š” SFT + preference alignment๋ฅผ ๋‹จ์ผ ๋ชฉ์ ํ•จ์ˆ˜๋กœ ํ†ตํ•ฉ:

  • SFT loss๋กœ chosen ์‘๋‹ต ํ•™์Šต
  • Odds ratio๋กœ chosen vs rejected ์„ ํ˜ธ๋„ ํ•™์Šต
  • DPO ๋Œ€๋น„ reference model ๋ถˆํ•„์š” โ†’ ๋ฉ”๋ชจ๋ฆฌ/์‹œ๊ฐ„ ์ ˆ์•ฝ

์™œ ORPO๊ฐ€ ๋ฐ˜๋ณต ํ‡ดํ™”์— ํšจ๊ณผ์ ์ธ๊ฐ€:

  1. ๋ฐ˜๋ณต ์‘๋‹ต์„ rejected๋กœ ๋ช…์‹œ์  ํ•™์Šต โ†’ ๋ชจ๋ธ์ด "๋ฐ˜๋ณตํ•˜์ง€ ๋ง๋ผ"๋ฅผ ์ง์ ‘ ๋ฐฐ์›€
  2. SFT๋งŒ์œผ๋กœ๋Š” "๋ญ˜ ํ•˜๋ฉด ์•ˆ ๋˜๋Š”์ง€" ํ•™์Šต ๋ถˆ๊ฐ€ โ†’ preference learning์ด ์œ ์ผํ•œ ํ•ด๋ฒ•
  3. 1B ๋ชจ๋ธ์˜ ๋ฐ˜๋ณต์€ ํŒŒ๋ผ๋ฏธํ„ฐ ๋ถ€์กฑ์ด ์•„๋‹Œ EOS ๊ฒฝ๊ณ„ ํ•™์Šต ์‹คํŒจ + ๋ฐ˜๋ณต ํŒจํ„ด ๋ฏธ๋ฒŒ์น™ โ†’ ORPO๋กœ ์ง์ ‘ ๊ต์ • ๊ฐ€๋Šฅ

ํ•„์š” ๋ฐ์ดํ„ฐ: 500~2000 preference ์Œ (์•„๋ž˜ ์„น์…˜ 2 ์ฐธ์กฐ)

Step C: ๋ฐ์ดํ„ฐ ์ •์ œ + ์ถ”๊ฐ€ SFT (์„ ํƒ์ , 2~4์‹œ๊ฐ„)

์˜ˆ์ƒ ๋ฐ˜๋ณต๋ฅ : 47% โ†’ 35%

  • data_quality_audit์—์„œ ๋ฐœ๊ฒฌ๋œ ๋ฌธ์ œ ์ˆ˜์ •:
    • </s> ์˜ค์—ผ 113๊ฑด ์ œ๊ฑฐ
    • ์งง์€ output(<80์ž) 16,519๊ฑด ์ œ๊ฑฐ
    • Q/A ๋งˆ์ปค ~550๊ฑด ์ œ๊ฑฐ
    • OpenOrca ๊ฐ€์ค‘์น˜ 5.0โ†’2.0
  • ์ •์ œ๋œ ~120K ๋ฐ์ดํ„ฐ๋กœ ์ถ”๊ฐ€ SFT 2-3 epochs

๋…๋ฆฝ ํšจ๊ณผ: ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ ๊ฐœ์„ ์€ ORPO์™€ ๋ฌด๊ด€ํ•˜๊ฒŒ ๊ธฐ์ € ๋ชจ๋ธ ๊ฐœ์„ . ํ•˜์ง€๋งŒ ORPO ์—†์ด ์ด๊ฒƒ๋งŒ์œผ๋กœ๋Š” ๋ฐ˜๋ณต๋ฅ  <5% ๋ถˆ๊ฐ€๋Šฅ (SFT v1โ†’v2์—์„œ ์ด๋ฏธ ๋ฐ์ดํ„ฐ ์ •์ œํ–ˆ์œผ๋‚˜ 17.7%โ†’18%๋กœ ์ •์ฒด)

์ข…ํ•ฉ ์˜ˆ์ƒ

๋‹จ๊ณ„ ๋ฐ˜๋ณต๋ฅ  ์†Œ์š”์‹œ๊ฐ„ ๋ˆ„์ ์‹œ๊ฐ„
ํ˜„์žฌ 18.0% - -
Step A (์ถ”๋ก  ํŒŒ๋ผ๋ฏธํ„ฐ) 10~12% 0h 0h
Step B (ORPO) 4~7% 3~5h 3~5h
Step C (๋ฐ์ดํ„ฐ ์ •์ œ SFT) 3~5% 2~4h 5~9h
์ตœ์ข… 3~5% 5~9h

2. ์ž์ฒด Preference ๋ฐ์ดํ„ฐ ์ƒ์„ฑ ์ „๋žต

๋ฐฉ๋ฒ•: Self-Play Rejection Sampling

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained("checkpoints/korean_1b_sft/checkpoint-best")
tokenizer = AutoTokenizer.from_pretrained(...)

def generate_preference_pair(prompt, n_samples=8, temp=0.9):
    """ํ”„๋กฌํ”„ํŠธ ๋‹น n_samples๊ฐœ ์ƒ์„ฑ โ†’ chosen/rejected ๋ถ„๋ฅ˜"""
    responses = []
    for _ in range(n_samples):
        output = model.generate(
            tokenizer.encode(f"<|user|>\n{prompt}\n<|assistant|>\n", return_tensors="pt"),
            max_new_tokens=256, temperature=temp, top_p=0.95,
            do_sample=True, repetition_penalty=1.0  # ์˜๋„์ ์œผ๋กœ penalty ์—†์ด
        )
        text = tokenizer.decode(output[0], skip_special_tokens=True)
        rep_rate = calc_repetition_rate(text)  # 10-gram ๊ธฐ์ค€
        responses.append((text, rep_rate))
    
    # ๋ถ„๋ฅ˜
    chosen = [r for r in responses if r[1] < 0.05]   # ๋ฐ˜๋ณต๋ฅ  5% ๋ฏธ๋งŒ โ†’ chosen
    rejected = [r for r in responses if r[1] > 0.15]  # ๋ฐ˜๋ณต๋ฅ  15% ์ด์ƒ โ†’ rejected
    
    if chosen and rejected:
        return {"prompt": prompt, "chosen": chosen[0][0], "rejected": rejected[0][0]}
    return None

๊ทœ๋ชจ ๊ณ„์‚ฐ

ํ•ญ๋ชฉ ๊ฐ’
ํ•„์š” preference ์Œ 500~1000 (์ตœ์†Œ 500)
ํ”„๋กฌํ”„ํŠธ ๋‹น ์ƒ˜ํ”Œ ์ˆ˜ 8
์œ ํšจ ์Œ ์ƒ์„ฑ๋ฅ  ~40% (๋ฐ˜๋ณต๋ฅ  18%์ด๋ฏ€๋กœ chosen/rejected ๋ถ„๋ฆฌ ๊ฐ€๋Šฅ)
ํ•„์š” ํ”„๋กฌํ”„ํŠธ ์ˆ˜ 500 / 0.4 = ~1,250๊ฐœ
ํ”„๋กฌํ”„ํŠธ ๋‹น ์ƒ์„ฑ ์‹œ๊ฐ„ 8 ร— 256 tokens ร— ~0.02s/token โ‰ˆ 40s
์ด ์ƒ์„ฑ ์‹œ๊ฐ„ 1,250 ร— 40s โ‰ˆ 14์‹œ๊ฐ„ (GPU 1๊ฐœ)

โš ๏ธ ์ž์ฒด ์ƒ์„ฑ์€ ๋А๋ฆผ. ๋Œ€์•ˆ: ๊ธฐ์กด HF preference ๋ฐ์ดํ„ฐ ํ™œ์šฉ (์„น์…˜ 3)

์ž๋™ ํ’ˆ์งˆ ํŒ๋‹จ ๊ธฐ์ค€

  • chosen ์ž„๊ณ„๊ฐ’: 10-gram ๋ฐ˜๋ณต๋ฅ  < 5%, ๊ธธ์ด > 50 tokens, EOS ์ •์ƒ ์ƒ์„ฑ
  • rejected ์ž„๊ณ„๊ฐ’: 10-gram ๋ฐ˜๋ณต๋ฅ  > 15% OR ๋™์ผ ๋ฌธ์žฅ 2ํšŒ ์ด์ƒ ๋ฐ˜๋ณต
  • ์ค‘๊ฐ„ ์˜์—ญ(5~15%)์€ ๋ฒ„๋ฆผ โ†’ contrastive signal ๊ทน๋Œ€ํ™”

๋น ๋ฅธ ๋Œ€์•ˆ: ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์ „๋žต (์ถ”์ฒœ)

  1. HF์—์„œ 500~1000์Œ ๋‹ค์šด๋กœ๋“œ (์ฆ‰์‹œ)
  2. ์ž์ฒด ๋ชจ๋ธ๋กœ 200300์Œ ์ถ”๊ฐ€ ์ƒ์„ฑ (๋ฐ˜๋ณต ํŠนํ™”, 34์‹œ๊ฐ„)
  3. ์ด 700~1300์Œ์œผ๋กœ ORPO ํ•™์Šต

3. HuggingFace ์ฆ‰์‹œ ์‚ฌ์šฉ ๊ฐ€๋Šฅ ํ•œ๊ตญ์–ด Preference ๋ฐ์ดํ„ฐ

ํ™•์ธ๋œ ๋ฐ์ดํ„ฐ์…‹

๋ฐ์ดํ„ฐ์…‹ ํฌ๊ธฐ ํฌ๋งท ์ ํ•ฉ์„ฑ
maywell/ko_Ultrafeedback_binarized 61,966์Œ prompt/chosen/rejected โญโญโญ ์ตœ์  โ€” ๋ฐ”๋กœ ORPO์— ์‚ฌ์šฉ ๊ฐ€๋Šฅ
kuotient/orca-math-korean-dpo-pairs 192,848์Œ question/chosen/rejected โญโญ ์ˆ˜ํ•™ ํŠนํ™”์ง€๋งŒ ์–‘ ํ’๋ถ€
nayohan/preference-collection-ko-full 199,760์Œ ๋ณต์žก ํฌ๋งท (score_A/B) โญโญ ์ „์ฒ˜๋ฆฌ ํ•„์š”
jojo0217/korean_rlhf_dataset ๋ฏธํ™•์ธ ๋ฏธํ™•์ธ โญ ํ™•์ธ ํ•„์š”
heegyu/PKU-SafeRLHF-ko ๋ฏธํ™•์ธ ๋ฏธํ™•์ธ โญ ์•ˆ์ „์„ฑ ํŠนํ™”

์ถ”์ฒœ ์กฐํ•ฉ

# 1์ˆœ์œ„: ko_Ultrafeedback_binarized์—์„œ 2000์Œ ์ƒ˜ํ”Œ๋ง
from datasets import load_dataset
ds = load_dataset("maywell/ko_Ultrafeedback_binarized", split="train")
# ์ด๋ฏธ prompt/chosen/rejected ํฌ๋งท โ†’ ๋ฐ”๋กœ ์‚ฌ์šฉ

# 2์ˆœ์œ„: orca-math์—์„œ 500์Œ ์ถ”๊ฐ€ (๋‹ค์–‘์„ฑ)
ds2 = load_dataset("kuotient/orca-math-korean-dpo-pairs", split="train")

์ค€๋น„ ์‹œ๊ฐ„: 30๋ถ„ ๋ฏธ๋งŒ (๋‹ค์šด๋กœ๋“œ + ํฌ๋งท ๋ณ€ํ™˜)


4. 1B ๋ชจ๋ธ์˜ ํ•œ๊ณ„์™€ ORPO ๊ทน๋ณต ๋ฒ”์œ„

๋ฐ˜๋ณต ํ‡ดํ™”์˜ ๊ทผ๋ณธ ์›์ธ: ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜ vs ํ•™์Šต ๋ฐฉ๋ฒ•

ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๊ฐ€ ์ฃผ ์›์ธ์ด ์•„๋‹Œ ๊ทผ๊ฑฐ:

  1. Pretrain ๋‹จ๊ณ„์—์„œ ๋ฐ˜๋ณต๋ฅ  69% โ†’ SFT๋กœ 18%๊นŒ์ง€ ๋‚ฎ์ถค. ๊ฐ™์€ 1B ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ 51%p ๊ฐœ์„ 
  2. ๋ฐ˜๋ณต ํŒจํ„ด์€ ํŠน์ • ํ”„๋กฌํ”„ํŠธ์—์„œ๋งŒ ๋ฐœ์ƒ (์งง์€ ์‚ฌ์‹ค ์งˆ๋ฌธ์€ 0%, ๊ธด ์„ค๋ช… ์งˆ๋ฌธ์—์„œ 20~33%)
  3. data_quality_audit์—์„œ EOS ํ•™์Šต ์‹คํŒจ๊ฐ€ ํ•ต์‹ฌ ์›์ธ์œผ๋กœ ์ง€๋ชฉ๋จ โ†’ ํ•™์Šต ๋ฐ์ดํ„ฐ/๋ฐฉ๋ฒ• ๋ฌธ์ œ

1B์—์„œ ๋ฐ˜๋ณต๋ฅ  <5% ํ˜„์‹ค์„ฑ:

  • Qwen2.5-0.5B, SmolLM-1.7B ๋“ฑ ์œ ์‚ฌ ๊ทœ๋ชจ ๋ชจ๋ธ์ด RLHF/DPO ํ›„ ๋ฐ˜๋ณต๋ฅ  <5% ๋‹ฌ์„ฑ ์‚ฌ๋ก€ ๋‹ค์ˆ˜
  • ORPO ์›๋…ผ๋ฌธ(Hong et al., 2024)์—์„œ Phi-2(2.7B)์™€ Llama-2-7B ์‹คํ—˜ โ†’ ์†Œ๊ทœ๋ชจ ๋ชจ๋ธ์—์„œ๋„ ์ผ๊ด€๋œ ๊ฐœ์„ 
  • 1B๊ธ‰ ์ง์ ‘ ์‹คํ—˜์€ ๋“œ๋ฌผ์ง€๋งŒ, ๋ฐ˜๋ณต ํ‡ดํ™”๋Š” alignment ๋ฌธ์ œ์ด์ง€ capacity ๋ฌธ์ œ๊ฐ€ ์•„๋‹˜

ORPO ํŠน์œ ์˜ ์žฅ์  (1B์— ์œ ๋ฆฌ):

  • Reference model ๋ถˆํ•„์š” โ†’ GPU ๋ฉ”๋ชจ๋ฆฌ ์ ˆ์•ฝ (DPO๋Š” 2๋ฐฐ ๋ฉ”๋ชจ๋ฆฌ)
  • 1B ๋ชจ๋ธ์„ ๋‹จ์ผ GPU์—์„œ full fine-tuning ๊ฐ€๋Šฅ
  • SFT + preference๋ฅผ ๋™์‹œ์— ํ•™์Šต โ†’ ์ ์€ ๋ฐ์ดํ„ฐ๋กœ ํšจ์œจ์ 

ํ˜„์‹ค์  ๊ธฐ๋Œ€์น˜

๋ชฉํ‘œ ๋‹ฌ์„ฑ ๊ฐ€๋Šฅ์„ฑ ์กฐ๊ฑด
๋ฐ˜๋ณต๋ฅ  <10% 95% ORPO 500์Œ + rep_penalty=1.2
๋ฐ˜๋ณต๋ฅ  <5% 70% ORPO 1000์Œ + ๋ฐ์ดํ„ฐ ์ •์ œ SFT
๋ฐ˜๋ณต๋ฅ  <3% 40% ORPO 2000์Œ + ๋ฐ์ดํ„ฐ ์ •์ œ + ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹

5. ์ด ๋น„์šฉ ๊ณ„์‚ฐ

1B ORPO ๊ฒฝ๋กœ (์ด ์ „๋žต)

๋‹จ๊ณ„ ์ž‘์—… ์‹œ๊ฐ„
1 HF preference ๋ฐ์ดํ„ฐ ๋‹ค์šด๋กœ๋“œ + ์ „์ฒ˜๋ฆฌ 0.5h
2 ์ž์ฒด preference ์ƒ์„ฑ (200~300์Œ, ์„ ํƒ์ ) 3~4h
3 ORPO ํ•™์Šต (1000์Œ, 1~2 epochs) 1~2h
4 ํ‰๊ฐ€ + ๋ฐ˜๋ณต 0.5h
5 (์„ ํƒ) ๋ฐ์ดํ„ฐ ์ •์ œ ์žฌSFT 2~4h
์ดํ•ฉ (ํ•„์ˆ˜๋งŒ) 2~3h
์ดํ•ฉ (์ „์ฒด) 7~11h

3B ์ฒ˜์Œ๋ถ€ํ„ฐ ๊ฒฝ๋กœ (๋Œ€์•ˆ)

๋‹จ๊ณ„ ์‹œ๊ฐ„
3B pretrain 26h
SFT 1~2h
ํ‰๊ฐ€ 1h
์ดํ•ฉ 28~29h

๋น„๊ต

ํ•ญ๋ชฉ 1B ORPO 3B ์ฒ˜์Œ๋ถ€ํ„ฐ
์†Œ์š” ์‹œ๊ฐ„ 2~11h 28~29h
์„ฑ๊ณต ํ™•๋ฅ  (<5%) 70% 80~90%
์‹คํŒจ ์‹œ ๋น„์šฉ 3~11h ๋‚ญ๋น„ 29h ๋‚ญ๋น„
๊ธฐ๋Œ€๊ฐ’ (์‹œ๊ฐ„ร—ํ™•๋ฅ ) 311h / 0.7 = **416h** 29h / 0.85 = 34h
๋ณ‘๋ ฌ ๊ฐ€๋Šฅ โœ… 3B์™€ ๋™์‹œ ์ง„ํ–‰ ๊ฐ€๋Šฅ GPU ์ ์œ 

6. ์ตœ์ข… ๊ถŒ๊ณ : ์™œ ์ง€๊ธˆ ๋‹น์žฅ ORPO์—ฌ์•ผ ํ•˜๋Š”๊ฐ€

ํ•ต์‹ฌ ๋…ผ๊ฑฐ

  1. ์‹œ๊ฐ„ ํšจ์œจ: ํ•„์ˆ˜ ๋‹จ๊ณ„๋งŒ 2~3์‹œ๊ฐ„. 3B์˜ 1/10 ์‹œ๊ฐ„
  2. ๋ฆฌ์Šคํฌ ์ตœ์†Œ: ์‹คํŒจํ•ด๋„ 3์‹œ๊ฐ„ ์†์‹ค. 3B๋Š” 29์‹œ๊ฐ„ ์†์‹ค
  3. ์ด๋ฏธ ๋ฐ์ดํ„ฐ ์žˆ์Œ: maywell/ko_Ultrafeedback_binarized 61K์Œ์ด HF์— ์ค€๋น„๋จ. ๋‹ค์šด๋กœ๋“œ๋งŒ ํ•˜๋ฉด ๋จ
  4. ์ •ํ™•ํ•œ ๋ฌธ์ œ ํ•ด๊ฒฐ: ๋ฐ˜๋ณต ํ‡ดํ™”์˜ ์›์ธ์€ "๋ญ˜ ํ•˜๋ฉด ์•ˆ ๋˜๋Š”์ง€ ๋ชจ๋ฆ„" โ†’ preference learning์ด ์ •ํ™•ํ•œ ํ•ด๋ฒ•
  5. ๋ณ‘๋ ฌ ์ „๋žต ๊ฐ€๋Šฅ: ORPO๋Š” 2~3์‹œ๊ฐ„์ด๋ฏ€๋กœ, 3B ํ•™์Šต๊ณผ ๋™์‹œ์— ์‹œ์ž‘ ๊ฐ€๋Šฅ. ๋จผ์ € ๋๋‚˜๋Š” ์ชฝ ์ฑ„ํƒ

์ฆ‰์‹œ ์‹คํ–‰ ๊ณ„ํš

# Step 1: preference ๋ฐ์ดํ„ฐ ์ค€๋น„ (30๋ถ„)
python3 scripts/prepare_orpo_data.py \
  --hf_dataset maywell/ko_Ultrafeedback_binarized \
  --sample_size 2000 \
  --output data/orpo/train.jsonl

# Step 2: ORPO ํ•™์Šต (1~2์‹œ๊ฐ„)
python3 scripts/train_orpo.py \
  --model checkpoints/korean_1b_sft/checkpoint-best \
  --data data/orpo/train.jsonl \
  --lr 5e-6 --epochs 2 --batch_size 4 --beta 0.1 \
  --output checkpoints/korean_1b_orpo

# Step 3: ํ‰๊ฐ€ (30๋ถ„)
python3 eval/comprehensive_eval.py \
  --model checkpoints/korean_1b_orpo \
  --repetition_penalty 1.2 --no_repeat_ngram_size 4

์„ฑ๊ณต ํŒ์ • ๊ธฐ์ค€

์ง€ํ‘œ ๋ชฉํ‘œ ํ˜„์žฌ
๋ฐ˜๋ณต๋ฅ  <5% 18%
์ž์—ฐ ์ข…๋ฃŒ์œจ >80% 60%
์‘๋‹ต ํ’ˆ์งˆ ์œ ์ง€ ๋˜๋Š” ๊ฐœ์„  baseline

์š”์•ฝ

ํ•ญ๋ชฉ ๊ฐ’
์ „๋žต ORPO + ์ถ”๋ก  ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹
์˜ˆ์ƒ ๋ฐ˜๋ณต๋ฅ  3~7% (๋ชฉํ‘œ <5% ๋‹ฌ์„ฑ ํ™•๋ฅ  70%)
์ด ์†Œ์š”์‹œ๊ฐ„ 23h (ํ•„์ˆ˜) / 711h (์ „์ฒด)
vs 3B 1015๋ฐฐ ๋น ๋ฆ„, ๊ธฐ๋Œ€๊ฐ’ ๊ธฐ์ค€ 23๋ฐฐ ํšจ์œจ์ 
ํ•„์š” ๋ฐ์ดํ„ฐ HF์—์„œ ์ฆ‰์‹œ ์‚ฌ์šฉ ๊ฐ€๋Šฅ (0์›, 30๋ถ„)
ํ•ต์‹ฌ ๋ฉ”์‹œ์ง€ SFT๋งŒ์œผ๋กœ๋Š” "ํ•˜์ง€ ๋ง์•„์•ผ ํ•  ๊ฒƒ"์„ ๊ฐ€๋ฅด์น  ์ˆ˜ ์—†๋‹ค. ORPO๊ฐ€ ์ •ํ™•ํ•œ ํ•ด๋ฒ•์ด๋‹ค.