frankenstallm / source /eval /plan /improvement_options.md
pathcosmos's picture
Upload folder using huggingface_hub (#29)
5b1ff4d

SFT ๊ฐœ์„  ๋ฐฉ์•ˆ ์‹ฌ์ธต ์กฐ์‚ฌ

ํ”„๋กœ์ ํŠธ: 1B Korean LLM SFT (188k ์ƒ˜ํ”Œ, 8ร—B200, FP8) ํ˜„์žฌ ๊ตฌํ˜„: NEFTune, dynamic padding, gradient checkpointing, cosine LR, BF16+FP8 ์ž‘์„ฑ์ผ: 2026-02-26


1. Curriculum Learning (๊ต์œก๊ณผ์ • ํ•™์Šต)

๊ฐœ๋…

์‰ฌ์šด ์ƒ˜ํ”Œ์—์„œ ์–ด๋ ค์šด ์ƒ˜ํ”Œ ์ˆœ์„œ๋กœ ํ•™์Šตํ•˜์—ฌ ์ˆ˜๋ ด ์†๋„์™€ ์ตœ์ข… ์„ฑ๋Šฅ ํ–ฅ์ƒ.

๊ตฌํ˜„ ๋ฐฉ๋ฒ•

๋ฐฉ๋ฒ• A: Perplexity ๊ธฐ๋ฐ˜ ์ •๋ ฌ (๊ถŒ์žฅ)

# scripts/compute_difficulty.py
import torch, json
from pathlib import Path
from tokenizers import Tokenizer
from model import LLM

def compute_sample_perplexity(model, tokenizer, data_path, output_path, device="cuda:0"):
    """ํ˜„์žฌ pretrain ๋ชจ๋ธ๋กœ ๊ฐ ์ƒ˜ํ”Œ์˜ output perplexity ๊ณ„์‚ฐ"""
    model.eval()
    results = []
    
    with open(data_path) as f:
        samples = [json.loads(line) for line in f]
    
    with torch.no_grad():
        for i, sample in enumerate(samples):
            # conversation์—์„œ assistant turn๋งŒ ์ถ”์ถœ
            messages = sample["messages"]
            # ์ „์ฒด ์‹œํ€€์Šค ํ† ํฌ๋‚˜์ด์ฆˆ
            full_text = tokenizer.encode(
                "".join(m["content"] for m in messages)
            )
            input_ids = torch.tensor([full_text.ids[:4096]], device=device)
            
            logits = model(input_ids)
            # response ํ† ํฐ์— ๋Œ€ํ•œ CE loss = perplexity proxy
            shift_logits = logits[:, :-1, :]
            shift_labels = input_ids[:, 1:]
            loss = torch.nn.functional.cross_entropy(
                shift_logits.reshape(-1, shift_logits.size(-1)),
                shift_labels.reshape(-1),
                reduction='mean'
            )
            ppl = loss.exp().item()
            results.append({"idx": i, "ppl": ppl})
            
            if i % 1000 == 0:
                print(f"  {i}/{len(samples)} done")
    
    # ppl ์˜ค๋ฆ„์ฐจ์ˆœ ์ •๋ ฌ = ์‰ฌ์šด ๊ฒƒ๋ถ€ํ„ฐ
    results.sort(key=lambda x: x["ppl"])
    
    with open(output_path, "w") as f:
        json.dump(results, f)
    return results

๋ฐฉ๋ฒ• B: ๊ธธ์ด ๊ธฐ๋ฐ˜ (๊ฐ€์žฅ ๊ฐ„๋‹จ)

  • ์งง์€ ์‘๋‹ต โ†’ ๊ธด ์‘๋‹ต ์ˆœ์„œ๋กœ ์ •๋ ฌ
  • SFTDataset์—์„œ __getitem__ ์‹œ ์ •๋ ฌ๋œ ์ธ๋ฑ์Šค ์‚ฌ์šฉ

๋ฐฉ๋ฒ• C: IFD Score

  • Cherry LLM ๋…ผ๋ฌธ (2024): IFD = PPL(output|instruction) / PPL(output)
  • ๋†’์€ IFD = instruction์ด output ์ƒ์„ฑ์„ ์ž˜ ์œ ๋„ํ•˜์ง€ ๋ชปํ•จ = ์–ด๋ ค์›€

์‹ค์ œ ํšจ๊ณผ

  • Curriculum Learning for LLMs (Xu et al., 2024): SFT์—์„œ MT-Bench +0.3~0.5์ 
  • ํšจ๊ณผ ์ œํ•œ์  ์˜๊ฒฌ: Bengio et al.์˜ ์› ์—ฐ๊ตฌ ์ดํ›„ SFT์—์„œ์˜ ํšจ๊ณผ๋Š” mixed results
  • ์˜ˆ์ƒ: ko_ifeval +1~2%, ๋ฐ˜๋ณต๋ฅ  ๋ณ€ํ™” ๋ฏธ๋ฏธ

ํ‰๊ฐ€

ํ•ญ๋ชฉ ๊ฐ’
์˜ˆ์ƒ ํšจ๊ณผ ko_ifeval +1~2%
๊ตฌํ˜„ ๋ณต์žก๋„ 2/5
์†Œ์š” ์‹œ๊ฐ„ PPL ๊ณ„์‚ฐ 2~3์‹œ๊ฐ„ + ์ฝ”๋“œ ์ˆ˜์ • 2์‹œ๊ฐ„
์ ์šฉ ๊ฐ€๋Šฅ โœ… DataLoader sampler ์ˆ˜์ •์œผ๋กœ ๊ฐ€๋Šฅ

2. "Less is More" ์ „๋žต (LIMA, AlpaGasus)

ํ•ต์‹ฌ ๋…ผ๋ฌธ

  • LIMA (Zhou et al., 2023): 1000๊ฐœ ๊ณ ํ’ˆ์งˆ > 52k ์ €ํ’ˆ์งˆ. 65B ๋ชจ๋ธ์—์„œ ๊ฒ€์ฆ.
  • AlpaGasus (Chen et al., 2023): GPT-4๋กœ ํ’ˆ์งˆ ์ ์ˆ˜ โ†’ 9k์—์„œ 3k ์„ ๋ณ„ โ†’ Alpaca ๋Œ€๋น„ ์šฐ์ˆ˜
  • DEITA (Liu et al., 2024): complexity + quality + diversity 3์ถ• ํ•„ํ„ฐ๋ง

ํ’ˆ์งˆ ์ ์ˆ˜ ๊ณ„์‚ฐ ๋ฐฉ๋ฒ• (์™ธ๋ถ€ API ์—†์ด)

# scripts/quality_filter.py
import json, math, torch
import numpy as np
from collections import Counter

def compute_quality_scores(data_path, model, tokenizer, device="cuda:0"):
    """๋‹ค์ฐจ์› ํ’ˆ์งˆ ์ ์ˆ˜ ๊ณ„์‚ฐ"""
    with open(data_path) as f:
        samples = [json.loads(line) for line in f]
    
    scored = []
    model.eval()
    
    for i, sample in enumerate(samples):
        msgs = sample["messages"]
        
        # 1) ๊ธธ์ด ์ ์ˆ˜: ๋„ˆ๋ฌด ์งง๊ฑฐ๋‚˜ ๋„ˆ๋ฌด ๊ธด ๊ฑด ๊ฐ์ 
        response = "".join(m["content"] for m in msgs if m["role"] == "assistant")
        resp_len = len(response)
        len_score = min(resp_len / 500, 1.0) * (1.0 if resp_len < 3000 else 3000 / resp_len)
        
        # 2) ๋ฐ˜๋ณต ๊ฐ์ : n-gram ๋ฐ˜๋ณต๋ฅ 
        tokens = list(response)
        if len(tokens) > 10:
            trigrams = [tuple(tokens[j:j+3]) for j in range(len(tokens)-2)]
            unique_ratio = len(set(trigrams)) / len(trigrams)
        else:
            unique_ratio = 1.0
        rep_score = unique_ratio
        
        # 3) Perplexity ์ ์ˆ˜ (์ค‘๊ฐ„์ด ์ข‹์Œ - ๋„ˆ๋ฌด ๋‚ฎ์œผ๋ฉด trivial, ๋„ˆ๋ฌด ๋†’์œผ๋ฉด noise)
        # ์‚ฌ์ „ ๊ณ„์‚ฐ๋œ ppl ์‚ฌ์šฉ
        
        # 4) Instruction ๋ณต์žก๋„: instruction ๊ธธ์ด
        instruction = "".join(m["content"] for m in msgs if m["role"] == "user")
        inst_complexity = min(len(instruction) / 200, 1.0)
        
        # ์ข…ํ•ฉ ์ ์ˆ˜
        quality = 0.3 * len_score + 0.3 * rep_score + 0.2 * inst_complexity + 0.2
        scored.append({"idx": i, "quality": quality, "sample": sample})
        
    return scored

def select_top_k(scored, k):
    """์ƒ์œ„ k๊ฐœ ์„ ๋ณ„"""
    scored.sort(key=lambda x: x["quality"], reverse=True)
    return scored[:k]

๊ถŒ์žฅ ์ƒ˜ํ”Œ ์ˆ˜

  • 188k ์ „์ฒด โ†’ 50k~80k ๊ถŒ์žฅ (์ƒ์œ„ 30~40%)
  • 1B ๋ชจ๋ธ ๊ทœ๋ชจ์—์„œ๋Š” LIMA์ฒ˜๋Ÿผ ๊ทน๋‹จ์  ์ถ•์†Œ๋ณด๋‹ค moderate ํ•„ํ„ฐ๋ง์ด ์ ํ•ฉ
  • ๊ทผ๊ฑฐ: AlpaGasus๋Š” ~30% ์„ ๋ณ„์—์„œ ์ตœ์ , 1B๋Š” 65B๋ณด๋‹ค ๋ฐ์ดํ„ฐ ์˜์กด๋„ ๋†’์Œ

ํ‰๊ฐ€

ํ•ญ๋ชฉ ๊ฐ’
์˜ˆ์ƒ ํšจ๊ณผ ๋ฐ˜๋ณต๋ฅ  -3050%, ko_ifeval +35%
๊ตฌํ˜„ ๋ณต์žก๋„ 2/5
์†Œ์š” ์‹œ๊ฐ„ ํ’ˆ์งˆ ๊ณ„์‚ฐ 3~4์‹œ๊ฐ„, ํ•„ํ„ฐ๋ง ์ฝ”๋“œ 1์‹œ๊ฐ„
์ ์šฉ ๊ฐ€๋Šฅ โœ… ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ๋‹จ๊ณ„

3. Packing (Sequence Packing)

๊ฐœ๋…

์งง์€ ์‹œํ€€์Šค๋“ค์„ ํ•˜๋‚˜์˜ max_seq_len์— ํŒจํ‚นํ•˜์—ฌ padding ๋‚ญ๋น„ ์ œ๊ฑฐ.

ํ˜„ ํ”„๋กœ์ ํŠธ ์ƒํ™ฉ

  • ์ด๋ฏธ dynamic_collate_fn์œผ๋กœ batch-level dynamic padding ๊ตฌํ˜„๋จ
  • Packing์€ ๊ทธ ์ด์ƒ: ์—ฌ๋Ÿฌ ์ƒ˜ํ”Œ์„ ํ•˜๋‚˜์˜ ์‹œํ€€์Šค๋กœ concatenate

์ฃผ์˜์‚ฌํ•ญ: Cross-contamination

  • ํŒจํ‚น๋œ ์„œ๋กœ ๋‹ค๋ฅธ ์ƒ˜ํ”Œ ๊ฐ„ attention์ด ํ๋ฅด๋ฉด ์•ˆ ๋จ
  • ํ•ด๊ฒฐ: Flash Attention v2์˜ cu_seqlens ํŒŒ๋ผ๋ฏธํ„ฐ (varlen attention)
  • ๋˜๋Š” block diagonal attention mask

๊ตฌํ˜„ ๋ฐฉ๋ฒ•

# data/packed_sft_dataset.py
class PackedSFTDataset:
    """์—ฌ๋Ÿฌ SFT ์ƒ˜ํ”Œ์„ ํ•˜๋‚˜์˜ ์‹œํ€€์Šค๋กœ ํŒจํ‚น"""
    
    def __init__(self, samples, tokenizer, max_seq_len=4096):
        self.packed = []
        self.cu_seqlens = []  # Flash Attention varlen์šฉ
        
        buffer_ids = []
        buffer_labels = []
        seq_lens = []
        
        for sample in samples:
            ids, labels = self._tokenize(sample, tokenizer)
            
            if len(buffer_ids) + len(ids) > max_seq_len:
                # ํ˜„์žฌ ๋ฒ„ํผ ์ €์žฅ
                if buffer_ids:
                    self._save_buffer(buffer_ids, buffer_labels, seq_lens, max_seq_len)
                buffer_ids = ids
                buffer_labels = labels
                seq_lens = [len(ids)]
            else:
                buffer_ids.extend(ids)
                buffer_labels.extend(labels)
                seq_lens.append(len(ids))
        
        if buffer_ids:
            self._save_buffer(buffer_ids, buffer_labels, seq_lens, max_seq_len)
    
    def _save_buffer(self, ids, labels, seq_lens, max_seq_len):
        # Pad to max_seq_len
        pad_len = max_seq_len - len(ids)
        ids = ids + [0] * pad_len
        labels = labels + [-1] * pad_len
        
        # cu_seqlens for varlen flash attention
        cu = [0]
        for l in seq_lens:
            cu.append(cu[-1] + l)
        
        self.packed.append({
            "input_ids": torch.tensor(ids),
            "labels": torch.tensor(labels),
            "cu_seqlens": torch.tensor(cu, dtype=torch.int32),
        })

์†๋„ ๊ฐœ์„  ์˜ˆ์ƒ

  • ํ˜„์žฌ ํ‰๊ท  ์‹œํ€€์Šค ๊ธธ์ด๊ฐ€ max_seq_len(4096)๋ณด๋‹ค ํ›จ์”ฌ ์งง๋‹ค๋ฉด 1.5~3ร— ์†๋„ ํ–ฅ์ƒ
  • SFT ๋ฐ์ดํ„ฐ ํŠน์„ฑ์ƒ ํ‰๊ท  ~1000 ํ† ํฐ์ด๋ฉด ~3ร— ํšจ์œจ ํ–ฅ์ƒ ์˜ˆ์ƒ

ํ‰๊ฐ€

ํ•ญ๋ชฉ ๊ฐ’
์˜ˆ์ƒ ํšจ๊ณผ ํ•™์Šต ์†๋„ 1.5~3ร—, ์„ฑ๋Šฅ ๋ณ€ํ™” ์—†๊ฑฐ๋‚˜ ๋ฏธ๋ฏธ
๊ตฌํ˜„ ๋ณต์žก๋„ 3/5 (Flash Attention varlen ์—ฐ๋™ ํ•„์š”)
์†Œ์š” ์‹œ๊ฐ„ 1~2์ผ
์ ์šฉ ๊ฐ€๋Šฅ โš ๏ธ ๋ชจ๋ธ์˜ attention ๊ตฌํ˜„์ด cu_seqlens ์ง€์›ํ•ด์•ผ ํ•จ

4. Multi-task SFT (๋„๋ฉ”์ธ๋ณ„ Loss Weighting)

๊ฐœ๋…

๋ฐ์ดํ„ฐ ์†Œ์Šค๋ณ„๋กœ ๋„๋ฉ”์ธ์„ ๋ถ„๋ฅ˜ํ•˜๊ณ , ๋„๋ฉ”์ธ๋ณ„ loss weight๋ฅผ ๋‹ค๋ฅด๊ฒŒ ์ ์šฉ.

ํ˜„์žฌ ๋ฐ์ดํ„ฐ ์†Œ์Šค ๋ถ„์„ (์ถ”์ •)

  • korean_safe_conv/raw/ ํ•˜์œ„: hatespeech, square, evol, yitingxie, gamseong, koalpaca, conversation
  • ์นดํ…Œ๊ณ ๋ฆฌ: ์•ˆ์ „์„ฑ, QA, ์ฐฝ์ž‘, ์ผ๋ฐ˜ ๋Œ€ํ™”

๊ตฌํ˜„ ๋ฐฉ๋ฒ•

# ๋„๋ฉ”์ธ ํƒœ๊ทธ๋ฅผ JSONL์— ์ถ”๊ฐ€
# {"messages": [...], "domain": "qa"}  
# {"messages": [...], "domain": "creative"}

# trainer์—์„œ ๋„๋ฉ”์ธ๋ณ„ loss weight
DOMAIN_WEIGHTS = {
    "qa": 1.0,
    "creative": 0.8,
    "safety": 1.2,
    "code": 1.0,
    "math": 1.0,
    "conversation": 0.6,  # ์ผ๋ฐ˜ ๋Œ€ํ™”๋Š” ๋‚ฎ๊ฒŒ
}

ํ‰๊ฐ€

ํ•ญ๋ชฉ ๊ฐ’
์˜ˆ์ƒ ํšจ๊ณผ ํŠน์ • ๋ฒค์น˜๋งˆํฌ +1~3%
๊ตฌํ˜„ ๋ณต์žก๋„ 3/5
์†Œ์š” ์‹œ๊ฐ„ ๋„๋ฉ”์ธ ๋ถ„๋ฅ˜ 0.5์ผ + ๊ตฌํ˜„ 0.5์ผ
์ ์šฉ ๊ฐ€๋Šฅ โœ… loss ๊ณ„์‚ฐ ์‹œ weight ๊ณฑ์…ˆ

5. Token-level Loss Weighting / Focal Loss

๊ฐœ๋…

๋ชจ๋“  response ํ† ํฐ์— ๋™์ผ weight ๋Œ€์‹ , ๋ชจ๋ธ์ด ์˜ˆ์ธกํ•˜๊ธฐ ์–ด๋ ค์šด ํ† ํฐ์— ๋” ๋†’์€ weight.

Focal Loss ๊ตฌํ˜„

# train/focal_loss.py
import torch
import torch.nn.functional as F

def focal_cross_entropy(logits, targets, gamma=2.0, ignore_index=-1):
    """
    Focal loss: down-weight easy tokens, up-weight hard tokens.
    Lin et al., "Focal Loss for Dense Object Detection", ICCV 2017.
    """
    # Standard CE
    ce_loss = F.cross_entropy(
        logits.reshape(-1, logits.size(-1)),
        targets.reshape(-1),
        ignore_index=ignore_index,
        reduction='none'
    )
    
    # p_t = probability of correct class
    log_pt = -ce_loss
    pt = torch.exp(log_pt)
    
    # Focal weight: (1 - p_t)^gamma
    focal_weight = (1 - pt) ** gamma
    loss = focal_weight * ce_loss
    
    # Mask ignored tokens
    mask = (targets.reshape(-1) != ignore_index)
    loss = loss[mask].mean()
    
    return loss

์ ์šฉ: trainer.py ์ˆ˜์ •

# trainer.py์˜ _compute_loss์—์„œ
# ๊ธฐ์กด: F.cross_entropy(logits, targets, ignore_index=-1)
# ๋ณ€๊ฒฝ: focal_cross_entropy(logits, targets, gamma=2.0, ignore_index=-1)

์‹ค์ œ ํšจ๊ณผ

  • SFT์—์„œ focal loss ์ ์šฉ ๋…ผ๋ฌธ์€ ์ œํ•œ์ 
  • SelectIT (Liu et al., 2024): token-level selection์œผ๋กœ IFEval +2~4%
  • gamma=2.0์ด ์ผ๋ฐ˜์ ์ด๋‚˜ SFT์—์„œ๋Š” gamma=1.0~1.5 ๊ถŒ์žฅ (๋„ˆ๋ฌด ๊ฐ•ํ•˜๋ฉด ๋ถˆ์•ˆ์ •)

ํ‰๊ฐ€

ํ•ญ๋ชฉ ๊ฐ’
์˜ˆ์ƒ ํšจ๊ณผ ko_ifeval +1~3%, ๋ฐ˜๋ณต๋ฅ  ์˜ํ–ฅ ๋ฏธ๋ฏธ
๊ตฌํ˜„ ๋ณต์žก๋„ 1/5
์†Œ์š” ์‹œ๊ฐ„ 2์‹œ๊ฐ„
์ ์šฉ ๊ฐ€๋Šฅ โœ… loss ํ•จ์ˆ˜๋งŒ ๊ต์ฒด

6. Data Augmentation for Korean

๋ฐฉ๋ฒ• A: Self-Paraphrase

# ํ˜„์žฌ ๋ชจ๋ธ(๋˜๋Š” ๋” ํฐ ๋ชจ๋ธ)๋กœ response ์žฌ์ƒ์„ฑ
# instruction์€ ์œ ์ง€, output๋งŒ ๋‹ค์–‘ํ™”
# โ†’ ๋™์ผ instruction์— ๋Œ€ํ•œ N๊ฐœ ๋‹ค๋ฅธ ์‘๋‹ต ํ™•๋ณด

๋ฐฉ๋ฒ• B: Back-translation

# ์˜์–ด ๊ณ ํ’ˆ์งˆ ๋ฐ์ดํ„ฐ (Alpaca, Dolly, OpenAssistant) โ†’ ํ•œ๊ตญ์–ด ๋ฒˆ์—ญ
# ์„œ๋ฒ„์—์„œ ์‹คํ–‰ ๊ฐ€๋Šฅํ•œ ๋ฐฉ๋ฒ•:
# 1) NLLB-200 3.3B (Meta): ์˜คํ”„๋ผ์ธ ๋ฒˆ์—ญ, 1GPU๋กœ ์‹คํ–‰ ๊ฐ€๋Šฅ
# 2) ํ•œ๊ตญ์–ด ํŠนํ™” ๋ชจ๋ธ๋กœ ์ง์ ‘ ๋ฒˆ์—ญ ์ƒ์„ฑ

# pip install transformers
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "facebook/nllb-200-3.3B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to("cuda:7")  # ์—ฌ์œ  GPU 1๊ฐœ ์‚ฌ์šฉ

def translate_en_to_ko(text):
    inputs = tokenizer(text, return_tensors="pt", max_length=1024, truncation=True).to("cuda:7")
    outputs = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["kor_Hang"], max_new_tokens=1024)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

ํ‰๊ฐ€

ํ•ญ๋ชฉ ๊ฐ’
์˜ˆ์ƒ ํšจ๊ณผ ๋‹ค์–‘์„ฑ ์ฆ๊ฐ€, ko_ifeval +2~4% (์ข‹์€ ์†Œ์Šค ๋ฐ์ดํ„ฐ ์‹œ)
๊ตฌํ˜„ ๋ณต์žก๋„ 3/5
์†Œ์š” ์‹œ๊ฐ„ ๋ฒˆ์—ญ ํŒŒ์ดํ”„๋ผ์ธ 1์ผ + ๋ฒˆ์—ญ ์‹คํ–‰ 1~2์ผ
์ ์šฉ ๊ฐ€๋Šฅ โœ… ๋ณ„๋„ GPU์—์„œ ๋ณ‘๋ ฌ ์‹คํ–‰ ๊ฐ€๋Šฅ

7. ํ•™์Šต ์•ˆ์ •์„ฑ ๊ฐœ์„ 

FP8 ํ•™์Šต ์ฃผ์˜์‚ฌํ•ญ

ํ˜„์žฌ ์„ค์ •: MXFP8 + BF16 ๊ธฐ๋ฐ˜. ์ฃผ์š” ์ฃผ์˜์ :

  1. Loss spike ๋ฐฉ์ง€

    • max_grad_norm: 1.0 ์ด๋ฏธ ์ ์šฉ๋จ โœ…
    • LR 2e-5๋Š” ๋ณด์ˆ˜์  โœ…
    • ์ถ”๊ฐ€: gradient norm ๋ชจ๋‹ˆํ„ฐ๋ง + ์ž๋™ LR ๊ฐ์†Œ
    # ์—ฐ์† 3๋ฒˆ grad_norm > threshold๋ฉด lr ๋ฐ˜๊ฐ
    if grad_norm > 5.0:
        spike_count += 1
        if spike_count >= 3:
            for pg in optimizer.param_groups:
                pg['lr'] *= 0.5
    
  2. Weight Decay

    • ํ˜„์žฌ 0.01 (์ ์ ˆ)
    • SFT์—์„œ๋Š” 0.01~0.05 ๋ฒ”์œ„๊ฐ€ ํ‘œ์ค€
  3. Dropout

    • ํ˜„์žฌ dropout: 0.0 โ€” SFT์—์„œ๋Š” 0.05~0.1 ์ถ”๊ฐ€ ๊ถŒ์žฅ
    • ๊ณผ์ ํ•ฉ ๋ฐฉ์ง€์— ์ง์ ‘์  ํšจ๊ณผ
    dropout: 0.05  # configs/korean_1b_sft.yaml
    
  4. FP8 amax ์„ค์ •

    • fp8_amax_history_len: 16 + fp8_amax_compute_algo: "max" โ€” ์ ์ ˆ
    • MXFP8์€ DelayedScaling๋ณด๋‹ค ์•ˆ์ •์ 

ํ‰๊ฐ€

ํ•ญ๋ชฉ ๊ฐ’
์˜ˆ์ƒ ํšจ๊ณผ ์•ˆ์ •์„ฑ โ†‘, ๊ณผ์ ํ•ฉ -1020%, ๋ฐ˜๋ณต๋ฅ  -510%
๊ตฌํ˜„ ๋ณต์žก๋„ 1/5
์†Œ์š” ์‹œ๊ฐ„ 1์‹œ๊ฐ„
์ ์šฉ ๊ฐ€๋Šฅ โœ… config ๋ณ€๊ฒฝ๋งŒ์œผ๋กœ ๊ฐ€๋Šฅ

8. ํ‰๊ฐ€ ๊ธฐ๋ฐ˜ ๋ฐ์ดํ„ฐ ์„ ํƒ (Self-Play โ†’ ORPO)

ํŒŒ์ดํ”„๋ผ์ธ

1. ํ˜„์žฌ SFT ๋ชจ๋ธ๋กœ ๊ฐ instruction์— ๋Œ€ํ•ด N=4๊ฐœ ์‘๋‹ต ์ƒ์„ฑ
2. ์ž๋™ ํ‰๊ฐ€ (๋ฐ˜๋ณต๋ฅ , ๊ธธ์ด, ์ผ๊ด€์„ฑ)๋กœ best/worst ์„ ์ •
3. (chosen, rejected) ํŽ˜์–ด ๊ตฌ์„ฑ
4. ORPO/DPO ํ•™์Šต (์ด๋ฏธ train/orpo.py ์กด์žฌ!)

๊ตฌ์ฒด์  ๋‹จ๊ณ„

# scripts/generate_self_play.py
def generate_candidates(model, tokenizer, instructions, n=4, temp=0.8):
    pairs = []
    for inst in instructions:
        responses = []
        for _ in range(n):
            out = model.generate(inst, temperature=temp, max_new_tokens=1024)
            score = auto_evaluate(out)  # ๋ฐ˜๋ณต๋ฅ , ๊ธธ์ด, coherence
            responses.append((out, score))
        
        responses.sort(key=lambda x: x[1], reverse=True)
        pairs.append({
            "instruction": inst,
            "chosen": responses[0][0],
            "rejected": responses[-1][0],
        })
    return pairs

ํ‰๊ฐ€

ํ•ญ๋ชฉ ๊ฐ’
์˜ˆ์ƒ ํšจ๊ณผ ๋ฐ˜๋ณต๋ฅ  -4060%, ko_ifeval +35%
๊ตฌํ˜„ ๋ณต์žก๋„ 4/5
์†Œ์š” ์‹œ๊ฐ„ ์ƒ์„ฑ 1~2์ผ + ORPO ํ•™์Šต 0.5์ผ
์ ์šฉ ๊ฐ€๋Šฅ โœ… orpo.py ์ด๋ฏธ ์กด์žฌ

์ข…ํ•ฉ ๋น„๊ตํ‘œ

๊ธฐ๋ฒ• ์˜ˆ์ƒ ํšจ๊ณผ (๋ฐ˜๋ณต๋ฅ ) ์˜ˆ์ƒ ํšจ๊ณผ (ko_ifeval) ๊ตฌํ˜„ ๋ณต์žก๋„ ์†Œ์š” ์‹œ๊ฐ„ ์šฐ์„ ์ˆœ์œ„
1. Curriculum Learning - +1~2% 2/5 5์‹œ๊ฐ„ ์ค‘๊ธฐ
2. Less is More -30~50% +3~5% 2/5 5์‹œ๊ฐ„ ์ฆ‰์‹œ
3. Packing (์†๋„๋งŒ) (๋ณ€ํ™”์—†์Œ) 3/5 1~2์ผ ์ค‘๊ธฐ
4. Multi-task Weighting - +1~3% 3/5 1์ผ ์ค‘๊ธฐ
5. Focal Loss - +1~3% 1/5 2์‹œ๊ฐ„ ์ฆ‰์‹œ
6. Data Augmentation - +2~4% 3/5 2~3์ผ ์ค‘๊ธฐ
7. ํ•™์Šต ์•ˆ์ •์„ฑ (dropout) -5~10% - 1/5 1์‹œ๊ฐ„ ์ฆ‰์‹œ
8. Self-Play โ†’ ORPO -40~60% +3~5% 4/5 2~3์ผ ์ค‘๊ธฐ

๐Ÿš€ ์ฆ‰์‹œ ์ ์šฉ Top 3

1์œ„: "Less is More" ๋ฐ์ดํ„ฐ ํ•„ํ„ฐ๋ง

  • ๊ทผ๊ฑฐ: LIMA, AlpaGasus ๋…ผ๋ฌธ์—์„œ ์ผ๊ด€๋˜๊ฒŒ ์ž…์ฆ. 188k โ†’ 50~80k ํ•„ํ„ฐ๋ง์œผ๋กœ ์ €ํ’ˆ์งˆ/๋ฐ˜๋ณต์  ์ƒ˜ํ”Œ ์ œ๊ฑฐ
  • ์˜ˆ์ƒ ํšจ๊ณผ: ๋ฐ˜๋ณต๋ฅ  -3050%, ko_ifeval +35%
  • ์†Œ์š”: 5์‹œ๊ฐ„ (PPL ๊ณ„์‚ฐ + ํ•„ํ„ฐ๋ง ์Šคํฌ๋ฆฝํŠธ)
  • ๋ฆฌ์Šคํฌ: ๋‚ฎ์Œ (์ตœ์•…์˜ ๊ฒฝ์šฐ ์ „์ฒด ๋ฐ์ดํ„ฐ๋กœ ๋กค๋ฐฑ)

2์œ„: Focal Loss ์ ์šฉ

  • ๊ทผ๊ฑฐ: ์–ด๋ ค์šด ํ† ํฐ์— ์ง‘์ค‘ โ†’ instruction following ๋Šฅ๋ ฅ ํ–ฅ์ƒ. ๊ตฌํ˜„ ๊ทนํžˆ ๊ฐ„๋‹จ
  • ์˜ˆ์ƒ ํšจ๊ณผ: ko_ifeval +1~3%
  • ์†Œ์š”: 2์‹œ๊ฐ„ (loss ํ•จ์ˆ˜ 1๊ฐœ ์ถ”๊ฐ€)
  • ๋ฆฌ์Šคํฌ: ๋งค์šฐ ๋‚ฎ์Œ (gamma ๊ฐ’๋งŒ ์กฐ์ •ํ•˜๋ฉด ๋จ)

3์œ„: Dropout ์ถ”๊ฐ€ (0.05)

  • ๊ทผ๊ฑฐ: ํ˜„์žฌ dropout=0.0์œผ๋กœ ๊ณผ์ ํ•ฉ ์œ„ํ—˜. SFT์—์„œ light dropout์€ ํ‘œ์ค€
  • ์˜ˆ์ƒ ํšจ๊ณผ: ๊ณผ์ ํ•ฉ ๊ฐ์†Œ, ๋ฐ˜๋ณต๋ฅ  -5~10%
  • ์†Œ์š”: config ํ•œ ์ค„ ๋ณ€๊ฒฝ
  • ๋ฆฌ์Šคํฌ: ์—†์Œ

๐Ÿ“… ์ค‘๊ธฐ ์ ์šฉ Top 3

1์œ„: Self-Play โ†’ ORPO (SFT ์ดํ›„)

  • ๊ทผ๊ฑฐ: SFT ์™„๋ฃŒ ํ›„ ์„ ํ˜ธ ํ•™์Šต์€ ๋ฐ˜๋ณต๋ฅ  ๊ฐ์†Œ์— ๊ฐ€์žฅ ํšจ๊ณผ์ . orpo.py ์ด๋ฏธ ๊ตฌํ˜„๋จ
  • ์˜ˆ์ƒ ํšจ๊ณผ: ๋ฐ˜๋ณต๋ฅ  -4060%, ko_ifeval +35%
  • ์†Œ์š”: 2~3์ผ (์ƒ์„ฑ + ํ•™์Šต)

2์œ„: Sequence Packing

  • ๊ทผ๊ฑฐ: ํ•™์Šต ์†๋„ 1.5~3ร— ํ–ฅ์ƒ. ํ–ฅํ›„ ๋ฐ˜๋ณต ์‹คํ—˜์— ํ•„์ˆ˜์ 
  • ์˜ˆ์ƒ ํšจ๊ณผ: ํ•™์Šต ์‹œ๊ฐ„ ๋Œ€ํญ ๋‹จ์ถ•
  • ์†Œ์š”: 1~2์ผ

3์œ„: Curriculum Learning + Data Augmentation (Back-translation)

  • ๊ทผ๊ฑฐ: ๋ฐ์ดํ„ฐ ๋‹ค์–‘์„ฑ๊ณผ ํ•™์Šต ํšจ์œจ ๋™์‹œ ๊ฐœ์„ 
  • ์˜ˆ์ƒ ํšจ๊ณผ: ko_ifeval +2~4%
  • ์†Œ์š”: 3~4์ผ

๊ถŒ์žฅ ์‹คํ–‰ ์ˆœ์„œ

Phase 1 (์ฆ‰์‹œ, SFT ์ „): 
  1. dropout: 0.05 ์„ค์ •
  2. ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ ํ•„ํ„ฐ๋ง (188k โ†’ 60~80k)
  3. Focal loss ์ ์šฉ (gamma=1.5)
  โ†’ SFT ์‹คํ–‰

Phase 2 (SFT ํ›„): 
  4. Self-Play ๋ฐ์ดํ„ฐ ์ƒ์„ฑ
  5. ORPO ํ•™์Šต

Phase 3 (๋‹ค์Œ ๋ผ์šด๋“œ):
  6. Packing ๊ตฌํ˜„ (๋ฐ˜๋ณต ์‹คํ—˜ ๊ฐ€์†)
  7. Back-translation์œผ๋กœ ๋ฐ์ดํ„ฐ ํ™•์žฅ
  8. Curriculum learning ์‹คํ—˜