frankenstallm / source /eval /data_quality_audit.md
pathcosmos's picture
Upload folder using huggingface_hub (#29)
5b1ff4d

SFT ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ ๊ฐ์‚ฌ ๋ณด๊ณ ์„œ

๋‚ ์งœ: 2026-02-26
๋ฐ์ดํ„ฐ: data/sft/train.jsonl (159,125 ์ƒ˜ํ”Œ)
์†Œ์Šค: 6๊ฐœ HuggingFace ๋ฐ์ดํ„ฐ์…‹ (KOR-OpenOrca-Platypus-v3, kullm-v2, ko-alpaca-12k, korean_safe_conversation, evol-instruct-korean, kovast)


1. ๋ฐ์ดํ„ฐ ๊ธฐ๋ณธ ํ†ต๊ณ„

ํ•ญ๋ชฉ ๊ฐ’
์ด ์ƒ˜ํ”Œ ์ˆ˜ 159,125
Output ํ‰๊ท  ๊ธธ์ด 608 chars
Output ์ค‘์•™๊ฐ’ 468 chars
Output ์ตœ์†Œ/์ตœ๋Œ€ 10 / 7,393 chars
์ค‘๋ณต (instruction+output) 0 (dedup ์ ์šฉ๋จ)
์ค‘๋ณต (instruction only) 0

Output ๊ธธ์ด ๋ถ„ํฌ

๊ตฌ๊ฐ„ ์ˆ˜๋Ÿ‰ ๋น„์œจ
< 50 chars 16,519 10.4%
50-100 11,112 7.0%
100-500 55,550 34.9%
500-1000 47,023 29.6%
1000-2000 23,731 14.9%
2000-4000 5,049 3.2%
> 4000 141 0.1%

2. ๋ฐœ๊ฒฌ๋œ ํ’ˆ์งˆ ๋ฌธ์ œ

๐Ÿ”ด ์‹ฌ๊ฐ (๋ฐ˜๋ณต ๋ฃจํ”„ ์ง์ ‘ ์›์ธ ๊ฐ€๋Šฅ์„ฑ)

๋ฌธ์ œ 1: ํŠน์ˆ˜ ํ† ํฐ ์˜ค์—ผ โ€” </s> 113๊ฑด

  • Output ํ…์ŠคํŠธ ์•ˆ์— </s> ๋ฌธ์ž์—ด์ด ๋ฆฌํ„ฐ๋Ÿด๋กœ ํฌํ•จ๋œ ์ƒ˜ํ”Œ 113๊ฑด
  • ์˜ํ–ฅ: ํ•™์Šต ์‹œ chat template์ด {output}</s>๋ฅผ ๋ถ™์ด๋ฏ€๋กœ, output ๋‚ด๋ถ€์˜ </s>๋Š” premature EOS๋ฅผ ํ•™์Šต์‹œํ‚ด. ์ดํ›„ ๋ชจ๋ธ์ด EOS๋ฅผ ์ œ๋Œ€๋กœ ์ƒ์„ฑํ•˜์ง€ ๋ชปํ•˜๊ฑฐ๋‚˜, EOS ์ดํ›„์—๋„ ๊ณ„์† ์ƒ์„ฑํ•˜๋Š” ํŒจํ„ด์„ ํ•™์Šต
  • ๊ธฐํƒ€: <|endoftext|> 1๊ฑด, EOS 44๊ฑด, [PAD] 3๊ฑด

๋ฌธ์ œ 2: Output ๋‚ด ์งˆ๋ฌธ/๋‹ต๋ณ€ ๋งˆ์ปค โ€” ์•ฝ 550๊ฑด

  • "์งˆ๋ฌธ:" 503๊ฑด, "๋‹ต๋ณ€:" 430๊ฑด (output ๋‚ด๋ถ€)
  • "### ๋‹ต๋ณ€:" 141๊ฑด, "### ์งˆ๋ฌธ:" 10๊ฑด
  • "### Instruction:" 4๊ฑด, "### Response:" 2๊ฑด
  • ์˜ํ–ฅ: ๋ชจ๋ธ์ด ๋‹ต๋ณ€ ์ค‘์— "์งˆ๋ฌธ:" โ†’ "๋‹ต๋ณ€:" ํŒจํ„ด์„ ํ•™์Šตํ•˜์—ฌ ์ž์ฒด์ ์œผ๋กœ Q/A ๋ฃจํ”„๋ฅผ ์ƒ์„ฑ

๋ฌธ์ œ 3: Self-repetition ํŒจํ„ด โ€” 57๊ฑด

  • 10-gram ๊ธฐ์ค€ 50% ์ด์ƒ ๋ฐ˜๋ณต๋˜๋Š” output 57๊ฑด
  • ์˜ํ–ฅ: ๋ฐ˜๋ณต ์ƒ์„ฑ ํŒจํ„ด์„ ์ง์ ‘ ํ•™์Šต

๐ŸŸก ์ค‘๊ฐ„ (ํ’ˆ์งˆ ์ €ํ•˜)

๋ฌธ์ œ 4: ์งง์€ Output โ€” 16,519๊ฑด (10.4%)

  • 50์ž ๋ฏธ๋งŒ output์ด ์ „์ฒด์˜ 10.4%
  • 30์ž ๋ฏธ๋งŒ์€ 8,833๊ฑด
  • ์˜ํ–ฅ: ๋ชจ๋ธ์ด ์ถฉ๋ถ„ํžˆ ๊ธด ๋‹ต๋ณ€์„ ์ƒ์„ฑํ•˜๋Š” ๋Šฅ๋ ฅ ์ €ํ•˜. ์งง๊ฒŒ ๋๋‚ด์•ผ ํ•  ๊ณณ์—์„œ EOS๋ฅผ ๋ฐฐ์šฐ์ง€๋งŒ, ๋Œ€๋ถ€๋ถ„์˜ ์งˆ๋ฌธ์—์„œ๋Š” ๋„ˆ๋ฌด ์งง์€ ๋‹ต๋ณ€ โ†’ EOS ๋ฏธ์ƒ์„ฑ โ†’ ๊ณ„์† ์ƒ์„ฑ โ†’ ๋ฃจํ”„

๋ฌธ์ œ 5: ๋‚ฎ์€ ํ•œ๊ตญ์–ด ๋น„์œจ โ€” 21,774๊ฑด (13.7%)

  • ํ•œ๊ธ€ ๋ฌธ์ž ๋น„์œจ 30% ๋ฏธ๋งŒ์ธ ์ƒ˜ํ”Œ (์ฝ”๋“œ, ์˜์–ด, ์ค‘๊ตญ์–ด ๋“ฑ ํ˜ผ์žฌ)
  • prepare_sft_data.py์˜ ํ•„ํ„ฐ๊ฐ€ ์ด๋ฏธ 30% ๊ธฐ์ค€์„ ์ ์šฉํ•˜์ง€๋งŒ, ๊ฐ€์ค‘์น˜ ์ƒ˜ํ”Œ๋ง ์ดํ›„ ์ ์šฉ ์ˆœ์„œ ๋ฌธ์ œ ๊ฐ€๋Šฅ์„ฑ
  • ์˜ํ–ฅ: ํ•œ๊ตญ์–ด LLM์œผ๋กœ์„œ์˜ ์ผ๊ด€์„ฑ ์ €ํ•˜

3. ๊ฐ€์„ค ๊ฒ€์ฆ ๊ฒฐ๊ณผ

๊ฐ€์„ค A: Output์— Q/A ๋ฃจํ”„ ํŒจํ„ด ์กด์žฌ โ†’ โš ๏ธ ๋ถ€๋ถ„ ํ™•์ธ

  • ### ์งˆ๋ฌธ: ... ### ๋‹ต๋ณ€: ์ •ํ™•ํ•œ ํŒจํ„ด: 4๊ฑด (0.003%)
  • ์งˆ๋ฌธ: ... ๋‹ต๋ณ€: ๋น„๊ณต์‹ ํŒจํ„ด: 119๊ฑด (0.07%)
  • ๋‹จ์ˆœ "์งˆ๋ฌธ:" ๋˜๋Š” "๋‹ต๋ณ€:" ํฌํ•จ: ~550๊ฑด
  • ๊ฒฐ๋ก : ์ •ํ™•ํ•œ ๋ฃจํ”„ ํŒจํ„ด์€ ๊ทน์†Œ์ˆ˜์ด๋‚˜, "์งˆ๋ฌธ/๋‹ต๋ณ€" ํ‚ค์›Œ๋“œ๊ฐ€ output์— ํฌํ•จ๋œ ์ƒ˜ํ”Œ์ด ์ˆ˜๋ฐฑ ๊ฑด ์กด์žฌ. ์ด๊ฒƒ๋งŒ์œผ๋กœ ๋ฃจํ”„์˜ ์ฃผ ์›์ธ์ด๋ผ ๋ณด๊ธฐ ์–ด๋ ค์›€.

๊ฐ€์„ค B: ์งง์€ Output โ†’ โœ… ์œ ๋ ฅ ์›์ธ

  • 50์ž ๋ฏธ๋งŒ 16,519๊ฑด (10.4%)์ด output ๋ถ„ํฌ์˜ ์ƒ๋‹น ๋ถ€๋ถ„
  • ๋ชจ๋ธ์ด ์งง์€ ๋‹ต๋ณ€ ํ›„ EOS๋ฅผ ์ƒ์„ฑํ•˜์ง€ ๋ชปํ•˜๊ณ  ๊ณ„์† ํ† ํฐ์„ ์ƒ์„ฑํ•  ๊ฐ€๋Šฅ์„ฑ
  • ํŠนํžˆ </s> ํ† ํฐ ์˜ค์—ผ(113๊ฑด)๊ณผ ๊ฒฐํ•ฉํ•˜๋ฉด: ๋ชจ๋ธ์ด EOS ๊ฒฝ๊ณ„๋ฅผ ์ •ํ™•ํžˆ ํ•™์Šตํ•˜์ง€ ๋ชปํ•จ

๊ฐ€์„ค C: ์†Œ์Šค๋ณ„ ํ’ˆ์งˆ ํŽธ์ฐจ โ†’ โœ… ํ™•์ธ (๊ฐ„์ ‘)

  • prepare_sft_data.py ๊ธฐ์ค€: KOR-OpenOrca-Platypus-v3 5๋ฐฐ ์—…์ƒ˜ํ”Œ๋ง, kovast 0.8๋ฐฐ ๋‹ค์šด์ƒ˜ํ”Œ๋ง
  • ๊ฐ€์ค‘์น˜๊ฐ€ ๋งค์šฐ ๊ณต๊ฒฉ์  (5.0๋ฐฐ๋Š” ๋™์ผ ๋ฐ์ดํ„ฐ 5ํšŒ ๋ฐ˜๋ณต = ๊ณผ์ ํ•ฉ ์œ„ํ—˜)
  • kovast๋Š” ๋ฉ€ํ‹ฐํ„ด ๋Œ€ํ™”์—์„œ ์ฒซ ํ„ด๋งŒ ์ถ”์ถœ โ†’ ๋ฌธ๋งฅ ๋ถ€์กฑ์œผ๋กœ ์ด์ƒํ•œ output ๊ฐ€๋Šฅ
  • ๊ฒฐ๋ก : 5๋ฐฐ ์—…์ƒ˜ํ”Œ๋ง๋œ OpenOrca-Platypus๊ฐ€ ์ฃผ ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ์ง€๋ฐฐ. ํ•ด๋‹น ์†Œ์Šค์— ๋ฌธ์ œ๊ฐ€ ์žˆ์œผ๋ฉด ์ „์ฒด ๋ชจ๋ธ์— ์ง์ ‘ ์˜ํ–ฅ.

๐Ÿ” ์ถ”๊ฐ€ ๋ฐœ๊ฒฌ: ๋ฐ˜๋ณต ๋ฃจํ”„์˜ ์ง„์งœ ์›์ธ ์ถ”์ •

EOS ํ•™์Šต ์‹คํŒจ๊ฐ€ ํ•ต์‹ฌ. ์›์ธ ์กฐํ•ฉ:

  1. Output ๋‚ด </s> ๋ฆฌํ„ฐ๋Ÿด (113๊ฑด) โ†’ EOS ๊ฒฝ๊ณ„ ํ˜ผ๋ž€
  2. ์งง์€ output 10.4% โ†’ EOS ํƒ€์ด๋ฐ ํ•™์Šต ๋ถˆ์•ˆ์ •
  3. 5000 steps๋กœ 159K ๋ฐ์ดํ„ฐ ํ•™์Šต โ†’ ๊ฐ ์ƒ˜ํ”Œ ํ‰๊ท  1.6 epoch๋„ ์•ˆ ๋จ โ†’ underfitting ๊ฐ€๋Šฅ
  4. inference ์‹œ repetition_penalty ๋ฏธ์ ์šฉ (eval ์ฝ”๋“œ์—๋Š” top_p/top_k๋งŒ ์žˆ๊ณ  repetition_penalty ์—†์Œ)

4. ์ฆ‰์‹œ ์ ์šฉ ๊ฐ€๋Šฅํ•œ ๋ฐ์ดํ„ฐ ํ•„ํ„ฐ๋ง ์ฝ”๋“œ

"""
enhanced_quality_filter.py โ€” SFT ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ ๊ฐ•ํ™” ํ•„ํ„ฐ
Usage: python enhanced_quality_filter.py data/sft/train.jsonl data/sft/train_cleaned.jsonl
"""
import json
import re
import sys

def enhanced_filter(sample: dict) -> bool:
    instruction = sample.get("instruction", "").strip()
    output = sample.get("output", "").strip()
    
    # 1. ๊ธฐ๋ณธ ๊ธธ์ด ํ•„ํ„ฐ (๊ฐ•ํ™”)
    if len(output) < 80:  # 50 โ†’ 80์œผ๋กœ ์ƒํ–ฅ
        return False
    if len(output) > 3000:  # 4000 โ†’ 3000์œผ๋กœ ํ•˜ํ–ฅ
        return False
    if len(instruction) < 15:
        return False
    
    # 2. ํŠน์ˆ˜ ํ† ํฐ ์ œ๊ฑฐ
    BAD_TOKENS = ["</s>", "<|endoftext|>", "<|end|>", "<s>", "<pad>", "[PAD]", "<unk>"]
    for tok in BAD_TOKENS:
        if tok in output:
            return False
    
    # 3. Q/A ๋งˆ์ปค ์˜ค์—ผ ์ œ๊ฑฐ
    QA_PATTERNS = [
        r"###\s*(์งˆ๋ฌธ|๋‹ต๋ณ€|Instruction|Response|Input|Output)\s*:",
        r"^(์งˆ๋ฌธ|๋‹ต๋ณ€)\s*:",  # ์ค„ ์‹œ์ž‘์—์„œ "์งˆ๋ฌธ:" "๋‹ต๋ณ€:"
    ]
    for pat in QA_PATTERNS:
        if re.search(pat, output, re.MULTILINE):
            return False
    
    # 4. ํ•œ๊ตญ์–ด ๋น„์œจ ๊ฐ•ํ™” (30% โ†’ 40%)
    ko_chars = sum(1 for c in output if '\uac00' <= c <= '\ud7a3')
    if len(output) > 0 and ko_chars / len(output) < 0.4:
        return False
    
    # 5. N-gram ๋ฐ˜๋ณต ํ•„ํ„ฐ (๊ฐ•ํ™”)
    words = output.split()
    if len(words) > 15:
        # 5-gram ๋ฐ˜๋ณต ์ฒดํฌ
        fivegrams = [tuple(words[i:i+5]) for i in range(len(words) - 4)]
        if fivegrams:
            unique_ratio = len(set(fivegrams)) / len(fivegrams)
            if unique_ratio < 0.7:  # 30% ์ด์ƒ ๋ฐ˜๋ณต์ด๋ฉด ์ œ๊ฑฐ
                return False
    
    # 6. "EOS" ๋ฆฌํ„ฐ๋Ÿด ์ œ๊ฑฐ
    if re.search(r'\bEOS\b', output):
        return False
    
    return True


def main():
    input_path = sys.argv[1]
    output_path = sys.argv[2]
    
    kept, dropped = 0, 0
    with open(input_path) as fin, open(output_path, "w") as fout:
        for line in fin:
            sample = json.loads(line)
            if enhanced_filter(sample):
                fout.write(line)
                kept += 1
            else:
                dropped += 1
    
    print(f"Kept: {kept:,} | Dropped: {dropped:,} | Drop rate: {dropped/(kept+dropped)*100:.1f}%")


if __name__ == "__main__":
    main()

5. ๋ฐ์ดํ„ฐ ํŒŒ์ดํ”„๋ผ์ธ ๊ฐœ์„  ๊ถŒ์žฅ์‚ฌํ•ญ

5.1 ๊ฐ€์ค‘์น˜ ์žฌ์กฐ์ •

ํ˜„์žฌ ๊ฐ€์ค‘์น˜๊ฐ€ ๋„ˆ๋ฌด ๊ณต๊ฒฉ์ . ๊ถŒ์žฅ ๋ณ€๊ฒฝ:

DATASET_WEIGHTS = {
    "KOR-OpenOrca-Platypus-v3": 2.0,   # 5.0 โ†’ 2.0 (๊ณผ์ ํ•ฉ ๋ฐฉ์ง€)
    "kullm-v2":                 1.0,
    "ko-alpaca-12k":            1.5,   # 2.0 โ†’ 1.5
    "korean_safe_conversation": 1.0,   # 1.5 โ†’ 1.0
    "evol-instruct-korean":     1.5,
    "kovast":                   0.5,   # 0.8 โ†’ 0.5 (ํ’ˆ์งˆ ์ด์Šˆ)
}

5.2 ํ•™์Šต ์„ค์ • ์ˆ˜์ •

# ํ˜„์žฌ: 5000 steps, batch 4ร—8ร—2 = 64
# 159K samples / 64 = 2,486 steps/epoch โ†’ ํ˜„์žฌ ์•ฝ 2 epochs

# ๊ถŒ์žฅ: ํ•„ํ„ฐ๋ง ํ›„ ~120K ๋ฐ์ดํ„ฐ๋กœ 3 epochs
MAX_STEPS=6000

5.3 Inference ์‹œ repetition_penalty ์ถ”๊ฐ€

# eval/comprehensive_eval.py ์ˆ˜์ •
repetition_penalty = 1.2  # ๋ฐ˜๋ณต ์–ต์ œ

6. ์ถ”์ฒœ ๊ณ ํ’ˆ์งˆ ๋ฐ์ดํ„ฐ์…‹ (HuggingFace)

๋ฐ์ดํ„ฐ์…‹ URL ์„ค๋ช… ์˜ˆ์ƒ ํฌ๊ธฐ
Open-Orca Korean kyujinpy/KOR-OpenOrca-Platypus-v3 ์ด๋ฏธ ์‚ฌ์šฉ ์ค‘ -
ShareGPT Korean junelee/sharegpt_deepl_ko ShareGPT ํ•œ๊ตญ์–ด ๋ฒˆ์—ญ ~90K
KoAlpaca v1.1 beomi/KoAlpaca-v1.1a ๊ณ ํ’ˆ์งˆ ํ•œ๊ตญ์–ด Alpaca ~21K
LIMA Korean HAERAE-HUB/KMMLU ํ•œ๊ตญ์–ด ๋ฒค์น˜๋งˆํฌ (ํ‰๊ฐ€์šฉ) -
Korean HC3 heegyu/korean_chatgpt_corpus ChatGPT ํ•œ๊ตญ์–ด ๋Œ€ํ™” ~12K
Orca DPO Korean kyujinpy/orca_dpo_pairs_ko DPO ํŽ˜์–ด (SFT+DPO ๊ฐ€๋Šฅ) ~12K
OpenHermes 2.5 Ko maywell/ko_Ultrafeedback_binarized ํ•œ๊ตญ์–ด Ultrafeedback ~60K
KOpen-platypus kyujinpy/KOpen-platypus ํ•œ๊ตญ์–ด Platypus ~25K

๊ฐ€์žฅ ์ถ”์ฒœํ•˜๋Š” ์ถ”๊ฐ€ ๋ฐ์ดํ„ฐ:

  1. junelee/sharegpt_deepl_ko โ€” ๋‹ค์–‘ํ•œ ์ฃผ์ œ์˜ ๋ฉ€ํ‹ฐํ„ด ๋Œ€ํ™”, ์ถฉ๋ถ„ํžˆ ๊ธด output
  2. heegyu/korean_chatgpt_corpus โ€” ChatGPT ํ’ˆ์งˆ ํ•œ๊ตญ์–ด ๋‹ต๋ณ€
  3. beomi/KoAlpaca-v1.1a โ€” ๊ฒ€์ฆ๋œ ํ•œ๊ตญ์–ด instruction ๋ฐ์ดํ„ฐ

7. ์š”์•ฝ: ์ฆ‰์‹œ ์กฐ์น˜ ์‚ฌํ•ญ

์šฐ์„ ์ˆœ์œ„ ์กฐ์น˜ ์˜ˆ์ƒ ํšจ๊ณผ
๐Ÿ”ด P0 </s>, `< endoftext
๐Ÿ”ด P0 Output ์ตœ์†Œ ๊ธธ์ด 80์ž๋กœ ์ƒํ–ฅ ์งง์€ ๋‹ต๋ณ€์œผ๋กœ ์ธํ•œ EOS ๋ฏธํ•™์Šต ๋ฐฉ์ง€
๐Ÿ”ด P0 Inference์— repetition_penalty=1.2 ์ถ”๊ฐ€ ์ฆ‰์‹œ ๋ฐ˜๋ณต ๋ฃจํ”„ ์™„ํ™”
๐ŸŸก P1 Q/A ๋งˆ์ปค ํฌํ•จ ์ƒ˜ํ”Œ ์ œ๊ฑฐ (~550๊ฑด) ์ž์ฒด Q/A ๋ฃจํ”„ ํŒจํ„ด ํ•™์Šต ๋ฐฉ์ง€
๐ŸŸก P1 OpenOrca ๊ฐ€์ค‘์น˜ 5.0 โ†’ 2.0 ๊ณผ์ ํ•ฉ ๋ฐฉ์ง€, ๋‹ค์–‘์„ฑ ํ™•๋ณด
๐ŸŸก P1 ํ•œ๊ตญ์–ด ๋น„์œจ ํ•„ํ„ฐ 40%๋กœ ๊ฐ•ํ™” ํ•œ๊ตญ์–ด ์ผ๊ด€์„ฑ ํ–ฅ์ƒ
๐ŸŸข P2 ์ถ”๊ฐ€ ๊ณ ํ’ˆ์งˆ ๋ฐ์ดํ„ฐ์…‹ ์ˆ˜์ง‘ ์ „๋ฐ˜์  ํ’ˆ์งˆ ํ–ฅ์ƒ
๐ŸŸข P2 Self-repetition ํ•„ํ„ฐ ๊ฐ•ํ™” (5-gram, 70% threshold) ๋ฐ˜๋ณต ํŒจํ„ด ์›์ฒœ ์ฐจ๋‹จ

์˜ˆ์ƒ ํ•„ํ„ฐ๋ง ํ›„ ๋ฐ์ดํ„ฐ: ~120,000-130,000 ์ƒ˜ํ”Œ (ํ˜„์žฌ ๋Œ€๋น„ 18-25% ์ œ๊ฑฐ)