frankenstallm / source /eval /data_quality_audit.md
pathcosmos's picture
Upload folder using huggingface_hub (#29)
5b1ff4d
# SFT ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ ๊ฐ์‚ฌ ๋ณด๊ณ ์„œ
**๋‚ ์งœ:** 2026-02-26
**๋ฐ์ดํ„ฐ:** `data/sft/train.jsonl` (159,125 ์ƒ˜ํ”Œ)
**์†Œ์Šค:** 6๊ฐœ HuggingFace ๋ฐ์ดํ„ฐ์…‹ (KOR-OpenOrca-Platypus-v3, kullm-v2, ko-alpaca-12k, korean_safe_conversation, evol-instruct-korean, kovast)
---
## 1. ๋ฐ์ดํ„ฐ ๊ธฐ๋ณธ ํ†ต๊ณ„
| ํ•ญ๋ชฉ | ๊ฐ’ |
|------|-----|
| ์ด ์ƒ˜ํ”Œ ์ˆ˜ | 159,125 |
| Output ํ‰๊ท  ๊ธธ์ด | 608 chars |
| Output ์ค‘์•™๊ฐ’ | 468 chars |
| Output ์ตœ์†Œ/์ตœ๋Œ€ | 10 / 7,393 chars |
| ์ค‘๋ณต (instruction+output) | 0 (dedup ์ ์šฉ๋จ) |
| ์ค‘๋ณต (instruction only) | 0 |
### Output ๊ธธ์ด ๋ถ„ํฌ
| ๊ตฌ๊ฐ„ | ์ˆ˜๋Ÿ‰ | ๋น„์œจ |
|------|------|------|
| < 50 chars | 16,519 | 10.4% |
| 50-100 | 11,112 | 7.0% |
| 100-500 | 55,550 | 34.9% |
| 500-1000 | 47,023 | 29.6% |
| 1000-2000 | 23,731 | 14.9% |
| 2000-4000 | 5,049 | 3.2% |
| > 4000 | 141 | 0.1% |
---
## 2. ๋ฐœ๊ฒฌ๋œ ํ’ˆ์งˆ ๋ฌธ์ œ
### ๐Ÿ”ด ์‹ฌ๊ฐ (๋ฐ˜๋ณต ๋ฃจํ”„ ์ง์ ‘ ์›์ธ ๊ฐ€๋Šฅ์„ฑ)
#### ๋ฌธ์ œ 1: ํŠน์ˆ˜ ํ† ํฐ ์˜ค์—ผ โ€” `</s>` 113๊ฑด
- Output ํ…์ŠคํŠธ ์•ˆ์— `</s>` ๋ฌธ์ž์—ด์ด ๋ฆฌํ„ฐ๋Ÿด๋กœ ํฌํ•จ๋œ ์ƒ˜ํ”Œ 113๊ฑด
- **์˜ํ–ฅ:** ํ•™์Šต ์‹œ chat template์ด `{output}</s>`๋ฅผ ๋ถ™์ด๋ฏ€๋กœ, output ๋‚ด๋ถ€์˜ `</s>`๋Š” premature EOS๋ฅผ ํ•™์Šต์‹œํ‚ด. ์ดํ›„ ๋ชจ๋ธ์ด EOS๋ฅผ ์ œ๋Œ€๋กœ ์ƒ์„ฑํ•˜์ง€ ๋ชปํ•˜๊ฑฐ๋‚˜, EOS ์ดํ›„์—๋„ ๊ณ„์† ์ƒ์„ฑํ•˜๋Š” ํŒจํ„ด์„ ํ•™์Šต
- ๊ธฐํƒ€: `<|endoftext|>` 1๊ฑด, `EOS` 44๊ฑด, `[PAD]` 3๊ฑด
#### ๋ฌธ์ œ 2: Output ๋‚ด ์งˆ๋ฌธ/๋‹ต๋ณ€ ๋งˆ์ปค โ€” ์•ฝ 550๊ฑด
- `"์งˆ๋ฌธ:"` 503๊ฑด, `"๋‹ต๋ณ€:"` 430๊ฑด (output ๋‚ด๋ถ€)
- `"### ๋‹ต๋ณ€:"` 141๊ฑด, `"### ์งˆ๋ฌธ:"` 10๊ฑด
- `"### Instruction:"` 4๊ฑด, `"### Response:"` 2๊ฑด
- **์˜ํ–ฅ:** ๋ชจ๋ธ์ด ๋‹ต๋ณ€ ์ค‘์— "์งˆ๋ฌธ:" โ†’ "๋‹ต๋ณ€:" ํŒจํ„ด์„ ํ•™์Šตํ•˜์—ฌ ์ž์ฒด์ ์œผ๋กœ Q/A ๋ฃจํ”„๋ฅผ ์ƒ์„ฑ
#### ๋ฌธ์ œ 3: Self-repetition ํŒจํ„ด โ€” 57๊ฑด
- 10-gram ๊ธฐ์ค€ 50% ์ด์ƒ ๋ฐ˜๋ณต๋˜๋Š” output 57๊ฑด
- **์˜ํ–ฅ:** ๋ฐ˜๋ณต ์ƒ์„ฑ ํŒจํ„ด์„ ์ง์ ‘ ํ•™์Šต
### ๐ŸŸก ์ค‘๊ฐ„ (ํ’ˆ์งˆ ์ €ํ•˜)
#### ๋ฌธ์ œ 4: ์งง์€ Output โ€” 16,519๊ฑด (10.4%)
- 50์ž ๋ฏธ๋งŒ output์ด ์ „์ฒด์˜ 10.4%
- 30์ž ๋ฏธ๋งŒ์€ 8,833๊ฑด
- **์˜ํ–ฅ:** ๋ชจ๋ธ์ด ์ถฉ๋ถ„ํžˆ ๊ธด ๋‹ต๋ณ€์„ ์ƒ์„ฑํ•˜๋Š” ๋Šฅ๋ ฅ ์ €ํ•˜. ์งง๊ฒŒ ๋๋‚ด์•ผ ํ•  ๊ณณ์—์„œ EOS๋ฅผ ๋ฐฐ์šฐ์ง€๋งŒ, ๋Œ€๋ถ€๋ถ„์˜ ์งˆ๋ฌธ์—์„œ๋Š” ๋„ˆ๋ฌด ์งง์€ ๋‹ต๋ณ€ โ†’ EOS ๋ฏธ์ƒ์„ฑ โ†’ ๊ณ„์† ์ƒ์„ฑ โ†’ ๋ฃจํ”„
#### ๋ฌธ์ œ 5: ๋‚ฎ์€ ํ•œ๊ตญ์–ด ๋น„์œจ โ€” 21,774๊ฑด (13.7%)
- ํ•œ๊ธ€ ๋ฌธ์ž ๋น„์œจ 30% ๋ฏธ๋งŒ์ธ ์ƒ˜ํ”Œ (์ฝ”๋“œ, ์˜์–ด, ์ค‘๊ตญ์–ด ๋“ฑ ํ˜ผ์žฌ)
- `prepare_sft_data.py`์˜ ํ•„ํ„ฐ๊ฐ€ ์ด๋ฏธ 30% ๊ธฐ์ค€์„ ์ ์šฉํ•˜์ง€๋งŒ, ๊ฐ€์ค‘์น˜ ์ƒ˜ํ”Œ๋ง ์ดํ›„ ์ ์šฉ ์ˆœ์„œ ๋ฌธ์ œ ๊ฐ€๋Šฅ์„ฑ
- **์˜ํ–ฅ:** ํ•œ๊ตญ์–ด LLM์œผ๋กœ์„œ์˜ ์ผ๊ด€์„ฑ ์ €ํ•˜
---
## 3. ๊ฐ€์„ค ๊ฒ€์ฆ ๊ฒฐ๊ณผ
### ๊ฐ€์„ค A: Output์— Q/A ๋ฃจํ”„ ํŒจํ„ด ์กด์žฌ โ†’ โš ๏ธ ๋ถ€๋ถ„ ํ™•์ธ
- `### ์งˆ๋ฌธ: ... ### ๋‹ต๋ณ€:` ์ •ํ™•ํ•œ ํŒจํ„ด: **4๊ฑด** (0.003%)
- `์งˆ๋ฌธ: ... ๋‹ต๋ณ€:` ๋น„๊ณต์‹ ํŒจํ„ด: **119๊ฑด** (0.07%)
- ๋‹จ์ˆœ "์งˆ๋ฌธ:" ๋˜๋Š” "๋‹ต๋ณ€:" ํฌํ•จ: **~550๊ฑด**
- **๊ฒฐ๋ก :** ์ •ํ™•ํ•œ ๋ฃจํ”„ ํŒจํ„ด์€ ๊ทน์†Œ์ˆ˜์ด๋‚˜, "์งˆ๋ฌธ/๋‹ต๋ณ€" ํ‚ค์›Œ๋“œ๊ฐ€ output์— ํฌํ•จ๋œ ์ƒ˜ํ”Œ์ด ์ˆ˜๋ฐฑ ๊ฑด ์กด์žฌ. ์ด๊ฒƒ๋งŒ์œผ๋กœ ๋ฃจํ”„์˜ ์ฃผ ์›์ธ์ด๋ผ ๋ณด๊ธฐ ์–ด๋ ค์›€.
### ๊ฐ€์„ค B: ์งง์€ Output โ†’ โœ… ์œ ๋ ฅ ์›์ธ
- 50์ž ๋ฏธ๋งŒ 16,519๊ฑด (10.4%)์ด output ๋ถ„ํฌ์˜ ์ƒ๋‹น ๋ถ€๋ถ„
- ๋ชจ๋ธ์ด ์งง์€ ๋‹ต๋ณ€ ํ›„ EOS๋ฅผ ์ƒ์„ฑํ•˜์ง€ ๋ชปํ•˜๊ณ  ๊ณ„์† ํ† ํฐ์„ ์ƒ์„ฑํ•  ๊ฐ€๋Šฅ์„ฑ
- **ํŠนํžˆ `</s>` ํ† ํฐ ์˜ค์—ผ(113๊ฑด)๊ณผ ๊ฒฐํ•ฉํ•˜๋ฉด:** ๋ชจ๋ธ์ด EOS ๊ฒฝ๊ณ„๋ฅผ ์ •ํ™•ํžˆ ํ•™์Šตํ•˜์ง€ ๋ชปํ•จ
### ๊ฐ€์„ค C: ์†Œ์Šค๋ณ„ ํ’ˆ์งˆ ํŽธ์ฐจ โ†’ โœ… ํ™•์ธ (๊ฐ„์ ‘)
- `prepare_sft_data.py` ๊ธฐ์ค€: KOR-OpenOrca-Platypus-v3 **5๋ฐฐ ์—…์ƒ˜ํ”Œ๋ง**, kovast **0.8๋ฐฐ ๋‹ค์šด์ƒ˜ํ”Œ๋ง**
- ๊ฐ€์ค‘์น˜๊ฐ€ ๋งค์šฐ ๊ณต๊ฒฉ์  (5.0๋ฐฐ๋Š” ๋™์ผ ๋ฐ์ดํ„ฐ 5ํšŒ ๋ฐ˜๋ณต = ๊ณผ์ ํ•ฉ ์œ„ํ—˜)
- kovast๋Š” ๋ฉ€ํ‹ฐํ„ด ๋Œ€ํ™”์—์„œ ์ฒซ ํ„ด๋งŒ ์ถ”์ถœ โ†’ ๋ฌธ๋งฅ ๋ถ€์กฑ์œผ๋กœ ์ด์ƒํ•œ output ๊ฐ€๋Šฅ
- **๊ฒฐ๋ก :** 5๋ฐฐ ์—…์ƒ˜ํ”Œ๋ง๋œ OpenOrca-Platypus๊ฐ€ ์ฃผ ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ์ง€๋ฐฐ. ํ•ด๋‹น ์†Œ์Šค์— ๋ฌธ์ œ๊ฐ€ ์žˆ์œผ๋ฉด ์ „์ฒด ๋ชจ๋ธ์— ์ง์ ‘ ์˜ํ–ฅ.
### ๐Ÿ” ์ถ”๊ฐ€ ๋ฐœ๊ฒฌ: ๋ฐ˜๋ณต ๋ฃจํ”„์˜ ์ง„์งœ ์›์ธ ์ถ”์ •
**EOS ํ•™์Šต ์‹คํŒจ๊ฐ€ ํ•ต์‹ฌ.** ์›์ธ ์กฐํ•ฉ:
1. Output ๋‚ด `</s>` ๋ฆฌํ„ฐ๋Ÿด (113๊ฑด) โ†’ EOS ๊ฒฝ๊ณ„ ํ˜ผ๋ž€
2. ์งง์€ output 10.4% โ†’ EOS ํƒ€์ด๋ฐ ํ•™์Šต ๋ถˆ์•ˆ์ •
3. 5000 steps๋กœ 159K ๋ฐ์ดํ„ฐ ํ•™์Šต โ†’ ๊ฐ ์ƒ˜ํ”Œ ํ‰๊ท  1.6 epoch๋„ ์•ˆ ๋จ โ†’ underfitting ๊ฐ€๋Šฅ
4. **inference ์‹œ repetition_penalty ๋ฏธ์ ์šฉ** (eval ์ฝ”๋“œ์—๋Š” top_p/top_k๋งŒ ์žˆ๊ณ  repetition_penalty ์—†์Œ)
---
## 4. ์ฆ‰์‹œ ์ ์šฉ ๊ฐ€๋Šฅํ•œ ๋ฐ์ดํ„ฐ ํ•„ํ„ฐ๋ง ์ฝ”๋“œ
```python
"""
enhanced_quality_filter.py โ€” SFT ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ ๊ฐ•ํ™” ํ•„ํ„ฐ
Usage: python enhanced_quality_filter.py data/sft/train.jsonl data/sft/train_cleaned.jsonl
"""
import json
import re
import sys
def enhanced_filter(sample: dict) -> bool:
instruction = sample.get("instruction", "").strip()
output = sample.get("output", "").strip()
# 1. ๊ธฐ๋ณธ ๊ธธ์ด ํ•„ํ„ฐ (๊ฐ•ํ™”)
if len(output) < 80: # 50 โ†’ 80์œผ๋กœ ์ƒํ–ฅ
return False
if len(output) > 3000: # 4000 โ†’ 3000์œผ๋กœ ํ•˜ํ–ฅ
return False
if len(instruction) < 15:
return False
# 2. ํŠน์ˆ˜ ํ† ํฐ ์ œ๊ฑฐ
BAD_TOKENS = ["</s>", "<|endoftext|>", "<|end|>", "<s>", "<pad>", "[PAD]", "<unk>"]
for tok in BAD_TOKENS:
if tok in output:
return False
# 3. Q/A ๋งˆ์ปค ์˜ค์—ผ ์ œ๊ฑฐ
QA_PATTERNS = [
r"###\s*(์งˆ๋ฌธ|๋‹ต๋ณ€|Instruction|Response|Input|Output)\s*:",
r"^(์งˆ๋ฌธ|๋‹ต๋ณ€)\s*:", # ์ค„ ์‹œ์ž‘์—์„œ "์งˆ๋ฌธ:" "๋‹ต๋ณ€:"
]
for pat in QA_PATTERNS:
if re.search(pat, output, re.MULTILINE):
return False
# 4. ํ•œ๊ตญ์–ด ๋น„์œจ ๊ฐ•ํ™” (30% โ†’ 40%)
ko_chars = sum(1 for c in output if '\uac00' <= c <= '\ud7a3')
if len(output) > 0 and ko_chars / len(output) < 0.4:
return False
# 5. N-gram ๋ฐ˜๋ณต ํ•„ํ„ฐ (๊ฐ•ํ™”)
words = output.split()
if len(words) > 15:
# 5-gram ๋ฐ˜๋ณต ์ฒดํฌ
fivegrams = [tuple(words[i:i+5]) for i in range(len(words) - 4)]
if fivegrams:
unique_ratio = len(set(fivegrams)) / len(fivegrams)
if unique_ratio < 0.7: # 30% ์ด์ƒ ๋ฐ˜๋ณต์ด๋ฉด ์ œ๊ฑฐ
return False
# 6. "EOS" ๋ฆฌํ„ฐ๋Ÿด ์ œ๊ฑฐ
if re.search(r'\bEOS\b', output):
return False
return True
def main():
input_path = sys.argv[1]
output_path = sys.argv[2]
kept, dropped = 0, 0
with open(input_path) as fin, open(output_path, "w") as fout:
for line in fin:
sample = json.loads(line)
if enhanced_filter(sample):
fout.write(line)
kept += 1
else:
dropped += 1
print(f"Kept: {kept:,} | Dropped: {dropped:,} | Drop rate: {dropped/(kept+dropped)*100:.1f}%")
if __name__ == "__main__":
main()
```
---
## 5. ๋ฐ์ดํ„ฐ ํŒŒ์ดํ”„๋ผ์ธ ๊ฐœ์„  ๊ถŒ์žฅ์‚ฌํ•ญ
### 5.1 ๊ฐ€์ค‘์น˜ ์žฌ์กฐ์ •
ํ˜„์žฌ ๊ฐ€์ค‘์น˜๊ฐ€ ๋„ˆ๋ฌด ๊ณต๊ฒฉ์ . ๊ถŒ์žฅ ๋ณ€๊ฒฝ:
```python
DATASET_WEIGHTS = {
"KOR-OpenOrca-Platypus-v3": 2.0, # 5.0 โ†’ 2.0 (๊ณผ์ ํ•ฉ ๋ฐฉ์ง€)
"kullm-v2": 1.0,
"ko-alpaca-12k": 1.5, # 2.0 โ†’ 1.5
"korean_safe_conversation": 1.0, # 1.5 โ†’ 1.0
"evol-instruct-korean": 1.5,
"kovast": 0.5, # 0.8 โ†’ 0.5 (ํ’ˆ์งˆ ์ด์Šˆ)
}
```
### 5.2 ํ•™์Šต ์„ค์ • ์ˆ˜์ •
```bash
# ํ˜„์žฌ: 5000 steps, batch 4ร—8ร—2 = 64
# 159K samples / 64 = 2,486 steps/epoch โ†’ ํ˜„์žฌ ์•ฝ 2 epochs
# ๊ถŒ์žฅ: ํ•„ํ„ฐ๋ง ํ›„ ~120K ๋ฐ์ดํ„ฐ๋กœ 3 epochs
MAX_STEPS=6000
```
### 5.3 Inference ์‹œ repetition_penalty ์ถ”๊ฐ€
```python
# eval/comprehensive_eval.py ์ˆ˜์ •
repetition_penalty = 1.2 # ๋ฐ˜๋ณต ์–ต์ œ
```
---
## 6. ์ถ”์ฒœ ๊ณ ํ’ˆ์งˆ ๋ฐ์ดํ„ฐ์…‹ (HuggingFace)
| ๋ฐ์ดํ„ฐ์…‹ | URL | ์„ค๋ช… | ์˜ˆ์ƒ ํฌ๊ธฐ |
|----------|-----|------|-----------|
| Open-Orca Korean | `kyujinpy/KOR-OpenOrca-Platypus-v3` | ์ด๋ฏธ ์‚ฌ์šฉ ์ค‘ | - |
| ShareGPT Korean | `junelee/sharegpt_deepl_ko` | ShareGPT ํ•œ๊ตญ์–ด ๋ฒˆ์—ญ | ~90K |
| KoAlpaca v1.1 | `beomi/KoAlpaca-v1.1a` | ๊ณ ํ’ˆ์งˆ ํ•œ๊ตญ์–ด Alpaca | ~21K |
| LIMA Korean | `HAERAE-HUB/KMMLU` | ํ•œ๊ตญ์–ด ๋ฒค์น˜๋งˆํฌ (ํ‰๊ฐ€์šฉ) | - |
| Korean HC3 | `heegyu/korean_chatgpt_corpus` | ChatGPT ํ•œ๊ตญ์–ด ๋Œ€ํ™” | ~12K |
| Orca DPO Korean | `kyujinpy/orca_dpo_pairs_ko` | DPO ํŽ˜์–ด (SFT+DPO ๊ฐ€๋Šฅ) | ~12K |
| OpenHermes 2.5 Ko | `maywell/ko_Ultrafeedback_binarized` | ํ•œ๊ตญ์–ด Ultrafeedback | ~60K |
| KOpen-platypus | `kyujinpy/KOpen-platypus` | ํ•œ๊ตญ์–ด Platypus | ~25K |
**๊ฐ€์žฅ ์ถ”์ฒœํ•˜๋Š” ์ถ”๊ฐ€ ๋ฐ์ดํ„ฐ:**
1. `junelee/sharegpt_deepl_ko` โ€” ๋‹ค์–‘ํ•œ ์ฃผ์ œ์˜ ๋ฉ€ํ‹ฐํ„ด ๋Œ€ํ™”, ์ถฉ๋ถ„ํžˆ ๊ธด output
2. `heegyu/korean_chatgpt_corpus` โ€” ChatGPT ํ’ˆ์งˆ ํ•œ๊ตญ์–ด ๋‹ต๋ณ€
3. `beomi/KoAlpaca-v1.1a` โ€” ๊ฒ€์ฆ๋œ ํ•œ๊ตญ์–ด instruction ๋ฐ์ดํ„ฐ
---
## 7. ์š”์•ฝ: ์ฆ‰์‹œ ์กฐ์น˜ ์‚ฌํ•ญ
| ์šฐ์„ ์ˆœ์œ„ | ์กฐ์น˜ | ์˜ˆ์ƒ ํšจ๊ณผ |
|----------|------|-----------|
| ๐Ÿ”ด P0 | `</s>`, `<|endoftext|>`, `EOS` ํฌํ•จ ์ƒ˜ํ”Œ ์ œ๊ฑฐ (161๊ฑด) | EOS ํ•™์Šต ํ˜ผ๋ž€ ํ•ด์†Œ |
| ๐Ÿ”ด P0 | Output ์ตœ์†Œ ๊ธธ์ด 80์ž๋กœ ์ƒํ–ฅ | ์งง์€ ๋‹ต๋ณ€์œผ๋กœ ์ธํ•œ EOS ๋ฏธํ•™์Šต ๋ฐฉ์ง€ |
| ๐Ÿ”ด P0 | Inference์— `repetition_penalty=1.2` ์ถ”๊ฐ€ | ์ฆ‰์‹œ ๋ฐ˜๋ณต ๋ฃจํ”„ ์™„ํ™” |
| ๐ŸŸก P1 | Q/A ๋งˆ์ปค ํฌํ•จ ์ƒ˜ํ”Œ ์ œ๊ฑฐ (~550๊ฑด) | ์ž์ฒด Q/A ๋ฃจํ”„ ํŒจํ„ด ํ•™์Šต ๋ฐฉ์ง€ |
| ๐ŸŸก P1 | OpenOrca ๊ฐ€์ค‘์น˜ 5.0 โ†’ 2.0 | ๊ณผ์ ํ•ฉ ๋ฐฉ์ง€, ๋‹ค์–‘์„ฑ ํ™•๋ณด |
| ๐ŸŸก P1 | ํ•œ๊ตญ์–ด ๋น„์œจ ํ•„ํ„ฐ 40%๋กœ ๊ฐ•ํ™” | ํ•œ๊ตญ์–ด ์ผ๊ด€์„ฑ ํ–ฅ์ƒ |
| ๐ŸŸข P2 | ์ถ”๊ฐ€ ๊ณ ํ’ˆ์งˆ ๋ฐ์ดํ„ฐ์…‹ ์ˆ˜์ง‘ | ์ „๋ฐ˜์  ํ’ˆ์งˆ ํ–ฅ์ƒ |
| ๐ŸŸข P2 | Self-repetition ํ•„ํ„ฐ ๊ฐ•ํ™” (5-gram, 70% threshold) | ๋ฐ˜๋ณต ํŒจํ„ด ์›์ฒœ ์ฐจ๋‹จ |
**์˜ˆ์ƒ ํ•„ํ„ฐ๋ง ํ›„ ๋ฐ์ดํ„ฐ:** ~120,000-130,000 ์ƒ˜ํ”Œ (ํ˜„์žฌ ๋Œ€๋น„ 18-25% ์ œ๊ฑฐ)