SFT ๋ฐ์ดํฐ ํ์ง ๊ฐ์ฌ ๋ณด๊ณ ์
๋ ์ง: 2026-02-26
๋ฐ์ดํฐ: data/sft/train.jsonl (159,125 ์ํ)
์์ค: 6๊ฐ HuggingFace ๋ฐ์ดํฐ์
(KOR-OpenOrca-Platypus-v3, kullm-v2, ko-alpaca-12k, korean_safe_conversation, evol-instruct-korean, kovast)
1. ๋ฐ์ดํฐ ๊ธฐ๋ณธ ํต๊ณ
| ํญ๋ชฉ | ๊ฐ |
|---|---|
| ์ด ์ํ ์ | 159,125 |
| Output ํ๊ท ๊ธธ์ด | 608 chars |
| Output ์ค์๊ฐ | 468 chars |
| Output ์ต์/์ต๋ | 10 / 7,393 chars |
| ์ค๋ณต (instruction+output) | 0 (dedup ์ ์ฉ๋จ) |
| ์ค๋ณต (instruction only) | 0 |
Output ๊ธธ์ด ๋ถํฌ
| ๊ตฌ๊ฐ | ์๋ | ๋น์จ |
|---|---|---|
| < 50 chars | 16,519 | 10.4% |
| 50-100 | 11,112 | 7.0% |
| 100-500 | 55,550 | 34.9% |
| 500-1000 | 47,023 | 29.6% |
| 1000-2000 | 23,731 | 14.9% |
| 2000-4000 | 5,049 | 3.2% |
| > 4000 | 141 | 0.1% |
2. ๋ฐ๊ฒฌ๋ ํ์ง ๋ฌธ์
๐ด ์ฌ๊ฐ (๋ฐ๋ณต ๋ฃจํ ์ง์ ์์ธ ๊ฐ๋ฅ์ฑ)
๋ฌธ์ 1: ํน์ ํ ํฐ ์ค์ผ โ </s> 113๊ฑด
- Output ํ
์คํธ ์์
</s>๋ฌธ์์ด์ด ๋ฆฌํฐ๋ด๋ก ํฌํจ๋ ์ํ 113๊ฑด - ์ํฅ: ํ์ต ์ chat template์ด
{output}</s>๋ฅผ ๋ถ์ด๋ฏ๋ก, output ๋ด๋ถ์</s>๋ premature EOS๋ฅผ ํ์ต์ํด. ์ดํ ๋ชจ๋ธ์ด EOS๋ฅผ ์ ๋๋ก ์์ฑํ์ง ๋ชปํ๊ฑฐ๋, EOS ์ดํ์๋ ๊ณ์ ์์ฑํ๋ ํจํด์ ํ์ต - ๊ธฐํ:
<|endoftext|>1๊ฑด,EOS44๊ฑด,[PAD]3๊ฑด
๋ฌธ์ 2: Output ๋ด ์ง๋ฌธ/๋ต๋ณ ๋ง์ปค โ ์ฝ 550๊ฑด
"์ง๋ฌธ:"503๊ฑด,"๋ต๋ณ:"430๊ฑด (output ๋ด๋ถ)"### ๋ต๋ณ:"141๊ฑด,"### ์ง๋ฌธ:"10๊ฑด"### Instruction:"4๊ฑด,"### Response:"2๊ฑด- ์ํฅ: ๋ชจ๋ธ์ด ๋ต๋ณ ์ค์ "์ง๋ฌธ:" โ "๋ต๋ณ:" ํจํด์ ํ์ตํ์ฌ ์์ฒด์ ์ผ๋ก Q/A ๋ฃจํ๋ฅผ ์์ฑ
๋ฌธ์ 3: Self-repetition ํจํด โ 57๊ฑด
- 10-gram ๊ธฐ์ค 50% ์ด์ ๋ฐ๋ณต๋๋ output 57๊ฑด
- ์ํฅ: ๋ฐ๋ณต ์์ฑ ํจํด์ ์ง์ ํ์ต
๐ก ์ค๊ฐ (ํ์ง ์ ํ)
๋ฌธ์ 4: ์งง์ Output โ 16,519๊ฑด (10.4%)
- 50์ ๋ฏธ๋ง output์ด ์ ์ฒด์ 10.4%
- 30์ ๋ฏธ๋ง์ 8,833๊ฑด
- ์ํฅ: ๋ชจ๋ธ์ด ์ถฉ๋ถํ ๊ธด ๋ต๋ณ์ ์์ฑํ๋ ๋ฅ๋ ฅ ์ ํ. ์งง๊ฒ ๋๋ด์ผ ํ ๊ณณ์์ EOS๋ฅผ ๋ฐฐ์ฐ์ง๋ง, ๋๋ถ๋ถ์ ์ง๋ฌธ์์๋ ๋๋ฌด ์งง์ ๋ต๋ณ โ EOS ๋ฏธ์์ฑ โ ๊ณ์ ์์ฑ โ ๋ฃจํ
๋ฌธ์ 5: ๋ฎ์ ํ๊ตญ์ด ๋น์จ โ 21,774๊ฑด (13.7%)
- ํ๊ธ ๋ฌธ์ ๋น์จ 30% ๋ฏธ๋ง์ธ ์ํ (์ฝ๋, ์์ด, ์ค๊ตญ์ด ๋ฑ ํผ์ฌ)
prepare_sft_data.py์ ํํฐ๊ฐ ์ด๋ฏธ 30% ๊ธฐ์ค์ ์ ์ฉํ์ง๋ง, ๊ฐ์ค์น ์ํ๋ง ์ดํ ์ ์ฉ ์์ ๋ฌธ์ ๊ฐ๋ฅ์ฑ- ์ํฅ: ํ๊ตญ์ด LLM์ผ๋ก์์ ์ผ๊ด์ฑ ์ ํ
3. ๊ฐ์ค ๊ฒ์ฆ ๊ฒฐ๊ณผ
๊ฐ์ค A: Output์ Q/A ๋ฃจํ ํจํด ์กด์ฌ โ โ ๏ธ ๋ถ๋ถ ํ์ธ
### ์ง๋ฌธ: ... ### ๋ต๋ณ:์ ํํ ํจํด: 4๊ฑด (0.003%)์ง๋ฌธ: ... ๋ต๋ณ:๋น๊ณต์ ํจํด: 119๊ฑด (0.07%)- ๋จ์ "์ง๋ฌธ:" ๋๋ "๋ต๋ณ:" ํฌํจ: ~550๊ฑด
- ๊ฒฐ๋ก : ์ ํํ ๋ฃจํ ํจํด์ ๊ทน์์์ด๋, "์ง๋ฌธ/๋ต๋ณ" ํค์๋๊ฐ output์ ํฌํจ๋ ์ํ์ด ์๋ฐฑ ๊ฑด ์กด์ฌ. ์ด๊ฒ๋ง์ผ๋ก ๋ฃจํ์ ์ฃผ ์์ธ์ด๋ผ ๋ณด๊ธฐ ์ด๋ ค์.
๊ฐ์ค B: ์งง์ Output โ โ ์ ๋ ฅ ์์ธ
- 50์ ๋ฏธ๋ง 16,519๊ฑด (10.4%)์ด output ๋ถํฌ์ ์๋น ๋ถ๋ถ
- ๋ชจ๋ธ์ด ์งง์ ๋ต๋ณ ํ EOS๋ฅผ ์์ฑํ์ง ๋ชปํ๊ณ ๊ณ์ ํ ํฐ์ ์์ฑํ ๊ฐ๋ฅ์ฑ
- ํนํ
</s>ํ ํฐ ์ค์ผ(113๊ฑด)๊ณผ ๊ฒฐํฉํ๋ฉด: ๋ชจ๋ธ์ด EOS ๊ฒฝ๊ณ๋ฅผ ์ ํํ ํ์ตํ์ง ๋ชปํจ
๊ฐ์ค C: ์์ค๋ณ ํ์ง ํธ์ฐจ โ โ ํ์ธ (๊ฐ์ )
prepare_sft_data.py๊ธฐ์ค: KOR-OpenOrca-Platypus-v3 5๋ฐฐ ์ ์ํ๋ง, kovast 0.8๋ฐฐ ๋ค์ด์ํ๋ง- ๊ฐ์ค์น๊ฐ ๋งค์ฐ ๊ณต๊ฒฉ์ (5.0๋ฐฐ๋ ๋์ผ ๋ฐ์ดํฐ 5ํ ๋ฐ๋ณต = ๊ณผ์ ํฉ ์ํ)
- kovast๋ ๋ฉํฐํด ๋ํ์์ ์ฒซ ํด๋ง ์ถ์ถ โ ๋ฌธ๋งฅ ๋ถ์กฑ์ผ๋ก ์ด์ํ output ๊ฐ๋ฅ
- ๊ฒฐ๋ก : 5๋ฐฐ ์ ์ํ๋ง๋ OpenOrca-Platypus๊ฐ ์ฃผ ํ์ต ๋ฐ์ดํฐ๋ฅผ ์ง๋ฐฐ. ํด๋น ์์ค์ ๋ฌธ์ ๊ฐ ์์ผ๋ฉด ์ ์ฒด ๋ชจ๋ธ์ ์ง์ ์ํฅ.
๐ ์ถ๊ฐ ๋ฐ๊ฒฌ: ๋ฐ๋ณต ๋ฃจํ์ ์ง์ง ์์ธ ์ถ์
EOS ํ์ต ์คํจ๊ฐ ํต์ฌ. ์์ธ ์กฐํฉ:
- Output ๋ด
</s>๋ฆฌํฐ๋ด (113๊ฑด) โ EOS ๊ฒฝ๊ณ ํผ๋ - ์งง์ output 10.4% โ EOS ํ์ด๋ฐ ํ์ต ๋ถ์์
- 5000 steps๋ก 159K ๋ฐ์ดํฐ ํ์ต โ ๊ฐ ์ํ ํ๊ท 1.6 epoch๋ ์ ๋จ โ underfitting ๊ฐ๋ฅ
- inference ์ repetition_penalty ๋ฏธ์ ์ฉ (eval ์ฝ๋์๋ top_p/top_k๋ง ์๊ณ repetition_penalty ์์)
4. ์ฆ์ ์ ์ฉ ๊ฐ๋ฅํ ๋ฐ์ดํฐ ํํฐ๋ง ์ฝ๋
"""
enhanced_quality_filter.py โ SFT ๋ฐ์ดํฐ ํ์ง ๊ฐํ ํํฐ
Usage: python enhanced_quality_filter.py data/sft/train.jsonl data/sft/train_cleaned.jsonl
"""
import json
import re
import sys
def enhanced_filter(sample: dict) -> bool:
instruction = sample.get("instruction", "").strip()
output = sample.get("output", "").strip()
# 1. ๊ธฐ๋ณธ ๊ธธ์ด ํํฐ (๊ฐํ)
if len(output) < 80: # 50 โ 80์ผ๋ก ์ํฅ
return False
if len(output) > 3000: # 4000 โ 3000์ผ๋ก ํํฅ
return False
if len(instruction) < 15:
return False
# 2. ํน์ ํ ํฐ ์ ๊ฑฐ
BAD_TOKENS = ["</s>", "<|endoftext|>", "<|end|>", "<s>", "<pad>", "[PAD]", "<unk>"]
for tok in BAD_TOKENS:
if tok in output:
return False
# 3. Q/A ๋ง์ปค ์ค์ผ ์ ๊ฑฐ
QA_PATTERNS = [
r"###\s*(์ง๋ฌธ|๋ต๋ณ|Instruction|Response|Input|Output)\s*:",
r"^(์ง๋ฌธ|๋ต๋ณ)\s*:", # ์ค ์์์์ "์ง๋ฌธ:" "๋ต๋ณ:"
]
for pat in QA_PATTERNS:
if re.search(pat, output, re.MULTILINE):
return False
# 4. ํ๊ตญ์ด ๋น์จ ๊ฐํ (30% โ 40%)
ko_chars = sum(1 for c in output if '\uac00' <= c <= '\ud7a3')
if len(output) > 0 and ko_chars / len(output) < 0.4:
return False
# 5. N-gram ๋ฐ๋ณต ํํฐ (๊ฐํ)
words = output.split()
if len(words) > 15:
# 5-gram ๋ฐ๋ณต ์ฒดํฌ
fivegrams = [tuple(words[i:i+5]) for i in range(len(words) - 4)]
if fivegrams:
unique_ratio = len(set(fivegrams)) / len(fivegrams)
if unique_ratio < 0.7: # 30% ์ด์ ๋ฐ๋ณต์ด๋ฉด ์ ๊ฑฐ
return False
# 6. "EOS" ๋ฆฌํฐ๋ด ์ ๊ฑฐ
if re.search(r'\bEOS\b', output):
return False
return True
def main():
input_path = sys.argv[1]
output_path = sys.argv[2]
kept, dropped = 0, 0
with open(input_path) as fin, open(output_path, "w") as fout:
for line in fin:
sample = json.loads(line)
if enhanced_filter(sample):
fout.write(line)
kept += 1
else:
dropped += 1
print(f"Kept: {kept:,} | Dropped: {dropped:,} | Drop rate: {dropped/(kept+dropped)*100:.1f}%")
if __name__ == "__main__":
main()
5. ๋ฐ์ดํฐ ํ์ดํ๋ผ์ธ ๊ฐ์ ๊ถ์ฅ์ฌํญ
5.1 ๊ฐ์ค์น ์ฌ์กฐ์
ํ์ฌ ๊ฐ์ค์น๊ฐ ๋๋ฌด ๊ณต๊ฒฉ์ . ๊ถ์ฅ ๋ณ๊ฒฝ:
DATASET_WEIGHTS = {
"KOR-OpenOrca-Platypus-v3": 2.0, # 5.0 โ 2.0 (๊ณผ์ ํฉ ๋ฐฉ์ง)
"kullm-v2": 1.0,
"ko-alpaca-12k": 1.5, # 2.0 โ 1.5
"korean_safe_conversation": 1.0, # 1.5 โ 1.0
"evol-instruct-korean": 1.5,
"kovast": 0.5, # 0.8 โ 0.5 (ํ์ง ์ด์)
}
5.2 ํ์ต ์ค์ ์์
# ํ์ฌ: 5000 steps, batch 4ร8ร2 = 64
# 159K samples / 64 = 2,486 steps/epoch โ ํ์ฌ ์ฝ 2 epochs
# ๊ถ์ฅ: ํํฐ๋ง ํ ~120K ๋ฐ์ดํฐ๋ก 3 epochs
MAX_STEPS=6000
5.3 Inference ์ repetition_penalty ์ถ๊ฐ
# eval/comprehensive_eval.py ์์
repetition_penalty = 1.2 # ๋ฐ๋ณต ์ต์
6. ์ถ์ฒ ๊ณ ํ์ง ๋ฐ์ดํฐ์ (HuggingFace)
| ๋ฐ์ดํฐ์ | URL | ์ค๋ช | ์์ ํฌ๊ธฐ |
|---|---|---|---|
| Open-Orca Korean | kyujinpy/KOR-OpenOrca-Platypus-v3 |
์ด๋ฏธ ์ฌ์ฉ ์ค | - |
| ShareGPT Korean | junelee/sharegpt_deepl_ko |
ShareGPT ํ๊ตญ์ด ๋ฒ์ญ | ~90K |
| KoAlpaca v1.1 | beomi/KoAlpaca-v1.1a |
๊ณ ํ์ง ํ๊ตญ์ด Alpaca | ~21K |
| LIMA Korean | HAERAE-HUB/KMMLU |
ํ๊ตญ์ด ๋ฒค์น๋งํฌ (ํ๊ฐ์ฉ) | - |
| Korean HC3 | heegyu/korean_chatgpt_corpus |
ChatGPT ํ๊ตญ์ด ๋ํ | ~12K |
| Orca DPO Korean | kyujinpy/orca_dpo_pairs_ko |
DPO ํ์ด (SFT+DPO ๊ฐ๋ฅ) | ~12K |
| OpenHermes 2.5 Ko | maywell/ko_Ultrafeedback_binarized |
ํ๊ตญ์ด Ultrafeedback | ~60K |
| KOpen-platypus | kyujinpy/KOpen-platypus |
ํ๊ตญ์ด Platypus | ~25K |
๊ฐ์ฅ ์ถ์ฒํ๋ ์ถ๊ฐ ๋ฐ์ดํฐ:
junelee/sharegpt_deepl_koโ ๋ค์ํ ์ฃผ์ ์ ๋ฉํฐํด ๋ํ, ์ถฉ๋ถํ ๊ธด outputheegyu/korean_chatgpt_corpusโ ChatGPT ํ์ง ํ๊ตญ์ด ๋ต๋ณbeomi/KoAlpaca-v1.1aโ ๊ฒ์ฆ๋ ํ๊ตญ์ด instruction ๋ฐ์ดํฐ
7. ์์ฝ: ์ฆ์ ์กฐ์น ์ฌํญ
| ์ฐ์ ์์ | ์กฐ์น | ์์ ํจ๊ณผ |
|---|---|---|
| ๐ด P0 | </s>, `< |
endoftext |
| ๐ด P0 | Output ์ต์ ๊ธธ์ด 80์๋ก ์ํฅ | ์งง์ ๋ต๋ณ์ผ๋ก ์ธํ EOS ๋ฏธํ์ต ๋ฐฉ์ง |
| ๐ด P0 | Inference์ repetition_penalty=1.2 ์ถ๊ฐ |
์ฆ์ ๋ฐ๋ณต ๋ฃจํ ์ํ |
| ๐ก P1 | Q/A ๋ง์ปค ํฌํจ ์ํ ์ ๊ฑฐ (~550๊ฑด) | ์์ฒด Q/A ๋ฃจํ ํจํด ํ์ต ๋ฐฉ์ง |
| ๐ก P1 | OpenOrca ๊ฐ์ค์น 5.0 โ 2.0 | ๊ณผ์ ํฉ ๋ฐฉ์ง, ๋ค์์ฑ ํ๋ณด |
| ๐ก P1 | ํ๊ตญ์ด ๋น์จ ํํฐ 40%๋ก ๊ฐํ | ํ๊ตญ์ด ์ผ๊ด์ฑ ํฅ์ |
| ๐ข P2 | ์ถ๊ฐ ๊ณ ํ์ง ๋ฐ์ดํฐ์ ์์ง | ์ ๋ฐ์ ํ์ง ํฅ์ |
| ๐ข P2 | Self-repetition ํํฐ ๊ฐํ (5-gram, 70% threshold) | ๋ฐ๋ณต ํจํด ์์ฒ ์ฐจ๋จ |
์์ ํํฐ๋ง ํ ๋ฐ์ดํฐ: ~120,000-130,000 ์ํ (ํ์ฌ ๋๋น 18-25% ์ ๊ฑฐ)