| # Preference/RLHF + Benchmark ๋ฐ์ดํฐ ์ ์ ์กฐ์ฌ | |
| > ์กฐ์ฌ์ผ: 2026-02-27 | |
| --- | |
| ## Part 1: ํ๊ตญ์ด Preference/DPO ๋ฐ์ดํฐ | |
| | ๋ฐ์ดํฐ์ | ๊ท๋ชจ | ๋ค์ด๋ก๋ | ๋น๊ณ | | |
| |----------|------|----------|------| | |
| | `kuotient/orca-math-korean-dpo-pairs` | 100K~1M | 111 | ํ๊ตญ์ด ์ํ DPO. ๋๊ท๋ชจ | | |
| | `nayohan/preference-collection-ko-full` | 100K~1M | 30 | ํ๊ตญ์ด ์ข ํฉ preference | | |
| | `jojo0217/korean_rlhf_dataset` | 100K~1M | 54 | ํ๊ตญ์ด RLHF | | |
| | `maywell/ko_Ultrafeedback_binarized` | 10K~100K | 108 | UltraFeedback ํ๊ตญ์ด ๋ฒ์ญ | | |
| | `ChuGyouk/argilla-distilabel-math-preference-dpo-korean` | 1K~10K | 10 | ์ํ DPO ํ๊ตญ์ด | | |
| | `ohsuz/dpo-v1010-korean` | 10K~100K | 3 | ํ๊ตญ์ด DPO | | |
| | `ohsuz/dpo-v1010-korean-without-finance` | 10K~100K | 3 | ๊ธ์ต ์ ์ธ ๋ฒ์ | | |
| | `tellang/yeji-preference-ko-v1` | 10K~100K | 13 | ํ๊ตญ์ด preference | | |
| | `AnonymousLLMer/Safety_preference-ko-cleaned` | 1K~10K | 4 | ์์ ์ฑ preference | | |
| | `mncai/distilabel-math-preference-dpo-ko` | 1K~10K | 4 | ์ํ DPO ํ๊ตญ์ด | | |
| | `vaiv/ko-rag-preference` | <1K | 2 | RAG preference (์๊ท๋ชจ) | | |
| ### โ ์ ๊ทผ ๋ถ๊ฐ (404) | |
| - `Bongseok/ko-DPO-v0.1` โ ์ญ์ ๋จ | |
| - `HAERAE-HUB/KoRA` โ ์ญ์ ๋จ | |
| - `maywell/ko_Ultrafeedback` โ ์ญ์ ๋จ (binarized ๋ฒ์ ๋ง ์กด์ฌ) | |
| --- | |
| ## Part 2: ์์ด Preference ๋ฐ์ดํฐ (๋ฒ์ญ ๊ฐ์น ์์) | |
| | ๋ฐ์ดํฐ์ | ๊ท๋ชจ | ๋ค์ด๋ก๋ | ๋ฒ์ญ ๊ฐ์น | | |
| |----------|------|----------|-----------| | |
| | `HuggingFaceH4/ultrafeedback_binarized` | 100K~1M (~62K์) | 5,158 | โญโญโญ ์ต๊ณ . ์ด๋ฏธ ko ๋ฒ์ญํ ์กด์ฌ(maywell) | | |
| | `Anthropic/hh-rlhf` | 100K~1M | 17,609 | โญโญโญ ์ธ๊ฐ ์ ํธ๋. ๋ํํ | | |
| | `nvidia/HelpSteer2` | 10K~100K | 15,448 | โญโญโญ ๊ณ ํ์ง ์ธ๋ฐ ์ ์ | | |
| | `openbmb/UltraFeedback` | 10K~100K | 2,317 | โญโญ ์๋ณธ (binarized ๋ฒ์ ๋ ์ ์ฉ) | | |
| | `argilla/distilabel-math-preference-dpo` | 1K~10K | 328 | โญโญ ์ํ ํนํ (์ด๋ฏธ ko ๋ฒ์ญํ ์กด์ฌ) | | |
| | `snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset` | 10K~100K | 71 | โญ ์๋ ์์ฑ | | |
| | `HuggingFaceH4/stack-exchange-preferences` | 10M~100M | 3,873 | โญ ๋๋ฌด ๋๊ท๋ชจ, ์ฝ๋ ํธํฅ | | |
| | `allenai/preference-test-sets` | 10K~100K | 2,777 | ํ๊ฐ์ฉ (ํ์ต ๋ถ์ ํฉ) | | |
| --- | |
| ## Part 3: ๋ฒค์น๋งํฌ/ํ๊ฐ ๋ฐ์ดํฐ | |
| | ๋ฐ์ดํฐ์ | ๊ท๋ชจ | ๋ค์ด๋ก๋ | ์ฉ๋ | | |
| |----------|------|----------|------| | |
| | **`HAERAE-HUB/KMMLU`** | 100K~1M | 10,537 | ํ๊ตญ์ด MMLU. ํต์ฌ ๋ฒค์น๋งํฌ | | |
| | `skt/kobest_v1` | 10K~100K | 3,194 | KoBEST 5๊ฐ ํ์คํฌ (BoolQ, COPA, WiC, HellaSwag, SentiNeg) | | |
| | `HAERAE-HUB/HAE_RAE_BENCH_1.0` | 1K~10K | 457 | ํด๋ ๋ฒค์น | | |
| | `HAERAE-HUB/K2-Eval` | <1K | 76 | K2 ํ๊ฐ | | |
| | `openai/gsm8k` | 10K~100K | 465,032 | ์ํ ์ถ๋ก (์์ด) | | |
| | `HuggingFaceH4/MATH-500` | <1K | 94,894 | ์ํ ๋ฒค์น๋งํฌ (์์ด) | | |
| | `Rowan/hellaswag` | 10K~100K | 213,419 | ์์์ถ๋ก (์์ด) | | |
| | `google/IFEval` | <1K | 60,319 | ์ง์ ๋ฐ๋ฅด๊ธฐ ํ๊ฐ (์์ด) | | |
| ### โ ์ ๊ทผ ๋ถ๊ฐ (404) | |
| - `coastalcph/mimir`, `kuotient/korean-gsm8k`, `HAERAE-HUB/KorNAT-CV`, `HAERAE-HUB/KorNAT-NL2SQL`, `snunlp/korean-hate-speech` | |
| --- | |
| ## Part 4: ์์ฒด Preference ๋ฐ์ดํฐ ์์ฑ ๊ฐ๋ฅ์ฑ | |
| **SFT v2 ๋ชจ๋ธ (๋ฐ๋ณต๋ฅ 18%) ๊ธฐ๋ฐ Self-Play ๋ฐฉ์:** | |
| ### ๋ฐฉ๋ฒ | |
| 1. SFT ๋ฐ์ดํฐ์ ํ๋กฌํํธ ํ์์ ๊ฐ ํ๋กฌํํธ๋น N=4~8ํ ์ํ๋ง (temperature 0.7~1.0) | |
| 2. ์๋ ํ์ง ํ๋จ์ผ๋ก chosen/rejected ์ ๋ณ | |
| ### ์๋ ํ์ง ํ๋จ ๊ธฐ์ค | |
| - **๋ฐ๋ณต ํ์ง**: n-gram ๋ฐ๋ณต๋ฅ > 20% โ rejected | |
| - **๊ธธ์ด ํํฐ**: ๋๋ฌด ์งง๊ฑฐ๋(<50์) ๋๋ฌด ๊ธด(>2000์) โ rejected | |
| - **Perplexity ๊ธฐ๋ฐ**: ์ธ๋ถ judge ๋ชจ๋ธ (GPT-4 ๋๋ ๋ ํฐ ๋ชจ๋ธ)๋ก ์ ์ ๋ถ์ฌ | |
| - **Self-consistency**: ๋์ผ ํ๋กฌํํธ ์๋ต ๊ฐ reward model ์ ์ ๋น๊ต | |
| ### ์์ ์์ฑ๋ | |
| - SFT ํ๋กฌํํธ 10K๊ฐ ร 4ํ ์ํ๋ง = 40K ์๋ต | |
| - chosen/rejected ์: ~10K~20K์ (์์ 25% vs ํ์ 25%) | |
| - **์ฃผ์**: ๋ฐ๋ณต๋ฅ 18%์ธ ๋ชจ๋ธ๋ก ์์ฑ ์ rejected ํ์ง์ด ๋๋ฌด ๋ฎ์ ์ ์์ โ ์ ์๋ฏธํ ํ์ต ์ ํธ ์ฝํ ๊ฐ๋ฅ | |
| ### ๊ถ์ฅ | |
| - ์์ฒด ์์ฑ๋ณด๋ค **๊ธฐ์กด ํ๊ตญ์ด ๋ฐ์ดํฐ ํ์ฉ ์ฐ์ ** (์๋ ์ถ์ฒ ์ฐธ์กฐ) | |
| - ์์ฒด ์์ฑ์ ORPO 1์ฐจ ํ์ต ํ, ๊ฐ์ ๋ ๋ชจ๋ธ๋ก 2์ฐจ Self-Play ์ ๋ ํจ๊ณผ์ | |
| --- | |
| ## ๐ฏ ORPO ์ฆ์ ์์ ๊ฐ๋ฅํ ๋ฐ์ดํฐ ์กฐํฉ ์ถ์ฒ | |
| ### Tier 1: ์ฆ์ ์ฌ์ฉ (ํ๊ตญ์ด, ๋ณํ ์ต์) | |
| | ๋ฐ์ดํฐ | ์์ ์์ | ์ฐ์ ์์ | | |
| |--------|-----------|----------| | |
| | `jojo0217/korean_rlhf_dataset` | ~100K+ | ๐ฅ ๊ฐ์ฅ ๋ฒ์ฉ์ | | |
| | `maywell/ko_Ultrafeedback_binarized` | ~60K | ๐ฅ UltraFeedback ํ๊ตญ์ด, ๊ณ ํ์ง | | |
| | `nayohan/preference-collection-ko-full` | ~100K+ | ๐ฅ ์ข ํฉ preference | | |
| | `kuotient/orca-math-korean-dpo-pairs` | ~100K+ | ๐ฅ ์ํ ํนํ | | |
| ### Tier 2: ๋ณด์ถฉ์ฉ | |
| | ๋ฐ์ดํฐ | ์์ ์์ | ์ฉ๋ | | |
| |--------|-----------|------| | |
| | `ohsuz/dpo-v1010-korean` | ~10K+ | ์ถ๊ฐ ๋ค์์ฑ | | |
| | `tellang/yeji-preference-ko-v1` | ~10K+ | ์ถ๊ฐ ๋ค์์ฑ | | |
| | `ChuGyouk/argilla-distilabel-math-preference-dpo-korean` | ~5K | ์ํ ๋ณด์ถฉ | | |
| ### ์ถ์ฒ ์กฐํฉ | |
| ``` | |
| ์ด ~200K~300K์ ํ๋ณด ๊ฐ๋ฅ | |
| 1์ฐจ: jojo0217 + maywell + nayohan ํฉ์ฐ โ ~260K์ (์์) | |
| 2์ฐจ: kuotient ์ํ ์ถ๊ฐ โ ์ํ ๋ฅ๋ ฅ ๊ฐํ | |
| ``` | |
| ### ๋ฒค์น๋งํฌ ํ๊ฐ ํ์ดํ๋ผ์ธ | |
| - **KMMLU** (ํ๊ตญ์ด ์ง์) + **KoBEST** (ํ๊ตญ์ด NLU) ํ์ | |
| - **GSM8K** (์ํ) + **IFEval** (์ง์ ๋ฐ๋ฅด๊ธฐ) ๋ณด์กฐ | |
| - **HAE_RAE_BENCH** ํ๊ตญ์ด ์ข ํฉ ํ๊ฐ | |