frankenstallm / source /eval /data_inventory /sft_datasets.md
pathcosmos's picture
Upload folder using huggingface_hub (#29)
5b1ff4d

ํ•œ๊ตญ์–ด SFT/Instruction ๋ฐ์ดํ„ฐ์…‹ ์ „์ˆ˜ ์กฐ์‚ฌ

์กฐ์‚ฌ์ผ: 2026-02-27 ์กฐ์‚ฌ ๋ฒ”์œ„: HuggingFace Hub ํ•œ๊ตญ์–ด SFT/Instruction ๋ฐ์ดํ„ฐ์…‹


1. ํ˜„์žฌ SFT ๋ฐ์ดํ„ฐ ํ˜„ํ™ฉ

ํ•ญ๋ชฉ ๊ฐ’
ํŒŒ์ผ /PROJECT/.../data/sft/train.jsonl
์ด ๊ฑด์ˆ˜ 161,848
ํฌ๋งท instruction / input / output (Alpaca ํ˜•์‹)
์†Œ์Šค ํ•„๋“œ โŒ ์—†์Œ (source ํ‚ค ๋ฏธ์กด์žฌ)

โš ๏ธ ์†Œ์Šค ์ถ”์ ์ด ๋ถˆ๊ฐ€๋Šฅํ•˜์—ฌ ์ค‘๋ณต/์ถœ์ฒ˜ ๊ฒ€์ฆ์ด ์–ด๋ ค์›€. ํ–ฅํ›„ ๋ฐ์ดํ„ฐ ์ถ”๊ฐ€ ์‹œ source ํ•„๋“œ ํ•„์ˆ˜ ๊ถŒ์žฅ.


2. HuggingFace ํ•œ๊ตญ์–ด SFT ๋ฐ์ดํ„ฐ์…‹ ๋ชฉ๋ก

Tier 1 โ€” ์ตœ๊ณ ํ’ˆ์งˆ (์ธ๊ฐ„ ์ž‘์„ฑ / ๊ฐ•๋ ฅ ํ•„ํ„ฐ๋ง / GPT-4 ์ƒ์„ฑ+๊ฒ€์ฆ)

๋ฐ์ดํ„ฐ์…‹ ํฌ๊ธฐ ์–ธ์–ด ์„ค๋ช… DL
nlpai-lab/kullm-v2 10K~100K ๐Ÿ‡ฐ๐Ÿ‡ท GPT-4 ๊ธฐ๋ฐ˜ ํ•œ๊ตญ์–ด instruction, ์ปค๋ฎค๋‹ˆํ‹ฐ ๊ฒ€์ฆ 730
FreedomIntelligence/alpaca-gpt4-korean ~52K ๐Ÿ‡ฐ๐Ÿ‡ท GPT-4๋กœ ์ƒ์„ฑํ•œ ํ•œ๊ตญ์–ด Alpaca 158
dbdu/ShareGPT-74k-ko 10K~100K ๐Ÿ‡ฐ๐Ÿ‡ท ShareGPT ํ•œ๊ตญ์–ด ๋ฒˆ์—ญ, ๋ฉ€ํ‹ฐํ„ด ๋Œ€ํ™” 169
squarelike/sharegpt_deepl_ko_translation ~50K+ ๐Ÿ‡ฐ๐Ÿ‡ท ShareGPT DeepL ๋ฒˆ์—ญ, ๊ณ ํ’ˆ์งˆ ๋ฒˆ์—ญ์ฒด 41
kuotient/orca-math-word-problems-193k-korean 100K~1M ๐Ÿ‡ฐ๐Ÿ‡ท ์ˆ˜ํ•™ ๋ฌธ์ œ ํ•œ๊ตญ์–ด ๋ฒˆ์—ญ, ๋Œ€๊ทœ๋ชจ 396
HuggingFaceH4/no_robots ~10K ๐Ÿ‡ฌ๐Ÿ‡ง ์ธ๊ฐ„ ์ž‘์„ฑ ๊ณ ํ’ˆ์งˆ (์˜์–ด, ๋ฒˆ์—ญ ๊ฐ€์น˜ ๋†’์Œ) 5,211
allenai/tulu-3-sft-mixture 100K~1M ๋‹ค๊ตญ์–ด Allen AI ์ตœ์‹  SFT ๋ฏน์Šค, ๊ณ ํ’ˆ์งˆ ํ๋ ˆ์ด์…˜ 22,453
HAERAE-HUB/K2-Feedback ~์ˆ˜์ฒœ ๐Ÿ‡ฐ๐Ÿ‡ท ํ•œ๊ตญ์–ด ํ‰๊ฐ€/ํ”ผ๋“œ๋ฐฑ ๋ฐ์ดํ„ฐ 54

Tier 2 โ€” ์ค‘๊ฐ„ ํ’ˆ์งˆ (GPT-3.5/4 ์ƒ์„ฑ, ๋ถ€๋ถ„ ๊ฒ€์ฆ)

๋ฐ์ดํ„ฐ์…‹ ํฌ๊ธฐ ์–ธ์–ด ์„ค๋ช… DL
beomi/KoAlpaca-v1.1a ~52K ๐Ÿ‡ฐ๐Ÿ‡ท ํ•œ๊ตญ์–ด Alpaca, ๋„๋ฆฌ ์‚ฌ์šฉ 3,096
kyujinpy/KOR-OpenOrca-Platypus-v3 10K~50K ๐Ÿ‡ฐ๐Ÿ‡ท OpenOrca+Platypus ํ•œ๊ตญ์–ด ๋ณ‘ํ•ฉ 612
kyujinpy/OpenOrca-KO 10K~50K ๐Ÿ‡ฐ๐Ÿ‡ท OpenOrca ํ•œ๊ตญ์–ด ๋ฒˆ์—ญ 139
squarelike/OpenOrca-gugugo-ko 10M~100M ๐Ÿ‡ฐ๐Ÿ‡ท ์ดˆ๋Œ€๊ทœ๋ชจ OpenOrca ํ•œ๊ตญ์–ด ๋ฒˆ์—ญ 82
nlp-with-deeplearning/Ko.WizardLM_evol_instruct_V2_196k ~196K ๐Ÿ‡ฐ๐Ÿ‡ท WizardLM Evol Instruct ํ•œ๊ตญ์–ด 20
heegyu/open-korean-instructions ๋‹ค์–‘ ๐Ÿ‡ฐ๐Ÿ‡ท ์—ฌ๋Ÿฌ ํ•œ๊ตญ์–ด instruction ํ†ตํ•ฉ 214
nayohan/instruction_en_ko_translation_1.4m 1.4M ๐Ÿ‡ฐ๐Ÿ‡ท ๋Œ€๊ทœ๋ชจ ์˜โ†’ํ•œ instruction ๋ฒˆ์—ญ 11
nayohan/Evol-Instruct-Code-80k-v1-ko ~80K ๐Ÿ‡ฐ๐Ÿ‡ท ์ฝ”๋“œ instruction ํ•œ๊ตญ์–ด 23
changpt/ko-lima-vicuna <1K ๐Ÿ‡ฐ๐Ÿ‡ท LIMA+Vicuna ํ•œ๊ตญ์–ด (์†Œ๋Ÿ‰ ๊ณ ํ’ˆ์งˆ) 43
OpenLab-NLP/tiny-instruct-ko ~์ˆ˜๋งŒ ๐Ÿ‡ฐ๐Ÿ‡ท ํ•œ๊ตญ์–ด instruction ์†Œ๊ทœ๋ชจ 127
nlpai-lab/openassistant-guanaco-ko 1K~10K ๐Ÿ‡ฐ๐Ÿ‡ท OpenAssistant Guanaco ํ•œ๊ตญ์–ด 48
HuggingFaceH4/ultrachat_200k 100K~1M ๐Ÿ‡ฌ๐Ÿ‡ง ๊ณ ํ’ˆ์งˆ ๋Œ€ํ™” (์˜์–ด, ๋ฒˆ์—ญ ๊ฐ€์น˜) 33,729
kyujinpy/KOpen-platypus ~25K ๐Ÿ‡ฐ๐Ÿ‡ท๐Ÿ‡ฌ๐Ÿ‡ง Platypus ํ•œ๊ตญ์–ด 306

Tier 3 โ€” ์ฐธ๊ณ ์šฉ (๋…ธ์ด์ฆˆ ๊ฐ€๋Šฅ์„ฑ, ์ถ”๊ฐ€ ํ•„ํ„ฐ๋ง ํ•„์š”)

๋ฐ์ดํ„ฐ์…‹ ํฌ๊ธฐ ์–ธ์–ด ์„ค๋ช… DL
CarrotAI/ko-instruction-dataset 1K~10K ๐Ÿ‡ฐ๐Ÿ‡ท ์†Œ๊ทœ๋ชจ 71
CarrotAI/ko-code-alpaca-QA ์†Œ๊ทœ๋ชจ ๐Ÿ‡ฐ๐Ÿ‡ท ์ฝ”๋“œ QA 71
causal-lm/instructions-ko ๋ถˆ๋ช… ๐Ÿ‡ฐ๐Ÿ‡ท 21
junelee/sharegpt_deepl_ko ~์ˆ˜๋งŒ ๐Ÿ‡ฐ๐Ÿ‡ท DeepL ๋ฒˆ์—ญ 86
neuralfoundry-coder/aihub-korean-education-instruct-sample ์ƒ˜ํ”Œ ๐Ÿ‡ฐ๐Ÿ‡ท ๊ต์œก ๋„๋ฉ”์ธ 32
neuralfoundry-coder/korean-legal-instruction-sample ์ƒ˜ํ”Œ ๐Ÿ‡ฐ๐Ÿ‡ท ๋ฒ•๋ฅ  ๋„๋ฉ”์ธ 30

์˜์–ด ๋Œ€๊ทœ๋ชจ (๋ฒˆ์—ญ ํŒŒ์ดํ”„๋ผ์ธ์œผ๋กœ ํ™œ์šฉ ๊ฐ€๋Šฅ)

๋ฐ์ดํ„ฐ์…‹ ํฌ๊ธฐ ์„ค๋ช… DL
Open-Orca/OpenOrca ~4M FLAN ๊ธฐ๋ฐ˜ ๋Œ€๊ทœ๋ชจ -
teknium/OpenHermes-2.5 ~1M ๊ณ ํ’ˆ์งˆ ํ˜ผํ•ฉ -
WizardLM/WizardLM_evol_instruct_V2_196k 196K Evol Instruct -
stingning/ultrachat 1M~10M ๋Œ€ํ™”ํ˜• 2,838
iamtarun/python_code_instructions_18k_alpaca 18K ์ฝ”๋“œ 6,499
sahil2801/CodeAlpaca-20k 20K ์ฝ”๋“œ 12,060

3. ๋„๋ฉ”์ธ ์ปค๋ฒ„๋ฆฌ์ง€ ๋ถ„์„

ํ˜„์žฌ ๋ฐ์ดํ„ฐ (161K) ์ถ”์ • ๋„๋ฉ”์ธ ๋ถ„ํฌ

๋ฐ์ดํ„ฐ์— source ํ•„๋“œ๊ฐ€ ์—†์–ด ์ •ํ™•ํ•œ ๋ถ„์„ ๋ถˆ๊ฐ€. ๋ฐ์ดํ„ฐ ๋‚ด์šฉ ์ƒ˜ํ”Œ๋ง ๊ธฐ๋ฐ˜ ์ถ”์ •:

๋„๋ฉ”์ธ ์ถ”์ • ๋น„์œจ ์ƒํƒœ
์ผ๋ฐ˜ ์ง€์‹/QA ~40% โœ… ์ถฉ๋ถ„
๋ฒˆ์—ญ์ฒด ๋Œ€ํ™” ~25% โœ… ์ถฉ๋ถ„
์ฐฝ์ž‘/๊ธ€์“ฐ๊ธฐ ~15% โš ๏ธ ๋ณดํ†ต
์ฝ”๋”ฉ ~5% โŒ ๋ถ€์กฑ
์ˆ˜ํ•™/๊ณผํ•™ ~5% โŒ ๋ถ€์กฑ
ํ•œ๊ตญ์–ด ํŠนํ™” (๋ฌธํ™”/์—ญ์‚ฌ/๋ฒ•๋ฅ ) ~5% โŒ ๋ถ€์กฑ
๋กคํ”Œ๋ ˆ์ด/ํŽ˜๋ฅด์†Œ๋‚˜ ~5% โš ๏ธ ๋ณดํ†ต

๋„๋ฉ”์ธ ๊ฐญ (๋ถ€์กฑํ•œ ์˜์—ญ)

  1. ์ˆ˜ํ•™/๋…ผ๋ฆฌ ์ถ”๋ก  โ€” ํ˜„์žฌ ๊ฑฐ์˜ ์—†์Œ. kuotient/orca-math-word-problems-193k-korean (193K)๋กœ ์ฆ‰์‹œ ๋ณด์™„ ๊ฐ€๋Šฅ
  2. ์ฝ”๋”ฉ โ€” ํ•œ๊ตญ์–ด ์ฝ”๋“œ instruction ๊ทน์†Œ. nayohan/Evol-Instruct-Code-80k-v1-ko (80K) ํ™œ์šฉ ํ•„์š”
  3. ํ•œ๊ตญ์–ด ํŠนํ™” ์ง€์‹ โ€” ํ•œ๊ตญ ๋ฌธํ™”, ์—ญ์‚ฌ, ๋ฒ•๋ฅ , ์ˆ˜๋Šฅ ๋“ฑ ๋„๋ฉ”์ธ ํŠนํ™” ๋ฐ์ดํ„ฐ ๋ถ€์กฑ
  4. ๋ฉ€ํ‹ฐํ„ด ๋Œ€ํ™” โ€” ์‹ฑ๊ธ€ํ„ด QA ์œ„์ฃผ. dbdu/ShareGPT-74k-ko, ultrachat_200k ๋ฒˆ์—ญ์œผ๋กœ ๋ณด์™„
  5. Safety/๊ฑฐ์ ˆ ์‘๋‹ต โ€” ์œ ํ•ด ์š”์ฒญ ๊ฑฐ์ ˆ ํ•™์Šต ๋ฐ์ดํ„ฐ ๋ถ€์žฌ

4. ์ฆ‰์‹œ ๋‹ค์šด๋กœ๋“œ ๊ถŒ์žฅ Top 5

๐Ÿฅ‡ 1. kuotient/orca-math-word-problems-193k-korean

  • ํฌ๊ธฐ: ~193K
  • ์ด์œ : ์ˆ˜ํ•™ ๋„๋ฉ”์ธ ์™„์ „ ๋ณด์™„. ํ•œ๊ตญ์–ด ๋„ค์ดํ‹ฐ๋ธŒ ๋ฒˆ์—ญ. ๋Œ€๊ทœ๋ชจ.
  • ํ’ˆ์งˆ: Tier 1-2 (Orca Math ๊ธฐ๋ฐ˜, ๊ฒ€์ฆ๋จ)
  • ์šฐ์„ ๋„: โ˜…โ˜…โ˜…โ˜…โ˜…

๐Ÿฅˆ 2. dbdu/ShareGPT-74k-ko

  • ํฌ๊ธฐ: ~74K
  • ์ด์œ : ์‹ค์ œ ChatGPT ๋Œ€ํ™” ๊ธฐ๋ฐ˜ ๋ฉ€ํ‹ฐํ„ด. ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ. ๋ฒˆ์—ญ ํ’ˆ์งˆ ์–‘ํ˜ธ.
  • ํ’ˆ์งˆ: Tier 1 (์‹ค์‚ฌ์šฉ์ž ๋Œ€ํ™” ๊ธฐ๋ฐ˜)
  • ์šฐ์„ ๋„: โ˜…โ˜…โ˜…โ˜…โ˜…

๐Ÿฅ‰ 3. nayohan/Evol-Instruct-Code-80k-v1-ko

  • ํฌ๊ธฐ: ~80K
  • ์ด์œ : ์ฝ”๋”ฉ ๋„๋ฉ”์ธ ์œ ์ผํ•œ ๋Œ€๊ทœ๋ชจ ํ•œ๊ตญ์–ด ๋ฐ์ดํ„ฐ. WizardCoder ๊ธฐ๋ฐ˜.
  • ํ’ˆ์งˆ: Tier 2
  • ์šฐ์„ ๋„: โ˜…โ˜…โ˜…โ˜…โ˜†

4๏ธโƒฃ 4. nlp-with-deeplearning/Ko.WizardLM_evol_instruct_V2_196k

  • ํฌ๊ธฐ: ~196K
  • ์ด์œ : Evol Instruct๋กœ ๋‚œ์ด๋„ ๋‹ค์–‘. ๋ณต์žกํ•œ instruction ํฌํ•จ. ๋Œ€๊ทœ๋ชจ.
  • ํ’ˆ์งˆ: Tier 2
  • ์šฐ์„ ๋„: โ˜…โ˜…โ˜…โ˜…โ˜†

5๏ธโƒฃ 5. FreedomIntelligence/alpaca-gpt4-korean

  • ํฌ๊ธฐ: ~52K
  • ์ด์œ : GPT-4 ์ƒ์„ฑ์œผ๋กœ ์‘๋‹ต ํ’ˆ์งˆ ๋†’์Œ. ๊ธฐ์กด Alpaca ๋ฐ์ดํ„ฐ์™€ ์ƒ๋ณด์ .
  • ํ’ˆ์งˆ: Tier 1
  • ์šฐ์„ ๋„: โ˜…โ˜…โ˜…โ˜†โ˜†

5. ์ถ”๊ฐ€ ๊ถŒ์žฅ ์‚ฌํ•ญ

์ฆ‰์‹œ ์กฐ์น˜

  1. ํ˜„์žฌ train.jsonl์— source ํ•„๋“œ ์ถ”๊ฐ€ (์—ญ์ถ”์  or ํ–ฅํ›„ ๋ฐ์ดํ„ฐ๋ถ€ํ„ฐ)
  2. Top 5 ๋ฐ์ดํ„ฐ์…‹ ๋‹ค์šด๋กœ๋“œ โ†’ ์ค‘๋ณต ์ œ๊ฑฐ โ†’ source ํƒœ๊น… ํ›„ ๋ณ‘ํ•ฉ
  3. ์˜ˆ์ƒ ์ถ”๊ฐ€ ๋ฐ์ดํ„ฐ: ~595K (193K + 74K + 80K + 196K + 52K)
  4. ๋ณ‘ํ•ฉ ํ›„ ์ด ๊ทœ๋ชจ: ~757K (ํ˜„์žฌ 162K + 595K)

์ค‘๊ธฐ ๊ณ„ํš

  • nayohan/instruction_en_ko_translation_1.4m โ€” 1.4M ๋Œ€๊ทœ๋ชจ์ด๋‚˜ ํ’ˆ์งˆ ๊ฒ€์ฆ ํ•„์š”
  • squarelike/OpenOrca-gugugo-ko โ€” ์ดˆ๋Œ€๊ทœ๋ชจ(10M+)์ด๋‚˜ ๋…ธ์ด์ฆˆ ํ•„ํ„ฐ๋ง ํ•„์ˆ˜
  • allenai/tulu-3-sft-mixture โ€” ๋‹ค๊ตญ์–ด ํฌํ•จ, ํ•œ๊ตญ์–ด ๋ถ€๋ถ„ ์ถ”์ถœ ๊ฐ€์น˜
  • Safety ๋ฐ์ดํ„ฐ ์ž์ฒด ๊ตฌ์ถ• (์œ ํ•ด ์š”์ฒญ ๊ฑฐ์ ˆ ์‹œ๋‚˜๋ฆฌ์˜ค)

๋„๋ฉ”์ธ ํŠนํ™” ๋ณด๊ฐ•

  • ๋ฒ•๋ฅ : neuralfoundry-coder/korean-legal-instruction-sample (์ƒ˜ํ”Œ๋งŒ ๊ณต๊ฐœ, AI Hub ์›๋ณธ ํ™•์ธ ํ•„์š”)
  • ๊ต์œก: neuralfoundry-coder/aihub-korean-education-instruct-sample
  • ์˜๋ฃŒ: squarelike/ko_medical_chat (25 DL, ์†Œ๊ทœ๋ชจ)

6. 404 (์‚ญ์ œ/๋น„๊ณต๊ฐœ) ๋ฐ์ดํ„ฐ์…‹

๋‹ค์Œ ๋ฐ์ดํ„ฐ์…‹์€ ํ˜„์žฌ ์ ‘๊ทผ ๋ถˆ๊ฐ€:

  • Bingsu/ko-alpaca-cleaned โŒ
  • naver-clova-ix/koco-v1-5 (๋ณ„๋„ ํ™•์ธ ํ•„์š”)
  • kuotient/korean-conversation-dataset (๋ณ„๋„ ํ™•์ธ ํ•„์š”)
  • HAERAE-HUB/K2-Bench-Instruction โŒ
  • nayohan/llama3-instruct-ko โŒ
  • Bongseok/Kor-Platypus2 โŒ
  • kuotient/orca-math-word-problems-korean โŒ (โ†’ orca-math-word-problems-193k-korean์ด ์ •ํ™•ํ•œ ์ด๋ฆ„)
  • kyujinpy/Kor-Platypus2-T70k โŒ
  • HAERAE-HUB/qarv-instruct-100k โŒ