frankenstallm / source /eval /benchmark_pipeline.md
pathcosmos's picture
Upload folder using huggingface_hub (#29)
5b1ff4d
|
raw
history blame
8.35 kB

Korean LLM Benchmark Pipeline

์ž‘์„ฑ: 2026-02-26 | ์„œ๋ฒ„: 8ร— NVIDIA B200 183GB | PyTorch 2.10 (NV custom), CUDA 13.1


1. lm-eval ์„ค์น˜ ์ƒํƒœ

lm-eval 0.4.11 ์„ค์น˜๋จ (/usr/local/lib/python3.12/dist-packages/)
์„ค์น˜ ๋ช…๋ น: pip install lm-eval --break-system-packages

โš ๏ธ lm-eval[ko] extra๋Š” 0.4.11์— ์—†์Œ. ๊ธฐ๋ณธ lm-eval๋กœ ์„ค์น˜ํ•˜๋ฉด ๋จ. Korean ๊ด€๋ จ ํƒœ์Šคํฌ๋Š” ๊ธฐ๋ณธ ํŒจํ‚ค์ง€์— ๋ชจ๋‘ ํฌํ•จ๋ผ ์žˆ์Œ.


2. Open Ko-LLM Leaderboard 9๊ฐœ ํƒœ์Šคํฌ ๋ถ„์„

โŒ ๊ฒฐ๋ก : ๋กœ์ปฌ ์‹คํ–‰ ๋ถˆ๊ฐ€ (๋น„๊ณต๊ฐœ ๋ฐ์ดํ„ฐ์…‹)

Open Ko-LLM Leaderboard 2์˜ 9๊ฐœ ํƒœ์Šคํฌ๋Š” ์ „์šฉ ๋น„๊ณต๊ฐœ ๋ฐ์ดํ„ฐ์…‹ ์‚ฌ์šฉ:

  • Ko-GPQA, Ko-WinoGrande, Ko-GSM8K, Ko-EQ-Bench โ†’ Flitto ์ œ๊ณต (๋น„๊ณต๊ฐœ)
  • KorNAT-CKA, KorNAT-SVA, Ko-Harmlessness, Ko-Helpfulness โ†’ SELECTSTAR + KAIST AI (๋น„๊ณต๊ฐœ)
  • Ko-IFEval โ†’ ๋น„๊ณต๊ฐœ ๋ฒˆ์—ญ๋ณธ

leaderboard๋Š” lm-evaluation-harness๋ฅผ ์‚ฌ์šฉํ•˜์ง€๋งŒ, ๋ฐ์ดํ„ฐ์…‹์— ์ง์ ‘ ์ ‘๊ทผ ๋ถˆ๊ฐ€.

๊ฐ ํƒœ์Šคํฌ ์ƒ์„ธ (๋ฉ”ํŠธ๋ฆญ ๊ธฐ์ค€, ๊ฒฐ๊ณผ ๋ฐ์ดํ„ฐ ๋ถ„์„)

ํƒœ์Šคํฌ ๋ ˆ์ด๋ธ” ๋ฉ”ํŠธ๋ฆญ Few-shot ํŠน์ง•
ko_eqbench Ko-EQ Bench eqbench,none 0-shot ๊ฐ์ •์ง€๋Šฅ ํ‰๊ฐ€, ํŒŒ์‹ฑ ํ•„์š”
ko_gpqa_diamond_zeroshot Ko-GPQA Diamond acc_norm,none 0-shot ๋Œ€ํ•™์› ์ˆ˜์ค€ ๊ณผํ•™
ko_gsm8k Ko-GSM8K exact_match,strict-match 0-shot ์ดˆ๋“ฑ ์ˆ˜ํ•™ ์ถ”๋ก 
ko_ifeval Ko-IFEval prompt_level_strict_acc,none + inst_level_strict_acc,none (ํ‰๊ท ) 0-shot ์ง€์‹œ ๋”ฐ๋ฅด๊ธฐ
ko_winogrande Ko-Winogrande acc,none 0-shot ์ƒ์‹ ์ถ”๋ก 
kornat_common KorNAT-CKA acc_norm,none 0-shot ํ•œ๊ตญ ๋ฌธํ™”ยท์ง€์‹
kornat_harmless Ko-Harmlessness acc_norm,none 0-shot ๋ฌดํ•ด์„ฑ
kornat_helpful Ko-Helpfulness acc_norm,none 0-shot ์œ ์šฉ์„ฑ
kornat_social KorNAT-SVA A-SVA,none 0-shot ์‚ฌํšŒ์  ๊ฐ€์น˜

๋Œ€์•ˆ: ๊ณต๊ฐœ ์œ ์‚ฌ ํƒœ์Šคํฌ๋กœ ๊ฐ„์ ‘ ์ธก์ •

์›๋ž˜ ํƒœ์Šคํฌ ๊ณต๊ฐœ ๋Œ€์•ˆ (lm-eval)
Ko-GSM8K global_mmlu_ko + ์ˆ˜ํ•™ ์„œ๋ธŒ์…‹
Ko-WinoGrande paws_ko (์œ ์‚ฌ ์ƒ์‹)
KorNAT-CKA haerae_general_knowledge, haerae_history
Ko-IFEval ๋ณ„๋„ IFEval ์Šคํฌ๋ฆฝํŠธ ํ•„์š”

3. ์‹ค์ œ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ํ•œ๊ตญ์–ด ๋ฒค์น˜๋งˆํฌ

3-1. KoBEST โœ… (lm-eval ๋‚ด์žฅ)

  • HF ๋ฐ์ดํ„ฐ์…‹: skt/kobest_v1
  • lm-eval ํƒœ์Šคํฌ ๊ทธ๋ฃน: kobest
  • 5๊ฐœ ์„œ๋ธŒํƒœ์Šคํฌ:
    • kobest_boolq: True/False ์ด์ง„ ๋ถ„๋ฅ˜ (~950 test)
    • kobest_copa: ์›์ธยท๊ฒฐ๊ณผ ์ถ”๋ก  (~500 test)
    • kobest_hellaswag: ๋ฌธ์žฅ ์™„์„ฑ ์ƒ์‹ (~500 test)
    • kobest_sentineg: ๊ฐ์„ฑ ๋ถ„์„ ๋ถ€์ •๋ฌธ (~500 test)
    • kobest_wic: ๋‹จ์–ด ์˜๋ฏธ ํŒŒ์•… (~638 test)
  • ์‹คํ–‰ ๋ช…๋ น:
    lm_eval --model hf --model_args pretrained=<HF_MODEL_PATH> \
      --tasks kobest --num_fewshot 0 --batch_size auto
    
  • ์˜ˆ์ƒ ์†Œ์š”: 1B ๋ชจ๋ธ ๊ธฐ์ค€ GPU 1์žฅ ~15-30๋ถ„

3-2. HAE-RAE Bench โœ… (lm-eval ๋‚ด์žฅ)

  • HF ๋ฐ์ดํ„ฐ์…‹: HAERAE-HUB/HAE_RAE_BENCH_1.0
  • lm-eval ํƒœ์Šคํฌ ๊ทธ๋ฃน: haerae
  • 6๊ฐœ ์„œ๋ธŒํƒœ์Šคํฌ: (reading_comprehension ์ œ์™ธ 5๊ฐœ lm-eval์—์„œ ์ง€์›)
    • haerae_general_knowledge: ํ•œ๊ตญ ์ƒ์‹ (~430 test)
    • haerae_history: ์—ญ์‚ฌ (~100 test)
    • haerae_loan_word: ์™ธ๋ž˜์–ด (~200 test)
    • haerae_rare_word: ํฌ๊ท€์–ด (~200 test)
    • haerae_standard_nomenclature: ํ‘œ์ค€์–ด ํ‘œ๊ธฐ (~200 test)
  • ์‹คํ–‰ ๋ช…๋ น:
    lm_eval --model hf --model_args pretrained=<HF_MODEL_PATH> \
      --tasks haerae --num_fewshot 0 --batch_size auto
    
  • ์˜ˆ์ƒ ์†Œ์š”: ~5-10๋ถ„

3-3. Global MMLU (Korean) โœ… (lm-eval ๋‚ด์žฅ)

  • HF ๋ฐ์ดํ„ฐ์…‹: CohereForAI/Global-MMLU
  • lm-eval ํƒœ์Šคํฌ ๊ทธ๋ฃน: global_mmlu_ko
  • 57๊ฐœ ๋„๋ฉ”์ธ ํ•œ๊ตญ์–ด ๋ฒˆ์—ญ๋ณธ
  • ์‹คํ–‰ ๋ช…๋ น:
    lm_eval --model hf --model_args pretrained=<HF_MODEL_PATH> \
      --tasks global_mmlu_ko --num_fewshot 0 --batch_size auto
    
  • ์˜ˆ์ƒ ์†Œ์š”: 1B ๋ชจ๋ธ ๊ธฐ์ค€ ~60-90๋ถ„

3-4. K2-Eval โš ๏ธ (๋ณ„๋„ ํ‰๊ฐ€ ํ•„์š”)

  • HF ๋ฐ์ดํ„ฐ์…‹: HAERAE-HUB/K2-Eval โœ… (๊ณต๊ฐœ ์ ‘๊ทผ ๊ฐ€๋Šฅ)
  • ํ˜•ํƒœ: ๊ฐœ๋ฐฉํ˜• ์ง€์‹œ ๋”ฐ๋ฅด๊ธฐ (Open-ended instructions)
  • ์นดํ…Œ๊ณ ๋ฆฌ: Korean History, Geography, Social Issues, Numerical Estimation, Creative Writing ๋“ฑ
  • lm-eval ์ง€์›: โŒ โ€” LLM-as-a-Judge ๋ฐฉ์‹ ํ•„์š” (GPT-4 ๋˜๋Š” Claude)
  • ๋Œ€์•ˆ: vLLM์œผ๋กœ ์ƒ์„ฑ ํ›„ ๋ณ„๋„ judge ์Šคํฌ๋ฆฝํŠธ

3-5. LogiKor โŒ (HuggingFace์—์„œ ๋ฏธํ™•์ธ)

  • ๊ณต๊ฐœ๋œ LogiKor ๋ฐ์ดํ„ฐ์…‹์„ HF์—์„œ ์ฐพ์ง€ ๋ชปํ•จ
  • ๋…ผ๋ฌธ/GitHub ๊ฒฝ๋กœ ์ง์ ‘ ํ™•์ธ ํ•„์š”
  • ์ถ”ํ›„ ๋ฐœ๊ฒฌ ์‹œ ์ถ”๊ฐ€ ์˜ˆ์ •

3-6. PAWS-Ko โœ… (lm-eval ๋‚ด์žฅ)

  • ํƒœ์Šคํฌ: paws_ko โ€” ํŒจ๋Ÿฌํ”„๋ ˆ์ด์ฆˆ ํƒ์ง€
  • ๋น ๋ฅด๊ฒŒ ์–ธ์–ด ์ดํ•ด ์ธก์ • ๊ฐ€๋Šฅ

4. ๋น ๋ฅธ ์ฒดํฌ vs ์ „์ฒด ํ‰๊ฐ€ ํƒœ์Šคํฌ์…‹

โšก ๋น ๋ฅธ ์ฒดํฌ (๋ชฉํ‘œ: 30๋ถ„ ์ด๋‚ด)

kobest_boolq, kobest_copa, haerae_general_knowledge, haerae_history, paws_ko
  • ์ด ์ƒ˜ํ”Œ ์ˆ˜: ~2,000๊ฐœ ์ดํ•˜
  • 1B ๋ชจ๋ธ + 8ร—B200 โ†’ ์•ฝ 10-20๋ถ„ ์˜ˆ์ƒ
  • ๋‹ค์–‘์„ฑ: ๋ถ„๋ฅ˜, ์ถ”๋ก , ์ƒ์‹, ํŒจ๋Ÿฌํ”„๋ ˆ์ด์ฆˆ

๐Ÿ“Š ์ „์ฒด ํ‰๊ฐ€ (๋ชฉํ‘œ: 2-4์‹œ๊ฐ„)

kobest (5) + haerae (5) + global_mmlu_ko (์ „์ฒด) + paws_ko
  • ์ด ์ƒ˜ํ”Œ ์ˆ˜: ~15,000๊ฐœ
  • 1B ๋ชจ๋ธ + 8ร—B200 โ†’ ์•ฝ 1.5-3์‹œ๊ฐ„ ์˜ˆ์ƒ
  • tensor_parallel ๋ฏธ์ง€์› ์‹œ ๋‹จ์ผ GPU ์‚ฌ์šฉ โ†’ ๋” ๊ธธ์–ด์งˆ ์ˆ˜ ์žˆ์Œ

5. ๋ชจ๋ธ ์„œ๋น™ ๋ฐฉ๋ฒ• ๊ฒฐ๋ก 

ํ˜„ํ™ฉ

  • ์ฒดํฌํฌ์ธํŠธ: checkpoints/korean_1b_sft/checkpoint-0005000/
  • ๋‚ด์šฉ: model.pt, config.yaml, optimizer.pt, scheduler.pt, train_state.pt
  • ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜: ์ปค์Šคํ…€ LLaMA-like (FP8, d_model=2048, n_layers=24, n_heads=16)
  • lm-eval ๊ธฐ๋ณธ ํฌ๋งท: HuggingFace AutoModelForCausalLM

โœ… ์ถ”์ฒœ ๋ฐฉ๋ฒ•: HF ๋ณ€ํ™˜ ํ›„ ํ‰๊ฐ€

scripts/convert_to_hf.py๊ฐ€ ์ด๋ฏธ ๊ตฌํ˜„๋˜์–ด ์žˆ์Œ. LlamaForCausalLM์œผ๋กœ ๋ณ€ํ™˜.

# Step 1: HF ํฌ๋งท์œผ๋กœ ๋ณ€ํ™˜
cd /PROJECT/0325120031_A/ghong/taketimes/llm-bang
python scripts/convert_to_hf.py \
    --checkpoint checkpoints/korean_1b_sft/checkpoint-0005000 \
    --output outputs/hf_korean_1b_sft_5000 \
    --tokenizer tokenizer/korean_sp/tokenizer.json

# Step 2: lm-eval ์‹คํ–‰
lm_eval --model hf \
    --model_args pretrained=outputs/hf_korean_1b_sft_5000 \
    --tasks kobest \
    --device cuda:0

์ฃผ์˜์‚ฌํ•ญ:

  • FP8 ๊ฐ€์ค‘์น˜๋ฅผ float32๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ณผ์ • ํฌํ•จ (convert_to_hf.py ๋‚ด๋ถ€ ์ฒ˜๋ฆฌ)
  • ์ปค์Šคํ…€ ์–ดํœ˜(vocab_size=64000) โ†’ sentencepiece_unigram ๋ฐฉ์‹
  • lm-eval์ด tokenizer๋ฅผ ์ธ์‹ํ•˜๋ ค๋ฉด tokenizer_config.json์— "model_type": "llama" ํ•„์š” (์Šคํฌ๋ฆฝํŠธ์— ์ด๋ฏธ ํฌํ•จ)

๋Œ€์•ˆ ๋ฐฉ๋ฒ• B: API ์„œ๋น™ + local-completions

# vLLM์œผ๋กœ ๋ณ€ํ™˜๋œ ๋ชจ๋ธ ์„œ๋น™
python -m vllm.entrypoints.openai.api_server \
    --model outputs/hf_korean_1b_sft_5000 --port 8000

# lm-eval API ํ‰๊ฐ€
lm_eval --model local-completions \
    --model_args model=outputs/hf_korean_1b_sft_5000,base_url=http://localhost:8000/v1,num_concurrent=8 \
    --tasks kobest

โŒ ๋ฐฉ๋ฒ• C: ์ปค์Šคํ…€ ๋ž˜ํผ (๊ถŒ์žฅ ์•ˆ ํ•จ)

lm-eval ModelWrapper ์ž‘์„ฑ ํ•„์š” โ†’ ๋ณต์žก๋„ ๋†’์Œ, ์œ ์ง€๋ณด์ˆ˜ ์–ด๋ ค์›€.


6. ์„ค์น˜ ๊ฐ€์ด๋“œ

# ํ˜„์žฌ ํ™˜๊ฒฝ (Python 3.12, externally managed)
pip install lm-eval --break-system-packages

# ๋˜๋Š” ๊ฐ€์ƒํ™˜๊ฒฝ ์‚ฌ์šฉ (๊ถŒ์žฅ)
python3 -m venv /PROJECT/0325120031_A/ghong/taketimes/llm-bang/venv
source /PROJECT/0325120031_A/ghong/taketimes/llm-bang/venv/bin/activate
pip install lm-eval

# ์ถ”๊ฐ€ ์˜์กด์„ฑ
pip install safetensors transformers torch accelerate

7. ์Šคํฌ๋ฆฝํŠธ ์œ„์น˜

์Šคํฌ๋ฆฝํŠธ ์šฉ๋„
scripts/run_eval_quick.sh ๋น ๋ฅธ ์ฒดํฌ (10-20๋ถ„)
scripts/run_eval_full.sh ์ „์ฒด ํ‰๊ฐ€ (1.5-3์‹œ๊ฐ„)
scripts/convert_to_hf.py ์ปค์Šคํ…€ ์ฒดํฌํฌ์ธํŠธ โ†’ HF ๋ณ€ํ™˜

8. ์ฐธ๊ณ  ์ž๋ฃŒ