KorMedMCQA: Multi-Choice Question Answering Benchmark for Korean Healthcare Professional Licensing Examinations
Paper โข 2403.01469 โข Published
ํ๊ตญ์ด ์๋ฃ/๋ฐ์ด์ค ๋๋ฉ์ธ์ ํนํ๋ Qwen3-8B ๊ธฐ๋ฐ LoRA ํ์ธํ๋ ๋ชจ๋ธ์ ๋๋ค. KorMedMCQA ๋ฒค์น๋งํฌ์์ 69.06% ์ ํ๋๋ฅผ ๋ฌ์ฑํ์ฌ, ๋ชฉํ 65%๋ฅผ ์ด๊ณผ ๋ฌ์ฑํ์ต๋๋ค.
| Parameter | Value |
|---|---|
| Base Model | Qwen/Qwen3-8B |
| LoRA Rank (r) | 16 |
| LoRA Alpha | 32 |
| LoRA Dropout | 0.1 |
| LoRA Target Modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Learning Rate | 2e-4 |
| LR Scheduler | Cosine |
| Warmup Ratio | 0.03 |
| Epochs | 3 |
| Per-device Batch Size | 2 |
| Gradient Accumulation Steps | 16 |
| Effective Batch Size | 32 |
| Max Sequence Length | 2,048 |
| Attention Implementation | SDPA |
| Precision | bf16 |
| Optimizer | AdamW (fused) |
| Seed | 42 |
| DoRA | False |
| RSLoRA | False |
ํ๊ตญ ์๋ฃ ์๊ฒฉ์ํ ๊ธฐ๋ฐ ๊ฐ๊ด์ ๋ฌธ์ (Multiple Choice QA) ๋ฒค์น๋งํฌ์์ ํ๊ฐํ์ต๋๋ค.
Overall Performance
| Metric | Value |
|---|---|
| Overall Accuracy | 69.06% |
| Total Samples | 3,009 |
| Correct | 2,078 |
| Extract Fail Rate | 0.00% |
| Evaluation Mode | Direct (zero-shot) |
Per-Subject Performance
| Subject | Correct | Total | Accuracy |
|---|---|---|---|
| ๊ฐํธ์ฌ (Nurse) | 687 | 878 | 78.25% |
| ์ฝ์ฌ (Pharmacist) | 198 | 271 | 73.06% |
| ์ฝํ (Pharm Science) | 422 | 614 | 68.73% |
| ์์ฌ (Doctor) | 297 | 435 | 68.28% |
| ์น๊ณผ์์ฌ (Dentist) | 474 | 811 | 58.45% |
| Experiment | Accuracy | Note |
|---|---|---|
| qwen3_baseline | 60.39% | Qwen3-8B ๊ธฐ์ค์ (ํ์ธํ๋ ์์) |
| qwen3_sft_001 | 64.27% | ์ด๊ธฐ SFT (2 epochs) |
| qwen3_sft_002 | 67.50% | ํ์ฅ ํ์ต (3 epochs) |
| qwen3_sft_v3 | 69.06% | ์ต์ ๋ฐ์ดํฐ์ v3 ์ ์ฉ (์ต๊ณ ์ฑ๋ฅ) |
| Model | Accuracy | Parameters | License |
|---|---|---|---|
| WeIN_bio_Qwen3-8B (๋ณธ ๋ชจ๋ธ) | 69.06% | 8B | Apache 2.0 |
| Qwen3-8B (baseline) | 60.39% | 8B | Apache 2.0 |
| EXAONE 7.8B | 56.10% | 7.8B | Non-Commercial |
| Random Guess | 20.00% | - | - |
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# Load base model and adapter
base_model_id = "Qwen/Qwen3-8B"
adapter_id = "dhkim0324/WeIN_bio_Qwen3-8B"
tokenizer = AutoTokenizer.from_pretrained(adapter_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
base_model_id,
device_map="auto",
torch_dtype="auto",
trust_remote_code=True,
)
model = PeftModel.from_pretrained(model, adapter_id)
# Inference
prompt = "๋ค์ ์๋ฃ ๊ด๋ จ ๊ฐ๊ด์ ๋ฌธ์ ์ ๋ตํ์์ค.\n\n๋ฌธ์ : ์ฌ๊ทผ๊ฒฝ์์ ๊ฐ์ฅ ํํ ์์ธ์?\n1. ๊ด์๋๋งฅ ์ฃฝ์๊ฒฝํ์ฆ\n2. ์ฌ์ฅํ๋ง์งํ\n3. ์ฌ๊ทผ์ผ\n4. ๋๋๋งฅ๋ฐ๋ฆฌ\n\n์ ๋ต:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=128, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
from transformers import AutoModelForCausalLM
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", torch_dtype="auto")
model = PeftModel.from_pretrained(base_model, "dhkim0324/WeIN_bio_Qwen3-8B")
merged_model = model.merge_and_unload()
merged_model.save_pretrained("merged_model")
@misc{wein_bio_qwen3_2026,
title={WeIN_bio_Qwen3-8B: Korean Medical Domain LoRA Adapter for Qwen3-8B},
author={dhkim0324},
year={2026},
publisher={Hugging Face}
}