library_name: transformers tags: - llama - medical - usmle - multiple-choice - text-classification - causal-lm - qlora - peft - instruction-tuned
Model Card for david125tran/llama31-medqa-qlora
This model is a QLoRA-fine-tuned version of Meta-Llama-3.1-8B-Instruct on the MedQA USMLE 4-option multiple-choice dataset. It is trained to answer USMLE-style medical questions by selecting one of four answer choices: A, B, C, or D.
Model Details
Model Description
- Model type: Causal language model (decoder-only) used as a multiple-choice classifier
- Base model:
meta-llama/Llama-3.1-8B-Instruct - Fine-tuning method: QLoRA via PEFT /
SFTTrainer(TRL) - Language: English
- Task: Medical multiple-choice question answering (USMLE-style)
- Author: David Tran (
david125tran) - License: Inherits the license and use restrictions of
meta-llama/Llama-3.1-8B-Instructand the MedQA dataset. - Finetuned from model:
meta-llama/Llama-3.1-8B-Instruct
This model is intended as a research and learning artifact in my “LLM-Fine-Tuning-Lab” repo, demonstrating how an instruction-tuned base model can be adapted into a domain-specific multiple-choice classifier using modern fine-tuning techniques (QLoRA, LoRA, 4-bit quantization).
Model Sources
- Training code / experiments: (GitHub)
https://github.com/david125tran/LLM-Fine-Tuning-Lab - Dataset:
GBaker/MedQA-USMLE-4-optionson the Hugging Face Hub
Uses
Direct Use
The model can be used to:
- Answer USMLE-style medical multiple-choice questions with 4 options (A–D).
- Serve as a research baseline or teaching example for:
- QLoRA fine-tuning
- PEFT/LoRA adapters on top of an instruction-tuned base model
- Logits-based multiple-choice scoring
Typical usage pattern:
- Format the question + options into a single prompt.
- Feed it through the model in inference mode.
- Compare the final-token logits for the tokens
"A","B","C","D"and chooseargmax.
Downstream Use
Downstream, the model may be:
- Integrated into offline analysis pipelines for medical education research (e.g., question difficulty analysis, distractor quality, or model-assisted item generation).
- Used as a starting point for further fine-tuning on:
- Specific medical subdomains (cardiology, neurology, etc.)
- Different exam formats or MCQ banks
- Chain-of-thought or rationale-style training
Out-of-Scope Use
This model must not be used for:
- Real-time clinical decision making or patient care.
- Providing medical advice to patients or the public.
- Any safety-critical application where incorrect outputs could cause harm.
- Automation of high-stakes licensing or certification decisions.
It is a research/educational artifact and should be treated as such.
Bias, Risks, and Limitations
- The model is trained entirely on MedQA (USMLE-style) questions, which are exam-style, synthetic, and biased toward specific curricula and style of reasoning.
- It may:
- Overfit to exam-style phrasing and perform worse on free-form clinical narratives.
- Reflect dataset biases in disease prevalence, demographics, and clinical practice norms.
- Produce confident but incorrect answers without uncertainty estimates.
The model has only been evaluated on a small subset (~200 examples) of the MedQA distribution during this experiment and has not been rigorously validated for robustness, calibration, or fairness.
Recommendations
- Treat outputs as unverified suggestions, not as ground truth.
- Do not deploy in clinical settings.
- If used in research, combine with:
- Human expert review
- Calibration analysis (e.g., confidence vs. accuracy)
- Error analysis, especially on high-confidence wrong answers