YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

library_name: transformers tags: - llama - medical - usmle - multiple-choice - text-classification - causal-lm - qlora - peft - instruction-tuned

Model Card for david125tran/llama31-medqa-qlora

This model is a QLoRA-fine-tuned version of Meta-Llama-3.1-8B-Instruct on the MedQA USMLE 4-option multiple-choice dataset. It is trained to answer USMLE-style medical questions by selecting one of four answer choices: A, B, C, or D.


Model Details

Model Description

  • Model type: Causal language model (decoder-only) used as a multiple-choice classifier
  • Base model: meta-llama/Llama-3.1-8B-Instruct
  • Fine-tuning method: QLoRA via PEFT / SFTTrainer (TRL)
  • Language: English
  • Task: Medical multiple-choice question answering (USMLE-style)
  • Author: David Tran (david125tran)
  • License: Inherits the license and use restrictions of meta-llama/Llama-3.1-8B-Instruct and the MedQA dataset.
  • Finetuned from model: meta-llama/Llama-3.1-8B-Instruct

This model is intended as a research and learning artifact in my “LLM-Fine-Tuning-Lab” repo, demonstrating how an instruction-tuned base model can be adapted into a domain-specific multiple-choice classifier using modern fine-tuning techniques (QLoRA, LoRA, 4-bit quantization).

Model Sources

  • Training code / experiments: (GitHub) https://github.com/david125tran/LLM-Fine-Tuning-Lab
  • Dataset: GBaker/MedQA-USMLE-4-options on the Hugging Face Hub

Uses

Direct Use

The model can be used to:

  • Answer USMLE-style medical multiple-choice questions with 4 options (A–D).
  • Serve as a research baseline or teaching example for:
    • QLoRA fine-tuning
    • PEFT/LoRA adapters on top of an instruction-tuned base model
    • Logits-based multiple-choice scoring

Typical usage pattern:

  1. Format the question + options into a single prompt.
  2. Feed it through the model in inference mode.
  3. Compare the final-token logits for the tokens "A", "B", "C", "D" and choose argmax.

Downstream Use

Downstream, the model may be:

  • Integrated into offline analysis pipelines for medical education research (e.g., question difficulty analysis, distractor quality, or model-assisted item generation).
  • Used as a starting point for further fine-tuning on:
    • Specific medical subdomains (cardiology, neurology, etc.)
    • Different exam formats or MCQ banks
    • Chain-of-thought or rationale-style training

Out-of-Scope Use

This model must not be used for:

  • Real-time clinical decision making or patient care.
  • Providing medical advice to patients or the public.
  • Any safety-critical application where incorrect outputs could cause harm.
  • Automation of high-stakes licensing or certification decisions.

It is a research/educational artifact and should be treated as such.


Bias, Risks, and Limitations

  • The model is trained entirely on MedQA (USMLE-style) questions, which are exam-style, synthetic, and biased toward specific curricula and style of reasoning.
  • It may:
    • Overfit to exam-style phrasing and perform worse on free-form clinical narratives.
    • Reflect dataset biases in disease prevalence, demographics, and clinical practice norms.
    • Produce confident but incorrect answers without uncertainty estimates.

The model has only been evaluated on a small subset (~200 examples) of the MedQA distribution during this experiment and has not been rigorously validated for robustness, calibration, or fairness.

Recommendations

  • Treat outputs as unverified suggestions, not as ground truth.
  • Do not deploy in clinical settings.
  • If used in research, combine with:
    • Human expert review
    • Calibration analysis (e.g., confidence vs. accuracy)
    • Error analysis, especially on high-confidence wrong answers

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support