YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

library_name: transformers tags: - llama - medical - usmle - multiple-choice - text-classification - causal-lm - qlora - peft - instruction-tuned

Model Card for `david125tran/llama31-medqa-qlora`

This model is a QLoRA-fine-tuned version of Meta-Llama-3.1-8B-Instruct on the MedQA USMLE 4-option multiple-choice dataset. It is trained to answer USMLE-style medical questions by selecting one of four answer choices: A, B, C, or D.

Model Details

Model Description

Model type: Causal language model (decoder-only) used as a multiple-choice classifier
Base model: meta-llama/Llama-3.1-8B-Instruct
Fine-tuning method: QLoRA via PEFT / SFTTrainer (TRL)
Language: English
Task: Medical multiple-choice question answering (USMLE-style)
Author: David Tran (david125tran)
License: Inherits the license and use restrictions of meta-llama/Llama-3.1-8B-Instruct and the MedQA dataset.
Finetuned from model: meta-llama/Llama-3.1-8B-Instruct

This model is intended as a research and learning artifact in my “LLM-Fine-Tuning-Lab” repo, demonstrating how an instruction-tuned base model can be adapted into a domain-specific multiple-choice classifier using modern fine-tuning techniques (QLoRA, LoRA, 4-bit quantization).

Model Sources

Training code / experiments: (GitHub) https://github.com/david125tran/LLM-Fine-Tuning-Lab
Dataset: GBaker/MedQA-USMLE-4-options on the Hugging Face Hub

Uses

Direct Use

The model can be used to:

Answer USMLE-style medical multiple-choice questions with 4 options (A–D).
Serve as a research baseline or teaching example for:
- QLoRA fine-tuning
- PEFT/LoRA adapters on top of an instruction-tuned base model
- Logits-based multiple-choice scoring

Typical usage pattern:

Format the question + options into a single prompt.
Feed it through the model in inference mode.
Compare the final-token logits for the tokens "A", "B", "C", "D" and choose argmax.

Downstream Use

Downstream, the model may be:

Integrated into offline analysis pipelines for medical education research (e.g., question difficulty analysis, distractor quality, or model-assisted item generation).
Used as a starting point for further fine-tuning on:
- Specific medical subdomains (cardiology, neurology, etc.)
- Different exam formats or MCQ banks
- Chain-of-thought or rationale-style training

Out-of-Scope Use

This model must not be used for:

Real-time clinical decision making or patient care.
Providing medical advice to patients or the public.
Any safety-critical application where incorrect outputs could cause harm.
Automation of high-stakes licensing or certification decisions.

It is a research/educational artifact and should be treated as such.

Bias, Risks, and Limitations

The model is trained entirely on MedQA (USMLE-style) questions, which are exam-style, synthetic, and biased toward specific curricula and style of reasoning.
It may:
- Overfit to exam-style phrasing and perform worse on free-form clinical narratives.
- Reflect dataset biases in disease prevalence, demographics, and clinical practice norms.
- Produce confident but incorrect answers without uncertainty estimates.

The model has only been evaluated on a small subset (~200 examples) of the MedQA distribution during this experiment and has not been rigorously validated for robustness, calibration, or fairness.

Recommendations

Treat outputs as unverified suggestions, not as ground truth.
Do not deploy in clinical settings.
If used in research, combine with:
- Human expert review
- Calibration analysis (e.g., confidence vs. accuracy)
- Error analysis, especially on high-confidence wrong answers

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support