Configuration Parsing Warning: In adapter_config.json: "peft.task_type" must be a string
Benchmarking and Adapting On-Device Large Language Models for Clinical Decision Support
Authors
Alif Munim* 1, Jun Ma* 1,2, Omar Ibrahim* 1, Alhusain Abdalla* 1, Shuolin Yin3, Leo Chen4, Bo Wang† 1,5,6,7,8
* Equal contribution † Corresponding author
1AI Collaborative Centre, University Health Network, Toronto, Canada
2Princess Margaret Cancer Centre, University Health Network, Toronto, Canada
3Department of Electrical and Computer Engineering, University of Toronto, Toronto, Canada
4Division of Urology, Department of Surgery, St. Michael's Hospital, Unity Health Toronto and University of Toronto, Toronto, Canada
5Peter Munk Cardiac Centre, University Health Network, Toronto, Canada
6Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, Canada
7Department of Computer Science, University of Toronto, Toronto, Canada
8Vector Institute for Artificial Intelligence, Toronto, Canada
Highlights
- LoRA fine-tuned GPT-OSS 20B for structured radiology differential diagnosis
- Trained on 1,894 EuroRad medical cases spanning diverse imaging modalities and specialties
- Generates systematic chain-of-thought reasoning: symptom mapping → differential analysis → diagnosis
- Lightweight adapter (2.27 GB) compatible with 4-bit quantization for on-device deployment
- Part of a broader benchmark study comparing on-device LLMs across medical tasks
Model Overview
This model is a LoRA fine-tuned version of unsloth/gpt-oss-20b for medical radiology diagnosis, developed as part of a study benchmarking and adapting on-device large language models for clinical decision support. Trained on EuroRad clinical cases, it generates step-by-step diagnostic reasoning from patient history and imaging findings, mapping symptoms to differentials and converging on a final diagnosis with supporting evidence.
The model employs a systematic diagnostic framework: (1) relating clinical history to imaging findings, (2) mapping findings to each differential, (3) systematic elimination of alternatives, and (4) converging on a final diagnosis with confidence reasoning.
Model Details
| Base Model | unsloth/gpt-oss-20b |
| Fine-tuning Method | LoRA (Low-Rank Adaptation) |
| Training Framework | Unsloth |
| Task | Medical diagnosis from radiology reports |
| Training Dataset | wanglab/eurorad-gpt-oss-training-data |
| Sequence Length | 4,096 tokens |
| Quantization | 4-bit |
| Adapter Size | 2.27 GB |
| License | Apache-2.0 (research use only) |
Installation
pip install unsloth peft transformers accelerate bitsandbytes
Usage
from unsloth import FastLanguageModel
from peft import PeftModel
# Load base model
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/gpt-oss-20b",
dtype=None,
max_seq_length=4096,
load_in_4bit=True,
full_finetuning=False,
)
# Load LoRA adapter
model = PeftModel.from_pretrained(
model,
"wanglab/on-device-LLM-gpt-oss-20b",
is_trainable=False
)
# Enable inference mode
FastLanguageModel.for_inference(model)
# Example inference
prompt = """You are an expert radiologist demonstrating step-by-step diagnostic reasoning.
Case presentation:
{combined_description}
Differential diagnoses to consider:
{dd_formatted}
Generate systematic Chain-of-Thought reasoning that shows how clinicians think through cases:
1. **Connect symptoms to findings**: Link clinical presentation with imaging observations
2. **Map to differentials**: Show how findings support or contradict each differential diagnosis
3. **Systematic elimination**: Explicitly rule out less likely options with reasoning
4. **Converge to answer**: Demonstrate the logical path to the correct diagnosis"""
inputs = tokenizer(prompt.format(
combined_description="...", # clinical history + imaging findings
dd_formatted="Diagnosis A, Diagnosis B, Diagnosis C"
), return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.1)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Training Details
| Framework | Unsloth |
| Dataset | wanglab/eurorad-gpt-oss-training-data |
| Cases | 1,894 EuroRad radiology cases |
| Input | Text only (clinical history + imaging report) |
| Optimization | 4-bit quantization + LoRA |
| Sequence Length | 4,096 tokens |
Citation
Citation will be updated upon arXiv submission and journal publication.
@article{munim2025ondevice,
title={Benchmarking and Adapting On-Device Large Language Models for Clinical Decision Support},
author={Munim, Alif and Ma, Jun and Ibrahim, Omar and Abdalla, Alhusain and Yin, Shuolin and Chen, Leo and Wang, Bo},
journal={},
year={2025}
}
Limitations
- Clinical Validation Required: This model has not been clinically validated and should not be used for actual patient diagnosis
- Research Purposes Only: Designed for research in medical AI and diagnostic systems
- Not for Clinical Use: Not intended for direct patient care without clinical validation
- May reflect biases present in the EuroRad training data
- Performance may vary across imaging modalities and medical specialties
- Like all LLMs, may generate plausible but incorrect information ("hallucinations")
Contact
For issues and questions, please open a discussion in this repository.
Corresponding author: Bo Wang — bowang@vectorinstitute.ai
Disclaimer: This model is for research purposes only and has not been approved for clinical use. Always consult qualified healthcare professionals for medical decisions.
- Downloads last month
- 38