Expert-Specific Large Language Models for Radiology

This repository contains the model weights for "Expert-Specific Large Language Models for Radiology: Achieving Clinical-Grade Performance with Small Datasets".
๐ฏ Key Findings
- Expert-specific models trained on 2,175โ9,016 reports achieve comparable performance to benchmark models trained on 520,442 reports
- Up to 95.05% time efficiency gains in prospective clinical deployment
- 97.8% concordance with radiologist-finalised impressions (BERTScore F1: 0.95โ1.00)
- Significantly outperforms public LLMs (GPT-4, Baidu Qianfan) on all CHARM dimensions
๐ฆ Available Models
Expert-Specific Models (BLOOMZ-7B based)
| Model |
Base |
Training Data |
Description |
7b_radiologist1 |
BLOOMZ-7B |
~3,000 reports |
Expert-specific model for Radiologist 1 |
7b_radiologist4 |
BLOOMZ-7B |
~5,000 reports |
Expert-specific model for Radiologist 4 |
7b_radiologist5 |
BLOOMZ-7B |
~2,175 reports |
Expert-specific model for Radiologist 5 |
Expert-Specific Models (BLOOMZ-3B based)
| Model |
Base |
Training Data |
Description |
3b_radiologist1 |
BLOOMZ-3B |
~3,000 reports |
Compact expert-specific model for Radiologist 1 |
3b_radiologist4 |
BLOOMZ-3B |
~5,000 reports |
Compact expert-specific model for Radiologist 4 |
3b_radiologist5 |
BLOOMZ-3B |
~2,175 reports |
Compact expert-specific model for Radiologist 5 |
Benchmark SFT Models (trained on 520,442 reports)
| Model |
Base |
Epochs |
Description |
bloom_1b1_3 |
BLOOMZ-1B |
3 |
Benchmark SFT model (1B params, 3 epochs) |
bloom_1b1_16 |
BLOOMZ-1B |
16 |
Benchmark SFT model (1B params, 16 epochs) |
bloom_3b_3 |
BLOOMZ-3B |
3 |
Benchmark SFT model (3B params, 3 epochs) |
bloom_3b_16 |
BLOOMZ-3B |
16 |
Benchmark SFT model (3B params, 16 epochs) |
RLHF Models (refined with human feedback)
| Model |
Base |
PPO Steps |
Description |
rlhf_checkpoint-80 |
BLOOMZ-3B |
80 |
RLHF-refined model (early checkpoint) |
rlhf_checkpoint-120 |
BLOOMZ-3B |
120 |
RLHF-refined model (optimal checkpoint) |
๐ Quick Start
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "your-org/7b_radiologist1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
findings = "่่ๅคงๅฐๅฝขๆๆญฃๅธธ๏ผๅฎ่ดจๅ
ๆช่งๆ็กฎๅผๅธธๅฏๅบฆๅฝฑใ่ๅๅคงๅฐๆญฃๅธธ๏ผๅฃไธๅ๏ผ่
ๅ
ๆช่งๆ็กฎๅผๅธธๅฏๅบฆๅฝฑใ"
prompt = f"According to the following medical imaging description: {findings} Generate a corresponding CT image impression:"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
**inputs,
max_new_tokens=256,
do_sample=True,
temperature=0.7,
top_p=0.9
)
impression = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(impression)
๐ Training Details
Hyperparameters
| Parameter |
SFT (Benchmark/Expert) |
RLHF (PPO) |
| Base Model |
BLOOMZ-1B/3B/7B |
BLOOMZ-3B |
| Learning Rate |
2ร10โปโต |
1.41ร10โปโต |
| LR Schedule |
Cosine decay |
Constant |
| Batch Size |
8 |
256 |
| Gradient Accumulation |
16 |
1 |
| Max Sequence Length |
2048 |
2048 |
| Weight Decay |
0.01 |
0 |
| Dropout |
0.1 |
0.1 |
| Epochs |
16 |
โ |
| PPO Epochs |
โ |
4 |
| PPO Clip Range |
โ |
0.2 |
Hardware
- 8ร NVIDIA A100 (40GB) GPUs
- DeepSpeed ZeRO-3 for distributed training
๐ Performance
NLP Metrics (31,434 test reports)
| Model Type |
BLEU-4 |
ROUGE-L F1 |
BERTScore F1 |
| Expert-Specific (7B) |
0.58โ0.65 |
0.71โ0.77 |
0.69 |
| Benchmark SFT (3B) |
0.68โ0.70 |
0.81โ0.82 |
0.69โ0.70 |
| RLHF (3B) |
0.65โ0.68 |
0.78โ0.80 |
0.97 |
| GPT-4 |
0.03 |
0.13 |
0.74 |
| Baidu Qianfan |
0.05 |
0.20 |
0.69 |
CHARM Clinical Evaluation
Expert-specific models achieved up to 95.0% performance overlap with benchmark RLHF models on CHARM metrics (Clarity, Helpfulness, Accuracy, Redundancy, Misleading) despite using 58โ239ร less training data.
โ ๏ธ Intended Use & Limitations
Intended Use
- Research purposes in medical AI and radiology NLP
- Educational demonstrations of expert-specific fine-tuning approaches
- Baseline comparisons for radiology report generation systems
๐ License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
๐ Acknowledgements
This work was supported by:
- National Natural Science Foundation of China (82472065, W2432049)
- National Key Research and Development Program of China (2022YFC2409501)
- National Center for Translational Medicine Shanghai (NRCTM(SH)-2025-11)
- Shanghai Explorer Program (24TS1414900)
- Shanghai Pujiang Program (2023PJD053)