EuroLLM-22B-MeditronFO
EuroLLM-22B-MeditronFO is a 22B-parameter medical specialist LLM, produced by supervised fine-tuning of EuroLLM-22B-Instruct on the Fully Open Meditron Corpus.
This model is part of the Fully Open Meditron family β the first end-to-end auditable pipeline for clinical LLMs, with open weights, open data, open training recipe, and clinician-vetted corpus construction.
EuroLLM-22B-MeditronFO is preferred over its base in 67.2% of Auto-MOOVE pairwise comparisons (adjusted win rate).
- π Paper: Fully Open Meditron: An Auditable Pipeline for Clinical LLMs
- π» Code: github.com/EPFLiGHT/FullyOpenMeditron
- π Collection: MeditronFO
- ποΈ Training corpus: EPFLiGHT/fully-open-meditron
Performance
Accuracy (%) on standard medical benchmarks. See the paper for full evaluation details, confidence intervals, and open-ended Auto-MOOVE results.
| Benchmark | EuroLLM-22B-Instruct | EuroLLM-22B-MeditronFO | Ξ |
|---|---|---|---|
| MedMCQA | 54.94 | 54.79 | -0.15 |
| MedQA | 66.61 | 63.16 | -3.45 |
| PubMedQA | 73.60 | 78.00 | +4.40 |
| MedXpertQA | 14.61 | 14.61 | +0.00 |
| HealthBench Hard | 34.79 | 37.38 | +2.59 |
| Average | 48.91 | 49.59 | +0.68 |
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "EPFLiGHT/EuroLLM-22B-MeditronFO"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{"role": "user", "content": "A 62-year-old woman presents with a three-day history of dyspnea on exertion and a productive cough. What is the differential diagnosis?"},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
Training
- Base model: EuroLLM-22B-Instruct
- Corpus: Fully Open Meditron β
601k examples (150M tokens), aggregating eight public medical QA datasets with three clinician-vetted synthetic components: exam-style QA, guideline-grounded QA from 46,469 clinical practice guidelines, and open-ended clinical vignettes - Hardware: NVIDIA GH200 nodes
- Framework: Axolotl with FSDP v2 / DeepSpeed ZeRO-3, Flash Attention 2, bf16 mixed precision
- Decontamination: System-wide two-stage n-gram and token-alignment decontamination against all evaluation benchmarks
Full hyperparameters are in Appendix I of the paper.
Intended Use
Research only. This model is intended to support research on medical LLMs, auditing of clinical AI systems, and reproducibility of the Fully Open Meditron pipeline.
It is not validated for clinical deployment, individual patient advice, autonomous decision-making, or any other deployment-adjacent use. Conduct independent domain-specific safety evaluation before any such use.
Citation
todo
}
License
Released under the apache-2.0 license. Permissive use including commercial, subject to attribution.
- Downloads last month
- 33