Instructions to use boods/EnToFrMedicaLLM-Multilingual with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use boods/EnToFrMedicaLLM-Multilingual with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="boods/EnToFrMedicaLLM-Multilingual") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("boods/EnToFrMedicaLLM-Multilingual") model = AutoModelForCausalLM.from_pretrained("boods/EnToFrMedicaLLM-Multilingual") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - PEFT
How to use boods/EnToFrMedicaLLM-Multilingual with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use boods/EnToFrMedicaLLM-Multilingual with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "boods/EnToFrMedicaLLM-Multilingual" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "boods/EnToFrMedicaLLM-Multilingual", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/boods/EnToFrMedicaLLM-Multilingual
- SGLang
How to use boods/EnToFrMedicaLLM-Multilingual with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "boods/EnToFrMedicaLLM-Multilingual" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "boods/EnToFrMedicaLLM-Multilingual", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "boods/EnToFrMedicaLLM-Multilingual" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "boods/EnToFrMedicaLLM-Multilingual", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use boods/EnToFrMedicaLLM-Multilingual with Docker Model Runner:
docker model run hf.co/boods/EnToFrMedicaLLM-Multilingual
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("boods/EnToFrMedicaLLM-Multilingual")
model = AutoModelForCausalLM.from_pretrained("boods/EnToFrMedicaLLM-Multilingual")
messages = [
{"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))- EnMed-Unified — French Medical LLM (Multi-Task)
- Model Family Overview
- Intended Uses
- Quick Start
- Training Details
- Evaluation
- Raw scores across all models and shot counts
- Per-task means (averaged over 0 / 3 / 5-shot)
- Global descriptive ranking (normalised, 9 cells)
- Normalised scores across all 9 (task × shot) cells
- Per-cell deltas versus Qwen3-14B-vanilla
- Item-level paired t-tests with 95 % confidence intervals
- Significance heatmap — per-cell annotated deltas
- Statistical significance record vs. Qwen3-14B-vanilla
- Best model at every (task × shot) cell
- Critical Difference diagrams — rank analysis per shot count
- Limitations
- Citation
- Acknowledgements
- License
- Model Family Overview
EnMed-Unified — French Medical LLM (Multi-Task)
Headline system of the EnMed family. A Qwen3-14B decoder adapted for French medical question answering through domain-adaptive continual pre-training (DAPT) on a large French health corpus, followed by multi-task LoRA fine-tuning across three QA formats simultaneously.
Phase 1 evaluation establishes 4 statistically significant wins over the un-adapted Qwen3-14B-vanilla baseline (BH-corrected, q = 0.05) with zero significant losses across nine independent (task × shot) evaluation cells.
Model Family Overview
The EnMed family consists of five variants, all built on Qwen3-14B:
| Model | Adapter | Description |
|---|---|---|
| EnMed-Unified ⭐ | DAPT + Mixed LoRA | Headline system. Multi-task adapter trained jointly on all three QA tasks. Best deployment choice — never significantly worse than the base model on any task/shot combination. |
| EnMed-DAPT | DAPT only | Domain-adapted backbone, no task-specific LoRA. Statistically indistinguishable from Qwen3-14B-vanilla — confirms DAPT does not cause catastrophic forgetting. |
| EnMed-MCQA | DAPT + MCQA LoRA | Specialised for French medical multiple-choice QA. Safe specialist: 2 significant wins on its home task, zero losses. |
| EnMed-ExtQA | DAPT + ExtQA LoRA | Specialised for clinical span extraction. Gains on MCQA and 0-shot ExtQA but degrades abstractive QA. |
| EnMed-AbsQA | DAPT + AbsQA LoRA | Specialised for abstractive generation. Paradoxically degrades its home task under LLM-as-judge scoring while improving MCQA. See Limitations. |
Intended Uses
Supported tasks
- French Medical Multiple-Choice QA — select the best answer from 4–5 candidates (e.g., medical licensing exam questions from FrenchMedMCQA / DrBenchmark)
- French Clinical Extractive QA — identify and return verbatim answer spans from French clinical case narratives (CAS corpus format)
- French Medical Abstractive QA — generate free-form answers to open-ended French medical questions (MediQAl format)
Out-of-scope uses
- ⚠️ Clinical decision support / patient-facing deployment — this is a research prototype. It has not been validated for real clinical use. Do not use outputs to guide patient care.
- English-only medical QA — the DAPT stage targets French; English capability may have drifted from the base model.
- Languages other than French — not evaluated.
- NER, summarisation, or classification — not part of the training or evaluation protocol.
Quick Start
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "brice-eloundou/EnMed-Unified" # replace with your actual HF repo
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
# ── Multiple-Choice QA ───────────────────────────────────────────────────────
prompt = """Tu es un expert médical francophone. Réponds à la question suivante
en choisissant la meilleure réponse parmi les options proposées.
Question: Quelle est la principale cause d'insuffisance rénale aiguë en réanimation ?
A) Glomérulonéphrite aiguë
B) Nécrose tubulaire aiguë ischémique
C) Pyélonéphrite aiguë
D) Lithiase urinaire
Réponse:"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
out = model.generate(**inputs, max_new_tokens=16, temperature=0.1, do_sample=False)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Log-probability decoding (recommended for MCQA)
For evaluation and benchmarking, score each option under teacher forcing and select the highest-likelihood token — this matches the evaluation protocol used in the paper and avoids format-compliance failures.
import torch, torch.nn.functional as F
def score_option(model, tokenizer, prefix, option_text):
text = prefix + option_text
enc = tokenizer(text, return_tensors="pt").to(model.device)
prefix_len = tokenizer(prefix, return_tensors="pt")["input_ids"].shape[1]
with torch.no_grad():
logits = model(**enc).logits[0, prefix_len-1:-1]
option_ids = enc["input_ids"][0, prefix_len:]
lp = F.log_softmax(logits, dim=-1)
return lp[range(len(option_ids)), option_ids].sum().item()
options = {"A": "Glomérulonéphrite aiguë",
"B": "Nécrose tubulaire aiguë ischémique",
"C": "Pyélonéphrite aiguë",
"D": "Lithiase urinaire"}
scores = {k: score_option(model, tokenizer, prefix=prompt, option_text=v)
for k, v in options.items()}
print("Predicted:", max(scores, key=scores.get))
Training Details
Base model
Qwen/Qwen3-14B — instruction-tuned release.
Stage 1 — Domain-Adaptive Continual Pre-training (DAPT)
The backbone undergoes continual pre-training on the French health corpus introduced by Mannion et al. (2026), a large openly licensed collection of French clinical and biomedical text. This stage uses no task supervision; it exposes the model to French medical vocabulary and discourse without committing to a downstream task format.
Stage 2 — Multi-Task LoRA Fine-tuning
A single LoRA adapter is trained jointly on all three downstream QA tasks, with task identifiers embedded in the prompt. This design prevents the length/style register over-fitting that degrades single-task adapters under LLM-as-judge evaluation (see Limitations).
| Hyperparameter | Value |
|---|---|
| LoRA rank r | 16 |
| LoRA scaling α | 32 |
| LoRA dropout | 0.05 |
| Target modules | Attention + MLP projection matrices |
| Quantisation | 4-bit NormalFloat (QLoRA / bitsandbytes) |
| Optimiser | AdamW (paged) |
| LR schedule | Cosine with linear warmup (3 % of steps) |
| Peak learning rate | 2 × 10⁻⁴ |
| Effective batch size | 16 (gradient accumulation) |
| Hardware | 1 × NVIDIA A100 80 GB |
| Framework | Unsloth + HuggingFace PEFT |
Evaluation
All eight systems were evaluated on three French medical QA tasks under 0-shot, 3-shot, and 5-shot prompting — a 3 × 3 grid of nine independent (task, shot) cells. Item-level paired t-tests were conducted per cell against Qwen3-14B-vanilla, with Benjamini–Hochberg FDR control (q = 0.05) and Bonferroni bound reported alongside.
| Task | Dataset | N (test) | Primary metric |
|---|---|---|---|
| Multiple-choice QA (MCQA) | FrenchMedMCQA / DrBenchmark | 622 | Accuracy |
| Extractive QA (ExtQA) | CAS clinical cases | 207 | Token-level F₁ |
| Abstractive QA (AbsQA) | MediQAl | 247–248 | LLM-as-judge 1–5 (Gemma) |
Raw scores across all models and shot counts
The dotted line marks the Qwen3-14B-vanilla 0-shot reference. EnMed variants consistently sit above or on the reference for MCQA and ExtQA; the AbsQA panel reveals the EnMed-AbsQA collapse discussed in Limitations.
Per-task means (averaged over 0 / 3 / 5-shot)
| Model | MCQA acc. ↑ | ExtQA F₁ ↑ | AbsQA judge ↑ |
|---|---|---|---|
| EnMed-Unified ⭐ | 0.575 | 0.529 | 3.195 |
| EnMed-MCQA | 0.569 | 0.507 | 3.242 |
| EnMed-ExtQA | 0.572 | 0.533 | 3.082 |
| EnMed-DAPT | 0.546 | 0.504 | 3.242 |
| EnMed-AbsQA | 0.582 | 0.506 | 2.997 |
| Qwen3-14B-vanilla (reference) | 0.548 | 0.502 | 3.240 |
| Qwen3-8B | 0.466 | 0.511 | 3.144 |
| Mistral-7B-Instruct-v0.3 | 0.277 | 0.445 | 2.926 |
Global descriptive ranking (normalised, 9 cells)
| Model | Mean | Std |
|---|---|---|
| EnMed-Unified | 0.551 | 0.026 |
| EnMed-MCQA | 0.545 | 0.035 |
| EnMed-ExtQA | 0.542 | 0.028 |
| EnMed-DAPT | 0.537 | 0.034 |
| Qwen3-14B-vanilla | 0.537 | 0.034 |
| EnMed-AbsQA | 0.529 | 0.043 |
| Qwen3-8B | 0.505 | 0.041 |
| Mistral-7B-Instruct-v0.3 | 0.401 | 0.103 |
This ranking is descriptive only — normalisation across incomparable metric scales does not constitute a significance test.
Normalised scores across all 9 (task × shot) cells
Per-cell deltas versus Qwen3-14B-vanilla
Item-level paired t-tests with 95 % confidence intervals
Positive bars mean the EnMed variant outperforms the reference; negative bars mean the opposite. Only starred bars represent statistically significant differences.
Significance heatmap — per-cell annotated deltas
Statistical significance record vs. Qwen3-14B-vanilla
(9 independent item-level paired t-tests; α = 0.05; BH-corrected wins marked)
| Model | Sig. wins / 9 | Sig. losses / 9 | Verdict |
|---|---|---|---|
| EnMed-Unified ⭐ | 4 ✅ BH-robust | 0 | Significantly better on MCQA-0, MCQA-3, ExtQA-0, ExtQA-3; never worse |
| EnMed-MCQA | 2 | 0 | Safe MCQA specialist |
| EnMed-ExtQA | 3 | 3 | Mixed: wins MCQA + ExtQA-0, loses all AbsQA cells |
| EnMed-AbsQA | 3 | 3 | Mixed: wins all MCQA, loses all AbsQA |
| EnMed-DAPT | 0 | 0 | Indistinguishable from reference — confirms DAPT safety |
Best model at every (task × shot) cell
No single system wins all nine cells: EnMed-AbsQA leads MCQA, EnMed-ExtQA leads 0- and 5-shot ExtQA, and AbsQA cells split across EnMed-DAPT, Qwen3-14B-vanilla and EnMed-MCQA. EnMed-Unified does not lead any single cell but is never the worst.
Critical Difference diagrams — rank analysis per shot count
Average rank across the three tasks (lower = better). Critical difference CD = 6.06.
The CD (6.06) exceeds the observed rank spread, so these diagrams are descriptive consensus rankings — they corroborate but do not independently prove the item-level findings above.
Limitations
Multiplicity. Benjamini–Hochberg correction at q = 0.05 confirms EnMed-Unified's four headline wins. Weaker cells (e.g., ExtQA-3, MCQA-5) do not survive correction and should be treated as suggestive.
Distributional assumptions. Paired t-tests assume approximately normal per-item differences, which may not hold for binary MCQA outcomes or ordinal 1–5 judge scores. A fully ordinal-aware treatment remains future work.
Single-judge evaluation. AbsQA scores were generated by a single Gemma-family LLM-as-judge. Single-judge evaluations are susceptible to judge-specific biases; a predominantly English-trained judge may under-reward answers correct under French clinical conventions. Judge diversity and order-invariance checks have not been conducted.
Task-specific adapter paradox. EnMed-AbsQA and EnMed-ExtQA improve MCQA while significantly degrading their own nominal home task under LLM-as-judge scoring. We attribute this to over-fitting to a length/style register the judge penalises. Multi-task training (EnMed-Unified) mitigates this.
Phase 2 not yet released. This is the Phase 1 model. The full cross-lingual continual pre-training pipeline (English biomedical → French medical transfer) will be released as EnMed-Phase2.
⚠️ Not for clinical deployment. This model has not been clinically validated. Do not use it for patient-facing applications or clinical decision support.
Citation
The associated paper has been submitted to Springer Lecture Notes in Computer Science (LNCS) and is currently under review. If you use EnMed-Unified or any member of the EnMed family, please cite the preprint version:
@unpublished{abodoeloundou2025enmed,
title = {Cross-Lingual Domain Adaptation and Multi-Task Fine-Tuning
for High-Fidelity Medical Language Models},
author = {Abodo Eloundou, Brice Donald and Malykh, Valentin},
note = {Submitted to Springer Lecture Notes in Computer Science (LNCS).
Under review. ITMO University / MTS Web Services,
Saint Petersburg, Russia},
year = {2026}
}
This entry will be updated to a full @inproceedings citation upon acceptance.
If you use the French health pre-training corpus, please also cite:
@article{mannion2026biomedical,
title = {Is biomedical specialization still worth it?
Insights from domain-adaptive language modelling
with a new French health corpus},
author = {Mannion, A. and Macaire, C. and Violle, A. and
Ohayon, S. and Tannier, X. and Schwab, D. and others},
journal = {arXiv preprint arXiv:2604.06903},
year = {2026}
}
Acknowledgements
Research conducted at ITMO University, Saint Petersburg, Russia and MTS Web Services, Saint Petersburg, Russia.
Authors:
- Brice Donald Abodo Eloundou — ITMO University | ORCID: 0009-0009-1845-5867
- Valentin Malykh — MTS Web Services / ITMO University
Evaluation benchmarks: DrBenchmark (Labrak et al., 2024), FrenchMedMCQA (Labrak et al., 2022), MediQAl (Bazoge, 2025), CAS corpus (Grabar et al., 2020).
License
Released under Apache 2.0, consistent with the Qwen3-14B base model license. The pre-training corpus license follows Mannion et al. (2026); users are responsible for compliance with that corpus's terms.
Clinical use warning: This model is a research artefact. Any use in clinical or patient-facing settings requires independent clinical validation and regulatory approval in the applicable jurisdiction.
- Downloads last month
- 87












# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="boods/EnToFrMedicaLLM-Multilingual") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)