How to use from the
Use from the
Transformers library
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="boods/EnToFrMedicaLLM-Multilingual")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("boods/EnToFrMedicaLLM-Multilingual")
model = AutoModelForCausalLM.from_pretrained("boods/EnToFrMedicaLLM-Multilingual")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))
Quick Links

EnMed-Unified — French Medical LLM (Multi-Task)

Headline system of the EnMed family. A Qwen3-14B decoder adapted for French medical question answering through domain-adaptive continual pre-training (DAPT) on a large French health corpus, followed by multi-task LoRA fine-tuning across three QA formats simultaneously.

Phase 1 evaluation establishes 4 statistically significant wins over the un-adapted Qwen3-14B-vanilla baseline (BH-corrected, q = 0.05) with zero significant losses across nine independent (task × shot) evaluation cells.


Model Family Overview

The EnMed family consists of five variants, all built on Qwen3-14B:

Model Adapter Description
EnMed-Unified DAPT + Mixed LoRA Headline system. Multi-task adapter trained jointly on all three QA tasks. Best deployment choice — never significantly worse than the base model on any task/shot combination.
EnMed-DAPT DAPT only Domain-adapted backbone, no task-specific LoRA. Statistically indistinguishable from Qwen3-14B-vanilla — confirms DAPT does not cause catastrophic forgetting.
EnMed-MCQA DAPT + MCQA LoRA Specialised for French medical multiple-choice QA. Safe specialist: 2 significant wins on its home task, zero losses.
EnMed-ExtQA DAPT + ExtQA LoRA Specialised for clinical span extraction. Gains on MCQA and 0-shot ExtQA but degrades abstractive QA.
EnMed-AbsQA DAPT + AbsQA LoRA Specialised for abstractive generation. Paradoxically degrades its home task under LLM-as-judge scoring while improving MCQA. See Limitations.

Intended Uses

Supported tasks

  • French Medical Multiple-Choice QA — select the best answer from 4–5 candidates (e.g., medical licensing exam questions from FrenchMedMCQA / DrBenchmark)
  • French Clinical Extractive QA — identify and return verbatim answer spans from French clinical case narratives (CAS corpus format)
  • French Medical Abstractive QA — generate free-form answers to open-ended French medical questions (MediQAl format)

Out-of-scope uses

  • ⚠️ Clinical decision support / patient-facing deployment — this is a research prototype. It has not been validated for real clinical use. Do not use outputs to guide patient care.
  • English-only medical QA — the DAPT stage targets French; English capability may have drifted from the base model.
  • Languages other than French — not evaluated.
  • NER, summarisation, or classification — not part of the training or evaluation protocol.

Quick Start

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "brice-eloundou/EnMed-Unified"   # replace with your actual HF repo

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# ── Multiple-Choice QA ───────────────────────────────────────────────────────
prompt = """Tu es un expert médical francophone. Réponds à la question suivante
en choisissant la meilleure réponse parmi les options proposées.

Question: Quelle est la principale cause d'insuffisance rénale aiguë en réanimation ?
A) Glomérulonéphrite aiguë
B) Nécrose tubulaire aiguë ischémique
C) Pyélonéphrite aiguë
D) Lithiase urinaire

Réponse:"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=16, temperature=0.1, do_sample=False)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Log-probability decoding (recommended for MCQA)

For evaluation and benchmarking, score each option under teacher forcing and select the highest-likelihood token — this matches the evaluation protocol used in the paper and avoids format-compliance failures.

import torch, torch.nn.functional as F

def score_option(model, tokenizer, prefix, option_text):
    text = prefix + option_text
    enc = tokenizer(text, return_tensors="pt").to(model.device)
    prefix_len = tokenizer(prefix, return_tensors="pt")["input_ids"].shape[1]
    with torch.no_grad():
        logits = model(**enc).logits[0, prefix_len-1:-1]
        option_ids = enc["input_ids"][0, prefix_len:]
        lp = F.log_softmax(logits, dim=-1)
        return lp[range(len(option_ids)), option_ids].sum().item()

options = {"A": "Glomérulonéphrite aiguë",
           "B": "Nécrose tubulaire aiguë ischémique",
           "C": "Pyélonéphrite aiguë",
           "D": "Lithiase urinaire"}
scores = {k: score_option(model, tokenizer, prefix=prompt, option_text=v)
          for k, v in options.items()}
print("Predicted:", max(scores, key=scores.get))

Training Details

Base model

Qwen/Qwen3-14B — instruction-tuned release.

Stage 1 — Domain-Adaptive Continual Pre-training (DAPT)

The backbone undergoes continual pre-training on the French health corpus introduced by Mannion et al. (2026), a large openly licensed collection of French clinical and biomedical text. This stage uses no task supervision; it exposes the model to French medical vocabulary and discourse without committing to a downstream task format.

Stage 2 — Multi-Task LoRA Fine-tuning

A single LoRA adapter is trained jointly on all three downstream QA tasks, with task identifiers embedded in the prompt. This design prevents the length/style register over-fitting that degrades single-task adapters under LLM-as-judge evaluation (see Limitations).

Hyperparameter Value
LoRA rank r 16
LoRA scaling α 32
LoRA dropout 0.05
Target modules Attention + MLP projection matrices
Quantisation 4-bit NormalFloat (QLoRA / bitsandbytes)
Optimiser AdamW (paged)
LR schedule Cosine with linear warmup (3 % of steps)
Peak learning rate 2 × 10⁻⁴
Effective batch size 16 (gradient accumulation)
Hardware 1 × NVIDIA A100 80 GB
Framework Unsloth + HuggingFace PEFT

Evaluation

All eight systems were evaluated on three French medical QA tasks under 0-shot, 3-shot, and 5-shot prompting — a 3 × 3 grid of nine independent (task, shot) cells. Item-level paired t-tests were conducted per cell against Qwen3-14B-vanilla, with Benjamini–Hochberg FDR control (q = 0.05) and Bonferroni bound reported alongside.

Task Dataset N (test) Primary metric
Multiple-choice QA (MCQA) FrenchMedMCQA / DrBenchmark 622 Accuracy
Extractive QA (ExtQA) CAS clinical cases 207 Token-level F₁
Abstractive QA (AbsQA) MediQAl 247–248 LLM-as-judge 1–5 (Gemma)

Raw scores across all models and shot counts

Raw scores per model per shot count across MCQA (accuracy), ExtQA (token-F1) and AbsQA (LLM-as-judge). The dotted line marks Qwen3-14B-vanilla 0-shot performance.

The dotted line marks the Qwen3-14B-vanilla 0-shot reference. EnMed variants consistently sit above or on the reference for MCQA and ExtQA; the AbsQA panel reveals the EnMed-AbsQA collapse discussed in Limitations.


Per-task means (averaged over 0 / 3 / 5-shot)

Model MCQA acc. ↑ ExtQA F₁ ↑ AbsQA judge ↑
EnMed-Unified 0.575 0.529 3.195
EnMed-MCQA 0.569 0.507 3.242
EnMed-ExtQA 0.572 0.533 3.082
EnMed-DAPT 0.546 0.504 3.242
EnMed-AbsQA 0.582 0.506 2.997
Qwen3-14B-vanilla (reference) 0.548 0.502 3.240
Qwen3-8B 0.466 0.511 3.144
Mistral-7B-Instruct-v0.3 0.277 0.445 2.926

Per-task means ± 1 std across the three shot counts. Hatched bar = Qwen3-14B-vanilla reference; red dashed line = its mean. Descriptive only.


Global descriptive ranking (normalised, 9 cells)

Global descriptive ranking: mean normalised score across the 9 (task, shot) cells ± 1 std. The dashed line marks the Qwen3-14B-vanilla mean of 0.537. EnMed-Unified leads with mean 0.551 and the smallest standard deviation.

Model Mean Std
EnMed-Unified 0.551 0.026
EnMed-MCQA 0.545 0.035
EnMed-ExtQA 0.542 0.028
EnMed-DAPT 0.537 0.034
Qwen3-14B-vanilla 0.537 0.034
EnMed-AbsQA 0.529 0.043
Qwen3-8B 0.505 0.041
Mistral-7B-Instruct-v0.3 0.401 0.103

This ranking is descriptive only — normalisation across incomparable metric scales does not constitute a significance test.


Normalised scores across all 9 (task × shot) cells

Normalised scores across the 9 (task, shot) cells. Each cell is rescaled so that the worst-performing system maps to 0 and the best to 1. Rows sorted by descending global mean.


Per-cell deltas versus Qwen3-14B-vanilla

Per-cell delta of each EnMed candidate against Qwen3-14B-vanilla. Positive (red) = candidate outperforms reference. Three panels: MCQA accuracy, ExtQA token-F1, AbsQA LLM-as-judge.


Item-level paired t-tests with 95 % confidence intervals

Item-level paired t-tests against Qwen3-14B-vanilla. Each bar is the mean delta ± 95% CI computed from N=622 (MCQA), N=207 (ExtQA), N≈248 (AbsQA) paired observations. Stars: * p<0.05, ** p<0.01, *** p<0.001. Inferential figure.

Positive bars mean the EnMed variant outperforms the reference; negative bars mean the opposite. Only starred bars represent statistically significant differences.


Significance heatmap — per-cell annotated deltas

Per-cell signed delta of each EnMed candidate against Qwen3-14B-vanilla annotated with paired-t significance (* p<0.05, ** p<0.01, *** p<0.001; ns otherwise). Reading a row gives the per-system win/loss record.


Statistical significance record vs. Qwen3-14B-vanilla

(9 independent item-level paired t-tests; α = 0.05; BH-corrected wins marked)

Model Sig. wins / 9 Sig. losses / 9 Verdict
EnMed-Unified 4 ✅ BH-robust 0 Significantly better on MCQA-0, MCQA-3, ExtQA-0, ExtQA-3; never worse
EnMed-MCQA 2 0 Safe MCQA specialist
EnMed-ExtQA 3 3 Mixed: wins MCQA + ExtQA-0, loses all AbsQA cells
EnMed-AbsQA 3 3 Mixed: wins all MCQA, loses all AbsQA
EnMed-DAPT 0 0 Indistinguishable from reference — confirms DAPT safety

Significance record across all 9 (task, shot) cells per system: dark green = sig. wins, light green = numeric wins, light red = numeric losses, dark red = sig. losses. Dotted line = 4.5-cell majority threshold.


Best model at every (task × shot) cell

Best-performing system at every (task, shot) cell. Each cell is coloured by system identity and labelled with the winning raw score. No single model wins all 9 cells.

No single system wins all nine cells: EnMed-AbsQA leads MCQA, EnMed-ExtQA leads 0- and 5-shot ExtQA, and AbsQA cells split across EnMed-DAPT, Qwen3-14B-vanilla and EnMed-MCQA. EnMed-Unified does not lead any single cell but is never the worst.


Critical Difference diagrams — rank analysis per shot count

Average rank across the three tasks (lower = better). Critical difference CD = 6.06.

Critical Difference diagram, 0-shot. Average rank of each system across 3 tasks. CD=6.06. EnMed-Unified and EnMed-ExtQA are tied best-ranked at 3.00; Mistral-7B is worst at 7.67.

Critical Difference diagram, 3-shot. EnMed-Unified leads at 2.83; Mistral-7B is worst at 8.00. CD=6.06.

Critical Difference diagram, 5-shot. EnMed-MCQA leads at 2.33; EnMed-Unified second at 3.00. Mistral-7B worst at 8.00. CD=6.06.

The CD (6.06) exceeds the observed rank spread, so these diagrams are descriptive consensus rankings — they corroborate but do not independently prove the item-level findings above.


Limitations

Multiplicity. Benjamini–Hochberg correction at q = 0.05 confirms EnMed-Unified's four headline wins. Weaker cells (e.g., ExtQA-3, MCQA-5) do not survive correction and should be treated as suggestive.

Distributional assumptions. Paired t-tests assume approximately normal per-item differences, which may not hold for binary MCQA outcomes or ordinal 1–5 judge scores. A fully ordinal-aware treatment remains future work.

Single-judge evaluation. AbsQA scores were generated by a single Gemma-family LLM-as-judge. Single-judge evaluations are susceptible to judge-specific biases; a predominantly English-trained judge may under-reward answers correct under French clinical conventions. Judge diversity and order-invariance checks have not been conducted.

Task-specific adapter paradox. EnMed-AbsQA and EnMed-ExtQA improve MCQA while significantly degrading their own nominal home task under LLM-as-judge scoring. We attribute this to over-fitting to a length/style register the judge penalises. Multi-task training (EnMed-Unified) mitigates this.

Phase 2 not yet released. This is the Phase 1 model. The full cross-lingual continual pre-training pipeline (English biomedical → French medical transfer) will be released as EnMed-Phase2.

⚠️ Not for clinical deployment. This model has not been clinically validated. Do not use it for patient-facing applications or clinical decision support.


Citation

The associated paper has been submitted to Springer Lecture Notes in Computer Science (LNCS) and is currently under review. If you use EnMed-Unified or any member of the EnMed family, please cite the preprint version:

@unpublished{abodoeloundou2025enmed,
  title  = {Cross-Lingual Domain Adaptation and Multi-Task Fine-Tuning
            for High-Fidelity Medical Language Models},
  author = {Abodo Eloundou, Brice Donald and Malykh, Valentin},
  note   = {Submitted to Springer Lecture Notes in Computer Science (LNCS).
            Under review. ITMO University / MTS Web Services,
            Saint Petersburg, Russia},
  year   = {2026}
}

This entry will be updated to a full @inproceedings citation upon acceptance.

If you use the French health pre-training corpus, please also cite:

@article{mannion2026biomedical,
  title   = {Is biomedical specialization still worth it?
             Insights from domain-adaptive language modelling
             with a new French health corpus},
  author  = {Mannion, A. and Macaire, C. and Violle, A. and
             Ohayon, S. and Tannier, X. and Schwab, D. and others},
  journal = {arXiv preprint arXiv:2604.06903},
  year    = {2026}
}

Acknowledgements

Research conducted at ITMO University, Saint Petersburg, Russia and MTS Web Services, Saint Petersburg, Russia.

Authors:

  • Brice Donald Abodo Eloundou — ITMO University  |  ORCID: 0009-0009-1845-5867
  • Valentin Malykh — MTS Web Services / ITMO University

Evaluation benchmarks: DrBenchmark (Labrak et al., 2024), FrenchMedMCQA (Labrak et al., 2022), MediQAl (Bazoge, 2025), CAS corpus (Grabar et al., 2020).


License

Released under Apache 2.0, consistent with the Qwen3-14B base model license. The pre-training corpus license follows Mannion et al. (2026); users are responsible for compliance with that corpus's terms.

Clinical use warning: This model is a research artefact. Any use in clinical or patient-facing settings requires independent clinical validation and regulatory approval in the applicable jurisdiction.

Downloads last month
87
Safetensors
Model size
15B params
Tensor type
BF16
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for boods/EnToFrMedicaLLM-Multilingual

Finetuned
Qwen/Qwen3-14B
Adapter
(223)
this model

Dataset used to train boods/EnToFrMedicaLLM-Multilingual

Paper for boods/EnToFrMedicaLLM-Multilingual