DT4H_XLM-R_mtl_multilingual_multilabel

Model Description

This multilingual clinical Named Entity Recognition (NER) model is designed to identify disease, symptom, and clinical procedure mentions in biomedical and clinical text. It is based on xlm-roberta-base and fine-tuned on translated variants of the clinical NER datasets DisTEMIST, SympTEMIST, MedProcNER, and CardioCCC, which consist of clinical case reports with manually annotated mentions of three entity types, following a multi-task learning (MTL) approach and using the BIO tagging scheme for sequence labeling.

The model consists of a shared multilingual encoder and a set of entity-specific token classification heads, each one being responsible for a different task. In this configuration, each classification head is trained on entity-specific data from all supported languages.

Architecture: Multi-task learning (MTL)
Training setup: Multilingual, Multilabel (DISEASE, SYMPTOM, PROCEDURE)
Tasks: disease, symptom, procedure
Supported languages:
- Spanish (es)
- Italian (it)
- Romanian (ro)
- English (en)
- Dutch (nl)
- Swedish (sv)
- Czech (cs)
Base model: xlm-roberta-base
Task: Token classification (NER)
Label scheme: BIO

Training Data

The model is trained on multilingual clinical NER data combining DisTEMIST, SympTEMIST, MedProcNER, and CardioCCC across the supported languages. In this MTL set up, each classification head is trained to perform a specific task, which is the detection of mentions of a given entity type in biomedical and clinical texts. For that, each classification head is fine-tuned on entity-specific data from all supported languages.

The training data is provided as part of the MultiClinNER subtask of the MultiClinAI shared task, an initiative as part of the DataTools4Heart (DT4H) project, which provides translated and annotation-projected clinical corpora.

Training and test splits correspond to the MultiClinNER task at the 11th SMM4H-HeaRD Workshop (ACL 2026).

Model loading

This model uses a custom MTL architecture, and therefore cannot be loaded with:

AutoModelForTokenClassification.from_pretrained(...)

Instead, this repository provides a PyTorch checkpoint (.pt) which includes:

Encoder weights
All entity-specific heads

How to use

To use the model:

Download the .pt file
Load it using the custom architecture

To facilitate this process, we provide an inference script in a GitHub repository that:

Loads the model from the checkpoint using the custom architecture
Processes .txt files from an input directory
Extracts mentions of the entity type corresponding to the selected classification head (task)
Exports predictions as a TSV file in the format required for the MultiClinAI evaluation library:

filename                              label      start_span    end_span    text
MultiClinNER-en-test-disease-00019    DISEASE    154           165         myocarditis

⚠ Note: We recommend pre-tokenizing the input text into words, as this matches the training setup. Providing raw text directly may lead to slightly degraded performance.

Limitations and bias

At the time of submission, no formal bias or fairness evaluation has been conducted. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.

Evaluation

Evaluation was conducted using strict (exact match) and character-level metrics on the MultiClinNER test set.

Evaluation Results (MTL, Multilingual + Multilabel - DISEASE)

Language	Strict P	Strict R	Strict F1	Char P	Char R	Char F1
es	0.5024	0.5949	0.5448	0.6437	0.7488	0.6923
it	0.5202	0.5087	0.5144	0.6614	0.6380	0.6495
ro	0.5536	0.5693	0.5613	0.6938	0.7071	0.7004
en	0.5476	0.4907	0.5176	0.6825	0.5996	0.6384
nl	0.5118	0.5283	0.5199	0.6323	0.6417	0.6370
sv	0.4821	0.5416	0.5101	0.6027	0.6635	0.6316
cz	0.4802	0.5007	0.4902	0.6107	0.6242	0.6174
Average			0.5226			0.6524

Evaluation Results (MTL, Multilingual + Multilabel - SYMPTOM)

Language	Strict P	Strict R	Strict F1	Char P	Char R	Char F1
es	0.1862	0.1734	0.1796	0.3518	0.3208	0.3356
it	0.2087	0.1478	0.1730	0.3714	0.2564	0.3033
ro	0.1819	0.1430	0.1602	0.3452	0.2644	0.2995
en	0.2501	0.1568	0.1927	0.4264	0.2618	0.3244
nl	0.2609	0.2120	0.2339	0.3985	0.3149	0.3518
sv	0.2429	0.2080	0.2241	0.3932	0.3257	0.3563
cz	0.1872	0.1718	0.1792	0.3296	0.2917	0.3095
Average			0.1918			0.3258

Evaluation Results (MTL, Multilingual + Multilabel - PROCEDURE)

Language	Strict P	Strict R	Strict F1	Char P	Char R	Char F1
es	0.6119	0.6336	0.6226	0.7575	0.7769	0.7671
it	0.6094	0.5043	0.5519	0.7561	0.6209	0.6819
ro	0.6565	0.6351	0.6456	0.7914	0.7599	0.7753
en	0.6169	0.5757	0.5956	0.7757	0.7192	0.7464
nl	0.6266	0.6203	0.6234	0.7526	0.7404	0.7465
sv	0.6131	0.6317	0.6223	0.7448	0.7614	0.7530
cz	0.6104	0.6269	0.6185	0.7527	0.7644	0.7585
Average			0.6114			0.7470