File size: 3,435 Bytes

1dd4a44

---
language:
- en
- zh
- es
- ja
- ru
tags:
- fill-mask
- clinical-nlp
- multilingual
- bert
license: mit
---

# MultiClinicalBERT: A Multilingual Transformer Pretrained on Real-World Clinical Notes

MultiClinicalBERT is a multilingual transformer model pretrained on real-world clinical notes across multiple languages. It is designed to provide strong and consistent performance for clinical NLP tasks in both high-resource and low-resource settings.

To the best of our knowledge, this is the first open-source BERT model pretrained specifically on multilingual real-world clinical notes.


## Model Overview

MultiClinicalBERT is initialized from `bert-base-multilingual-cased` and further pretrained using a two-stage domain-adaptive strategy on a large-scale multilingual clinical corpus (BRIDGE), combined with biomedical and general-domain data.

The model captures:
- Clinical terminology and documentation patterns
- Cross-lingual representations for medical text
- Robust performance across diverse healthcare datasets


## Pretraining Data

The model is trained on a mixture of three data sources:

### 1. Clinical Data (BRIDGE Corpus)
- 87 multilingual clinical datasets  
- ~1.42M documents  
- ~995M tokens  
- Languages: English, Chinese, Spanish, Japanese, Russian  

This dataset reflects real-world clinical practice and is the core contribution of this work.

### 2. Biomedical Literature (PubMed)
- ~1.25M documents  
- ~194M tokens  

Provides domain knowledge and medical terminology.

### 3. General-Domain Text (Wikipedia)
- ~5.8K documents  
- ~43M tokens  
- Languages: Spanish, Japanese, Russian  

Improves general linguistic coverage.

### Total
- ~2.7M documents  >1.2B tokens  


## Pretraining Strategy

We adopt a **two-stage domain-adaptive pretraining approach**:

### Stage 1: Mixed-domain pretraining
- Data: BRIDGE + PubMed + Wikipedia  
- Goal: Inject biomedical and multilingual knowledge  

### Stage 2: Clinical-specific adaptation
- Data: BRIDGE only  
- Goal: Learn fine-grained clinical language patterns  

### Objective
- Masked Language Modeling (MLM)  
- 15% token masking  


## Evaluation

We evaluate MultiClinicalBERT on **11 clinical NLP tasks across 5 languages**:

- English: MIMIC-III Mortality, MedNLI, MIMIC-IV CDM  
- Chinese: CEMR, IMCS-V2 NER  
- Japanese: IFMIR NER, IFMIR Incident Type  
- Russian: RuMedNLI, RuCCoNNER  
- Spanish: De-identification, PPTS  

### Key Results
- Consistently outperforms multilingual BERT (mBERT)
- Matches or exceeds strong language-specific models
- Largest gains observed in low-resource settings
- Statistically significant improvements (Welch’s t-test, p < 0.05)

Example:
- MedNLI: **83.90% accuracy**
- CEMR: **86.38% accuracy**
- IFMIR NER: **85.53 F1**
- RuMedNLI: **78.31% accuracy**


## Key Contributions

- First BERT model pretrained on **multilingual real-world clinical notes**
- Large-scale clinical corpus (BRIDGE) with diverse languages
- Effective **two-stage domain adaptation strategy**
- Strong performance across **multiple languages and tasks**
- Suitable for:
  - Clinical NLP
  - Multilingual medical text understanding
  - Retrieval-augmented generation (RAG)
  - Clinical decision support systems


## Usage

```python
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("YLab-Open/MultiClinicalBERT")
model = AutoModel.from_pretrained("YLab-Open/MultiClinicalBERT")