File size: 3,435 Bytes
1dd4a44 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 | ---
language:
- en
- zh
- es
- ja
- ru
tags:
- fill-mask
- clinical-nlp
- multilingual
- bert
license: mit
---
# MultiClinicalBERT: A Multilingual Transformer Pretrained on Real-World Clinical Notes
MultiClinicalBERT is a multilingual transformer model pretrained on real-world clinical notes across multiple languages. It is designed to provide strong and consistent performance for clinical NLP tasks in both high-resource and low-resource settings.
To the best of our knowledge, this is the first open-source BERT model pretrained specifically on multilingual real-world clinical notes.
## Model Overview
MultiClinicalBERT is initialized from `bert-base-multilingual-cased` and further pretrained using a two-stage domain-adaptive strategy on a large-scale multilingual clinical corpus (BRIDGE), combined with biomedical and general-domain data.
The model captures:
- Clinical terminology and documentation patterns
- Cross-lingual representations for medical text
- Robust performance across diverse healthcare datasets
## Pretraining Data
The model is trained on a mixture of three data sources:
### 1. Clinical Data (BRIDGE Corpus)
- 87 multilingual clinical datasets
- ~1.42M documents
- ~995M tokens
- Languages: English, Chinese, Spanish, Japanese, Russian
This dataset reflects real-world clinical practice and is the core contribution of this work.
### 2. Biomedical Literature (PubMed)
- ~1.25M documents
- ~194M tokens
Provides domain knowledge and medical terminology.
### 3. General-Domain Text (Wikipedia)
- ~5.8K documents
- ~43M tokens
- Languages: Spanish, Japanese, Russian
Improves general linguistic coverage.
### Total
- ~2.7M documents >1.2B tokens
## Pretraining Strategy
We adopt a **two-stage domain-adaptive pretraining approach**:
### Stage 1: Mixed-domain pretraining
- Data: BRIDGE + PubMed + Wikipedia
- Goal: Inject biomedical and multilingual knowledge
### Stage 2: Clinical-specific adaptation
- Data: BRIDGE only
- Goal: Learn fine-grained clinical language patterns
### Objective
- Masked Language Modeling (MLM)
- 15% token masking
## Evaluation
We evaluate MultiClinicalBERT on **11 clinical NLP tasks across 5 languages**:
- English: MIMIC-III Mortality, MedNLI, MIMIC-IV CDM
- Chinese: CEMR, IMCS-V2 NER
- Japanese: IFMIR NER, IFMIR Incident Type
- Russian: RuMedNLI, RuCCoNNER
- Spanish: De-identification, PPTS
### Key Results
- Consistently outperforms multilingual BERT (mBERT)
- Matches or exceeds strong language-specific models
- Largest gains observed in low-resource settings
- Statistically significant improvements (Welch’s t-test, p < 0.05)
Example:
- MedNLI: **83.90% accuracy**
- CEMR: **86.38% accuracy**
- IFMIR NER: **85.53 F1**
- RuMedNLI: **78.31% accuracy**
## Key Contributions
- First BERT model pretrained on **multilingual real-world clinical notes**
- Large-scale clinical corpus (BRIDGE) with diverse languages
- Effective **two-stage domain adaptation strategy**
- Strong performance across **multiple languages and tasks**
- Suitable for:
- Clinical NLP
- Multilingual medical text understanding
- Retrieval-augmented generation (RAG)
- Clinical decision support systems
## Usage
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("YLab-Open/MultiClinicalBERT")
model = AutoModel.from_pretrained("YLab-Open/MultiClinicalBERT") |