| --- |
| language: |
| - en |
| - zh |
| - es |
| - ja |
| - ru |
| tags: |
| - fill-mask |
| - clinical-nlp |
| - multilingual |
| - bert |
| license: mit |
| --- |
| |
| # MultiClinicalBERT: A Multilingual Transformer Pretrained on Real-World Clinical Notes |
|
|
| MultiClinicalBERT is a multilingual transformer model pretrained on real-world clinical notes across multiple languages. It is designed to provide strong and consistent performance for clinical NLP tasks in both high-resource and low-resource settings. |
|
|
| To the best of our knowledge, this is the first open-source BERT model pretrained specifically on multilingual real-world clinical notes. |
|
|
|
|
| ## Model Overview |
|
|
| MultiClinicalBERT is initialized from `bert-base-multilingual-cased` and further pretrained using a two-stage domain-adaptive strategy on a large-scale multilingual clinical corpus (BRIDGE), combined with biomedical and general-domain data. |
|
|
| The model captures: |
| - Clinical terminology and documentation patterns |
| - Cross-lingual representations for medical text |
| - Robust performance across diverse healthcare datasets |
|
|
|
|
| ## Pretraining Data |
|
|
| The model is trained on a mixture of three data sources: |
|
|
| ### 1. Clinical Data (BRIDGE Corpus) |
| - 87 multilingual clinical datasets |
| - ~1.42M documents |
| - ~995M tokens |
| - Languages: English, Chinese, Spanish, Japanese, Russian |
|
|
| This dataset reflects real-world clinical practice and is the core contribution of this work. |
|
|
| ### 2. Biomedical Literature (PubMed) |
| - ~1.25M documents |
| - ~194M tokens |
|
|
| Provides domain knowledge and medical terminology. |
|
|
| ### 3. General-Domain Text (Wikipedia) |
| - ~5.8K documents |
| - ~43M tokens |
| - Languages: Spanish, Japanese, Russian |
|
|
| Improves general linguistic coverage. |
|
|
| ### Total |
| - ~2.7M documents >1.2B tokens |
|
|
|
|
| ## Pretraining Strategy |
|
|
| We adopt a **two-stage domain-adaptive pretraining approach**: |
|
|
| ### Stage 1: Mixed-domain pretraining |
| - Data: BRIDGE + PubMed + Wikipedia |
| - Goal: Inject biomedical and multilingual knowledge |
|
|
| ### Stage 2: Clinical-specific adaptation |
| - Data: BRIDGE only |
| - Goal: Learn fine-grained clinical language patterns |
|
|
| ### Objective |
| - Masked Language Modeling (MLM) |
| - 15% token masking |
|
|
|
|
| ## Evaluation |
|
|
| We evaluate MultiClinicalBERT on **11 clinical NLP tasks across 5 languages**: |
|
|
| - English: MIMIC-III Mortality, MedNLI, MIMIC-IV CDM |
| - Chinese: CEMR, IMCS-V2 NER |
| - Japanese: IFMIR NER, IFMIR Incident Type |
| - Russian: RuMedNLI, RuCCoNNER |
| - Spanish: De-identification, PPTS |
|
|
| ### Key Results |
| - Consistently outperforms multilingual BERT (mBERT) |
| - Matches or exceeds strong language-specific models |
| - Largest gains observed in low-resource settings |
| - Statistically significant improvements (Welch’s t-test, p < 0.05) |
|
|
| Example: |
| - MedNLI: **83.90% accuracy** |
| - CEMR: **86.38% accuracy** |
| - IFMIR NER: **85.53 F1** |
| - RuMedNLI: **78.31% accuracy** |
|
|
|
|
| ## Key Contributions |
|
|
| - First BERT model pretrained on **multilingual real-world clinical notes** |
| - Large-scale clinical corpus (BRIDGE) with diverse languages |
| - Effective **two-stage domain adaptation strategy** |
| - Strong performance across **multiple languages and tasks** |
| - Suitable for: |
| - Clinical NLP |
| - Multilingual medical text understanding |
| - Retrieval-augmented generation (RAG) |
| - Clinical decision support systems |
|
|
|
|
| ## Usage |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModel |
| |
| tokenizer = AutoTokenizer.from_pretrained("YLab-Open/MultiClinicalBERT") |
| model = AutoModel.from_pretrained("YLab-Open/MultiClinicalBERT") |