--- language: - en - zh - es - ja - ru tags: - fill-mask - clinical-nlp - multilingual - bert license: mit --- # MultiClinicalBERT: A Multilingual Transformer Pretrained on Real-World Clinical Notes MultiClinicalBERT is a multilingual transformer model pretrained on real-world clinical notes across multiple languages. It is designed to provide strong and consistent performance for clinical NLP tasks in both high-resource and low-resource settings. To the best of our knowledge, this is the first open-source BERT model pretrained specifically on multilingual real-world clinical notes. ## Model Overview MultiClinicalBERT is initialized from `bert-base-multilingual-cased` and further pretrained using a two-stage domain-adaptive strategy on a large-scale multilingual clinical corpus (BRIDGE), combined with biomedical and general-domain data. The model captures: - Clinical terminology and documentation patterns - Cross-lingual representations for medical text - Robust performance across diverse healthcare datasets ## Pretraining Data The model is trained on a mixture of three data sources: ### 1. Clinical Data (BRIDGE Corpus) - 87 multilingual clinical datasets - ~1.42M documents - ~995M tokens - Languages: English, Chinese, Spanish, Japanese, Russian This dataset reflects real-world clinical practice and is the core contribution of this work. ### 2. Biomedical Literature (PubMed) - ~1.25M documents - ~194M tokens Provides domain knowledge and medical terminology. ### 3. General-Domain Text (Wikipedia) - ~5.8K documents - ~43M tokens - Languages: Spanish, Japanese, Russian Improves general linguistic coverage. ### Total - ~2.7M documents >1.2B tokens ## Pretraining Strategy We adopt a **two-stage domain-adaptive pretraining approach**: ### Stage 1: Mixed-domain pretraining - Data: BRIDGE + PubMed + Wikipedia - Goal: Inject biomedical and multilingual knowledge ### Stage 2: Clinical-specific adaptation - Data: BRIDGE only - Goal: Learn fine-grained clinical language patterns ### Objective - Masked Language Modeling (MLM) - 15% token masking ## Evaluation We evaluate MultiClinicalBERT on **11 clinical NLP tasks across 5 languages**: - English: MIMIC-III Mortality, MedNLI, MIMIC-IV CDM - Chinese: CEMR, IMCS-V2 NER - Japanese: IFMIR NER, IFMIR Incident Type - Russian: RuMedNLI, RuCCoNNER - Spanish: De-identification, PPTS ### Key Results - Consistently outperforms multilingual BERT (mBERT) - Matches or exceeds strong language-specific models - Largest gains observed in low-resource settings - Statistically significant improvements (Welch’s t-test, p < 0.05) Example: - MedNLI: **83.90% accuracy** - CEMR: **86.38% accuracy** - IFMIR NER: **85.53 F1** - RuMedNLI: **78.31% accuracy** ## Key Contributions - First BERT model pretrained on **multilingual real-world clinical notes** - Large-scale clinical corpus (BRIDGE) with diverse languages - Effective **two-stage domain adaptation strategy** - Strong performance across **multiple languages and tasks** - Suitable for: - Clinical NLP - Multilingual medical text understanding - Retrieval-augmented generation (RAG) - Clinical decision support systems ## Usage ```python from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("YLab-Open/MultiClinicalBERT") model = AutoModel.from_pretrained("YLab-Open/MultiClinicalBERT")