MultiClinicalBERT / README.md
enqiliu's picture
Update README.md
1dd4a44 verified
metadata
language:
  - en
  - zh
  - es
  - ja
  - ru
tags:
  - fill-mask
  - clinical-nlp
  - multilingual
  - bert
license: mit

MultiClinicalBERT: A Multilingual Transformer Pretrained on Real-World Clinical Notes

MultiClinicalBERT is a multilingual transformer model pretrained on real-world clinical notes across multiple languages. It is designed to provide strong and consistent performance for clinical NLP tasks in both high-resource and low-resource settings.

To the best of our knowledge, this is the first open-source BERT model pretrained specifically on multilingual real-world clinical notes.

Model Overview

MultiClinicalBERT is initialized from bert-base-multilingual-cased and further pretrained using a two-stage domain-adaptive strategy on a large-scale multilingual clinical corpus (BRIDGE), combined with biomedical and general-domain data.

The model captures:

  • Clinical terminology and documentation patterns
  • Cross-lingual representations for medical text
  • Robust performance across diverse healthcare datasets

Pretraining Data

The model is trained on a mixture of three data sources:

1. Clinical Data (BRIDGE Corpus)

  • 87 multilingual clinical datasets
  • ~1.42M documents
  • ~995M tokens
  • Languages: English, Chinese, Spanish, Japanese, Russian

This dataset reflects real-world clinical practice and is the core contribution of this work.

2. Biomedical Literature (PubMed)

  • ~1.25M documents
  • ~194M tokens

Provides domain knowledge and medical terminology.

3. General-Domain Text (Wikipedia)

  • ~5.8K documents
  • ~43M tokens
  • Languages: Spanish, Japanese, Russian

Improves general linguistic coverage.

Total

  • ~2.7M documents >1.2B tokens

Pretraining Strategy

We adopt a two-stage domain-adaptive pretraining approach:

Stage 1: Mixed-domain pretraining

  • Data: BRIDGE + PubMed + Wikipedia
  • Goal: Inject biomedical and multilingual knowledge

Stage 2: Clinical-specific adaptation

  • Data: BRIDGE only
  • Goal: Learn fine-grained clinical language patterns

Objective

  • Masked Language Modeling (MLM)
  • 15% token masking

Evaluation

We evaluate MultiClinicalBERT on 11 clinical NLP tasks across 5 languages:

  • English: MIMIC-III Mortality, MedNLI, MIMIC-IV CDM
  • Chinese: CEMR, IMCS-V2 NER
  • Japanese: IFMIR NER, IFMIR Incident Type
  • Russian: RuMedNLI, RuCCoNNER
  • Spanish: De-identification, PPTS

Key Results

  • Consistently outperforms multilingual BERT (mBERT)
  • Matches or exceeds strong language-specific models
  • Largest gains observed in low-resource settings
  • Statistically significant improvements (Welch’s t-test, p < 0.05)

Example:

  • MedNLI: 83.90% accuracy
  • CEMR: 86.38% accuracy
  • IFMIR NER: 85.53 F1
  • RuMedNLI: 78.31% accuracy

Key Contributions

  • First BERT model pretrained on multilingual real-world clinical notes
  • Large-scale clinical corpus (BRIDGE) with diverse languages
  • Effective two-stage domain adaptation strategy
  • Strong performance across multiple languages and tasks
  • Suitable for:
    • Clinical NLP
    • Multilingual medical text understanding
    • Retrieval-augmented generation (RAG)
    • Clinical decision support systems

Usage

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("YLab-Open/MultiClinicalBERT")
model = AutoModel.from_pretrained("YLab-Open/MultiClinicalBERT")