Update README.md

1dd4a44 verified 7 days ago

3.44 kB

language:
  - en
  - zh
  - es
  - ja
  - ru
tags:
  - fill-mask
  - clinical-nlp
  - multilingual
  - bert
license: mit

MultiClinicalBERT: A Multilingual Transformer Pretrained on Real-World Clinical Notes

MultiClinicalBERT is a multilingual transformer model pretrained on real-world clinical notes across multiple languages. It is designed to provide strong and consistent performance for clinical NLP tasks in both high-resource and low-resource settings.

To the best of our knowledge, this is the first open-source BERT model pretrained specifically on multilingual real-world clinical notes.

Model Overview

MultiClinicalBERT is initialized from bert-base-multilingual-cased and further pretrained using a two-stage domain-adaptive strategy on a large-scale multilingual clinical corpus (BRIDGE), combined with biomedical and general-domain data.

The model captures:

Clinical terminology and documentation patterns
Cross-lingual representations for medical text
Robust performance across diverse healthcare datasets

Pretraining Data

The model is trained on a mixture of three data sources:

1. Clinical Data (BRIDGE Corpus)

87 multilingual clinical datasets
~1.42M documents
~995M tokens
Languages: English, Chinese, Spanish, Japanese, Russian

This dataset reflects real-world clinical practice and is the core contribution of this work.

2. Biomedical Literature (PubMed)

~1.25M documents
~194M tokens

Provides domain knowledge and medical terminology.

3. General-Domain Text (Wikipedia)

~5.8K documents
~43M tokens
Languages: Spanish, Japanese, Russian

Improves general linguistic coverage.

Total

~2.7M documents >1.2B tokens

Pretraining Strategy

We adopt a two-stage domain-adaptive pretraining approach:

Stage 1: Mixed-domain pretraining

Data: BRIDGE + PubMed + Wikipedia
Goal: Inject biomedical and multilingual knowledge

Stage 2: Clinical-specific adaptation

Data: BRIDGE only
Goal: Learn fine-grained clinical language patterns

Objective

Masked Language Modeling (MLM)
15% token masking

Evaluation

We evaluate MultiClinicalBERT on 11 clinical NLP tasks across 5 languages:

English: MIMIC-III Mortality, MedNLI, MIMIC-IV CDM
Chinese: CEMR, IMCS-V2 NER
Japanese: IFMIR NER, IFMIR Incident Type
Russian: RuMedNLI, RuCCoNNER
Spanish: De-identification, PPTS

Key Results

Consistently outperforms multilingual BERT (mBERT)
Matches or exceeds strong language-specific models
Largest gains observed in low-resource settings
Statistically significant improvements (Welch’s t-test, p < 0.05)

Example:

MedNLI: 83.90% accuracy
CEMR: 86.38% accuracy
IFMIR NER: 85.53 F1
RuMedNLI: 78.31% accuracy

Key Contributions

First BERT model pretrained on multilingual real-world clinical notes
Large-scale clinical corpus (BRIDGE) with diverse languages
Effective two-stage domain adaptation strategy
Strong performance across multiple languages and tasks
Suitable for:
- Clinical NLP
- Multilingual medical text understanding
- Retrieval-augmented generation (RAG)
- Clinical decision support systems

Usage

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("YLab-Open/MultiClinicalBERT")
model = AutoModel.from_pretrained("YLab-Open/MultiClinicalBERT")