YLab-Open
/

MultiClinicalBERT

Model card Files Files and versions

MultiClinicalBERT / README.md

enqiliu's picture

Update README.md

1dd4a44 verified 8 days ago

|

history blame contribute delete

3.44 kB

	---
	language:
	- en
	- zh
	- es
	- ja
	- ru
	tags:
	- fill-mask
	- clinical-nlp
	- multilingual
	- bert
	license: mit
	---

	# MultiClinicalBERT: A Multilingual Transformer Pretrained on Real-World Clinical Notes

	MultiClinicalBERT is a multilingual transformer model pretrained on real-world clinical notes across multiple languages. It is designed to provide strong and consistent performance for clinical NLP tasks in both high-resource and low-resource settings.

	To the best of our knowledge, this is the first open-source BERT model pretrained specifically on multilingual real-world clinical notes.


	## Model Overview

	MultiClinicalBERT is initialized from `bert-base-multilingual-cased` and further pretrained using a two-stage domain-adaptive strategy on a large-scale multilingual clinical corpus (BRIDGE), combined with biomedical and general-domain data.

	The model captures:
	- Clinical terminology and documentation patterns
	- Cross-lingual representations for medical text
	- Robust performance across diverse healthcare datasets


	## Pretraining Data

	The model is trained on a mixture of three data sources:

	### 1. Clinical Data (BRIDGE Corpus)
	- 87 multilingual clinical datasets
	- ~1.42M documents
	- ~995M tokens
	- Languages: English, Chinese, Spanish, Japanese, Russian

	This dataset reflects real-world clinical practice and is the core contribution of this work.

	### 2. Biomedical Literature (PubMed)
	- ~1.25M documents
	- ~194M tokens

	Provides domain knowledge and medical terminology.

	### 3. General-Domain Text (Wikipedia)
	- ~5.8K documents
	- ~43M tokens
	- Languages: Spanish, Japanese, Russian

	Improves general linguistic coverage.

	### Total
	- ~2.7M documents >1.2B tokens


	## Pretraining Strategy

	We adopt a two-stage domain-adaptive pretraining approach:

	### Stage 1: Mixed-domain pretraining
	- Data: BRIDGE + PubMed + Wikipedia
	- Goal: Inject biomedical and multilingual knowledge

	### Stage 2: Clinical-specific adaptation
	- Data: BRIDGE only
	- Goal: Learn fine-grained clinical language patterns

	### Objective
	- Masked Language Modeling (MLM)
	- 15% token masking


	## Evaluation

	We evaluate MultiClinicalBERT on 11 clinical NLP tasks across 5 languages:

	- English: MIMIC-III Mortality, MedNLI, MIMIC-IV CDM
	- Chinese: CEMR, IMCS-V2 NER
	- Japanese: IFMIR NER, IFMIR Incident Type
	- Russian: RuMedNLI, RuCCoNNER
	- Spanish: De-identification, PPTS

	### Key Results
	- Consistently outperforms multilingual BERT (mBERT)
	- Matches or exceeds strong language-specific models
	- Largest gains observed in low-resource settings
	- Statistically significant improvements (Welch’s t-test, p < 0.05)

	Example:
	- MedNLI: 83.90% accuracy
	- CEMR: 86.38% accuracy
	- IFMIR NER: 85.53 F1
	- RuMedNLI: 78.31% accuracy


	## Key Contributions

	- First BERT model pretrained on multilingual real-world clinical notes
	- Large-scale clinical corpus (BRIDGE) with diverse languages
	- Effective two-stage domain adaptation strategy
	- Strong performance across multiple languages and tasks
	- Suitable for:
	- Clinical NLP
	- Multilingual medical text understanding
	- Retrieval-augmented generation (RAG)
	- Clinical decision support systems


	## Usage

	```python
	from transformers import AutoTokenizer, AutoModel

	tokenizer = AutoTokenizer.from_pretrained("YLab-Open/MultiClinicalBERT")
	model = AutoModel.from_pretrained("YLab-Open/MultiClinicalBERT")