enqiliu commited on
Commit
1dd4a44
·
verified ·
1 Parent(s): 4b389b8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +121 -3
README.md CHANGED
@@ -1,3 +1,121 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - zh
5
+ - es
6
+ - ja
7
+ - ru
8
+ tags:
9
+ - fill-mask
10
+ - clinical-nlp
11
+ - multilingual
12
+ - bert
13
+ license: mit
14
+ ---
15
+
16
+ # MultiClinicalBERT: A Multilingual Transformer Pretrained on Real-World Clinical Notes
17
+
18
+ MultiClinicalBERT is a multilingual transformer model pretrained on real-world clinical notes across multiple languages. It is designed to provide strong and consistent performance for clinical NLP tasks in both high-resource and low-resource settings.
19
+
20
+ To the best of our knowledge, this is the first open-source BERT model pretrained specifically on multilingual real-world clinical notes.
21
+
22
+
23
+ ## Model Overview
24
+
25
+ MultiClinicalBERT is initialized from `bert-base-multilingual-cased` and further pretrained using a two-stage domain-adaptive strategy on a large-scale multilingual clinical corpus (BRIDGE), combined with biomedical and general-domain data.
26
+
27
+ The model captures:
28
+ - Clinical terminology and documentation patterns
29
+ - Cross-lingual representations for medical text
30
+ - Robust performance across diverse healthcare datasets
31
+
32
+
33
+ ## Pretraining Data
34
+
35
+ The model is trained on a mixture of three data sources:
36
+
37
+ ### 1. Clinical Data (BRIDGE Corpus)
38
+ - 87 multilingual clinical datasets
39
+ - ~1.42M documents
40
+ - ~995M tokens
41
+ - Languages: English, Chinese, Spanish, Japanese, Russian
42
+
43
+ This dataset reflects real-world clinical practice and is the core contribution of this work.
44
+
45
+ ### 2. Biomedical Literature (PubMed)
46
+ - ~1.25M documents
47
+ - ~194M tokens
48
+
49
+ Provides domain knowledge and medical terminology.
50
+
51
+ ### 3. General-Domain Text (Wikipedia)
52
+ - ~5.8K documents
53
+ - ~43M tokens
54
+ - Languages: Spanish, Japanese, Russian
55
+
56
+ Improves general linguistic coverage.
57
+
58
+ ### Total
59
+ - ~2.7M documents >1.2B tokens
60
+
61
+
62
+ ## Pretraining Strategy
63
+
64
+ We adopt a **two-stage domain-adaptive pretraining approach**:
65
+
66
+ ### Stage 1: Mixed-domain pretraining
67
+ - Data: BRIDGE + PubMed + Wikipedia
68
+ - Goal: Inject biomedical and multilingual knowledge
69
+
70
+ ### Stage 2: Clinical-specific adaptation
71
+ - Data: BRIDGE only
72
+ - Goal: Learn fine-grained clinical language patterns
73
+
74
+ ### Objective
75
+ - Masked Language Modeling (MLM)
76
+ - 15% token masking
77
+
78
+
79
+ ## Evaluation
80
+
81
+ We evaluate MultiClinicalBERT on **11 clinical NLP tasks across 5 languages**:
82
+
83
+ - English: MIMIC-III Mortality, MedNLI, MIMIC-IV CDM
84
+ - Chinese: CEMR, IMCS-V2 NER
85
+ - Japanese: IFMIR NER, IFMIR Incident Type
86
+ - Russian: RuMedNLI, RuCCoNNER
87
+ - Spanish: De-identification, PPTS
88
+
89
+ ### Key Results
90
+ - Consistently outperforms multilingual BERT (mBERT)
91
+ - Matches or exceeds strong language-specific models
92
+ - Largest gains observed in low-resource settings
93
+ - Statistically significant improvements (Welch’s t-test, p < 0.05)
94
+
95
+ Example:
96
+ - MedNLI: **83.90% accuracy**
97
+ - CEMR: **86.38% accuracy**
98
+ - IFMIR NER: **85.53 F1**
99
+ - RuMedNLI: **78.31% accuracy**
100
+
101
+
102
+ ## Key Contributions
103
+
104
+ - First BERT model pretrained on **multilingual real-world clinical notes**
105
+ - Large-scale clinical corpus (BRIDGE) with diverse languages
106
+ - Effective **two-stage domain adaptation strategy**
107
+ - Strong performance across **multiple languages and tasks**
108
+ - Suitable for:
109
+ - Clinical NLP
110
+ - Multilingual medical text understanding
111
+ - Retrieval-augmented generation (RAG)
112
+ - Clinical decision support systems
113
+
114
+
115
+ ## Usage
116
+
117
+ ```python
118
+ from transformers import AutoTokenizer, AutoModel
119
+
120
+ tokenizer = AutoTokenizer.from_pretrained("YLab-Open/MultiClinicalBERT")
121
+ model = AutoModel.from_pretrained("YLab-Open/MultiClinicalBERT")