Siddharth63 commited on
Commit
e37091f
·
verified ·
1 Parent(s): 552aa23

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +204 -1
README.md CHANGED
@@ -1,3 +1,206 @@
1
  ---
2
- license: artistic-2.0
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ tags:
6
+ - biomedical
7
+ - clinical
8
+ - ul2
9
+ - t5
10
+ - encoder-decoder
11
+ - pretraining
12
+ - text2text-generation
13
+ - medical
14
  ---
15
+
16
+ # PubMedUL2 & MedUL2
17
+
18
+ ## Model Description
19
+
20
+ **PubMedUL2** and **MedUL2** are a family of **domain-specific UL2/T5-style encoder–decoder language models** pretrained on large-scale biomedical and medical corpora using the **UL2 (Mixture-of-Denoisers)** objective.
21
+
22
+ - **PubMedUL2** models are pretrained on **25 million PubMed abstracts**
23
+ - **MedUL2** models are pretrained on **PubMed abstracts + clinical notes + additional medical documents**
24
+ - All models use a **T5-efficient architecture**, inspired by Google’s efficient T5 variants
25
+
26
+ These checkpoints are **pretraining-only models** and **must be fine-tuned** before use on downstream tasks.
27
+
28
+ ---
29
+
30
+ ## Pretraining Objective: UL2 (Mixture-of-Denoisers)
31
+
32
+ These models were pretrained using **UL2**, a unified framework that formulates language modeling objectives as **denoising tasks**.
33
+
34
+ UL2 introduces a **Mixture-of-Denoisers (MoD)** approach that samples from multiple denoising paradigms during pretraining.
35
+
36
+ ### Denoising Tasks
37
+
38
+ UL2 pretraining uses a mixture of three denoising tasks:
39
+
40
+ 1. **R-denoising (Regular Span Corruption)**
41
+ - Equivalent to standard T5 span corruption
42
+ - Optimized for language understanding tasks
43
+
44
+ 2. **X-denoising (Extreme Span Corruption)**
45
+ - Uses very large masked spans
46
+ - Encourages long-form generation and abstraction
47
+
48
+ 3. **S-denoising (Sequential / PrefixLM)**
49
+ - Prefix language modeling similar to causal LM
50
+ - Suitable for sequence-to-sequence and generative tasks
51
+
52
+ ### Paradigm Tokens (Mode Switching)
53
+
54
+ During pretraining, a **paradigm token** is inserted at the beginning of each input:
55
+
56
+ | Token | Mode | Recommended Use |
57
+ |------|------|------------------|
58
+ | `[NLU]` | R-denoising | Classification, QA, retrieval |
59
+ | `[NLG]` | X-denoising | Mixed understanding & generation |
60
+ | `[S2S]` | S-denoising | Generative / causal tasks |
61
+
62
+ **Important:**
63
+ For best performance, the same token should be **prepended during fine-tuning and inference**.
64
+
65
+ ---
66
+
67
+ ## Architecture
68
+
69
+ - Encoder–decoder Transformer (T5-style)
70
+ - Uses **T5-efficient architecture**
71
+ - Compatible with Hugging Face `T5ForConditionalGeneration`
72
+
73
+ ---
74
+
75
+ ## Intended Uses
76
+
77
+ These models are intended to be **fine-tuned** for:
78
+
79
+ - Biomedical and clinical **text classification**
80
+ - **Question answering**
81
+ - **Summarization** of medical literature or clinical notes
82
+ - **Text generation** in medical contexts
83
+
84
+ ---
85
+
86
+ ## Limitations
87
+
88
+ - ❌ Not instruction-tuned
89
+ - ❌ No supervised training
90
+ - ❌ Not suitable for zero-shot use
91
+
92
+ These checkpoints are **self-supervised pretraining models only** and require task-specific fine-tuning.
93
+
94
+ ---
95
+
96
+ ## Fine-Tuning Recommendations
97
+
98
+ - **Avoid mixed precision** (fp16 / bf16) initially
99
+ - Fine-tuning is more stable in **fp32**
100
+ - Always prepend one of `[NLU]`, `[NLG]`, or `[S2S]` to input text
101
+ - Suggested defaults:
102
+ - Classification / QA → `[NLU]`
103
+ - Causal or generative tasks → `[S2S]`
104
+ - Mixed tasks → `[NLG]`
105
+
106
+ ---
107
+
108
+ ## Model Parameter Summary
109
+
110
+ | Model Name | Parameter Count | Description | Access
111
+ |-----------|----------------|------------|------------|
112
+ | `pubmedul2-tiny-nl6` | **19.26M** | Tiny UL2-style model with 6 layers | Open
113
+ | `pubmedul2-mini-nl8` | **50.12M** | Mini UL2 with 8 layers | Open
114
+ | `pubmedul2-small` | **60.52M** | Small UL2 variant | Open
115
+ | `pubmedul2-small-nl24` | **192.73M** | Small UL2 with 24 layers | Open
116
+ | `medul2-base` | **222.93M** | Base UL2/T5-style model | Open
117
+ | `pubmedul2-base` | **222.93M** | Base UL2/T5-style model | Open
118
+ | `medul2-base-nl36` | **619.44M** | Base UL2 with 36 layers | Gated commercial
119
+ | `pubmedul2-base-nl36` | **619.44M** | Base UL2 with 36 layers | Gated commercial
120
+ | `medul2-large` | **737.72M** | Large UL2/T5-style model | Gated non-commercial
121
+ | `pubmedul2-large` | **737.72M** | Large UL2/T5-style model | Gated non-commercial
122
+ | `medul2-large-nl36` | **1090.14M** | Very large UL2 with 36 layers | Access on Request
123
+
124
+ ---
125
+
126
+ ## Named Entity Recognition (NER) Evaluation
127
+
128
+ We evaluate PubMedUL2 and MedUL2 models on a biomedical **Named Entity Recognition (NER)** task using multiple matching criteria to better capture boundary-level performance.
129
+
130
+ The evaluation reports **entity-level F1 scores** across different biomedical entity types and model sizes.
131
+
132
+ ### Exact Match F1
133
+
134
+ An entity prediction is considered correct only if both the **entity span and label exactly match** the gold annotation.
135
+
136
+ | entity_type | medul2-base | pubmedul2-base | pubmedul2-mini-nl8 | pubmedul2-small | pubmedul2-tiny-nl6 |
137
+ |:--------------|--------------:|-----------------:|---------------------:|------------------:|---------------------:|
138
+ | cell_line | 0.42 | 0.43 | 0.44 | 0.43 | 0.35 |
139
+ | cell_type | 0.59 | 0.58 | 0.59 | 0.58 | 0.52 |
140
+ | chemical | 0.76 | 0.75 | 0.72 | 0.72 | 0.56 |
141
+ | disease | 0.7 | 0.73 | 0.7 | 0.68 | 0.63 |
142
+ | dna | 0.59 | 0.55 | 0.54 | 0.55 | 0.45 |
143
+ | gene | 0.62 | 0.59 | 0.6 | 0.59 | 0.55 |
144
+ | protein | 0.59 | 0.58 | 0.58 | 0.59 | 0.55 |
145
+ | rna | 0.6 | 0.56 | 0.55 | 0.6 | 0.56 |
146
+ | species | 0.66 | 0.67 | 0.58 | 0.63 | 0.54 |
147
+
148
+ ---
149
+
150
+ ### Partial Match F1
151
+
152
+ A prediction is counted as correct if it **partially overlaps** with a gold entity of the same type.
153
+
154
+ | entity_type | medul2-base | pubmedul2-base | pubmedul2-mini-nl8 | pubmedul2-small | pubmedul2-tiny-nl6 |
155
+ |:--------------|--------------:|-----------------:|---------------------:|------------------:|---------------------:|
156
+ | cell_line | 0.48 | 0.49 | 0.48 | 0.48 | 0.41 |
157
+ | cell_type | 0.66 | 0.64 | 0.66 | 0.65 | 0.59 |
158
+ | chemical | 0.79 | 0.78 | 0.76 | 0.75 | 0.6 |
159
+ | disease | 0.82 | 0.84 | 0.8 | 0.79 | 0.74 |
160
+ | dna | 0.65 | 0.61 | 0.6 | 0.61 | 0.53 |
161
+ | gene | 0.76 | 0.74 | 0.74 | 0.73 | 0.68 |
162
+ | protein | 0.66 | 0.66 | 0.66 | 0.67 | 0.64 |
163
+ | rna | 0.68 | 0.63 | 0.64 | 0.66 | 0.65 |
164
+ | species | 0.68 | 0.7 | 0.61 | 0.65 | 0.56 |
165
+
166
+ ---
167
+
168
+ ### IoU Match F1
169
+
170
+ Predictions are evaluated using **Intersection-over-Union (IoU)** overlap between predicted and gold spans, providing a softer boundary-based metric.
171
+
172
+ | entity_type | medul2-base | pubmedul2-base | pubmedul2-mini-nl8 | pubmedul2-small | pubmedul2-tiny-nl6 |
173
+ |:--------------|--------------:|-----------------:|---------------------:|------------------:|---------------------:|
174
+ | cell_line | 0.5 | 0.5 | 0.5 | 0.5 | 0.42 |
175
+ | cell_type | 0.67 | 0.66 | 0.68 | 0.67 | 0.62 |
176
+ | chemical | 0.83 | 0.83 | 0.82 | 0.82 | 0.72 |
177
+ | disease | 0.85 | 0.86 | 0.86 | 0.85 | 0.82 |
178
+ | dna | 0.65 | 0.62 | 0.62 | 0.62 | 0.55 |
179
+ | gene | 0.76 | 0.75 | 0.75 | 0.74 | 0.71 |
180
+ | protein | 0.67 | 0.66 | 0.67 | 0.67 | 0.66 |
181
+ | rna | 0.68 | 0.65 | 0.66 | 0.67 | 0.67 |
182
+ | species | 0.72 | 0.74 | 0.65 | 0.69 | 0.58 |
183
+
184
+ ---
185
+
186
+ ### Observations
187
+
188
+ - **MedUL2 models** generally outperform PubMedUL2 on clinical-heavy entity types such as *disease* and *chemical*
189
+ - Performance improves consistently from **tiny → base models**
190
+ - Boundary-sensitive metrics (Partial / IoU) show significantly higher scores than Exact Match, highlighting boundary ambiguity in biomedical NER
191
+
192
+ ---
193
+
194
+ ## Acknowledgements
195
+
196
+ This project would not have been possible without compute generously provided by **Google TPU Research Cloud**.
197
+
198
+ Thanks to:
199
+ - The **Finnish-NLP** authors for releasing the UL2 objective code, task definitions, and guidance
200
+ - **Yeb Havinga** for help getting started with the **t5x** framework
201
+
202
+ ---
203
+
204
+ ## License
205
+
206
+ Please refer to the individual model repositories for **license and access details**, which may vary depending on training data sources.