medul2-base / README.md
Siddharth63's picture
Update README.md
e37091f verified
---
language:
- en
license: apache-2.0
tags:
- biomedical
- clinical
- ul2
- t5
- encoder-decoder
- pretraining
- text2text-generation
- medical
---
# PubMedUL2 & MedUL2
## Model Description
**PubMedUL2** and **MedUL2** are a family of **domain-specific UL2/T5-style encoder–decoder language models** pretrained on large-scale biomedical and medical corpora using the **UL2 (Mixture-of-Denoisers)** objective.
- **PubMedUL2** models are pretrained on **25 million PubMed abstracts**
- **MedUL2** models are pretrained on **PubMed abstracts + clinical notes + additional medical documents**
- All models use a **T5-efficient architecture**, inspired by Google’s efficient T5 variants
These checkpoints are **pretraining-only models** and **must be fine-tuned** before use on downstream tasks.
---
## Pretraining Objective: UL2 (Mixture-of-Denoisers)
These models were pretrained using **UL2**, a unified framework that formulates language modeling objectives as **denoising tasks**.
UL2 introduces a **Mixture-of-Denoisers (MoD)** approach that samples from multiple denoising paradigms during pretraining.
### Denoising Tasks
UL2 pretraining uses a mixture of three denoising tasks:
1. **R-denoising (Regular Span Corruption)**
- Equivalent to standard T5 span corruption
- Optimized for language understanding tasks
2. **X-denoising (Extreme Span Corruption)**
- Uses very large masked spans
- Encourages long-form generation and abstraction
3. **S-denoising (Sequential / PrefixLM)**
- Prefix language modeling similar to causal LM
- Suitable for sequence-to-sequence and generative tasks
### Paradigm Tokens (Mode Switching)
During pretraining, a **paradigm token** is inserted at the beginning of each input:
| Token | Mode | Recommended Use |
|------|------|------------------|
| `[NLU]` | R-denoising | Classification, QA, retrieval |
| `[NLG]` | X-denoising | Mixed understanding & generation |
| `[S2S]` | S-denoising | Generative / causal tasks |
**Important:**
For best performance, the same token should be **prepended during fine-tuning and inference**.
---
## Architecture
- Encoder–decoder Transformer (T5-style)
- Uses **T5-efficient architecture**
- Compatible with Hugging Face `T5ForConditionalGeneration`
---
## Intended Uses
These models are intended to be **fine-tuned** for:
- Biomedical and clinical **text classification**
- **Question answering**
- **Summarization** of medical literature or clinical notes
- **Text generation** in medical contexts
---
## Limitations
- ❌ Not instruction-tuned
- ❌ No supervised training
- ❌ Not suitable for zero-shot use
These checkpoints are **self-supervised pretraining models only** and require task-specific fine-tuning.
---
## Fine-Tuning Recommendations
- **Avoid mixed precision** (fp16 / bf16) initially
- Fine-tuning is more stable in **fp32**
- Always prepend one of `[NLU]`, `[NLG]`, or `[S2S]` to input text
- Suggested defaults:
- Classification / QA → `[NLU]`
- Causal or generative tasks → `[S2S]`
- Mixed tasks → `[NLG]`
---
## Model Parameter Summary
| Model Name | Parameter Count | Description | Access
|-----------|----------------|------------|------------|
| `pubmedul2-tiny-nl6` | **19.26M** | Tiny UL2-style model with 6 layers | Open
| `pubmedul2-mini-nl8` | **50.12M** | Mini UL2 with 8 layers | Open
| `pubmedul2-small` | **60.52M** | Small UL2 variant | Open
| `pubmedul2-small-nl24` | **192.73M** | Small UL2 with 24 layers | Open
| `medul2-base` | **222.93M** | Base UL2/T5-style model | Open
| `pubmedul2-base` | **222.93M** | Base UL2/T5-style model | Open
| `medul2-base-nl36` | **619.44M** | Base UL2 with 36 layers | Gated commercial
| `pubmedul2-base-nl36` | **619.44M** | Base UL2 with 36 layers | Gated commercial
| `medul2-large` | **737.72M** | Large UL2/T5-style model | Gated non-commercial
| `pubmedul2-large` | **737.72M** | Large UL2/T5-style model | Gated non-commercial
| `medul2-large-nl36` | **1090.14M** | Very large UL2 with 36 layers | Access on Request
---
## Named Entity Recognition (NER) Evaluation
We evaluate PubMedUL2 and MedUL2 models on a biomedical **Named Entity Recognition (NER)** task using multiple matching criteria to better capture boundary-level performance.
The evaluation reports **entity-level F1 scores** across different biomedical entity types and model sizes.
### Exact Match F1
An entity prediction is considered correct only if both the **entity span and label exactly match** the gold annotation.
| entity_type | medul2-base | pubmedul2-base | pubmedul2-mini-nl8 | pubmedul2-small | pubmedul2-tiny-nl6 |
|:--------------|--------------:|-----------------:|---------------------:|------------------:|---------------------:|
| cell_line | 0.42 | 0.43 | 0.44 | 0.43 | 0.35 |
| cell_type | 0.59 | 0.58 | 0.59 | 0.58 | 0.52 |
| chemical | 0.76 | 0.75 | 0.72 | 0.72 | 0.56 |
| disease | 0.7 | 0.73 | 0.7 | 0.68 | 0.63 |
| dna | 0.59 | 0.55 | 0.54 | 0.55 | 0.45 |
| gene | 0.62 | 0.59 | 0.6 | 0.59 | 0.55 |
| protein | 0.59 | 0.58 | 0.58 | 0.59 | 0.55 |
| rna | 0.6 | 0.56 | 0.55 | 0.6 | 0.56 |
| species | 0.66 | 0.67 | 0.58 | 0.63 | 0.54 |
---
### Partial Match F1
A prediction is counted as correct if it **partially overlaps** with a gold entity of the same type.
| entity_type | medul2-base | pubmedul2-base | pubmedul2-mini-nl8 | pubmedul2-small | pubmedul2-tiny-nl6 |
|:--------------|--------------:|-----------------:|---------------------:|------------------:|---------------------:|
| cell_line | 0.48 | 0.49 | 0.48 | 0.48 | 0.41 |
| cell_type | 0.66 | 0.64 | 0.66 | 0.65 | 0.59 |
| chemical | 0.79 | 0.78 | 0.76 | 0.75 | 0.6 |
| disease | 0.82 | 0.84 | 0.8 | 0.79 | 0.74 |
| dna | 0.65 | 0.61 | 0.6 | 0.61 | 0.53 |
| gene | 0.76 | 0.74 | 0.74 | 0.73 | 0.68 |
| protein | 0.66 | 0.66 | 0.66 | 0.67 | 0.64 |
| rna | 0.68 | 0.63 | 0.64 | 0.66 | 0.65 |
| species | 0.68 | 0.7 | 0.61 | 0.65 | 0.56 |
---
### IoU Match F1
Predictions are evaluated using **Intersection-over-Union (IoU)** overlap between predicted and gold spans, providing a softer boundary-based metric.
| entity_type | medul2-base | pubmedul2-base | pubmedul2-mini-nl8 | pubmedul2-small | pubmedul2-tiny-nl6 |
|:--------------|--------------:|-----------------:|---------------------:|------------------:|---------------------:|
| cell_line | 0.5 | 0.5 | 0.5 | 0.5 | 0.42 |
| cell_type | 0.67 | 0.66 | 0.68 | 0.67 | 0.62 |
| chemical | 0.83 | 0.83 | 0.82 | 0.82 | 0.72 |
| disease | 0.85 | 0.86 | 0.86 | 0.85 | 0.82 |
| dna | 0.65 | 0.62 | 0.62 | 0.62 | 0.55 |
| gene | 0.76 | 0.75 | 0.75 | 0.74 | 0.71 |
| protein | 0.67 | 0.66 | 0.67 | 0.67 | 0.66 |
| rna | 0.68 | 0.65 | 0.66 | 0.67 | 0.67 |
| species | 0.72 | 0.74 | 0.65 | 0.69 | 0.58 |
---
### Observations
- **MedUL2 models** generally outperform PubMedUL2 on clinical-heavy entity types such as *disease* and *chemical*
- Performance improves consistently from **tiny → base models**
- Boundary-sensitive metrics (Partial / IoU) show significantly higher scores than Exact Match, highlighting boundary ambiguity in biomedical NER
---
## Acknowledgements
This project would not have been possible without compute generously provided by **Google TPU Research Cloud**.
Thanks to:
- The **Finnish-NLP** authors for releasing the UL2 objective code, task definitions, and guidance
- **Yeb Havinga** for help getting started with the **t5x** framework
---
## License
Please refer to the individual model repositories for **license and access details**, which may vary depending on training data sources.