--- language: - en license: apache-2.0 tags: - biomedical - clinical - ul2 - t5 - encoder-decoder - pretraining - text2text-generation - medical --- # PubMedUL2 & MedUL2 ## Model Description **PubMedUL2** and **MedUL2** are a family of **domain-specific UL2/T5-style encoder–decoder language models** pretrained on large-scale biomedical and medical corpora using the **UL2 (Mixture-of-Denoisers)** objective. - **PubMedUL2** models are pretrained on **25 million PubMed abstracts** - **MedUL2** models are pretrained on **PubMed abstracts + clinical notes + additional medical documents** - All models use a **T5-efficient architecture**, inspired by Google’s efficient T5 variants These checkpoints are **pretraining-only models** and **must be fine-tuned** before use on downstream tasks. --- ## Pretraining Objective: UL2 (Mixture-of-Denoisers) These models were pretrained using **UL2**, a unified framework that formulates language modeling objectives as **denoising tasks**. UL2 introduces a **Mixture-of-Denoisers (MoD)** approach that samples from multiple denoising paradigms during pretraining. ### Denoising Tasks UL2 pretraining uses a mixture of three denoising tasks: 1. **R-denoising (Regular Span Corruption)** - Equivalent to standard T5 span corruption - Optimized for language understanding tasks 2. **X-denoising (Extreme Span Corruption)** - Uses very large masked spans - Encourages long-form generation and abstraction 3. **S-denoising (Sequential / PrefixLM)** - Prefix language modeling similar to causal LM - Suitable for sequence-to-sequence and generative tasks ### Paradigm Tokens (Mode Switching) During pretraining, a **paradigm token** is inserted at the beginning of each input: | Token | Mode | Recommended Use | |------|------|------------------| | `[NLU]` | R-denoising | Classification, QA, retrieval | | `[NLG]` | X-denoising | Mixed understanding & generation | | `[S2S]` | S-denoising | Generative / causal tasks | **Important:** For best performance, the same token should be **prepended during fine-tuning and inference**. --- ## Architecture - Encoder–decoder Transformer (T5-style) - Uses **T5-efficient architecture** - Compatible with Hugging Face `T5ForConditionalGeneration` --- ## Intended Uses These models are intended to be **fine-tuned** for: - Biomedical and clinical **text classification** - **Question answering** - **Summarization** of medical literature or clinical notes - **Text generation** in medical contexts --- ## Limitations - ❌ Not instruction-tuned - ❌ No supervised training - ❌ Not suitable for zero-shot use These checkpoints are **self-supervised pretraining models only** and require task-specific fine-tuning. --- ## Fine-Tuning Recommendations - **Avoid mixed precision** (fp16 / bf16) initially - Fine-tuning is more stable in **fp32** - Always prepend one of `[NLU]`, `[NLG]`, or `[S2S]` to input text - Suggested defaults: - Classification / QA → `[NLU]` - Causal or generative tasks → `[S2S]` - Mixed tasks → `[NLG]` --- ## Model Parameter Summary | Model Name | Parameter Count | Description | Access |-----------|----------------|------------|------------| | `pubmedul2-tiny-nl6` | **19.26M** | Tiny UL2-style model with 6 layers | Open | `pubmedul2-mini-nl8` | **50.12M** | Mini UL2 with 8 layers | Open | `pubmedul2-small` | **60.52M** | Small UL2 variant | Open | `pubmedul2-small-nl24` | **192.73M** | Small UL2 with 24 layers | Open | `medul2-base` | **222.93M** | Base UL2/T5-style model | Open | `pubmedul2-base` | **222.93M** | Base UL2/T5-style model | Open | `medul2-base-nl36` | **619.44M** | Base UL2 with 36 layers | Gated commercial | `pubmedul2-base-nl36` | **619.44M** | Base UL2 with 36 layers | Gated commercial | `medul2-large` | **737.72M** | Large UL2/T5-style model | Gated non-commercial | `pubmedul2-large` | **737.72M** | Large UL2/T5-style model | Gated non-commercial | `medul2-large-nl36` | **1090.14M** | Very large UL2 with 36 layers | Access on Request --- ## Named Entity Recognition (NER) Evaluation We evaluate PubMedUL2 and MedUL2 models on a biomedical **Named Entity Recognition (NER)** task using multiple matching criteria to better capture boundary-level performance. The evaluation reports **entity-level F1 scores** across different biomedical entity types and model sizes. ### Exact Match F1 An entity prediction is considered correct only if both the **entity span and label exactly match** the gold annotation. | entity_type | medul2-base | pubmedul2-base | pubmedul2-mini-nl8 | pubmedul2-small | pubmedul2-tiny-nl6 | |:--------------|--------------:|-----------------:|---------------------:|------------------:|---------------------:| | cell_line | 0.42 | 0.43 | 0.44 | 0.43 | 0.35 | | cell_type | 0.59 | 0.58 | 0.59 | 0.58 | 0.52 | | chemical | 0.76 | 0.75 | 0.72 | 0.72 | 0.56 | | disease | 0.7 | 0.73 | 0.7 | 0.68 | 0.63 | | dna | 0.59 | 0.55 | 0.54 | 0.55 | 0.45 | | gene | 0.62 | 0.59 | 0.6 | 0.59 | 0.55 | | protein | 0.59 | 0.58 | 0.58 | 0.59 | 0.55 | | rna | 0.6 | 0.56 | 0.55 | 0.6 | 0.56 | | species | 0.66 | 0.67 | 0.58 | 0.63 | 0.54 | --- ### Partial Match F1 A prediction is counted as correct if it **partially overlaps** with a gold entity of the same type. | entity_type | medul2-base | pubmedul2-base | pubmedul2-mini-nl8 | pubmedul2-small | pubmedul2-tiny-nl6 | |:--------------|--------------:|-----------------:|---------------------:|------------------:|---------------------:| | cell_line | 0.48 | 0.49 | 0.48 | 0.48 | 0.41 | | cell_type | 0.66 | 0.64 | 0.66 | 0.65 | 0.59 | | chemical | 0.79 | 0.78 | 0.76 | 0.75 | 0.6 | | disease | 0.82 | 0.84 | 0.8 | 0.79 | 0.74 | | dna | 0.65 | 0.61 | 0.6 | 0.61 | 0.53 | | gene | 0.76 | 0.74 | 0.74 | 0.73 | 0.68 | | protein | 0.66 | 0.66 | 0.66 | 0.67 | 0.64 | | rna | 0.68 | 0.63 | 0.64 | 0.66 | 0.65 | | species | 0.68 | 0.7 | 0.61 | 0.65 | 0.56 | --- ### IoU Match F1 Predictions are evaluated using **Intersection-over-Union (IoU)** overlap between predicted and gold spans, providing a softer boundary-based metric. | entity_type | medul2-base | pubmedul2-base | pubmedul2-mini-nl8 | pubmedul2-small | pubmedul2-tiny-nl6 | |:--------------|--------------:|-----------------:|---------------------:|------------------:|---------------------:| | cell_line | 0.5 | 0.5 | 0.5 | 0.5 | 0.42 | | cell_type | 0.67 | 0.66 | 0.68 | 0.67 | 0.62 | | chemical | 0.83 | 0.83 | 0.82 | 0.82 | 0.72 | | disease | 0.85 | 0.86 | 0.86 | 0.85 | 0.82 | | dna | 0.65 | 0.62 | 0.62 | 0.62 | 0.55 | | gene | 0.76 | 0.75 | 0.75 | 0.74 | 0.71 | | protein | 0.67 | 0.66 | 0.67 | 0.67 | 0.66 | | rna | 0.68 | 0.65 | 0.66 | 0.67 | 0.67 | | species | 0.72 | 0.74 | 0.65 | 0.69 | 0.58 | --- ### Observations - **MedUL2 models** generally outperform PubMedUL2 on clinical-heavy entity types such as *disease* and *chemical* - Performance improves consistently from **tiny → base models** - Boundary-sensitive metrics (Partial / IoU) show significantly higher scores than Exact Match, highlighting boundary ambiguity in biomedical NER --- ## Acknowledgements This project would not have been possible without compute generously provided by **Google TPU Research Cloud**. Thanks to: - The **Finnish-NLP** authors for releasing the UL2 objective code, task definitions, and guidance - **Yeb Havinga** for help getting started with the **t5x** framework --- ## License Please refer to the individual model repositories for **license and access details**, which may vary depending on training data sources.