Siddharth63
/

medul2-base

 ---
+language:
+- en
+license: apache-2.0
+tags:
+- biomedical
+- clinical
+- ul2
+- t5
+- encoder-decoder
+- pretraining
+- text2text-generation
+- medical
 ---
+# PubMedUL2 & MedUL2
+## Model Description
+**PubMedUL2** and **MedUL2** are a family of **domain-specific UL2/T5-style encoder–decoder language models** pretrained on large-scale biomedical and medical corpora using the **UL2 (Mixture-of-Denoisers)** objective.
+- **PubMedUL2** models are pretrained on **25 million PubMed abstracts**
+- **MedUL2** models are pretrained on **PubMed abstracts + clinical notes + additional medical documents**
+- All models use a **T5-efficient architecture**, inspired by Google’s efficient T5 variants
+These checkpoints are **pretraining-only models** and **must be fine-tuned** before use on downstream tasks.
+---
+## Pretraining Objective: UL2 (Mixture-of-Denoisers)
+These models were pretrained using **UL2**, a unified framework that formulates language modeling objectives as **denoising tasks**.
+UL2 introduces a **Mixture-of-Denoisers (MoD)** approach that samples from multiple denoising paradigms during pretraining.
+### Denoising Tasks
+UL2 pretraining uses a mixture of three denoising tasks:
+1. **R-denoising (Regular Span Corruption)**
+   - Equivalent to standard T5 span corruption
+   - Optimized for language understanding tasks
+2. **X-denoising (Extreme Span Corruption)**
+   - Uses very large masked spans
+   - Encourages long-form generation and abstraction
+3. **S-denoising (Sequential / PrefixLM)**
+   - Prefix language modeling similar to causal LM
+   - Suitable for sequence-to-sequence and generative tasks
+### Paradigm Tokens (Mode Switching)
+During pretraining, a **paradigm token** is inserted at the beginning of each input:
+| Token  | Mode | Recommended Use |
+|------|------|------------------|
+| `[NLU]` | R-denoising | Classification, QA, retrieval |
+| `[NLG]` | X-denoising | Mixed understanding & generation |
+| `[S2S]` | S-denoising | Generative / causal tasks |
+**Important:**
+For best performance, the same token should be **prepended during fine-tuning and inference**.
+---
+## Architecture
+- Encoder–decoder Transformer (T5-style)
+- Uses **T5-efficient architecture**
+- Compatible with Hugging Face `T5ForConditionalGeneration`
+---
+## Intended Uses
+These models are intended to be **fine-tuned** for:
+- Biomedical and clinical **text classification**
+- **Question answering**
+- **Summarization** of medical literature or clinical notes
+- **Text generation** in medical contexts
+---
+## Limitations
+- ❌ Not instruction-tuned
+- ❌ No supervised training
+- ❌ Not suitable for zero-shot use
+These checkpoints are **self-supervised pretraining models only** and require task-specific fine-tuning.
+---
+## Fine-Tuning Recommendations
+- **Avoid mixed precision** (fp16 / bf16) initially
+  - Fine-tuning is more stable in **fp32**
+- Always prepend one of `[NLU]`, `[NLG]`, or `[S2S]` to input text
+- Suggested defaults:
+  - Classification / QA → `[NLU]`
+  - Causal or generative tasks → `[S2S]`
+  - Mixed tasks → `[NLG]`
+---
+## Model Parameter Summary
+| Model Name | Parameter Count | Description | Access
+|-----------|----------------|------------|------------|
+| `pubmedul2-tiny-nl6` | **19.26M** | Tiny UL2-style model with 6 layers | Open
+| `pubmedul2-mini-nl8` | **50.12M** | Mini UL2 with 8 layers | Open
+| `pubmedul2-small` | **60.52M** | Small UL2 variant | Open
+| `pubmedul2-small-nl24` | **192.73M** | Small UL2 with 24 layers | Open
+| `medul2-base` | **222.93M** | Base UL2/T5-style model | Open
+| `pubmedul2-base` | **222.93M** | Base UL2/T5-style model | Open
+| `medul2-base-nl36` | **619.44M** | Base UL2 with 36 layers | Gated commercial
+| `pubmedul2-base-nl36` | **619.44M** | Base UL2 with 36 layers | Gated commercial
+| `medul2-large` | **737.72M** | Large UL2/T5-style model | Gated non-commercial
+| `pubmedul2-large` | **737.72M** | Large UL2/T5-style model | Gated non-commercial
+| `medul2-large-nl36` | **1090.14M** | Very large UL2 with 36 layers | Access on Request
+---
+## Named Entity Recognition (NER) Evaluation
+We evaluate PubMedUL2 and MedUL2 models on a biomedical **Named Entity Recognition (NER)** task using multiple matching criteria to better capture boundary-level performance.
+The evaluation reports **entity-level F1 scores** across different biomedical entity types and model sizes.
+### Exact Match F1
+An entity prediction is considered correct only if both the **entity span and label exactly match** the gold annotation.
+| entity_type   |   medul2-base |   pubmedul2-base |   pubmedul2-mini-nl8 |   pubmedul2-small |   pubmedul2-tiny-nl6 |
+|:--------------|--------------:|-----------------:|---------------------:|------------------:|---------------------:|
+| cell_line     |          0.42 |             0.43 |                 0.44 |              0.43 |                 0.35 |
+| cell_type     |          0.59 |             0.58 |                 0.59 |              0.58 |                 0.52 |
+| chemical      |          0.76 |             0.75 |                 0.72 |              0.72 |                 0.56 |
+| disease       |          0.7  |             0.73 |                 0.7  |              0.68 |                 0.63 |
+| dna           |          0.59 |             0.55 |                 0.54 |              0.55 |                 0.45 |
+| gene          |          0.62 |             0.59 |                 0.6  |              0.59 |                 0.55 |
+| protein       |          0.59 |             0.58 |                 0.58 |              0.59 |                 0.55 |
+| rna           |          0.6  |             0.56 |                 0.55 |              0.6  |                 0.56 |
+| species       |          0.66 |             0.67 |                 0.58 |              0.63 |                 0.54 |
+---
+### Partial Match F1
+A prediction is counted as correct if it **partially overlaps** with a gold entity of the same type.
+| entity_type   |   medul2-base |   pubmedul2-base |   pubmedul2-mini-nl8 |   pubmedul2-small |   pubmedul2-tiny-nl6 |
+|:--------------|--------------:|-----------------:|---------------------:|------------------:|---------------------:|
+| cell_line     |          0.48 |             0.49 |                 0.48 |              0.48 |                 0.41 |
+| cell_type     |          0.66 |             0.64 |                 0.66 |              0.65 |                 0.59 |
+| chemical      |          0.79 |             0.78 |                 0.76 |              0.75 |                 0.6  |
+| disease       |          0.82 |             0.84 |                 0.8  |              0.79 |                 0.74 |
+| dna           |          0.65 |             0.61 |                 0.6  |              0.61 |                 0.53 |
+| gene          |          0.76 |             0.74 |                 0.74 |              0.73 |                 0.68 |
+| protein       |          0.66 |             0.66 |                 0.66 |              0.67 |                 0.64 |
+| rna           |          0.68 |             0.63 |                 0.64 |              0.66 |                 0.65 |
+| species       |          0.68 |             0.7  |                 0.61 |              0.65 |                 0.56 |
+---
+### IoU Match F1
+Predictions are evaluated using **Intersection-over-Union (IoU)** overlap between predicted and gold spans, providing a softer boundary-based metric.
+| entity_type   |   medul2-base |   pubmedul2-base |   pubmedul2-mini-nl8 |   pubmedul2-small |   pubmedul2-tiny-nl6 |
+|:--------------|--------------:|-----------------:|---------------------:|------------------:|---------------------:|
+| cell_line     |          0.5  |             0.5  |                 0.5  |              0.5  |                 0.42 |
+| cell_type     |          0.67 |             0.66 |                 0.68 |              0.67 |                 0.62 |
+| chemical      |          0.83 |             0.83 |                 0.82 |              0.82 |                 0.72 |
+| disease       |          0.85 |             0.86 |                 0.86 |              0.85 |                 0.82 |
+| dna           |          0.65 |             0.62 |                 0.62 |              0.62 |                 0.55 |
+| gene          |          0.76 |             0.75 |                 0.75 |              0.74 |                 0.71 |
+| protein       |          0.67 |             0.66 |                 0.67 |              0.67 |                 0.66 |
+| rna           |          0.68 |             0.65 |                 0.66 |              0.67 |                 0.67 |
+| species       |          0.72 |             0.74 |                 0.65 |              0.69 |                 0.58 |
+---
+### Observations
+- **MedUL2 models** generally outperform PubMedUL2 on clinical-heavy entity types such as *disease* and *chemical*
+- Performance improves consistently from **tiny → base models**
+- Boundary-sensitive metrics (Partial / IoU) show significantly higher scores than Exact Match, highlighting boundary ambiguity in biomedical NER
+---
+## Acknowledgements
+This project would not have been possible without compute generously provided by **Google TPU Research Cloud**.
+Thanks to:
+- The **Finnish-NLP** authors for releasing the UL2 objective code, task definitions, and guidance
+- **Yeb Havinga** for help getting started with the **t5x** framework
+---
+## License
+Please refer to the individual model repositories for **license and access details**, which may vary depending on training data sources.