medul2-base / README.md

Update README.md

e37091f verified 8 days ago

9.61 kB

	---
	language:
	- en
	license: apache-2.0
	tags:
	- biomedical
	- clinical
	- ul2
	- t5
	- encoder-decoder
	- pretraining
	- text2text-generation
	- medical
	---

	# PubMedUL2 & MedUL2

	## Model Description

	PubMedUL2 and MedUL2 are a family of domain-specific UL2/T5-style encoder–decoder language models pretrained on large-scale biomedical and medical corpora using the UL2 (Mixture-of-Denoisers) objective.

	- PubMedUL2 models are pretrained on 25 million PubMed abstracts
	- MedUL2 models are pretrained on PubMed abstracts + clinical notes + additional medical documents
	- All models use a T5-efficient architecture, inspired by Google’s efficient T5 variants

	These checkpoints are pretraining-only models and must be fine-tuned before use on downstream tasks.

	---

	## Pretraining Objective: UL2 (Mixture-of-Denoisers)

	These models were pretrained using UL2, a unified framework that formulates language modeling objectives as denoising tasks.

	UL2 introduces a Mixture-of-Denoisers (MoD) approach that samples from multiple denoising paradigms during pretraining.

	### Denoising Tasks

	UL2 pretraining uses a mixture of three denoising tasks:

	1. R-denoising (Regular Span Corruption)
	- Equivalent to standard T5 span corruption
	- Optimized for language understanding tasks

	2. X-denoising (Extreme Span Corruption)
	- Uses very large masked spans
	- Encourages long-form generation and abstraction

	3. S-denoising (Sequential / PrefixLM)
	- Prefix language modeling similar to causal LM
	- Suitable for sequence-to-sequence and generative tasks

	### Paradigm Tokens (Mode Switching)

	During pretraining, a paradigm token is inserted at the beginning of each input:

	\| Token \| Mode \| Recommended Use \|
	\|------\|------\|------------------\|
	\| `[NLU]` \| R-denoising \| Classification, QA, retrieval \|
	\| `[NLG]` \| X-denoising \| Mixed understanding & generation \|
	\| `[S2S]` \| S-denoising \| Generative / causal tasks \|

	Important:
	For best performance, the same token should be prepended during fine-tuning and inference.

	---

	## Architecture

	- Encoder–decoder Transformer (T5-style)
	- Uses T5-efficient architecture
	- Compatible with Hugging Face `T5ForConditionalGeneration`

	---

	## Intended Uses

	These models are intended to be fine-tuned for:

	- Biomedical and clinical text classification
	- Question answering
	- Summarization of medical literature or clinical notes
	- Text generation in medical contexts

	---

	## Limitations

	- ❌ Not instruction-tuned
	- ❌ No supervised training
	- ❌ Not suitable for zero-shot use

	These checkpoints are self-supervised pretraining models only and require task-specific fine-tuning.

	---

	## Fine-Tuning Recommendations

	- Avoid mixed precision (fp16 / bf16) initially
	- Fine-tuning is more stable in fp32
	- Always prepend one of `[NLU]`, `[NLG]`, or `[S2S]` to input text
	- Suggested defaults:
	- Classification / QA → `[NLU]`
	- Causal or generative tasks → `[S2S]`
	- Mixed tasks → `[NLG]`

	---

	## Model Parameter Summary

	\| Model Name \| Parameter Count \| Description \| Access
	\|-----------\|----------------\|------------\|------------\|
	\| `pubmedul2-tiny-nl6` \| 19.26M \| Tiny UL2-style model with 6 layers \| Open
	\| `pubmedul2-mini-nl8` \| 50.12M \| Mini UL2 with 8 layers \| Open
	\| `pubmedul2-small` \| 60.52M \| Small UL2 variant \| Open
	\| `pubmedul2-small-nl24` \| 192.73M \| Small UL2 with 24 layers \| Open
	\| `medul2-base` \| 222.93M \| Base UL2/T5-style model \| Open
	\| `pubmedul2-base` \| 222.93M \| Base UL2/T5-style model \| Open
	\| `medul2-base-nl36` \| 619.44M \| Base UL2 with 36 layers \| Gated commercial
	\| `pubmedul2-base-nl36` \| 619.44M \| Base UL2 with 36 layers \| Gated commercial
	\| `medul2-large` \| 737.72M \| Large UL2/T5-style model \| Gated non-commercial
	\| `pubmedul2-large` \| 737.72M \| Large UL2/T5-style model \| Gated non-commercial
	\| `medul2-large-nl36` \| 1090.14M \| Very large UL2 with 36 layers \| Access on Request

	---

	## Named Entity Recognition (NER) Evaluation

	We evaluate PubMedUL2 and MedUL2 models on a biomedical Named Entity Recognition (NER) task using multiple matching criteria to better capture boundary-level performance.

	The evaluation reports entity-level F1 scores across different biomedical entity types and model sizes.

	### Exact Match F1

	An entity prediction is considered correct only if both the entity span and label exactly match the gold annotation.

	\| entity_type \| medul2-base \| pubmedul2-base \| pubmedul2-mini-nl8 \| pubmedul2-small \| pubmedul2-tiny-nl6 \|
	\|:--------------\|--------------:\|-----------------:\|---------------------:\|------------------:\|---------------------:\|
	\| cell_line \| 0.42 \| 0.43 \| 0.44 \| 0.43 \| 0.35 \|
	\| cell_type \| 0.59 \| 0.58 \| 0.59 \| 0.58 \| 0.52 \|
	\| chemical \| 0.76 \| 0.75 \| 0.72 \| 0.72 \| 0.56 \|
	\| disease \| 0.7 \| 0.73 \| 0.7 \| 0.68 \| 0.63 \|
	\| dna \| 0.59 \| 0.55 \| 0.54 \| 0.55 \| 0.45 \|
	\| gene \| 0.62 \| 0.59 \| 0.6 \| 0.59 \| 0.55 \|
	\| protein \| 0.59 \| 0.58 \| 0.58 \| 0.59 \| 0.55 \|
	\| rna \| 0.6 \| 0.56 \| 0.55 \| 0.6 \| 0.56 \|
	\| species \| 0.66 \| 0.67 \| 0.58 \| 0.63 \| 0.54 \|

	---

	### Partial Match F1

	A prediction is counted as correct if it partially overlaps with a gold entity of the same type.

	\| entity_type \| medul2-base \| pubmedul2-base \| pubmedul2-mini-nl8 \| pubmedul2-small \| pubmedul2-tiny-nl6 \|
	\|:--------------\|--------------:\|-----------------:\|---------------------:\|------------------:\|---------------------:\|
	\| cell_line \| 0.48 \| 0.49 \| 0.48 \| 0.48 \| 0.41 \|
	\| cell_type \| 0.66 \| 0.64 \| 0.66 \| 0.65 \| 0.59 \|
	\| chemical \| 0.79 \| 0.78 \| 0.76 \| 0.75 \| 0.6 \|
	\| disease \| 0.82 \| 0.84 \| 0.8 \| 0.79 \| 0.74 \|
	\| dna \| 0.65 \| 0.61 \| 0.6 \| 0.61 \| 0.53 \|
	\| gene \| 0.76 \| 0.74 \| 0.74 \| 0.73 \| 0.68 \|
	\| protein \| 0.66 \| 0.66 \| 0.66 \| 0.67 \| 0.64 \|
	\| rna \| 0.68 \| 0.63 \| 0.64 \| 0.66 \| 0.65 \|
	\| species \| 0.68 \| 0.7 \| 0.61 \| 0.65 \| 0.56 \|

	---

	### IoU Match F1

	Predictions are evaluated using Intersection-over-Union (IoU) overlap between predicted and gold spans, providing a softer boundary-based metric.

	\| entity_type \| medul2-base \| pubmedul2-base \| pubmedul2-mini-nl8 \| pubmedul2-small \| pubmedul2-tiny-nl6 \|
	\|:--------------\|--------------:\|-----------------:\|---------------------:\|------------------:\|---------------------:\|
	\| cell_line \| 0.5 \| 0.5 \| 0.5 \| 0.5 \| 0.42 \|
	\| cell_type \| 0.67 \| 0.66 \| 0.68 \| 0.67 \| 0.62 \|
	\| chemical \| 0.83 \| 0.83 \| 0.82 \| 0.82 \| 0.72 \|
	\| disease \| 0.85 \| 0.86 \| 0.86 \| 0.85 \| 0.82 \|
	\| dna \| 0.65 \| 0.62 \| 0.62 \| 0.62 \| 0.55 \|
	\| gene \| 0.76 \| 0.75 \| 0.75 \| 0.74 \| 0.71 \|
	\| protein \| 0.67 \| 0.66 \| 0.67 \| 0.67 \| 0.66 \|
	\| rna \| 0.68 \| 0.65 \| 0.66 \| 0.67 \| 0.67 \|
	\| species \| 0.72 \| 0.74 \| 0.65 \| 0.69 \| 0.58 \|

	---

	### Observations

	- MedUL2 models generally outperform PubMedUL2 on clinical-heavy entity types such as disease and chemical
	- Performance improves consistently from tiny → base models
	- Boundary-sensitive metrics (Partial / IoU) show significantly higher scores than Exact Match, highlighting boundary ambiguity in biomedical NER

	---

	## Acknowledgements

	This project would not have been possible without compute generously provided by Google TPU Research Cloud.

	Thanks to:
	- The Finnish-NLP authors for releasing the UL2 objective code, task definitions, and guidance
	- Yeb Havinga for help getting started with the t5x framework

	---

	## License

	Please refer to the individual model repositories for license and access details, which may vary depending on training data sources.