Upload README.md with huggingface_hub

276f04b verified about 10 hours ago

4.98 kB

	---
	language:
	- rna
	library_name: transformers
	tags:
	- RNA
	- language-model
	- UTR
	- genomics
	- biology
	license: gpl-3.0
	---

	# UTR-LM-MLMSS

	UTR-LM is a 5' UTR RNA language model based on ESM2, pretrained on endogenous 5' UTRs from five species and a large synthetic library. This checkpoint (`UTR-LM-MLMSS`) was trained with MLM + secondary structure prediction as a supervised auxiliary objective.

	## Architecture

	\| Parameter \| Value \|
	\|---\|---\|
	\| Layers \| 6 \|
	\| Attention heads \| 16 \|
	\| Embedding dimension \| 128 \|
	\| Vocabulary size \| 10 \|
	\| Positional encoding \| Rotary (RoPE) \|
	\| Architecture \| ESM2-style pre-LN Transformer \|

	Vocabulary: `<pad>` (0), `<eos>` (1), `<unk>` (2), `A` (3), `G` (4), `C` (5), `T` (6), `<cls>` (7), `<mask>` (8), `<sep>` (9)

	## Pretraining

	- Objective: Masked language modeling + per-token secondary structure prediction (3-class: unpaired, stem, loop)
	- Data: Endogenous 5' UTRs from five species (human, mouse, zebrafish, Drosophila, yeast) combined with the Cao et al. random 5' UTR synthetic library
	- Source checkpoint: `ESM2SS_FS4.1_fiveSpeciesCao_6layers_16heads_128embedsize_4096batchToks_lr1e-05_structureweight1.0_MLMLossMin_epoch200.pkl`

	Only one `ESM2SS` (secondary structure only, no MFE regression) checkpoint was available; no selection decision was required.

	## Parity Verification

	Hidden-state representations produced by this HF model are verified to be exactly identical (max absolute difference = 0.00) to the original ESM2-based implementation at all 7 representation levels (initial embedding + 6 transformer layers). Verified on GPU with PyTorch 2.8 / CUDA 12.6.

	## Related Models

	See the full [UTR-LM collection](https://huggingface.co/collections/Taykhoom/utr-lm-6a173a96ae7c070c3a84ebb4).

	\| Model \| Pretraining Objective \| Notes \|
	\|---\|---\|---\|
	\| [UTR-LM-MLM](https://huggingface.co/Taykhoom/UTR-LM-MLM) \| MLM \| Base model \|
	\| [UTR-LM-MLMSI](https://huggingface.co/Taykhoom/UTR-LM-MLMSI) \| MLM + MFE regression \| Recommended for TE / EL tasks \|
	\| [UTR-LM-MLMSS](https://huggingface.co/Taykhoom/UTR-LM-MLMSS) \| MLM + secondary structure \| This model \|
	\| [UTR-LM-MLMSISS](https://huggingface.co/Taykhoom/UTR-LM-MLMSISS) \| MLM + MFE + secondary structure \| Recommended for MRL tasks \|

	## Usage

	### Embedding generation

	```python
	import torch
	from transformers import AutoTokenizer, AutoModel

	tokenizer = AutoTokenizer.from_pretrained("Taykhoom/UTR-LM-MLMSS", trust_remote_code=True)
	model = AutoModel.from_pretrained("Taykhoom/UTR-LM-MLMSS", trust_remote_code=True)
	model.eval()

	sequences = ["ATGCATGCATGC", "GCTAGCTAGCTAGCTA"]
	enc = tokenizer(sequences, return_tensors="pt", padding=True)

	with torch.no_grad():
	out = model(**enc)

	# CLS token embedding (position 0) - recommended for sequence-level tasks
	cls_emb = out.last_hidden_state[:, 0, :] # (batch, 128)

	# All-token embeddings
	token_emb = out.last_hidden_state # (batch, seq_len, 128)

	# Intermediate layer representations
	out_all = model(**enc, output_hidden_states=True)
	layer3_emb = out_all.hidden_states[3] # after layer 3, shape (batch, seq_len, 128)
	```

	### MLM logits

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForMaskedLM

	tokenizer = AutoTokenizer.from_pretrained("Taykhoom/UTR-LM-MLMSS", trust_remote_code=True)
	model = AutoModelForMaskedLM.from_pretrained("Taykhoom/UTR-LM-MLMSS", trust_remote_code=True)
	model.eval()

	enc = tokenizer(["ATGC<mask>ATGC"], return_tensors="pt")
	with torch.no_grad():
	logits = model(**enc).logits # (1, seq_len, 10)
	```

	### Fine-tuning

	The model follows standard HF conventions and can be fine-tuned with any Trainer-compatible setup. For sequence regression tasks, use the CLS token embedding as input to a prediction head (as done in the original UTR-LM paper).

	## Citation

	```bibtex
	@article{chu2024utrlm,
	title = {A 5' UTR Language Model for Decoding Untranslated Regions of mRNA and Function Predictions},
	author = {Chu, Yanyi and Yu, Dan and Li, Yupeng and Huang, Kaixuan and Shen, Yue and Cong, Le and Zhang, Jason and Wang, Mengdi},
	journal = {Nature Machine Intelligence},
	volume = {6},
	number = {4},
	pages = {449--460},
	year = {2024},
	doi = {10.1038/s42256-024-00823-9}
	}
	```

	## Implementation Notes

	The original UTR-LM implementation uses standard scaled dot-product attention. This HF port adds support for `attn_implementation="sdpa"` (PyTorch `F.scaled_dot_product_attention`) and `attn_implementation="flash_attention_2"` (requires `pip install flash-attn --no-build-isolation`), which were not part of the original codebase.

	## Credits

	Original model and code by Yanyi Chu et al. (Stanford). Source code: [UTR-LM GitHub repository](https://github.com/a96123155/UTR-LM). The HF conversion code was authored primarily by [Claude Code](https://claude.ai/code) and reviewed manually by Taykhoom Dalal.

	## License

	GPL-3.0, following the original UTR-LM repository.