Instructions to use Taykhoom/UTR-LM-MLMSI with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Taykhoom/UTR-LM-MLMSI with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Taykhoom/UTR-LM-MLMSI", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
File size: 5,231 Bytes
19b7718 c46f15f 19b7718 a552072 b770237 19b7718 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 | ---
language:
- rna
library_name: transformers
tags:
- RNA
- language-model
- UTR
- genomics
- biology
license: gpl-3.0
---
# UTR-LM-MLMSI
UTR-LM is a 5' UTR RNA language model based on ESM2, pretrained on endogenous 5' UTRs from five species and a large synthetic library. This checkpoint (`UTR-LM-MLMSI`) was trained with **MLM + minimum free energy (MFE) regression** as a supervised auxiliary objective.
## Architecture
| Parameter | Value |
|---|---|
| Layers | 6 |
| Attention heads | 16 |
| Embedding dimension | 128 |
| Vocabulary size | 10 |
| Positional encoding | Rotary (RoPE) |
| Architecture | ESM2-style pre-LN Transformer |
**Vocabulary:** `<pad>` (0), `<eos>` (1), `<unk>` (2), `A` (3), `G` (4), `C` (5), `T` (6), `<cls>` (7), `<mask>` (8), `<sep>` (9)
## Pretraining
- **Objective:** Masked language modeling + MFE (minimum free energy) regression
- **Data:** Endogenous 5' UTRs from five species (human, mouse, zebrafish, *Drosophila*, yeast) combined with the Cao et al. random 5' UTR synthetic library
- **Source checkpoint:** `ESM2SI_3.1_fiveSpeciesCao_6layers_16heads_128embedsize_4096batchToks_MLMLossMin.pkl`
### Checkpoint selection
Multiple `ESM2SI` checkpoints were available (versions 3.1, FS4.1, FS4.4, FS4.7). The `3.1` checkpoint was selected because it is the version specified in the original UTR-LM paper for translation efficiency (TE) and expression level (EL) downstream tasks (used in the `MJ4_Finetune` evaluation scripts). The FS4.x variants are later training runs but were not the ones reported in the original publication.
## Parity Verification
Hidden-state representations produced by this HF model are verified to be **exactly identical** (max absolute difference = 0.00) to the original ESM2-based implementation at all 7 representation levels (initial embedding + 6 transformer layers). Verified on GPU with PyTorch 2.8 / CUDA 12.6.
## Related Models
See the full [UTR-LM collection](https://huggingface.co/collections/Taykhoom/utr-lm-6a173a96ae7c070c3a84ebb4).
| Model | Pretraining Objective | Notes |
|---|---|---|
| [UTR-LM-MLM](https://huggingface.co/Taykhoom/UTR-LM-MLM) | MLM | Base model |
| **[UTR-LM-MLMSI](https://huggingface.co/Taykhoom/UTR-LM-MLMSI)** | MLM + MFE regression | This model — recommended for TE / EL tasks |
| [UTR-LM-MLMSS](https://huggingface.co/Taykhoom/UTR-LM-MLMSS) | MLM + secondary structure | — |
| [UTR-LM-MLMSISS](https://huggingface.co/Taykhoom/UTR-LM-MLMSISS) | MLM + MFE + secondary structure | Recommended for MRL tasks |
## Usage
### Embedding generation
```python
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Taykhoom/UTR-LM-MLMSI", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/UTR-LM-MLMSI", trust_remote_code=True)
model.eval()
sequences = ["ATGCATGCATGC", "GCTAGCTAGCTAGCTA"]
enc = tokenizer(sequences, return_tensors="pt", padding=True)
with torch.no_grad():
out = model(**enc)
# CLS token embedding (position 0) - recommended for sequence-level tasks
cls_emb = out.last_hidden_state[:, 0, :] # (batch, 128)
# All-token embeddings
token_emb = out.last_hidden_state # (batch, seq_len, 128)
# Intermediate layer representations
out_all = model(**enc, output_hidden_states=True)
layer3_emb = out_all.hidden_states[3] # after layer 3, shape (batch, seq_len, 128)
```
### MLM logits
```python
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("Taykhoom/UTR-LM-MLMSI", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("Taykhoom/UTR-LM-MLMSI", trust_remote_code=True)
model.eval()
enc = tokenizer(["ATGC<mask>ATGC"], return_tensors="pt")
with torch.no_grad():
logits = model(**enc).logits # (1, seq_len, 10)
```
### Fine-tuning
The model follows standard HF conventions and can be fine-tuned with any Trainer-compatible setup. For sequence regression tasks, use the CLS token embedding as input to a prediction head (as done in the original UTR-LM paper).
## Citation
```bibtex
@article{chu2024_utrlm,
title = {A 5' {UTR} Language Model for Decoding Untranslated Regions of {mRNA} and Function Predictions},
author = {Chu, Yanyi and Yu, Dan and Li, Yupeng and Huang, Kaixuan and Shen, Yue and Cong, Le and Zhang, Jason and Wang, Mengdi},
journal = {Nature Machine Intelligence},
volume = {6},
number = {4},
pages = {449--460},
year = {2024},
doi = {10.1038/s42256-024-00823-9}
}
```
## Implementation Notes
The original UTR-LM implementation uses standard scaled dot-product attention. This HF port adds support for `attn_implementation="sdpa"` (PyTorch `F.scaled_dot_product_attention`) and `attn_implementation="flash_attention_2"` (requires `pip install flash-attn --no-build-isolation`), which were not part of the original codebase.
## Credits
Original model and code by Yanyi Chu et al. (Stanford). Source code: [UTR-LM GitHub repository](https://github.com/a96123155/UTR-LM). The HF conversion code was authored primarily by [Claude Code](https://claude.ai/code) and reviewed manually by Taykhoom Dalal.
## License
GPL-3.0, following the original UTR-LM repository.
|