Instructions to use Taykhoom/UTR-LM-MLMSS with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Taykhoom/UTR-LM-MLMSS with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Taykhoom/UTR-LM-MLMSS", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
| language: | |
| - rna | |
| library_name: transformers | |
| tags: | |
| - RNA | |
| - language-model | |
| - UTR | |
| - genomics | |
| - biology | |
| license: gpl-3.0 | |
| # UTR-LM-MLMSS | |
| UTR-LM is a 5' UTR RNA language model based on ESM2, pretrained on endogenous 5' UTRs from five species and a large synthetic library. This checkpoint (`UTR-LM-MLMSS`) was trained with **MLM + secondary structure prediction** as a supervised auxiliary objective. | |
| ## Architecture | |
| | Parameter | Value | | |
| |---|---| | |
| | Layers | 6 | | |
| | Attention heads | 16 | | |
| | Embedding dimension | 128 | | |
| | Vocabulary size | 10 | | |
| | Positional encoding | Rotary (RoPE) | | |
| | Architecture | ESM2-style pre-LN Transformer | | |
| **Vocabulary:** `<pad>` (0), `<eos>` (1), `<unk>` (2), `A` (3), `G` (4), `C` (5), `T` (6), `<cls>` (7), `<mask>` (8), `<sep>` (9) | |
| ## Pretraining | |
| - **Objective:** Masked language modeling + per-token secondary structure prediction (3-class: unpaired, stem, loop) | |
| - **Data:** Endogenous 5' UTRs from five species (human, mouse, zebrafish, *Drosophila*, yeast) combined with the Cao et al. random 5' UTR synthetic library | |
| - **Source checkpoint:** `ESM2SS_FS4.1_fiveSpeciesCao_6layers_16heads_128embedsize_4096batchToks_lr1e-05_structureweight1.0_MLMLossMin_epoch200.pkl` | |
| Only one `ESM2SS` (secondary structure only, no MFE regression) checkpoint was available; no selection decision was required. | |
| ## Parity Verification | |
| Hidden-state representations produced by this HF model are verified to be **exactly identical** (max absolute difference = 0.00) to the original ESM2-based implementation at all 7 representation levels (initial embedding + 6 transformer layers). Verified on GPU with PyTorch 2.8 / CUDA 12.6. | |
| ## Related Models | |
| See the full [UTR-LM collection](https://huggingface.co/collections/Taykhoom/utr-lm-6a173a96ae7c070c3a84ebb4). | |
| | Model | Pretraining Objective | Notes | | |
| |---|---|---| | |
| | [UTR-LM-MLM](https://huggingface.co/Taykhoom/UTR-LM-MLM) | MLM | Base model | | |
| | [UTR-LM-MLMSI](https://huggingface.co/Taykhoom/UTR-LM-MLMSI) | MLM + MFE regression | Recommended for TE / EL tasks | | |
| | **[UTR-LM-MLMSS](https://huggingface.co/Taykhoom/UTR-LM-MLMSS)** | MLM + secondary structure | This model | | |
| | [UTR-LM-MLMSISS](https://huggingface.co/Taykhoom/UTR-LM-MLMSISS) | MLM + MFE + secondary structure | Recommended for MRL tasks | | |
| ## Usage | |
| ### Embedding generation | |
| ```python | |
| import torch | |
| from transformers import AutoTokenizer, AutoModel | |
| tokenizer = AutoTokenizer.from_pretrained("Taykhoom/UTR-LM-MLMSS", trust_remote_code=True) | |
| model = AutoModel.from_pretrained("Taykhoom/UTR-LM-MLMSS", trust_remote_code=True) | |
| model.eval() | |
| sequences = ["ATGCATGCATGC", "GCTAGCTAGCTAGCTA"] | |
| enc = tokenizer(sequences, return_tensors="pt", padding=True) | |
| with torch.no_grad(): | |
| out = model(**enc) | |
| # CLS token embedding (position 0) - recommended for sequence-level tasks | |
| cls_emb = out.last_hidden_state[:, 0, :] # (batch, 128) | |
| # All-token embeddings | |
| token_emb = out.last_hidden_state # (batch, seq_len, 128) | |
| # Intermediate layer representations | |
| out_all = model(**enc, output_hidden_states=True) | |
| layer3_emb = out_all.hidden_states[3] # after layer 3, shape (batch, seq_len, 128) | |
| ``` | |
| ### MLM logits | |
| ```python | |
| import torch | |
| from transformers import AutoTokenizer, AutoModelForMaskedLM | |
| tokenizer = AutoTokenizer.from_pretrained("Taykhoom/UTR-LM-MLMSS", trust_remote_code=True) | |
| model = AutoModelForMaskedLM.from_pretrained("Taykhoom/UTR-LM-MLMSS", trust_remote_code=True) | |
| model.eval() | |
| enc = tokenizer(["ATGC<mask>ATGC"], return_tensors="pt") | |
| with torch.no_grad(): | |
| logits = model(**enc).logits # (1, seq_len, 10) | |
| ``` | |
| ### Fine-tuning | |
| The model follows standard HF conventions and can be fine-tuned with any Trainer-compatible setup. For sequence regression tasks, use the CLS token embedding as input to a prediction head (as done in the original UTR-LM paper). | |
| ## Citation | |
| ```bibtex | |
| @article{chu2024utrlm, | |
| title = {A 5' UTR Language Model for Decoding Untranslated Regions of mRNA and Function Predictions}, | |
| author = {Chu, Yanyi and Yu, Dan and Li, Yupeng and Huang, Kaixuan and Shen, Yue and Cong, Le and Zhang, Jason and Wang, Mengdi}, | |
| journal = {Nature Machine Intelligence}, | |
| volume = {6}, | |
| number = {4}, | |
| pages = {449--460}, | |
| year = {2024}, | |
| doi = {10.1038/s42256-024-00823-9} | |
| } | |
| ``` | |
| ## Implementation Notes | |
| The original UTR-LM implementation uses standard scaled dot-product attention. This HF port adds support for `attn_implementation="sdpa"` (PyTorch `F.scaled_dot_product_attention`) and `attn_implementation="flash_attention_2"` (requires `pip install flash-attn --no-build-isolation`), which were not part of the original codebase. | |
| ## Credits | |
| Original model and code by Yanyi Chu et al. (Stanford). Source code: [UTR-LM GitHub repository](https://github.com/a96123155/UTR-LM). The HF conversion code was authored primarily by [Claude Code](https://claude.ai/code) and reviewed manually by Taykhoom Dalal. | |
| ## License | |
| GPL-3.0, following the original UTR-LM repository. | |