Instructions to use Taykhoom/UTR-LM-MLM with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Taykhoom/UTR-LM-MLM with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Taykhoom/UTR-LM-MLM", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
UTR-LM-MLM
UTR-LM is a 5' UTR RNA language model based on ESM2, pretrained on endogenous 5' UTR sequences from five species and a large synthetic library. This checkpoint (UTR-LM-MLM) was trained with a masked language modeling (MLM) objective only - no auxiliary supervised signal.
Architecture
| Parameter | Value |
|---|---|
| Layers | 6 |
| Attention heads | 16 |
| Embedding dimension | 128 |
| Vocabulary size | 10 |
| Positional encoding | Rotary (RoPE) |
| Architecture | ESM2-style pre-LN Transformer |
Vocabulary: <pad> (0), <eos> (1), <unk> (2), A (3), G (4), C (5), T (6), <cls> (7), <mask> (8), <sep> (9)
Pretraining
- Objective: Masked language modeling (15% token masking)
- Data: Endogenous 5' UTRs from five species (human, mouse, zebrafish, Drosophila, yeast) combined with the Cao et al. random 5' UTR synthetic library
- Source checkpoint:
ESM2_1.4_five_species_TrainLossMin_6layers_16heads_128embedsize_4096batchToks.pkl
This is the base MLM-only model. It serves as a pure language model prior with no supervised task augmentation, making it a clean baseline for embedding-based transfer tasks.
Parity Verification
Hidden-state representations produced by this HF model are verified to be exactly identical (max absolute difference = 0.00) to the original ESM2-based implementation at all 7 representation levels (initial embedding + 6 transformer layers). Verified on GPU with PyTorch 2.8 / CUDA 12.6.
Related Models
See the full UTR-LM collection.
| Model | Pretraining Objective | Notes |
|---|---|---|
| UTR-LM-MLM | MLM | This model |
| UTR-LM-MLMSI | MLM + MFE regression | Recommended for TE / EL tasks |
| UTR-LM-MLMSS | MLM + secondary structure | — |
| UTR-LM-MLMSISS | MLM + MFE + secondary structure | Recommended for MRL tasks |
Usage
Embedding generation
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Taykhoom/UTR-LM-MLM", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/UTR-LM-MLM", trust_remote_code=True)
model.eval()
sequences = ["ATGCATGCATGC", "GCTAGCTAGCTAGCTA"]
enc = tokenizer(sequences, return_tensors="pt", padding=True)
with torch.no_grad():
out = model(**enc)
# CLS token embedding (position 0) - recommended for sequence-level tasks
cls_emb = out.last_hidden_state[:, 0, :] # (batch, 128)
# All-token embeddings
token_emb = out.last_hidden_state # (batch, seq_len, 128)
# Intermediate layer representations
out_all = model(**enc, output_hidden_states=True)
layer3_emb = out_all.hidden_states[3] # after layer 3, shape (batch, seq_len, 128)
MLM logits
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("Taykhoom/UTR-LM-MLM", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("Taykhoom/UTR-LM-MLM", trust_remote_code=True)
model.eval()
enc = tokenizer(["ATGC<mask>ATGC"], return_tensors="pt")
with torch.no_grad():
logits = model(**enc).logits # (1, seq_len, 10)
Fine-tuning
The model follows standard HF conventions and can be fine-tuned with any Trainer-compatible setup. For sequence regression tasks, use the CLS token embedding as input to a prediction head (as done in the original UTR-LM paper).
Citation
@article{chu2023utrlm,
title = {A 5'UTR Language Model for Decoding Untranslated Regions of mRNA and Function Predictions},
author = {Chu, Yanyi and others},
journal = {bioRxiv},
year = {2023},
doi = {10.1101/2023.10.11.561938}
}
Implementation Notes
The original UTR-LM implementation uses standard scaled dot-product attention. This HF port adds support for attn_implementation="sdpa" (PyTorch F.scaled_dot_product_attention) and attn_implementation="flash_attention_2" (requires pip install flash-attn --no-build-isolation), which were not part of the original codebase.
Credits
Original model and code by Yanyi Chu et al. (Stanford). Source code: UTR-LM GitHub repository. The HF conversion code was authored primarily by Claude Code and reviewed manually by Taykhoom Dalal.
License
GPL-3.0, following the original UTR-LM repository.
- Downloads last month
- -