UTR-LM-MLMSISS

UTR-LM is a 5' UTR RNA language model based on ESM2, pretrained on endogenous 5' UTRs from five species and a large synthetic library. This checkpoint (UTR-LM-MLMSISS) was trained with MLM + MFE regression + secondary structure prediction as combined supervised auxiliary objectives.

Architecture

Parameter	Value
Layers	6
Attention heads	16
Embedding dimension	128
Vocabulary size	10
Positional encoding	Rotary (RoPE)
Architecture	ESM2-style pre-LN Transformer

Vocabulary: <pad> (0), <eos> (1), <unk> (2), A (3), G (4), C (5), T (6), <cls> (7), <mask> (8), <sep> (9)

Pretraining

Objective: Masked language modeling + MFE regression + per-token secondary structure prediction (3-class: unpaired, stem, loop)
Data: Endogenous 5' UTRs from five species (human, mouse, zebrafish, Drosophila, yeast) combined with the Cao et al. random 5' UTR synthetic library
Source checkpoint: ESM2SISS_FS4.1_fiveSpeciesCao_6layers_16heads_128embedsize_4096batchToks_lr1e-05_supervisedweight1.0_structureweight1.0_MLMLossMin_epoch93.pkl

Checkpoint selection

Multiple ESM2SISS checkpoints were available (FS4.1, FS4.4, FS4.7, FS4.10, FS4.13, FS4.16, FS4.19, FS4.22). The FS4.1 checkpoint at epoch 93 was selected because it is the version specified in the original UTR-LM paper for the mean ribosome load (MRL) downstream fine-tuning task (used in the MJ3_Finetune evaluation scripts with --prefix ESM2SISS_FS4.1.ep93).

Parity Verification

Hidden-state representations produced by this HF model are verified to be exactly identical (max absolute difference = 0.00) to the original ESM2-based implementation at all 7 representation levels (initial embedding + 6 transformer layers). Verified on GPU with PyTorch 2.8 / CUDA 12.9.

Related Models

See the full UTR-LM collection.

Model	Pretraining Objective	Notes
UTR-LM-MLM	MLM	Base model
UTR-LM-MLMSI	MLM + MFE regression	Recommended for TE / EL tasks
UTR-LM-MLMSS	MLM + secondary structure	—
UTR-LM-MLMSISS	MLM + MFE + secondary structure	This model — recommended for MRL tasks

Usage

Embedding generation

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/UTR-LM-MLMSISS", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/UTR-LM-MLMSISS", trust_remote_code=True)
model.eval()

sequences = ["ATGCATGCATGC", "GCTAGCTAGCTAGCTA"]
enc = tokenizer(sequences, return_tensors="pt", padding=True)

with torch.no_grad():
    out = model(**enc)

# CLS token embedding (position 0) - recommended for sequence-level tasks
cls_emb = out.last_hidden_state[:, 0, :]   # (batch, 128)

# All-token embeddings
token_emb = out.last_hidden_state           # (batch, seq_len, 128)

# Intermediate layer representations
out_all = model(**enc, output_hidden_states=True)
layer3_emb = out_all.hidden_states[3]       # after layer 3, shape (batch, seq_len, 128)

MLM logits

import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/UTR-LM-MLMSISS", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("Taykhoom/UTR-LM-MLMSISS", trust_remote_code=True)
model.eval()

enc = tokenizer(["ATGC<mask>ATGC"], return_tensors="pt")
with torch.no_grad():
    logits = model(**enc).logits   # (1, seq_len, 10)

Fine-tuning

The model follows standard HF conventions and can be fine-tuned with any Trainer-compatible setup. For sequence regression tasks, use the CLS token embedding as input to a prediction head (as done in the original UTR-LM paper).

Citation

@article{chu2024_utrlm,
  title   = {A 5' {UTR} Language Model for Decoding Untranslated Regions of {mRNA} and Function Predictions},
  author  = {Chu, Yanyi and Yu, Dan and Li, Yupeng and Huang, Kaixuan and Shen, Yue and Cong, Le and Zhang, Jason and Wang, Mengdi},
  journal = {Nature Machine Intelligence},
  volume  = {6},
  number  = {4},
  pages   = {449--460},
  year    = {2024},
  doi     = {10.1038/s42256-024-00823-9}
}

Implementation Notes

The original UTR-LM implementation uses standard scaled dot-product attention. This HF port adds support for attn_implementation="sdpa" (PyTorch F.scaled_dot_product_attention) and attn_implementation="flash_attention_2" (requires pip install flash-attn --no-build-isolation), which were not part of the original codebase.

Credits

Original model and code by Yanyi Chu et al. (Stanford). Source code: UTR-LM GitHub repository. The HF conversion code was authored primarily by Claude Code and reviewed manually by Taykhoom Dalal.

License

GPL-3.0, following the original UTR-LM repository.

Downloads last month: 62

Safetensors

Model size

1.21M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including Taykhoom/UTR-LM-MLMSISS

UTR-LM

Collection

HF ports of UTR-LM: 3 model versions ranging from MLM-only to MLM + SS + IRES + splice sites. • 4 items • Updated Jun 5