UTR-LM-MLMSI

UTR-LM is a 5' UTR RNA language model based on ESM2, pretrained on endogenous 5' UTRs from five species and a large synthetic library. This checkpoint (UTR-LM-MLMSI) was trained with MLM + minimum free energy (MFE) regression as a supervised auxiliary objective.

Architecture

Parameter Value
Layers 6
Attention heads 16
Embedding dimension 128
Vocabulary size 10
Positional encoding Rotary (RoPE)
Architecture ESM2-style pre-LN Transformer

Vocabulary: <pad> (0), <eos> (1), <unk> (2), A (3), G (4), C (5), T (6), <cls> (7), <mask> (8), <sep> (9)

Pretraining

  • Objective: Masked language modeling + MFE (minimum free energy) regression
  • Data: Endogenous 5' UTRs from five species (human, mouse, zebrafish, Drosophila, yeast) combined with the Cao et al. random 5' UTR synthetic library
  • Source checkpoint: ESM2SI_3.1_fiveSpeciesCao_6layers_16heads_128embedsize_4096batchToks_MLMLossMin.pkl

Checkpoint selection

Multiple ESM2SI checkpoints were available (versions 3.1, FS4.1, FS4.4, FS4.7). The 3.1 checkpoint was selected because it is the version specified in the original UTR-LM paper for translation efficiency (TE) and expression level (EL) downstream tasks (used in the MJ4_Finetune evaluation scripts). The FS4.x variants are later training runs but were not the ones reported in the original publication.

Parity Verification

Hidden-state representations produced by this HF model are verified to be exactly identical (max absolute difference = 0.00) to the original ESM2-based implementation at all 7 representation levels (initial embedding + 6 transformer layers). Verified on GPU with PyTorch 2.8 / CUDA 12.6.

Related Models

See the full UTR-LM collection.

Model Pretraining Objective Notes
UTR-LM-MLM MLM Base model
UTR-LM-MLMSI MLM + MFE regression This model — recommended for TE / EL tasks
UTR-LM-MLMSS MLM + secondary structure —
UTR-LM-MLMSISS MLM + MFE + secondary structure Recommended for MRL tasks

Usage

Embedding generation

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/UTR-LM-MLMSI", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/UTR-LM-MLMSI", trust_remote_code=True)
model.eval()

sequences = ["ATGCATGCATGC", "GCTAGCTAGCTAGCTA"]
enc = tokenizer(sequences, return_tensors="pt", padding=True)

with torch.no_grad():
    out = model(**enc)

# CLS token embedding (position 0) - recommended for sequence-level tasks
cls_emb = out.last_hidden_state[:, 0, :]   # (batch, 128)

# All-token embeddings
token_emb = out.last_hidden_state           # (batch, seq_len, 128)

# Intermediate layer representations
out_all = model(**enc, output_hidden_states=True)
layer3_emb = out_all.hidden_states[3]       # after layer 3, shape (batch, seq_len, 128)

MLM logits

import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/UTR-LM-MLMSI", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("Taykhoom/UTR-LM-MLMSI", trust_remote_code=True)
model.eval()

enc = tokenizer(["ATGC<mask>ATGC"], return_tensors="pt")
with torch.no_grad():
    logits = model(**enc).logits   # (1, seq_len, 10)

Fine-tuning

The model follows standard HF conventions and can be fine-tuned with any Trainer-compatible setup. For sequence regression tasks, use the CLS token embedding as input to a prediction head (as done in the original UTR-LM paper).

Citation

@article{chu2023utrlm,
  title   = {A 5'UTR Language Model for Decoding Untranslated Regions of mRNA and Function Predictions},
  author  = {Chu, Yanyi and others},
  journal = {bioRxiv},
  year    = {2023},
  doi     = {10.1101/2023.10.11.561938}
}

Implementation Notes

The original UTR-LM implementation uses standard scaled dot-product attention. This HF port adds support for attn_implementation="sdpa" (PyTorch F.scaled_dot_product_attention) and attn_implementation="flash_attention_2" (requires pip install flash-attn --no-build-isolation), which were not part of the original codebase.

Credits

Original model and code by Yanyi Chu et al. (Stanford). Source code: UTR-LM GitHub repository. The HF conversion code was authored primarily by Claude Code and reviewed manually by Taykhoom Dalal.

License

GPL-3.0, following the original UTR-LM repository.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including Taykhoom/UTR-LM-MLMSI