--- language: - rna library_name: transformers tags: - RNA - language-model - UTR - genomics - biology license: gpl-3.0 --- # UTR-LM-MLM UTR-LM is a 5' UTR RNA language model based on ESM2, pretrained on endogenous 5' UTR sequences from five species and a large synthetic library. This checkpoint (`UTR-LM-MLM`) was trained with a **masked language modeling (MLM)** objective only - no auxiliary supervised signal. ## Architecture | Parameter | Value | |---|---| | Layers | 6 | | Attention heads | 16 | | Embedding dimension | 128 | | Vocabulary size | 10 | | Positional encoding | Rotary (RoPE) | | Architecture | ESM2-style pre-LN Transformer | **Vocabulary:** `` (0), `` (1), `` (2), `A` (3), `G` (4), `C` (5), `T` (6), `` (7), `` (8), `` (9) ## Pretraining - **Objective:** Masked language modeling (15% token masking) - **Data:** Endogenous 5' UTRs from five species (human, mouse, zebrafish, *Drosophila*, yeast) combined with the Cao et al. random 5' UTR synthetic library - **Source checkpoint:** `ESM2_1.4_five_species_TrainLossMin_6layers_16heads_128embedsize_4096batchToks.pkl` This is the base MLM-only model. It serves as a pure language model prior with no supervised task augmentation, making it a clean baseline for embedding-based transfer tasks. ## Parity Verification Hidden-state representations produced by this HF model are verified to be **exactly identical** (max absolute difference = 0.00) to the original ESM2-based implementation at all 7 representation levels (initial embedding + 6 transformer layers). Verified on GPU with PyTorch 2.8 / CUDA 12.6. ## Related Models See the full [UTR-LM collection](https://huggingface.co/collections/Taykhoom/utr-lm-6a173a96ae7c070c3a84ebb4). | Model | Pretraining Objective | Notes | |---|---|---| | **[UTR-LM-MLM](https://huggingface.co/Taykhoom/UTR-LM-MLM)** | MLM | This model | | [UTR-LM-MLMSI](https://huggingface.co/Taykhoom/UTR-LM-MLMSI) | MLM + MFE regression | Recommended for TE / EL tasks | | [UTR-LM-MLMSS](https://huggingface.co/Taykhoom/UTR-LM-MLMSS) | MLM + secondary structure | — | | [UTR-LM-MLMSISS](https://huggingface.co/Taykhoom/UTR-LM-MLMSISS) | MLM + MFE + secondary structure | Recommended for MRL tasks | ## Usage ### Embedding generation ```python import torch from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("Taykhoom/UTR-LM-MLM", trust_remote_code=True) model = AutoModel.from_pretrained("Taykhoom/UTR-LM-MLM", trust_remote_code=True) model.eval() sequences = ["ATGCATGCATGC", "GCTAGCTAGCTAGCTA"] enc = tokenizer(sequences, return_tensors="pt", padding=True) with torch.no_grad(): out = model(**enc) # CLS token embedding (position 0) - recommended for sequence-level tasks cls_emb = out.last_hidden_state[:, 0, :] # (batch, 128) # All-token embeddings token_emb = out.last_hidden_state # (batch, seq_len, 128) # Intermediate layer representations out_all = model(**enc, output_hidden_states=True) layer3_emb = out_all.hidden_states[3] # after layer 3, shape (batch, seq_len, 128) ``` ### MLM logits ```python import torch from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("Taykhoom/UTR-LM-MLM", trust_remote_code=True) model = AutoModelForMaskedLM.from_pretrained("Taykhoom/UTR-LM-MLM", trust_remote_code=True) model.eval() enc = tokenizer(["ATGCATGC"], return_tensors="pt") with torch.no_grad(): logits = model(**enc).logits # (1, seq_len, 10) ``` ### Fine-tuning The model follows standard HF conventions and can be fine-tuned with any Trainer-compatible setup. For sequence regression tasks, use the CLS token embedding as input to a prediction head (as done in the original UTR-LM paper). ## Citation ```bibtex @article{chu2024utrlm, title = {A 5' UTR Language Model for Decoding Untranslated Regions of mRNA and Function Predictions}, author = {Chu, Yanyi and Yu, Dan and Li, Yupeng and Huang, Kaixuan and Shen, Yue and Cong, Le and Zhang, Jason and Wang, Mengdi}, journal = {Nature Machine Intelligence}, volume = {6}, number = {4}, pages = {449--460}, year = {2024}, doi = {10.1038/s42256-024-00823-9} } ``` ## Implementation Notes The original UTR-LM implementation uses standard scaled dot-product attention. This HF port adds support for `attn_implementation="sdpa"` (PyTorch `F.scaled_dot_product_attention`) and `attn_implementation="flash_attention_2"` (requires `pip install flash-attn --no-build-isolation`), which were not part of the original codebase. ## Credits Original model and code by Yanyi Chu et al. (Stanford). Source code: [UTR-LM GitHub repository](https://github.com/a96123155/UTR-LM). The HF conversion code was authored primarily by [Claude Code](https://claude.ai/code) and reviewed manually by Taykhoom Dalal. ## License GPL-3.0, following the original UTR-LM repository.