File size: 4,978 Bytes
03d9aff
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
94dbddf
 
03d9aff
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
276f04b
 
 
 
 
 
 
 
 
03d9aff
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
---
language:
- rna
library_name: transformers
tags:
- RNA
- language-model
- UTR
- genomics
- biology
license: gpl-3.0
---

# UTR-LM-MLMSS

UTR-LM is a 5' UTR RNA language model based on ESM2, pretrained on endogenous 5' UTRs from five species and a large synthetic library. This checkpoint (`UTR-LM-MLMSS`) was trained with **MLM + secondary structure prediction** as a supervised auxiliary objective.

## Architecture

| Parameter | Value |
|---|---|
| Layers | 6 |
| Attention heads | 16 |
| Embedding dimension | 128 |
| Vocabulary size | 10 |
| Positional encoding | Rotary (RoPE) |
| Architecture | ESM2-style pre-LN Transformer |

**Vocabulary:** `<pad>` (0), `<eos>` (1), `<unk>` (2), `A` (3), `G` (4), `C` (5), `T` (6), `<cls>` (7), `<mask>` (8), `<sep>` (9)

## Pretraining

- **Objective:** Masked language modeling + per-token secondary structure prediction (3-class: unpaired, stem, loop)
- **Data:** Endogenous 5' UTRs from five species (human, mouse, zebrafish, *Drosophila*, yeast) combined with the Cao et al. random 5' UTR synthetic library
- **Source checkpoint:** `ESM2SS_FS4.1_fiveSpeciesCao_6layers_16heads_128embedsize_4096batchToks_lr1e-05_structureweight1.0_MLMLossMin_epoch200.pkl`

Only one `ESM2SS` (secondary structure only, no MFE regression) checkpoint was available; no selection decision was required.

## Parity Verification

Hidden-state representations produced by this HF model are verified to be **exactly identical** (max absolute difference = 0.00) to the original ESM2-based implementation at all 7 representation levels (initial embedding + 6 transformer layers). Verified on GPU with PyTorch 2.8 / CUDA 12.6.

## Related Models

See the full [UTR-LM collection](https://huggingface.co/collections/Taykhoom/utr-lm-6a173a96ae7c070c3a84ebb4).

| Model | Pretraining Objective | Notes |
|---|---|---|
| [UTR-LM-MLM](https://huggingface.co/Taykhoom/UTR-LM-MLM) | MLM | Base model |
| [UTR-LM-MLMSI](https://huggingface.co/Taykhoom/UTR-LM-MLMSI) | MLM + MFE regression | Recommended for TE / EL tasks |
| **[UTR-LM-MLMSS](https://huggingface.co/Taykhoom/UTR-LM-MLMSS)** | MLM + secondary structure | This model |
| [UTR-LM-MLMSISS](https://huggingface.co/Taykhoom/UTR-LM-MLMSISS) | MLM + MFE + secondary structure | Recommended for MRL tasks |

## Usage

### Embedding generation

```python
import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/UTR-LM-MLMSS", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/UTR-LM-MLMSS", trust_remote_code=True)
model.eval()

sequences = ["ATGCATGCATGC", "GCTAGCTAGCTAGCTA"]
enc = tokenizer(sequences, return_tensors="pt", padding=True)

with torch.no_grad():
    out = model(**enc)

# CLS token embedding (position 0) - recommended for sequence-level tasks
cls_emb = out.last_hidden_state[:, 0, :]   # (batch, 128)

# All-token embeddings
token_emb = out.last_hidden_state           # (batch, seq_len, 128)

# Intermediate layer representations
out_all = model(**enc, output_hidden_states=True)
layer3_emb = out_all.hidden_states[3]       # after layer 3, shape (batch, seq_len, 128)
```

### MLM logits

```python
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/UTR-LM-MLMSS", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("Taykhoom/UTR-LM-MLMSS", trust_remote_code=True)
model.eval()

enc = tokenizer(["ATGC<mask>ATGC"], return_tensors="pt")
with torch.no_grad():
    logits = model(**enc).logits   # (1, seq_len, 10)
```

### Fine-tuning

The model follows standard HF conventions and can be fine-tuned with any Trainer-compatible setup. For sequence regression tasks, use the CLS token embedding as input to a prediction head (as done in the original UTR-LM paper).

## Citation

```bibtex
@article{chu2024utrlm,
  title   = {A 5' UTR Language Model for Decoding Untranslated Regions of mRNA and Function Predictions},
  author  = {Chu, Yanyi and Yu, Dan and Li, Yupeng and Huang, Kaixuan and Shen, Yue and Cong, Le and Zhang, Jason and Wang, Mengdi},
  journal = {Nature Machine Intelligence},
  volume  = {6},
  number  = {4},
  pages   = {449--460},
  year    = {2024},
  doi     = {10.1038/s42256-024-00823-9}
}
```

## Implementation Notes

The original UTR-LM implementation uses standard scaled dot-product attention. This HF port adds support for `attn_implementation="sdpa"` (PyTorch `F.scaled_dot_product_attention`) and `attn_implementation="flash_attention_2"` (requires `pip install flash-attn --no-build-isolation`), which were not part of the original codebase.

## Credits

Original model and code by Yanyi Chu et al. (Stanford). Source code: [UTR-LM GitHub repository](https://github.com/a96123155/UTR-LM). The HF conversion code was authored primarily by [Claude Code](https://claude.ai/code) and reviewed manually by Taykhoom Dalal.

## License

GPL-3.0, following the original UTR-LM repository.