File size: 5,231 Bytes
19b7718
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c46f15f
 
19b7718
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a552072
 
b770237
 
 
 
 
 
 
19b7718
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
---
language:
- rna
library_name: transformers
tags:
- RNA
- language-model
- UTR
- genomics
- biology
license: gpl-3.0
---

# UTR-LM-MLMSI

UTR-LM is a 5' UTR RNA language model based on ESM2, pretrained on endogenous 5' UTRs from five species and a large synthetic library. This checkpoint (`UTR-LM-MLMSI`) was trained with **MLM + minimum free energy (MFE) regression** as a supervised auxiliary objective.

## Architecture

| Parameter | Value |
|---|---|
| Layers | 6 |
| Attention heads | 16 |
| Embedding dimension | 128 |
| Vocabulary size | 10 |
| Positional encoding | Rotary (RoPE) |
| Architecture | ESM2-style pre-LN Transformer |

**Vocabulary:** `<pad>` (0), `<eos>` (1), `<unk>` (2), `A` (3), `G` (4), `C` (5), `T` (6), `<cls>` (7), `<mask>` (8), `<sep>` (9)

## Pretraining

- **Objective:** Masked language modeling + MFE (minimum free energy) regression
- **Data:** Endogenous 5' UTRs from five species (human, mouse, zebrafish, *Drosophila*, yeast) combined with the Cao et al. random 5' UTR synthetic library
- **Source checkpoint:** `ESM2SI_3.1_fiveSpeciesCao_6layers_16heads_128embedsize_4096batchToks_MLMLossMin.pkl`

### Checkpoint selection

Multiple `ESM2SI` checkpoints were available (versions 3.1, FS4.1, FS4.4, FS4.7). The `3.1` checkpoint was selected because it is the version specified in the original UTR-LM paper for translation efficiency (TE) and expression level (EL) downstream tasks (used in the `MJ4_Finetune` evaluation scripts). The FS4.x variants are later training runs but were not the ones reported in the original publication.

## Parity Verification

Hidden-state representations produced by this HF model are verified to be **exactly identical** (max absolute difference = 0.00) to the original ESM2-based implementation at all 7 representation levels (initial embedding + 6 transformer layers). Verified on GPU with PyTorch 2.8 / CUDA 12.6.

## Related Models

See the full [UTR-LM collection](https://huggingface.co/collections/Taykhoom/utr-lm-6a173a96ae7c070c3a84ebb4).

| Model | Pretraining Objective | Notes |
|---|---|---|
| [UTR-LM-MLM](https://huggingface.co/Taykhoom/UTR-LM-MLM) | MLM | Base model |
| **[UTR-LM-MLMSI](https://huggingface.co/Taykhoom/UTR-LM-MLMSI)** | MLM + MFE regression | This model — recommended for TE / EL tasks |
| [UTR-LM-MLMSS](https://huggingface.co/Taykhoom/UTR-LM-MLMSS) | MLM + secondary structure | — |
| [UTR-LM-MLMSISS](https://huggingface.co/Taykhoom/UTR-LM-MLMSISS) | MLM + MFE + secondary structure | Recommended for MRL tasks |

## Usage

### Embedding generation

```python
import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/UTR-LM-MLMSI", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/UTR-LM-MLMSI", trust_remote_code=True)
model.eval()

sequences = ["ATGCATGCATGC", "GCTAGCTAGCTAGCTA"]
enc = tokenizer(sequences, return_tensors="pt", padding=True)

with torch.no_grad():
    out = model(**enc)

# CLS token embedding (position 0) - recommended for sequence-level tasks
cls_emb = out.last_hidden_state[:, 0, :]   # (batch, 128)

# All-token embeddings
token_emb = out.last_hidden_state           # (batch, seq_len, 128)

# Intermediate layer representations
out_all = model(**enc, output_hidden_states=True)
layer3_emb = out_all.hidden_states[3]       # after layer 3, shape (batch, seq_len, 128)
```

### MLM logits

```python
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/UTR-LM-MLMSI", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("Taykhoom/UTR-LM-MLMSI", trust_remote_code=True)
model.eval()

enc = tokenizer(["ATGC<mask>ATGC"], return_tensors="pt")
with torch.no_grad():
    logits = model(**enc).logits   # (1, seq_len, 10)
```

### Fine-tuning

The model follows standard HF conventions and can be fine-tuned with any Trainer-compatible setup. For sequence regression tasks, use the CLS token embedding as input to a prediction head (as done in the original UTR-LM paper).

## Citation

```bibtex
@article{chu2024_utrlm,
  title   = {A 5' {UTR} Language Model for Decoding Untranslated Regions of {mRNA} and Function Predictions},
  author  = {Chu, Yanyi and Yu, Dan and Li, Yupeng and Huang, Kaixuan and Shen, Yue and Cong, Le and Zhang, Jason and Wang, Mengdi},
  journal = {Nature Machine Intelligence},
  volume  = {6},
  number  = {4},
  pages   = {449--460},
  year    = {2024},
  doi     = {10.1038/s42256-024-00823-9}
}
```

## Implementation Notes

The original UTR-LM implementation uses standard scaled dot-product attention. This HF port adds support for `attn_implementation="sdpa"` (PyTorch `F.scaled_dot_product_attention`) and `attn_implementation="flash_attention_2"` (requires `pip install flash-attn --no-build-isolation`), which were not part of the original codebase.

## Credits

Original model and code by Yanyi Chu et al. (Stanford). Source code: [UTR-LM GitHub repository](https://github.com/a96123155/UTR-LM). The HF conversion code was authored primarily by [Claude Code](https://claude.ai/code) and reviewed manually by Taykhoom Dalal.

## License

GPL-3.0, following the original UTR-LM repository.