Instructions to use Taykhoom/mRNA-FM with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Taykhoom/mRNA-FM with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="Taykhoom/mRNA-FM", trust_remote_code=True)# Load model directly from transformers import AutoModelForMaskedLM model = AutoModelForMaskedLM.from_pretrained("Taykhoom/mRNA-FM", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
File size: 5,951 Bytes
7ba311b c6baa3c 7ba311b 8450a2f 7ba311b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 | ---
language:
- rna
library_name: transformers
tags:
- RNA
- language-model
license: mit
---
# mRNA-FM
A 12-layer BERT-style transformer pre-trained on 45 million mRNA coding sequences (CDS) using
codon (3-mer) tokenisation.
## Architecture
| Parameter | Value |
|---|---|
| Layers | 12 |
| Attention heads | 20 |
| Embedding dimension | 1280 |
| FFN dimension | 5120 |
| Vocabulary size | 73 |
| Positional encoding | Learned |
| Architecture | ESM-1b-style pre-LN Transformer |
| Max sequence length | 1024 codon tokens (1022 codons = 3066 nucleotides) |
Vocabulary: `<cls>` (0), `<pad>` (1), `<eos>` (2), `<unk>` (3), 64 standard RNA codons
(indices 4-67), 4 null-padding tokens (68-71), `<mask>` (72).
## Pretraining
- **Objective:** Masked language modelling (codon-level, 15% masking rate)
- **Data:** RefSeq -- 45 million mRNA coding sequences
- **Source checkpoint:** `mRNA-FM_pretrained.pth` from [cuhkaih/rnafm](https://huggingface.co/cuhkaih/rnafm)
### Tokenisation
mRNA-FM uses **codon (3-mer) tokenisation**: the input sequence is split into consecutive
non-overlapping codons (triplets) and each codon is mapped to a single token. The model
therefore only accepts sequences whose length is a **multiple of 3**.
Input sequences must use **RNA notation (U, not T)**. Convert before tokenising:
```python
seq = seq.replace("T", "U")
```
## Parity Verification
Hidden-state representations verified identical (max abs diff = 0.00) to the original
implementation at all 13 representation levels (embedding + 12 transformer layers).
Verified on GPU (CUDA) with PyTorch 2.7 / transformers 4.57.6. SDPA numerical
differences are expected (~3e-4 max diff over 12 layers) and are not a correctness issue.
## Related Models
See the full [RNA-FM collection](https://huggingface.co/collections/Taykhoom/rna-fm-6a22c8c778d29e6dd3d437af).
| Model | Training data | Embedding dim | Notes |
|---|---|---|---|
| [RNA-FM](https://huggingface.co/Taykhoom/RNA-FM) | 23.7 M ncRNA | 640 | Character tokenisation |
| **[mRNA-FM](https://huggingface.co/Taykhoom/mRNA-FM)** | 45 M CDS | 1280 | This model |
## Usage
### Embedding generation
```python
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Taykhoom/mRNA-FM", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/mRNA-FM", trust_remote_code=True)
model.eval()
# Sequences must be RNA (U not T) and length divisible by 3 (codons)
sequences = [
"AUGGGGUGCGAUCAUACCAGCACUAAUGCCCUCCUGGGAAGUCCUCGUGUUGCA",
"AUGCUAGCUAGCUAGCUAUG",
]
enc = tokenizer(sequences, return_tensors="pt", padding=True)
with torch.no_grad():
out = model(**enc)
cls_emb = out.last_hidden_state[:, 0, :] # (batch, 1280) -- CLS token
token_emb = out.last_hidden_state # (batch, n_codons+2, 1280) -- per-codon
# Intermediate layers
out_all = model(**enc, output_hidden_states=True)
layer6_emb = out_all.hidden_states[6]
```
### CDS-aware embedding (mRNA sequences)
For mRNA sequences with a CDS track, use `batch_encode_with_cds` to apply T→U conversion,
extract only the coding region, chunk to codon boundaries, and encode — all in one call.
```python
import numpy as np
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Taykhoom/mRNA-FM", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/mRNA-FM", trust_remote_code=True)
model.eval()
# Binary CDS track: 1 at the first nucleotide of each codon in the CDS, 0 elsewhere
sequences = ["ATGCTAGCTAGCTAGCTATGCTAGCTAGCTAGCT"]
cds = [np.array([0]*5 + [1, 0, 0]*9 + [0]*2)] # example
enc, chunk_counts = tokenizer.batch_encode_with_cds(
sequences, cds, return_tensors="pt", padding=True, add_special_tokens=True
)
with torch.no_grad():
out = model(**enc)
# chunk_counts[i] = number of chunks produced for sequences[i]
# mean-pool non-special tokens for each sequence:
hidden = out.last_hidden_state # (total_chunks, seq_len, 1280)
```
### MLM logits
```python
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("Taykhoom/mRNA-FM", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("Taykhoom/mRNA-FM", trust_remote_code=True)
model.eval()
enc = tokenizer(["AUG<mask>GCUAUG"], return_tensors="pt")
with torch.no_grad():
logits = model(**enc).logits # (1, n_codons+2, 73)
```
### Fine-tuning
Standard HF conventions. Use the CLS token embedding (`out.last_hidden_state[:, 0, :]`) as
input to a classification or regression head for sequence-level tasks. Mean-pool over codon
positions (excluding CLS and EOS) for codon-level aggregation.
## Implementation Notes
The original implementation uses `F.multi_head_attention_forward` (eager). This HF port adds
`attn_implementation="sdpa"` and `attn_implementation="flash_attention_2"` support, which were
not part of the original codebase.
Each codon token represents exactly one triplet of nucleotides. The tokeniser splits the raw
sequence into non-overlapping codons; any trailing nucleotides that do not form a complete codon
are silently discarded.
## Citation
```bibtex
@article{chen2022_rnafm,
title = {Interpretable {RNA} Foundation Model from Unannotated Data for Highly Accurate {RNA} Structure and Function Predictions},
author = {Chen, Jiayang and Hu, Zhihang and Sun, Siqi and Tan, Qingxiong and Wang, Yixuan and Yu, Qinze and Zong, Licheng and Hong, Liang and Xiao, Jin and Shen, Tao and King, Irwin and Li, Yu},
journal = {arXiv preprint arXiv:2204.00300},
year = {2022},
doi = {10.48550/arXiv.2204.00300}
}
```
## Credits
Original model and code by Chen et al. Source: [GitHub](https://github.com/ml4bio/RNA-FM).
The HF conversion code was authored primarily by [Claude Code](https://claude.ai/code)
and reviewed manually by Taykhoom Dalal.
## License
MIT, following the original repository.
|