File size: 7,604 Bytes
ecabe7a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3022652
 
 
 
 
ecabe7a
 
 
1722240
ecabe7a
 
 
 
 
 
 
4acf5b5
 
ecabe7a
 
 
 
 
 
 
 
 
 
 
4acf5b5
 
ecabe7a
4acf5b5
ecabe7a
 
 
 
 
 
 
 
 
 
 
 
 
4acf5b5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ecabe7a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3022652
 
 
ecabe7a
 
 
 
 
 
 
 
 
 
 
4acf5b5
 
 
 
 
 
 
 
ecabe7a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
---
language:
- rna
library_name: transformers
tags:
- RNA
- mRNA
- codon
- language-model
license: other
---

# CodonBERT

BERT-based RNA language model pretrained on codon-level representations of more than
10 million mRNA sequences from mammals, bacteria, and human viruses using masked language
modeling. Designed for predicting mRNA-specific properties such as translation efficiency
and mRNA stability.

## Architecture

| Parameter | Value |
|---|---|
| Layers | 12 |
| Attention heads | 12 |
| Embedding dimension | 768 |
| Intermediate size | 3072 |
| Vocabulary size | 69 (5 special + 64 sense codons) |
| Positional encoding | Learned absolute |
| Architecture | Standard post-LN BERT Transformer |
| Max sequence length | 1024 tokens (codons) |

### Vocabulary

The tokenizer operates at the codon level. Sequences must be pre-split into
space-separated codons before passing to the tokenizer (see Usage below).
The 64 sense codons cover all combinations of {A, U, G, C}^3 in RNA space.
Special tokens follow standard BERT convention: `[PAD]=0`, `[UNK]=1`,
`[CLS]=2`, `[SEP]=3`, `[MASK]=4`.

## Pretraining

- **Objective:** Masked language modeling (MLM) on codon-level tokens
- **Data:** >10 million mRNA sequences from mammals, bacteria, and human viruses
- **Focus:** Coding sequences (CDS) only
- **Source checkpoint:** `model.safetensors` converted from the original
  [Sanofi-Public/CodonBERT](https://github.com/Sanofi-Public/CodonBERT) release
  (`BertForPreTraining` format)

### Checkpoint selection

There is a single publicly released checkpoint from the original authors. The backbone
weights (`bert.*` prefix) are extracted directly; the MLM and NSP heads are discarded.

## Parity Verification

All verified on GPU with PyTorch 2.7 / CUDA 12:

- **Hidden states (eager, sdpa):** identical to original at all 13 levels (max abs diff < 8e-6)
- **MLM logits:** `BertForMaskedLM` logits identical to original `BertForPreTraining` (max abs diff < 9e-6)
- **Flash attention 2:** verified against eager (bf16) at non-padding positions (max diff < 0.25, expected BF16 accumulation across 12 layers)

## Related Models

See the full [CodonBERT collection](https://huggingface.co/collections/Taykhoom/codonbert-6a2215ba01c589ad8eac8a2d).

| Model | Notes |
|---|---|
| **[CodonBERT](https://huggingface.co/Taykhoom/CodonBERT)** | This model |

## Usage

CodonBERT operates on CDS sequences. The tokenizer handles T->U conversion and codon
splitting automatically — pass raw nucleotide strings directly.

### Embedding generation

```python
import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/CodonBERT", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/CodonBERT", trust_remote_code=True)
model.eval()

# Raw CDS nucleotide strings — T or U both accepted
cds_sequences = ["ATGAAAGGCCCTTAA", "ATGTTTGGG"]

enc = tokenizer(cds_sequences, return_tensors="pt", padding=True)

with torch.no_grad():
    out = model(**enc)

cls_emb   = out.last_hidden_state[:, 0, :]  # (batch, 768) -- CLS token
mean_emb  = (out.last_hidden_state * enc["attention_mask"].unsqueeze(-1)).sum(1) / \
            enc["attention_mask"].sum(1, keepdim=True)  # mean over non-padding

# Intermediate layers
out_all = model(**enc, output_hidden_states=True)
layer6_emb = out_all.hidden_states[6]  # (batch, seq_len, 768)
```

### CDS-aware encoding (full mRNA input)

For full mRNA sequences where the CDS region must be extracted first:

```python
import numpy as np

# cds: binary array with 1 at the first nucleotide of each codon
enc, chunk_counts = tokenizer.batch_encode_with_cds(
    mrna_sequences,
    cds_tracks,       # list of numpy arrays
    return_tensors="pt",
    padding=True,
)
with torch.no_grad():
    out = model(**enc)
```

### SDPA and Flash Attention 2

```python
model_sdpa = AutoModel.from_pretrained(
    "Taykhoom/CodonBERT", trust_remote_code=True, attn_implementation="sdpa"
)
model_flash = AutoModel.from_pretrained(
    "Taykhoom/CodonBERT", trust_remote_code=True, attn_implementation="flash_attention_2"
)
```

### MLM logits

```python
from transformers import AutoModelForMaskedLM

model_mlm = AutoModelForMaskedLM.from_pretrained("Taykhoom/CodonBERT", trust_remote_code=True)
model_mlm.eval()

seq = "AUG [MASK] GGG"
enc = tokenizer(seq, return_tensors="pt")
with torch.no_grad():
    logits = model_mlm(**enc).logits  # (1, seq_len, 69)
```

The MLM head weights are fully preserved: the prediction transform (dense + GELU +
LayerNorm), the decoder weight (tied to the word embedding in the original, stored
explicitly here), and the output bias are all converted from the original checkpoint.

### Fine-tuning

Standard HF conventions apply. For sequence-level tasks, use the CLS token embedding
as input to a classification/regression head.

## Implementation Notes

Two key differences from the original CodonBERT release:

**1. Integrated codon tokenization.** The original repository requires users to
manually pre-process sequences into space-separated codons before passing them to
the tokenizer. This port ships `CodonBertTokenizer`, a `BertTokenizer` subclass
whose `_tokenize` method automatically normalizes sequences (T->U, uppercase) and
splits them into codon 3-mers. Users can pass raw nucleotide strings directly:
`tokenizer("AUGAAAGGG")` works without any pre-processing. A
`batch_encode_with_cds(sequences, cds_tracks)` method handles full mRNA input with
CDS extraction and codon-boundary-aligned chunking, matching the mRNABench
preprocessing exactly.

**2. SDPA and Flash Attention 2 support.** The original release used the standard
HuggingFace `BertModel`, which does not support `attn_implementation="sdpa"` or
`attn_implementation="flash_attention_2"`. This port inherits from
[Taykhoom/BERT-updated](https://huggingface.co/Taykhoom/BERT-updated), a minimal
BERT re-implementation with all three backends (`eager`, `sdpa`,
`flash_attention_2`). Parity against the original eager implementation is verified
at every layer.

## Citation

```bibtex
@article{li2024_codonbert,
  title   = {{CodonBERT} large language model for {mRNA} vaccines},
  author  = {Li, Sizhen and Moayedpour, Saeed and Li, Ruijiang and Bailey, Michael and Riahi, Saleh and Kogler-Anele, Lorenzo and Miladi, Milad and Miner, Jacob and Pertuy, Fabien and Zheng, Dinghai and Wang, Jun and Balsubramani, Akshay and Tran, Khang and Zacharia, Minnie and Wu, Monica and Gu, Xiaobo and Clinton, Ryan and Asquith, Carla and Skaleski, Joseph and Boeglin, Lianne and Chivukula, Sudha and Dias, Anusha and Strugnell, Tod and Ulloa Montoya, Fernando and Agarwal, Vikram and Bar-Joseph, Ziv and Jager, Sven},
  journal = {Genome Research},
  volume  = {34},
  number  = {7},
  pages   = {1027--1035},
  year    = {2024},
  doi     = {10.1101/gr.278870.123}
}
```

## Credits

Original model and code by Li et al. Source: [GitHub](https://github.com/Sanofi-Public/CodonBERT).
The HF conversion code was authored primarily by [Claude Code](https://claude.ai/code)
and reviewed manually by Taykhoom Dalal.

## License

Academic/non-commercial use only, following the original repository license:

Permission is hereby granted, free of charge, for academic research purposes only
and for non-commercial use only, to any person from an academic research or non-profit
organization obtaining a copy of these models, software, datasets and/or algorithms.
For purposes of this notice, "non-commercial use" excludes uses foreseeably resulting
in a commercial benefit or monetary gain. All other rights are reserved.