Taykhoom commited on
Commit
ecabe7a
·
verified ·
1 Parent(s): 8a60514

Upload folder using huggingface_hub

Browse files
Files changed (6) hide show
  1. README.md +193 -0
  2. config.json +25 -0
  3. model.safetensors +3 -0
  4. special_tokens_map.json +7 -0
  5. tokenizer_config.json +55 -0
  6. vocab.txt +69 -0
README.md ADDED
@@ -0,0 +1,193 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - rna
4
+ library_name: transformers
5
+ tags:
6
+ - RNA
7
+ - mRNA
8
+ - codon
9
+ - language-model
10
+ license: other
11
+ ---
12
+
13
+ # CodonBERT
14
+
15
+ BERT-based RNA language model pretrained on codon-level representations of more than
16
+ 10 million mRNA sequences from mammals, bacteria, and human viruses using masked language
17
+ modeling. Designed for predicting mRNA-specific properties such as translation efficiency
18
+ and mRNA stability.
19
+
20
+ ## Architecture
21
+
22
+ | Parameter | Value |
23
+ |---|---|
24
+ | Layers | 12 |
25
+ | Attention heads | 12 |
26
+ | Embedding dimension | 768 |
27
+ | Intermediate size | 3072 |
28
+ | Vocabulary size | 69 (5 special + 64 sense codons) |
29
+ | Positional encoding | Learned absolute |
30
+ | Architecture | Standard post-LN BERT Transformer |
31
+ | Max sequence length | 1024 tokens (codons) |
32
+
33
+ ### Vocabulary
34
+
35
+ The tokenizer operates at the codon level. Sequences must be pre-split into
36
+ space-separated codons before passing to the tokenizer (see Usage below).
37
+ The 64 sense codons cover all combinations of {A, U, G, C}^3 in RNA space.
38
+ Special tokens follow standard BERT convention: `[PAD]=0`, `[UNK]=1`,
39
+ `[CLS]=2`, `[SEP]=3`, `[MASK]=4`.
40
+
41
+ ## Pretraining
42
+
43
+ - **Objective:** Masked language modeling (MLM) on codon-level tokens
44
+ - **Data:** >10 million mRNA sequences from mammals, bacteria, and human viruses
45
+ - **Focus:** Coding sequences (CDS) only
46
+ - **Source checkpoint:** `model.safetensors` converted from the original
47
+ [Sanofi-Public/CodonBERT](https://github.com/Sanofi-Public/CodonBERT) release
48
+ (`BertForPreTraining` format)
49
+
50
+ ### Checkpoint selection
51
+
52
+ There is a single publicly released checkpoint from the original authors. The backbone
53
+ weights (`bert.*` prefix) are extracted directly; the MLM and NSP heads are discarded.
54
+
55
+ ## Parity Verification
56
+
57
+ Hidden-state representations verified identical (max abs diff < 8e-6) to the original
58
+ implementation at all 13 representation levels (embedding + 12 transformer layers).
59
+ Verified with eager and sdpa backends on GPU with PyTorch 2.7 / CUDA 12.
60
+
61
+ ## Related Models
62
+
63
+ See the full [CodonBERT collection](https://huggingface.co/collections/Taykhoom/codonbert-TODO).
64
+
65
+ | Model | Notes |
66
+ |---|---|
67
+ | **[CodonBERT](https://huggingface.co/Taykhoom/CodonBERT)** | This model |
68
+
69
+ ## Usage
70
+
71
+ CodonBERT operates on CDS sequences. Input nucleotide sequences must be:
72
+ 1. In RNA space (U, not T)
73
+ 2. A coding region (CDS) that is a multiple of 3 nucleotides
74
+ 3. Pre-converted to space-separated codons before tokenization
75
+
76
+ ### Embedding generation
77
+
78
+ ```python
79
+ import torch
80
+ from transformers import AutoTokenizer, AutoModel
81
+
82
+
83
+ def nt_to_codons(seq: str) -> str:
84
+ seq = seq.upper().replace("T", "U")
85
+ n = len(seq) - len(seq) % 3
86
+ return " ".join(seq[i:i + 3] for i in range(0, n, 3))
87
+
88
+
89
+ tokenizer = AutoTokenizer.from_pretrained("Taykhoom/CodonBERT", trust_remote_code=True)
90
+ model = AutoModel.from_pretrained("Taykhoom/CodonBERT", trust_remote_code=True)
91
+ model.eval()
92
+
93
+ cds_sequences = ["AUGAAAGGGCCCUAA", "AUGUUUGGG"]
94
+ codon_sequences = [nt_to_codons(s) for s in cds_sequences]
95
+
96
+ enc = tokenizer(codon_sequences, return_tensors="pt", padding=True)
97
+
98
+ with torch.no_grad():
99
+ out = model(**enc)
100
+
101
+ cls_emb = out.last_hidden_state[:, 0, :] # (batch, 768) -- CLS token
102
+ mean_emb = (out.last_hidden_state * enc["attention_mask"].unsqueeze(-1)).sum(1) / \
103
+ enc["attention_mask"].sum(1, keepdim=True) # mean over non-padding
104
+
105
+ # Intermediate layers
106
+ out_all = model(**enc, output_hidden_states=True)
107
+ layer6_emb = out_all.hidden_states[6] # (batch, seq_len, 768)
108
+ ```
109
+
110
+ ### SDPA and Flash Attention 2
111
+
112
+ ```python
113
+ model_sdpa = AutoModel.from_pretrained(
114
+ "Taykhoom/CodonBERT", trust_remote_code=True, attn_implementation="sdpa"
115
+ )
116
+ model_flash = AutoModel.from_pretrained(
117
+ "Taykhoom/CodonBERT", trust_remote_code=True, attn_implementation="flash_attention_2"
118
+ )
119
+ ```
120
+
121
+ ### MLM logits
122
+
123
+ ```python
124
+ from transformers import AutoModelForMaskedLM
125
+
126
+ model_mlm = AutoModelForMaskedLM.from_pretrained("Taykhoom/CodonBERT", trust_remote_code=True)
127
+ model_mlm.eval()
128
+
129
+ seq = "AUG [MASK] GGG"
130
+ enc = tokenizer(seq, return_tensors="pt")
131
+ with torch.no_grad():
132
+ logits = model_mlm(**enc).logits # (1, seq_len, 69)
133
+ ```
134
+
135
+ Note: the MLM head (`cls`) is re-initialized randomly in this port. The backbone
136
+ weights are exact; only MLM fine-tuning tasks would require re-training the head.
137
+
138
+ ### Fine-tuning
139
+
140
+ Standard HF conventions apply. For sequence-level tasks, use the CLS token embedding
141
+ as input to a classification/regression head.
142
+
143
+ ## Implementation Notes
144
+
145
+ Two key differences from the original CodonBERT release:
146
+
147
+ **1. Integrated codon tokenization.** The original repository requires users to
148
+ manually pre-process sequences into space-separated codons before tokenizing. This
149
+ port ships the same `BertTokenizer`-based tokenizer with a corrected
150
+ `model_max_length` (1024, matching the model's positional embedding table) and
151
+ `do_basic_tokenize=true` so that whitespace-split codon strings are correctly
152
+ mapped to codon IDs. Users still need to convert nucleotide sequences to
153
+ space-separated codons (see `nt_to_codons` above), but the tokenizer is
154
+ self-contained and directly loadable via `AutoTokenizer`.
155
+
156
+ **2. SDPA and Flash Attention 2 support.** The original release used the standard
157
+ HuggingFace `BertModel`, which does not support `attn_implementation="sdpa"` or
158
+ `attn_implementation="flash_attention_2"`. This port inherits from
159
+ [Taykhoom/BERT-updated](https://huggingface.co/Taykhoom/BERT-updated), a minimal
160
+ BERT re-implementation with all three backends (`eager`, `sdpa`,
161
+ `flash_attention_2`). Parity against the original eager implementation is verified
162
+ at every layer.
163
+
164
+ ## Citation
165
+
166
+ ```bibtex
167
+ @article{li2024_codonbert,
168
+ title = {{CodonBERT} large language model for {mRNA} vaccines},
169
+ author = {Li, Sizhen and Moayedpour, Saeed and Li, Ruijiang and Bailey, Michael and Riahi, Saleh and Kogler-Anele, Lorenzo and Miladi, Milad and Miner, Jacob and Pertuy, Fabien and Zheng, Dinghai and Wang, Jun and Balsubramani, Akshay and Tran, Khang and Zacharia, Minnie and Wu, Monica and Gu, Xiaobo and Clinton, Ryan and Asquith, Carla and Skaleski, Joseph and Boeglin, Lianne and Chivukula, Sudha and Dias, Anusha and Strugnell, Tod and Ulloa Montoya, Fernando and Agarwal, Vikram and Bar-Joseph, Ziv and Jager, Sven},
170
+ journal = {Genome Research},
171
+ volume = {34},
172
+ number = {7},
173
+ pages = {1027--1035},
174
+ year = {2024},
175
+ doi = {10.1101/gr.278870.123}
176
+ }
177
+ ```
178
+
179
+ ## Credits
180
+
181
+ Original model and code by Li et al. Source: [GitHub](https://github.com/Sanofi-Public/CodonBERT).
182
+ The HF conversion code was authored primarily by [Claude Code](https://claude.ai/code)
183
+ and reviewed manually by Taykhoom Dalal.
184
+
185
+ ## License
186
+
187
+ Academic/non-commercial use only, following the original repository license:
188
+
189
+ Permission is hereby granted, free of charge, for academic research purposes only
190
+ and for non-commercial use only, to any person from an academic research or non-profit
191
+ organization obtaining a copy of these models, software, datasets and/or algorithms.
192
+ For purposes of this notice, "non-commercial use" excludes uses foreseeably resulting
193
+ in a commercial benefit or monetary gain. All other rights are reserved.
config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "BertModel"
4
+ ],
5
+ "auto_map": {
6
+ "AutoConfig": "Taykhoom/BERT-updated--configuration_bert_updated.BertUpdatedConfig",
7
+ "AutoModel": "Taykhoom/BERT-updated--modeling_bert.BertModel",
8
+ "AutoModelForMaskedLM": "Taykhoom/BERT-updated--modeling_bert.BertForMaskedLM"
9
+ },
10
+ "attention_probs_dropout_prob": 0.1,
11
+ "hidden_act": "gelu",
12
+ "hidden_dropout_prob": 0.1,
13
+ "hidden_size": 768,
14
+ "initializer_range": 0.02,
15
+ "intermediate_size": 3072,
16
+ "layer_norm_eps": 1e-12,
17
+ "max_position_embeddings": 1024,
18
+ "model_type": "bert_updated",
19
+ "num_attention_heads": 12,
20
+ "num_hidden_layers": 12,
21
+ "pad_token_id": 0,
22
+ "type_vocab_size": 2,
23
+ "vocab_size": 69,
24
+ "transformers_version": "4.57.6"
25
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:203e1dbf3aa7b7c038b25998b8fed977245361e85ded0a08184d80d8eb809898
3
+ size 345972416
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
tokenizer_config.json ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "4": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "cls_token": "[CLS]",
45
+ "do_basic_tokenize": true,
46
+ "do_lower_case": false,
47
+ "mask_token": "[MASK]",
48
+ "model_max_length": 1024,
49
+ "pad_token": "[PAD]",
50
+ "sep_token": "[SEP]",
51
+ "strip_accents": null,
52
+ "tokenize_chinese_chars": false,
53
+ "tokenizer_class": "BertTokenizer",
54
+ "unk_token": "[UNK]"
55
+ }
vocab.txt ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [PAD]
2
+ [UNK]
3
+ [CLS]
4
+ [SEP]
5
+ [MASK]
6
+ AAA
7
+ AAU
8
+ AAG
9
+ AAC
10
+ AUA
11
+ AUU
12
+ AUG
13
+ AUC
14
+ AGA
15
+ AGU
16
+ AGG
17
+ AGC
18
+ ACA
19
+ ACU
20
+ ACG
21
+ ACC
22
+ UAA
23
+ UAU
24
+ UAG
25
+ UAC
26
+ UUA
27
+ UUU
28
+ UUG
29
+ UUC
30
+ UGA
31
+ UGU
32
+ UGG
33
+ UGC
34
+ UCA
35
+ UCU
36
+ UCG
37
+ UCC
38
+ GAA
39
+ GAU
40
+ GAG
41
+ GAC
42
+ GUA
43
+ GUU
44
+ GUG
45
+ GUC
46
+ GGA
47
+ GGU
48
+ GGG
49
+ GGC
50
+ GCA
51
+ GCU
52
+ GCG
53
+ GCC
54
+ CAA
55
+ CAU
56
+ CAG
57
+ CAC
58
+ CUA
59
+ CUU
60
+ CUG
61
+ CUC
62
+ CGA
63
+ CGU
64
+ CGG
65
+ CGC
66
+ CCA
67
+ CCU
68
+ CCG
69
+ CCC