Upload folder using huggingface_hub

Browse files

Files changed (6) hide show

README.md +193 -0
config.json +25 -0
model.safetensors +3 -0
special_tokens_map.json +7 -0
tokenizer_config.json +55 -0
vocab.txt +69 -0

README.md ADDED Viewed

	@@ -0,0 +1,193 @@

+---
+language:
+- rna
+library_name: transformers
+tags:
+- RNA
+- mRNA
+- codon
+- language-model
+license: other
+---
+# CodonBERT
+BERT-based RNA language model pretrained on codon-level representations of more than
+10 million mRNA sequences from mammals, bacteria, and human viruses using masked language
+modeling. Designed for predicting mRNA-specific properties such as translation efficiency
+and mRNA stability.
+## Architecture
+| Parameter | Value |
+|---|---|
+| Layers | 12 |
+| Attention heads | 12 |
+| Embedding dimension | 768 |
+| Intermediate size | 3072 |
+| Vocabulary size | 69 (5 special + 64 sense codons) |
+| Positional encoding | Learned absolute |
+| Architecture | Standard post-LN BERT Transformer |
+| Max sequence length | 1024 tokens (codons) |
+### Vocabulary
+The tokenizer operates at the codon level. Sequences must be pre-split into
+space-separated codons before passing to the tokenizer (see Usage below).
+The 64 sense codons cover all combinations of {A, U, G, C}^3 in RNA space.
+Special tokens follow standard BERT convention: `[PAD]=0`, `[UNK]=1`,
+`[CLS]=2`, `[SEP]=3`, `[MASK]=4`.
+## Pretraining
+- **Objective:** Masked language modeling (MLM) on codon-level tokens
+- **Data:** >10 million mRNA sequences from mammals, bacteria, and human viruses
+- **Focus:** Coding sequences (CDS) only
+- **Source checkpoint:** `model.safetensors` converted from the original
+  [Sanofi-Public/CodonBERT](https://github.com/Sanofi-Public/CodonBERT) release
+  (`BertForPreTraining` format)
+### Checkpoint selection
+There is a single publicly released checkpoint from the original authors. The backbone
+weights (`bert.*` prefix) are extracted directly; the MLM and NSP heads are discarded.
+## Parity Verification
+Hidden-state representations verified identical (max abs diff < 8e-6) to the original
+implementation at all 13 representation levels (embedding + 12 transformer layers).
+Verified with eager and sdpa backends on GPU with PyTorch 2.7 / CUDA 12.
+## Related Models
+See the full [CodonBERT collection](https://huggingface.co/collections/Taykhoom/codonbert-TODO).
+| Model | Notes |
+|---|---|
+| **[CodonBERT](https://huggingface.co/Taykhoom/CodonBERT)** | This model |
+## Usage
+CodonBERT operates on CDS sequences. Input nucleotide sequences must be:
+1. In RNA space (U, not T)
+2. A coding region (CDS) that is a multiple of 3 nucleotides
+3. Pre-converted to space-separated codons before tokenization
+### Embedding generation
+```python
+import torch
+from transformers import AutoTokenizer, AutoModel
+def nt_to_codons(seq: str) -> str:
+    seq = seq.upper().replace("T", "U")
+    n = len(seq) - len(seq) % 3
+    return " ".join(seq[i:i + 3] for i in range(0, n, 3))
+tokenizer = AutoTokenizer.from_pretrained("Taykhoom/CodonBERT", trust_remote_code=True)
+model = AutoModel.from_pretrained("Taykhoom/CodonBERT", trust_remote_code=True)
+model.eval()
+cds_sequences = ["AUGAAAGGGCCCUAA", "AUGUUUGGG"]
+codon_sequences = [nt_to_codons(s) for s in cds_sequences]
+enc = tokenizer(codon_sequences, return_tensors="pt", padding=True)
+with torch.no_grad():
+    out = model(**enc)
+cls_emb   = out.last_hidden_state[:, 0, :]  # (batch, 768) -- CLS token
+mean_emb  = (out.last_hidden_state * enc["attention_mask"].unsqueeze(-1)).sum(1) / \
+            enc["attention_mask"].sum(1, keepdim=True)  # mean over non-padding
+# Intermediate layers
+out_all = model(**enc, output_hidden_states=True)
+layer6_emb = out_all.hidden_states[6]  # (batch, seq_len, 768)
+```
+### SDPA and Flash Attention 2
+```python
+model_sdpa = AutoModel.from_pretrained(
+    "Taykhoom/CodonBERT", trust_remote_code=True, attn_implementation="sdpa"
+)
+model_flash = AutoModel.from_pretrained(
+    "Taykhoom/CodonBERT", trust_remote_code=True, attn_implementation="flash_attention_2"
+)
+```
+### MLM logits
+```python
+from transformers import AutoModelForMaskedLM
+model_mlm = AutoModelForMaskedLM.from_pretrained("Taykhoom/CodonBERT", trust_remote_code=True)
+model_mlm.eval()
+seq = "AUG [MASK] GGG"
+enc = tokenizer(seq, return_tensors="pt")
+with torch.no_grad():
+    logits = model_mlm(**enc).logits  # (1, seq_len, 69)
+```
+Note: the MLM head (`cls`) is re-initialized randomly in this port. The backbone
+weights are exact; only MLM fine-tuning tasks would require re-training the head.
+### Fine-tuning
+Standard HF conventions apply. For sequence-level tasks, use the CLS token embedding
+as input to a classification/regression head.
+## Implementation Notes
+Two key differences from the original CodonBERT release:
+**1. Integrated codon tokenization.** The original repository requires users to
+manually pre-process sequences into space-separated codons before tokenizing. This
+port ships the same `BertTokenizer`-based tokenizer with a corrected
+`model_max_length` (1024, matching the model's positional embedding table) and
+`do_basic_tokenize=true` so that whitespace-split codon strings are correctly
+mapped to codon IDs. Users still need to convert nucleotide sequences to
+space-separated codons (see `nt_to_codons` above), but the tokenizer is
+self-contained and directly loadable via `AutoTokenizer`.
+**2. SDPA and Flash Attention 2 support.** The original release used the standard
+HuggingFace `BertModel`, which does not support `attn_implementation="sdpa"` or
+`attn_implementation="flash_attention_2"`. This port inherits from
+[Taykhoom/BERT-updated](https://huggingface.co/Taykhoom/BERT-updated), a minimal
+BERT re-implementation with all three backends (`eager`, `sdpa`,
+`flash_attention_2`). Parity against the original eager implementation is verified
+at every layer.
+## Citation
+```bibtex
+@article{li2024_codonbert,
+  title   = {{CodonBERT} large language model for {mRNA} vaccines},
+  author  = {Li, Sizhen and Moayedpour, Saeed and Li, Ruijiang and Bailey, Michael and Riahi, Saleh and Kogler-Anele, Lorenzo and Miladi, Milad and Miner, Jacob and Pertuy, Fabien and Zheng, Dinghai and Wang, Jun and Balsubramani, Akshay and Tran, Khang and Zacharia, Minnie and Wu, Monica and Gu, Xiaobo and Clinton, Ryan and Asquith, Carla and Skaleski, Joseph and Boeglin, Lianne and Chivukula, Sudha and Dias, Anusha and Strugnell, Tod and Ulloa Montoya, Fernando and Agarwal, Vikram and Bar-Joseph, Ziv and Jager, Sven},
+  journal = {Genome Research},
+  volume  = {34},
+  number  = {7},
+  pages   = {1027--1035},
+  year    = {2024},
+  doi     = {10.1101/gr.278870.123}
+}
+```
+## Credits
+Original model and code by Li et al. Source: [GitHub](https://github.com/Sanofi-Public/CodonBERT).
+The HF conversion code was authored primarily by [Claude Code](https://claude.ai/code)
+and reviewed manually by Taykhoom Dalal.
+## License
+Academic/non-commercial use only, following the original repository license:
+Permission is hereby granted, free of charge, for academic research purposes only
+and for non-commercial use only, to any person from an academic research or non-profit
+organization obtaining a copy of these models, software, datasets and/or algorithms.
+For purposes of this notice, "non-commercial use" excludes uses foreseeably resulting
+in a commercial benefit or monetary gain. All other rights are reserved.

config.json ADDED Viewed

	@@ -0,0 +1,25 @@

+{
+  "architectures": [
+    "BertModel"
+  ],
+  "auto_map": {
+    "AutoConfig": "Taykhoom/BERT-updated--configuration_bert_updated.BertUpdatedConfig",
+    "AutoModel": "Taykhoom/BERT-updated--modeling_bert.BertModel",
+    "AutoModelForMaskedLM": "Taykhoom/BERT-updated--modeling_bert.BertForMaskedLM"
+  },
+  "attention_probs_dropout_prob": 0.1,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 1024,
+  "model_type": "bert_updated",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "type_vocab_size": 2,
+  "vocab_size": 69,
+  "transformers_version": "4.57.6"
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:203e1dbf3aa7b7c038b25998b8fed977245361e85ded0a08184d80d8eb809898
+size 345972416

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,55 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "cls_token": "[CLS]",
+  "do_basic_tokenize": true,
+  "do_lower_case": false,
+  "mask_token": "[MASK]",
+  "model_max_length": 1024,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": false,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}

vocab.txt ADDED Viewed

	@@ -0,0 +1,69 @@

+[PAD]
+[UNK]
+[CLS]
+[SEP]
+[MASK]
+AAA
+AAU
+AAG
+AAC
+AUA
+AUU
+AUG
+AUC
+AGA
+AGU
+AGG
+AGC
+ACA
+ACU
+ACG
+ACC
+UAA
+UAU
+UAG
+UAC
+UUA
+UUU
+UUG
+UUC
+UGA
+UGU
+UGG
+UGC
+UCA
+UCU
+UCG
+UCC
+GAA
+GAU
+GAG
+GAC
+GUA
+GUU
+GUG
+GUC
+GGA
+GGU
+GGG
+GGC
+GCA
+GCU
+GCG
+GCC
+CAA
+CAU
+CAG
+CAC
+CUA
+CUU
+CUG
+CUC
+CGA
+CGU
+CGG
+CGC
+CCA
+CCU
+CCG
+CCC