Instructions to use Taykhoom/CodonBERT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Taykhoom/CodonBERT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="Taykhoom/CodonBERT", trust_remote_code=True)# Load model directly from transformers import AutoModelForMaskedLM model = AutoModelForMaskedLM.from_pretrained("Taykhoom/CodonBERT", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
| language: | |
| - rna | |
| library_name: transformers | |
| tags: | |
| - RNA | |
| - mRNA | |
| - codon | |
| - language-model | |
| license: other | |
| # CodonBERT | |
| BERT-based RNA language model pretrained on codon-level representations of more than | |
| 10 million mRNA sequences from mammals, bacteria, and human viruses using masked language | |
| modeling. Designed for predicting mRNA-specific properties such as translation efficiency | |
| and mRNA stability. | |
| ## Architecture | |
| | Parameter | Value | | |
| |---|---| | |
| | Layers | 12 | | |
| | Attention heads | 12 | | |
| | Embedding dimension | 768 | | |
| | Intermediate size | 3072 | | |
| | Vocabulary size | 69 (5 special + 64 sense codons) | | |
| | Positional encoding | Learned absolute | | |
| | Architecture | Standard post-LN BERT Transformer | | |
| | Max sequence length | 1024 tokens (codons) | | |
| ### Vocabulary | |
| The tokenizer operates at the codon level. Sequences must be pre-split into | |
| space-separated codons before passing to the tokenizer (see Usage below). | |
| The 64 sense codons cover all combinations of {A, U, G, C}^3 in RNA space. | |
| Special tokens follow standard BERT convention: `[PAD]=0`, `[UNK]=1`, | |
| `[CLS]=2`, `[SEP]=3`, `[MASK]=4`. | |
| ## Pretraining | |
| - **Objective:** Masked language modeling (MLM) on codon-level tokens | |
| - **Data:** >10 million mRNA sequences from mammals, bacteria, and human viruses | |
| - **Focus:** Coding sequences (CDS) only | |
| - **Source checkpoint:** `model.safetensors` converted from the original | |
| [Sanofi-Public/CodonBERT](https://github.com/Sanofi-Public/CodonBERT) release | |
| (`BertForPreTraining` format) | |
| ### Checkpoint selection | |
| There is a single publicly released checkpoint from the original authors. The backbone | |
| weights (`bert.*` prefix) are extracted directly; the MLM and NSP heads are discarded. | |
| ## Parity Verification | |
| All verified on GPU with PyTorch 2.7 / CUDA 12: | |
| - **Hidden states (eager, sdpa):** identical to original at all 13 levels (max abs diff < 8e-6) | |
| - **MLM logits:** `BertForMaskedLM` logits identical to original `BertForPreTraining` (max abs diff < 9e-6) | |
| - **Flash attention 2:** verified against eager (bf16) at non-padding positions (max diff < 0.25, expected BF16 accumulation across 12 layers) | |
| ## Related Models | |
| See the full [CodonBERT collection](https://huggingface.co/collections/Taykhoom/codonbert-6a2215ba01c589ad8eac8a2d). | |
| | Model | Notes | | |
| |---|---| | |
| | **[CodonBERT](https://huggingface.co/Taykhoom/CodonBERT)** | This model | | |
| ## Usage | |
| CodonBERT operates on CDS sequences. The tokenizer handles T->U conversion and codon | |
| splitting automatically — pass raw nucleotide strings directly. | |
| ### Embedding generation | |
| ```python | |
| import torch | |
| from transformers import AutoTokenizer, AutoModel | |
| tokenizer = AutoTokenizer.from_pretrained("Taykhoom/CodonBERT", trust_remote_code=True) | |
| model = AutoModel.from_pretrained("Taykhoom/CodonBERT", trust_remote_code=True) | |
| model.eval() | |
| # Raw CDS nucleotide strings — T or U both accepted | |
| cds_sequences = ["ATGAAAGGCCCTTAA", "ATGTTTGGG"] | |
| enc = tokenizer(cds_sequences, return_tensors="pt", padding=True) | |
| with torch.no_grad(): | |
| out = model(**enc) | |
| cls_emb = out.last_hidden_state[:, 0, :] # (batch, 768) -- CLS token | |
| mean_emb = (out.last_hidden_state * enc["attention_mask"].unsqueeze(-1)).sum(1) / \ | |
| enc["attention_mask"].sum(1, keepdim=True) # mean over non-padding | |
| # Intermediate layers | |
| out_all = model(**enc, output_hidden_states=True) | |
| layer6_emb = out_all.hidden_states[6] # (batch, seq_len, 768) | |
| ``` | |
| ### CDS-aware encoding (full mRNA input) | |
| For full mRNA sequences where the CDS region must be extracted first: | |
| ```python | |
| import numpy as np | |
| # cds: binary array with 1 at the first nucleotide of each codon | |
| enc, chunk_counts = tokenizer.batch_encode_with_cds( | |
| mrna_sequences, | |
| cds_tracks, # list of numpy arrays | |
| return_tensors="pt", | |
| padding=True, | |
| ) | |
| with torch.no_grad(): | |
| out = model(**enc) | |
| ``` | |
| ### SDPA and Flash Attention 2 | |
| ```python | |
| model_sdpa = AutoModel.from_pretrained( | |
| "Taykhoom/CodonBERT", trust_remote_code=True, attn_implementation="sdpa" | |
| ) | |
| model_flash = AutoModel.from_pretrained( | |
| "Taykhoom/CodonBERT", trust_remote_code=True, attn_implementation="flash_attention_2" | |
| ) | |
| ``` | |
| ### MLM logits | |
| ```python | |
| from transformers import AutoModelForMaskedLM | |
| model_mlm = AutoModelForMaskedLM.from_pretrained("Taykhoom/CodonBERT", trust_remote_code=True) | |
| model_mlm.eval() | |
| seq = "AUG [MASK] GGG" | |
| enc = tokenizer(seq, return_tensors="pt") | |
| with torch.no_grad(): | |
| logits = model_mlm(**enc).logits # (1, seq_len, 69) | |
| ``` | |
| The MLM head weights are fully preserved: the prediction transform (dense + GELU + | |
| LayerNorm), the decoder weight (tied to the word embedding in the original, stored | |
| explicitly here), and the output bias are all converted from the original checkpoint. | |
| ### Fine-tuning | |
| Standard HF conventions apply. For sequence-level tasks, use the CLS token embedding | |
| as input to a classification/regression head. | |
| ## Implementation Notes | |
| Two key differences from the original CodonBERT release: | |
| **1. Integrated codon tokenization.** The original repository requires users to | |
| manually pre-process sequences into space-separated codons before passing them to | |
| the tokenizer. This port ships `CodonBertTokenizer`, a `BertTokenizer` subclass | |
| whose `_tokenize` method automatically normalizes sequences (T->U, uppercase) and | |
| splits them into codon 3-mers. Users can pass raw nucleotide strings directly: | |
| `tokenizer("AUGAAAGGG")` works without any pre-processing. A | |
| `batch_encode_with_cds(sequences, cds_tracks)` method handles full mRNA input with | |
| CDS extraction and codon-boundary-aligned chunking, matching the mRNABench | |
| preprocessing exactly. | |
| **2. SDPA and Flash Attention 2 support.** The original release used the standard | |
| HuggingFace `BertModel`, which does not support `attn_implementation="sdpa"` or | |
| `attn_implementation="flash_attention_2"`. This port inherits from | |
| [Taykhoom/BERT-updated](https://huggingface.co/Taykhoom/BERT-updated), a minimal | |
| BERT re-implementation with all three backends (`eager`, `sdpa`, | |
| `flash_attention_2`). Parity against the original eager implementation is verified | |
| at every layer. | |
| ## Citation | |
| ```bibtex | |
| @article{li2024_codonbert, | |
| title = {{CodonBERT} large language model for {mRNA} vaccines}, | |
| author = {Li, Sizhen and Moayedpour, Saeed and Li, Ruijiang and Bailey, Michael and Riahi, Saleh and Kogler-Anele, Lorenzo and Miladi, Milad and Miner, Jacob and Pertuy, Fabien and Zheng, Dinghai and Wang, Jun and Balsubramani, Akshay and Tran, Khang and Zacharia, Minnie and Wu, Monica and Gu, Xiaobo and Clinton, Ryan and Asquith, Carla and Skaleski, Joseph and Boeglin, Lianne and Chivukula, Sudha and Dias, Anusha and Strugnell, Tod and Ulloa Montoya, Fernando and Agarwal, Vikram and Bar-Joseph, Ziv and Jager, Sven}, | |
| journal = {Genome Research}, | |
| volume = {34}, | |
| number = {7}, | |
| pages = {1027--1035}, | |
| year = {2024}, | |
| doi = {10.1101/gr.278870.123} | |
| } | |
| ``` | |
| ## Credits | |
| Original model and code by Li et al. Source: [GitHub](https://github.com/Sanofi-Public/CodonBERT). | |
| The HF conversion code was authored primarily by [Claude Code](https://claude.ai/code) | |
| and reviewed manually by Taykhoom Dalal. | |
| ## License | |
| Academic/non-commercial use only, following the original repository license: | |
| Permission is hereby granted, free of charge, for academic research purposes only | |
| and for non-commercial use only, to any person from an academic research or non-profit | |
| organization obtaining a copy of these models, software, datasets and/or algorithms. | |
| For purposes of this notice, "non-commercial use" excludes uses foreseeably resulting | |
| in a commercial benefit or monetary gain. All other rights are reserved. | |