Taykhoom
/

CodonBERT

@@ -57,6 +57,8 @@ weights (`bert.*` prefix) are extracted directly; the MLM and NSP heads are disc
 Hidden-state representations verified identical (max abs diff < 8e-6) to the original
 implementation at all 13 representation levels (embedding + 12 transformer layers).
 Verified with eager and sdpa backends on GPU with PyTorch 2.7 / CUDA 12.
 ## Related Models
@@ -68,10 +70,8 @@ See the full [CodonBERT collection](https://huggingface.co/collections/Taykhoom/
 ## Usage
-CodonBERT operates on CDS sequences. Input nucleotide sequences must be:
-1. In RNA space (U, not T)
-2. A coding region (CDS) that is a multiple of 3 nucleotides
-3. Pre-converted to space-separated codons before tokenization
 ### Embedding generation
@@ -79,21 +79,14 @@ CodonBERT operates on CDS sequences. Input nucleotide sequences must be:
 import torch
 from transformers import AutoTokenizer, AutoModel
-def nt_to_codons(seq: str) -> str:
-    seq = seq.upper().replace("T", "U")
-    n = len(seq) - len(seq) % 3
-    return " ".join(seq[i:i + 3] for i in range(0, n, 3))
 tokenizer = AutoTokenizer.from_pretrained("Taykhoom/CodonBERT", trust_remote_code=True)
 model = AutoModel.from_pretrained("Taykhoom/CodonBERT", trust_remote_code=True)
 model.eval()
-cds_sequences = ["AUGAAAGGGCCCUAA", "AUGUUUGGG"]
-codon_sequences = [nt_to_codons(s) for s in cds_sequences]
-enc = tokenizer(codon_sequences, return_tensors="pt", padding=True)
 with torch.no_grad():
     out = model(**enc)
@@ -107,6 +100,24 @@ out_all = model(**enc, output_hidden_states=True)
 layer6_emb = out_all.hidden_states[6]  # (batch, seq_len, 768)
 ```
 ### SDPA and Flash Attention 2
 ```python
@@ -145,13 +156,14 @@ as input to a classification/regression head.
 Two key differences from the original CodonBERT release:
 **1. Integrated codon tokenization.** The original repository requires users to
-manually pre-process sequences into space-separated codons before tokenizing. This
-port ships the same `BertTokenizer`-based tokenizer with a corrected
-`model_max_length` (1024, matching the model's positional embedding table) and
-`do_basic_tokenize=true` so that whitespace-split codon strings are correctly
-mapped to codon IDs. Users still need to convert nucleotide sequences to
-space-separated codons (see `nt_to_codons` above), but the tokenizer is
-self-contained and directly loadable via `AutoTokenizer`.
 **2. SDPA and Flash Attention 2 support.** The original release used the standard
 HuggingFace `BertModel`, which does not support `attn_implementation="sdpa"` or

 Hidden-state representations verified identical (max abs diff < 8e-6) to the original
 implementation at all 13 representation levels (embedding + 12 transformer layers).
 Verified with eager and sdpa backends on GPU with PyTorch 2.7 / CUDA 12.
+Flash attention 2 verified against eager (bf16) at non-padding positions (max diff < 0.25,
+expected BF16 rounding across 12 layers).
 ## Related Models
 ## Usage
+CodonBERT operates on CDS sequences. The tokenizer handles T->U conversion and codon
+splitting automatically — pass raw nucleotide strings directly.
 ### Embedding generation
 import torch
 from transformers import AutoTokenizer, AutoModel
 tokenizer = AutoTokenizer.from_pretrained("Taykhoom/CodonBERT", trust_remote_code=True)
 model = AutoModel.from_pretrained("Taykhoom/CodonBERT", trust_remote_code=True)
 model.eval()
+# Raw CDS nucleotide strings — T or U both accepted
+cds_sequences = ["ATGAAAGGCCCTTAA", "ATGTTTGGG"]
+enc = tokenizer(cds_sequences, return_tensors="pt", padding=True)
 with torch.no_grad():
     out = model(**enc)
 layer6_emb = out_all.hidden_states[6]  # (batch, seq_len, 768)
 ```
+### CDS-aware encoding (full mRNA input)
+For full mRNA sequences where the CDS region must be extracted first:
+```python
+import numpy as np
+# cds: binary array with 1 at the first nucleotide of each codon
+enc, chunk_counts = tokenizer.batch_encode_with_cds(
+    mrna_sequences,
+    cds_tracks,       # list of numpy arrays
+    return_tensors="pt",
+    padding=True,
+)
+with torch.no_grad():
+    out = model(**enc)
+```
 ### SDPA and Flash Attention 2
 ```python
 Two key differences from the original CodonBERT release:
 **1. Integrated codon tokenization.** The original repository requires users to
+manually pre-process sequences into space-separated codons before passing them to
+the tokenizer. This port ships `CodonBertTokenizer`, a `BertTokenizer` subclass
+whose `_tokenize` method automatically normalizes sequences (T->U, uppercase) and
+splits them into codon 3-mers. Users can pass raw nucleotide strings directly:
+`tokenizer("AUGAAAGGG")` works without any pre-processing. A
+`batch_encode_with_cds(sequences, cds_tracks)` method handles full mRNA input with
+CDS extraction and codon-boundary-aligned chunking, matching the mRNABench
+preprocessing exactly.
 **2. SDPA and Flash Attention 2 support.** The original release used the standard
 HuggingFace `BertModel`, which does not support `attn_implementation="sdpa"` or