Instructions to use Taykhoom/CodonBERT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Taykhoom/CodonBERT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="Taykhoom/CodonBERT", trust_remote_code=True)# Load model directly from transformers import AutoModelForMaskedLM model = AutoModelForMaskedLM.from_pretrained("Taykhoom/CodonBERT", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -57,6 +57,8 @@ weights (`bert.*` prefix) are extracted directly; the MLM and NSP heads are disc
|
|
| 57 |
Hidden-state representations verified identical (max abs diff < 8e-6) to the original
|
| 58 |
implementation at all 13 representation levels (embedding + 12 transformer layers).
|
| 59 |
Verified with eager and sdpa backends on GPU with PyTorch 2.7 / CUDA 12.
|
|
|
|
|
|
|
| 60 |
|
| 61 |
## Related Models
|
| 62 |
|
|
@@ -68,10 +70,8 @@ See the full [CodonBERT collection](https://huggingface.co/collections/Taykhoom/
|
|
| 68 |
|
| 69 |
## Usage
|
| 70 |
|
| 71 |
-
CodonBERT operates on CDS sequences.
|
| 72 |
-
|
| 73 |
-
2. A coding region (CDS) that is a multiple of 3 nucleotides
|
| 74 |
-
3. Pre-converted to space-separated codons before tokenization
|
| 75 |
|
| 76 |
### Embedding generation
|
| 77 |
|
|
@@ -79,21 +79,14 @@ CodonBERT operates on CDS sequences. Input nucleotide sequences must be:
|
|
| 79 |
import torch
|
| 80 |
from transformers import AutoTokenizer, AutoModel
|
| 81 |
|
| 82 |
-
|
| 83 |
-
def nt_to_codons(seq: str) -> str:
|
| 84 |
-
seq = seq.upper().replace("T", "U")
|
| 85 |
-
n = len(seq) - len(seq) % 3
|
| 86 |
-
return " ".join(seq[i:i + 3] for i in range(0, n, 3))
|
| 87 |
-
|
| 88 |
-
|
| 89 |
tokenizer = AutoTokenizer.from_pretrained("Taykhoom/CodonBERT", trust_remote_code=True)
|
| 90 |
model = AutoModel.from_pretrained("Taykhoom/CodonBERT", trust_remote_code=True)
|
| 91 |
model.eval()
|
| 92 |
|
| 93 |
-
|
| 94 |
-
|
| 95 |
|
| 96 |
-
enc = tokenizer(
|
| 97 |
|
| 98 |
with torch.no_grad():
|
| 99 |
out = model(**enc)
|
|
@@ -107,6 +100,24 @@ out_all = model(**enc, output_hidden_states=True)
|
|
| 107 |
layer6_emb = out_all.hidden_states[6] # (batch, seq_len, 768)
|
| 108 |
```
|
| 109 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 110 |
### SDPA and Flash Attention 2
|
| 111 |
|
| 112 |
```python
|
|
@@ -145,13 +156,14 @@ as input to a classification/regression head.
|
|
| 145 |
Two key differences from the original CodonBERT release:
|
| 146 |
|
| 147 |
**1. Integrated codon tokenization.** The original repository requires users to
|
| 148 |
-
manually pre-process sequences into space-separated codons before
|
| 149 |
-
port ships
|
| 150 |
-
`
|
| 151 |
-
|
| 152 |
-
|
| 153 |
-
|
| 154 |
-
|
|
|
|
| 155 |
|
| 156 |
**2. SDPA and Flash Attention 2 support.** The original release used the standard
|
| 157 |
HuggingFace `BertModel`, which does not support `attn_implementation="sdpa"` or
|
|
|
|
| 57 |
Hidden-state representations verified identical (max abs diff < 8e-6) to the original
|
| 58 |
implementation at all 13 representation levels (embedding + 12 transformer layers).
|
| 59 |
Verified with eager and sdpa backends on GPU with PyTorch 2.7 / CUDA 12.
|
| 60 |
+
Flash attention 2 verified against eager (bf16) at non-padding positions (max diff < 0.25,
|
| 61 |
+
expected BF16 rounding across 12 layers).
|
| 62 |
|
| 63 |
## Related Models
|
| 64 |
|
|
|
|
| 70 |
|
| 71 |
## Usage
|
| 72 |
|
| 73 |
+
CodonBERT operates on CDS sequences. The tokenizer handles T->U conversion and codon
|
| 74 |
+
splitting automatically — pass raw nucleotide strings directly.
|
|
|
|
|
|
|
| 75 |
|
| 76 |
### Embedding generation
|
| 77 |
|
|
|
|
| 79 |
import torch
|
| 80 |
from transformers import AutoTokenizer, AutoModel
|
| 81 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 82 |
tokenizer = AutoTokenizer.from_pretrained("Taykhoom/CodonBERT", trust_remote_code=True)
|
| 83 |
model = AutoModel.from_pretrained("Taykhoom/CodonBERT", trust_remote_code=True)
|
| 84 |
model.eval()
|
| 85 |
|
| 86 |
+
# Raw CDS nucleotide strings — T or U both accepted
|
| 87 |
+
cds_sequences = ["ATGAAAGGCCCTTAA", "ATGTTTGGG"]
|
| 88 |
|
| 89 |
+
enc = tokenizer(cds_sequences, return_tensors="pt", padding=True)
|
| 90 |
|
| 91 |
with torch.no_grad():
|
| 92 |
out = model(**enc)
|
|
|
|
| 100 |
layer6_emb = out_all.hidden_states[6] # (batch, seq_len, 768)
|
| 101 |
```
|
| 102 |
|
| 103 |
+
### CDS-aware encoding (full mRNA input)
|
| 104 |
+
|
| 105 |
+
For full mRNA sequences where the CDS region must be extracted first:
|
| 106 |
+
|
| 107 |
+
```python
|
| 108 |
+
import numpy as np
|
| 109 |
+
|
| 110 |
+
# cds: binary array with 1 at the first nucleotide of each codon
|
| 111 |
+
enc, chunk_counts = tokenizer.batch_encode_with_cds(
|
| 112 |
+
mrna_sequences,
|
| 113 |
+
cds_tracks, # list of numpy arrays
|
| 114 |
+
return_tensors="pt",
|
| 115 |
+
padding=True,
|
| 116 |
+
)
|
| 117 |
+
with torch.no_grad():
|
| 118 |
+
out = model(**enc)
|
| 119 |
+
```
|
| 120 |
+
|
| 121 |
### SDPA and Flash Attention 2
|
| 122 |
|
| 123 |
```python
|
|
|
|
| 156 |
Two key differences from the original CodonBERT release:
|
| 157 |
|
| 158 |
**1. Integrated codon tokenization.** The original repository requires users to
|
| 159 |
+
manually pre-process sequences into space-separated codons before passing them to
|
| 160 |
+
the tokenizer. This port ships `CodonBertTokenizer`, a `BertTokenizer` subclass
|
| 161 |
+
whose `_tokenize` method automatically normalizes sequences (T->U, uppercase) and
|
| 162 |
+
splits them into codon 3-mers. Users can pass raw nucleotide strings directly:
|
| 163 |
+
`tokenizer("AUGAAAGGG")` works without any pre-processing. A
|
| 164 |
+
`batch_encode_with_cds(sequences, cds_tracks)` method handles full mRNA input with
|
| 165 |
+
CDS extraction and codon-boundary-aligned chunking, matching the mRNABench
|
| 166 |
+
preprocessing exactly.
|
| 167 |
|
| 168 |
**2. SDPA and Flash Attention 2 support.** The original release used the standard
|
| 169 |
HuggingFace `BertModel`, which does not support `attn_implementation="sdpa"` or
|