Taykhoom commited on
Commit
4acf5b5
·
verified ·
1 Parent(s): 6509a75

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +33 -21
README.md CHANGED
@@ -57,6 +57,8 @@ weights (`bert.*` prefix) are extracted directly; the MLM and NSP heads are disc
57
  Hidden-state representations verified identical (max abs diff < 8e-6) to the original
58
  implementation at all 13 representation levels (embedding + 12 transformer layers).
59
  Verified with eager and sdpa backends on GPU with PyTorch 2.7 / CUDA 12.
 
 
60
 
61
  ## Related Models
62
 
@@ -68,10 +70,8 @@ See the full [CodonBERT collection](https://huggingface.co/collections/Taykhoom/
68
 
69
  ## Usage
70
 
71
- CodonBERT operates on CDS sequences. Input nucleotide sequences must be:
72
- 1. In RNA space (U, not T)
73
- 2. A coding region (CDS) that is a multiple of 3 nucleotides
74
- 3. Pre-converted to space-separated codons before tokenization
75
 
76
  ### Embedding generation
77
 
@@ -79,21 +79,14 @@ CodonBERT operates on CDS sequences. Input nucleotide sequences must be:
79
  import torch
80
  from transformers import AutoTokenizer, AutoModel
81
 
82
-
83
- def nt_to_codons(seq: str) -> str:
84
- seq = seq.upper().replace("T", "U")
85
- n = len(seq) - len(seq) % 3
86
- return " ".join(seq[i:i + 3] for i in range(0, n, 3))
87
-
88
-
89
  tokenizer = AutoTokenizer.from_pretrained("Taykhoom/CodonBERT", trust_remote_code=True)
90
  model = AutoModel.from_pretrained("Taykhoom/CodonBERT", trust_remote_code=True)
91
  model.eval()
92
 
93
- cds_sequences = ["AUGAAAGGGCCCUAA", "AUGUUUGGG"]
94
- codon_sequences = [nt_to_codons(s) for s in cds_sequences]
95
 
96
- enc = tokenizer(codon_sequences, return_tensors="pt", padding=True)
97
 
98
  with torch.no_grad():
99
  out = model(**enc)
@@ -107,6 +100,24 @@ out_all = model(**enc, output_hidden_states=True)
107
  layer6_emb = out_all.hidden_states[6] # (batch, seq_len, 768)
108
  ```
109
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
110
  ### SDPA and Flash Attention 2
111
 
112
  ```python
@@ -145,13 +156,14 @@ as input to a classification/regression head.
145
  Two key differences from the original CodonBERT release:
146
 
147
  **1. Integrated codon tokenization.** The original repository requires users to
148
- manually pre-process sequences into space-separated codons before tokenizing. This
149
- port ships the same `BertTokenizer`-based tokenizer with a corrected
150
- `model_max_length` (1024, matching the model's positional embedding table) and
151
- `do_basic_tokenize=true` so that whitespace-split codon strings are correctly
152
- mapped to codon IDs. Users still need to convert nucleotide sequences to
153
- space-separated codons (see `nt_to_codons` above), but the tokenizer is
154
- self-contained and directly loadable via `AutoTokenizer`.
 
155
 
156
  **2. SDPA and Flash Attention 2 support.** The original release used the standard
157
  HuggingFace `BertModel`, which does not support `attn_implementation="sdpa"` or
 
57
  Hidden-state representations verified identical (max abs diff < 8e-6) to the original
58
  implementation at all 13 representation levels (embedding + 12 transformer layers).
59
  Verified with eager and sdpa backends on GPU with PyTorch 2.7 / CUDA 12.
60
+ Flash attention 2 verified against eager (bf16) at non-padding positions (max diff < 0.25,
61
+ expected BF16 rounding across 12 layers).
62
 
63
  ## Related Models
64
 
 
70
 
71
  ## Usage
72
 
73
+ CodonBERT operates on CDS sequences. The tokenizer handles T->U conversion and codon
74
+ splitting automatically pass raw nucleotide strings directly.
 
 
75
 
76
  ### Embedding generation
77
 
 
79
  import torch
80
  from transformers import AutoTokenizer, AutoModel
81
 
 
 
 
 
 
 
 
82
  tokenizer = AutoTokenizer.from_pretrained("Taykhoom/CodonBERT", trust_remote_code=True)
83
  model = AutoModel.from_pretrained("Taykhoom/CodonBERT", trust_remote_code=True)
84
  model.eval()
85
 
86
+ # Raw CDS nucleotide strings — T or U both accepted
87
+ cds_sequences = ["ATGAAAGGCCCTTAA", "ATGTTTGGG"]
88
 
89
+ enc = tokenizer(cds_sequences, return_tensors="pt", padding=True)
90
 
91
  with torch.no_grad():
92
  out = model(**enc)
 
100
  layer6_emb = out_all.hidden_states[6] # (batch, seq_len, 768)
101
  ```
102
 
103
+ ### CDS-aware encoding (full mRNA input)
104
+
105
+ For full mRNA sequences where the CDS region must be extracted first:
106
+
107
+ ```python
108
+ import numpy as np
109
+
110
+ # cds: binary array with 1 at the first nucleotide of each codon
111
+ enc, chunk_counts = tokenizer.batch_encode_with_cds(
112
+ mrna_sequences,
113
+ cds_tracks, # list of numpy arrays
114
+ return_tensors="pt",
115
+ padding=True,
116
+ )
117
+ with torch.no_grad():
118
+ out = model(**enc)
119
+ ```
120
+
121
  ### SDPA and Flash Attention 2
122
 
123
  ```python
 
156
  Two key differences from the original CodonBERT release:
157
 
158
  **1. Integrated codon tokenization.** The original repository requires users to
159
+ manually pre-process sequences into space-separated codons before passing them to
160
+ the tokenizer. This port ships `CodonBertTokenizer`, a `BertTokenizer` subclass
161
+ whose `_tokenize` method automatically normalizes sequences (T->U, uppercase) and
162
+ splits them into codon 3-mers. Users can pass raw nucleotide strings directly:
163
+ `tokenizer("AUGAAAGGG")` works without any pre-processing. A
164
+ `batch_encode_with_cds(sequences, cds_tracks)` method handles full mRNA input with
165
+ CDS extraction and codon-boundary-aligned chunking, matching the mRNABench
166
+ preprocessing exactly.
167
 
168
  **2. SDPA and Flash Attention 2 support.** The original release used the standard
169
  HuggingFace `BertModel`, which does not support `attn_implementation="sdpa"` or