Taykhoom
/

CodonBERT

@@ -54,11 +54,11 @@ weights (`bert.*` prefix) are extracted directly; the MLM and NSP heads are disc
 ## Parity Verification
-Hidden-state representations verified identical (max abs diff < 8e-6) to the original
-implementation at all 13 representation levels (embedding + 12 transformer layers).
-Verified with eager and sdpa backends on GPU with PyTorch 2.7 / CUDA 12.
-Flash attention 2 verified against eager (bf16) at non-padding positions (max diff < 0.25,
-expected BF16 rounding across 12 layers).
 ## Related Models
@@ -143,8 +143,9 @@ with torch.no_grad():
     logits = model_mlm(**enc).logits  # (1, seq_len, 69)
 ```
-Note: the MLM head (`cls`) is re-initialized randomly in this port. The backbone
-weights are exact; only MLM fine-tuning tasks would require re-training the head.
 ### Fine-tuning

 ## Parity Verification
+All verified on GPU with PyTorch 2.7 / CUDA 12:
+- **Hidden states (eager, sdpa):** identical to original at all 13 levels (max abs diff < 8e-6)
+- **MLM logits:** `BertForMaskedLM` logits identical to original `BertForPreTraining` (max abs diff < 9e-6)
+- **Flash attention 2:** verified against eager (bf16) at non-padding positions (max diff < 0.25, expected BF16 accumulation across 12 layers)
 ## Related Models
     logits = model_mlm(**enc).logits  # (1, seq_len, 69)
 ```
+The MLM head weights are fully preserved: the prediction transform (dense + GELU +
+LayerNorm), the decoder weight (tied to the word embedding in the original, stored
+explicitly here), and the output bias are all converted from the original checkpoint.
 ### Fine-tuning