Taykhoom commited on
Commit
3022652
·
verified ·
1 Parent(s): 0cfdd44

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +8 -7
README.md CHANGED
@@ -54,11 +54,11 @@ weights (`bert.*` prefix) are extracted directly; the MLM and NSP heads are disc
54
 
55
  ## Parity Verification
56
 
57
- Hidden-state representations verified identical (max abs diff < 8e-6) to the original
58
- implementation at all 13 representation levels (embedding + 12 transformer layers).
59
- Verified with eager and sdpa backends on GPU with PyTorch 2.7 / CUDA 12.
60
- Flash attention 2 verified against eager (bf16) at non-padding positions (max diff < 0.25,
61
- expected BF16 rounding across 12 layers).
62
 
63
  ## Related Models
64
 
@@ -143,8 +143,9 @@ with torch.no_grad():
143
  logits = model_mlm(**enc).logits # (1, seq_len, 69)
144
  ```
145
 
146
- Note: the MLM head (`cls`) is re-initialized randomly in this port. The backbone
147
- weights are exact; only MLM fine-tuning tasks would require re-training the head.
 
148
 
149
  ### Fine-tuning
150
 
 
54
 
55
  ## Parity Verification
56
 
57
+ All verified on GPU with PyTorch 2.7 / CUDA 12:
58
+
59
+ - **Hidden states (eager, sdpa):** identical to original at all 13 levels (max abs diff < 8e-6)
60
+ - **MLM logits:** `BertForMaskedLM` logits identical to original `BertForPreTraining` (max abs diff < 9e-6)
61
+ - **Flash attention 2:** verified against eager (bf16) at non-padding positions (max diff < 0.25, expected BF16 accumulation across 12 layers)
62
 
63
  ## Related Models
64
 
 
143
  logits = model_mlm(**enc).logits # (1, seq_len, 69)
144
  ```
145
 
146
+ The MLM head weights are fully preserved: the prediction transform (dense + GELU +
147
+ LayerNorm), the decoder weight (tied to the word embedding in the original, stored
148
+ explicitly here), and the output bias are all converted from the original checkpoint.
149
 
150
  ### Fine-tuning
151