toranb
/

theo-bert-base

Shows theo-bert-base (94.7%) vs general-purpose bert-base-uncased
(47.8%) on the 546-case theological MLM benchmark. Includes
per-difficulty and per-category breakdowns plus contrastive
confidence analysis.

Files changed (1) hide show

README.md +36 -0

README.md CHANGED Viewed

@@ -78,6 +78,42 @@ Per-category highlights:
 | Theology proper | 91.3% |
 | Canonical knowledge | 88.4% |
 Residual failures cluster around Old Testament proper-noun recall (Jeremiah, Jonah, Job, Nebuchadnezzar) and multi-piece subword reconstruction (`sabachthani`, `iniquity`, `Nebuchadnezzar`). The benchmark suggests strong domain-specific MLM behavior on this suite; broader generalization beyond the eval distribution has not been independently verified.
 ## Tokenizer

 | Theology proper | 91.3% |
 | Canonical knowledge | 88.4% |
+### Comparison with bert-base-uncased
+General-purpose BERT produces theologically incoherent completions on biblical text. Running `google-bert/bert-base-uncased` through the same 546-case eval shows the gap:
+| Metric | bert-base-uncased | **theo-bert-base** |
+|---|---|---|
+| Overall pass rate | 47.8% | **94.7%** |
+| Doctrinal association | 39.4% | **95.9%** |
+| Canonical knowledge | 37.7% | **88.4%** |
+| Contrastive theology | 65.2% | **97.9%** |
+| Difficulty-weighted | 46.5% | **94.6%** |
+| Critical failure rate | 26.9% | **15.6%** |
+By difficulty — theo-bert-base on **hard** cases (94.2%) outperforms bert-base-uncased on **easy** cases (56.6%):
+| Difficulty | bert-base-uncased | **theo-bert-base** |
+|---|---|---|
+| Easy | 56.6% | **94.9%** |
+| Medium | 46.9% | **94.9%** |
+| Hard | 44.2% | **94.2%** |
+By category:
+| Category | bert-base-uncased | **theo-bert-base** |
+|---|---|---|
+| Pneumatology | 45.2% | **100%** |
+| Soteriology | 55.0% | **98.2%** |
+| Ecclesiology | 62.5% | **97.5%** |
+| Hamartiology | 61.8% | **97.1%** |
+| Christology | 41.7% | **96.4%** |
+| Eschatology | 55.6% | **94.4%** |
+| Theology proper | 43.5% | **91.3%** |
+| Canonical knowledge | 37.7% | **88.4%** |
+On contrastive theology — the most discriminative test type — bert-base-uncased is right 65% of the time but only confident (margin > 0.10) on 23% of cases. Theo-bert-base is right 98% of the time and confident on 91% of cases.
 Residual failures cluster around Old Testament proper-noun recall (Jeremiah, Jonah, Job, Nebuchadnezzar) and multi-piece subword reconstruction (`sabachthani`, `iniquity`, `Nebuchadnezzar`). The benchmark suggests strong domain-specific MLM behavior on this suite; broader generalization beyond the eval distribution has not been independently verified.
 ## Tokenizer