toranb commited on
Commit
fa10de7
·
1 Parent(s): a64c547

docs: add bert-base-uncased baseline comparison to README

Browse files

Shows theo-bert-base (94.7%) vs general-purpose bert-base-uncased
(47.8%) on the 546-case theological MLM benchmark. Includes
per-difficulty and per-category breakdowns plus contrastive
confidence analysis.

Files changed (1) hide show
  1. README.md +36 -0
README.md CHANGED
@@ -78,6 +78,42 @@ Per-category highlights:
78
  | Theology proper | 91.3% |
79
  | Canonical knowledge | 88.4% |
80
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
81
  Residual failures cluster around Old Testament proper-noun recall (Jeremiah, Jonah, Job, Nebuchadnezzar) and multi-piece subword reconstruction (`sabachthani`, `iniquity`, `Nebuchadnezzar`). The benchmark suggests strong domain-specific MLM behavior on this suite; broader generalization beyond the eval distribution has not been independently verified.
82
 
83
  ## Tokenizer
 
78
  | Theology proper | 91.3% |
79
  | Canonical knowledge | 88.4% |
80
 
81
+ ### Comparison with bert-base-uncased
82
+
83
+ General-purpose BERT produces theologically incoherent completions on biblical text. Running `google-bert/bert-base-uncased` through the same 546-case eval shows the gap:
84
+
85
+ | Metric | bert-base-uncased | **theo-bert-base** |
86
+ |---|---|---|
87
+ | Overall pass rate | 47.8% | **94.7%** |
88
+ | Doctrinal association | 39.4% | **95.9%** |
89
+ | Canonical knowledge | 37.7% | **88.4%** |
90
+ | Contrastive theology | 65.2% | **97.9%** |
91
+ | Difficulty-weighted | 46.5% | **94.6%** |
92
+ | Critical failure rate | 26.9% | **15.6%** |
93
+
94
+ By difficulty — theo-bert-base on **hard** cases (94.2%) outperforms bert-base-uncased on **easy** cases (56.6%):
95
+
96
+ | Difficulty | bert-base-uncased | **theo-bert-base** |
97
+ |---|---|---|
98
+ | Easy | 56.6% | **94.9%** |
99
+ | Medium | 46.9% | **94.9%** |
100
+ | Hard | 44.2% | **94.2%** |
101
+
102
+ By category:
103
+
104
+ | Category | bert-base-uncased | **theo-bert-base** |
105
+ |---|---|---|
106
+ | Pneumatology | 45.2% | **100%** |
107
+ | Soteriology | 55.0% | **98.2%** |
108
+ | Ecclesiology | 62.5% | **97.5%** |
109
+ | Hamartiology | 61.8% | **97.1%** |
110
+ | Christology | 41.7% | **96.4%** |
111
+ | Eschatology | 55.6% | **94.4%** |
112
+ | Theology proper | 43.5% | **91.3%** |
113
+ | Canonical knowledge | 37.7% | **88.4%** |
114
+
115
+ On contrastive theology — the most discriminative test type — bert-base-uncased is right 65% of the time but only confident (margin > 0.10) on 23% of cases. Theo-bert-base is right 98% of the time and confident on 91% of cases.
116
+
117
  Residual failures cluster around Old Testament proper-noun recall (Jeremiah, Jonah, Job, Nebuchadnezzar) and multi-piece subword reconstruction (`sabachthani`, `iniquity`, `Nebuchadnezzar`). The benchmark suggests strong domain-specific MLM behavior on this suite; broader generalization beyond the eval distribution has not been independently verified.
118
 
119
  ## Tokenizer