Fill-Mask
Transformers
Safetensors
English
theo_bert_base
masked-language-modeling
bible
theology
christianity
trust-remote-code
custom_code
Eval Results (legacy)
Instructions to use toranb/theo-bert-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use toranb/theo-bert-base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="toranb/theo-bert-base", trust_remote_code=True)# Load model directly from transformers import AutoModelForMaskedLM model = AutoModelForMaskedLM.from_pretrained("toranb/theo-bert-base", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
| # Evaluation Suite: TheoBERT Base MLM Benchmark | |
| A 546-case domain-specific evaluation suite measuring masked language modeling performance on biblical and theological text. | |
| ## Overview | |
| The evaluation tests whether TheoBERT Base has internalized domain-specific semantics — not just surface-level co-occurrence. It goes beyond a standard perplexity measurement by testing the model against carefully constructed test cases that probe theological precision, canonical recall, and doctrinal discrimination. | |
| **Result:** 94.7% pass rate (517 / 546), difficulty-weighted score 94.6%. | |
| ## Test Types | |
| The suite uses three evaluation strategies, each targeting a different aspect of model competence: | |
| ### 1. Doctrinal Association (221 cases, `target_in_top_k`) | |
| > *Does the model know what belongs in the blank?* | |
| A sentence from scripture or doctrinal writing with one `[MASK]` token. The correct answer(s) must appear in the model's top-k predictions. | |
| **Pass condition:** At least one correct or acceptable-alternative token appears in the top-k (k=5) predictions. | |
| **Example:** | |
| ``` | |
| Input: "For as often as you eat this bread and drink the cup, | |
| you proclaim the Lord's [MASK] until he comes." | |
| Target: "death" | |
| Pass: "death" is in top-5 predicted tokens → ✅ | |
| ``` | |
| These cases span 9 theological categories (bibliology, christology, ecclesiology, eschatology, hamartiology, pneumatology, soteriology, theology proper, spiritual warfare, kingdom theology, Romans road). | |
| ### 2. Canonical Knowledge (138 cases, `target_in_top_k`) | |
| > *Does the model know the exact wording of specific verses?* | |
| Direct recall of well-known biblical passages. No theological reasoning required — this tests pure memorization of canonical text. | |
| > **Train/eval overlap note.** The training corpus includes bible text, so canonical-knowledge cases are not held out from the training distribution. The 88.4% pass rate on this category should be interpreted as in-distribution recall, not as evidence of generalization to unseen text. | |
| **Pass condition:** Same as doctrinal association (correct token in top-5). | |
| **Example:** | |
| ``` | |
| Input: "Love bears all things, [MASK] all things, hopes all things, | |
| endures all things." | |
| Target: "believes" | |
| Pass: "believes" is in top-5 → ✅ | |
| ``` | |
| This is the hardest category for the model (88.4% pass rate), with most failures clustering around Old Testament proper nouns. | |
| ### 3. Contrastive Theology (187 cases, `correct_beats_foil`) | |
| > *Can the model tell the difference between theologically correct and incorrect completions?* | |
| A sentence with one `[MASK]` token, a theologically correct target, and a deliberately wrong *foil* — a word that is semantically plausible in general English but theologically incorrect in context. The model must assign higher probability to the target than the foil. | |
| **Pass condition:** P(target) > P(foil). The margin is additionally classified as high (>0.10), medium (>0.02), or low (≤0.02) confidence. | |
| **Example:** | |
| ``` | |
| Input: "Paul teaches that believers are justified freely by God's [MASK], | |
| not by human merit or works of the law." | |
| Target: "grace" | |
| Foil: "law" | |
| Pass: P("grace") > P("law") → ✅ | |
| ``` | |
| The foil "law" is lexically primed by the later phrase "works of the law," making this a hard case (the model must override surface-level co-occurrence with doctrinal understanding). Failure examples like "power", "will", "plan" — generic divine attributes that miss the specific grace/law antithesis — are tracked as `critical_failure` signals. | |
| ## Categories | |
| | Category | Cases | Pass Rate | Description | | |
| |---|---|---|---| | |
| | Pneumatology | 42 | 100% | Holy Spirit: person, work, gifts, indwelling | | |
| | Soteriology | 109 | 98.2% | Salvation: justification, sanctification, atonement | | |
| | Ecclesiology | 40 | 97.5% | Church: body of Christ, ordinances, unity | | |
| | Hamartiology | 34 | 97.1% | Sin: nature, origin, consequences | | |
| | Christology | 84 | 96.4% | Christ: person, natures, incarnation, atonement | | |
| | Eschatology | 36 | 94.4% | Last things: return of Christ, judgment, resurrection | | |
| | Theology proper | 46 | 91.3% | God: attributes, trinity, sovereignty | | |
| | Canonical knowledge | 138 | 88.4% | Verse-level recall from specific biblical passages | | |
| | Bibliology | 6 | — | Scripture: inspiration, authority, sufficiency | | |
| | Romans road | 8 | — | Evangelistic verses from Romans | | |
| | Spiritual warfare | 2 | — | Cosmic conflict, armor of God | | |
| | Kingdom theology | 1 | — | Already/not-yet kingdom framework | | |
| Categories with very few cases (bibliology, Romans road, spiritual warfare, kingdom theology) are present in the eval but reported in aggregate; their individual pass rates should be interpreted cautiously. | |
| ## Difficulty Levels | |
| Each test case is assigned a difficulty rating: | |
| | Difficulty | Cases | Pass Rate | Criteria | | |
| |---|---|---|---| | |
| | Easy | 99 | 94.9% | High-frequency verse, single obvious completion | | |
| | Medium | 275 | 94.9% | Requires domain context but has strong lexical cues | | |
| | Hard | 172 | 94.2% | Multiple plausible completions, adversarial foil priming, rare vocabulary, or multi-token targets | | |
| ### What Makes a Case "Hard" | |
| - **Foil priming** — The foil word appears elsewhere in the same sentence, creating lexical pressure toward the wrong answer (e.g., "law" in a sentence that ends with "works of the law") | |
| - **Multi-piece targets** — Words that tokenize into 2+ wordpieces require sequence-level probability scoring across mask positions | |
| - **Rare proper nouns** — Old Testament names, places, and transliterated terms with low training frequency | |
| - **Abstract theological distinctions** — Cases where the model must discriminate between near-synonyms with different doctrinal implications | |
| ## Multi-Token and Multi-Piece Handling | |
| When a target word tokenizes into multiple wordpieces (e.g., "Nebuchadnezzar" → 5 subword tokens), the eval automatically: | |
| 1. Expands `[MASK]` into the required number of mask tokens | |
| 2. Uses beam search over the mask positions to find the target sequence | |
| 3. Matches either exact sequence equality or subsequence containment (for targets that span fewer positions than available masks) | |
| This handles cases like: | |
| - `"sabachthani"` (multi-piece from Aramaic transliteration) | |
| - `"iniquity"` (3 wordpieces in bert-base-uncased) | |
| - `"propitiation"` (4 wordpieces) | |
| Failure examples are also tracked for multi-piece targets — if the model produces a generic failure sequence (e.g., "power", "will") instead of the theologically correct multi-piece target, it's flagged. | |
| ## Scoring Protocol | |
| ### target_in_top_k (357 cases) | |
| ``` | |
| pass = correct_token_id ∈ top_k(token_ids) | |
| mrr = 1 / (rank of first correct token), or 0 if not found | |
| ``` | |
| ### correct_beats_foil (187 cases) | |
| ``` | |
| pass = P(target) > P(foil) | |
| margin = P(target) - P(foil) | |
| confidence = high if margin > 0.10 | |
| | medium if margin > 0.02 | |
| | low otherwise | |
| ``` | |
| ### all_top_k_in_target_set (2 cases) | |
| ``` | |
| pass = (valid_tokens_in_top_k / k) ≥ 0.8 | |
| ``` | |
| Used when multiple targets are equally acceptable and the model should surface the theological cluster rather than a single token. | |
| ## Error Classification | |
| When a test case fails, the eval attempts to classify *why*: | |
| | Error type | Meaning | | |
| |---|---| | |
| | `near_miss` | Correct token ranked at k+1 or k+2 — nearly passed | | |
| | `generic_over_theological` | Top predictions are generic/universal words (e.g., "power", "love", "will") rather than theologically specific terms | | |
| | `wrong_semantic_cluster` | Wrong token is still semantically related but theologically incorrect | | |
| | `total_miss` | Correct token ranked below position 20 — model has essentially no signal | | |
| ## Critical Failures | |
| A **critical failure** is a test case where, regardless of pass/fail status, one of the explicitly listed `failure_examples` tokens appears in the top-3 predictions. This signals that the model is drifting toward generic religious language rather than precise theological vocabulary. | |
| Critical failure rate: extracted from the full results JSON. | |
| ## Running the Evaluation | |
| Install the dependencies first. `uv pip install` is recommended for speed and resolver behavior, but plain `pip install` works too: | |
| ```bash | |
| uv pip install -r requirements.txt | |
| # or: pip install -r requirements.txt | |
| ``` | |
| ```bash | |
| # Default: load model.safetensors from repo root, eval against eval.json | |
| python scripts/mlm_eval_safetensors.py | |
| # With GPU | |
| python scripts/mlm_eval_safetensors.py --device cuda | |
| # Compare against a previous run (proves fp16 round-trip fidelity) | |
| python scripts/mlm_eval_safetensors.py --compare eval_results/d12_encoder_mlm_eval.json | |
| # Custom paths | |
| python scripts/mlm_eval_safetensors.py \ | |
| --repo-dir /path/to/repo \ | |
| --eval-path /path/to/eval.json \ | |
| --device cuda | |
| # Adjust top-k and sampling | |
| python scripts/mlm_eval_safetensors.py --k 10 --n-samples 5 | |
| ``` | |
| The script writes results to `eval_results/safetensors_mlm_eval.json` by default. | |
| ### Verifying fp16 Fidelity | |
| The `--compare` flag diffs the safetensors (fp16→fp32) results against a prior evaluation of the original fp32 `.pt` checkpoint. If every test case produces the same pass/fail outcome, the fp16 storage is proven lossless for this model — no quantization artifacts affect semantic predictions. | |
| ## Test Case Schema | |
| Each test case in `eval.json` has this structure: | |
| ```json | |
| { | |
| "id": "DOC_001", | |
| "type": "doctrinal_association", | |
| "category": "soteriology", | |
| "difficulty": "medium", | |
| "input": "Paul teaches that the message of the [MASK] is foolishness...", | |
| "targets": ["cross"], | |
| "foils": [], | |
| "acceptable_alternatives": [], | |
| "failure_examples": ["church", "gospel", "law", "bible", "world"], | |
| "pass_condition": "target_in_top_k", | |
| "k": 5, | |
| "reference": "1 Corinthians 1:18", | |
| "reasoning": "The cross as the central message of the gospel...", | |
| "surface_confounder": "" | |
| } | |
| ``` | |
| | Field | Description | | |
| |---|---| | |
| | `id` | Unique identifier within the suite | | |
| | `type` | One of: `doctrinal_association`, `canonical_knowledge`, `contrastive_theology` | | |
| | `category` | Theological category (see table above) | | |
| | `difficulty` | `easy`, `medium`, or `hard` | | |
| | `input` | The masked sentence. Must contain at least one `[MASK]` | | |
| | `targets` | Correct completion(s) for the masked position(s) | | |
| | `foils` | Deliberately incorrect but plausible completions (contrastive only) | | |
| | `acceptable_alternatives` | Also-correct completions beyond the primary target | | |
| | `failure_examples` | Tokens that would indicate the model failed to internalize the domain, even if the primary target is predicted | | |
| | `pass_condition` | Scoring strategy: `target_in_top_k`, `correct_beats_foil`, or `all_top_k_in_target_set` | | |
| | `k` | Number of top predictions to consider | | |
| | `reference` | Source verse or doctrinal concept | | |
| | `reasoning` | Human-readable explanation of what the case tests and why the foil is wrong (if applicable) | | |
| | `surface_confounder` | Linguistic surface feature that could mislead a shallow model (if any) | | |
| ## Design Philosophy | |
| This eval was designed to probe **domain-specific MLM behavior**, not general linguistic fluency. A general-purpose BERT model may score well on standard MLM benchmarks while producing theologically incoherent completions on biblical text. The three test types target different aspects of that behavior: | |
| 1. **Doctrinal association** checks whether the model has absorbed domain-specific co-occurrence patterns — the "language" of theology | |
| 2. **Canonical knowledge** checks whether the model has memorized specific verses — the "data" of scripture | |
| 3. **Contrastive theology** checks whether the model prefers doctrinally correct completions over plausible foils | |
| The foil-based contrastive cases are the most discriminative: they test whether the model assigns higher probability to a doctrinally correct target than to a surface-level lexical confounder. Results on this suite should be read as evidence about behavior on cases of this shape, not as a general measure of theological understanding. The training corpus and eval suite were authored privately and have not been externally audited, so some train/eval distributional overlap (especially for canonical recall) is expected. | |