Instructions to use toranb/theo-bert-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use toranb/theo-bert-base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="toranb/theo-bert-base", trust_remote_code=True)# Load model directly from transformers import AutoModelForMaskedLM model = AutoModelForMaskedLM.from_pretrained("toranb/theo-bert-base", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
Evaluation Suite: TheoBERT Base MLM Benchmark
A 546-case domain-specific evaluation suite measuring masked language modeling performance on biblical and theological text.
Overview
The evaluation tests whether TheoBERT Base has internalized domain-specific semantics β not just surface-level co-occurrence. It goes beyond a standard perplexity measurement by testing the model against carefully constructed test cases that probe theological precision, canonical recall, and doctrinal discrimination.
Result: 94.7% pass rate (517 / 546), difficulty-weighted score 94.6%.
Test Types
The suite uses three evaluation strategies, each targeting a different aspect of model competence:
1. Doctrinal Association (221 cases, target_in_top_k)
Does the model know what belongs in the blank?
A sentence from scripture or doctrinal writing with one [MASK] token. The correct answer(s) must appear in the model's top-k predictions.
Pass condition: At least one correct or acceptable-alternative token appears in the top-k (k=5) predictions.
Example:
Input: "For as often as you eat this bread and drink the cup,
you proclaim the Lord's [MASK] until he comes."
Target: "death"
Pass: "death" is in top-5 predicted tokens β β
These cases span 9 theological categories (bibliology, christology, ecclesiology, eschatology, hamartiology, pneumatology, soteriology, theology proper, spiritual warfare, kingdom theology, Romans road).
2. Canonical Knowledge (138 cases, target_in_top_k)
Does the model know the exact wording of specific verses?
Direct recall of well-known biblical passages. No theological reasoning required β this tests pure memorization of canonical text.
Train/eval overlap note. The training corpus includes bible text, so canonical-knowledge cases are not held out from the training distribution. The 88.4% pass rate on this category should be interpreted as in-distribution recall, not as evidence of generalization to unseen text.
Pass condition: Same as doctrinal association (correct token in top-5).
Example:
Input: "Love bears all things, [MASK] all things, hopes all things,
endures all things."
Target: "believes"
Pass: "believes" is in top-5 β β
This is the hardest category for the model (88.4% pass rate), with most failures clustering around Old Testament proper nouns.
3. Contrastive Theology (187 cases, correct_beats_foil)
Can the model tell the difference between theologically correct and incorrect completions?
A sentence with one [MASK] token, a theologically correct target, and a deliberately wrong foil β a word that is semantically plausible in general English but theologically incorrect in context. The model must assign higher probability to the target than the foil.
Pass condition: P(target) > P(foil). The margin is additionally classified as high (>0.10), medium (>0.02), or low (β€0.02) confidence.
Example:
Input: "Paul teaches that believers are justified freely by God's [MASK],
not by human merit or works of the law."
Target: "grace"
Foil: "law"
Pass: P("grace") > P("law") β β
The foil "law" is lexically primed by the later phrase "works of the law," making this a hard case (the model must override surface-level co-occurrence with doctrinal understanding). Failure examples like "power", "will", "plan" β generic divine attributes that miss the specific grace/law antithesis β are tracked as critical_failure signals.
Categories
| Category | Cases | Pass Rate | Description |
|---|---|---|---|
| Pneumatology | 42 | 100% | Holy Spirit: person, work, gifts, indwelling |
| Soteriology | 109 | 98.2% | Salvation: justification, sanctification, atonement |
| Ecclesiology | 40 | 97.5% | Church: body of Christ, ordinances, unity |
| Hamartiology | 34 | 97.1% | Sin: nature, origin, consequences |
| Christology | 84 | 96.4% | Christ: person, natures, incarnation, atonement |
| Eschatology | 36 | 94.4% | Last things: return of Christ, judgment, resurrection |
| Theology proper | 46 | 91.3% | God: attributes, trinity, sovereignty |
| Canonical knowledge | 138 | 88.4% | Verse-level recall from specific biblical passages |
| Bibliology | 6 | β | Scripture: inspiration, authority, sufficiency |
| Romans road | 8 | β | Evangelistic verses from Romans |
| Spiritual warfare | 2 | β | Cosmic conflict, armor of God |
| Kingdom theology | 1 | β | Already/not-yet kingdom framework |
Categories with very few cases (bibliology, Romans road, spiritual warfare, kingdom theology) are present in the eval but reported in aggregate; their individual pass rates should be interpreted cautiously.
Difficulty Levels
Each test case is assigned a difficulty rating:
| Difficulty | Cases | Pass Rate | Criteria |
|---|---|---|---|
| Easy | 99 | 94.9% | High-frequency verse, single obvious completion |
| Medium | 275 | 94.9% | Requires domain context but has strong lexical cues |
| Hard | 172 | 94.2% | Multiple plausible completions, adversarial foil priming, rare vocabulary, or multi-token targets |
What Makes a Case "Hard"
- Foil priming β The foil word appears elsewhere in the same sentence, creating lexical pressure toward the wrong answer (e.g., "law" in a sentence that ends with "works of the law")
- Multi-piece targets β Words that tokenize into 2+ wordpieces require sequence-level probability scoring across mask positions
- Rare proper nouns β Old Testament names, places, and transliterated terms with low training frequency
- Abstract theological distinctions β Cases where the model must discriminate between near-synonyms with different doctrinal implications
Multi-Token and Multi-Piece Handling
When a target word tokenizes into multiple wordpieces (e.g., "Nebuchadnezzar" β 5 subword tokens), the eval automatically:
- Expands
[MASK]into the required number of mask tokens - Uses beam search over the mask positions to find the target sequence
- Matches either exact sequence equality or subsequence containment (for targets that span fewer positions than available masks)
This handles cases like:
"sabachthani"(multi-piece from Aramaic transliteration)"iniquity"(3 wordpieces in bert-base-uncased)"propitiation"(4 wordpieces)
Failure examples are also tracked for multi-piece targets β if the model produces a generic failure sequence (e.g., "power", "will") instead of the theologically correct multi-piece target, it's flagged.
Scoring Protocol
target_in_top_k (357 cases)
pass = correct_token_id β top_k(token_ids)
mrr = 1 / (rank of first correct token), or 0 if not found
correct_beats_foil (187 cases)
pass = P(target) > P(foil)
margin = P(target) - P(foil)
confidence = high if margin > 0.10
| medium if margin > 0.02
| low otherwise
all_top_k_in_target_set (2 cases)
pass = (valid_tokens_in_top_k / k) β₯ 0.8
Used when multiple targets are equally acceptable and the model should surface the theological cluster rather than a single token.
Error Classification
When a test case fails, the eval attempts to classify why:
| Error type | Meaning |
|---|---|
near_miss |
Correct token ranked at k+1 or k+2 β nearly passed |
generic_over_theological |
Top predictions are generic/universal words (e.g., "power", "love", "will") rather than theologically specific terms |
wrong_semantic_cluster |
Wrong token is still semantically related but theologically incorrect |
total_miss |
Correct token ranked below position 20 β model has essentially no signal |
Critical Failures
A critical failure is a test case where, regardless of pass/fail status, one of the explicitly listed failure_examples tokens appears in the top-3 predictions. This signals that the model is drifting toward generic religious language rather than precise theological vocabulary.
Critical failure rate: extracted from the full results JSON.
Running the Evaluation
Install the dependencies first. uv pip install is recommended for speed and resolver behavior, but plain pip install works too:
uv pip install -r requirements.txt
# or: pip install -r requirements.txt
# Default: load model.safetensors from repo root, eval against eval.json
python scripts/mlm_eval_safetensors.py
# With GPU
python scripts/mlm_eval_safetensors.py --device cuda
# Compare against a previous run (proves fp16 round-trip fidelity)
python scripts/mlm_eval_safetensors.py --compare eval_results/d12_encoder_mlm_eval.json
# Custom paths
python scripts/mlm_eval_safetensors.py \
--repo-dir /path/to/repo \
--eval-path /path/to/eval.json \
--device cuda
# Adjust top-k and sampling
python scripts/mlm_eval_safetensors.py --k 10 --n-samples 5
The script writes results to eval_results/safetensors_mlm_eval.json by default.
Verifying fp16 Fidelity
The --compare flag diffs the safetensors (fp16βfp32) results against a prior evaluation of the original fp32 .pt checkpoint. If every test case produces the same pass/fail outcome, the fp16 storage is proven lossless for this model β no quantization artifacts affect semantic predictions.
Test Case Schema
Each test case in eval.json has this structure:
{
"id": "DOC_001",
"type": "doctrinal_association",
"category": "soteriology",
"difficulty": "medium",
"input": "Paul teaches that the message of the [MASK] is foolishness...",
"targets": ["cross"],
"foils": [],
"acceptable_alternatives": [],
"failure_examples": ["church", "gospel", "law", "bible", "world"],
"pass_condition": "target_in_top_k",
"k": 5,
"reference": "1 Corinthians 1:18",
"reasoning": "The cross as the central message of the gospel...",
"surface_confounder": ""
}
| Field | Description |
|---|---|
id |
Unique identifier within the suite |
type |
One of: doctrinal_association, canonical_knowledge, contrastive_theology |
category |
Theological category (see table above) |
difficulty |
easy, medium, or hard |
input |
The masked sentence. Must contain at least one [MASK] |
targets |
Correct completion(s) for the masked position(s) |
foils |
Deliberately incorrect but plausible completions (contrastive only) |
acceptable_alternatives |
Also-correct completions beyond the primary target |
failure_examples |
Tokens that would indicate the model failed to internalize the domain, even if the primary target is predicted |
pass_condition |
Scoring strategy: target_in_top_k, correct_beats_foil, or all_top_k_in_target_set |
k |
Number of top predictions to consider |
reference |
Source verse or doctrinal concept |
reasoning |
Human-readable explanation of what the case tests and why the foil is wrong (if applicable) |
surface_confounder |
Linguistic surface feature that could mislead a shallow model (if any) |
Design Philosophy
This eval was designed to probe domain-specific MLM behavior, not general linguistic fluency. A general-purpose BERT model may score well on standard MLM benchmarks while producing theologically incoherent completions on biblical text. The three test types target different aspects of that behavior:
- Doctrinal association checks whether the model has absorbed domain-specific co-occurrence patterns β the "language" of theology
- Canonical knowledge checks whether the model has memorized specific verses β the "data" of scripture
- Contrastive theology checks whether the model prefers doctrinally correct completions over plausible foils
The foil-based contrastive cases are the most discriminative: they test whether the model assigns higher probability to a doctrinally correct target than to a surface-level lexical confounder. Results on this suite should be read as evidence about behavior on cases of this shape, not as a general measure of theological understanding. The training corpus and eval suite were authored privately and have not been externally audited, so some train/eval distributional overlap (especially for canonical recall) is expected.