theo-bert-base / EVAL.md
toranb's picture
Initial release: TheoBERT Base β€” biblical-domain masked language model
a64c547

Evaluation Suite: TheoBERT Base MLM Benchmark

A 546-case domain-specific evaluation suite measuring masked language modeling performance on biblical and theological text.

Overview

The evaluation tests whether TheoBERT Base has internalized domain-specific semantics β€” not just surface-level co-occurrence. It goes beyond a standard perplexity measurement by testing the model against carefully constructed test cases that probe theological precision, canonical recall, and doctrinal discrimination.

Result: 94.7% pass rate (517 / 546), difficulty-weighted score 94.6%.

Test Types

The suite uses three evaluation strategies, each targeting a different aspect of model competence:

1. Doctrinal Association (221 cases, target_in_top_k)

Does the model know what belongs in the blank?

A sentence from scripture or doctrinal writing with one [MASK] token. The correct answer(s) must appear in the model's top-k predictions.

Pass condition: At least one correct or acceptable-alternative token appears in the top-k (k=5) predictions.

Example:

Input:  "For as often as you eat this bread and drink the cup,
        you proclaim the Lord's [MASK] until he comes."
Target: "death"
Pass:   "death" is in top-5 predicted tokens β†’ βœ…

These cases span 9 theological categories (bibliology, christology, ecclesiology, eschatology, hamartiology, pneumatology, soteriology, theology proper, spiritual warfare, kingdom theology, Romans road).

2. Canonical Knowledge (138 cases, target_in_top_k)

Does the model know the exact wording of specific verses?

Direct recall of well-known biblical passages. No theological reasoning required β€” this tests pure memorization of canonical text.

Train/eval overlap note. The training corpus includes bible text, so canonical-knowledge cases are not held out from the training distribution. The 88.4% pass rate on this category should be interpreted as in-distribution recall, not as evidence of generalization to unseen text.

Pass condition: Same as doctrinal association (correct token in top-5).

Example:

Input:  "Love bears all things, [MASK] all things, hopes all things,
        endures all things."
Target: "believes"
Pass:   "believes" is in top-5 β†’ βœ…

This is the hardest category for the model (88.4% pass rate), with most failures clustering around Old Testament proper nouns.

3. Contrastive Theology (187 cases, correct_beats_foil)

Can the model tell the difference between theologically correct and incorrect completions?

A sentence with one [MASK] token, a theologically correct target, and a deliberately wrong foil β€” a word that is semantically plausible in general English but theologically incorrect in context. The model must assign higher probability to the target than the foil.

Pass condition: P(target) > P(foil). The margin is additionally classified as high (>0.10), medium (>0.02), or low (≀0.02) confidence.

Example:

Input:   "Paul teaches that believers are justified freely by God's [MASK],
         not by human merit or works of the law."
Target:  "grace"
Foil:    "law"
Pass:    P("grace") > P("law") β†’ βœ…

The foil "law" is lexically primed by the later phrase "works of the law," making this a hard case (the model must override surface-level co-occurrence with doctrinal understanding). Failure examples like "power", "will", "plan" β€” generic divine attributes that miss the specific grace/law antithesis β€” are tracked as critical_failure signals.

Categories

Category Cases Pass Rate Description
Pneumatology 42 100% Holy Spirit: person, work, gifts, indwelling
Soteriology 109 98.2% Salvation: justification, sanctification, atonement
Ecclesiology 40 97.5% Church: body of Christ, ordinances, unity
Hamartiology 34 97.1% Sin: nature, origin, consequences
Christology 84 96.4% Christ: person, natures, incarnation, atonement
Eschatology 36 94.4% Last things: return of Christ, judgment, resurrection
Theology proper 46 91.3% God: attributes, trinity, sovereignty
Canonical knowledge 138 88.4% Verse-level recall from specific biblical passages
Bibliology 6 β€” Scripture: inspiration, authority, sufficiency
Romans road 8 β€” Evangelistic verses from Romans
Spiritual warfare 2 β€” Cosmic conflict, armor of God
Kingdom theology 1 β€” Already/not-yet kingdom framework

Categories with very few cases (bibliology, Romans road, spiritual warfare, kingdom theology) are present in the eval but reported in aggregate; their individual pass rates should be interpreted cautiously.

Difficulty Levels

Each test case is assigned a difficulty rating:

Difficulty Cases Pass Rate Criteria
Easy 99 94.9% High-frequency verse, single obvious completion
Medium 275 94.9% Requires domain context but has strong lexical cues
Hard 172 94.2% Multiple plausible completions, adversarial foil priming, rare vocabulary, or multi-token targets

What Makes a Case "Hard"

  • Foil priming β€” The foil word appears elsewhere in the same sentence, creating lexical pressure toward the wrong answer (e.g., "law" in a sentence that ends with "works of the law")
  • Multi-piece targets β€” Words that tokenize into 2+ wordpieces require sequence-level probability scoring across mask positions
  • Rare proper nouns β€” Old Testament names, places, and transliterated terms with low training frequency
  • Abstract theological distinctions β€” Cases where the model must discriminate between near-synonyms with different doctrinal implications

Multi-Token and Multi-Piece Handling

When a target word tokenizes into multiple wordpieces (e.g., "Nebuchadnezzar" β†’ 5 subword tokens), the eval automatically:

  1. Expands [MASK] into the required number of mask tokens
  2. Uses beam search over the mask positions to find the target sequence
  3. Matches either exact sequence equality or subsequence containment (for targets that span fewer positions than available masks)

This handles cases like:

  • "sabachthani" (multi-piece from Aramaic transliteration)
  • "iniquity" (3 wordpieces in bert-base-uncased)
  • "propitiation" (4 wordpieces)

Failure examples are also tracked for multi-piece targets β€” if the model produces a generic failure sequence (e.g., "power", "will") instead of the theologically correct multi-piece target, it's flagged.

Scoring Protocol

target_in_top_k (357 cases)

pass  = correct_token_id ∈ top_k(token_ids)
mrr   = 1 / (rank of first correct token), or 0 if not found

correct_beats_foil (187 cases)

pass     = P(target) > P(foil)
margin   = P(target) - P(foil)
confidence = high   if margin > 0.10
           | medium if margin > 0.02
           | low    otherwise

all_top_k_in_target_set (2 cases)

pass = (valid_tokens_in_top_k / k) β‰₯ 0.8

Used when multiple targets are equally acceptable and the model should surface the theological cluster rather than a single token.

Error Classification

When a test case fails, the eval attempts to classify why:

Error type Meaning
near_miss Correct token ranked at k+1 or k+2 β€” nearly passed
generic_over_theological Top predictions are generic/universal words (e.g., "power", "love", "will") rather than theologically specific terms
wrong_semantic_cluster Wrong token is still semantically related but theologically incorrect
total_miss Correct token ranked below position 20 β€” model has essentially no signal

Critical Failures

A critical failure is a test case where, regardless of pass/fail status, one of the explicitly listed failure_examples tokens appears in the top-3 predictions. This signals that the model is drifting toward generic religious language rather than precise theological vocabulary.

Critical failure rate: extracted from the full results JSON.

Running the Evaluation

Install the dependencies first. uv pip install is recommended for speed and resolver behavior, but plain pip install works too:

uv pip install -r requirements.txt
# or: pip install -r requirements.txt
# Default: load model.safetensors from repo root, eval against eval.json
python scripts/mlm_eval_safetensors.py

# With GPU
python scripts/mlm_eval_safetensors.py --device cuda

# Compare against a previous run (proves fp16 round-trip fidelity)
python scripts/mlm_eval_safetensors.py --compare eval_results/d12_encoder_mlm_eval.json

# Custom paths
python scripts/mlm_eval_safetensors.py \
  --repo-dir /path/to/repo \
  --eval-path /path/to/eval.json \
  --device cuda

# Adjust top-k and sampling
python scripts/mlm_eval_safetensors.py --k 10 --n-samples 5

The script writes results to eval_results/safetensors_mlm_eval.json by default.

Verifying fp16 Fidelity

The --compare flag diffs the safetensors (fp16β†’fp32) results against a prior evaluation of the original fp32 .pt checkpoint. If every test case produces the same pass/fail outcome, the fp16 storage is proven lossless for this model β€” no quantization artifacts affect semantic predictions.

Test Case Schema

Each test case in eval.json has this structure:

{
  "id": "DOC_001",
  "type": "doctrinal_association",
  "category": "soteriology",
  "difficulty": "medium",
  "input": "Paul teaches that the message of the [MASK] is foolishness...",
  "targets": ["cross"],
  "foils": [],
  "acceptable_alternatives": [],
  "failure_examples": ["church", "gospel", "law", "bible", "world"],
  "pass_condition": "target_in_top_k",
  "k": 5,
  "reference": "1 Corinthians 1:18",
  "reasoning": "The cross as the central message of the gospel...",
  "surface_confounder": ""
}
Field Description
id Unique identifier within the suite
type One of: doctrinal_association, canonical_knowledge, contrastive_theology
category Theological category (see table above)
difficulty easy, medium, or hard
input The masked sentence. Must contain at least one [MASK]
targets Correct completion(s) for the masked position(s)
foils Deliberately incorrect but plausible completions (contrastive only)
acceptable_alternatives Also-correct completions beyond the primary target
failure_examples Tokens that would indicate the model failed to internalize the domain, even if the primary target is predicted
pass_condition Scoring strategy: target_in_top_k, correct_beats_foil, or all_top_k_in_target_set
k Number of top predictions to consider
reference Source verse or doctrinal concept
reasoning Human-readable explanation of what the case tests and why the foil is wrong (if applicable)
surface_confounder Linguistic surface feature that could mislead a shallow model (if any)

Design Philosophy

This eval was designed to probe domain-specific MLM behavior, not general linguistic fluency. A general-purpose BERT model may score well on standard MLM benchmarks while producing theologically incoherent completions on biblical text. The three test types target different aspects of that behavior:

  1. Doctrinal association checks whether the model has absorbed domain-specific co-occurrence patterns β€” the "language" of theology
  2. Canonical knowledge checks whether the model has memorized specific verses β€” the "data" of scripture
  3. Contrastive theology checks whether the model prefers doctrinally correct completions over plausible foils

The foil-based contrastive cases are the most discriminative: they test whether the model assigns higher probability to a doctrinally correct target than to a surface-level lexical confounder. Results on this suite should be read as evidence about behavior on cases of this shape, not as a general measure of theological understanding. The training corpus and eval suite were authored privately and have not been externally audited, so some train/eval distributional overlap (especially for canonical recall) is expected.