Initial release: TheoBERT Base — biblical-domain masked language model

a64c547 17 days ago

12.2 kB

Evaluation Suite: TheoBERT Base MLM Benchmark

A 546-case domain-specific evaluation suite measuring masked language modeling performance on biblical and theological text.

Overview

The evaluation tests whether TheoBERT Base has internalized domain-specific semantics — not just surface-level co-occurrence. It goes beyond a standard perplexity measurement by testing the model against carefully constructed test cases that probe theological precision, canonical recall, and doctrinal discrimination.

Result: 94.7% pass rate (517 / 546), difficulty-weighted score 94.6%.

Test Types

The suite uses three evaluation strategies, each targeting a different aspect of model competence:

1. Doctrinal Association (221 cases, `target_in_top_k`)

Does the model know what belongs in the blank?

A sentence from scripture or doctrinal writing with one [MASK] token. The correct answer(s) must appear in the model's top-k predictions.

Pass condition: At least one correct or acceptable-alternative token appears in the top-k (k=5) predictions.

Example:

Input:  "For as often as you eat this bread and drink the cup,
        you proclaim the Lord's [MASK] until he comes."
Target: "death"
Pass:   "death" is in top-5 predicted tokens → ✅

These cases span 9 theological categories (bibliology, christology, ecclesiology, eschatology, hamartiology, pneumatology, soteriology, theology proper, spiritual warfare, kingdom theology, Romans road).

2. Canonical Knowledge (138 cases, `target_in_top_k`)

Does the model know the exact wording of specific verses?

Direct recall of well-known biblical passages. No theological reasoning required — this tests pure memorization of canonical text.

Train/eval overlap note. The training corpus includes bible text, so canonical-knowledge cases are not held out from the training distribution. The 88.4% pass rate on this category should be interpreted as in-distribution recall, not as evidence of generalization to unseen text.

Pass condition: Same as doctrinal association (correct token in top-5).

Example:

Input:  "Love bears all things, [MASK] all things, hopes all things,
        endures all things."
Target: "believes"
Pass:   "believes" is in top-5 → ✅

This is the hardest category for the model (88.4% pass rate), with most failures clustering around Old Testament proper nouns.

3. Contrastive Theology (187 cases, `correct_beats_foil`)

Can the model tell the difference between theologically correct and incorrect completions?

A sentence with one [MASK] token, a theologically correct target, and a deliberately wrong foil — a word that is semantically plausible in general English but theologically incorrect in context. The model must assign higher probability to the target than the foil.

Pass condition: P(target) > P(foil). The margin is additionally classified as high (>0.10), medium (>0.02), or low (≤0.02) confidence.

Example:

Input:   "Paul teaches that believers are justified freely by God's [MASK],
         not by human merit or works of the law."
Target:  "grace"
Foil:    "law"
Pass:    P("grace") > P("law") → ✅

The foil "law" is lexically primed by the later phrase "works of the law," making this a hard case (the model must override surface-level co-occurrence with doctrinal understanding). Failure examples like "power", "will", "plan" — generic divine attributes that miss the specific grace/law antithesis — are tracked as critical_failure signals.

Category	Cases	Pass Rate	Description
Pneumatology	42	100%	Holy Spirit: person, work, gifts, indwelling
Soteriology	109	98.2%	Salvation: justification, sanctification, atonement
Ecclesiology	40	97.5%	Church: body of Christ, ordinances, unity
Hamartiology	34	97.1%	Sin: nature, origin, consequences
Christology	84	96.4%	Christ: person, natures, incarnation, atonement
Eschatology	36	94.4%	Last things: return of Christ, judgment, resurrection
Theology proper	46	91.3%	God: attributes, trinity, sovereignty
Canonical knowledge	138	88.4%	Verse-level recall from specific biblical passages
Bibliology	6	—	Scripture: inspiration, authority, sufficiency
Romans road	8	—	Evangelistic verses from Romans
Spiritual warfare	2	—	Cosmic conflict, armor of God
Kingdom theology	1	—	Already/not-yet kingdom framework

Difficulty Levels

Each test case is assigned a difficulty rating:

Difficulty	Cases	Pass Rate	Criteria
Easy	99	94.9%	High-frequency verse, single obvious completion
Medium	275	94.9%	Requires domain context but has strong lexical cues
Hard	172	94.2%	Multiple plausible completions, adversarial foil priming, rare vocabulary, or multi-token targets

What Makes a Case "Hard"

Foil priming — The foil word appears elsewhere in the same sentence, creating lexical pressure toward the wrong answer (e.g., "law" in a sentence that ends with "works of the law")
Multi-piece targets — Words that tokenize into 2+ wordpieces require sequence-level probability scoring across mask positions
Rare proper nouns — Old Testament names, places, and transliterated terms with low training frequency
Abstract theological distinctions — Cases where the model must discriminate between near-synonyms with different doctrinal implications

Multi-Token and Multi-Piece Handling

When a target word tokenizes into multiple wordpieces (e.g., "Nebuchadnezzar" → 5 subword tokens), the eval automatically:

Expands [MASK] into the required number of mask tokens
Uses beam search over the mask positions to find the target sequence
Matches either exact sequence equality or subsequence containment (for targets that span fewer positions than available masks)

This handles cases like:

"sabachthani" (multi-piece from Aramaic transliteration)
"iniquity" (3 wordpieces in bert-base-uncased)
"propitiation" (4 wordpieces)

Failure examples are also tracked for multi-piece targets — if the model produces a generic failure sequence (e.g., "power", "will") instead of the theologically correct multi-piece target, it's flagged.

Scoring Protocol

target_in_top_k (357 cases)

pass  = correct_token_id ∈ top_k(token_ids)
mrr   = 1 / (rank of first correct token), or 0 if not found

correct_beats_foil (187 cases)

pass     = P(target) > P(foil)
margin   = P(target) - P(foil)
confidence = high   if margin > 0.10
           | medium if margin > 0.02
           | low    otherwise

all_top_k_in_target_set (2 cases)

pass = (valid_tokens_in_top_k / k) ≥ 0.8

Used when multiple targets are equally acceptable and the model should surface the theological cluster rather than a single token.

Error Classification

When a test case fails, the eval attempts to classify why:

Error type	Meaning
`near_miss`	Correct token ranked at k+1 or k+2 — nearly passed
`generic_over_theological`	Top predictions are generic/universal words (e.g., "power", "love", "will") rather than theologically specific terms
`wrong_semantic_cluster`	Wrong token is still semantically related but theologically incorrect
`total_miss`	Correct token ranked below position 20 — model has essentially no signal

Critical Failures

A critical failure is a test case where, regardless of pass/fail status, one of the explicitly listed failure_examples tokens appears in the top-3 predictions. This signals that the model is drifting toward generic religious language rather than precise theological vocabulary.

Critical failure rate: extracted from the full results JSON.

Running the Evaluation

Install the dependencies first. uv pip install is recommended for speed and resolver behavior, but plain pip install works too:

uv pip install -r requirements.txt
# or: pip install -r requirements.txt

# Default: load model.safetensors from repo root, eval against eval.json
python scripts/mlm_eval_safetensors.py

# With GPU
python scripts/mlm_eval_safetensors.py --device cuda

# Compare against a previous run (proves fp16 round-trip fidelity)
python scripts/mlm_eval_safetensors.py --compare eval_results/d12_encoder_mlm_eval.json

# Custom paths
python scripts/mlm_eval_safetensors.py \
  --repo-dir /path/to/repo \
  --eval-path /path/to/eval.json \
  --device cuda

# Adjust top-k and sampling
python scripts/mlm_eval_safetensors.py --k 10 --n-samples 5

The script writes results to eval_results/safetensors_mlm_eval.json by default.

Verifying fp16 Fidelity

The --compare flag diffs the safetensors (fp16→fp32) results against a prior evaluation of the original fp32 .pt checkpoint. If every test case produces the same pass/fail outcome, the fp16 storage is proven lossless for this model — no quantization artifacts affect semantic predictions.

Test Case Schema

Each test case in eval.json has this structure:

{
  "id": "DOC_001",
  "type": "doctrinal_association",
  "category": "soteriology",
  "difficulty": "medium",
  "input": "Paul teaches that the message of the [MASK] is foolishness...",
  "targets": ["cross"],
  "foils": [],
  "acceptable_alternatives": [],
  "failure_examples": ["church", "gospel", "law", "bible", "world"],
  "pass_condition": "target_in_top_k",
  "k": 5,
  "reference": "1 Corinthians 1:18",
  "reasoning": "The cross as the central message of the gospel...",
  "surface_confounder": ""
}

Field	Description
`id`	Unique identifier within the suite
`type`	One of: `doctrinal_association`, `canonical_knowledge`, `contrastive_theology`
`category`	Theological category (see table above)
`difficulty`	`easy`, `medium`, or `hard`
`input`	The masked sentence. Must contain at least one `[MASK]`
`targets`	Correct completion(s) for the masked position(s)
`foils`	Deliberately incorrect but plausible completions (contrastive only)
`acceptable_alternatives`	Also-correct completions beyond the primary target
`failure_examples`	Tokens that would indicate the model failed to internalize the domain, even if the primary target is predicted
`pass_condition`	Scoring strategy: `target_in_top_k`, `correct_beats_foil`, or `all_top_k_in_target_set`
`k`	Number of top predictions to consider
`reference`	Source verse or doctrinal concept
`reasoning`	Human-readable explanation of what the case tests and why the foil is wrong (if applicable)
`surface_confounder`	Linguistic surface feature that could mislead a shallow model (if any)

Design Philosophy

This eval was designed to probe domain-specific MLM behavior, not general linguistic fluency. A general-purpose BERT model may score well on standard MLM benchmarks while producing theologically incoherent completions on biblical text. The three test types target different aspects of that behavior:

Doctrinal association checks whether the model has absorbed domain-specific co-occurrence patterns — the "language" of theology
Canonical knowledge checks whether the model has memorized specific verses — the "data" of scripture
Contrastive theology checks whether the model prefers doctrinally correct completions over plausible foils

The foil-based contrastive cases are the most discriminative: they test whether the model assigns higher probability to a doctrinally correct target than to a surface-level lexical confounder. Results on this suite should be read as evidence about behavior on cases of this shape, not as a general measure of theological understanding. The training corpus and eval suite were authored privately and have not been externally audited, so some train/eval distributional overlap (especially for canonical recall) is expected.

toranb
/

theo-bert-base

Evaluation Suite: TheoBERT Base MLM Benchmark

Overview

Test Types

1. Doctrinal Association (221 cases, `target_in_top_k`)

2. Canonical Knowledge (138 cases, `target_in_top_k`)

3. Contrastive Theology (187 cases, `correct_beats_foil`)

Categories

Difficulty Levels

What Makes a Case "Hard"

Multi-Token and Multi-Piece Handling

Scoring Protocol

target_in_top_k (357 cases)

correct_beats_foil (187 cases)

all_top_k_in_target_set (2 cases)

Error Classification

Critical Failures

Running the Evaluation

Verifying fp16 Fidelity

Test Case Schema

Design Philosophy

Evaluation Suite: TheoBERT Base MLM Benchmark

Overview

Test Types

1. Doctrinal Association (221 cases, target_in_top_k)

2. Canonical Knowledge (138 cases, target_in_top_k)

3. Contrastive Theology (187 cases, correct_beats_foil)

Categories

Difficulty Levels

What Makes a Case "Hard"

Multi-Token and Multi-Piece Handling

Scoring Protocol

target_in_top_k (357 cases)

correct_beats_foil (187 cases)

all_top_k_in_target_set (2 cases)

Error Classification

Critical Failures

Running the Evaluation

Verifying fp16 Fidelity

Test Case Schema

Design Philosophy

1. Doctrinal Association (221 cases, `target_in_top_k`)

2. Canonical Knowledge (138 cases, `target_in_top_k`)

3. Contrastive Theology (187 cases, `correct_beats_foil`)