Initial release: TheoBERT Base — biblical-domain masked language model

a64c547 17 days ago

12.2 kB

	# Evaluation Suite: TheoBERT Base MLM Benchmark

	A 546-case domain-specific evaluation suite measuring masked language modeling performance on biblical and theological text.

	## Overview

	The evaluation tests whether TheoBERT Base has internalized domain-specific semantics — not just surface-level co-occurrence. It goes beyond a standard perplexity measurement by testing the model against carefully constructed test cases that probe theological precision, canonical recall, and doctrinal discrimination.

	Result: 94.7% pass rate (517 / 546), difficulty-weighted score 94.6%.

	## Test Types

	The suite uses three evaluation strategies, each targeting a different aspect of model competence:

	### 1. Doctrinal Association (221 cases, `target_in_top_k`)

	> Does the model know what belongs in the blank?

	A sentence from scripture or doctrinal writing with one `[MASK]` token. The correct answer(s) must appear in the model's top-k predictions.

	Pass condition: At least one correct or acceptable-alternative token appears in the top-k (k=5) predictions.

	Example:
	```
	Input: "For as often as you eat this bread and drink the cup,
	you proclaim the Lord's [MASK] until he comes."
	Target: "death"
	Pass: "death" is in top-5 predicted tokens → ✅
	```

	These cases span 9 theological categories (bibliology, christology, ecclesiology, eschatology, hamartiology, pneumatology, soteriology, theology proper, spiritual warfare, kingdom theology, Romans road).

	### 2. Canonical Knowledge (138 cases, `target_in_top_k`)

	> Does the model know the exact wording of specific verses?

	Direct recall of well-known biblical passages. No theological reasoning required — this tests pure memorization of canonical text.

	> Train/eval overlap note. The training corpus includes bible text, so canonical-knowledge cases are not held out from the training distribution. The 88.4% pass rate on this category should be interpreted as in-distribution recall, not as evidence of generalization to unseen text.

	Pass condition: Same as doctrinal association (correct token in top-5).

	Example:
	```
	Input: "Love bears all things, [MASK] all things, hopes all things,
	endures all things."
	Target: "believes"
	Pass: "believes" is in top-5 → ✅
	```

	This is the hardest category for the model (88.4% pass rate), with most failures clustering around Old Testament proper nouns.

	### 3. Contrastive Theology (187 cases, `correct_beats_foil`)

	> Can the model tell the difference between theologically correct and incorrect completions?

	A sentence with one `[MASK]` token, a theologically correct target, and a deliberately wrong foil — a word that is semantically plausible in general English but theologically incorrect in context. The model must assign higher probability to the target than the foil.

	Pass condition: P(target) > P(foil). The margin is additionally classified as high (>0.10), medium (>0.02), or low (≤0.02) confidence.

	Example:
	```
	Input: "Paul teaches that believers are justified freely by God's [MASK],
	not by human merit or works of the law."
	Target: "grace"
	Foil: "law"
	Pass: P("grace") > P("law") → ✅
	```

	The foil "law" is lexically primed by the later phrase "works of the law," making this a hard case (the model must override surface-level co-occurrence with doctrinal understanding). Failure examples like "power", "will", "plan" — generic divine attributes that miss the specific grace/law antithesis — are tracked as `critical_failure` signals.

	## Categories

	\| Category \| Cases \| Pass Rate \| Description \|
	\|---\|---\|---\|---\|
	\| Pneumatology \| 42 \| 100% \| Holy Spirit: person, work, gifts, indwelling \|
	\| Soteriology \| 109 \| 98.2% \| Salvation: justification, sanctification, atonement \|
	\| Ecclesiology \| 40 \| 97.5% \| Church: body of Christ, ordinances, unity \|
	\| Hamartiology \| 34 \| 97.1% \| Sin: nature, origin, consequences \|
	\| Christology \| 84 \| 96.4% \| Christ: person, natures, incarnation, atonement \|
	\| Eschatology \| 36 \| 94.4% \| Last things: return of Christ, judgment, resurrection \|
	\| Theology proper \| 46 \| 91.3% \| God: attributes, trinity, sovereignty \|
	\| Canonical knowledge \| 138 \| 88.4% \| Verse-level recall from specific biblical passages \|
	\| Bibliology \| 6 \| — \| Scripture: inspiration, authority, sufficiency \|
	\| Romans road \| 8 \| — \| Evangelistic verses from Romans \|
	\| Spiritual warfare \| 2 \| — \| Cosmic conflict, armor of God \|
	\| Kingdom theology \| 1 \| — \| Already/not-yet kingdom framework \|

	Categories with very few cases (bibliology, Romans road, spiritual warfare, kingdom theology) are present in the eval but reported in aggregate; their individual pass rates should be interpreted cautiously.

	## Difficulty Levels

	Each test case is assigned a difficulty rating:

	\| Difficulty \| Cases \| Pass Rate \| Criteria \|
	\|---\|---\|---\|---\|
	\| Easy \| 99 \| 94.9% \| High-frequency verse, single obvious completion \|
	\| Medium \| 275 \| 94.9% \| Requires domain context but has strong lexical cues \|
	\| Hard \| 172 \| 94.2% \| Multiple plausible completions, adversarial foil priming, rare vocabulary, or multi-token targets \|

	### What Makes a Case "Hard"

	- Foil priming — The foil word appears elsewhere in the same sentence, creating lexical pressure toward the wrong answer (e.g., "law" in a sentence that ends with "works of the law")
	- Multi-piece targets — Words that tokenize into 2+ wordpieces require sequence-level probability scoring across mask positions
	- Rare proper nouns — Old Testament names, places, and transliterated terms with low training frequency
	- Abstract theological distinctions — Cases where the model must discriminate between near-synonyms with different doctrinal implications

	## Multi-Token and Multi-Piece Handling

	When a target word tokenizes into multiple wordpieces (e.g., "Nebuchadnezzar" → 5 subword tokens), the eval automatically:

	1. Expands `[MASK]` into the required number of mask tokens
	2. Uses beam search over the mask positions to find the target sequence
	3. Matches either exact sequence equality or subsequence containment (for targets that span fewer positions than available masks)

	This handles cases like:
	- `"sabachthani"` (multi-piece from Aramaic transliteration)
	- `"iniquity"` (3 wordpieces in bert-base-uncased)
	- `"propitiation"` (4 wordpieces)

	Failure examples are also tracked for multi-piece targets — if the model produces a generic failure sequence (e.g., "power", "will") instead of the theologically correct multi-piece target, it's flagged.

	## Scoring Protocol

	### target_in_top_k (357 cases)

	```
	pass = correct_token_id ∈ top_k(token_ids)
	mrr = 1 / (rank of first correct token), or 0 if not found
	```

	### correct_beats_foil (187 cases)

	```
	pass = P(target) > P(foil)
	margin = P(target) - P(foil)
	confidence = high if margin > 0.10
	\| medium if margin > 0.02
	\| low otherwise
	```

	### all_top_k_in_target_set (2 cases)

	```
	pass = (valid_tokens_in_top_k / k) ≥ 0.8
	```

	Used when multiple targets are equally acceptable and the model should surface the theological cluster rather than a single token.

	## Error Classification

	When a test case fails, the eval attempts to classify why:

	\| Error type \| Meaning \|
	\|---\|---\|
	\| `near_miss` \| Correct token ranked at k+1 or k+2 — nearly passed \|
	\| `generic_over_theological` \| Top predictions are generic/universal words (e.g., "power", "love", "will") rather than theologically specific terms \|
	\| `wrong_semantic_cluster` \| Wrong token is still semantically related but theologically incorrect \|
	\| `total_miss` \| Correct token ranked below position 20 — model has essentially no signal \|

	## Critical Failures

	A critical failure is a test case where, regardless of pass/fail status, one of the explicitly listed `failure_examples` tokens appears in the top-3 predictions. This signals that the model is drifting toward generic religious language rather than precise theological vocabulary.

	Critical failure rate: extracted from the full results JSON.

	## Running the Evaluation

	Install the dependencies first. `uv pip install` is recommended for speed and resolver behavior, but plain `pip install` works too:

	```bash
	uv pip install -r requirements.txt
	# or: pip install -r requirements.txt
	```

	```bash
	# Default: load model.safetensors from repo root, eval against eval.json
	python scripts/mlm_eval_safetensors.py

	# With GPU
	python scripts/mlm_eval_safetensors.py --device cuda

	# Compare against a previous run (proves fp16 round-trip fidelity)
	python scripts/mlm_eval_safetensors.py --compare eval_results/d12_encoder_mlm_eval.json

	# Custom paths
	python scripts/mlm_eval_safetensors.py \
	--repo-dir /path/to/repo \
	--eval-path /path/to/eval.json \
	--device cuda

	# Adjust top-k and sampling
	python scripts/mlm_eval_safetensors.py --k 10 --n-samples 5
	```

	The script writes results to `eval_results/safetensors_mlm_eval.json` by default.

	### Verifying fp16 Fidelity

	The `--compare` flag diffs the safetensors (fp16→fp32) results against a prior evaluation of the original fp32 `.pt` checkpoint. If every test case produces the same pass/fail outcome, the fp16 storage is proven lossless for this model — no quantization artifacts affect semantic predictions.

	## Test Case Schema

	Each test case in `eval.json` has this structure:

	```json
	{
	"id": "DOC_001",
	"type": "doctrinal_association",
	"category": "soteriology",
	"difficulty": "medium",
	"input": "Paul teaches that the message of the [MASK] is foolishness...",
	"targets": ["cross"],
	"foils": [],
	"acceptable_alternatives": [],
	"failure_examples": ["church", "gospel", "law", "bible", "world"],
	"pass_condition": "target_in_top_k",
	"k": 5,
	"reference": "1 Corinthians 1:18",
	"reasoning": "The cross as the central message of the gospel...",
	"surface_confounder": ""
	}
	```

	\| Field \| Description \|
	\|---\|---\|
	\| `id` \| Unique identifier within the suite \|
	\| `type` \| One of: `doctrinal_association`, `canonical_knowledge`, `contrastive_theology` \|
	\| `category` \| Theological category (see table above) \|
	\| `difficulty` \| `easy`, `medium`, or `hard` \|
	\| `input` \| The masked sentence. Must contain at least one `[MASK]` \|
	\| `targets` \| Correct completion(s) for the masked position(s) \|
	\| `foils` \| Deliberately incorrect but plausible completions (contrastive only) \|
	\| `acceptable_alternatives` \| Also-correct completions beyond the primary target \|
	\| `failure_examples` \| Tokens that would indicate the model failed to internalize the domain, even if the primary target is predicted \|
	\| `pass_condition` \| Scoring strategy: `target_in_top_k`, `correct_beats_foil`, or `all_top_k_in_target_set` \|
	\| `k` \| Number of top predictions to consider \|
	\| `reference` \| Source verse or doctrinal concept \|
	\| `reasoning` \| Human-readable explanation of what the case tests and why the foil is wrong (if applicable) \|
	\| `surface_confounder` \| Linguistic surface feature that could mislead a shallow model (if any) \|

	## Design Philosophy

	This eval was designed to probe domain-specific MLM behavior, not general linguistic fluency. A general-purpose BERT model may score well on standard MLM benchmarks while producing theologically incoherent completions on biblical text. The three test types target different aspects of that behavior:

	1. Doctrinal association checks whether the model has absorbed domain-specific co-occurrence patterns — the "language" of theology
	2. Canonical knowledge checks whether the model has memorized specific verses — the "data" of scripture
	3. Contrastive theology checks whether the model prefers doctrinally correct completions over plausible foils

	The foil-based contrastive cases are the most discriminative: they test whether the model assigns higher probability to a doctrinally correct target than to a surface-level lexical confounder. Results on this suite should be read as evidence about behavior on cases of this shape, not as a general measure of theological understanding. The training corpus and eval suite were authored privately and have not been externally audited, so some train/eval distributional overlap (especially for canonical recall) is expected.