docs: add bert-base-uncased baseline comparison to README

fa10de7 14 days ago

8.86 kB

	---
	language:
	- en
	license: apache-2.0
	library_name: transformers
	pipeline_tag: fill-mask
	tags:
	- masked-language-modeling
	- bible
	- theology
	- christianity
	- trust-remote-code
	model-index:
	- name: theo-bert-base
	results:
	- task:
	type: fill-mask
	name: Masked Language Modeling
	metrics:
	- type: accuracy
	value: 0.947
	name: Pass rate on 546-case theological MLM eval
	---

	# TheoBERT Base

	`theo-bert-base` is a domain-specialized masked language model for biblical and theological text. It is a custom bidirectional encoder pretrained from scratch on bible and closely related doctrinal material, exported in a Hugging Face–compatible format.

	This repository ships the MLM-shaped artifact: an encoder body paired with a working MLM head. It is the right checkpoint if you want fill-mask, token-level scoring, or a strong base for further domain-specific fine-tuning where token-level prediction matters.


	## What This Model Is For

	Recommended use cases:

	- Masked token prediction and token-level scoring in biblical-domain text
	- Initialization for continued domain adaptation or supervised downstream fine-tuning
	- Encoder hidden states for downstream task heads (classification, NER, etc.)

	## Training Pipeline

	This release is the output of a two-stage pretraining pipeline.

	Stage 1 — MLM pretraining from scratch (`encoder`)
	- 24 epochs of masked language modeling at 256-token context
	- 270,000 sequences from bible text, Christian books, biblical commentaries and synthetic data
	- Final train loss `1.0679`, train accuracy `76.42%`

	Stage 2 — Whole-word-masking continued pretraining (`mlmcontinued`) — this release
	- 25 additional epochs of continued pretraining on top of Stage 1
	- 18% whole-word-masking rate (whole-word, not single-piece)
	- Final train loss `0.8958`, train accuracy `79.66%`

	The MLM head was trained jointly with the body throughout both stages and is preserved in this release.

	## Evaluation

	Evaluated on a 546-case domain-specific MLM benchmark covering bibliology, christology, ecclesiology, eschatology, hamartiology, pneumatology, soteriology, theology proper, and canonical knowledge. Full methodology and test case schema in [`EVAL.md`](EVAL.md).

	\| Metric \| Value \|
	\|---\|---\|
	\| Overall pass rate \| 94.7% (517 / 546) \|
	\| Difficulty-weighted \| 94.6% \|
	\| Easy \| 94.9% \|
	\| Medium \| 94.9% \|
	\| Hard \| 94.2% \|

	Per-category highlights:

	\| Category \| Pass rate \|
	\|---\|---\|
	\| Pneumatology \| 100% \|
	\| Soteriology \| 98.2% \|
	\| Ecclesiology \| 97.5% \|
	\| Hamartiology \| 97.1% \|
	\| Christology \| 96.4% \|
	\| Eschatology \| 94.4% \|
	\| Theology proper \| 91.3% \|
	\| Canonical knowledge \| 88.4% \|

	### Comparison with bert-base-uncased

	General-purpose BERT produces theologically incoherent completions on biblical text. Running `google-bert/bert-base-uncased` through the same 546-case eval shows the gap:

	\| Metric \| bert-base-uncased \| theo-bert-base \|
	\|---\|---\|---\|
	\| Overall pass rate \| 47.8% \| 94.7% \|
	\| Doctrinal association \| 39.4% \| 95.9% \|
	\| Canonical knowledge \| 37.7% \| 88.4% \|
	\| Contrastive theology \| 65.2% \| 97.9% \|
	\| Difficulty-weighted \| 46.5% \| 94.6% \|
	\| Critical failure rate \| 26.9% \| 15.6% \|

	By difficulty — theo-bert-base on hard cases (94.2%) outperforms bert-base-uncased on easy cases (56.6%):

	\| Difficulty \| bert-base-uncased \| theo-bert-base \|
	\|---\|---\|---\|
	\| Easy \| 56.6% \| 94.9% \|
	\| Medium \| 46.9% \| 94.9% \|
	\| Hard \| 44.2% \| 94.2% \|

	By category:

	\| Category \| bert-base-uncased \| theo-bert-base \|
	\|---\|---\|---\|
	\| Pneumatology \| 45.2% \| 100% \|
	\| Soteriology \| 55.0% \| 98.2% \|
	\| Ecclesiology \| 62.5% \| 97.5% \|
	\| Hamartiology \| 61.8% \| 97.1% \|
	\| Christology \| 41.7% \| 96.4% \|
	\| Eschatology \| 55.6% \| 94.4% \|
	\| Theology proper \| 43.5% \| 91.3% \|
	\| Canonical knowledge \| 37.7% \| 88.4% \|

	On contrastive theology — the most discriminative test type — bert-base-uncased is right 65% of the time but only confident (margin > 0.10) on 23% of cases. Theo-bert-base is right 98% of the time and confident on 91% of cases.

	Residual failures cluster around Old Testament proper-noun recall (Jeremiah, Jonah, Job, Nebuchadnezzar) and multi-piece subword reconstruction (`sabachthani`, `iniquity`, `Nebuchadnezzar`). The benchmark suggests strong domain-specific MLM behavior on this suite; broader generalization beyond the eval distribution has not been independently verified.

	## Tokenizer

	`theo-bert-base` uses the `google-bert/bert-base-uncased` tokenizer. The fast-tokenizer files (`tokenizer.json`, `tokenizer_config.json`) are bundled in this repo so `AutoTokenizer.from_pretrained("toranb/theo-bert-base")` and the Hub `fill-mask` widget work out of the box.

	Tokenizer files are redistributed unmodified from [`google-bert/bert-base-uncased`](https://huggingface.co/google-bert/bert-base-uncased), released by Google under the Apache License 2.0.

	## Architecture

	- 12 transformer blocks
	- Hidden size 768
	- 8 attention heads (head dim 96)
	- Training sequence length 256 (rotary cache supports up to 2,560 tokens)
	- Vocabulary size 30,522 via `bert-base-uncased`
	- RoPE positional encoding applied to query and key projections
	- RMS normalization on Q and K (no learnable gain)
	- ReLU-squared MLP activation
	- Gated value embeddings on even-indexed layers
	- Learned residual interpolation between each block output and the initial token-embedding state
	- MLM head: `Linear → GELU → RMSNorm → Linear`

	Parameter count: 273,051,864 (≈273M).

	## Quick Start — Fill-Mask

	```python
	from transformers import AutoModelForMaskedLM, AutoTokenizer

	repo = "toranb/theo-bert-base"

	tokenizer = AutoTokenizer.from_pretrained(repo)
	model = AutoModelForMaskedLM.from_pretrained(repo, trust_remote_code=True)
	model.eval()

	inputs = tokenizer(
	"For God so loved the [MASK] that he gave his only Son.",
	return_tensors="pt",
	)
	outputs = model(**inputs)
	mask_index = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero(as_tuple=False)[0, 1]
	top_ids = outputs.logits[0, mask_index].topk(5).indices.tolist()
	print(tokenizer.convert_ids_to_tokens(top_ids))
	# → ['world', 'universe', 'son', 'church', 'earth']
	```

	## Quick Start — Encoder Hidden States

	```python
	from transformers import AutoModel, AutoTokenizer

	repo = "toranb/theo-bert-base"

	tokenizer = AutoTokenizer.from_pretrained(repo)
	model = AutoModel.from_pretrained(repo, trust_remote_code=True)
	model.eval()

	batch = tokenizer(
	["faith working through love", "the kingdom of God"],
	padding=True, truncation=True, max_length=256, return_tensors="pt",
	)
	hidden = model(**batch).last_hidden_state # [B, T, 768]
	mask = batch["attention_mask"].unsqueeze(-1).float()
	pooled = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1)
	```


	## Repository Contents

	\| File \| Purpose \|
	\|---\|---\|
	\| `configuration_theo_bert_base.py` \| Hugging Face config class \|
	\| `modeling_theo_bert_base.py` \| `AutoModel` and `AutoModelForMaskedLM` implementations \|
	\| `muon.py` \| Local Muon optimizer (retained for self-contained fine-tuning) \|
	\| `config.json` \| Generated from the source checkpoint configuration \|
	\| `model.safetensors` \| Released fp16 weights \|
	\| `checkpoint_metadata.json` \| Source checkpoint and per-stage training metadata \|
	\| `LICENSE` \| Apache-2.0 \|

	### Scripts

	\| Script \| Purpose \|
	\|---\|---\|
	\| `scripts/mlm_eval_safetensors.py` \| Loads `model.safetensors` + `eval.json` and runs the full 546-case MLM evaluation suite \|

	## Limitations

	- Specialized for biblical and theological language; may underperform on broad general-domain NLP tasks.
	- Tokenizer inherited from `bert-base-uncased`, so wordpiece behavior follows general English conventions rather than a theology-specific tokenizer.
	- Trained at 256-token context. Longer inputs work within the rotary cache (up to 2,560 tokens), but extended-context behavior is not a primary target of this release.
	- Training data is private, so external auditing of corpus composition is limited. The canonical-knowledge eval cases overlap by design with biblical text that appears in the training corpus, so the 88.4% recall on that category should be read as in-distribution recall, not held-out generalization.
	- Encoder MLM — not an autoregressive decoder.

	## Release Details

	- Exported from `mlmcontinued/latest.pt` (Stage 2 final epoch, training accuracy 79.66%)
	- Source checkpoint loss `0.8958`
	- Released weights in fp16 for bandwidth efficiency (546 MB)
	- Release format uses `safetensors`
	- Loading requires `trust_remote_code=True` to register the custom architecture
	- `config.json` declares `torch_dtype: float32` so default loads upcast on read. Disk weights stay fp16 (small download); CPU inference is numerically safe by default. For GPU fp16 inference, pass `dtype=torch.float16` to `from_pretrained`.