theo-bert-base / README.md
toranb's picture
docs: add bert-base-uncased baseline comparison to README
fa10de7
---
language:
- en
license: apache-2.0
library_name: transformers
pipeline_tag: fill-mask
tags:
- masked-language-modeling
- bible
- theology
- christianity
- trust-remote-code
model-index:
- name: theo-bert-base
results:
- task:
type: fill-mask
name: Masked Language Modeling
metrics:
- type: accuracy
value: 0.947
name: Pass rate on 546-case theological MLM eval
---
# TheoBERT Base
`theo-bert-base` is a domain-specialized **masked language model** for biblical and theological text. It is a custom bidirectional encoder pretrained from scratch on bible and closely related doctrinal material, exported in a Hugging Face–compatible format.
This repository ships the MLM-shaped artifact: an encoder body paired with a working MLM head. It is the right checkpoint if you want fill-mask, token-level scoring, or a strong base for further domain-specific fine-tuning where token-level prediction matters.
## What This Model Is For
Recommended use cases:
- Masked token prediction and token-level scoring in biblical-domain text
- Initialization for continued domain adaptation or supervised downstream fine-tuning
- Encoder hidden states for downstream task heads (classification, NER, etc.)
## Training Pipeline
This release is the output of a two-stage pretraining pipeline.
**Stage 1 β€” MLM pretraining from scratch (`encoder`)**
- 24 epochs of masked language modeling at 256-token context
- 270,000 sequences from bible text, Christian books, biblical commentaries and synthetic data
- Final train loss `1.0679`, train accuracy `76.42%`
**Stage 2 β€” Whole-word-masking continued pretraining (`mlmcontinued`)** β€” *this release*
- 25 additional epochs of continued pretraining on top of Stage 1
- 18% whole-word-masking rate (whole-word, not single-piece)
- Final train loss `0.8958`, train accuracy `79.66%`
The MLM head was trained jointly with the body throughout both stages and is preserved in this release.
## Evaluation
Evaluated on a 546-case domain-specific MLM benchmark covering bibliology, christology, ecclesiology, eschatology, hamartiology, pneumatology, soteriology, theology proper, and canonical knowledge. Full methodology and test case schema in [`EVAL.md`](EVAL.md).
| Metric | Value |
|---|---|
| Overall pass rate | **94.7%** (517 / 546) |
| Difficulty-weighted | 94.6% |
| Easy | 94.9% |
| Medium | 94.9% |
| Hard | 94.2% |
Per-category highlights:
| Category | Pass rate |
|---|---|
| Pneumatology | 100% |
| Soteriology | 98.2% |
| Ecclesiology | 97.5% |
| Hamartiology | 97.1% |
| Christology | 96.4% |
| Eschatology | 94.4% |
| Theology proper | 91.3% |
| Canonical knowledge | 88.4% |
### Comparison with bert-base-uncased
General-purpose BERT produces theologically incoherent completions on biblical text. Running `google-bert/bert-base-uncased` through the same 546-case eval shows the gap:
| Metric | bert-base-uncased | **theo-bert-base** |
|---|---|---|
| Overall pass rate | 47.8% | **94.7%** |
| Doctrinal association | 39.4% | **95.9%** |
| Canonical knowledge | 37.7% | **88.4%** |
| Contrastive theology | 65.2% | **97.9%** |
| Difficulty-weighted | 46.5% | **94.6%** |
| Critical failure rate | 26.9% | **15.6%** |
By difficulty β€” theo-bert-base on **hard** cases (94.2%) outperforms bert-base-uncased on **easy** cases (56.6%):
| Difficulty | bert-base-uncased | **theo-bert-base** |
|---|---|---|
| Easy | 56.6% | **94.9%** |
| Medium | 46.9% | **94.9%** |
| Hard | 44.2% | **94.2%** |
By category:
| Category | bert-base-uncased | **theo-bert-base** |
|---|---|---|
| Pneumatology | 45.2% | **100%** |
| Soteriology | 55.0% | **98.2%** |
| Ecclesiology | 62.5% | **97.5%** |
| Hamartiology | 61.8% | **97.1%** |
| Christology | 41.7% | **96.4%** |
| Eschatology | 55.6% | **94.4%** |
| Theology proper | 43.5% | **91.3%** |
| Canonical knowledge | 37.7% | **88.4%** |
On contrastive theology β€” the most discriminative test type β€” bert-base-uncased is right 65% of the time but only confident (margin > 0.10) on 23% of cases. Theo-bert-base is right 98% of the time and confident on 91% of cases.
Residual failures cluster around Old Testament proper-noun recall (Jeremiah, Jonah, Job, Nebuchadnezzar) and multi-piece subword reconstruction (`sabachthani`, `iniquity`, `Nebuchadnezzar`). The benchmark suggests strong domain-specific MLM behavior on this suite; broader generalization beyond the eval distribution has not been independently verified.
## Tokenizer
`theo-bert-base` uses the **`google-bert/bert-base-uncased` tokenizer**. The fast-tokenizer files (`tokenizer.json`, `tokenizer_config.json`) are bundled in this repo so `AutoTokenizer.from_pretrained("toranb/theo-bert-base")` and the Hub `fill-mask` widget work out of the box.
Tokenizer files are redistributed unmodified from [`google-bert/bert-base-uncased`](https://huggingface.co/google-bert/bert-base-uncased), released by Google under the Apache License 2.0.
## Architecture
- 12 transformer blocks
- Hidden size 768
- 8 attention heads (head dim 96)
- Training sequence length 256 (rotary cache supports up to 2,560 tokens)
- Vocabulary size 30,522 via `bert-base-uncased`
- RoPE positional encoding applied to query and key projections
- RMS normalization on Q and K (no learnable gain)
- ReLU-squared MLP activation
- Gated value embeddings on even-indexed layers
- Learned residual interpolation between each block output and the initial token-embedding state
- MLM head: `Linear β†’ GELU β†’ RMSNorm β†’ Linear`
Parameter count: **273,051,864** (β‰ˆ273M).
## Quick Start β€” Fill-Mask
```python
from transformers import AutoModelForMaskedLM, AutoTokenizer
repo = "toranb/theo-bert-base"
tokenizer = AutoTokenizer.from_pretrained(repo)
model = AutoModelForMaskedLM.from_pretrained(repo, trust_remote_code=True)
model.eval()
inputs = tokenizer(
"For God so loved the [MASK] that he gave his only Son.",
return_tensors="pt",
)
outputs = model(**inputs)
mask_index = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero(as_tuple=False)[0, 1]
top_ids = outputs.logits[0, mask_index].topk(5).indices.tolist()
print(tokenizer.convert_ids_to_tokens(top_ids))
# β†’ ['world', 'universe', 'son', 'church', 'earth']
```
## Quick Start β€” Encoder Hidden States
```python
from transformers import AutoModel, AutoTokenizer
repo = "toranb/theo-bert-base"
tokenizer = AutoTokenizer.from_pretrained(repo)
model = AutoModel.from_pretrained(repo, trust_remote_code=True)
model.eval()
batch = tokenizer(
["faith working through love", "the kingdom of God"],
padding=True, truncation=True, max_length=256, return_tensors="pt",
)
hidden = model(**batch).last_hidden_state # [B, T, 768]
mask = batch["attention_mask"].unsqueeze(-1).float()
pooled = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1)
```
## Repository Contents
| File | Purpose |
|---|---|
| `configuration_theo_bert_base.py` | Hugging Face config class |
| `modeling_theo_bert_base.py` | `AutoModel` and `AutoModelForMaskedLM` implementations |
| `muon.py` | Local Muon optimizer (retained for self-contained fine-tuning) |
| `config.json` | Generated from the source checkpoint configuration |
| `model.safetensors` | Released fp16 weights |
| `checkpoint_metadata.json` | Source checkpoint and per-stage training metadata |
| `LICENSE` | Apache-2.0 |
### Scripts
| Script | Purpose |
|---|---|
| `scripts/mlm_eval_safetensors.py` | Loads `model.safetensors` + `eval.json` and runs the full 546-case MLM evaluation suite |
## Limitations
- Specialized for biblical and theological language; may underperform on broad general-domain NLP tasks.
- Tokenizer inherited from `bert-base-uncased`, so wordpiece behavior follows general English conventions rather than a theology-specific tokenizer.
- Trained at 256-token context. Longer inputs work within the rotary cache (up to 2,560 tokens), but extended-context behavior is not a primary target of this release.
- Training data is private, so external auditing of corpus composition is limited. The canonical-knowledge eval cases overlap by design with biblical text that appears in the training corpus, so the 88.4% recall on that category should be read as in-distribution recall, not held-out generalization.
- Encoder MLM β€” not an autoregressive decoder.
## Release Details
- Exported from `mlmcontinued/latest.pt` (Stage 2 final epoch, training accuracy 79.66%)
- Source checkpoint loss `0.8958`
- Released weights in fp16 for bandwidth efficiency (546 MB)
- Release format uses `safetensors`
- Loading requires `trust_remote_code=True` to register the custom architecture
- `config.json` declares `torch_dtype: float32` so default loads upcast on read. Disk weights stay fp16 (small download); CPU inference is numerically safe by default. For GPU fp16 inference, pass `dtype=torch.float16` to `from_pretrained`.