Instructions to use toranb/theo-bert-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use toranb/theo-bert-base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="toranb/theo-bert-base", trust_remote_code=True)# Load model directly from transformers import AutoModelForMaskedLM model = AutoModelForMaskedLM.from_pretrained("toranb/theo-bert-base", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
TheoBERT Base
theo-bert-base is a domain-specialized masked language model for biblical and theological text. It is a custom bidirectional encoder pretrained from scratch on bible and closely related doctrinal material, exported in a Hugging Faceβcompatible format.
This repository ships the MLM-shaped artifact: an encoder body paired with a working MLM head. It is the right checkpoint if you want fill-mask, token-level scoring, or a strong base for further domain-specific fine-tuning where token-level prediction matters.
What This Model Is For
Recommended use cases:
- Masked token prediction and token-level scoring in biblical-domain text
- Initialization for continued domain adaptation or supervised downstream fine-tuning
- Encoder hidden states for downstream task heads (classification, NER, etc.)
Training Pipeline
This release is the output of a two-stage pretraining pipeline.
Stage 1 β MLM pretraining from scratch (encoder)
- 24 epochs of masked language modeling at 256-token context
- 270,000 sequences from bible text, Christian books, biblical commentaries and synthetic data
- Final train loss
1.0679, train accuracy76.42%
Stage 2 β Whole-word-masking continued pretraining (mlmcontinued) β this release
- 25 additional epochs of continued pretraining on top of Stage 1
- 18% whole-word-masking rate (whole-word, not single-piece)
- Final train loss
0.8958, train accuracy79.66%
The MLM head was trained jointly with the body throughout both stages and is preserved in this release.
Evaluation
Evaluated on a 546-case domain-specific MLM benchmark covering bibliology, christology, ecclesiology, eschatology, hamartiology, pneumatology, soteriology, theology proper, and canonical knowledge. Full methodology and test case schema in EVAL.md.
| Metric | Value |
|---|---|
| Overall pass rate | 94.7% (517 / 546) |
| Difficulty-weighted | 94.6% |
| Easy | 94.9% |
| Medium | 94.9% |
| Hard | 94.2% |
Per-category highlights:
| Category | Pass rate |
|---|---|
| Pneumatology | 100% |
| Soteriology | 98.2% |
| Ecclesiology | 97.5% |
| Hamartiology | 97.1% |
| Christology | 96.4% |
| Eschatology | 94.4% |
| Theology proper | 91.3% |
| Canonical knowledge | 88.4% |
Comparison with bert-base-uncased
General-purpose BERT produces theologically incoherent completions on biblical text. Running google-bert/bert-base-uncased through the same 546-case eval shows the gap:
| Metric | bert-base-uncased | theo-bert-base |
|---|---|---|
| Overall pass rate | 47.8% | 94.7% |
| Doctrinal association | 39.4% | 95.9% |
| Canonical knowledge | 37.7% | 88.4% |
| Contrastive theology | 65.2% | 97.9% |
| Difficulty-weighted | 46.5% | 94.6% |
| Critical failure rate | 26.9% | 15.6% |
By difficulty β theo-bert-base on hard cases (94.2%) outperforms bert-base-uncased on easy cases (56.6%):
| Difficulty | bert-base-uncased | theo-bert-base |
|---|---|---|
| Easy | 56.6% | 94.9% |
| Medium | 46.9% | 94.9% |
| Hard | 44.2% | 94.2% |
By category:
| Category | bert-base-uncased | theo-bert-base |
|---|---|---|
| Pneumatology | 45.2% | 100% |
| Soteriology | 55.0% | 98.2% |
| Ecclesiology | 62.5% | 97.5% |
| Hamartiology | 61.8% | 97.1% |
| Christology | 41.7% | 96.4% |
| Eschatology | 55.6% | 94.4% |
| Theology proper | 43.5% | 91.3% |
| Canonical knowledge | 37.7% | 88.4% |
On contrastive theology β the most discriminative test type β bert-base-uncased is right 65% of the time but only confident (margin > 0.10) on 23% of cases. Theo-bert-base is right 98% of the time and confident on 91% of cases.
Residual failures cluster around Old Testament proper-noun recall (Jeremiah, Jonah, Job, Nebuchadnezzar) and multi-piece subword reconstruction (sabachthani, iniquity, Nebuchadnezzar). The benchmark suggests strong domain-specific MLM behavior on this suite; broader generalization beyond the eval distribution has not been independently verified.
Tokenizer
theo-bert-base uses the google-bert/bert-base-uncased tokenizer. The fast-tokenizer files (tokenizer.json, tokenizer_config.json) are bundled in this repo so AutoTokenizer.from_pretrained("toranb/theo-bert-base") and the Hub fill-mask widget work out of the box.
Tokenizer files are redistributed unmodified from google-bert/bert-base-uncased, released by Google under the Apache License 2.0.
Architecture
- 12 transformer blocks
- Hidden size 768
- 8 attention heads (head dim 96)
- Training sequence length 256 (rotary cache supports up to 2,560 tokens)
- Vocabulary size 30,522 via
bert-base-uncased - RoPE positional encoding applied to query and key projections
- RMS normalization on Q and K (no learnable gain)
- ReLU-squared MLP activation
- Gated value embeddings on even-indexed layers
- Learned residual interpolation between each block output and the initial token-embedding state
- MLM head:
Linear β GELU β RMSNorm β Linear
Parameter count: 273,051,864 (β273M).
Quick Start β Fill-Mask
from transformers import AutoModelForMaskedLM, AutoTokenizer
repo = "toranb/theo-bert-base"
tokenizer = AutoTokenizer.from_pretrained(repo)
model = AutoModelForMaskedLM.from_pretrained(repo, trust_remote_code=True)
model.eval()
inputs = tokenizer(
"For God so loved the [MASK] that he gave his only Son.",
return_tensors="pt",
)
outputs = model(**inputs)
mask_index = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero(as_tuple=False)[0, 1]
top_ids = outputs.logits[0, mask_index].topk(5).indices.tolist()
print(tokenizer.convert_ids_to_tokens(top_ids))
# β ['world', 'universe', 'son', 'church', 'earth']
Quick Start β Encoder Hidden States
from transformers import AutoModel, AutoTokenizer
repo = "toranb/theo-bert-base"
tokenizer = AutoTokenizer.from_pretrained(repo)
model = AutoModel.from_pretrained(repo, trust_remote_code=True)
model.eval()
batch = tokenizer(
["faith working through love", "the kingdom of God"],
padding=True, truncation=True, max_length=256, return_tensors="pt",
)
hidden = model(**batch).last_hidden_state # [B, T, 768]
mask = batch["attention_mask"].unsqueeze(-1).float()
pooled = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1)
Repository Contents
| File | Purpose |
|---|---|
configuration_theo_bert_base.py |
Hugging Face config class |
modeling_theo_bert_base.py |
AutoModel and AutoModelForMaskedLM implementations |
muon.py |
Local Muon optimizer (retained for self-contained fine-tuning) |
config.json |
Generated from the source checkpoint configuration |
model.safetensors |
Released fp16 weights |
checkpoint_metadata.json |
Source checkpoint and per-stage training metadata |
LICENSE |
Apache-2.0 |
Scripts
| Script | Purpose |
|---|---|
scripts/mlm_eval_safetensors.py |
Loads model.safetensors + eval.json and runs the full 546-case MLM evaluation suite |
Limitations
- Specialized for biblical and theological language; may underperform on broad general-domain NLP tasks.
- Tokenizer inherited from
bert-base-uncased, so wordpiece behavior follows general English conventions rather than a theology-specific tokenizer. - Trained at 256-token context. Longer inputs work within the rotary cache (up to 2,560 tokens), but extended-context behavior is not a primary target of this release.
- Training data is private, so external auditing of corpus composition is limited. The canonical-knowledge eval cases overlap by design with biblical text that appears in the training corpus, so the 88.4% recall on that category should be read as in-distribution recall, not held-out generalization.
- Encoder MLM β not an autoregressive decoder.
Release Details
- Exported from
mlmcontinued/latest.pt(Stage 2 final epoch, training accuracy 79.66%) - Source checkpoint loss
0.8958 - Released weights in fp16 for bandwidth efficiency (546 MB)
- Release format uses
safetensors - Loading requires
trust_remote_code=Trueto register the custom architecture config.jsondeclarestorch_dtype: float32so default loads upcast on read. Disk weights stay fp16 (small download); CPU inference is numerically safe by default. For GPU fp16 inference, passdtype=torch.float16tofrom_pretrained.
- Downloads last month
- 26
Evaluation results
- Pass rate on 546-case theological MLM evalself-reported0.947