TheoBERT Base

theo-bert-base is a domain-specialized masked language model for biblical and theological text. It is a custom bidirectional encoder pretrained from scratch on bible and closely related doctrinal material, exported in a Hugging Face–compatible format.

This repository ships the MLM-shaped artifact: an encoder body paired with a working MLM head. It is the right checkpoint if you want fill-mask, token-level scoring, or a strong base for further domain-specific fine-tuning where token-level prediction matters.

What This Model Is For

Recommended use cases:

  • Masked token prediction and token-level scoring in biblical-domain text
  • Initialization for continued domain adaptation or supervised downstream fine-tuning
  • Encoder hidden states for downstream task heads (classification, NER, etc.)

Training Pipeline

This release is the output of a two-stage pretraining pipeline.

Stage 1 β€” MLM pretraining from scratch (encoder)

  • 24 epochs of masked language modeling at 256-token context
  • 270,000 sequences from bible text, Christian books, biblical commentaries and synthetic data
  • Final train loss 1.0679, train accuracy 76.42%

Stage 2 β€” Whole-word-masking continued pretraining (mlmcontinued) β€” this release

  • 25 additional epochs of continued pretraining on top of Stage 1
  • 18% whole-word-masking rate (whole-word, not single-piece)
  • Final train loss 0.8958, train accuracy 79.66%

The MLM head was trained jointly with the body throughout both stages and is preserved in this release.

Evaluation

Evaluated on a 546-case domain-specific MLM benchmark covering bibliology, christology, ecclesiology, eschatology, hamartiology, pneumatology, soteriology, theology proper, and canonical knowledge. Full methodology and test case schema in EVAL.md.

Metric Value
Overall pass rate 94.7% (517 / 546)
Difficulty-weighted 94.6%
Easy 94.9%
Medium 94.9%
Hard 94.2%

Per-category highlights:

Category Pass rate
Pneumatology 100%
Soteriology 98.2%
Ecclesiology 97.5%
Hamartiology 97.1%
Christology 96.4%
Eschatology 94.4%
Theology proper 91.3%
Canonical knowledge 88.4%

Comparison with bert-base-uncased

General-purpose BERT produces theologically incoherent completions on biblical text. Running google-bert/bert-base-uncased through the same 546-case eval shows the gap:

Metric bert-base-uncased theo-bert-base
Overall pass rate 47.8% 94.7%
Doctrinal association 39.4% 95.9%
Canonical knowledge 37.7% 88.4%
Contrastive theology 65.2% 97.9%
Difficulty-weighted 46.5% 94.6%
Critical failure rate 26.9% 15.6%

By difficulty β€” theo-bert-base on hard cases (94.2%) outperforms bert-base-uncased on easy cases (56.6%):

Difficulty bert-base-uncased theo-bert-base
Easy 56.6% 94.9%
Medium 46.9% 94.9%
Hard 44.2% 94.2%

By category:

Category bert-base-uncased theo-bert-base
Pneumatology 45.2% 100%
Soteriology 55.0% 98.2%
Ecclesiology 62.5% 97.5%
Hamartiology 61.8% 97.1%
Christology 41.7% 96.4%
Eschatology 55.6% 94.4%
Theology proper 43.5% 91.3%
Canonical knowledge 37.7% 88.4%

On contrastive theology β€” the most discriminative test type β€” bert-base-uncased is right 65% of the time but only confident (margin > 0.10) on 23% of cases. Theo-bert-base is right 98% of the time and confident on 91% of cases.

Residual failures cluster around Old Testament proper-noun recall (Jeremiah, Jonah, Job, Nebuchadnezzar) and multi-piece subword reconstruction (sabachthani, iniquity, Nebuchadnezzar). The benchmark suggests strong domain-specific MLM behavior on this suite; broader generalization beyond the eval distribution has not been independently verified.

Tokenizer

theo-bert-base uses the google-bert/bert-base-uncased tokenizer. The fast-tokenizer files (tokenizer.json, tokenizer_config.json) are bundled in this repo so AutoTokenizer.from_pretrained("toranb/theo-bert-base") and the Hub fill-mask widget work out of the box.

Tokenizer files are redistributed unmodified from google-bert/bert-base-uncased, released by Google under the Apache License 2.0.

Architecture

  • 12 transformer blocks
  • Hidden size 768
  • 8 attention heads (head dim 96)
  • Training sequence length 256 (rotary cache supports up to 2,560 tokens)
  • Vocabulary size 30,522 via bert-base-uncased
  • RoPE positional encoding applied to query and key projections
  • RMS normalization on Q and K (no learnable gain)
  • ReLU-squared MLP activation
  • Gated value embeddings on even-indexed layers
  • Learned residual interpolation between each block output and the initial token-embedding state
  • MLM head: Linear β†’ GELU β†’ RMSNorm β†’ Linear

Parameter count: 273,051,864 (β‰ˆ273M).

Quick Start β€” Fill-Mask

from transformers import AutoModelForMaskedLM, AutoTokenizer

repo = "toranb/theo-bert-base"

tokenizer = AutoTokenizer.from_pretrained(repo)
model = AutoModelForMaskedLM.from_pretrained(repo, trust_remote_code=True)
model.eval()

inputs = tokenizer(
    "For God so loved the [MASK] that he gave his only Son.",
    return_tensors="pt",
)
outputs = model(**inputs)
mask_index = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero(as_tuple=False)[0, 1]
top_ids = outputs.logits[0, mask_index].topk(5).indices.tolist()
print(tokenizer.convert_ids_to_tokens(top_ids))
# β†’ ['world', 'universe', 'son', 'church', 'earth']

Quick Start β€” Encoder Hidden States

from transformers import AutoModel, AutoTokenizer

repo = "toranb/theo-bert-base"

tokenizer = AutoTokenizer.from_pretrained(repo)
model = AutoModel.from_pretrained(repo, trust_remote_code=True)
model.eval()

batch = tokenizer(
    ["faith working through love", "the kingdom of God"],
    padding=True, truncation=True, max_length=256, return_tensors="pt",
)
hidden = model(**batch).last_hidden_state  # [B, T, 768]
mask = batch["attention_mask"].unsqueeze(-1).float()
pooled = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1)

Repository Contents

File Purpose
configuration_theo_bert_base.py Hugging Face config class
modeling_theo_bert_base.py AutoModel and AutoModelForMaskedLM implementations
muon.py Local Muon optimizer (retained for self-contained fine-tuning)
config.json Generated from the source checkpoint configuration
model.safetensors Released fp16 weights
checkpoint_metadata.json Source checkpoint and per-stage training metadata
LICENSE Apache-2.0

Scripts

Script Purpose
scripts/mlm_eval_safetensors.py Loads model.safetensors + eval.json and runs the full 546-case MLM evaluation suite

Limitations

  • Specialized for biblical and theological language; may underperform on broad general-domain NLP tasks.
  • Tokenizer inherited from bert-base-uncased, so wordpiece behavior follows general English conventions rather than a theology-specific tokenizer.
  • Trained at 256-token context. Longer inputs work within the rotary cache (up to 2,560 tokens), but extended-context behavior is not a primary target of this release.
  • Training data is private, so external auditing of corpus composition is limited. The canonical-knowledge eval cases overlap by design with biblical text that appears in the training corpus, so the 88.4% recall on that category should be read as in-distribution recall, not held-out generalization.
  • Encoder MLM β€” not an autoregressive decoder.

Release Details

  • Exported from mlmcontinued/latest.pt (Stage 2 final epoch, training accuracy 79.66%)
  • Source checkpoint loss 0.8958
  • Released weights in fp16 for bandwidth efficiency (546 MB)
  • Release format uses safetensors
  • Loading requires trust_remote_code=True to register the custom architecture
  • config.json declares torch_dtype: float32 so default loads upcast on read. Disk weights stay fp16 (small download); CPU inference is numerically safe by default. For GPU fp16 inference, pass dtype=torch.float16 to from_pretrained.
Downloads last month
26
Safetensors
Model size
0.3B params
Tensor type
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Evaluation results