Fill-Mask
Transformers
Safetensors
English
theo_bert_base
masked-language-modeling
bible
theology
christianity
trust-remote-code
custom_code
Eval Results (legacy)
Instructions to use toranb/theo-bert-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use toranb/theo-bert-base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="toranb/theo-bert-base", trust_remote_code=True)# Load model directly from transformers import AutoModelForMaskedLM model = AutoModelForMaskedLM.from_pretrained("toranb/theo-bert-base", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
| language: | |
| - en | |
| license: apache-2.0 | |
| library_name: transformers | |
| pipeline_tag: fill-mask | |
| tags: | |
| - masked-language-modeling | |
| - bible | |
| - theology | |
| - christianity | |
| - trust-remote-code | |
| model-index: | |
| - name: theo-bert-base | |
| results: | |
| - task: | |
| type: fill-mask | |
| name: Masked Language Modeling | |
| metrics: | |
| - type: accuracy | |
| value: 0.947 | |
| name: Pass rate on 546-case theological MLM eval | |
| # TheoBERT Base | |
| `theo-bert-base` is a domain-specialized **masked language model** for biblical and theological text. It is a custom bidirectional encoder pretrained from scratch on bible and closely related doctrinal material, exported in a Hugging Faceβcompatible format. | |
| This repository ships the MLM-shaped artifact: an encoder body paired with a working MLM head. It is the right checkpoint if you want fill-mask, token-level scoring, or a strong base for further domain-specific fine-tuning where token-level prediction matters. | |
| ## What This Model Is For | |
| Recommended use cases: | |
| - Masked token prediction and token-level scoring in biblical-domain text | |
| - Initialization for continued domain adaptation or supervised downstream fine-tuning | |
| - Encoder hidden states for downstream task heads (classification, NER, etc.) | |
| ## Training Pipeline | |
| This release is the output of a two-stage pretraining pipeline. | |
| **Stage 1 β MLM pretraining from scratch (`encoder`)** | |
| - 24 epochs of masked language modeling at 256-token context | |
| - 270,000 sequences from bible text, Christian books, biblical commentaries and synthetic data | |
| - Final train loss `1.0679`, train accuracy `76.42%` | |
| **Stage 2 β Whole-word-masking continued pretraining (`mlmcontinued`)** β *this release* | |
| - 25 additional epochs of continued pretraining on top of Stage 1 | |
| - 18% whole-word-masking rate (whole-word, not single-piece) | |
| - Final train loss `0.8958`, train accuracy `79.66%` | |
| The MLM head was trained jointly with the body throughout both stages and is preserved in this release. | |
| ## Evaluation | |
| Evaluated on a 546-case domain-specific MLM benchmark covering bibliology, christology, ecclesiology, eschatology, hamartiology, pneumatology, soteriology, theology proper, and canonical knowledge. Full methodology and test case schema in [`EVAL.md`](EVAL.md). | |
| | Metric | Value | | |
| |---|---| | |
| | Overall pass rate | **94.7%** (517 / 546) | | |
| | Difficulty-weighted | 94.6% | | |
| | Easy | 94.9% | | |
| | Medium | 94.9% | | |
| | Hard | 94.2% | | |
| Per-category highlights: | |
| | Category | Pass rate | | |
| |---|---| | |
| | Pneumatology | 100% | | |
| | Soteriology | 98.2% | | |
| | Ecclesiology | 97.5% | | |
| | Hamartiology | 97.1% | | |
| | Christology | 96.4% | | |
| | Eschatology | 94.4% | | |
| | Theology proper | 91.3% | | |
| | Canonical knowledge | 88.4% | | |
| ### Comparison with bert-base-uncased | |
| General-purpose BERT produces theologically incoherent completions on biblical text. Running `google-bert/bert-base-uncased` through the same 546-case eval shows the gap: | |
| | Metric | bert-base-uncased | **theo-bert-base** | | |
| |---|---|---| | |
| | Overall pass rate | 47.8% | **94.7%** | | |
| | Doctrinal association | 39.4% | **95.9%** | | |
| | Canonical knowledge | 37.7% | **88.4%** | | |
| | Contrastive theology | 65.2% | **97.9%** | | |
| | Difficulty-weighted | 46.5% | **94.6%** | | |
| | Critical failure rate | 26.9% | **15.6%** | | |
| By difficulty β theo-bert-base on **hard** cases (94.2%) outperforms bert-base-uncased on **easy** cases (56.6%): | |
| | Difficulty | bert-base-uncased | **theo-bert-base** | | |
| |---|---|---| | |
| | Easy | 56.6% | **94.9%** | | |
| | Medium | 46.9% | **94.9%** | | |
| | Hard | 44.2% | **94.2%** | | |
| By category: | |
| | Category | bert-base-uncased | **theo-bert-base** | | |
| |---|---|---| | |
| | Pneumatology | 45.2% | **100%** | | |
| | Soteriology | 55.0% | **98.2%** | | |
| | Ecclesiology | 62.5% | **97.5%** | | |
| | Hamartiology | 61.8% | **97.1%** | | |
| | Christology | 41.7% | **96.4%** | | |
| | Eschatology | 55.6% | **94.4%** | | |
| | Theology proper | 43.5% | **91.3%** | | |
| | Canonical knowledge | 37.7% | **88.4%** | | |
| On contrastive theology β the most discriminative test type β bert-base-uncased is right 65% of the time but only confident (margin > 0.10) on 23% of cases. Theo-bert-base is right 98% of the time and confident on 91% of cases. | |
| Residual failures cluster around Old Testament proper-noun recall (Jeremiah, Jonah, Job, Nebuchadnezzar) and multi-piece subword reconstruction (`sabachthani`, `iniquity`, `Nebuchadnezzar`). The benchmark suggests strong domain-specific MLM behavior on this suite; broader generalization beyond the eval distribution has not been independently verified. | |
| ## Tokenizer | |
| `theo-bert-base` uses the **`google-bert/bert-base-uncased` tokenizer**. The fast-tokenizer files (`tokenizer.json`, `tokenizer_config.json`) are bundled in this repo so `AutoTokenizer.from_pretrained("toranb/theo-bert-base")` and the Hub `fill-mask` widget work out of the box. | |
| Tokenizer files are redistributed unmodified from [`google-bert/bert-base-uncased`](https://huggingface.co/google-bert/bert-base-uncased), released by Google under the Apache License 2.0. | |
| ## Architecture | |
| - 12 transformer blocks | |
| - Hidden size 768 | |
| - 8 attention heads (head dim 96) | |
| - Training sequence length 256 (rotary cache supports up to 2,560 tokens) | |
| - Vocabulary size 30,522 via `bert-base-uncased` | |
| - RoPE positional encoding applied to query and key projections | |
| - RMS normalization on Q and K (no learnable gain) | |
| - ReLU-squared MLP activation | |
| - Gated value embeddings on even-indexed layers | |
| - Learned residual interpolation between each block output and the initial token-embedding state | |
| - MLM head: `Linear β GELU β RMSNorm β Linear` | |
| Parameter count: **273,051,864** (β273M). | |
| ## Quick Start β Fill-Mask | |
| ```python | |
| from transformers import AutoModelForMaskedLM, AutoTokenizer | |
| repo = "toranb/theo-bert-base" | |
| tokenizer = AutoTokenizer.from_pretrained(repo) | |
| model = AutoModelForMaskedLM.from_pretrained(repo, trust_remote_code=True) | |
| model.eval() | |
| inputs = tokenizer( | |
| "For God so loved the [MASK] that he gave his only Son.", | |
| return_tensors="pt", | |
| ) | |
| outputs = model(**inputs) | |
| mask_index = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero(as_tuple=False)[0, 1] | |
| top_ids = outputs.logits[0, mask_index].topk(5).indices.tolist() | |
| print(tokenizer.convert_ids_to_tokens(top_ids)) | |
| # β ['world', 'universe', 'son', 'church', 'earth'] | |
| ``` | |
| ## Quick Start β Encoder Hidden States | |
| ```python | |
| from transformers import AutoModel, AutoTokenizer | |
| repo = "toranb/theo-bert-base" | |
| tokenizer = AutoTokenizer.from_pretrained(repo) | |
| model = AutoModel.from_pretrained(repo, trust_remote_code=True) | |
| model.eval() | |
| batch = tokenizer( | |
| ["faith working through love", "the kingdom of God"], | |
| padding=True, truncation=True, max_length=256, return_tensors="pt", | |
| ) | |
| hidden = model(**batch).last_hidden_state # [B, T, 768] | |
| mask = batch["attention_mask"].unsqueeze(-1).float() | |
| pooled = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1) | |
| ``` | |
| ## Repository Contents | |
| | File | Purpose | | |
| |---|---| | |
| | `configuration_theo_bert_base.py` | Hugging Face config class | | |
| | `modeling_theo_bert_base.py` | `AutoModel` and `AutoModelForMaskedLM` implementations | | |
| | `muon.py` | Local Muon optimizer (retained for self-contained fine-tuning) | | |
| | `config.json` | Generated from the source checkpoint configuration | | |
| | `model.safetensors` | Released fp16 weights | | |
| | `checkpoint_metadata.json` | Source checkpoint and per-stage training metadata | | |
| | `LICENSE` | Apache-2.0 | | |
| ### Scripts | |
| | Script | Purpose | | |
| |---|---| | |
| | `scripts/mlm_eval_safetensors.py` | Loads `model.safetensors` + `eval.json` and runs the full 546-case MLM evaluation suite | | |
| ## Limitations | |
| - Specialized for biblical and theological language; may underperform on broad general-domain NLP tasks. | |
| - Tokenizer inherited from `bert-base-uncased`, so wordpiece behavior follows general English conventions rather than a theology-specific tokenizer. | |
| - Trained at 256-token context. Longer inputs work within the rotary cache (up to 2,560 tokens), but extended-context behavior is not a primary target of this release. | |
| - Training data is private, so external auditing of corpus composition is limited. The canonical-knowledge eval cases overlap by design with biblical text that appears in the training corpus, so the 88.4% recall on that category should be read as in-distribution recall, not held-out generalization. | |
| - Encoder MLM β not an autoregressive decoder. | |
| ## Release Details | |
| - Exported from `mlmcontinued/latest.pt` (Stage 2 final epoch, training accuracy 79.66%) | |
| - Source checkpoint loss `0.8958` | |
| - Released weights in fp16 for bandwidth efficiency (546 MB) | |
| - Release format uses `safetensors` | |
| - Loading requires `trust_remote_code=True` to register the custom architecture | |
| - `config.json` declares `torch_dtype: float32` so default loads upcast on read. Disk weights stay fp16 (small download); CPU inference is numerically safe by default. For GPU fp16 inference, pass `dtype=torch.float16` to `from_pretrained`. | |