MultiReligionBERT
MultiReligionBERT is a domain-adapted multilingual BERT model produced by continued masked language modelling (MLM) pre-training on a 12-language Bible corpus. Starting from bert-base-multilingual-cased (178M parameters), the model was trained for 30,000 steps on 372,652 verses spanning European and African languages. It is designed for cross-lingual religious NLP tasks, including zero-shot transfer to African languages that have limited representation in general-purpose multilingual pre-training corpora.
A companion English-only model, ReligionBERT, covers monolingual English religious NLP with stronger performance on English-specific tasks.
Model Details
| Field | Details |
|---|---|
| Model type | BERT (encoder-only, masked language model) |
| Base model | bert-base-multilingual-cased (178M parameters) |
| Pre-training objective | Continued MLM (15% token masking) |
| Pre-training corpus | 12-language Bible corpus, 372,652 verses |
| Training steps | 30,000 |
| Final validation loss | 1.392 |
| Languages | English, French, Spanish, Portuguese, German, Amharic, Shona, Xhosa, Malagasy, Somali, Zarma, Ewe (eval only), Swahili (eval only) |
| License | Apache 2.0 |
| Developed by | Lucas Licht |
| Institution | Koforidua Technical University, Ghana |
Intended Use
MultiReligionBERT is intended for multilingual and cross-lingual NLP tasks on religious and biblical text, including:
- Zero-shot cross-lingual classification of religious text for African languages
- Multilingual semantic similarity between Bible verses across languages
- Book and section classification of biblical passages in multiple languages
- Feature extraction for multilingual religious NLP pipelines
- Multilingual masked language modelling on religious corpora
It is particularly suited to low-resource African language settings where general-purpose multilingual models such as mBERT have near-zero performance on domain-relevant tasks.
How to Get Started
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
tokenizer = AutoTokenizer.from_pretrained("LucasLicht/multi-religion-bert")
model = AutoModelForMaskedLM.from_pretrained("LucasLicht/multi-religion-bert")
# English
text_en = "For God so loved the [MASK] that he gave his only begotten Son."
# French
text_fr = "Car Dieu a tant aimé le [MASK] qu'il a donné son Fils unique."
# Amharic
text_am = "እግዚአብሔር ዓለሙን እጅግ [MASK] ስለ አፈቀረ።"
for text in [text_en, text_fr, text_am]:
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
masked_index = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
logits = outputs.logits[0, masked_index]
predicted_token = tokenizer.decode(torch.argmax(logits, dim=-1))
print(f"{text[:40]}... => {predicted_token}")
For multilingual sentence embeddings or fine-tuning:
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained("LucasLicht/multi-religion-bert")
model = AutoModel.from_pretrained("LucasLicht/multi-religion-bert")
sentences = [
"The LORD is my shepherd; I shall not want.", # English
"L'Éternel est mon berger: je ne manquerai de rien.", # French
"Yehova ndisafudzi wangu; handishayiwi chinhu.", # Shona
]
for sent in sentences:
inputs = tokenizer(sent, return_tensors="pt", truncation=True, max_length=128)
with torch.no_grad():
outputs = model(**inputs)
cls_embedding = outputs.last_hidden_state[:, 0, :]
print(f"Embedding shape: {cls_embedding.shape}") # [1, 768]
Training Details
Pre-Training Corpus
The corpus was sourced from the christos-c/bible-corpus repository. Twelve languages were selected to provide geographic and typological diversity across European and African language families.
| Language | Family | Script | Verses | Role |
|---|---|---|---|---|
| English (KJV + WEB) | Germanic | Latin | ~62,000 | Pre-training |
| French | Romance | Latin | ~31,000 | Pre-training |
| Spanish | Romance | Latin | ~31,000 | Pre-training |
| Portuguese | Romance | Latin | ~31,000 | Pre-training |
| German | Germanic | Latin | ~31,000 | Pre-training |
| Amharic | Semitic | Ethiopic | ~31,000 | Pre-training |
| Shona | Bantu | Latin | ~31,000 | Pre-training |
| Xhosa | Bantu | Latin | ~31,000 | Pre-training |
| Malagasy | Austronesian | Latin | ~31,000 | Pre-training |
| Somali | Cushitic | Latin | ~31,000 | Pre-training |
| Zarma | Nilo-Saharan | Latin | ~31,000 | Pre-training |
| Ewe | Gbe | Latin | NT only | Evaluation only |
| Swahili | Bantu | Latin | NT only | Evaluation only |
Total pre-training corpus: 372,652 verses
A RAM-exhaustion issue during tokenisation of the full corpus was resolved by processing the corpus in 50,000-line chunks saved as HuggingFace dataset shards before concatenation.
Training Procedure
| Hyperparameter | Value |
|---|---|
| Base model | bert-base-multilingual-cased |
| Training steps | 30,000 |
| Effective batch size | 32 (16 per device, 2 gradient accumulation steps) |
| Learning rate | 3e-5 (linear warmup, 500 steps) |
| Weight decay | 0.01 (AdamW) |
| MLM masking probability | 15% |
| Max sequence length | 128 tokens |
| Precision | FP16 mixed precision |
| Hardware | NVIDIA Tesla T4 / A100 (Google Colab) |
| Framework | HuggingFace Transformers 5.0.0 |
Training was conducted across multiple sessions with checkpoint recovery. Due to session interruptions, full logging begins at step 18,500. A PermanentDeleteCallback retained only the two most recent checkpoints to prevent storage exhaustion. All metrics were logged to Weights and Biases.
Training Loss Curve
| Step | Validation Loss |
|---|---|
| 18,500 | 1.532 |
| 20,000 | 1.516 |
| 22,000 | 1.474 |
| 24,000 | 1.501 |
| 26,000 | 1.450 |
| 28,000 | 1.430 |
| 29,500 | 1.368 (best) |
| 30,000 | 1.392 (final) |
Evaluation
Downstream Task Results
Three fine-tuning tasks were evaluated using automatically constructed datasets. All results are on held-out test sets. MultiReligionBERT is compared to its generic baseline (bert-base-multilingual-cased).
Semantic Similarity (21,994 verse pairs)
| Model | Pearson | Spearman |
|---|---|---|
| mBERT | 0.9624 | 0.6772 |
| MultiReligionBERT | 0.9635 | 0.6669 |
Book Classification (7,726 samples, 66 classes)
| Model | Accuracy | Macro F1 |
|---|---|---|
| mBERT | 0.3972 | 0.2948 |
| MultiReligionBERT | 0.4360 | 0.3369 |
MultiReligionBERT outperforms mBERT by +3.88 accuracy points and +4.21 macro F1 points.
Extractive Question Answering (1,199 LLM-assisted examples, English)
| Model | Exact Match (%) | Token F1 (%) |
|---|---|---|
| mBERT | 48.33 | 70.54 |
| MultiReligionBERT | 40.83 | 65.39 |
Note: MultiReligionBERT underperforms mBERT on English extractive QA. Multilingual domain adaptation shifts the model's representations in ways that are counterproductive for English-only span extraction on a small dataset, consistent with the known risk of catastrophic forgetting of language-specific capabilities during continued pre-training.
Zero-Shot Cross-Lingual Transfer
MultiReligionBERT, mBERT, and XLM-RoBERTa were evaluated on zero-shot book classification applied to five African language Bible corpora. All models were fine-tuned exclusively on English classification data; no target-language training was performed.
Accuracy
| Language | Family | Samples | mBERT | MultiReligionBERT | XLM-RoBERTa |
|---|---|---|---|---|---|
| Amharic | Semitic | 264 | 0.0152 | 0.0152 | 0.1326 |
| Shona | Bantu | 252 | 0.0556 | 0.0714 | 0.0556 |
| Xhosa | Bantu | 264 | 0.0189 | 0.0227 | 0.0530 |
| Ewe | Gbe | 297 | 0.0000 | 0.0337 | 0.0034 |
| Swahili | Bantu | 286 | 0.0280 | 0.0559 | 0.0874 |
Macro F1
| Language | mBERT | MultiReligionBERT | XLM-RoBERTa |
|---|---|---|---|
| Amharic | 0.0005 | 0.0005 | 0.0886 |
| Shona | 0.0267 | 0.0402 | 0.0381 |
| Xhosa | 0.0040 | 0.0072 | 0.0396 |
| Ewe | 0.0000 | 0.0187 | 0.0011 |
| Swahili | 0.0098 | 0.0262 | 0.0364 |
MultiReligionBERT outperforms mBERT on 4 of 5 languages. The most notable result is on Ewe: mBERT predicts zero correct book labels across all 297 test samples, while MultiReligionBERT achieves 3.37% accuracy and 0.0187 macro F1. Ewe is a severely low-resource Gbe language with minimal representation in standard multilingual pre-training corpora, but it is covered by a New Testament translation included in the Bible pre-training data. This demonstrates that domain-specific corpus coverage of a low-resource language provides transfer signal that general-purpose multilingual pre-training at scale alone does not supply.
Datasets
| Dataset | Task | Size | Notes |
|---|---|---|---|
| Verse Similarity | Semantic similarity (STS) | 21,994 pairs | Cross-translation and intra-corpus pairs; balanced subset 6,392 |
| Bible Book Classification | Text classification | 7,726 samples | 66 classes; 80/10/10 split |
| Bible QA | Extractive QA | 1,199 examples | LLM-assisted via Llama 3.3 70B; SQuAD v2 format; 100% human-verified quality |
Limitations
- Pre-training covers 12 languages. Performance on other religious traditions (Quran, Vedas, Buddhist sutras) and languages not in the pre-training corpus has not been evaluated.
- Ewe and Swahili are included only as evaluation languages (New Testament-only translations); their pre-training signal is limited to the NT portion.
- The model underperforms mBERT on English extractive QA due to domain shift during multilingual continued pre-training.
- The model inherits biases present in
bert-base-multilingual-casedand may reflect translation-specific theological perspectives in the source Bible texts. - Cross-lingual accuracy on all African languages remains low in absolute terms; results should be interpreted relative to mBERT rather than as production-ready performance.
Citation
If you use this model, please cite:
@misc{licht2025multireligionbert,
title = {ReligionBERT: Domain-Adaptive Pre-Training of BERT on Biblical Corpora for Religious NLP Tasks},
author = {Licht, Lucas},
year = {2025},
note = {Koforidua Technical University, Ghana. Model available at https://huggingface.co/LucasLicht/multi-religion-bert}
}
Related Model
| Model | Languages | Best Use Case |
|---|---|---|
| ReligionBERT | English only | Monolingual English religious NLP; stronger QA performance |
| MultiReligionBERT | 12 languages | Cross-lingual transfer; African language zero-shot tasks |
Contact
For questions or collaboration, reach out via HuggingFace or GitHub: @Licht005
- Downloads last month
- 39
Model tree for LucasLicht/multi-religion-bert
Base model
google-bert/bert-base-multilingual-cased