MultiReligionBERT

MultiReligionBERT is a domain-adapted multilingual BERT model produced by continued masked language modelling (MLM) pre-training on a 12-language Bible corpus. Starting from bert-base-multilingual-cased (178M parameters), the model was trained for 30,000 steps on 372,652 verses spanning European and African languages. It is designed for cross-lingual religious NLP tasks, including zero-shot transfer to African languages that have limited representation in general-purpose multilingual pre-training corpora.

A companion English-only model, ReligionBERT, covers monolingual English religious NLP with stronger performance on English-specific tasks.


Model Details

Field Details
Model type BERT (encoder-only, masked language model)
Base model bert-base-multilingual-cased (178M parameters)
Pre-training objective Continued MLM (15% token masking)
Pre-training corpus 12-language Bible corpus, 372,652 verses
Training steps 30,000
Final validation loss 1.392
Languages English, French, Spanish, Portuguese, German, Amharic, Shona, Xhosa, Malagasy, Somali, Zarma, Ewe (eval only), Swahili (eval only)
License Apache 2.0
Developed by Lucas Licht
Institution Koforidua Technical University, Ghana

Intended Use

MultiReligionBERT is intended for multilingual and cross-lingual NLP tasks on religious and biblical text, including:

  • Zero-shot cross-lingual classification of religious text for African languages
  • Multilingual semantic similarity between Bible verses across languages
  • Book and section classification of biblical passages in multiple languages
  • Feature extraction for multilingual religious NLP pipelines
  • Multilingual masked language modelling on religious corpora

It is particularly suited to low-resource African language settings where general-purpose multilingual models such as mBERT have near-zero performance on domain-relevant tasks.


How to Get Started

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

tokenizer = AutoTokenizer.from_pretrained("LucasLicht/multi-religion-bert")
model = AutoModelForMaskedLM.from_pretrained("LucasLicht/multi-religion-bert")

# English
text_en = "For God so loved the [MASK] that he gave his only begotten Son."
# French
text_fr = "Car Dieu a tant aimé le [MASK] qu'il a donné son Fils unique."
# Amharic
text_am = "እግዚአብሔር ዓለሙን እጅግ [MASK] ስለ አፈቀረ።"

for text in [text_en, text_fr, text_am]:
    inputs = tokenizer(text, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)
    masked_index = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
    logits = outputs.logits[0, masked_index]
    predicted_token = tokenizer.decode(torch.argmax(logits, dim=-1))
    print(f"{text[:40]}... => {predicted_token}")

For multilingual sentence embeddings or fine-tuning:

from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("LucasLicht/multi-religion-bert")
model = AutoModel.from_pretrained("LucasLicht/multi-religion-bert")

sentences = [
    "The LORD is my shepherd; I shall not want.",       # English
    "L'Éternel est mon berger: je ne manquerai de rien.", # French
    "Yehova ndisafudzi wangu; handishayiwi chinhu.",     # Shona
]

for sent in sentences:
    inputs = tokenizer(sent, return_tensors="pt", truncation=True, max_length=128)
    with torch.no_grad():
        outputs = model(**inputs)
    cls_embedding = outputs.last_hidden_state[:, 0, :]
    print(f"Embedding shape: {cls_embedding.shape}")  # [1, 768]

Training Details

Pre-Training Corpus

The corpus was sourced from the christos-c/bible-corpus repository. Twelve languages were selected to provide geographic and typological diversity across European and African language families.

Language Family Script Verses Role
English (KJV + WEB) Germanic Latin ~62,000 Pre-training
French Romance Latin ~31,000 Pre-training
Spanish Romance Latin ~31,000 Pre-training
Portuguese Romance Latin ~31,000 Pre-training
German Germanic Latin ~31,000 Pre-training
Amharic Semitic Ethiopic ~31,000 Pre-training
Shona Bantu Latin ~31,000 Pre-training
Xhosa Bantu Latin ~31,000 Pre-training
Malagasy Austronesian Latin ~31,000 Pre-training
Somali Cushitic Latin ~31,000 Pre-training
Zarma Nilo-Saharan Latin ~31,000 Pre-training
Ewe Gbe Latin NT only Evaluation only
Swahili Bantu Latin NT only Evaluation only

Total pre-training corpus: 372,652 verses

A RAM-exhaustion issue during tokenisation of the full corpus was resolved by processing the corpus in 50,000-line chunks saved as HuggingFace dataset shards before concatenation.

Training Procedure

Hyperparameter Value
Base model bert-base-multilingual-cased
Training steps 30,000
Effective batch size 32 (16 per device, 2 gradient accumulation steps)
Learning rate 3e-5 (linear warmup, 500 steps)
Weight decay 0.01 (AdamW)
MLM masking probability 15%
Max sequence length 128 tokens
Precision FP16 mixed precision
Hardware NVIDIA Tesla T4 / A100 (Google Colab)
Framework HuggingFace Transformers 5.0.0

Training was conducted across multiple sessions with checkpoint recovery. Due to session interruptions, full logging begins at step 18,500. A PermanentDeleteCallback retained only the two most recent checkpoints to prevent storage exhaustion. All metrics were logged to Weights and Biases.

Training Loss Curve

Step Validation Loss
18,500 1.532
20,000 1.516
22,000 1.474
24,000 1.501
26,000 1.450
28,000 1.430
29,500 1.368 (best)
30,000 1.392 (final)

Evaluation

Downstream Task Results

Three fine-tuning tasks were evaluated using automatically constructed datasets. All results are on held-out test sets. MultiReligionBERT is compared to its generic baseline (bert-base-multilingual-cased).

Semantic Similarity (21,994 verse pairs)

Model Pearson Spearman
mBERT 0.9624 0.6772
MultiReligionBERT 0.9635 0.6669

Book Classification (7,726 samples, 66 classes)

Model Accuracy Macro F1
mBERT 0.3972 0.2948
MultiReligionBERT 0.4360 0.3369

MultiReligionBERT outperforms mBERT by +3.88 accuracy points and +4.21 macro F1 points.

Extractive Question Answering (1,199 LLM-assisted examples, English)

Model Exact Match (%) Token F1 (%)
mBERT 48.33 70.54
MultiReligionBERT 40.83 65.39

Note: MultiReligionBERT underperforms mBERT on English extractive QA. Multilingual domain adaptation shifts the model's representations in ways that are counterproductive for English-only span extraction on a small dataset, consistent with the known risk of catastrophic forgetting of language-specific capabilities during continued pre-training.


Zero-Shot Cross-Lingual Transfer

MultiReligionBERT, mBERT, and XLM-RoBERTa were evaluated on zero-shot book classification applied to five African language Bible corpora. All models were fine-tuned exclusively on English classification data; no target-language training was performed.

Accuracy

Language Family Samples mBERT MultiReligionBERT XLM-RoBERTa
Amharic Semitic 264 0.0152 0.0152 0.1326
Shona Bantu 252 0.0556 0.0714 0.0556
Xhosa Bantu 264 0.0189 0.0227 0.0530
Ewe Gbe 297 0.0000 0.0337 0.0034
Swahili Bantu 286 0.0280 0.0559 0.0874

Macro F1

Language mBERT MultiReligionBERT XLM-RoBERTa
Amharic 0.0005 0.0005 0.0886
Shona 0.0267 0.0402 0.0381
Xhosa 0.0040 0.0072 0.0396
Ewe 0.0000 0.0187 0.0011
Swahili 0.0098 0.0262 0.0364

MultiReligionBERT outperforms mBERT on 4 of 5 languages. The most notable result is on Ewe: mBERT predicts zero correct book labels across all 297 test samples, while MultiReligionBERT achieves 3.37% accuracy and 0.0187 macro F1. Ewe is a severely low-resource Gbe language with minimal representation in standard multilingual pre-training corpora, but it is covered by a New Testament translation included in the Bible pre-training data. This demonstrates that domain-specific corpus coverage of a low-resource language provides transfer signal that general-purpose multilingual pre-training at scale alone does not supply.


Datasets

Dataset Task Size Notes
Verse Similarity Semantic similarity (STS) 21,994 pairs Cross-translation and intra-corpus pairs; balanced subset 6,392
Bible Book Classification Text classification 7,726 samples 66 classes; 80/10/10 split
Bible QA Extractive QA 1,199 examples LLM-assisted via Llama 3.3 70B; SQuAD v2 format; 100% human-verified quality

Limitations

  • Pre-training covers 12 languages. Performance on other religious traditions (Quran, Vedas, Buddhist sutras) and languages not in the pre-training corpus has not been evaluated.
  • Ewe and Swahili are included only as evaluation languages (New Testament-only translations); their pre-training signal is limited to the NT portion.
  • The model underperforms mBERT on English extractive QA due to domain shift during multilingual continued pre-training.
  • The model inherits biases present in bert-base-multilingual-cased and may reflect translation-specific theological perspectives in the source Bible texts.
  • Cross-lingual accuracy on all African languages remains low in absolute terms; results should be interpreted relative to mBERT rather than as production-ready performance.

Citation

If you use this model, please cite:

@misc{licht2025multireligionbert,
  title     = {ReligionBERT: Domain-Adaptive Pre-Training of BERT on Biblical Corpora for Religious NLP Tasks},
  author    = {Licht, Lucas},
  year      = {2025},
  note      = {Koforidua Technical University, Ghana. Model available at https://huggingface.co/LucasLicht/multi-religion-bert}
}

Related Model

Model Languages Best Use Case
ReligionBERT English only Monolingual English religious NLP; stronger QA performance
MultiReligionBERT 12 languages Cross-lingual transfer; African language zero-shot tasks

Contact

For questions or collaboration, reach out via HuggingFace or GitHub: @Licht005

Downloads last month
39
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for LucasLicht/multi-religion-bert

Finetuned
(978)
this model