MultiReligionBERT

MultiReligionBERT is a domain-adapted multilingual BERT model produced by continued masked language modelling (MLM) pre-training on a 12-language Bible corpus. Starting from bert-base-multilingual-cased (178M parameters), the model was trained for 30,000 steps on 372,652 verses spanning European and African languages. It is designed for cross-lingual religious NLP tasks, including zero-shot transfer to African languages that have limited representation in general-purpose multilingual pre-training corpora.

A companion English-only model, ReligionBERT, covers monolingual English religious NLP with stronger performance on English-specific tasks.

Model Details

Field	Details
Model type	BERT (encoder-only, masked language model)
Base model	`bert-base-multilingual-cased` (178M parameters)
Pre-training objective	Continued MLM (15% token masking)
Pre-training corpus	12-language Bible corpus, 372,652 verses
Training steps	30,000
Final validation loss	1.392
Languages	English, French, Spanish, Portuguese, German, Amharic, Shona, Xhosa, Malagasy, Somali, Zarma, Ewe (eval only), Swahili (eval only)
License	Apache 2.0
Developed by	Lucas Licht
Institution	Koforidua Technical University, Ghana

Intended Use

MultiReligionBERT is intended for multilingual and cross-lingual NLP tasks on religious and biblical text, including:

Zero-shot cross-lingual classification of religious text for African languages
Multilingual semantic similarity between Bible verses across languages
Book and section classification of biblical passages in multiple languages
Feature extraction for multilingual religious NLP pipelines
Multilingual masked language modelling on religious corpora

It is particularly suited to low-resource African language settings where general-purpose multilingual models such as mBERT have near-zero performance on domain-relevant tasks.

How to Get Started

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

tokenizer = AutoTokenizer.from_pretrained("LucasLicht/multi-religion-bert")
model = AutoModelForMaskedLM.from_pretrained("LucasLicht/multi-religion-bert")

# English
text_en = "For God so loved the [MASK] that he gave his only begotten Son."
# French
text_fr = "Car Dieu a tant aimé le [MASK] qu'il a donné son Fils unique."
# Amharic
text_am = "እግዚአብሔር ዓለሙን እጅግ [MASK] ስለ አፈቀረ።"

for text in [text_en, text_fr, text_am]:
    inputs = tokenizer(text, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)
    masked_index = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
    logits = outputs.logits[0, masked_index]
    predicted_token = tokenizer.decode(torch.argmax(logits, dim=-1))
    print(f"{text[:40]}... => {predicted_token}")

For multilingual sentence embeddings or fine-tuning:

from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("LucasLicht/multi-religion-bert")
model = AutoModel.from_pretrained("LucasLicht/multi-religion-bert")

sentences = [
    "The LORD is my shepherd; I shall not want.",       # English
    "L'Éternel est mon berger: je ne manquerai de rien.", # French
    "Yehova ndisafudzi wangu; handishayiwi chinhu.",     # Shona
]

for sent in sentences:
    inputs = tokenizer(sent, return_tensors="pt", truncation=True, max_length=128)
    with torch.no_grad():
        outputs = model(**inputs)
    cls_embedding = outputs.last_hidden_state[:, 0, :]
    print(f"Embedding shape: {cls_embedding.shape}")  # [1, 768]

Training Details

Pre-Training Corpus

The corpus was sourced from the christos-c/bible-corpus repository. Twelve languages were selected to provide geographic and typological diversity across European and African language families.

Language	Family	Script	Verses	Role
English (KJV + WEB)	Germanic	Latin	~62,000	Pre-training
French	Romance	Latin	~31,000	Pre-training
Spanish	Romance	Latin	~31,000	Pre-training
Portuguese	Romance	Latin	~31,000	Pre-training
German	Germanic	Latin	~31,000	Pre-training
Amharic	Semitic	Ethiopic	~31,000	Pre-training
Shona	Bantu	Latin	~31,000	Pre-training
Xhosa	Bantu	Latin	~31,000	Pre-training
Malagasy	Austronesian	Latin	~31,000	Pre-training
Somali	Cushitic	Latin	~31,000	Pre-training
Zarma	Nilo-Saharan	Latin	~31,000	Pre-training
Ewe	Gbe	Latin	NT only	Evaluation only
Swahili	Bantu	Latin	NT only	Evaluation only

Total pre-training corpus: 372,652 verses

A RAM-exhaustion issue during tokenisation of the full corpus was resolved by processing the corpus in 50,000-line chunks saved as HuggingFace dataset shards before concatenation.

Training Procedure

Hyperparameter	Value
Base model	`bert-base-multilingual-cased`
Training steps	30,000
Effective batch size	32 (16 per device, 2 gradient accumulation steps)
Learning rate	3e-5 (linear warmup, 500 steps)
Weight decay	0.01 (AdamW)
MLM masking probability	15%
Max sequence length	128 tokens
Precision	FP16 mixed precision
Hardware	NVIDIA Tesla T4 / A100 (Google Colab)
Framework	HuggingFace Transformers 5.0.0

Training was conducted across multiple sessions with checkpoint recovery. Due to session interruptions, full logging begins at step 18,500. A PermanentDeleteCallback retained only the two most recent checkpoints to prevent storage exhaustion. All metrics were logged to Weights and Biases.

Training Loss Curve

Step	Validation Loss
18,500	1.532
20,000	1.516
22,000	1.474
24,000	1.501
26,000	1.450
28,000	1.430
29,500	1.368 (best)
30,000	1.392 (final)

Evaluation

Downstream Task Results

Three fine-tuning tasks were evaluated using automatically constructed datasets. All results are on held-out test sets. MultiReligionBERT is compared to its generic baseline (bert-base-multilingual-cased).

Semantic Similarity (21,994 verse pairs)

Model	Pearson	Spearman
mBERT	0.9624	0.6772
MultiReligionBERT	0.9635	0.6669

Book Classification (7,726 samples, 66 classes)

Model	Accuracy	Macro F1
mBERT	0.3972	0.2948
MultiReligionBERT	0.4360	0.3369

MultiReligionBERT outperforms mBERT by +3.88 accuracy points and +4.21 macro F1 points.

Extractive Question Answering (1,199 LLM-assisted examples, English)

Model	Exact Match (%)	Token F1 (%)
mBERT	48.33	70.54
MultiReligionBERT	40.83	65.39

Note: MultiReligionBERT underperforms mBERT on English extractive QA. Multilingual domain adaptation shifts the model's representations in ways that are counterproductive for English-only span extraction on a small dataset, consistent with the known risk of catastrophic forgetting of language-specific capabilities during continued pre-training.

Zero-Shot Cross-Lingual Transfer

MultiReligionBERT, mBERT, and XLM-RoBERTa were evaluated on zero-shot book classification applied to five African language Bible corpora. All models were fine-tuned exclusively on English classification data; no target-language training was performed.

Accuracy

Language	Family	Samples	mBERT	MultiReligionBERT	XLM-RoBERTa
Amharic	Semitic	264	0.0152	0.0152	0.1326
Shona	Bantu	252	0.0556	0.0714	0.0556
Xhosa	Bantu	264	0.0189	0.0227	0.0530
Ewe	Gbe	297	0.0000	0.0337	0.0034
Swahili	Bantu	286	0.0280	0.0559	0.0874

Macro F1

Language	mBERT	MultiReligionBERT	XLM-RoBERTa
Amharic	0.0005	0.0005	0.0886
Shona	0.0267	0.0402	0.0381
Xhosa	0.0040	0.0072	0.0396
Ewe	0.0000	0.0187	0.0011
Swahili	0.0098	0.0262	0.0364

MultiReligionBERT outperforms mBERT on 4 of 5 languages. The most notable result is on Ewe: mBERT predicts zero correct book labels across all 297 test samples, while MultiReligionBERT achieves 3.37% accuracy and 0.0187 macro F1. Ewe is a severely low-resource Gbe language with minimal representation in standard multilingual pre-training corpora, but it is covered by a New Testament translation included in the Bible pre-training data. This demonstrates that domain-specific corpus coverage of a low-resource language provides transfer signal that general-purpose multilingual pre-training at scale alone does not supply.

Datasets

Dataset	Task	Size	Notes
Verse Similarity	Semantic similarity (STS)	21,994 pairs	Cross-translation and intra-corpus pairs; balanced subset 6,392
Bible Book Classification	Text classification	7,726 samples	66 classes; 80/10/10 split
Bible QA	Extractive QA	1,199 examples	LLM-assisted via Llama 3.3 70B; SQuAD v2 format; 100% human-verified quality

Limitations

Pre-training covers 12 languages. Performance on other religious traditions (Quran, Vedas, Buddhist sutras) and languages not in the pre-training corpus has not been evaluated.
Ewe and Swahili are included only as evaluation languages (New Testament-only translations); their pre-training signal is limited to the NT portion.
The model underperforms mBERT on English extractive QA due to domain shift during multilingual continued pre-training.
The model inherits biases present in bert-base-multilingual-cased and may reflect translation-specific theological perspectives in the source Bible texts.
Cross-lingual accuracy on all African languages remains low in absolute terms; results should be interpreted relative to mBERT rather than as production-ready performance.

Citation

If you use this model, please cite:

@misc{licht2025multireligionbert,
  title     = {ReligionBERT: Domain-Adaptive Pre-Training of BERT on Biblical Corpora for Religious NLP Tasks},
  author    = {Licht, Lucas},
  year      = {2025},
  note      = {Koforidua Technical University, Ghana. Model available at https://huggingface.co/LucasLicht/multi-religion-bert}
}

Related Model

Model	Languages	Best Use Case
ReligionBERT	English only	Monolingual English religious NLP; stronger QA performance
MultiReligionBERT	12 languages	Cross-lingual transfer; African language zero-shot tasks

Contact

For questions or collaboration, reach out via HuggingFace or GitHub: @Licht005

Downloads last month: 39

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for LucasLicht/multi-religion-bert

Base model

google-bert/bert-base-multilingual-cased

Finetuned

(978)

this model