Indo-Religiolect-BERT V2
A fine-tuned BERT model for classifying Indonesian text into three distinct religious denominations: Islam, Catholicism, and Protestantism.
Model Description
This model uses IndoBERT (Indonesian BERT) as the base model and is fine-tuned to identify unique "religiolects" (religious dialects) used by different faith communities in Indonesia. The model successfully distinguishes between groups with high accuracy, even navigating the shared vocabulary between Catholic and Protestant discourse.
- Base Model: indolem/indobert-base-uncased
- Task: Text Classification (3-class)
- Language: Indonesian
- Classes: Islam (0), Catholic (1), Protestant (2)
Training Details
- Training Strategy: Balanced undersampling to ensure equal representation across all three classes
- Architecture: BERT-based sequence classification
- Max Sequence Length: 128 tokens
- Training Data: ~3 million sentences from 100+ authoritative religious websites
Training Data Sources
- 30 Catholic websites (e.g., Mirifica, KAS)
- 27 Islamic websites (e.g., NU Online)
- 44 Protestant websites (e.g., PGI)
How to Use
Direct Inference
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F
# Load model and tokenizer
MODEL_NAME = "dansachs/indo-religiolect-bert-v2"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)
# Predict
text = "Allah adalah Tuhan yang Maha Esa"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
probs = F.softmax(logits, dim=1).numpy()[0]
labels = ['Islam', 'Catholic', 'Protestant']
prediction = labels[probs.argmax()]
print(f"Prediction: {prediction}")
print(f"Confidence: {probs.max():.1%}")
Using the Interactive Scripts
Clone the repository and use the provided scripts:
# Interactive mode
python interactive/predict.py
# Batch processing
python interactive/predict_batch.py --file texts.txt --output results.csv
Dataset
The model was trained on the Indonesian Religious Corpus dataset:
๐ Dataset: dansachs/indonesian-religious-corpus
The dataset contains ~3 million clean sentences scraped from authoritative religious websites, with metadata including denomination, location, date, and source links.
Repository
๐ GitHub Repository: dansachs/indo-religiolects
The repository includes:
- Training scripts and notebooks
- Interactive inference tools
- Data collection pipeline
- Full documentation
Limitations and Bias
- The model is trained on web-scraped content and may reflect biases present in online religious discourse
- Performance may vary for texts from sources not represented in the training data
- The model is designed for Indonesian text and may not perform well on other languages
- Religious classification is a sensitive task; use responsibly and consider the context
Citation
If you use this model in your research, please cite:
@misc{indo-religiolect-bert-v2,
title={Indo-Religiolect-BERT V2: A Fine-tuned Model for Indonesian Religious Text Classification},
author={Sachs, Dan},
year={2025},
howpublished={\url{https://huggingface.co/dansachs/indo-religiolect-bert-v2}}
}
Acknowledgments
- Base model: IndoBERT by IndoLEM
- Built with Hugging Face Transformers
- Training data collected from 100+ authoritative religious websites
License
MIT License - For academic research purposes.
- Downloads last month
- 20