Indo-Religiolect-BERT V2

A fine-tuned BERT model for classifying Indonesian text into three distinct religious denominations: Islam, Catholicism, and Protestantism.

Model Description

This model uses IndoBERT (Indonesian BERT) as the base model and is fine-tuned to identify unique "religiolects" (religious dialects) used by different faith communities in Indonesia. The model successfully distinguishes between groups with high accuracy, even navigating the shared vocabulary between Catholic and Protestant discourse.

Base Model: indolem/indobert-base-uncased
Task: Text Classification (3-class)
Language: Indonesian
Classes: Islam (0), Catholic (1), Protestant (2)

Training Details

Training Strategy: Balanced undersampling to ensure equal representation across all three classes
Architecture: BERT-based sequence classification
Max Sequence Length: 128 tokens
Training Data: ~3 million sentences from 100+ authoritative religious websites

Training Data Sources

30 Catholic websites (e.g., Mirifica, KAS)
27 Islamic websites (e.g., NU Online)
44 Protestant websites (e.g., PGI)

How to Use

Direct Inference

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F

# Load model and tokenizer
MODEL_NAME = "dansachs/indo-religiolect-bert-v2"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)

# Predict
text = "Allah adalah Tuhan yang Maha Esa"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)

with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits

probs = F.softmax(logits, dim=1).numpy()[0]
labels = ['Islam', 'Catholic', 'Protestant']
prediction = labels[probs.argmax()]

print(f"Prediction: {prediction}")
print(f"Confidence: {probs.max():.1%}")

Using the Interactive Scripts

Clone the repository and use the provided scripts:

# Interactive mode
python interactive/predict.py

# Batch processing
python interactive/predict_batch.py --file texts.txt --output results.csv

Dataset

The model was trained on the Indonesian Religious Corpus dataset:

🔗 Dataset: dansachs/indonesian-religious-corpus

The dataset contains ~3 million clean sentences scraped from authoritative religious websites, with metadata including denomination, location, date, and source links.

Repository

🔗 GitHub Repository: dansachs/indo-religiolects

The repository includes:

Training scripts and notebooks
Interactive inference tools
Data collection pipeline
Full documentation

Limitations and Bias

The model is trained on web-scraped content and may reflect biases present in online religious discourse
Performance may vary for texts from sources not represented in the training data
The model is designed for Indonesian text and may not perform well on other languages
Religious classification is a sensitive task; use responsibly and consider the context

Citation

If you use this model in your research, please cite:

@misc{indo-religiolect-bert-v2,
  title={Indo-Religiolect-BERT V2: A Fine-tuned Model for Indonesian Religious Text Classification},
  author={Sachs, Dan},
  year={2025},
  howpublished={\url{https://huggingface.co/dansachs/indo-religiolect-bert-v2}}
}

Acknowledgments

Base model: IndoBERT by IndoLEM
Built with Hugging Face Transformers
Training data collected from 100+ authoritative religious websites

License

MIT License - For academic research purposes.

Downloads last month: 1

Safetensors

Model size

0.1B params

Tensor type

F32

dansachs
/

indo-religiolect-bert-v2