Indo-Religiolect-BERT V2

A fine-tuned BERT model for classifying Indonesian text into three distinct religious denominations: Islam, Catholicism, and Protestantism.

Model Description

This model uses IndoBERT (Indonesian BERT) as the base model and is fine-tuned to identify unique "religiolects" (religious dialects) used by different faith communities in Indonesia. The model successfully distinguishes between groups with high accuracy, even navigating the shared vocabulary between Catholic and Protestant discourse.

  • Base Model: indolem/indobert-base-uncased
  • Task: Text Classification (3-class)
  • Language: Indonesian
  • Classes: Islam (0), Catholic (1), Protestant (2)

Training Details

  • Training Strategy: Balanced undersampling to ensure equal representation across all three classes
  • Architecture: BERT-based sequence classification
  • Max Sequence Length: 128 tokens
  • Training Data: ~3 million sentences from 100+ authoritative religious websites

Training Data Sources

  • 30 Catholic websites (e.g., Mirifica, KAS)
  • 27 Islamic websites (e.g., NU Online)
  • 44 Protestant websites (e.g., PGI)

How to Use

Direct Inference

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F

# Load model and tokenizer
MODEL_NAME = "dansachs/indo-religiolect-bert-v2"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)

# Predict
text = "Allah adalah Tuhan yang Maha Esa"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)

with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits

probs = F.softmax(logits, dim=1).numpy()[0]
labels = ['Islam', 'Catholic', 'Protestant']
prediction = labels[probs.argmax()]

print(f"Prediction: {prediction}")
print(f"Confidence: {probs.max():.1%}")

Using the Interactive Scripts

Clone the repository and use the provided scripts:

# Interactive mode
python interactive/predict.py

# Batch processing
python interactive/predict_batch.py --file texts.txt --output results.csv

Dataset

The model was trained on the Indonesian Religious Corpus dataset:

๐Ÿ”— Dataset: dansachs/indonesian-religious-corpus

The dataset contains ~3 million clean sentences scraped from authoritative religious websites, with metadata including denomination, location, date, and source links.

Repository

๐Ÿ”— GitHub Repository: dansachs/indo-religiolects

The repository includes:

  • Training scripts and notebooks
  • Interactive inference tools
  • Data collection pipeline
  • Full documentation

Limitations and Bias

  • The model is trained on web-scraped content and may reflect biases present in online religious discourse
  • Performance may vary for texts from sources not represented in the training data
  • The model is designed for Indonesian text and may not perform well on other languages
  • Religious classification is a sensitive task; use responsibly and consider the context

Citation

If you use this model in your research, please cite:

@misc{indo-religiolect-bert-v2,
  title={Indo-Religiolect-BERT V2: A Fine-tuned Model for Indonesian Religious Text Classification},
  author={Sachs, Dan},
  year={2025},
  howpublished={\url{https://huggingface.co/dansachs/indo-religiolect-bert-v2}}
}

Acknowledgments

License

MIT License - For academic research purposes.

Downloads last month
20
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Dataset used to train dansachs/indo-religiolect-bert-v2

Evaluation results