Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,134 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
tags:
|
| 4 |
+
- indonesian
|
| 5 |
+
- nlp
|
| 6 |
+
- classification
|
| 7 |
+
- religiolect
|
| 8 |
+
- bert
|
| 9 |
+
- indobert
|
| 10 |
+
- text-classification
|
| 11 |
+
- multilingual
|
| 12 |
+
datasets:
|
| 13 |
+
- dansachs/indonesian-religious-corpus
|
| 14 |
+
model-index:
|
| 15 |
+
- name: indo-religiolect-bert-v2
|
| 16 |
+
results: []
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
# Indo-Religiolect-BERT V2
|
| 20 |
+
|
| 21 |
+
A fine-tuned BERT model for classifying Indonesian text into three distinct religious denominations: **Islam**, **Catholicism**, and **Protestantism**.
|
| 22 |
+
|
| 23 |
+
## Model Description
|
| 24 |
+
|
| 25 |
+
This model uses **IndoBERT** (Indonesian BERT) as the base model and is fine-tuned to identify unique "religiolects" (religious dialects) used by different faith communities in Indonesia. The model successfully distinguishes between groups with high accuracy, even navigating the shared vocabulary between Catholic and Protestant discourse.
|
| 26 |
+
|
| 27 |
+
- **Base Model:** [indolem/indobert-base-uncased](https://huggingface.co/indolem/indobert-base-uncased)
|
| 28 |
+
- **Task:** Text Classification (3-class)
|
| 29 |
+
- **Language:** Indonesian
|
| 30 |
+
- **Classes:** Islam (0), Catholic (1), Protestant (2)
|
| 31 |
+
|
| 32 |
+
## Training Details
|
| 33 |
+
|
| 34 |
+
- **Training Strategy:** Balanced undersampling to ensure equal representation across all three classes
|
| 35 |
+
- **Architecture:** BERT-based sequence classification
|
| 36 |
+
- **Max Sequence Length:** 128 tokens
|
| 37 |
+
- **Training Data:** ~3 million sentences from 100+ authoritative religious websites
|
| 38 |
+
|
| 39 |
+
### Training Data Sources
|
| 40 |
+
|
| 41 |
+
- **30 Catholic websites** (e.g., Mirifica, KAS)
|
| 42 |
+
- **27 Islamic websites** (e.g., NU Online)
|
| 43 |
+
- **44 Protestant websites** (e.g., PGI)
|
| 44 |
+
|
| 45 |
+
## How to Use
|
| 46 |
+
|
| 47 |
+
### Direct Inference
|
| 48 |
+
|
| 49 |
+
```python
|
| 50 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
| 51 |
+
import torch
|
| 52 |
+
import torch.nn.functional as F
|
| 53 |
+
|
| 54 |
+
# Load model and tokenizer
|
| 55 |
+
MODEL_NAME = "dansachs/indo-religiolect-bert-v2"
|
| 56 |
+
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
|
| 57 |
+
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)
|
| 58 |
+
|
| 59 |
+
# Predict
|
| 60 |
+
text = "Allah adalah Tuhan yang Maha Esa"
|
| 61 |
+
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
|
| 62 |
+
|
| 63 |
+
with torch.no_grad():
|
| 64 |
+
outputs = model(**inputs)
|
| 65 |
+
logits = outputs.logits
|
| 66 |
+
|
| 67 |
+
probs = F.softmax(logits, dim=1).numpy()[0]
|
| 68 |
+
labels = ['Islam', 'Catholic', 'Protestant']
|
| 69 |
+
prediction = labels[probs.argmax()]
|
| 70 |
+
|
| 71 |
+
print(f"Prediction: {prediction}")
|
| 72 |
+
print(f"Confidence: {probs.max():.1%}")
|
| 73 |
+
```
|
| 74 |
+
|
| 75 |
+
### Using the Interactive Scripts
|
| 76 |
+
|
| 77 |
+
Clone the repository and use the provided scripts:
|
| 78 |
+
|
| 79 |
+
```bash
|
| 80 |
+
# Interactive mode
|
| 81 |
+
python interactive/predict.py
|
| 82 |
+
|
| 83 |
+
# Batch processing
|
| 84 |
+
python interactive/predict_batch.py --file texts.txt --output results.csv
|
| 85 |
+
```
|
| 86 |
+
|
| 87 |
+
## Dataset
|
| 88 |
+
|
| 89 |
+
The model was trained on the **Indonesian Religious Corpus** dataset:
|
| 90 |
+
|
| 91 |
+
🔗 **Dataset:** [dansachs/indonesian-religious-corpus](https://huggingface.co/datasets/dansachs/indonesian-religious-corpus)
|
| 92 |
+
|
| 93 |
+
The dataset contains ~3 million clean sentences scraped from authoritative religious websites, with metadata including denomination, location, date, and source links.
|
| 94 |
+
|
| 95 |
+
## Repository
|
| 96 |
+
|
| 97 |
+
🔗 **GitHub Repository:** [dansachs/indo-religiolects](https://github.com/dansachs/indo-religiolects)
|
| 98 |
+
|
| 99 |
+
The repository includes:
|
| 100 |
+
- Training scripts and notebooks
|
| 101 |
+
- Interactive inference tools
|
| 102 |
+
- Data collection pipeline
|
| 103 |
+
- Full documentation
|
| 104 |
+
|
| 105 |
+
## Limitations and Bias
|
| 106 |
+
|
| 107 |
+
- The model is trained on web-scraped content and may reflect biases present in online religious discourse
|
| 108 |
+
- Performance may vary for texts from sources not represented in the training data
|
| 109 |
+
- The model is designed for Indonesian text and may not perform well on other languages
|
| 110 |
+
- Religious classification is a sensitive task; use responsibly and consider the context
|
| 111 |
+
|
| 112 |
+
## Citation
|
| 113 |
+
|
| 114 |
+
If you use this model in your research, please cite:
|
| 115 |
+
|
| 116 |
+
```bibtex
|
| 117 |
+
@misc{indo-religiolect-bert-v2,
|
| 118 |
+
title={Indo-Religiolect-BERT V2: A Fine-tuned Model for Indonesian Religious Text Classification},
|
| 119 |
+
author={Sachs, Dan},
|
| 120 |
+
year={2025},
|
| 121 |
+
howpublished={\url{https://huggingface.co/dansachs/indo-religiolect-bert-v2}}
|
| 122 |
+
}
|
| 123 |
+
```
|
| 124 |
+
|
| 125 |
+
## Acknowledgments
|
| 126 |
+
|
| 127 |
+
- Base model: [IndoBERT by IndoLEM](https://huggingface.co/indolem/indobert-base-uncased)
|
| 128 |
+
- Built with [Hugging Face Transformers](https://huggingface.co/transformers/)
|
| 129 |
+
- Training data collected from 100+ authoritative religious websites
|
| 130 |
+
|
| 131 |
+
## License
|
| 132 |
+
|
| 133 |
+
MIT License - For academic research purposes.
|
| 134 |
+
|