dansachs commited on
Commit
1bc979e
·
verified ·
1 Parent(s): 820dba3

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +134 -0
README.md ADDED
@@ -0,0 +1,134 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - indonesian
5
+ - nlp
6
+ - classification
7
+ - religiolect
8
+ - bert
9
+ - indobert
10
+ - text-classification
11
+ - multilingual
12
+ datasets:
13
+ - dansachs/indonesian-religious-corpus
14
+ model-index:
15
+ - name: indo-religiolect-bert-v2
16
+ results: []
17
+ ---
18
+
19
+ # Indo-Religiolect-BERT V2
20
+
21
+ A fine-tuned BERT model for classifying Indonesian text into three distinct religious denominations: **Islam**, **Catholicism**, and **Protestantism**.
22
+
23
+ ## Model Description
24
+
25
+ This model uses **IndoBERT** (Indonesian BERT) as the base model and is fine-tuned to identify unique "religiolects" (religious dialects) used by different faith communities in Indonesia. The model successfully distinguishes between groups with high accuracy, even navigating the shared vocabulary between Catholic and Protestant discourse.
26
+
27
+ - **Base Model:** [indolem/indobert-base-uncased](https://huggingface.co/indolem/indobert-base-uncased)
28
+ - **Task:** Text Classification (3-class)
29
+ - **Language:** Indonesian
30
+ - **Classes:** Islam (0), Catholic (1), Protestant (2)
31
+
32
+ ## Training Details
33
+
34
+ - **Training Strategy:** Balanced undersampling to ensure equal representation across all three classes
35
+ - **Architecture:** BERT-based sequence classification
36
+ - **Max Sequence Length:** 128 tokens
37
+ - **Training Data:** ~3 million sentences from 100+ authoritative religious websites
38
+
39
+ ### Training Data Sources
40
+
41
+ - **30 Catholic websites** (e.g., Mirifica, KAS)
42
+ - **27 Islamic websites** (e.g., NU Online)
43
+ - **44 Protestant websites** (e.g., PGI)
44
+
45
+ ## How to Use
46
+
47
+ ### Direct Inference
48
+
49
+ ```python
50
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
51
+ import torch
52
+ import torch.nn.functional as F
53
+
54
+ # Load model and tokenizer
55
+ MODEL_NAME = "dansachs/indo-religiolect-bert-v2"
56
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
57
+ model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)
58
+
59
+ # Predict
60
+ text = "Allah adalah Tuhan yang Maha Esa"
61
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
62
+
63
+ with torch.no_grad():
64
+ outputs = model(**inputs)
65
+ logits = outputs.logits
66
+
67
+ probs = F.softmax(logits, dim=1).numpy()[0]
68
+ labels = ['Islam', 'Catholic', 'Protestant']
69
+ prediction = labels[probs.argmax()]
70
+
71
+ print(f"Prediction: {prediction}")
72
+ print(f"Confidence: {probs.max():.1%}")
73
+ ```
74
+
75
+ ### Using the Interactive Scripts
76
+
77
+ Clone the repository and use the provided scripts:
78
+
79
+ ```bash
80
+ # Interactive mode
81
+ python interactive/predict.py
82
+
83
+ # Batch processing
84
+ python interactive/predict_batch.py --file texts.txt --output results.csv
85
+ ```
86
+
87
+ ## Dataset
88
+
89
+ The model was trained on the **Indonesian Religious Corpus** dataset:
90
+
91
+ 🔗 **Dataset:** [dansachs/indonesian-religious-corpus](https://huggingface.co/datasets/dansachs/indonesian-religious-corpus)
92
+
93
+ The dataset contains ~3 million clean sentences scraped from authoritative religious websites, with metadata including denomination, location, date, and source links.
94
+
95
+ ## Repository
96
+
97
+ 🔗 **GitHub Repository:** [dansachs/indo-religiolects](https://github.com/dansachs/indo-religiolects)
98
+
99
+ The repository includes:
100
+ - Training scripts and notebooks
101
+ - Interactive inference tools
102
+ - Data collection pipeline
103
+ - Full documentation
104
+
105
+ ## Limitations and Bias
106
+
107
+ - The model is trained on web-scraped content and may reflect biases present in online religious discourse
108
+ - Performance may vary for texts from sources not represented in the training data
109
+ - The model is designed for Indonesian text and may not perform well on other languages
110
+ - Religious classification is a sensitive task; use responsibly and consider the context
111
+
112
+ ## Citation
113
+
114
+ If you use this model in your research, please cite:
115
+
116
+ ```bibtex
117
+ @misc{indo-religiolect-bert-v2,
118
+ title={Indo-Religiolect-BERT V2: A Fine-tuned Model for Indonesian Religious Text Classification},
119
+ author={Sachs, Dan},
120
+ year={2025},
121
+ howpublished={\url{https://huggingface.co/dansachs/indo-religiolect-bert-v2}}
122
+ }
123
+ ```
124
+
125
+ ## Acknowledgments
126
+
127
+ - Base model: [IndoBERT by IndoLEM](https://huggingface.co/indolem/indobert-base-uncased)
128
+ - Built with [Hugging Face Transformers](https://huggingface.co/transformers/)
129
+ - Training data collected from 100+ authoritative religious websites
130
+
131
+ ## License
132
+
133
+ MIT License - For academic research purposes.
134
+