Update README.md
Browse files
README.md
CHANGED
|
@@ -30,152 +30,280 @@ model-index:
|
|
| 30 |
value: 0.9389
|
| 31 |
---
|
| 32 |
|
| 33 |
-
# 📊
|
| 34 |
|
| 35 |
-
Model
|
| 36 |
-
|
|
|
|
| 37 |
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
- **UMUM** → Informasi / pertanyaan umum
|
| 41 |
-
- **LAINNYA** → Aduan lain yang tidak termasuk kategori di atas
|
| 42 |
|
| 43 |
---
|
| 44 |
|
| 45 |
-
##
|
| 46 |
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
|
|
|
|
|
|
|
|
|
| 51 |
|
| 52 |
---
|
| 53 |
|
| 54 |
-
##
|
| 55 |
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
- **Augmentasi** → 3,600 (balance 900 per kelas)
|
| 62 |
-
- **Split** → 80% Train (2880) | 20% Validation (720)
|
| 63 |
-
- **Base model** → `indobenchmark/indobert-base-p1`
|
| 64 |
-
- **Device training** → NVIDIA RTX 3050 Laptop GPU (CUDA)
|
| 65 |
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
## 📈 Hasil Evaluasi
|
| 69 |
-
|
| 70 |
-
- **Best Epoch** → 3
|
| 71 |
-
- **Validation Accuracy** → **93.89%**
|
| 72 |
-
- **Macro F1-score** → **0.9389**
|
| 73 |
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
| Darurat
|
| 78 |
-
| Prioritas
|
| 79 |
-
| Umum
|
| 80 |
-
| Lainnya
|
| 81 |
-
| **Macro Avg** | 0.9401 | 0.9389 | 0.9389 |
|
| 82 |
|
| 83 |
-
###
|
| 84 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 85 |
|
| 86 |
-
|
| 87 |
-
[ 6 162 11 1] # Prioritas
|
| 88 |
-
[ 1 2 176 1] # Umum
|
| 89 |
-
[ 3 1 5 171]] # Lainnya
|
| 90 |
|
| 91 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 92 |
|
| 93 |
---
|
| 94 |
|
| 95 |
-
##
|
| 96 |
|
| 97 |
-
###
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 98 |
```python
|
| 99 |
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
| 100 |
import torch
|
| 101 |
|
|
|
|
| 102 |
model_name = "Zulkifli1409/aduan-model"
|
| 103 |
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 104 |
model = AutoModelForSequenceClassification.from_pretrained(model_name)
|
| 105 |
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
probs = torch.nn.functional.softmax(outputs.logits, dim=1)
|
| 110 |
|
| 111 |
-
|
| 112 |
-
|
|
|
|
|
|
|
|
|
|
| 113 |
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
````
|
| 117 |
|
| 118 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 119 |
|
|
|
|
| 120 |
```
|
| 121 |
Prediksi: DARURAT
|
| 122 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 123 |
```
|
| 124 |
|
| 125 |
---
|
| 126 |
|
| 127 |
-
##
|
| 128 |
|
| 129 |
-
|
|
| 130 |
-
|
| 131 |
-
|
|
| 132 |
-
|
|
| 133 |
-
|
|
| 134 |
-
|
|
| 135 |
-
|
|
| 136 |
-
|
|
|
|
|
|
|
|
| 137 |
|
| 138 |
---
|
| 139 |
|
| 140 |
-
##
|
| 141 |
|
| 142 |
-
|
| 143 |
-
|
| 144 |
-
|
| 145 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 146 |
```
|
| 147 |
|
| 148 |
-
|
|
|
|
|
|
|
| 149 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 150 |
```bash
|
| 151 |
curl -X POST https://api-klasifikasi-aduan.up.railway.app/predict \
|
| 152 |
-H "Content-Type: application/json" \
|
| 153 |
-d '{"text": "Ada kebakaran di pasar"}'
|
| 154 |
```
|
| 155 |
|
| 156 |
-
Response
|
| 157 |
-
|
| 158 |
```json
|
| 159 |
{
|
| 160 |
"label": "DARURAT",
|
| 161 |
-
"confidence": 0.
|
| 162 |
"all_scores": {
|
| 163 |
-
"
|
| 164 |
-
"
|
| 165 |
-
"
|
| 166 |
-
"
|
|
|
|
| 167 |
}
|
| 168 |
}
|
| 169 |
```
|
| 170 |
|
| 171 |
---
|
| 172 |
|
| 173 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 174 |
|
| 175 |
-
|
| 176 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 177 |
|
| 178 |
---
|
| 179 |
|
| 180 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 181 |
|
|
|
|
|
|
| 30 |
value: 0.9389
|
| 31 |
---
|
| 32 |
|
| 33 |
+
# 📊 Indonesian Complaint Classification Model (IndoBERT)
|
| 34 |
|
| 35 |
+
[](https://huggingface.co/Zulkifli1409/aduan-model)
|
| 36 |
+
[](https://opensource.org/licenses/Apache-2.0)
|
| 37 |
+
[](https://en.wikipedia.org/wiki/Indonesian_language)
|
| 38 |
|
| 39 |
+
Model klasifikasi teks aduan masyarakat dalam Bahasa Indonesia menggunakan **IndoBERT (indobenchmark/indobert-base-p1)**.
|
| 40 |
+
Model dapat mengelompokkan aduan ke dalam **5 kategori** dengan akurasi **96.10%**.
|
|
|
|
|
|
|
| 41 |
|
| 42 |
---
|
| 43 |
|
| 44 |
+
## 📑 Kategori Klasifikasi
|
| 45 |
|
| 46 |
+
| Label | Deskripsi | Contoh |
|
| 47 |
+
|-------|-----------|--------|
|
| 48 |
+
| **PINALTI** | Konten yang mengandung kata kasar, SARA, pornografi, ujaran kebencian, atau pelanggaran norma | "Kampret pejabat koruptor!", "Konten porno beredar", "Rasis banget pemerintah" |
|
| 49 |
+
| **DARURAT** | Situasi darurat yang membutuhkan respon segera (kebakaran, kecelakaan, bencana, ancaman nyawa) | "Ada kebakaran besar di pasar!", "Kecelakaan beruntun di tol", "Banjir bandang melanda desa" |
|
| 50 |
+
| **PRIORITAS** | Permasalahan yang perlu penanganan cepat (infrastruktur rusak, kebersihan, pelayanan publik) | "Jalan berlubang berbahaya", "Sampah menumpuk seminggu", "Lampu jalan mati semua" |
|
| 51 |
+
| **UMUM** | Pertanyaan informasi, saran, atau aduan non-urgent | "Bagaimana cara mengurus KTP?", "Kapan jadwal posyandu?", "Saran untuk program desa" |
|
| 52 |
+
| **LAINNYA** | Aduan yang tidak termasuk kategori di atas | "Terima kasih atas pelayanannya", "Hanya ingin menyampaikan apresiasi" |
|
| 53 |
|
| 54 |
---
|
| 55 |
|
| 56 |
+
## 🎯 Model Performance
|
| 57 |
|
| 58 |
+
### **Overall Metrics**
|
| 59 |
+
- **Validation Accuracy**: **96.10%**
|
| 60 |
+
- **Macro F1-Score**: **0.9608**
|
| 61 |
+
- **Weighted F1-Score**: **0.9610**
|
| 62 |
+
- **Average Confidence**: **93.90%**
|
|
|
|
|
|
|
|
|
|
|
|
|
| 63 |
|
| 64 |
+
### **Per-Class Performance**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 65 |
|
| 66 |
+
| Label | Precision | Recall | F1-Score | Support |
|
| 67 |
+
|-------|-----------|--------|----------|---------|
|
| 68 |
+
| Pinalti | 0.9588 | 0.9645 | 0.9617 | 169 |
|
| 69 |
+
| Darurat | 0.9453 | 0.9603 | 0.9528 | 126 |
|
| 70 |
+
| Prioritas | 0.9675 | 0.9675 | 0.9675 | 123 |
|
| 71 |
+
| Umum | 0.9752 | 0.9593 | 0.9672 | 123 |
|
| 72 |
+
| Lainnya | 0.9596 | 0.9500 | 0.9548 | 100 |
|
|
|
|
| 73 |
|
| 74 |
+
### **Confusion Matrix**
|
| 75 |
```
|
| 76 |
+
Predicted
|
| 77 |
+
Pin Dar Pri Umu Lai
|
| 78 |
+
Actual Pin 163 2 1 0 3
|
| 79 |
+
Dar 2 121 2 0 1
|
| 80 |
+
Pri 0 3 119 1 0
|
| 81 |
+
Umu 2 2 1 118 0
|
| 82 |
+
Lai 3 0 0 2 95
|
| 83 |
+
```
|
| 84 |
+
|
| 85 |
+
---
|
| 86 |
|
| 87 |
+
## 📊 Dataset Information
|
|
|
|
|
|
|
|
|
|
| 88 |
|
| 89 |
+
- **Total Training Samples**: 3,204
|
| 90 |
+
- Pinalti: 844
|
| 91 |
+
- Darurat: 630
|
| 92 |
+
- Prioritas: 612
|
| 93 |
+
- Umum: 616
|
| 94 |
+
- Lainnya: 502
|
| 95 |
+
- **Train/Val Split**: 80% / 20% (2,563 / 641)
|
| 96 |
+
- **Augmentation**: Applied to balance classes
|
| 97 |
+
- **Language**: Indonesian (Bahasa Indonesia)
|
| 98 |
|
| 99 |
---
|
| 100 |
|
| 101 |
+
## 🚀 Quick Start
|
| 102 |
|
| 103 |
+
### Installation
|
| 104 |
+
```bash
|
| 105 |
+
pip install transformers torch
|
| 106 |
+
```
|
| 107 |
+
|
| 108 |
+
### Basic Usage
|
| 109 |
```python
|
| 110 |
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
| 111 |
import torch
|
| 112 |
|
| 113 |
+
# Load model and tokenizer
|
| 114 |
model_name = "Zulkifli1409/aduan-model"
|
| 115 |
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 116 |
model = AutoModelForSequenceClassification.from_pretrained(model_name)
|
| 117 |
|
| 118 |
+
# Prepare input
|
| 119 |
+
text = "Ada kebakaran besar di pasar, tolong kirim pemadam segera!"
|
| 120 |
+
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128)
|
|
|
|
| 121 |
|
| 122 |
+
# Predict
|
| 123 |
+
with torch.no_grad():
|
| 124 |
+
outputs = model(**inputs)
|
| 125 |
+
probs = torch.nn.functional.softmax(outputs.logits, dim=1)
|
| 126 |
+
pred_idx = torch.argmax(probs).item()
|
| 127 |
|
| 128 |
+
# Labels
|
| 129 |
+
labels = ["PINALTI", "DARURAT", "PRIORITAS", "UMUM", "LAINNYA"]
|
|
|
|
| 130 |
|
| 131 |
+
print(f"Prediksi: {labels[pred_idx]}")
|
| 132 |
+
print(f"Confidence: {probs[0][pred_idx].item():.2%}")
|
| 133 |
+
print(f"\nAll probabilities:")
|
| 134 |
+
for label, prob in zip(labels, probs[0]):
|
| 135 |
+
print(f" {label}: {prob.item():.2%}")
|
| 136 |
+
```
|
| 137 |
|
| 138 |
+
**Output:**
|
| 139 |
```
|
| 140 |
Prediksi: DARURAT
|
| 141 |
+
Confidence: 96.03%
|
| 142 |
+
|
| 143 |
+
All probabilities:
|
| 144 |
+
PINALTI: 0.21%
|
| 145 |
+
DARURAT: 96.03%
|
| 146 |
+
PRIORITAS: 2.89%
|
| 147 |
+
UMUM: 0.45%
|
| 148 |
+
LAINNYA: 0.42%
|
| 149 |
```
|
| 150 |
|
| 151 |
---
|
| 152 |
|
| 153 |
+
## 🧪 Example Predictions
|
| 154 |
|
| 155 |
+
| Input Text | Prediction | Confidence |
|
| 156 |
+
|------------|------------|------------|
|
| 157 |
+
| "Brengsek! Pejabat korup semua!" | **PINALTI** | 94.23% |
|
| 158 |
+
| "Ada orang kecelakaan parah butuh ambulans" | **DARURAT** | 95.67% |
|
| 159 |
+
| "Jalan berlubang perlu diperbaiki segera" | **PRIORITAS** | 92.34% |
|
| 160 |
+
| "Bagaimana cara mengurus surat izin usaha?" | **UMUM** | 89.45% |
|
| 161 |
+
| "Terima kasih atas bantuannya" | **LAINNYA** | 88.91% |
|
| 162 |
+
| "Konten porno tersebar di grup WhatsApp" | **PINALTI** | 91.78% |
|
| 163 |
+
| "Banjir tinggi merendam rumah warga" | **DARURAT** | 93.12% |
|
| 164 |
+
| "Sampah menumpuk di jalan sejak seminggu lalu" | **PRIORITAS** | 90.56% |
|
| 165 |
|
| 166 |
---
|
| 167 |
|
| 168 |
+
## 🔧 Batch Prediction
|
| 169 |
|
| 170 |
+
```python
|
| 171 |
+
texts = [
|
| 172 |
+
"Ada kebakaran di gedung!",
|
| 173 |
+
"Jalan rusak parah",
|
| 174 |
+
"Dasar bodoh pemerintah!",
|
| 175 |
+
"Kapan jadwal vaksinasi?"
|
| 176 |
+
]
|
| 177 |
+
|
| 178 |
+
# Tokenize batch
|
| 179 |
+
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=128)
|
| 180 |
+
|
| 181 |
+
# Predict
|
| 182 |
+
with torch.no_grad():
|
| 183 |
+
outputs = model(**inputs)
|
| 184 |
+
probs = torch.nn.functional.softmax(outputs.logits, dim=1)
|
| 185 |
+
predictions = torch.argmax(probs, dim=1)
|
| 186 |
+
|
| 187 |
+
labels = ["PINALTI", "DARURAT", "PRIORITAS", "UMUM", "LAINNYA"]
|
| 188 |
+
|
| 189 |
+
for text, pred_idx, prob in zip(texts, predictions, probs):
|
| 190 |
+
pred_label = labels[pred_idx]
|
| 191 |
+
confidence = prob[pred_idx].item()
|
| 192 |
+
print(f"Text: {text}")
|
| 193 |
+
print(f"Prediction: {pred_label} ({confidence:.2%})\n")
|
| 194 |
```
|
| 195 |
|
| 196 |
+
---
|
| 197 |
+
|
| 198 |
+
## 🌐 API Deployment
|
| 199 |
|
| 200 |
+
Model ini juga tersedia sebagai REST API di Railway:
|
| 201 |
+
|
| 202 |
+
**Base URL**: `https://api-klasifikasi-aduan.up.railway.app`
|
| 203 |
+
|
| 204 |
+
### cURL Example
|
| 205 |
```bash
|
| 206 |
curl -X POST https://api-klasifikasi-aduan.up.railway.app/predict \
|
| 207 |
-H "Content-Type: application/json" \
|
| 208 |
-d '{"text": "Ada kebakaran di pasar"}'
|
| 209 |
```
|
| 210 |
|
| 211 |
+
### Response
|
|
|
|
| 212 |
```json
|
| 213 |
{
|
| 214 |
"label": "DARURAT",
|
| 215 |
+
"confidence": 0.9603,
|
| 216 |
"all_scores": {
|
| 217 |
+
"PINALTI": 0.0021,
|
| 218 |
+
"DARURAT": 0.9603,
|
| 219 |
+
"PRIORITAS": 0.0289,
|
| 220 |
+
"UMUM": 0.0045,
|
| 221 |
+
"LAINNYA": 0.0042
|
| 222 |
}
|
| 223 |
}
|
| 224 |
```
|
| 225 |
|
| 226 |
---
|
| 227 |
|
| 228 |
+
## 🛠️ Training Details
|
| 229 |
+
|
| 230 |
+
### Model Architecture
|
| 231 |
+
- **Base Model**: `indobenchmark/indobert-base-p1`
|
| 232 |
+
- **Task**: Sequence Classification (5 classes)
|
| 233 |
+
- **Max Sequence Length**: 128 tokens
|
| 234 |
+
- **Hidden Size**: 768
|
| 235 |
+
- **Attention Heads**: 12
|
| 236 |
+
- **Layers**: 12
|
| 237 |
+
|
| 238 |
+
### Training Configuration
|
| 239 |
+
- **GPU**: Tesla T4 (14.74 GB VRAM)
|
| 240 |
+
- **Precision**: FP16 (Mixed Precision)
|
| 241 |
+
- **Gradient Checkpointing**: Enabled
|
| 242 |
+
- **Batch Size**: 2
|
| 243 |
+
- **Learning Rate**: 1.5e-5
|
| 244 |
+
- **Epochs**: 5
|
| 245 |
+
- **Optimizer**: AdamW
|
| 246 |
+
- **Best Epoch**: 5
|
| 247 |
+
|
| 248 |
+
### Training Progress
|
| 249 |
+
| Epoch | Train Loss | Train Acc | Val Loss | Val Acc | Val F1 |
|
| 250 |
+
|-------|------------|-----------|----------|---------|--------|
|
| 251 |
+
| 1 | 0.3688 | 74.87% | 0.0825 | 93.45% | 0.9346 |
|
| 252 |
+
| 2 | 0.0586 | 95.86% | 0.0604 | 96.10% | 0.9609 |
|
| 253 |
+
| 3 | 0.0179 | 98.52% | 0.0635 | 96.41% | 0.9641 |
|
| 254 |
+
| 4 | 0.0069 | 99.38% | 0.0668 | 96.10% | 0.9611 |
|
| 255 |
+
| 5 | 0.0021 | 99.88% | 0.0623 | **96.10%** | **0.9610** |
|
| 256 |
|
| 257 |
+
---
|
| 258 |
+
|
| 259 |
+
## ⚠️ Important Notes
|
| 260 |
+
|
| 261 |
+
### Content Moderation (PINALTI)
|
| 262 |
+
Model ini dapat mendeteksi konten yang tidak pantas, namun **tidak sempurna**. Untuk aplikasi produksi yang sensitif, pertimbangkan:
|
| 263 |
+
- Layer moderasi tambahan
|
| 264 |
+
- Human review untuk kasus borderline
|
| 265 |
+
- Whitelist/blacklist kata kunci eksplisit
|
| 266 |
+
- Kombinasi dengan rule-based filtering
|
| 267 |
+
|
| 268 |
+
### Limitations
|
| 269 |
+
- Model dilatih dengan data aduan masyarakat Indonesia
|
| 270 |
+
- Performa optimal untuk teks dengan panjang 10-100 kata
|
| 271 |
+
- Slang atau dialek daerah tertentu mungkin kurang akurat
|
| 272 |
+
- Context yang ambigu dapat menghasilkan prediksi yang kurang tepat
|
| 273 |
+
|
| 274 |
+
---
|
| 275 |
+
|
| 276 |
+
## 📄 License
|
| 277 |
+
|
| 278 |
+
This model is licensed under **Apache 2.0 License**.
|
| 279 |
+
|
| 280 |
+
---
|
| 281 |
+
|
| 282 |
+
## 📧 Citation & Contact
|
| 283 |
+
|
| 284 |
+
**Developer**: Zulkifli1409
|
| 285 |
+
**Hugging Face**: [@Zulkifli1409](https://huggingface.co/Zulkifli1409)
|
| 286 |
+
|
| 287 |
+
Jika Anda menggunakan model ini dalam penelitian atau aplikasi, mohon untuk memberikan kredit yang sesuai.
|
| 288 |
+
|
| 289 |
+
### BibTeX
|
| 290 |
+
```bibtex
|
| 291 |
+
@misc{zulkifli2025aduan,
|
| 292 |
+
author = {Zulkifli},
|
| 293 |
+
title = {Indonesian Complaint Classification Model with IndoBERT},
|
| 294 |
+
year = {2025},
|
| 295 |
+
publisher = {Hugging Face},
|
| 296 |
+
howpublished = {\url{https://huggingface.co/Zulkifli1409/aduan-model}}
|
| 297 |
+
}
|
| 298 |
+
```
|
| 299 |
|
| 300 |
---
|
| 301 |
|
| 302 |
+
## 🤝 Contributing
|
| 303 |
+
|
| 304 |
+
Umpan balik, laporan bug, dan kontribusi sangat diterima!
|
| 305 |
+
Silakan buka *issue* di repository atau hubungi via Hugging Face.
|
| 306 |
+
|
| 307 |
+
---
|
| 308 |
|
| 309 |
+
**© 2025 - Klasifikasi Aduan Model**
|