|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- id |
|
|
metrics: |
|
|
- accuracy |
|
|
- f1 |
|
|
base_model: indobenchmark/indobert-base-p1 |
|
|
pipeline_tag: text-classification |
|
|
library_name: transformers |
|
|
tags: |
|
|
- indoBERT |
|
|
- classification |
|
|
- aduan |
|
|
- indonesian |
|
|
model-index: |
|
|
- name: aduan-model |
|
|
results: |
|
|
- task: |
|
|
type: text-classification |
|
|
name: Text Classification |
|
|
dataset: |
|
|
name: Custom Labeled Aduan Dataset |
|
|
type: private |
|
|
split: validation |
|
|
metrics: |
|
|
- type: accuracy |
|
|
value: 0.9389 |
|
|
- type: f1 |
|
|
value: 0.9389 |
|
|
--- |
|
|
|
|
|
# 📊 Indonesian Complaint Classification Model (IndoBERT) |
|
|
|
|
|
[](https://huggingface.co/Zulkifli1409/aduan-model) |
|
|
[](https://opensource.org/licenses/Apache-2.0) |
|
|
[](https://en.wikipedia.org/wiki/Indonesian_language) |
|
|
|
|
|
Model klasifikasi teks aduan masyarakat dalam Bahasa Indonesia menggunakan **IndoBERT (indobenchmark/indobert-base-p1)**. |
|
|
Model dapat mengelompokkan aduan ke dalam **5 kategori** dengan akurasi **96.10%**. |
|
|
|
|
|
--- |
|
|
|
|
|
## 📑 Kategori Klasifikasi |
|
|
|
|
|
| Label | Deskripsi | Contoh | |
|
|
|-------|-----------|--------| |
|
|
| **PINALTI** | Konten yang mengandung kata kasar, SARA, pornografi, ujaran kebencian, atau pelanggaran norma | "Kampret pejabat koruptor!", "Konten porno beredar", "Rasis banget pemerintah" | |
|
|
| **DARURAT** | Situasi darurat yang membutuhkan respon segera (kebakaran, kecelakaan, bencana, ancaman nyawa) | "Ada kebakaran besar di pasar!", "Kecelakaan beruntun di tol", "Banjir bandang melanda desa" | |
|
|
| **PRIORITAS** | Permasalahan yang perlu penanganan cepat (infrastruktur rusak, kebersihan, pelayanan publik) | "Jalan berlubang berbahaya", "Sampah menumpuk seminggu", "Lampu jalan mati semua" | |
|
|
| **UMUM** | Pertanyaan informasi, saran, atau aduan non-urgent | "Bagaimana cara mengurus KTP?", "Kapan jadwal posyandu?", "Saran untuk program desa" | |
|
|
| **LAINNYA** | Aduan yang tidak termasuk kategori di atas | "Terima kasih atas pelayanannya", "Hanya ingin menyampaikan apresiasi" | |
|
|
|
|
|
--- |
|
|
|
|
|
## 🎯 Model Performance |
|
|
|
|
|
### **Overall Metrics** |
|
|
- **Validation Accuracy**: **96.10%** |
|
|
- **Macro F1-Score**: **0.9608** |
|
|
- **Weighted F1-Score**: **0.9610** |
|
|
- **Average Confidence**: **93.90%** |
|
|
|
|
|
### **Per-Class Performance** |
|
|
|
|
|
| Label | Precision | Recall | F1-Score | Support | |
|
|
|-------|-----------|--------|----------|---------| |
|
|
| Pinalti | 0.9588 | 0.9645 | 0.9617 | 169 | |
|
|
| Darurat | 0.9453 | 0.9603 | 0.9528 | 126 | |
|
|
| Prioritas | 0.9675 | 0.9675 | 0.9675 | 123 | |
|
|
| Umum | 0.9752 | 0.9593 | 0.9672 | 123 | |
|
|
| Lainnya | 0.9596 | 0.9500 | 0.9548 | 100 | |
|
|
|
|
|
### **Confusion Matrix** |
|
|
``` |
|
|
Predicted |
|
|
Pin Dar Pri Umu Lai |
|
|
Actual Pin 163 2 1 0 3 |
|
|
Dar 2 121 2 0 1 |
|
|
Pri 0 3 119 1 0 |
|
|
Umu 2 2 1 118 0 |
|
|
Lai 3 0 0 2 95 |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## 📊 Dataset Information |
|
|
|
|
|
- **Total Training Samples**: 3,204 |
|
|
- Pinalti: 844 |
|
|
- Darurat: 630 |
|
|
- Prioritas: 612 |
|
|
- Umum: 616 |
|
|
- Lainnya: 502 |
|
|
- **Train/Val Split**: 80% / 20% (2,563 / 641) |
|
|
- **Augmentation**: Applied to balance classes |
|
|
- **Language**: Indonesian (Bahasa Indonesia) |
|
|
|
|
|
--- |
|
|
|
|
|
## 🚀 Quick Start |
|
|
|
|
|
### Installation |
|
|
```bash |
|
|
pip install transformers torch |
|
|
``` |
|
|
|
|
|
### Basic Usage |
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
import torch |
|
|
|
|
|
# Load model and tokenizer |
|
|
model_name = "Zulkifli1409/aduan-model" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForSequenceClassification.from_pretrained(model_name) |
|
|
|
|
|
# Prepare input |
|
|
text = "Ada kebakaran besar di pasar, tolong kirim pemadam segera!" |
|
|
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128) |
|
|
|
|
|
# Predict |
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
probs = torch.nn.functional.softmax(outputs.logits, dim=1) |
|
|
pred_idx = torch.argmax(probs).item() |
|
|
|
|
|
# Labels |
|
|
labels = ["PINALTI", "DARURAT", "PRIORITAS", "UMUM", "LAINNYA"] |
|
|
|
|
|
print(f"Prediksi: {labels[pred_idx]}") |
|
|
print(f"Confidence: {probs[0][pred_idx].item():.2%}") |
|
|
print(f"\nAll probabilities:") |
|
|
for label, prob in zip(labels, probs[0]): |
|
|
print(f" {label}: {prob.item():.2%}") |
|
|
``` |
|
|
|
|
|
**Output:** |
|
|
``` |
|
|
Prediksi: DARURAT |
|
|
Confidence: 96.03% |
|
|
|
|
|
All probabilities: |
|
|
PINALTI: 0.21% |
|
|
DARURAT: 96.03% |
|
|
PRIORITAS: 2.89% |
|
|
UMUM: 0.45% |
|
|
LAINNYA: 0.42% |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## 🧪 Example Predictions |
|
|
|
|
|
| Input Text | Prediction | Confidence | |
|
|
|------------|------------|------------| |
|
|
| "Brengsek! Pejabat korup semua!" | **PINALTI** | 94.23% | |
|
|
| "Ada orang kecelakaan parah butuh ambulans" | **DARURAT** | 95.67% | |
|
|
| "Jalan berlubang perlu diperbaiki segera" | **PRIORITAS** | 92.34% | |
|
|
| "Bagaimana cara mengurus surat izin usaha?" | **UMUM** | 89.45% | |
|
|
| "Terima kasih atas bantuannya" | **LAINNYA** | 88.91% | |
|
|
| "Konten porno tersebar di grup WhatsApp" | **PINALTI** | 91.78% | |
|
|
| "Banjir tinggi merendam rumah warga" | **DARURAT** | 93.12% | |
|
|
| "Sampah menumpuk di jalan sejak seminggu lalu" | **PRIORITAS** | 90.56% | |
|
|
|
|
|
--- |
|
|
|
|
|
## 🔧 Batch Prediction |
|
|
|
|
|
```python |
|
|
texts = [ |
|
|
"Ada kebakaran di gedung!", |
|
|
"Jalan rusak parah", |
|
|
"Dasar bodoh pemerintah!", |
|
|
"Kapan jadwal vaksinasi?" |
|
|
] |
|
|
|
|
|
# Tokenize batch |
|
|
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=128) |
|
|
|
|
|
# Predict |
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
probs = torch.nn.functional.softmax(outputs.logits, dim=1) |
|
|
predictions = torch.argmax(probs, dim=1) |
|
|
|
|
|
labels = ["PINALTI", "DARURAT", "PRIORITAS", "UMUM", "LAINNYA"] |
|
|
|
|
|
for text, pred_idx, prob in zip(texts, predictions, probs): |
|
|
pred_label = labels[pred_idx] |
|
|
confidence = prob[pred_idx].item() |
|
|
print(f"Text: {text}") |
|
|
print(f"Prediction: {pred_label} ({confidence:.2%})\n") |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## 🌐 API Deployment |
|
|
|
|
|
Model ini juga tersedia sebagai REST API di Railway: |
|
|
|
|
|
**Base URL**: `https://api-klasifikasi-aduan.up.railway.app` |
|
|
|
|
|
### cURL Example |
|
|
```bash |
|
|
curl -X POST https://api-klasifikasi-aduan.up.railway.app/predict \ |
|
|
-H "Content-Type: application/json" \ |
|
|
-d '{"text": "Ada kebakaran di pasar"}' |
|
|
``` |
|
|
|
|
|
### Response |
|
|
```json |
|
|
{ |
|
|
"label": "DARURAT", |
|
|
"confidence": 0.9603, |
|
|
"all_scores": { |
|
|
"PINALTI": 0.0021, |
|
|
"DARURAT": 0.9603, |
|
|
"PRIORITAS": 0.0289, |
|
|
"UMUM": 0.0045, |
|
|
"LAINNYA": 0.0042 |
|
|
} |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## 🛠️ Training Details |
|
|
|
|
|
### Model Architecture |
|
|
- **Base Model**: `indobenchmark/indobert-base-p1` |
|
|
- **Task**: Sequence Classification (5 classes) |
|
|
- **Max Sequence Length**: 128 tokens |
|
|
- **Hidden Size**: 768 |
|
|
- **Attention Heads**: 12 |
|
|
- **Layers**: 12 |
|
|
|
|
|
### Training Configuration |
|
|
- **GPU**: Tesla T4 (14.74 GB VRAM) |
|
|
- **Precision**: FP16 (Mixed Precision) |
|
|
- **Gradient Checkpointing**: Enabled |
|
|
- **Batch Size**: 2 |
|
|
- **Learning Rate**: 1.5e-5 |
|
|
- **Epochs**: 5 |
|
|
- **Optimizer**: AdamW |
|
|
- **Best Epoch**: 5 |
|
|
|
|
|
### Training Progress |
|
|
| Epoch | Train Loss | Train Acc | Val Loss | Val Acc | Val F1 | |
|
|
|-------|------------|-----------|----------|---------|--------| |
|
|
| 1 | 0.3688 | 74.87% | 0.0825 | 93.45% | 0.9346 | |
|
|
| 2 | 0.0586 | 95.86% | 0.0604 | 96.10% | 0.9609 | |
|
|
| 3 | 0.0179 | 98.52% | 0.0635 | 96.41% | 0.9641 | |
|
|
| 4 | 0.0069 | 99.38% | 0.0668 | 96.10% | 0.9611 | |
|
|
| 5 | 0.0021 | 99.88% | 0.0623 | **96.10%** | **0.9610** | |
|
|
|
|
|
--- |
|
|
|
|
|
## ⚠️ Important Notes |
|
|
|
|
|
### Content Moderation (PINALTI) |
|
|
Model ini dapat mendeteksi konten yang tidak pantas, namun **tidak sempurna**. Untuk aplikasi produksi yang sensitif, pertimbangkan: |
|
|
- Layer moderasi tambahan |
|
|
- Human review untuk kasus borderline |
|
|
- Whitelist/blacklist kata kunci eksplisit |
|
|
- Kombinasi dengan rule-based filtering |
|
|
|
|
|
### Limitations |
|
|
- Model dilatih dengan data aduan masyarakat Indonesia |
|
|
- Performa optimal untuk teks dengan panjang 10-100 kata |
|
|
- Slang atau dialek daerah tertentu mungkin kurang akurat |
|
|
- Context yang ambigu dapat menghasilkan prediksi yang kurang tepat |
|
|
|
|
|
--- |
|
|
|
|
|
## 📄 License |
|
|
|
|
|
This model is licensed under **Apache 2.0 License**. |
|
|
|
|
|
--- |
|
|
|
|
|
## 📧 Citation & Contact |
|
|
|
|
|
**Developer**: Zulkifli1409 |
|
|
**Hugging Face**: [@Zulkifli1409](https://huggingface.co/Zulkifli1409) |
|
|
|
|
|
Jika Anda menggunakan model ini dalam penelitian atau aplikasi, mohon untuk memberikan kredit yang sesuai. |
|
|
|
|
|
### BibTeX |
|
|
```bibtex |
|
|
@misc{zulkifli2025aduan, |
|
|
author = {Zulkifli}, |
|
|
title = {Indonesian Complaint Classification Model with IndoBERT}, |
|
|
year = {2025}, |
|
|
publisher = {Hugging Face}, |
|
|
howpublished = {\url{https://huggingface.co/Zulkifli1409/aduan-model}} |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## 🤝 Contributing |
|
|
|
|
|
Umpan balik, laporan bug, dan kontribusi sangat diterima! |
|
|
Silakan buka *issue* di repository atau hubungi via Hugging Face. |
|
|
|
|
|
--- |
|
|
|
|
|
**© 2025 - Klasifikasi Aduan Model** |
|
|
|