Zulkifli1409 commited on
Commit
a336501
·
verified ·
1 Parent(s): cb419d8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +210 -82
README.md CHANGED
@@ -30,152 +30,280 @@ model-index:
30
  value: 0.9389
31
  ---
32
 
33
- # 📊 Aduan Classification Model (IndoBERT)
34
 
35
- Model ini dilatih untuk **klasifikasi teks aduan masyarakat** dalam Bahasa Indonesia menggunakan **IndoBERT (indobenchmark/indobert-base-p1)**.
36
- Model dapat mengelompokkan aduan ke dalam 4 kategori:
 
37
 
38
- - **DARURAT** Situasi darurat (kebakaran, kecelakaan, bencana)
39
- - **PRIORITAS** Perlu penanganan cepat (jalan rusak, kebersihan, infrastruktur)
40
- - **UMUM** → Informasi / pertanyaan umum
41
- - **LAINNYA** → Aduan lain yang tidak termasuk kategori di atas
42
 
43
  ---
44
 
45
- ## 📂 Files
46
 
47
- - `model.safetensors` model terlatih (498MB)
48
- - `aduan_model.pt` → backup format pickle
49
- - `config.json`, `tokenizer.json`, `vocab.txt` konfigurasi dan tokenizer
50
- - `special_tokens_map.json`, `tokenizer_config.json` mapping tokenizer
 
 
 
51
 
52
  ---
53
 
54
- ## 📊 Dataset & Training
55
 
56
- - **Total data (raw)**: 3,373
57
- - Darurat: 900
58
- - Prioritas: 875
59
- - Umum: 880
60
- - Lainnya: 718
61
- - **Augmentasi** → 3,600 (balance 900 per kelas)
62
- - **Split** → 80% Train (2880) | 20% Validation (720)
63
- - **Base model** → `indobenchmark/indobert-base-p1`
64
- - **Device training** → NVIDIA RTX 3050 Laptop GPU (CUDA)
65
 
66
- ---
67
-
68
- ## 📈 Hasil Evaluasi
69
-
70
- - **Best Epoch** → 3
71
- - **Validation Accuracy** → **93.89%**
72
- - **Macro F1-score** → **0.9389**
73
 
74
- ### 📑 Classification Report
75
- | Label | Precision | Recall | F1-score |
76
- |------------|-----------|--------|----------|
77
- | Darurat | 0.9435 | 0.9278 | 0.9356 |
78
- | Prioritas | 0.9257 | 0.9000 | 0.9127 |
79
- | Umum | 0.9026 | 0.9778 | 0.9387 |
80
- | Lainnya | 0.9884 | 0.9500 | 0.9688 |
81
- | **Macro Avg** | 0.9401 | 0.9389 | 0.9389 |
82
 
83
- ### 🔢 Confusion Matrix
84
  ```
 
 
 
 
 
 
 
 
 
 
85
 
86
- [[167 10 3 0] # Darurat
87
- [ 6 162 11 1] # Prioritas
88
- [ 1 2 176 1] # Umum
89
- [ 3 1 5 171]] # Lainnya
90
 
91
- ````
 
 
 
 
 
 
 
 
92
 
93
  ---
94
 
95
- ## 🧪 Contoh Prediksi
96
 
97
- ### Single Input
 
 
 
 
 
98
  ```python
99
  from transformers import AutoTokenizer, AutoModelForSequenceClassification
100
  import torch
101
 
 
102
  model_name = "Zulkifli1409/aduan-model"
103
  tokenizer = AutoTokenizer.from_pretrained(model_name)
104
  model = AutoModelForSequenceClassification.from_pretrained(model_name)
105
 
106
- text = "Ada kebakaran besar di jalan sudirman, tolong kirim pemadam!"
107
- inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
108
- outputs = model(**inputs)
109
- probs = torch.nn.functional.softmax(outputs.logits, dim=1)
110
 
111
- pred_idx = torch.argmax(probs).item()
112
- labels = ["DARURAT", "PRIORITAS", "UMUM", "LAINNYA"]
 
 
 
113
 
114
- print("Prediksi:", labels[pred_idx])
115
- print("Probabilitas:", probs.tolist())
116
- ````
117
 
118
- ### Output:
 
 
 
 
 
119
 
 
120
  ```
121
  Prediksi: DARURAT
122
- Probabilitas: [[0.9823, 0.0145, 0.0021, 0.0011]]
 
 
 
 
 
 
 
123
  ```
124
 
125
  ---
126
 
127
- ## 📦 Advanced Prediction Tests
128
 
129
- | Teks Aduan | Prediksi | Confidence |
130
- | ----------------------------------------- | --------- | ---------- |
131
- | ada kebakaran besar di pasar tolong cepat | DARURAT | 60.62% |
132
- | jalan berlubang perlu diperbaiki | PRIORITAS | 78.47% |
133
- | mohon pencerahan tentang program desa | UMUM | 72.09% |
134
- | ada orang kecelakaan parah butuh ambulans | DARURAT | 74.29% |
135
- | sampah menumpuk di jalan | PRIORITAS | 71.17% |
136
- | banjir tinggi merendam rumah warga | DARURAT | 58.01% |
 
 
137
 
138
  ---
139
 
140
- ## 🚀 Deployment
141
 
142
- Model ini juga tersedia dalam bentuk API di Railway:
143
-
144
- ```
145
- Base URL: https://api-klasifikasi-aduan.up.railway.app
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
146
  ```
147
 
148
- Contoh request:
 
 
149
 
 
 
 
 
 
150
  ```bash
151
  curl -X POST https://api-klasifikasi-aduan.up.railway.app/predict \
152
  -H "Content-Type: application/json" \
153
  -d '{"text": "Ada kebakaran di pasar"}'
154
  ```
155
 
156
- Response:
157
-
158
  ```json
159
  {
160
  "label": "DARURAT",
161
- "confidence": 0.9823,
162
  "all_scores": {
163
- "DARURAT": 0.9823,
164
- "PRIORITAS": 0.0145,
165
- "UMUM": 0.0021,
166
- "LAINNYA": 0.0011
 
167
  }
168
  }
169
  ```
170
 
171
  ---
172
 
173
- ## 📧 Kontak
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
174
 
175
- Dikembangkan oleh **Zulkifli1409**
176
- Jika ada pertanyaan atau saran, silakan buka *issue* atau hubungi via [Hugging Face profile](https://huggingface.co/Zulkifli1409).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
177
 
178
  ---
179
 
180
- **© 2025 Klasifikasi Aduan Model**
 
 
 
 
 
181
 
 
 
30
  value: 0.9389
31
  ---
32
 
33
+ # 📊 Indonesian Complaint Classification Model (IndoBERT)
34
 
35
+ [![Model](https://img.shields.io/badge/🤗%20Hugging%20Face-Model-yellow)](https://huggingface.co/Zulkifli1409/aduan-model)
36
+ [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
37
+ [![Language](https://img.shields.io/badge/Language-Indonesian-red.svg)](https://en.wikipedia.org/wiki/Indonesian_language)
38
 
39
+ Model klasifikasi teks aduan masyarakat dalam Bahasa Indonesia menggunakan **IndoBERT (indobenchmark/indobert-base-p1)**.
40
+ Model dapat mengelompokkan aduan ke dalam **5 kategori** dengan akurasi **96.10%**.
 
 
41
 
42
  ---
43
 
44
+ ## 📑 Kategori Klasifikasi
45
 
46
+ | Label | Deskripsi | Contoh |
47
+ |-------|-----------|--------|
48
+ | **PINALTI** | Konten yang mengandung kata kasar, SARA, pornografi, ujaran kebencian, atau pelanggaran norma | "Kampret pejabat koruptor!", "Konten porno beredar", "Rasis banget pemerintah" |
49
+ | **DARURAT** | Situasi darurat yang membutuhkan respon segera (kebakaran, kecelakaan, bencana, ancaman nyawa) | "Ada kebakaran besar di pasar!", "Kecelakaan beruntun di tol", "Banjir bandang melanda desa" |
50
+ | **PRIORITAS** | Permasalahan yang perlu penanganan cepat (infrastruktur rusak, kebersihan, pelayanan publik) | "Jalan berlubang berbahaya", "Sampah menumpuk seminggu", "Lampu jalan mati semua" |
51
+ | **UMUM** | Pertanyaan informasi, saran, atau aduan non-urgent | "Bagaimana cara mengurus KTP?", "Kapan jadwal posyandu?", "Saran untuk program desa" |
52
+ | **LAINNYA** | Aduan yang tidak termasuk kategori di atas | "Terima kasih atas pelayanannya", "Hanya ingin menyampaikan apresiasi" |
53
 
54
  ---
55
 
56
+ ## 🎯 Model Performance
57
 
58
+ ### **Overall Metrics**
59
+ - **Validation Accuracy**: **96.10%**
60
+ - **Macro F1-Score**: **0.9608**
61
+ - **Weighted F1-Score**: **0.9610**
62
+ - **Average Confidence**: **93.90%**
 
 
 
 
63
 
64
+ ### **Per-Class Performance**
 
 
 
 
 
 
65
 
66
+ | Label | Precision | Recall | F1-Score | Support |
67
+ |-------|-----------|--------|----------|---------|
68
+ | Pinalti | 0.9588 | 0.9645 | 0.9617 | 169 |
69
+ | Darurat | 0.9453 | 0.9603 | 0.9528 | 126 |
70
+ | Prioritas | 0.9675 | 0.9675 | 0.9675 | 123 |
71
+ | Umum | 0.9752 | 0.9593 | 0.9672 | 123 |
72
+ | Lainnya | 0.9596 | 0.9500 | 0.9548 | 100 |
 
73
 
74
+ ### **Confusion Matrix**
75
  ```
76
+ Predicted
77
+ Pin Dar Pri Umu Lai
78
+ Actual Pin 163 2 1 0 3
79
+ Dar 2 121 2 0 1
80
+ Pri 0 3 119 1 0
81
+ Umu 2 2 1 118 0
82
+ Lai 3 0 0 2 95
83
+ ```
84
+
85
+ ---
86
 
87
+ ## 📊 Dataset Information
 
 
 
88
 
89
+ - **Total Training Samples**: 3,204
90
+ - Pinalti: 844
91
+ - Darurat: 630
92
+ - Prioritas: 612
93
+ - Umum: 616
94
+ - Lainnya: 502
95
+ - **Train/Val Split**: 80% / 20% (2,563 / 641)
96
+ - **Augmentation**: Applied to balance classes
97
+ - **Language**: Indonesian (Bahasa Indonesia)
98
 
99
  ---
100
 
101
+ ## 🚀 Quick Start
102
 
103
+ ### Installation
104
+ ```bash
105
+ pip install transformers torch
106
+ ```
107
+
108
+ ### Basic Usage
109
  ```python
110
  from transformers import AutoTokenizer, AutoModelForSequenceClassification
111
  import torch
112
 
113
+ # Load model and tokenizer
114
  model_name = "Zulkifli1409/aduan-model"
115
  tokenizer = AutoTokenizer.from_pretrained(model_name)
116
  model = AutoModelForSequenceClassification.from_pretrained(model_name)
117
 
118
+ # Prepare input
119
+ text = "Ada kebakaran besar di pasar, tolong kirim pemadam segera!"
120
+ inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128)
 
121
 
122
+ # Predict
123
+ with torch.no_grad():
124
+ outputs = model(**inputs)
125
+ probs = torch.nn.functional.softmax(outputs.logits, dim=1)
126
+ pred_idx = torch.argmax(probs).item()
127
 
128
+ # Labels
129
+ labels = ["PINALTI", "DARURAT", "PRIORITAS", "UMUM", "LAINNYA"]
 
130
 
131
+ print(f"Prediksi: {labels[pred_idx]}")
132
+ print(f"Confidence: {probs[0][pred_idx].item():.2%}")
133
+ print(f"\nAll probabilities:")
134
+ for label, prob in zip(labels, probs[0]):
135
+ print(f" {label}: {prob.item():.2%}")
136
+ ```
137
 
138
+ **Output:**
139
  ```
140
  Prediksi: DARURAT
141
+ Confidence: 96.03%
142
+
143
+ All probabilities:
144
+ PINALTI: 0.21%
145
+ DARURAT: 96.03%
146
+ PRIORITAS: 2.89%
147
+ UMUM: 0.45%
148
+ LAINNYA: 0.42%
149
  ```
150
 
151
  ---
152
 
153
+ ## 🧪 Example Predictions
154
 
155
+ | Input Text | Prediction | Confidence |
156
+ |------------|------------|------------|
157
+ | "Brengsek! Pejabat korup semua!" | **PINALTI** | 94.23% |
158
+ | "Ada orang kecelakaan parah butuh ambulans" | **DARURAT** | 95.67% |
159
+ | "Jalan berlubang perlu diperbaiki segera" | **PRIORITAS** | 92.34% |
160
+ | "Bagaimana cara mengurus surat izin usaha?" | **UMUM** | 89.45% |
161
+ | "Terima kasih atas bantuannya" | **LAINNYA** | 88.91% |
162
+ | "Konten porno tersebar di grup WhatsApp" | **PINALTI** | 91.78% |
163
+ | "Banjir tinggi merendam rumah warga" | **DARURAT** | 93.12% |
164
+ | "Sampah menumpuk di jalan sejak seminggu lalu" | **PRIORITAS** | 90.56% |
165
 
166
  ---
167
 
168
+ ## 🔧 Batch Prediction
169
 
170
+ ```python
171
+ texts = [
172
+ "Ada kebakaran di gedung!",
173
+ "Jalan rusak parah",
174
+ "Dasar bodoh pemerintah!",
175
+ "Kapan jadwal vaksinasi?"
176
+ ]
177
+
178
+ # Tokenize batch
179
+ inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=128)
180
+
181
+ # Predict
182
+ with torch.no_grad():
183
+ outputs = model(**inputs)
184
+ probs = torch.nn.functional.softmax(outputs.logits, dim=1)
185
+ predictions = torch.argmax(probs, dim=1)
186
+
187
+ labels = ["PINALTI", "DARURAT", "PRIORITAS", "UMUM", "LAINNYA"]
188
+
189
+ for text, pred_idx, prob in zip(texts, predictions, probs):
190
+ pred_label = labels[pred_idx]
191
+ confidence = prob[pred_idx].item()
192
+ print(f"Text: {text}")
193
+ print(f"Prediction: {pred_label} ({confidence:.2%})\n")
194
  ```
195
 
196
+ ---
197
+
198
+ ## 🌐 API Deployment
199
 
200
+ Model ini juga tersedia sebagai REST API di Railway:
201
+
202
+ **Base URL**: `https://api-klasifikasi-aduan.up.railway.app`
203
+
204
+ ### cURL Example
205
  ```bash
206
  curl -X POST https://api-klasifikasi-aduan.up.railway.app/predict \
207
  -H "Content-Type: application/json" \
208
  -d '{"text": "Ada kebakaran di pasar"}'
209
  ```
210
 
211
+ ### Response
 
212
  ```json
213
  {
214
  "label": "DARURAT",
215
+ "confidence": 0.9603,
216
  "all_scores": {
217
+ "PINALTI": 0.0021,
218
+ "DARURAT": 0.9603,
219
+ "PRIORITAS": 0.0289,
220
+ "UMUM": 0.0045,
221
+ "LAINNYA": 0.0042
222
  }
223
  }
224
  ```
225
 
226
  ---
227
 
228
+ ## 🛠️ Training Details
229
+
230
+ ### Model Architecture
231
+ - **Base Model**: `indobenchmark/indobert-base-p1`
232
+ - **Task**: Sequence Classification (5 classes)
233
+ - **Max Sequence Length**: 128 tokens
234
+ - **Hidden Size**: 768
235
+ - **Attention Heads**: 12
236
+ - **Layers**: 12
237
+
238
+ ### Training Configuration
239
+ - **GPU**: Tesla T4 (14.74 GB VRAM)
240
+ - **Precision**: FP16 (Mixed Precision)
241
+ - **Gradient Checkpointing**: Enabled
242
+ - **Batch Size**: 2
243
+ - **Learning Rate**: 1.5e-5
244
+ - **Epochs**: 5
245
+ - **Optimizer**: AdamW
246
+ - **Best Epoch**: 5
247
+
248
+ ### Training Progress
249
+ | Epoch | Train Loss | Train Acc | Val Loss | Val Acc | Val F1 |
250
+ |-------|------------|-----------|----------|---------|--------|
251
+ | 1 | 0.3688 | 74.87% | 0.0825 | 93.45% | 0.9346 |
252
+ | 2 | 0.0586 | 95.86% | 0.0604 | 96.10% | 0.9609 |
253
+ | 3 | 0.0179 | 98.52% | 0.0635 | 96.41% | 0.9641 |
254
+ | 4 | 0.0069 | 99.38% | 0.0668 | 96.10% | 0.9611 |
255
+ | 5 | 0.0021 | 99.88% | 0.0623 | **96.10%** | **0.9610** |
256
 
257
+ ---
258
+
259
+ ## ⚠️ Important Notes
260
+
261
+ ### Content Moderation (PINALTI)
262
+ Model ini dapat mendeteksi konten yang tidak pantas, namun **tidak sempurna**. Untuk aplikasi produksi yang sensitif, pertimbangkan:
263
+ - Layer moderasi tambahan
264
+ - Human review untuk kasus borderline
265
+ - Whitelist/blacklist kata kunci eksplisit
266
+ - Kombinasi dengan rule-based filtering
267
+
268
+ ### Limitations
269
+ - Model dilatih dengan data aduan masyarakat Indonesia
270
+ - Performa optimal untuk teks dengan panjang 10-100 kata
271
+ - Slang atau dialek daerah tertentu mungkin kurang akurat
272
+ - Context yang ambigu dapat menghasilkan prediksi yang kurang tepat
273
+
274
+ ---
275
+
276
+ ## 📄 License
277
+
278
+ This model is licensed under **Apache 2.0 License**.
279
+
280
+ ---
281
+
282
+ ## 📧 Citation & Contact
283
+
284
+ **Developer**: Zulkifli1409
285
+ **Hugging Face**: [@Zulkifli1409](https://huggingface.co/Zulkifli1409)
286
+
287
+ Jika Anda menggunakan model ini dalam penelitian atau aplikasi, mohon untuk memberikan kredit yang sesuai.
288
+
289
+ ### BibTeX
290
+ ```bibtex
291
+ @misc{zulkifli2025aduan,
292
+ author = {Zulkifli},
293
+ title = {Indonesian Complaint Classification Model with IndoBERT},
294
+ year = {2025},
295
+ publisher = {Hugging Face},
296
+ howpublished = {\url{https://huggingface.co/Zulkifli1409/aduan-model}}
297
+ }
298
+ ```
299
 
300
  ---
301
 
302
+ ## 🤝 Contributing
303
+
304
+ Umpan balik, laporan bug, dan kontribusi sangat diterima!
305
+ Silakan buka *issue* di repository atau hubungi via Hugging Face.
306
+
307
+ ---
308
 
309
+ **© 2025 - Klasifikasi Aduan Model**