|
|
---
|
|
|
title: IndoHoaxDetector
|
|
|
emoji: 🌖
|
|
|
colorFrom: pink
|
|
|
colorTo: indigo
|
|
|
sdk: gradio
|
|
|
sdk_version: 5.49.1
|
|
|
app_file: app.py
|
|
|
pinned: false
|
|
|
license: mit
|
|
|
short_description: Indonesian News Detector
|
|
|
tags: ["nlp", "text-classification", "indonesian", "hoax-detection", "machine-learning", "fact-checking", "transformers", "tfidf"]
|
|
|
---
|
|
|
|
|
|
# IndoHoaxDetector
|
|
|
|
|
|
A comprehensive machine learning system for detecting hoax-style news articles in Indonesian language. This project provides multiple models trained on linguistic features of Indonesian news to identify articles written in a style typical of hoaxes or fake news. The system analyzes writing patterns, sensationalism, and other stylistic indicators rather than checking factual accuracy.
|
|
|
|
|
|
## Features
|
|
|
|
|
|
- **Multiple Models**: Logistic Regression, Linear SVM, Random Forest, Naive Bayes, and IndoBERT
|
|
|
- **High Accuracy**: Best model (IndoBERT) achieves 99.89% accuracy on test data
|
|
|
- **Web Interface**: Interactive Gradio app for easy prediction
|
|
|
- **Lightweight Options**: TF-IDF based models for fast inference
|
|
|
- **Transformer Support**: IndoBERT for state-of-the-art performance
|
|
|
|
|
|
## Model Comparison
|
|
|
|
|
|
All models were trained on 62,972 Indonesian news articles and evaluated on 12,595 held-out test samples:
|
|
|
|
|
|
| Model | Accuracy | Precision (Macro) | Recall (Macro) | F1 (Macro) | Type |
|
|
|
|------------------------|----------|-------------------|----------------|------------|------|
|
|
|
| **IndoBERT** | **0.9989** | **0.9989** | **0.9989** | **0.9989** | Transformer |
|
|
|
| Linear SVM | 0.9819 | 0.9820 | 0.9817 | 0.9818 | TF-IDF + SVM |
|
|
|
| Logistic Regression | 0.9782 | 0.9787 | 0.9777 | 0.9781 | TF-IDF + LR |
|
|
|
| Random Forest | 0.9765 | 0.9768 | 0.9760 | 0.9764 | TF-IDF + RF |
|
|
|
| Multinomial Naive Bayes| 0.9398 | 0.9414 | 0.9381 | 0.9393 | TF-IDF + NB |
|
|
|
|
|
|
## Model Details
|
|
|
|
|
|
### IndoBERT (Recommended)
|
|
|
- **Type**: Fine-tuned IndoBenchmark/indobert-base-p1
|
|
|
- **Training**: 3 epochs, batch_size=16, learning_rate=2e-5, max_length=128
|
|
|
- **Accuracy**: 99.89%
|
|
|
- **Best For**: Highest accuracy, contextual understanding
|
|
|
- **Requirements**: transformers, torch (GPU recommended)
|
|
|
|
|
|
### TF-IDF Based Models
|
|
|
- **Vectorizer**: TF-IDF (unigrams, max_features=5000, sublinear_tf=True)
|
|
|
- **Preprocessing**: Lowercase, URL removal, punctuation, Indonesian stopwords, Sastrawi stemming
|
|
|
- **Logistic Regression**: L2 regularization, max_iter=1000 (97.82% accuracy)
|
|
|
- **Linear SVM**: C=1.0, max_iter=10000 (98.19% accuracy)
|
|
|
- **Random Forest**: 100 trees, random_state=42 (97.65% accuracy)
|
|
|
- **Naive Bayes**: Multinomial, alpha=1.0 (93.98% accuracy)
|
|
|
|
|
|
## Usage
|
|
|
|
|
|
### Web Interface
|
|
|
Visit the Hugging Face Space to use the interactive app:
|
|
|
1. Select a model from the dropdown
|
|
|
2. Enter Indonesian news text
|
|
|
3. Click "Detect"
|
|
|
4. View the prediction and confidence
|
|
|
|
|
|
### Local Usage
|
|
|
To run locally:
|
|
|
|
|
|
```bash
|
|
|
pip install gradio scikit-learn transformers torch
|
|
|
python app.py
|
|
|
```
|
|
|
|
|
|
### API Usage
|
|
|
Load and use models programmatically:
|
|
|
|
|
|
```python
|
|
|
import pickle
|
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
|
|
import torch
|
|
|
|
|
|
# For TF-IDF models (Logistic Regression, SVM, RF, NB)
|
|
|
with open('logreg_model.pkl', 'rb') as f:
|
|
|
model = pickle.load(f)
|
|
|
|
|
|
# Load vectorizer
|
|
|
with open('tfidf_vectorizer.pkl', 'rb') as f:
|
|
|
vectorizer = pickle.load(f)
|
|
|
|
|
|
text = "Your Indonesian news text here"
|
|
|
X = vectorizer.transform([text])
|
|
|
prediction = model.predict(X) # 0 for FAKTA, 1 for HOAX
|
|
|
|
|
|
# For IndoBERT
|
|
|
tokenizer = AutoTokenizer.from_pretrained('indobert_model')
|
|
|
model = AutoModelForSequenceClassification.from_pretrained('indobert_model')
|
|
|
|
|
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
|
|
|
with torch.no_grad():
|
|
|
outputs = model(**inputs)
|
|
|
prediction = torch.argmax(outputs.logits, dim=-1).item()
|
|
|
```
|
|
|
|
|
|
## Training Data
|
|
|
|
|
|
- **Source**: TurnBackHoax fact-checks, Kompas/CekFakta.com articles
|
|
|
- **Size**: 62,972 labeled articles (balanced HOAX/FAKTA)
|
|
|
- **Preprocessing**: Indonesian text normalization, stemming, stopword removal
|
|
|
- **Split**: 80% train, 20% test (stratified)
|
|
|
|
|
|
## Limitations
|
|
|
|
|
|
- **Stylistic Analysis Only**: Detects hoax-like writing style, not factual accuracy
|
|
|
- **Language Specific**: Trained on Indonesian news patterns
|
|
|
- **Domain Shift**: May perform differently on social media vs. formal news
|
|
|
- **False Positives/Negatives**: Sensational legitimate news may be flagged as hoax
|
|
|
- **Transformer Requirements**: IndoBERT needs GPU for training/inference
|
|
|
|
|
|
## Ethical Considerations
|
|
|
|
|
|
- **Responsible Use**: Augment human fact-checkers, not replace them
|
|
|
- **Bias Awareness**: Training data may reflect source biases
|
|
|
- **Transparency**: All models and code are open-source
|
|
|
- **Misuse Prevention**: Clearly labeled as stylistic detector, not truth verifier
|
|
|
|
|
|
## Contributing
|
|
|
|
|
|
Contributions welcome! Open issues or PRs for improvements.
|
|
|
|
|
|
## License
|
|
|
|
|
|
MIT License - see LICENSE file for details
|
|
|
|
|
|
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
|
|
|