IndoHoaxDetector / README.md

Update README.md with comprehensive model documentation

e260f89 about 2 months ago

5.22 kB

	---
	title: IndoHoaxDetector
	emoji: 🌖
	colorFrom: pink
	colorTo: indigo
	sdk: gradio
	sdk_version: 5.49.1
	app_file: app.py
	pinned: false
	license: mit
	short_description: Indonesian News Detector
	tags: ["nlp", "text-classification", "indonesian", "hoax-detection", "machine-learning", "fact-checking", "transformers", "tfidf"]
	---

	# IndoHoaxDetector

	A comprehensive machine learning system for detecting hoax-style news articles in Indonesian language. This project provides multiple models trained on linguistic features of Indonesian news to identify articles written in a style typical of hoaxes or fake news. The system analyzes writing patterns, sensationalism, and other stylistic indicators rather than checking factual accuracy.

	## Features

	- Multiple Models: Logistic Regression, Linear SVM, Random Forest, Naive Bayes, and IndoBERT
	- High Accuracy: Best model (IndoBERT) achieves 99.89% accuracy on test data
	- Web Interface: Interactive Gradio app for easy prediction
	- Lightweight Options: TF-IDF based models for fast inference
	- Transformer Support: IndoBERT for state-of-the-art performance

	## Model Comparison

	All models were trained on 62,972 Indonesian news articles and evaluated on 12,595 held-out test samples:

	\| Model \| Accuracy \| Precision (Macro) \| Recall (Macro) \| F1 (Macro) \| Type \|
	\|------------------------\|----------\|-------------------\|----------------\|------------\|------\|
	\| IndoBERT \| 0.9989 \| 0.9989 \| 0.9989 \| 0.9989 \| Transformer \|
	\| Linear SVM \| 0.9819 \| 0.9820 \| 0.9817 \| 0.9818 \| TF-IDF + SVM \|
	\| Logistic Regression \| 0.9782 \| 0.9787 \| 0.9777 \| 0.9781 \| TF-IDF + LR \|
	\| Random Forest \| 0.9765 \| 0.9768 \| 0.9760 \| 0.9764 \| TF-IDF + RF \|
	\| Multinomial Naive Bayes\| 0.9398 \| 0.9414 \| 0.9381 \| 0.9393 \| TF-IDF + NB \|

	## Model Details

	### IndoBERT (Recommended)
	- Type: Fine-tuned IndoBenchmark/indobert-base-p1
	- Training: 3 epochs, batch_size=16, learning_rate=2e-5, max_length=128
	- Accuracy: 99.89%
	- Best For: Highest accuracy, contextual understanding
	- Requirements: transformers, torch (GPU recommended)

	### TF-IDF Based Models
	- Vectorizer: TF-IDF (unigrams, max_features=5000, sublinear_tf=True)
	- Preprocessing: Lowercase, URL removal, punctuation, Indonesian stopwords, Sastrawi stemming
	- Logistic Regression: L2 regularization, max_iter=1000 (97.82% accuracy)
	- Linear SVM: C=1.0, max_iter=10000 (98.19% accuracy)
	- Random Forest: 100 trees, random_state=42 (97.65% accuracy)
	- Naive Bayes: Multinomial, alpha=1.0 (93.98% accuracy)

	## Usage

	### Web Interface
	Visit the Hugging Face Space to use the interactive app:
	1. Select a model from the dropdown
	2. Enter Indonesian news text
	3. Click "Detect"
	4. View the prediction and confidence

	### Local Usage
	To run locally:

	```bash
	pip install gradio scikit-learn transformers torch
	python app.py
	```

	### API Usage
	Load and use models programmatically:

	```python
	import pickle
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	# For TF-IDF models (Logistic Regression, SVM, RF, NB)
	with open('logreg_model.pkl', 'rb') as f:
	model = pickle.load(f)

	# Load vectorizer
	with open('tfidf_vectorizer.pkl', 'rb') as f:
	vectorizer = pickle.load(f)

	text = "Your Indonesian news text here"
	X = vectorizer.transform([text])
	prediction = model.predict(X) # 0 for FAKTA, 1 for HOAX

	# For IndoBERT
	tokenizer = AutoTokenizer.from_pretrained('indobert_model')
	model = AutoModelForSequenceClassification.from_pretrained('indobert_model')

	inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
	with torch.no_grad():
	outputs = model(**inputs)
	prediction = torch.argmax(outputs.logits, dim=-1).item()
	```

	## Training Data

	- Source: TurnBackHoax fact-checks, Kompas/CekFakta.com articles
	- Size: 62,972 labeled articles (balanced HOAX/FAKTA)
	- Preprocessing: Indonesian text normalization, stemming, stopword removal
	- Split: 80% train, 20% test (stratified)

	## Limitations

	- Stylistic Analysis Only: Detects hoax-like writing style, not factual accuracy
	- Language Specific: Trained on Indonesian news patterns
	- Domain Shift: May perform differently on social media vs. formal news
	- False Positives/Negatives: Sensational legitimate news may be flagged as hoax
	- Transformer Requirements: IndoBERT needs GPU for training/inference

	## Ethical Considerations

	- Responsible Use: Augment human fact-checkers, not replace them
	- Bias Awareness: Training data may reflect source biases
	- Transparency: All models and code are open-source
	- Misuse Prevention: Clearly labeled as stylistic detector, not truth verifier

	## Contributing

	Contributions welcome! Open issues or PRs for improvements.

	## License

	MIT License - see LICENSE file for details

	Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference