Model Card for IndoHoaxDetector
Model Details
Model Description
IndoHoaxDetector is a binary classification model designed to detect hoax-style news articles in the Indonesian language. It uses logistic regression trained on linguistic features of Indonesian news to classify text as either legitimate or hoax-like writing. This model analyzes writing style and patterns, not factual accuracy or truthfulness of the content.
- Developed by: Gareth Aurelius Harrison
- Model type: Logistic Regression (scikit-learn)
- Language(s): Indonesian
- License: MIT
- Finetuned from model: N/A (trained from scratch)
Model Sources
- Repository: https://huggingface.co/theonegareth/IndoHoaxDetector
- Paper or resources: N/A
Uses
Direct Use
This model can be used to analyze Indonesian news articles and determine if they are written in a hoax-like style. It identifies linguistic patterns typical of fake news but does not verify factual accuracy. It is intended for educational, research, and journalistic purposes to help identify potentially sensational or misleading writing styles.
Downstream Use
- News verification tools
- Fact-checking applications
- Educational resources on misinformation
- Research on Indonesian media landscape
Out-of-Scope Use
- Automated content moderation without human oversight
- Legal or judicial decisions
- Real-time censorship
- Detection in other languages
Bias, Risks, and Limitations
Recommendations
Users should be aware that this model:
- Is trained on specific datasets and may not generalize to all Indonesian news
- Can produce false positives/negatives
- Should not be used as the sole basis for important decisions
- Requires human verification for critical applications
Known Limitations
- Stylistic vs Factual Analysis: This model detects writing style typical of hoaxes, not factual inaccuracies. Legitimate news written sensationally may be flagged as hoax, and factual hoaxes written professionally may be missed.
- Data Bias: The model is trained on a limited dataset; performance may vary with different topics or writing styles
- Language Specificity: Only works for Indonesian text
- Temporal Limitations: News patterns change over time; the model may become less accurate with newer data
- Binary Classification: Does not provide nuanced assessments of credibility
Ethical Considerations
- Misinformation Detection: While helpful for identifying hoaxes, this technology could be misused to suppress legitimate dissenting views
- Privacy: Text analysis may involve sensitive content
- Accessibility: Should be used to empower users, not to restrict information access
- Transparency: Model decisions should be explainable and verifiable
Training Details
Training Data
- Dataset: Indonesian news articles dataset (details not publicly available)
- Preprocessing: Text cleaning, tokenization, feature extraction (likely TF-IDF)
- Size: Not specified
- Distribution: Balanced between hoax and legitimate classes (assumed)
Training Procedure
- Training Date: October 29, 2024
- Hardware: Not specified
- Software: scikit-learn
- Hyperparameters: Default logistic regression parameters
- Carbon Footprint: Not calculated
Evaluation
Testing Data
- Dataset: Held-out test set from training data
- Size: Not specified
- Distribution: Balanced (assumed)
Metrics
- Accuracy: 97.83%
- Other metrics: Not provided (precision, recall, F1-score unknown)
Results
The model achieves high accuracy on the test set, but detailed performance metrics per class are not available.
Technical Specifications
- Input Format: Raw Indonesian text
- Output Format: Binary classification (0: legitimate, 1: hoax) with probability scores
- Model Size: Small (pickle file ~ few MB)
- Inference Time: Fast (< 1 second per prediction)
Model Card Authors
Gareth Aurelius Harrison
Model Card Contact
For questions or issues, please open an issue on the Hugging Face repository.