Model Card for IndoHoaxDetector

Model Details

Model Description

IndoHoaxDetector is a binary classification model designed to detect hoax-style news articles in the Indonesian language. It uses logistic regression trained on linguistic features of Indonesian news to classify text as either legitimate or hoax-like writing. This model analyzes writing style and patterns, not factual accuracy or truthfulness of the content.

Developed by: Gareth Aurelius Harrison
Model type: Logistic Regression (scikit-learn)
Language(s): Indonesian
License: MIT
Finetuned from model: N/A (trained from scratch)

Model Sources

Repository: https://huggingface.co/theonegareth/IndoHoaxDetector
Paper or resources: N/A

Uses

Direct Use

This model can be used to analyze Indonesian news articles and determine if they are written in a hoax-like style. It identifies linguistic patterns typical of fake news but does not verify factual accuracy. It is intended for educational, research, and journalistic purposes to help identify potentially sensational or misleading writing styles.

Downstream Use

News verification tools
Fact-checking applications
Educational resources on misinformation
Research on Indonesian media landscape

Out-of-Scope Use

Automated content moderation without human oversight
Legal or judicial decisions
Real-time censorship
Detection in other languages

Bias, Risks, and Limitations

Recommendations

Users should be aware that this model:

Is trained on specific datasets and may not generalize to all Indonesian news
Can produce false positives/negatives
Should not be used as the sole basis for important decisions
Requires human verification for critical applications

Known Limitations

Stylistic vs Factual Analysis: This model detects writing style typical of hoaxes, not factual inaccuracies. Legitimate news written sensationally may be flagged as hoax, and factual hoaxes written professionally may be missed.
Data Bias: The model is trained on a limited dataset; performance may vary with different topics or writing styles
Language Specificity: Only works for Indonesian text
Temporal Limitations: News patterns change over time; the model may become less accurate with newer data
Binary Classification: Does not provide nuanced assessments of credibility

Ethical Considerations

Misinformation Detection: While helpful for identifying hoaxes, this technology could be misused to suppress legitimate dissenting views
Privacy: Text analysis may involve sensitive content
Accessibility: Should be used to empower users, not to restrict information access
Transparency: Model decisions should be explainable and verifiable

Training Details

Training Data

Dataset: Indonesian news articles dataset (details not publicly available)
Preprocessing: Text cleaning, tokenization, feature extraction (likely TF-IDF)
Size: Not specified
Distribution: Balanced between hoax and legitimate classes (assumed)

Training Procedure

Training Date: October 29, 2024
Hardware: Not specified
Software: scikit-learn
Hyperparameters: Default logistic regression parameters
Carbon Footprint: Not calculated

Evaluation

Testing Data

Dataset: Held-out test set from training data
Size: Not specified
Distribution: Balanced (assumed)

Metrics

Accuracy: 97.83%
Other metrics: Not provided (precision, recall, F1-score unknown)

Results

The model achieves high accuracy on the test set, but detailed performance metrics per class are not available.

Technical Specifications

Input Format: Raw Indonesian text
Output Format: Binary classification (0: legitimate, 1: hoax) with probability scores
Model Size: Small (pickle file ~ few MB)
Inference Time: Fast (< 1 second per prediction)

Model Card Authors

Gareth Aurelius Harrison

Model Card Contact

For questions or issues, please open an issue on the Hugging Face repository.