| # Model Card for IndoHoaxDetector | |
| ## Model Details | |
| ### Model Description | |
| IndoHoaxDetector is a binary classification model designed to detect hoax-style news articles in the Indonesian language. It uses logistic regression trained on linguistic features of Indonesian news to classify text as either legitimate or hoax-like writing. **This model analyzes writing style and patterns, not factual accuracy or truthfulness of the content.** | |
| - **Developed by**: Gareth Aurelius Harrison | |
| - **Model type**: Logistic Regression (scikit-learn) | |
| - **Language(s)**: Indonesian | |
| - **License**: MIT | |
| - **Finetuned from model**: N/A (trained from scratch) | |
| ### Model Sources | |
| - **Repository**: https://huggingface.co/theonegareth/IndoHoaxDetector | |
| - **Paper or resources**: N/A | |
| ## Uses | |
| ### Direct Use | |
| This model can be used to analyze Indonesian news articles and determine if they are written in a hoax-like style. It identifies linguistic patterns typical of fake news but does **not verify factual accuracy**. It is intended for educational, research, and journalistic purposes to help identify potentially sensational or misleading writing styles. | |
| ### Downstream Use | |
| - News verification tools | |
| - Fact-checking applications | |
| - Educational resources on misinformation | |
| - Research on Indonesian media landscape | |
| ### Out-of-Scope Use | |
| - Automated content moderation without human oversight | |
| - Legal or judicial decisions | |
| - Real-time censorship | |
| - Detection in other languages | |
| ## Bias, Risks, and Limitations | |
| ### Recommendations | |
| Users should be aware that this model: | |
| - Is trained on specific datasets and may not generalize to all Indonesian news | |
| - Can produce false positives/negatives | |
| - Should not be used as the sole basis for important decisions | |
| - Requires human verification for critical applications | |
| ### Known Limitations | |
| - **Stylistic vs Factual Analysis**: This model detects writing style typical of hoaxes, not factual inaccuracies. Legitimate news written sensationally may be flagged as hoax, and factual hoaxes written professionally may be missed. | |
| - **Data Bias**: The model is trained on a limited dataset; performance may vary with different topics or writing styles | |
| - **Language Specificity**: Only works for Indonesian text | |
| - **Temporal Limitations**: News patterns change over time; the model may become less accurate with newer data | |
| - **Binary Classification**: Does not provide nuanced assessments of credibility | |
| ### Ethical Considerations | |
| - **Misinformation Detection**: While helpful for identifying hoaxes, this technology could be misused to suppress legitimate dissenting views | |
| - **Privacy**: Text analysis may involve sensitive content | |
| - **Accessibility**: Should be used to empower users, not to restrict information access | |
| - **Transparency**: Model decisions should be explainable and verifiable | |
| ## Training Details | |
| ### Training Data | |
| - **Dataset**: Indonesian news articles dataset (details not publicly available) | |
| - **Preprocessing**: Text cleaning, tokenization, feature extraction (likely TF-IDF) | |
| - **Size**: Not specified | |
| - **Distribution**: Balanced between hoax and legitimate classes (assumed) | |
| ### Training Procedure | |
| - **Training Date**: October 29, 2024 | |
| - **Hardware**: Not specified | |
| - **Software**: scikit-learn | |
| - **Hyperparameters**: Default logistic regression parameters | |
| - **Carbon Footprint**: Not calculated | |
| ## Evaluation | |
| ### Testing Data | |
| - **Dataset**: Held-out test set from training data | |
| - **Size**: Not specified | |
| - **Distribution**: Balanced (assumed) | |
| ### Metrics | |
| - **Accuracy**: 97.83% | |
| - **Other metrics**: Not provided (precision, recall, F1-score unknown) | |
| ### Results | |
| The model achieves high accuracy on the test set, but detailed performance metrics per class are not available. | |
| ## Technical Specifications | |
| - **Input Format**: Raw Indonesian text | |
| - **Output Format**: Binary classification (0: legitimate, 1: hoax) with probability scores | |
| - **Model Size**: Small (pickle file ~ few MB) | |
| - **Inference Time**: Fast (< 1 second per prediction) | |
| ## Model Card Authors | |
| Gareth Aurelius Harrison | |
| ## Model Card Contact | |
| For questions or issues, please open an issue on the Hugging Face repository. |