IndoHoaxDetector / modelcard.md
theonegareth's picture
Clarify that model detects hoax-style writing, not factual accuracy
45fd788

Model Card for IndoHoaxDetector

Model Details

Model Description

IndoHoaxDetector is a binary classification model designed to detect hoax-style news articles in the Indonesian language. It uses logistic regression trained on linguistic features of Indonesian news to classify text as either legitimate or hoax-like writing. This model analyzes writing style and patterns, not factual accuracy or truthfulness of the content.

  • Developed by: Gareth Aurelius Harrison
  • Model type: Logistic Regression (scikit-learn)
  • Language(s): Indonesian
  • License: MIT
  • Finetuned from model: N/A (trained from scratch)

Model Sources

Uses

Direct Use

This model can be used to analyze Indonesian news articles and determine if they are written in a hoax-like style. It identifies linguistic patterns typical of fake news but does not verify factual accuracy. It is intended for educational, research, and journalistic purposes to help identify potentially sensational or misleading writing styles.

Downstream Use

  • News verification tools
  • Fact-checking applications
  • Educational resources on misinformation
  • Research on Indonesian media landscape

Out-of-Scope Use

  • Automated content moderation without human oversight
  • Legal or judicial decisions
  • Real-time censorship
  • Detection in other languages

Bias, Risks, and Limitations

Recommendations

Users should be aware that this model:

  • Is trained on specific datasets and may not generalize to all Indonesian news
  • Can produce false positives/negatives
  • Should not be used as the sole basis for important decisions
  • Requires human verification for critical applications

Known Limitations

  • Stylistic vs Factual Analysis: This model detects writing style typical of hoaxes, not factual inaccuracies. Legitimate news written sensationally may be flagged as hoax, and factual hoaxes written professionally may be missed.
  • Data Bias: The model is trained on a limited dataset; performance may vary with different topics or writing styles
  • Language Specificity: Only works for Indonesian text
  • Temporal Limitations: News patterns change over time; the model may become less accurate with newer data
  • Binary Classification: Does not provide nuanced assessments of credibility

Ethical Considerations

  • Misinformation Detection: While helpful for identifying hoaxes, this technology could be misused to suppress legitimate dissenting views
  • Privacy: Text analysis may involve sensitive content
  • Accessibility: Should be used to empower users, not to restrict information access
  • Transparency: Model decisions should be explainable and verifiable

Training Details

Training Data

  • Dataset: Indonesian news articles dataset (details not publicly available)
  • Preprocessing: Text cleaning, tokenization, feature extraction (likely TF-IDF)
  • Size: Not specified
  • Distribution: Balanced between hoax and legitimate classes (assumed)

Training Procedure

  • Training Date: October 29, 2024
  • Hardware: Not specified
  • Software: scikit-learn
  • Hyperparameters: Default logistic regression parameters
  • Carbon Footprint: Not calculated

Evaluation

Testing Data

  • Dataset: Held-out test set from training data
  • Size: Not specified
  • Distribution: Balanced (assumed)

Metrics

  • Accuracy: 97.83%
  • Other metrics: Not provided (precision, recall, F1-score unknown)

Results

The model achieves high accuracy on the test set, but detailed performance metrics per class are not available.

Technical Specifications

  • Input Format: Raw Indonesian text
  • Output Format: Binary classification (0: legitimate, 1: hoax) with probability scores
  • Model Size: Small (pickle file ~ few MB)
  • Inference Time: Fast (< 1 second per prediction)

Model Card Authors

Gareth Aurelius Harrison

Model Card Contact

For questions or issues, please open an issue on the Hugging Face repository.