--- title: IndoHoaxDetector emoji: 🌖 colorFrom: pink colorTo: indigo sdk: gradio sdk_version: 5.49.1 app_file: app.py pinned: false license: mit short_description: Indonesian News Detector tags: ["nlp", "text-classification", "indonesian", "hoax-detection", "machine-learning", "fact-checking", "transformers", "tfidf"] --- # IndoHoaxDetector A comprehensive machine learning system for detecting hoax-style news articles in Indonesian language. This project provides multiple models trained on linguistic features of Indonesian news to identify articles written in a style typical of hoaxes or fake news. The system analyzes writing patterns, sensationalism, and other stylistic indicators rather than checking factual accuracy. ## Features - **Multiple Models**: Logistic Regression, Linear SVM, Random Forest, Naive Bayes, and IndoBERT - **High Accuracy**: Best model (IndoBERT) achieves 99.89% accuracy on test data - **Web Interface**: Interactive Gradio app for easy prediction - **Lightweight Options**: TF-IDF based models for fast inference - **Transformer Support**: IndoBERT for state-of-the-art performance ## Model Comparison All models were trained on 62,972 Indonesian news articles and evaluated on 12,595 held-out test samples: | Model | Accuracy | Precision (Macro) | Recall (Macro) | F1 (Macro) | Type | |------------------------|----------|-------------------|----------------|------------|------| | **IndoBERT** | **0.9989** | **0.9989** | **0.9989** | **0.9989** | Transformer | | Linear SVM | 0.9819 | 0.9820 | 0.9817 | 0.9818 | TF-IDF + SVM | | Logistic Regression | 0.9782 | 0.9787 | 0.9777 | 0.9781 | TF-IDF + LR | | Random Forest | 0.9765 | 0.9768 | 0.9760 | 0.9764 | TF-IDF + RF | | Multinomial Naive Bayes| 0.9398 | 0.9414 | 0.9381 | 0.9393 | TF-IDF + NB | ## Model Details ### IndoBERT (Recommended) - **Type**: Fine-tuned IndoBenchmark/indobert-base-p1 - **Training**: 3 epochs, batch_size=16, learning_rate=2e-5, max_length=128 - **Accuracy**: 99.89% - **Best For**: Highest accuracy, contextual understanding - **Requirements**: transformers, torch (GPU recommended) ### TF-IDF Based Models - **Vectorizer**: TF-IDF (unigrams, max_features=5000, sublinear_tf=True) - **Preprocessing**: Lowercase, URL removal, punctuation, Indonesian stopwords, Sastrawi stemming - **Logistic Regression**: L2 regularization, max_iter=1000 (97.82% accuracy) - **Linear SVM**: C=1.0, max_iter=10000 (98.19% accuracy) - **Random Forest**: 100 trees, random_state=42 (97.65% accuracy) - **Naive Bayes**: Multinomial, alpha=1.0 (93.98% accuracy) ## Usage ### Web Interface Visit the Hugging Face Space to use the interactive app: 1. Select a model from the dropdown 2. Enter Indonesian news text 3. Click "Detect" 4. View the prediction and confidence ### Local Usage To run locally: ```bash pip install gradio scikit-learn transformers torch python app.py ``` ### API Usage Load and use models programmatically: ```python import pickle from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch # For TF-IDF models (Logistic Regression, SVM, RF, NB) with open('logreg_model.pkl', 'rb') as f: model = pickle.load(f) # Load vectorizer with open('tfidf_vectorizer.pkl', 'rb') as f: vectorizer = pickle.load(f) text = "Your Indonesian news text here" X = vectorizer.transform([text]) prediction = model.predict(X) # 0 for FAKTA, 1 for HOAX # For IndoBERT tokenizer = AutoTokenizer.from_pretrained('indobert_model') model = AutoModelForSequenceClassification.from_pretrained('indobert_model') inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128) with torch.no_grad(): outputs = model(**inputs) prediction = torch.argmax(outputs.logits, dim=-1).item() ``` ## Training Data - **Source**: TurnBackHoax fact-checks, Kompas/CekFakta.com articles - **Size**: 62,972 labeled articles (balanced HOAX/FAKTA) - **Preprocessing**: Indonesian text normalization, stemming, stopword removal - **Split**: 80% train, 20% test (stratified) ## Limitations - **Stylistic Analysis Only**: Detects hoax-like writing style, not factual accuracy - **Language Specific**: Trained on Indonesian news patterns - **Domain Shift**: May perform differently on social media vs. formal news - **False Positives/Negatives**: Sensational legitimate news may be flagged as hoax - **Transformer Requirements**: IndoBERT needs GPU for training/inference ## Ethical Considerations - **Responsible Use**: Augment human fact-checkers, not replace them - **Bias Awareness**: Training data may reflect source biases - **Transparency**: All models and code are open-source - **Misuse Prevention**: Clearly labeled as stylistic detector, not truth verifier ## Contributing Contributions welcome! Open issues or PRs for improvements. ## License MIT License - see LICENSE file for details Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference