πŸ” VeriLang β€” Vernacular Misinformation Detector

Python MuRIL Accuracy Streamlit Languages

AI-powered fake news detector for WhatsApp forwards in Hindi, Gujarati, Marathi and Telugu using Google's MuRIL model.


🎯 Problem Statement

WhatsApp misinformation in Indian regional languages is a serious problem affecting millions of people daily. Existing fake news detectors work only for English. VeriLang solves this by detecting misinformation in 4 major Indian languages β€” Hindi, Gujarati, Marathi and Telugu.


✨ What Makes This Unique

Feature Other Projects VeriLang
Languages English only 4 Indian languages
Model Generic BERT MuRIL β€” Indian language specialist
Target General news WhatsApp forwards specifically
Explainability None SHAP word importance
Deployment Jupyter only Live web app

πŸ“Š Dataset

  • Source: Extracted 4 different language dataset from Zenodo β€” Multilingual Fake News Detection (https://zenodo.org/records/11408513)
  • Size: 49,426 articles
  • Labels: Fake (0) / Real (1)
  • Languages: Hindi, Gujarati, Marathi, Telugu
Language Total Fake Real
Hindi 15,051 7,599 7,452
Gujarati 14,830 6,145 8,685
Telugu 11,424 4,795 6,629
Marathi 8,121 707 7,414

πŸ”¬ Methodology

Phase 1 β€” Data Cleaning

  • Removed URLs, HTML tags, extra spaces
  • Kept unicode scripts intact for Indian languages
  • No duplicates, no missing values

Phase 2 β€” Baseline ML Models

  • TF-IDF with character n-grams (2-4 grams)
  • Character n-grams work better than word n-grams for Indian scripts
  • Trained Logistic Regression, SVM, Naive Bayes

Phase 3 β€” MuRIL Fine-tuning

  • Used google/muril-base-cased from HuggingFace
  • MuRIL is trained on 17 Indian languages
  • Fine-tuned for 3 epochs on Google Colab T4 GPU
  • Applied class weights to fix Marathi imbalance

πŸ“ˆ Model Performance

Overall Accuracy

Model Accuracy
Logistic Regression 99.63%
Naive Bayes 97.77%
SVM 99.84%
MuRIL (final) 99.92%

Per Language Accuracy

Language SVM MuRIL
Hindi 99.66% 99.80%
Gujarati 99.86% 99.93%
Marathi 99.94% 100.00%
Telugu 99.96% 100.00%

Overfitting Check

Model Train Test Gap Status
Logistic Reg 99.60% 99.63% 0.03% No overfit
SVM 99.99% 99.84% 0.15% No overfit
MuRIL β€” 99.92% β€” CV std 0.05%

πŸ›  Tech Stack

Category Tools
Deep Learning PyTorch, HuggingFace Transformers
Model google/muril-base-cased
ML Baseline Scikit-learn, TF-IDF
Explainability SHAP
Web App Streamlit
Model Hosting HuggingFace Hub
Training Google Colab T4 GPU

πŸ’» How to Run Locally

git clone https://github.com/Maitry09/verilang-misinformation-detector
cd verilang-misinformation-detector
pip install -r requirements.txt
streamlit run app.py

Add .streamlit/secrets.toml:

HF_TOKEN = "your_token_here"

βš–οΈ Ethical Considerations

  • Tool is for awareness only β€” not a final judgment system
  • Always verify from official sources like PIB, ANI, PTI
  • Model may have bias toward certain writing styles
  • Marathi fake detection improved with augmentation but still has fewer training samples

πŸš€ Future Work

  • Add Bengali, Tamil, Kannada support
  • Work on mixture of language
  • Build WhatsApp bot integration using Twilio API
  • Add image-based misinformation detection
  • Real-time news verification via fact-check APIs
  • Mobile app for direct WhatsApp forward checking

πŸ‘€ Author

Maitry

  • GitHub: @Maitry09
  • Live App: [verilang.streamlit.app]

πŸ“„ License

MIT License


⭐ Star this repo if you found it useful!

Downloads last month
50
Safetensors
Model size
0.2B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support