Fake / README.md
Ravi1212's picture
Upload 12 files
2bdf377 verified

ο»Ώ# πŸ›‘οΈ TruthLens β€” BERT-Based Fake News Detector

Python FastAPI React MongoDB TailwindCSS HuggingFace Vercel License


A full-stack web application that detects fake news using a large language model (LLM) as the primary classifier, backed by a fine-tuned BERT transformer model, real-time Google News RSS validation, image OCR analysis, API rate limiting, and a fully animated React interface with MongoDB-backed user authentication.

🌐 Live Demo

Link
πŸ–₯️ Frontend (React App) https://truth-lens-bert-based-fake-news-and.vercel.app
βš™οΈ Backend API https://suryakf-truthlens-backend.hf.space
πŸ“– Swagger / API Docs https://suryakf-truthlens-backend.hf.space/docs

The backend runs on Hugging Face Spaces (CPU Basic β€” 2 vCPU, 16 GB RAM). The frontend is deployed on Vercel with global CDN. The database is MongoDB Atlas (M0 free cluster).

✨ Features

Core Detection Pipeline

  • Fine-tuned BERT (Primary) β€” PyTorch BERT model (~95% accuracy)
  • Three-label output β€” REAL / FAKE / UNVERIFIED. The LLM outputs UNVERIFIED when evidence is inconclusive, avoiding over-flagging real recent news as fake.
  • Confidence Scoring β€” Per-prediction probability distribution visualised as a live pie chart.
  • Batch Analysis β€” Submit up to 10 news texts in one request.

News Source Validation

  • Google News RSS β€” Free real-time headline search (no API key required). Retrieves title, source, publish date, and article description.
  • NewsAPI Integration β€” Extended article lookup with source attribution.
  • SerpAPI Integration β€” Fallback search-engine news verification.
  • Live context injection β€” All retrieved articles (headline + summary + URL + publish date) are passed directly into the LLM's prompt so it cross-references the claim against real-world evidence.

Image & OCR

  • Screenshot Upload β€” Paste or upload a screenshot of a news headline/article.
  • Mistral OCR β€” Extracts title, body text, source, and date from the image.
  • Same pipeline as text β€” After OCR, the extracted headline goes through the same LLM-primary flow (news search β†’ LLM with context β†’ BERT fallback).

Rate Limiting

API rate limits enforced via slowapi (per client IP):

Endpoint Limit
POST /api/predict 30 / minute
POST /api/batch-predict 5 / minute
POST /api/image-predict 10 / minute
POST /api/extract-image-text 10 / minute
POST /api/auth/login 5 / minute
POST /api/auth/register 3 / minute

Authentication & History

  • JWT Authentication β€” 24-hour access tokens, bcrypt-hashed passwords.
  • Prediction History β€” Every analysis stored with timestamp and label in MongoDB.
  • User Dashboard β€” Live stats, streak counter, accuracy breakdown.

Developer Experience

  • Rotating Log Files β€” All API activity written to logs/app.log (10 MB cap, 5 backups).
  • Swagger / ReDoc β€” Auto-generated interactive API docs at /docs and /redoc.
  • Environment-Driven Config β€” All secrets via .env.

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    FRONTEND (React + Vite)                      β”‚
β”‚   Home  β”‚  Login  β”‚  Register  β”‚  Dashboard                     β”‚
β”‚   GSAP ScrollTrigger Β· Framer Motion Β· TailwindCSS Β· Recharts   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚ HTTPS / JWT
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      BACKEND (FastAPI)                          β”‚
β”‚  Rate Limiting (slowapi) β†’ Logging Middleware β†’ logs/app.log    β”‚
β”‚  /api/predict   /api/batch-predict   /api/image-predict         β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚                                      β”‚
       β–Ό  STEP 1                              β–Ό  STEP 1 (image)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ News Validator  β”‚                  β”‚   Mistral OCR       β”‚
β”‚ Google News RSS β”‚                  β”‚ Extracts title +    β”‚
β”‚ NewsAPI         β”‚                  β”‚ text from image     β”‚
β”‚ SerpAPI         β”‚                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜                             β”‚
         β”‚ articles (title+desc+date+url)       β”‚ extracted headline
         β–Ό  STEP 2 (PRIMARY)                    β–Ό  STEP 2 (PRIMARY)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    LLM Fact-Checker                             β”‚
β”‚        Primary model β†’ Fallback 1 β†’ Fallback 2                  β”‚
β”‚  Output: REAL / FAKE / UNVERIFIED + confidence + reasoning      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚ (only if ALL Gemini models fail)
                                β–Ό  STEP 3 (FALLBACK)
                   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                   β”‚  Fine-tuned BERT       β”‚
                   β”‚  PyTorch + HF ~95% acc β”‚
                   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  MongoDB Atlas (Motor async)                    β”‚
β”‚          users collection Β· predictions collection              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Hybrid Model Architecture (Mermaid)

flowchart TD
  A[Input Text] --> B[Tokenizer<br/>bert-base-uncased]
    B --> C[input_ids, attention_mask]

  C --> D[BERT Encoder<br/>Hidden Size: 768]
    D --> E[Dropout]
  E --> F[BiLSTM<br/>2 layers, hidden=256, bidirectional]
  F --> G[LayerNorm<br/>Output dim: 512]
  G --> H[Multi-Head Self-Attention<br/>8 heads]

  H --> I[Global Max Pooling<br/>across sequence]
    I --> J[MLP Classifier]

    J --> J1[Linear 512->256 + ReLU + Dropout]
    J1 --> J2[Linear 256->128 + ReLU + Dropout]
    J2 --> J3[Linear 128->2]

    J3 --> K[Logits: Real vs Fake]
    K --> L[Softmax / Argmax Prediction]

    subgraph Training
      M[CrossEntropyLoss<br/>class weights + label smoothing]
      N[AdamW + LR Scheduler<br/>Warmup + Weight Decay]
      O[Early Stopping<br/>monitor val F1]
    end

    J3 --> M
    M --> N
    N --> O

πŸ“ Project Structure

FinalYearProject/
β”œβ”€β”€ app/
β”‚   β”œβ”€β”€ main.py              # FastAPI app, CORS, rate limiter, logging middleware
β”‚   β”œβ”€β”€ auth.py              # JWT token logic, bcrypt helpers
β”‚   β”œβ”€β”€ database.py          # Motor async MongoDB client
β”‚   β”œβ”€β”€ limiter.py           # Shared slowapi Limiter instance
β”‚   β”œβ”€β”€ api/
β”‚   β”‚   β”œβ”€β”€ routes.py        # Prediction endpoints (/api/predict, /api/batch-predict, /api/image-predict)
β”‚   β”‚   └── auth_routes.py   # Auth endpoints (/api/auth/*)
β”‚   β”œβ”€β”€ models/
β”‚   β”‚   └── bert_model.py    # BERT inference wrapper (fallback only)
β”‚   β”œβ”€β”€ schemas/
β”‚   β”‚   β”œβ”€β”€ prediction.py    # Pydantic request/response models
β”‚   β”‚   └── auth.py          # User & token schemas
β”‚   └── utils/
β”‚       β”œβ”€β”€ ai_verification.py # LLM fact-checker (primary classifier)
β”‚       β”œβ”€β”€ news_validator.py  # Multi-source news validation + RSS parser
β”‚       β”œβ”€β”€ image_ocr.py       # Mistral OCR β€” image upload + text extraction
β”‚       └── logger.py          # RotatingFileHandler logger factory
β”œβ”€β”€ enhanced_bert_liar_model/   # BERT fine-tuned on LIAR dataset (fallback)
β”œβ”€β”€ enhanced_bert_welfake_model/ # BERT fine-tuned on WELFake dataset (fallback)
β”œβ”€β”€ frontend/
β”‚   └── src/
β”‚       β”œβ”€β”€ App.jsx
β”‚       β”œβ”€β”€ api/index.js
β”‚       β”œβ”€β”€ context/AuthContext.jsx
β”‚       β”œβ”€β”€ motion/           # GSAP + Framer Motion helpers
β”‚       └── pages/            # Home, Dashboard, Login, Register
β”œβ”€β”€ logs/                     # Auto-created β€” rotating app.log
β”œβ”€β”€ Data/WELFake_Dataset.csv
β”œβ”€β”€ Notebook/
β”‚   β”œβ”€β”€ bert_finetune_notebook.ipynb
β”‚   └── wel-fakebert-finetune-notebook.ipynb
β”œβ”€β”€ run_api.py
β”œβ”€β”€ pyproject.toml
└── README.md

πŸš€ Production Deployment

Browser
  └──▢  Vercel (React/Vite frontend)
              └── VITE_API_URL ──▢  Hugging Face Spaces (FastAPI + BERT + LLM)
                                          └── MONGODB_URL ──▢  MongoDB Atlas
Layer Platform Plan
Frontend Vercel Free
Backend Hugging Face Spaces CPU Basic (Free)
Database MongoDB Atlas M0 Free

Deploy your own copy

Backend (HF Spaces)

  1. Fork this repo and create a new Space (SDK: Docker)
  2. Copy app/, enhanced_bert_*/, run_api.py, Dockerfile.huggingface (rename to Dockerfile)
  3. Add secrets in Space Settings:
Secret Description
MONGODB_URL MongoDB Atlas connection string
SECRET_KEY JWT signing secret
AI_API_KEY LLM API key for the primary fact-checker
MISTRAL_API_KEY Mistral API key (for image OCR)
ALLOWED_ORIGINS Comma-separated frontend URLs

Frontend (Vercel)

  1. Import your GitHub repo β†’ set Root Directory to frontend
  2. Add env var: VITE_API_URL=https://YOUR_HF_USER-your-space.hf.space/api

πŸ’» Local Development

Prerequisites

  • Python 3.11+, Node.js 18+
  • UV package manager
  • MongoDB Atlas account
  • LLM API key (for the primary fact-checker)
  • Mistral API key (free at mistral.ai) β€” for image OCR

1. Install Backend

git clone <your-repo-url>
cd FinalYearProject
pip install uv
uv sync

2. Configure Environment

Create .env in the project root:

# MongoDB Atlas
MONGODB_URL=mongodb+srv://username:password@cluster.mongodb.net/?retryWrites=true&w=majority
DATABASE_NAME=fake_news_detector

# JWT
SECRET_KEY=your-super-secret-jwt-key-change-in-production
ACCESS_TOKEN_EXPIRE_MINUTES=1440

# LLM API key (primary fact-checker)
AI_API_KEY=your_api_key_here

# Mistral AI (image OCR)
MISTRAL_API_KEY=your_mistral_api_key_here

# News Validation (optional β€” Google News RSS is free)
NEWSAPI_KEY=your_newsapi_key
SERPAPI_KEY=your_serpapi_key

# CORS
ALLOWED_ORIGINS=http://localhost:5173,http://localhost:3000

3. Start the Backend

python run_api.py

4. Start the Frontend

cd frontend
npm install
npm run dev

Frontend: http://localhost:5173


πŸ” API Reference

Authentication

Method Endpoint Rate Limit Description
POST /api/auth/register 3/min Create a new user account
POST /api/auth/login 5/min Login and receive a JWT token
GET /api/auth/me β€” Get current authenticated user
GET /api/auth/history β€” Retrieve prediction history
GET /api/auth/stats β€” Get total/real/fake counts
POST /api/auth/logout β€” Logout

Predictions (JWT required)

Method Endpoint Rate Limit Description
POST /api/predict 30/min Analyse a single news headline
POST /api/batch-predict 5/min Analyse up to 10 texts in one call
POST /api/image-predict 10/min OCR + analyse a news screenshot
POST /api/extract-image-text 10/min OCR only (no prediction)

Example β€” Single Prediction

Request:

curl -X POST http://localhost:8000/api/predict \
  -H "Authorization: Bearer YOUR_JWT_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"title": "Scientists discover new planet in solar system"}'

Response:

{
  "text": "Scientists discover new planet in solar system",
  "prediction": "unverified",
  "confidence": 0.62,
  "probabilities": { "real": 0.62, "fake": 0.38 },
  "is_fake": false,
  "prediction_source": "llm_primary",
  "context_articles_used": 2,
  "news_insight": "ℹ️ Limited related news coverage found."
}

πŸ”§ Technology Stack

Backend

Library Purpose
FastAPI Async REST API framework
Uvicorn ASGI server
google-genai LLM SDK β€” primary fact-checker
mistralai Mistral OCR β€” image text extraction
slowapi Per-IP API rate limiting
PyTorch BERT model inference (fallback)
Transformers (HuggingFace) Tokeniser + BERT model architecture
Motor Async MongoDB driver
python-jose JWT token generation & validation
passlib[bcrypt] Password hashing
requests + beautifulsoup4 News RSS scraping
newsapi-python NewsAPI client
serpapi SerpAPI client

Frontend

Library Purpose
React 18 UI component library
Vite Build tool & dev server
TailwindCSS 3 Utility-first styling
GSAP + ScrollTrigger Scroll-driven animations
Framer Motion Page transition system
Recharts Pie chart visualisation
Axios HTTP client with interceptors

πŸ€– Classification Details

LLM Fact-Checker (Primary)

Property Value
Input User claim + live news articles (headline, summary, date, URL)
Output labels REAL / FAKE / UNVERIFIED
Fallback chain Multiple model tiers tried automatically on quota errors
Context Receives live Google News articles before deciding

UNVERIFIED is returned when the LLM cannot confirm or deny the claim from available evidence (e.g. very recent events not yet widely reported). It maps to is_fake: false with capped confidence (≀ 68%).

FAKE is only returned when retrieved articles directly contradict the specific factual assertion β€” not merely because the claim is surprising or uses dramatic language.

BERT (Fallback)

Property Value
Architecture BERT (bert-base-uncased)
Training LIAR dataset (binarised)
Max token length 512
Accuracy ~95%
When used Only when all Gemini models fail

πŸ”’ Security

  • JWT tokens with configurable expiry (default 24 hours)
  • Bcrypt password hashing
  • Per-IP rate limiting on all public endpoints
  • CORS middleware (configurable via ALLOWED_ORIGINS)
  • Pydantic input validation on all endpoints
  • Environment-variable-driven secrets

πŸ”§ Environment Variables Reference

Variable Required Description
MONGODB_URL βœ… MongoDB Atlas connection string
DATABASE_NAME βœ… Target database name
SECRET_KEY βœ… Secret used to sign JWT tokens
AI_API_KEY βœ… LLM API key (primary fact-checker)
MISTRAL_API_KEY βœ… Mistral API key (image OCR)
ACCESS_TOKEN_EXPIRE_MINUTES ❌ Token TTL (default: 1440)
NEWSAPI_KEY ❌ NewsAPI key
SERPAPI_KEY ❌ SerpAPI key
ALLOWED_ORIGINS ❌ Comma-separated CORS origins
ENABLE_AI_CHECK ❌ Set false to force BERT-only mode

πŸ“‚ Datasets

LIAR Dataset

Property Detail
Source W. Wang, 2017 β€” UCSB
Size ~12,800 labelled statements
Labels 6-class β†’ binarised to fake / real
Domain Political statements (PolitiFact)
License Public domain

WELFake Dataset

Property Detail
Source Verma et al., 2021
Size 72,134 articles (35,028 fake Β· 37,106 real)
Domain Mixed: Kaggle, Reuters, BuzzFeed
License CC BY 4.0

πŸ§ͺ Training Notebooks

Notebook Description
Notebook/bert_finetune_notebook.ipynb BERT fine-tuning on LIAR dataset
Notebook/wel-fakebert-finetune-notebook.ipynb BERT fine-tuning on WELFake dataset

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/my-feature
  3. Commit: git commit -m "feat: add my feature"
  4. Push: git push origin feature/my-feature
  5. Open a Pull Request

πŸ“„ License

MIT License


πŸ™ Acknowledgements


πŸ›‘οΈ Built to fight misinformation β€” TruthLens