Spaces:

Ravi1212
/

Fake

Configuration error

App Files Files Community

Fake / README.md

Ravi1212

Upload 12 files

2bdf377 verified about 1 month ago

preview code

raw

history blame contribute delete

19.7 kB

# 🛡️ TruthLens — BERT-Based Fake News Detector

A full-stack web application that detects fake news using a large language model (LLM) as the primary classifier, backed by a fine-tuned BERT transformer model, real-time Google News RSS validation, image OCR analysis, API rate limiting, and a fully animated React interface with MongoDB-backed user authentication.

🌐 Live Demo

	Link
🖥️ Frontend (React App)	https://truth-lens-bert-based-fake-news-and.vercel.app
⚙️ Backend API	https://suryakf-truthlens-backend.hf.space
📖 Swagger / API Docs	https://suryakf-truthlens-backend.hf.space/docs

The backend runs on Hugging Face Spaces (CPU Basic — 2 vCPU, 16 GB RAM). The frontend is deployed on Vercel with global CDN. The database is MongoDB Atlas (M0 free cluster).

✨ Features

Core Detection Pipeline

Fine-tuned BERT (Primary) — PyTorch BERT model (~95% accuracy)
Three-label output — REAL / FAKE / UNVERIFIED. The LLM outputs UNVERIFIED when evidence is inconclusive, avoiding over-flagging real recent news as fake.
Confidence Scoring — Per-prediction probability distribution visualised as a live pie chart.
Batch Analysis — Submit up to 10 news texts in one request.

News Source Validation

Google News RSS — Free real-time headline search (no API key required). Retrieves title, source, publish date, and article description.
NewsAPI Integration — Extended article lookup with source attribution.
SerpAPI Integration — Fallback search-engine news verification.
Live context injection — All retrieved articles (headline + summary + URL + publish date) are passed directly into the LLM's prompt so it cross-references the claim against real-world evidence.

Image & OCR

Screenshot Upload — Paste or upload a screenshot of a news headline/article.
Mistral OCR — Extracts title, body text, source, and date from the image.
Same pipeline as text — After OCR, the extracted headline goes through the same LLM-primary flow (news search → LLM with context → BERT fallback).

Rate Limiting

API rate limits enforced via slowapi (per client IP):

Endpoint	Limit
`POST /api/predict`	30 / minute
`POST /api/batch-predict`	5 / minute
`POST /api/image-predict`	10 / minute
`POST /api/extract-image-text`	10 / minute
`POST /api/auth/login`	5 / minute
`POST /api/auth/register`	3 / minute

Authentication & History

JWT Authentication — 24-hour access tokens, bcrypt-hashed passwords.
Prediction History — Every analysis stored with timestamp and label in MongoDB.
User Dashboard — Live stats, streak counter, accuracy breakdown.

Developer Experience

Rotating Log Files — All API activity written to logs/app.log (10 MB cap, 5 backups).
Swagger / ReDoc — Auto-generated interactive API docs at /docs and /redoc.
Environment-Driven Config — All secrets via .env.

🏗️ Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    FRONTEND (React + Vite)                      │
│   Home  │  Login  │  Register  │  Dashboard                     │
│   GSAP ScrollTrigger · Framer Motion · TailwindCSS · Recharts   │
└────────────────────────────┬────────────────────────────────────┘
                             │ HTTPS / JWT
┌────────────────────────────▼────────────────────────────────────┐
│                      BACKEND (FastAPI)                          │
│  Rate Limiting (slowapi) → Logging Middleware → logs/app.log    │
│  /api/predict   /api/batch-predict   /api/image-predict         │
└──────┬──────────────────────────────────────┬───────────────────┘
       │                                      │
       ▼  STEP 1                              ▼  STEP 1 (image)
┌─────────────────┐                  ┌─────────────────────┐
│ News Validator  │                  │   Mistral OCR       │
│ Google News RSS │                  │ Extracts title +    │
│ NewsAPI         │                  │ text from image     │
│ SerpAPI         │                  └──────────┬──────────┘
└────────┬────────┘                             │
         │ articles (title+desc+date+url)       │ extracted headline
         ▼  STEP 2 (PRIMARY)                    ▼  STEP 2 (PRIMARY)
┌─────────────────────────────────────────────────────────────────┐
│                    LLM Fact-Checker                             │
│        Primary model → Fallback 1 → Fallback 2                  │
│  Output: REAL / FAKE / UNVERIFIED + confidence + reasoning      │
└───────────────────────────────┬─────────────────────────────────┘
                                │ (only if ALL Gemini models fail)
                                ▼  STEP 3 (FALLBACK)
                   ┌────────────────────────┐
                   │  Fine-tuned BERT       │
                   │  PyTorch + HF ~95% acc │
                   └────────────────────────┘
                                │
┌───────────────────────────────▼─────────────────────────────────┐
│                  MongoDB Atlas (Motor async)                    │
│          users collection · predictions collection              │
└─────────────────────────────────────────────────────────────────┘

Hybrid Model Architecture (Mermaid)

flowchart TD
  A[Input Text] --> B[Tokenizer<br/>bert-base-uncased]
    B --> C[input_ids, attention_mask]

  C --> D[BERT Encoder<br/>Hidden Size: 768]
    D --> E[Dropout]
  E --> F[BiLSTM<br/>2 layers, hidden=256, bidirectional]
  F --> G[LayerNorm<br/>Output dim: 512]
  G --> H[Multi-Head Self-Attention<br/>8 heads]

  H --> I[Global Max Pooling<br/>across sequence]
    I --> J[MLP Classifier]

    J --> J1[Linear 512->256 + ReLU + Dropout]
    J1 --> J2[Linear 256->128 + ReLU + Dropout]
    J2 --> J3[Linear 128->2]

    J3 --> K[Logits: Real vs Fake]
    K --> L[Softmax / Argmax Prediction]

    subgraph Training
      M[CrossEntropyLoss<br/>class weights + label smoothing]
      N[AdamW + LR Scheduler<br/>Warmup + Weight Decay]
      O[Early Stopping<br/>monitor val F1]
    end

    J3 --> M
    M --> N
    N --> O

📁 Project Structure

FinalYearProject/
├── app/
│   ├── main.py              # FastAPI app, CORS, rate limiter, logging middleware
│   ├── auth.py              # JWT token logic, bcrypt helpers
│   ├── database.py          # Motor async MongoDB client
│   ├── limiter.py           # Shared slowapi Limiter instance
│   ├── api/
│   │   ├── routes.py        # Prediction endpoints (/api/predict, /api/batch-predict, /api/image-predict)
│   │   └── auth_routes.py   # Auth endpoints (/api/auth/*)
│   ├── models/
│   │   └── bert_model.py    # BERT inference wrapper (fallback only)
│   ├── schemas/
│   │   ├── prediction.py    # Pydantic request/response models
│   │   └── auth.py          # User & token schemas
│   └── utils/
│       ├── ai_verification.py # LLM fact-checker (primary classifier)
│       ├── news_validator.py  # Multi-source news validation + RSS parser
│       ├── image_ocr.py       # Mistral OCR — image upload + text extraction
│       └── logger.py          # RotatingFileHandler logger factory
├── enhanced_bert_liar_model/   # BERT fine-tuned on LIAR dataset (fallback)
├── enhanced_bert_welfake_model/ # BERT fine-tuned on WELFake dataset (fallback)
├── frontend/
│   └── src/
│       ├── App.jsx
│       ├── api/index.js
│       ├── context/AuthContext.jsx
│       ├── motion/           # GSAP + Framer Motion helpers
│       └── pages/            # Home, Dashboard, Login, Register
├── logs/                     # Auto-created — rotating app.log
├── Data/WELFake_Dataset.csv
├── Notebook/
│   ├── bert_finetune_notebook.ipynb
│   └── wel-fakebert-finetune-notebook.ipynb
├── run_api.py
├── pyproject.toml
└── README.md

🚀 Production Deployment

Browser
  └──▶  Vercel (React/Vite frontend)
              └── VITE_API_URL ──▶  Hugging Face Spaces (FastAPI + BERT + LLM)
                                          └── MONGODB_URL ──▶  MongoDB Atlas

Layer	Platform	Plan
Frontend	Vercel	Free
Backend	Hugging Face Spaces	CPU Basic (Free)
Database	MongoDB Atlas	M0 Free

Deploy your own copy

Backend (HF Spaces)

Fork this repo and create a new Space (SDK: Docker)
Copy app/, enhanced_bert_*/, run_api.py, Dockerfile.huggingface (rename to Dockerfile)
Add secrets in Space Settings:

Secret	Description
`MONGODB_URL`	MongoDB Atlas connection string
`SECRET_KEY`	JWT signing secret
`AI_API_KEY`	LLM API key for the primary fact-checker
`MISTRAL_API_KEY`	Mistral API key (for image OCR)
`ALLOWED_ORIGINS`	Comma-separated frontend URLs

Frontend (Vercel)

Import your GitHub repo → set Root Directory to frontend
Add env var: VITE_API_URL=https://YOUR_HF_USER-your-space.hf.space/api

💻 Local Development

Prerequisites

Python 3.11+, Node.js 18+
UV package manager
MongoDB Atlas account
LLM API key (for the primary fact-checker)
Mistral API key (free at mistral.ai) — for image OCR

1. Install Backend

git clone <your-repo-url>
cd FinalYearProject
pip install uv
uv sync

2. Configure Environment

Create .env in the project root:

# MongoDB Atlas
MONGODB_URL=mongodb+srv://username:password@cluster.mongodb.net/?retryWrites=true&w=majority
DATABASE_NAME=fake_news_detector

# JWT
SECRET_KEY=your-super-secret-jwt-key-change-in-production
ACCESS_TOKEN_EXPIRE_MINUTES=1440

# LLM API key (primary fact-checker)
AI_API_KEY=your_api_key_here

# Mistral AI (image OCR)
MISTRAL_API_KEY=your_mistral_api_key_here

# News Validation (optional — Google News RSS is free)
NEWSAPI_KEY=your_newsapi_key
SERPAPI_KEY=your_serpapi_key

# CORS
ALLOWED_ORIGINS=http://localhost:5173,http://localhost:3000

3. Start the Backend

python run_api.py

API: http://localhost:8000
Swagger: http://localhost:8000/docs

4. Start the Frontend

cd frontend
npm install
npm run dev

Frontend: http://localhost:5173

🔐 API Reference

Authentication

Method	Endpoint	Rate Limit	Description
`POST`	`/api/auth/register`	3/min	Create a new user account
`POST`	`/api/auth/login`	5/min	Login and receive a JWT token
`GET`	`/api/auth/me`	—	Get current authenticated user
`GET`	`/api/auth/history`	—	Retrieve prediction history
`GET`	`/api/auth/stats`	—	Get total/real/fake counts
`POST`	`/api/auth/logout`	—	Logout

Predictions (JWT required)

Method	Endpoint	Rate Limit	Description
`POST`	`/api/predict`	30/min	Analyse a single news headline
`POST`	`/api/batch-predict`	5/min	Analyse up to 10 texts in one call
`POST`	`/api/image-predict`	10/min	OCR + analyse a news screenshot
`POST`	`/api/extract-image-text`	10/min	OCR only (no prediction)

Example — Single Prediction

Request:

curl -X POST http://localhost:8000/api/predict \
  -H "Authorization: Bearer YOUR_JWT_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"title": "Scientists discover new planet in solar system"}'

Response:

{
  "text": "Scientists discover new planet in solar system",
  "prediction": "unverified",
  "confidence": 0.62,
  "probabilities": { "real": 0.62, "fake": 0.38 },
  "is_fake": false,
  "prediction_source": "llm_primary",
  "context_articles_used": 2,
  "news_insight": "ℹ️ Limited related news coverage found."
}

🔧 Technology Stack

Backend

Library	Purpose
FastAPI	Async REST API framework
Uvicorn	ASGI server
google-genai	LLM SDK — primary fact-checker
mistralai	Mistral OCR — image text extraction
slowapi	Per-IP API rate limiting
PyTorch	BERT model inference (fallback)
Transformers (HuggingFace)	Tokeniser + BERT model architecture
Motor	Async MongoDB driver
python-jose	JWT token generation & validation
passlib[bcrypt]	Password hashing
requests + beautifulsoup4	News RSS scraping
newsapi-python	NewsAPI client
serpapi	SerpAPI client

Frontend

Library	Purpose
React 18	UI component library
Vite	Build tool & dev server
TailwindCSS 3	Utility-first styling
GSAP + ScrollTrigger	Scroll-driven animations
Framer Motion	Page transition system
Recharts	Pie chart visualisation
Axios	HTTP client with interceptors

🤖 Classification Details

LLM Fact-Checker (Primary)

Property	Value
Input	User claim + live news articles (headline, summary, date, URL)
Output labels	`REAL` / `FAKE` / `UNVERIFIED`
Fallback chain	Multiple model tiers tried automatically on quota errors
Context	Receives live Google News articles before deciding

UNVERIFIED is returned when the LLM cannot confirm or deny the claim from available evidence (e.g. very recent events not yet widely reported). It maps to is_fake: false with capped confidence (≤ 68%).

FAKE is only returned when retrieved articles directly contradict the specific factual assertion — not merely because the claim is surprising or uses dramatic language.

BERT (Fallback)

Property	Value
Architecture	BERT (bert-base-uncased)
Training	LIAR dataset (binarised)
Max token length	512
Accuracy	~95%
When used	Only when all Gemini models fail

🔒 Security

JWT tokens with configurable expiry (default 24 hours)
Bcrypt password hashing
Per-IP rate limiting on all public endpoints
CORS middleware (configurable via ALLOWED_ORIGINS)
Pydantic input validation on all endpoints
Environment-variable-driven secrets

🔧 Environment Variables Reference

Variable	Required	Description
`MONGODB_URL`	✅	MongoDB Atlas connection string
`DATABASE_NAME`	✅	Target database name
`SECRET_KEY`	✅	Secret used to sign JWT tokens
`AI_API_KEY`	✅	LLM API key (primary fact-checker)
`MISTRAL_API_KEY`	✅	Mistral API key (image OCR)
`ACCESS_TOKEN_EXPIRE_MINUTES`	❌	Token TTL (default: 1440)
`NEWSAPI_KEY`	❌	NewsAPI key
`SERPAPI_KEY`	❌	SerpAPI key
`ALLOWED_ORIGINS`	❌	Comma-separated CORS origins
`ENABLE_AI_CHECK`	❌	Set `false` to force BERT-only mode

📂 Datasets

LIAR Dataset

Property	Detail
Source	W. Wang, 2017 — UCSB
Size	~12,800 labelled statements
Labels	6-class → binarised to fake / real
Domain	Political statements (PolitiFact)
License	Public domain

WELFake Dataset

Property	Detail
Source	Verma et al., 2021
Size	72,134 articles (35,028 fake · 37,106 real)
Domain	Mixed: Kaggle, Reuters, BuzzFeed
License	CC BY 4.0

🧪 Training Notebooks

Notebook	Description
`Notebook/bert_finetune_notebook.ipynb`	BERT fine-tuning on LIAR dataset
`Notebook/wel-fakebert-finetune-notebook.ipynb`	BERT fine-tuning on WELFake dataset

🤝 Contributing

Fork the repository
Create a feature branch: git checkout -b feature/my-feature
Commit: git commit -m "feat: add my feature"
Push: git push origin feature/my-feature
Open a Pull Request

📄 License

MIT License

🙏 Acknowledgements

LIAR Dataset — W. Wang, 2017
WELFake Dataset — Verma et al., 2021
Hugging Face Transformers — BERT tokeniser and model utilities
Primary LLM fact-checker — contextual claim verification against live news
Mistral AI — Image OCR

🛡️ Built to fight misinformation — TruthLens