Spaces:

Ravi1212
/

Fake

Configuration error

App Files Files Community

Fake / README.md

Ravi1212

Upload 12 files

2bdf377 verified about 1 month ago

preview code

raw

history blame contribute delete

19.7 kB

	# 🛡️ TruthLens — BERT-Based Fake News Detector



	![Python](https://img.shields.io/badge/Python-3.11+-blue.svg)
	![FastAPI](https://img.shields.io/badge/FastAPI-0.104+-green.svg)
	![React](https://img.shields.io/badge/React-18.2+-61DAFB.svg)
	![MongoDB](https://img.shields.io/badge/MongoDB-Atlas-47A248.svg)
	![TailwindCSS](https://img.shields.io/badge/TailwindCSS-3.3+-38B2AC.svg)
	![HuggingFace](https://img.shields.io/badge/HuggingFace-Spaces-FFD21E.svg)
	![Vercel](https://img.shields.io/badge/Deployed%20on-Vercel-black.svg)
	![License](https://img.shields.io/badge/License-MIT-yellow.svg)

	---

	A full-stack web application that detects fake news using a large language model (LLM) as the primary classifier, backed by a fine-tuned BERT transformer model, real-time Google News RSS validation, image OCR analysis, API rate limiting, and a fully animated React interface with MongoDB-backed user authentication.


	## 🌐 Live Demo

	\| \| Link \|
	\|---\|---\|
	\| 🖥️ Frontend (React App) \| [https://truth-lens-bert-based-fake-news-and.vercel.app](https://truth-lens-bert-based-fake-news-and.vercel.app) \|
	\| ⚙️ Backend API \| [https://suryakf-truthlens-backend.hf.space](https://suryakf-truthlens-backend.hf.space) \|
	\| 📖 Swagger / API Docs \| [https://suryakf-truthlens-backend.hf.space/docs](https://suryakf-truthlens-backend.hf.space/docs) \|

	> The backend runs on Hugging Face Spaces (CPU Basic — 2 vCPU, 16 GB RAM).
	> The frontend is deployed on Vercel with global CDN.
	> The database is MongoDB Atlas (M0 free cluster).



	## ✨ Features

	### Core Detection Pipeline
	- Fine-tuned BERT (Primary) — PyTorch BERT model (~95% accuracy)
	- Three-label output — `REAL` / `FAKE` / `UNVERIFIED`. The LLM outputs UNVERIFIED when evidence is inconclusive, avoiding over-flagging real recent news as fake.
	- Confidence Scoring — Per-prediction probability distribution visualised as a live pie chart.
	- Batch Analysis — Submit up to 10 news texts in one request.

	### News Source Validation
	- Google News RSS — Free real-time headline search (no API key required). Retrieves title, source, publish date, and article description.
	- NewsAPI Integration — Extended article lookup with source attribution.
	- SerpAPI Integration — Fallback search-engine news verification.
	- Live context injection — All retrieved articles (headline + summary + URL + publish date) are passed directly into the LLM's prompt so it cross-references the claim against real-world evidence.

	### Image & OCR
	- Screenshot Upload — Paste or upload a screenshot of a news headline/article.
	- Mistral OCR — Extracts title, body text, source, and date from the image.
	- Same pipeline as text — After OCR, the extracted headline goes through the same LLM-primary flow (news search → LLM with context → BERT fallback).

	### Rate Limiting
	API rate limits enforced via slowapi (per client IP):

	\| Endpoint \| Limit \|
	\|---\|---\|
	\| `POST /api/predict` \| 30 / minute \|
	\| `POST /api/batch-predict` \| 5 / minute \|
	\| `POST /api/image-predict` \| 10 / minute \|
	\| `POST /api/extract-image-text` \| 10 / minute \|
	\| `POST /api/auth/login` \| 5 / minute \|
	\| `POST /api/auth/register` \| 3 / minute \|

	### Authentication & History
	- JWT Authentication — 24-hour access tokens, bcrypt-hashed passwords.
	- Prediction History — Every analysis stored with timestamp and label in MongoDB.
	- User Dashboard — Live stats, streak counter, accuracy breakdown.

	### Developer Experience
	- Rotating Log Files — All API activity written to `logs/app.log` (10 MB cap, 5 backups).
	- Swagger / ReDoc — Auto-generated interactive API docs at `/docs` and `/redoc`.
	- Environment-Driven Config — All secrets via `.env`.

	---

	## 🏗️ Architecture

	```
	┌─────────────────────────────────────────────────────────────────┐
	│ FRONTEND (React + Vite) │
	│ Home │ Login │ Register │ Dashboard │
	│ GSAP ScrollTrigger · Framer Motion · TailwindCSS · Recharts │
	└────────────────────────────┬────────────────────────────────────┘
	│ HTTPS / JWT
	┌────────────────────────────▼────────────────────────────────────┐
	│ BACKEND (FastAPI) │
	│ Rate Limiting (slowapi) → Logging Middleware → logs/app.log │
	│ /api/predict /api/batch-predict /api/image-predict │
	└──────┬──────────────────────────────────────┬───────────────────┘
	│ │
	▼ STEP 1 ▼ STEP 1 (image)
	┌─────────────────┐ ┌─────────────────────┐
	│ News Validator │ │ Mistral OCR │
	│ Google News RSS │ │ Extracts title + │
	│ NewsAPI │ │ text from image │
	│ SerpAPI │ └──────────┬──────────┘
	└────────┬────────┘ │
	│ articles (title+desc+date+url) │ extracted headline
	▼ STEP 2 (PRIMARY) ▼ STEP 2 (PRIMARY)
	┌─────────────────────────────────────────────────────────────────┐
	│ LLM Fact-Checker │
	│ Primary model → Fallback 1 → Fallback 2 │
	│ Output: REAL / FAKE / UNVERIFIED + confidence + reasoning │
	└───────────────────────────────┬─────────────────────────────────┘
	│ (only if ALL Gemini models fail)
	▼ STEP 3 (FALLBACK)
	┌────────────────────────┐
	│ Fine-tuned BERT │
	│ PyTorch + HF ~95% acc │
	└────────────────────────┘
	│
	┌───────────────────────────────▼─────────────────────────────────┐
	│ MongoDB Atlas (Motor async) │
	│ users collection · predictions collection │
	└─────────────────────────────────────────────────────────────────┘
	```

	### Hybrid Model Architecture (Mermaid)

	```mermaid
	flowchart TD
	A[Input Text] --> B[Tokenizer<br/>bert-base-uncased]
	B --> C[input_ids, attention_mask]

	C --> D[BERT Encoder<br/>Hidden Size: 768]
	D --> E[Dropout]
	E --> F[BiLSTM<br/>2 layers, hidden=256, bidirectional]
	F --> G[LayerNorm<br/>Output dim: 512]
	G --> H[Multi-Head Self-Attention<br/>8 heads]

	H --> I[Global Max Pooling<br/>across sequence]
	I --> J[MLP Classifier]

	J --> J1[Linear 512->256 + ReLU + Dropout]
	J1 --> J2[Linear 256->128 + ReLU + Dropout]
	J2 --> J3[Linear 128->2]

	J3 --> K[Logits: Real vs Fake]
	K --> L[Softmax / Argmax Prediction]

	subgraph Training
	M[CrossEntropyLoss<br/>class weights + label smoothing]
	N[AdamW + LR Scheduler<br/>Warmup + Weight Decay]
	O[Early Stopping<br/>monitor val F1]
	end

	J3 --> M
	M --> N
	N --> O
	```

	---

	## 📁 Project Structure

	```
	FinalYearProject/
	├── app/
	│ ├── main.py # FastAPI app, CORS, rate limiter, logging middleware
	│ ├── auth.py # JWT token logic, bcrypt helpers
	│ ├── database.py # Motor async MongoDB client
	│ ├── limiter.py # Shared slowapi Limiter instance
	│ ├── api/
	│ │ ├── routes.py # Prediction endpoints (/api/predict, /api/batch-predict, /api/image-predict)
	│ │ └── auth_routes.py # Auth endpoints (/api/auth/*)
	│ ├── models/
	│ │ └── bert_model.py # BERT inference wrapper (fallback only)
	│ ├── schemas/
	│ │ ├── prediction.py # Pydantic request/response models
	│ │ └── auth.py # User & token schemas
	│ └── utils/
	│ ├── ai_verification.py # LLM fact-checker (primary classifier)
	│ ├── news_validator.py # Multi-source news validation + RSS parser
	│ ├── image_ocr.py # Mistral OCR — image upload + text extraction
	│ └── logger.py # RotatingFileHandler logger factory
	├── enhanced_bert_liar_model/ # BERT fine-tuned on LIAR dataset (fallback)
	├── enhanced_bert_welfake_model/ # BERT fine-tuned on WELFake dataset (fallback)
	├── frontend/
	│ └── src/
	│ ├── App.jsx
	│ ├── api/index.js
	│ ├── context/AuthContext.jsx
	│ ├── motion/ # GSAP + Framer Motion helpers
	│ └── pages/ # Home, Dashboard, Login, Register
	├── logs/ # Auto-created — rotating app.log
	├── Data/WELFake_Dataset.csv
	├── Notebook/
	│ ├── bert_finetune_notebook.ipynb
	│ └── wel-fakebert-finetune-notebook.ipynb
	├── run_api.py
	├── pyproject.toml
	└── README.md
	```

	---

	## 🚀 Production Deployment

	```
	Browser
	└──▶ Vercel (React/Vite frontend)
	└── VITE_API_URL ──▶ Hugging Face Spaces (FastAPI + BERT + LLM)
	└── MONGODB_URL ──▶ MongoDB Atlas
	```

	\| Layer \| Platform \| Plan \|
	\|-------\|----------\|------\|
	\| Frontend \| [Vercel](https://vercel.com) \| Free \|
	\| Backend \| [Hugging Face Spaces](https://huggingface.co/spaces) \| CPU Basic (Free) \|
	\| Database \| [MongoDB Atlas](https://cloud.mongodb.com) \| M0 Free \|

	### Deploy your own copy

	Backend (HF Spaces)
	1. Fork this repo and create a new Space (SDK: Docker)
	2. Copy `app/`, `enhanced_bert_*/`, `run_api.py`, `Dockerfile.huggingface` (rename to `Dockerfile`)
	3. Add secrets in Space Settings:

	\| Secret \| Description \|
	\|--------\|-------------\|
	\| `MONGODB_URL` \| MongoDB Atlas connection string \|
	\| `SECRET_KEY` \| JWT signing secret \|
	\| `AI_API_KEY` \| LLM API key for the primary fact-checker \|
	\| `MISTRAL_API_KEY` \| Mistral API key (for image OCR) \|
	\| `ALLOWED_ORIGINS` \| Comma-separated frontend URLs \|

	Frontend (Vercel)
	1. Import your GitHub repo → set Root Directory to `frontend`
	2. Add env var: `VITE_API_URL=https://YOUR_HF_USER-your-space.hf.space/api`

	---

	## 💻 Local Development

	### Prerequisites
	- Python 3.11+, Node.js 18+
	- [UV](https://github.com/astral-sh/uv) package manager
	- MongoDB Atlas account
	- LLM API key (for the primary fact-checker)
	- Mistral API key (free at [mistral.ai](https://mistral.ai)) — for image OCR

	### 1. Install Backend

	```bash
	git clone <your-repo-url>
	cd FinalYearProject
	pip install uv
	uv sync
	```

	### 2. Configure Environment

	Create `.env` in the project root:

	```env
	# MongoDB Atlas
	MONGODB_URL=mongodb+srv://username:password@cluster.mongodb.net/?retryWrites=true&w=majority
	DATABASE_NAME=fake_news_detector

	# JWT
	SECRET_KEY=your-super-secret-jwt-key-change-in-production
	ACCESS_TOKEN_EXPIRE_MINUTES=1440

	# LLM API key (primary fact-checker)
	AI_API_KEY=your_api_key_here

	# Mistral AI (image OCR)
	MISTRAL_API_KEY=your_mistral_api_key_here

	# News Validation (optional — Google News RSS is free)
	NEWSAPI_KEY=your_newsapi_key
	SERPAPI_KEY=your_serpapi_key

	# CORS
	ALLOWED_ORIGINS=http://localhost:5173,http://localhost:3000
	```

	### 3. Start the Backend

	```bash
	python run_api.py
	```

	- API: http://localhost:8000
	- Swagger: http://localhost:8000/docs

	### 4. Start the Frontend

	```bash
	cd frontend
	npm install
	npm run dev
	```

	Frontend: http://localhost:5173

	---

	## 🔐 API Reference

	### Authentication

	\| Method \| Endpoint \| Rate Limit \| Description \|
	\|--------\|----------\|------------\|-------------\|
	\| `POST` \| `/api/auth/register` \| 3/min \| Create a new user account \|
	\| `POST` \| `/api/auth/login` \| 5/min \| Login and receive a JWT token \|
	\| `GET` \| `/api/auth/me` \| — \| Get current authenticated user \|
	\| `GET` \| `/api/auth/history` \| — \| Retrieve prediction history \|
	\| `GET` \| `/api/auth/stats` \| — \| Get total/real/fake counts \|
	\| `POST` \| `/api/auth/logout` \| — \| Logout \|

	### Predictions (JWT required)

	\| Method \| Endpoint \| Rate Limit \| Description \|
	\|--------\|----------\|------------\|-------------\|
	\| `POST` \| `/api/predict` \| 30/min \| Analyse a single news headline \|
	\| `POST` \| `/api/batch-predict` \| 5/min \| Analyse up to 10 texts in one call \|
	\| `POST` \| `/api/image-predict` \| 10/min \| OCR + analyse a news screenshot \|
	\| `POST` \| `/api/extract-image-text` \| 10/min \| OCR only (no prediction) \|

	### Example — Single Prediction

	Request:
	```bash
	curl -X POST http://localhost:8000/api/predict \
	-H "Authorization: Bearer YOUR_JWT_TOKEN" \
	-H "Content-Type: application/json" \
	-d '{"title": "Scientists discover new planet in solar system"}'
	```

	Response:
	```json
	{
	"text": "Scientists discover new planet in solar system",
	"prediction": "unverified",
	"confidence": 0.62,
	"probabilities": { "real": 0.62, "fake": 0.38 },
	"is_fake": false,
	"prediction_source": "llm_primary",
	"context_articles_used": 2,
	"news_insight": "ℹ️ Limited related news coverage found."
	}
	```

	---

	## 🔧 Technology Stack

	### Backend
	\| Library \| Purpose \|
	\|---------\|---------\|
	\| FastAPI \| Async REST API framework \|
	\| Uvicorn \| ASGI server \|
	\| google-genai \| LLM SDK — primary fact-checker \|
	\| mistralai \| Mistral OCR — image text extraction \|
	\| slowapi \| Per-IP API rate limiting \|
	\| PyTorch \| BERT model inference (fallback) \|
	\| Transformers (HuggingFace) \| Tokeniser + BERT model architecture \|
	\| Motor \| Async MongoDB driver \|
	\| python-jose \| JWT token generation & validation \|
	\| passlib[bcrypt] \| Password hashing \|
	\| requests + beautifulsoup4 \| News RSS scraping \|
	\| newsapi-python \| NewsAPI client \|
	\| serpapi \| SerpAPI client \|

	### Frontend
	\| Library \| Purpose \|
	\|---------\|---------\|
	\| React 18 \| UI component library \|
	\| Vite \| Build tool & dev server \|
	\| TailwindCSS 3 \| Utility-first styling \|
	\| GSAP + ScrollTrigger \| Scroll-driven animations \|
	\| Framer Motion \| Page transition system \|
	\| Recharts \| Pie chart visualisation \|
	\| Axios \| HTTP client with interceptors \|

	---

	## 🤖 Classification Details

	### LLM Fact-Checker (Primary)
	\| Property \| Value \|
	\|----------\|-------\|
	\| Input \| User claim + live news articles (headline, summary, date, URL) \|
	\| Output labels \| `REAL` / `FAKE` / `UNVERIFIED` \|
	\| Fallback chain \| Multiple model tiers tried automatically on quota errors \|
	\| Context \| Receives live Google News articles before deciding \|

	UNVERIFIED is returned when the LLM cannot confirm or deny the claim from available evidence (e.g. very recent events not yet widely reported). It maps to `is_fake: false` with capped confidence (≤ 68%).

	FAKE is only returned when retrieved articles directly contradict the specific factual assertion — not merely because the claim is surprising or uses dramatic language.

	### BERT (Fallback)
	\| Property \| Value \|
	\|----------\|-------\|
	\| Architecture \| BERT (bert-base-uncased) \|
	\| Training \| LIAR dataset (binarised) \|
	\| Max token length \| 512 \|
	\| Accuracy \| ~95% \|
	\| When used \| Only when all Gemini models fail \|

	---

	## 🔒 Security

	- JWT tokens with configurable expiry (default 24 hours)
	- Bcrypt password hashing
	- Per-IP rate limiting on all public endpoints
	- CORS middleware (configurable via `ALLOWED_ORIGINS`)
	- Pydantic input validation on all endpoints
	- Environment-variable-driven secrets

	---

	## 🔧 Environment Variables Reference

	\| Variable \| Required \| Description \|
	\|----------\|----------\|-------------\|
	\| `MONGODB_URL` \| ✅ \| MongoDB Atlas connection string \|
	\| `DATABASE_NAME` \| ✅ \| Target database name \|
	\| `SECRET_KEY` \| ✅ \| Secret used to sign JWT tokens \|
	\| `AI_API_KEY` \| ✅ \| LLM API key (primary fact-checker) \|
	\| `MISTRAL_API_KEY` \| ✅ \| Mistral API key (image OCR) \|
	\| `ACCESS_TOKEN_EXPIRE_MINUTES` \| ❌ \| Token TTL (default: 1440) \|
	\| `NEWSAPI_KEY` \| ❌ \| NewsAPI key \|
	\| `SERPAPI_KEY` \| ❌ \| SerpAPI key \|
	\| `ALLOWED_ORIGINS` \| ❌ \| Comma-separated CORS origins \|
	\| `ENABLE_AI_CHECK` \| ❌ \| Set `false` to force BERT-only mode \|

	---

	## 📂 Datasets

	### LIAR Dataset
	\| Property \| Detail \|
	\|----------\|--------\|
	\| Source \| [W. Wang, 2017](https://aclanthology.org/P17-2067/) — UCSB \|
	\| Size \| ~12,800 labelled statements \|
	\| Labels \| 6-class → binarised to fake / real \|
	\| Domain \| Political statements (PolitiFact) \|
	\| License \| Public domain \|

	### WELFake Dataset
	\| Property \| Detail \|
	\|----------\|--------\|
	\| Source \| [Verma et al., 2021](https://doi.org/10.1109/TVCG.2021.3071339) \|
	\| Size \| 72,134 articles (35,028 fake · 37,106 real) \|
	\| Domain \| Mixed: Kaggle, Reuters, BuzzFeed \|
	\| License \| CC BY 4.0 \|

	---

	## 🧪 Training Notebooks

	\| Notebook \| Description \|
	\|----------\|-------------\|
	\| `Notebook/bert_finetune_notebook.ipynb` \| BERT fine-tuning on LIAR dataset \|
	\| `Notebook/wel-fakebert-finetune-notebook.ipynb` \| BERT fine-tuning on WELFake dataset \|

	---

	## 🤝 Contributing

	1. Fork the repository
	2. Create a feature branch: `git checkout -b feature/my-feature`
	3. Commit: `git commit -m "feat: add my feature"`
	4. Push: `git push origin feature/my-feature`
	5. Open a Pull Request

	---

	## 📄 License

	MIT License

	---

	## 🙏 Acknowledgements

	- [LIAR Dataset](https://www.cs.ucsb.edu/~william/data/liar_dataset.zip) — W. Wang, 2017
	- [WELFake Dataset](https://zenodo.org/record/4561253) — Verma et al., 2021
	- [Hugging Face Transformers](https://huggingface.co/) — BERT tokeniser and model utilities
	- Primary LLM fact-checker — contextual claim verification against live news
	- [Mistral AI](https://mistral.ai/) — Image OCR

	---

	<p align="center">🛡️ Built to fight misinformation — TruthLens</p>