Fake / README.md
Ravi1212's picture
Upload 12 files
2bdf377 verified
ο»Ώ# πŸ›‘οΈ TruthLens β€” BERT-Based Fake News Detector
![Python](https://img.shields.io/badge/Python-3.11+-blue.svg)
![FastAPI](https://img.shields.io/badge/FastAPI-0.104+-green.svg)
![React](https://img.shields.io/badge/React-18.2+-61DAFB.svg)
![MongoDB](https://img.shields.io/badge/MongoDB-Atlas-47A248.svg)
![TailwindCSS](https://img.shields.io/badge/TailwindCSS-3.3+-38B2AC.svg)
![HuggingFace](https://img.shields.io/badge/HuggingFace-Spaces-FFD21E.svg)
![Vercel](https://img.shields.io/badge/Deployed%20on-Vercel-black.svg)
![License](https://img.shields.io/badge/License-MIT-yellow.svg)
---
A full-stack web application that detects fake news using a **large language model (LLM)** as the primary classifier, backed by a fine-tuned BERT transformer model, real-time Google News RSS validation, image OCR analysis, API rate limiting, and a fully animated React interface with MongoDB-backed user authentication.
## 🌐 Live Demo
| | Link |
|---|---|
| **πŸ–₯️ Frontend (React App)** | **[https://truth-lens-bert-based-fake-news-and.vercel.app](https://truth-lens-bert-based-fake-news-and.vercel.app)** |
| **βš™οΈ Backend API** | [https://suryakf-truthlens-backend.hf.space](https://suryakf-truthlens-backend.hf.space) |
| **πŸ“– Swagger / API Docs** | [https://suryakf-truthlens-backend.hf.space/docs](https://suryakf-truthlens-backend.hf.space/docs) |
> The backend runs on **Hugging Face Spaces** (CPU Basic β€” 2 vCPU, 16 GB RAM).
> The frontend is deployed on **Vercel** with global CDN.
> The database is **MongoDB Atlas** (M0 free cluster).
## ✨ Features
### Core Detection Pipeline
- **Fine-tuned BERT (Primary)** β€” PyTorch BERT model (~95% accuracy)
- **Three-label output** β€” `REAL` / `FAKE` / `UNVERIFIED`. The LLM outputs UNVERIFIED when evidence is inconclusive, avoiding over-flagging real recent news as fake.
- **Confidence Scoring** β€” Per-prediction probability distribution visualised as a live pie chart.
- **Batch Analysis** β€” Submit up to 10 news texts in one request.
### News Source Validation
- **Google News RSS** β€” Free real-time headline search (no API key required). Retrieves title, source, publish date, and article description.
- **NewsAPI Integration** β€” Extended article lookup with source attribution.
- **SerpAPI Integration** β€” Fallback search-engine news verification.
- **Live context injection** β€” All retrieved articles (headline + summary + URL + publish date) are passed directly into the LLM's prompt so it cross-references the claim against real-world evidence.
### Image & OCR
- **Screenshot Upload** β€” Paste or upload a screenshot of a news headline/article.
- **Mistral OCR** β€” Extracts title, body text, source, and date from the image.
- **Same pipeline as text** β€” After OCR, the extracted headline goes through the same LLM-primary flow (news search β†’ LLM with context β†’ BERT fallback).
### Rate Limiting
API rate limits enforced via **slowapi** (per client IP):
| Endpoint | Limit |
|---|---|
| `POST /api/predict` | 30 / minute |
| `POST /api/batch-predict` | 5 / minute |
| `POST /api/image-predict` | 10 / minute |
| `POST /api/extract-image-text` | 10 / minute |
| `POST /api/auth/login` | 5 / minute |
| `POST /api/auth/register` | 3 / minute |
### Authentication & History
- **JWT Authentication** β€” 24-hour access tokens, bcrypt-hashed passwords.
- **Prediction History** β€” Every analysis stored with timestamp and label in MongoDB.
- **User Dashboard** β€” Live stats, streak counter, accuracy breakdown.
### Developer Experience
- **Rotating Log Files** β€” All API activity written to `logs/app.log` (10 MB cap, 5 backups).
- **Swagger / ReDoc** β€” Auto-generated interactive API docs at `/docs` and `/redoc`.
- **Environment-Driven Config** β€” All secrets via `.env`.
---
## πŸ—οΈ Architecture
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ FRONTEND (React + Vite) β”‚
β”‚ Home β”‚ Login β”‚ Register β”‚ Dashboard β”‚
β”‚ GSAP ScrollTrigger Β· Framer Motion Β· TailwindCSS Β· Recharts β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ HTTPS / JWT
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ BACKEND (FastAPI) β”‚
β”‚ Rate Limiting (slowapi) β†’ Logging Middleware β†’ logs/app.log β”‚
β”‚ /api/predict /api/batch-predict /api/image-predict β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚
β–Ό STEP 1 β–Ό STEP 1 (image)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ News Validator β”‚ β”‚ Mistral OCR β”‚
β”‚ Google News RSS β”‚ β”‚ Extracts title + β”‚
β”‚ NewsAPI β”‚ β”‚ text from image β”‚
β”‚ SerpAPI β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ articles (title+desc+date+url) β”‚ extracted headline
β–Ό STEP 2 (PRIMARY) β–Ό STEP 2 (PRIMARY)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ LLM Fact-Checker β”‚
β”‚ Primary model β†’ Fallback 1 β†’ Fallback 2 β”‚
β”‚ Output: REAL / FAKE / UNVERIFIED + confidence + reasoning β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ (only if ALL Gemini models fail)
β–Ό STEP 3 (FALLBACK)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Fine-tuned BERT β”‚
β”‚ PyTorch + HF ~95% acc β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ MongoDB Atlas (Motor async) β”‚
β”‚ users collection Β· predictions collection β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
### Hybrid Model Architecture (Mermaid)
```mermaid
flowchart TD
A[Input Text] --> B[Tokenizer<br/>bert-base-uncased]
B --> C[input_ids, attention_mask]
C --> D[BERT Encoder<br/>Hidden Size: 768]
D --> E[Dropout]
E --> F[BiLSTM<br/>2 layers, hidden=256, bidirectional]
F --> G[LayerNorm<br/>Output dim: 512]
G --> H[Multi-Head Self-Attention<br/>8 heads]
H --> I[Global Max Pooling<br/>across sequence]
I --> J[MLP Classifier]
J --> J1[Linear 512->256 + ReLU + Dropout]
J1 --> J2[Linear 256->128 + ReLU + Dropout]
J2 --> J3[Linear 128->2]
J3 --> K[Logits: Real vs Fake]
K --> L[Softmax / Argmax Prediction]
subgraph Training
M[CrossEntropyLoss<br/>class weights + label smoothing]
N[AdamW + LR Scheduler<br/>Warmup + Weight Decay]
O[Early Stopping<br/>monitor val F1]
end
J3 --> M
M --> N
N --> O
```
---
## πŸ“ Project Structure
```
FinalYearProject/
β”œβ”€β”€ app/
β”‚ β”œβ”€β”€ main.py # FastAPI app, CORS, rate limiter, logging middleware
β”‚ β”œβ”€β”€ auth.py # JWT token logic, bcrypt helpers
β”‚ β”œβ”€β”€ database.py # Motor async MongoDB client
β”‚ β”œβ”€β”€ limiter.py # Shared slowapi Limiter instance
β”‚ β”œβ”€β”€ api/
β”‚ β”‚ β”œβ”€β”€ routes.py # Prediction endpoints (/api/predict, /api/batch-predict, /api/image-predict)
β”‚ β”‚ └── auth_routes.py # Auth endpoints (/api/auth/*)
β”‚ β”œβ”€β”€ models/
β”‚ β”‚ └── bert_model.py # BERT inference wrapper (fallback only)
β”‚ β”œβ”€β”€ schemas/
β”‚ β”‚ β”œβ”€β”€ prediction.py # Pydantic request/response models
β”‚ β”‚ └── auth.py # User & token schemas
β”‚ └── utils/
β”‚ β”œβ”€β”€ ai_verification.py # LLM fact-checker (primary classifier)
β”‚ β”œβ”€β”€ news_validator.py # Multi-source news validation + RSS parser
β”‚ β”œβ”€β”€ image_ocr.py # Mistral OCR β€” image upload + text extraction
β”‚ └── logger.py # RotatingFileHandler logger factory
β”œβ”€β”€ enhanced_bert_liar_model/ # BERT fine-tuned on LIAR dataset (fallback)
β”œβ”€β”€ enhanced_bert_welfake_model/ # BERT fine-tuned on WELFake dataset (fallback)
β”œβ”€β”€ frontend/
β”‚ └── src/
β”‚ β”œβ”€β”€ App.jsx
β”‚ β”œβ”€β”€ api/index.js
β”‚ β”œβ”€β”€ context/AuthContext.jsx
β”‚ β”œβ”€β”€ motion/ # GSAP + Framer Motion helpers
β”‚ └── pages/ # Home, Dashboard, Login, Register
β”œβ”€β”€ logs/ # Auto-created β€” rotating app.log
β”œβ”€β”€ Data/WELFake_Dataset.csv
β”œβ”€β”€ Notebook/
β”‚ β”œβ”€β”€ bert_finetune_notebook.ipynb
β”‚ └── wel-fakebert-finetune-notebook.ipynb
β”œβ”€β”€ run_api.py
β”œβ”€β”€ pyproject.toml
└── README.md
```
---
## πŸš€ Production Deployment
```
Browser
└──▢ Vercel (React/Vite frontend)
└── VITE_API_URL ──▢ Hugging Face Spaces (FastAPI + BERT + LLM)
└── MONGODB_URL ──▢ MongoDB Atlas
```
| Layer | Platform | Plan |
|-------|----------|------|
| Frontend | [Vercel](https://vercel.com) | Free |
| Backend | [Hugging Face Spaces](https://huggingface.co/spaces) | CPU Basic (Free) |
| Database | [MongoDB Atlas](https://cloud.mongodb.com) | M0 Free |
### Deploy your own copy
**Backend (HF Spaces)**
1. Fork this repo and create a new Space (SDK: **Docker**)
2. Copy `app/`, `enhanced_bert_*/`, `run_api.py`, `Dockerfile.huggingface` (rename to `Dockerfile`)
3. Add secrets in Space Settings:
| Secret | Description |
|--------|-------------|
| `MONGODB_URL` | MongoDB Atlas connection string |
| `SECRET_KEY` | JWT signing secret |
| `AI_API_KEY` | LLM API key for the primary fact-checker |
| `MISTRAL_API_KEY` | Mistral API key (for image OCR) |
| `ALLOWED_ORIGINS` | Comma-separated frontend URLs |
**Frontend (Vercel)**
1. Import your GitHub repo β†’ set **Root Directory** to `frontend`
2. Add env var: `VITE_API_URL=https://YOUR_HF_USER-your-space.hf.space/api`
---
## πŸ’» Local Development
### Prerequisites
- Python 3.11+, Node.js 18+
- [UV](https://github.com/astral-sh/uv) package manager
- MongoDB Atlas account
- LLM API key (for the primary fact-checker)
- Mistral API key (free at [mistral.ai](https://mistral.ai)) β€” for image OCR
### 1. Install Backend
```bash
git clone <your-repo-url>
cd FinalYearProject
pip install uv
uv sync
```
### 2. Configure Environment
Create `.env` in the project root:
```env
# MongoDB Atlas
MONGODB_URL=mongodb+srv://username:password@cluster.mongodb.net/?retryWrites=true&w=majority
DATABASE_NAME=fake_news_detector
# JWT
SECRET_KEY=your-super-secret-jwt-key-change-in-production
ACCESS_TOKEN_EXPIRE_MINUTES=1440
# LLM API key (primary fact-checker)
AI_API_KEY=your_api_key_here
# Mistral AI (image OCR)
MISTRAL_API_KEY=your_mistral_api_key_here
# News Validation (optional β€” Google News RSS is free)
NEWSAPI_KEY=your_newsapi_key
SERPAPI_KEY=your_serpapi_key
# CORS
ALLOWED_ORIGINS=http://localhost:5173,http://localhost:3000
```
### 3. Start the Backend
```bash
python run_api.py
```
- API: **http://localhost:8000**
- Swagger: **http://localhost:8000/docs**
### 4. Start the Frontend
```bash
cd frontend
npm install
npm run dev
```
Frontend: **http://localhost:5173**
---
## πŸ” API Reference
### Authentication
| Method | Endpoint | Rate Limit | Description |
|--------|----------|------------|-------------|
| `POST` | `/api/auth/register` | 3/min | Create a new user account |
| `POST` | `/api/auth/login` | 5/min | Login and receive a JWT token |
| `GET` | `/api/auth/me` | β€” | Get current authenticated user |
| `GET` | `/api/auth/history` | β€” | Retrieve prediction history |
| `GET` | `/api/auth/stats` | β€” | Get total/real/fake counts |
| `POST` | `/api/auth/logout` | β€” | Logout |
### Predictions (JWT required)
| Method | Endpoint | Rate Limit | Description |
|--------|----------|------------|-------------|
| `POST` | `/api/predict` | 30/min | Analyse a single news headline |
| `POST` | `/api/batch-predict` | 5/min | Analyse up to 10 texts in one call |
| `POST` | `/api/image-predict` | 10/min | OCR + analyse a news screenshot |
| `POST` | `/api/extract-image-text` | 10/min | OCR only (no prediction) |
### Example β€” Single Prediction
**Request:**
```bash
curl -X POST http://localhost:8000/api/predict \
-H "Authorization: Bearer YOUR_JWT_TOKEN" \
-H "Content-Type: application/json" \
-d '{"title": "Scientists discover new planet in solar system"}'
```
**Response:**
```json
{
"text": "Scientists discover new planet in solar system",
"prediction": "unverified",
"confidence": 0.62,
"probabilities": { "real": 0.62, "fake": 0.38 },
"is_fake": false,
"prediction_source": "llm_primary",
"context_articles_used": 2,
"news_insight": "ℹ️ Limited related news coverage found."
}
```
---
## πŸ”§ Technology Stack
### Backend
| Library | Purpose |
|---------|---------|
| FastAPI | Async REST API framework |
| Uvicorn | ASGI server |
| **google-genai** | **LLM SDK β€” primary fact-checker** |
| **mistralai** | **Mistral OCR β€” image text extraction** |
| **slowapi** | **Per-IP API rate limiting** |
| PyTorch | BERT model inference (fallback) |
| Transformers (HuggingFace) | Tokeniser + BERT model architecture |
| Motor | Async MongoDB driver |
| python-jose | JWT token generation & validation |
| passlib[bcrypt] | Password hashing |
| requests + beautifulsoup4 | News RSS scraping |
| newsapi-python | NewsAPI client |
| serpapi | SerpAPI client |
### Frontend
| Library | Purpose |
|---------|---------|
| React 18 | UI component library |
| Vite | Build tool & dev server |
| TailwindCSS 3 | Utility-first styling |
| GSAP + ScrollTrigger | Scroll-driven animations |
| Framer Motion | Page transition system |
| Recharts | Pie chart visualisation |
| Axios | HTTP client with interceptors |
---
## πŸ€– Classification Details
### LLM Fact-Checker (Primary)
| Property | Value |
|----------|-------|
| Input | User claim + live news articles (headline, summary, date, URL) |
| Output labels | `REAL` / `FAKE` / `UNVERIFIED` |
| Fallback chain | Multiple model tiers tried automatically on quota errors |
| Context | Receives live Google News articles before deciding |
**UNVERIFIED** is returned when the LLM cannot confirm or deny the claim from available evidence (e.g. very recent events not yet widely reported). It maps to `is_fake: false` with capped confidence (≀ 68%).
**FAKE** is only returned when retrieved articles **directly contradict** the specific factual assertion β€” not merely because the claim is surprising or uses dramatic language.
### BERT (Fallback)
| Property | Value |
|----------|-------|
| Architecture | BERT (bert-base-uncased) |
| Training | LIAR dataset (binarised) |
| Max token length | 512 |
| Accuracy | ~95% |
| When used | Only when all Gemini models fail |
---
## πŸ”’ Security
- JWT tokens with configurable expiry (default 24 hours)
- Bcrypt password hashing
- Per-IP rate limiting on all public endpoints
- CORS middleware (configurable via `ALLOWED_ORIGINS`)
- Pydantic input validation on all endpoints
- Environment-variable-driven secrets
---
## πŸ”§ Environment Variables Reference
| Variable | Required | Description |
|----------|----------|-------------|
| `MONGODB_URL` | βœ… | MongoDB Atlas connection string |
| `DATABASE_NAME` | βœ… | Target database name |
| `SECRET_KEY` | βœ… | Secret used to sign JWT tokens |
| `AI_API_KEY` | βœ… | LLM API key (primary fact-checker) |
| `MISTRAL_API_KEY` | βœ… | Mistral API key (image OCR) |
| `ACCESS_TOKEN_EXPIRE_MINUTES` | ❌ | Token TTL (default: 1440) |
| `NEWSAPI_KEY` | ❌ | NewsAPI key |
| `SERPAPI_KEY` | ❌ | SerpAPI key |
| `ALLOWED_ORIGINS` | ❌ | Comma-separated CORS origins |
| `ENABLE_AI_CHECK` | ❌ | Set `false` to force BERT-only mode |
---
## πŸ“‚ Datasets
### LIAR Dataset
| Property | Detail |
|----------|--------|
| **Source** | [W. Wang, 2017](https://aclanthology.org/P17-2067/) β€” UCSB |
| **Size** | ~12,800 labelled statements |
| **Labels** | 6-class β†’ binarised to fake / real |
| **Domain** | Political statements (PolitiFact) |
| **License** | Public domain |
### WELFake Dataset
| Property | Detail |
|----------|--------|
| **Source** | [Verma et al., 2021](https://doi.org/10.1109/TVCG.2021.3071339) |
| **Size** | 72,134 articles (35,028 fake Β· 37,106 real) |
| **Domain** | Mixed: Kaggle, Reuters, BuzzFeed |
| **License** | CC BY 4.0 |
---
## πŸ§ͺ Training Notebooks
| Notebook | Description |
|----------|-------------|
| `Notebook/bert_finetune_notebook.ipynb` | BERT fine-tuning on LIAR dataset |
| `Notebook/wel-fakebert-finetune-notebook.ipynb` | BERT fine-tuning on WELFake dataset |
---
## 🀝 Contributing
1. Fork the repository
2. Create a feature branch: `git checkout -b feature/my-feature`
3. Commit: `git commit -m "feat: add my feature"`
4. Push: `git push origin feature/my-feature`
5. Open a Pull Request
---
## πŸ“„ License
MIT License
---
## πŸ™ Acknowledgements
- [LIAR Dataset](https://www.cs.ucsb.edu/~william/data/liar_dataset.zip) β€” W. Wang, 2017
- [WELFake Dataset](https://zenodo.org/record/4561253) β€” Verma et al., 2021
- [Hugging Face Transformers](https://huggingface.co/) β€” BERT tokeniser and model utilities
- Primary LLM fact-checker β€” contextual claim verification against live news
- [Mistral AI](https://mistral.ai/) β€” Image OCR
---
<p align="center">πŸ›‘οΈ Built to fight misinformation β€” TruthLens</p>