| ο»Ώ# π‘οΈ TruthLens β BERT-Based Fake News Detector | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
| --- | |
| A full-stack web application that detects fake news using a **large language model (LLM)** as the primary classifier, backed by a fine-tuned BERT transformer model, real-time Google News RSS validation, image OCR analysis, API rate limiting, and a fully animated React interface with MongoDB-backed user authentication. | |
| ## π Live Demo | |
| | | Link | | |
| |---|---| | |
| | **π₯οΈ Frontend (React App)** | **[https://truth-lens-bert-based-fake-news-and.vercel.app](https://truth-lens-bert-based-fake-news-and.vercel.app)** | | |
| | **βοΈ Backend API** | [https://suryakf-truthlens-backend.hf.space](https://suryakf-truthlens-backend.hf.space) | | |
| | **π Swagger / API Docs** | [https://suryakf-truthlens-backend.hf.space/docs](https://suryakf-truthlens-backend.hf.space/docs) | | |
| > The backend runs on **Hugging Face Spaces** (CPU Basic β 2 vCPU, 16 GB RAM). | |
| > The frontend is deployed on **Vercel** with global CDN. | |
| > The database is **MongoDB Atlas** (M0 free cluster). | |
| ## β¨ Features | |
| ### Core Detection Pipeline | |
| - **Fine-tuned BERT (Primary)** β PyTorch BERT model (~95% accuracy) | |
| - **Three-label output** β `REAL` / `FAKE` / `UNVERIFIED`. The LLM outputs UNVERIFIED when evidence is inconclusive, avoiding over-flagging real recent news as fake. | |
| - **Confidence Scoring** β Per-prediction probability distribution visualised as a live pie chart. | |
| - **Batch Analysis** β Submit up to 10 news texts in one request. | |
| ### News Source Validation | |
| - **Google News RSS** β Free real-time headline search (no API key required). Retrieves title, source, publish date, and article description. | |
| - **NewsAPI Integration** β Extended article lookup with source attribution. | |
| - **SerpAPI Integration** β Fallback search-engine news verification. | |
| - **Live context injection** β All retrieved articles (headline + summary + URL + publish date) are passed directly into the LLM's prompt so it cross-references the claim against real-world evidence. | |
| ### Image & OCR | |
| - **Screenshot Upload** β Paste or upload a screenshot of a news headline/article. | |
| - **Mistral OCR** β Extracts title, body text, source, and date from the image. | |
| - **Same pipeline as text** β After OCR, the extracted headline goes through the same LLM-primary flow (news search β LLM with context β BERT fallback). | |
| ### Rate Limiting | |
| API rate limits enforced via **slowapi** (per client IP): | |
| | Endpoint | Limit | | |
| |---|---| | |
| | `POST /api/predict` | 30 / minute | | |
| | `POST /api/batch-predict` | 5 / minute | | |
| | `POST /api/image-predict` | 10 / minute | | |
| | `POST /api/extract-image-text` | 10 / minute | | |
| | `POST /api/auth/login` | 5 / minute | | |
| | `POST /api/auth/register` | 3 / minute | | |
| ### Authentication & History | |
| - **JWT Authentication** β 24-hour access tokens, bcrypt-hashed passwords. | |
| - **Prediction History** β Every analysis stored with timestamp and label in MongoDB. | |
| - **User Dashboard** β Live stats, streak counter, accuracy breakdown. | |
| ### Developer Experience | |
| - **Rotating Log Files** β All API activity written to `logs/app.log` (10 MB cap, 5 backups). | |
| - **Swagger / ReDoc** β Auto-generated interactive API docs at `/docs` and `/redoc`. | |
| - **Environment-Driven Config** β All secrets via `.env`. | |
| --- | |
| ## ποΈ Architecture | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β FRONTEND (React + Vite) β | |
| β Home β Login β Register β Dashboard β | |
| β GSAP ScrollTrigger Β· Framer Motion Β· TailwindCSS Β· Recharts β | |
| ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ | |
| β HTTPS / JWT | |
| ββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββ | |
| β BACKEND (FastAPI) β | |
| β Rate Limiting (slowapi) β Logging Middleware β logs/app.log β | |
| β /api/predict /api/batch-predict /api/image-predict β | |
| ββββββββ¬βββββββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββ | |
| β β | |
| βΌ STEP 1 βΌ STEP 1 (image) | |
| βββββββββββββββββββ βββββββββββββββββββββββ | |
| β News Validator β β Mistral OCR β | |
| β Google News RSS β β Extracts title + β | |
| β NewsAPI β β text from image β | |
| β SerpAPI β ββββββββββββ¬βββββββββββ | |
| ββββββββββ¬βββββββββ β | |
| β articles (title+desc+date+url) β extracted headline | |
| βΌ STEP 2 (PRIMARY) βΌ STEP 2 (PRIMARY) | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β LLM Fact-Checker β | |
| β Primary model β Fallback 1 β Fallback 2 β | |
| β Output: REAL / FAKE / UNVERIFIED + confidence + reasoning β | |
| βββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββ | |
| β (only if ALL Gemini models fail) | |
| βΌ STEP 3 (FALLBACK) | |
| ββββββββββββββββββββββββββ | |
| β Fine-tuned BERT β | |
| β PyTorch + HF ~95% acc β | |
| ββββββββββββββββββββββββββ | |
| β | |
| βββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββ | |
| β MongoDB Atlas (Motor async) β | |
| β users collection Β· predictions collection β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| ### Hybrid Model Architecture (Mermaid) | |
| ```mermaid | |
| flowchart TD | |
| A[Input Text] --> B[Tokenizer<br/>bert-base-uncased] | |
| B --> C[input_ids, attention_mask] | |
| C --> D[BERT Encoder<br/>Hidden Size: 768] | |
| D --> E[Dropout] | |
| E --> F[BiLSTM<br/>2 layers, hidden=256, bidirectional] | |
| F --> G[LayerNorm<br/>Output dim: 512] | |
| G --> H[Multi-Head Self-Attention<br/>8 heads] | |
| H --> I[Global Max Pooling<br/>across sequence] | |
| I --> J[MLP Classifier] | |
| J --> J1[Linear 512->256 + ReLU + Dropout] | |
| J1 --> J2[Linear 256->128 + ReLU + Dropout] | |
| J2 --> J3[Linear 128->2] | |
| J3 --> K[Logits: Real vs Fake] | |
| K --> L[Softmax / Argmax Prediction] | |
| subgraph Training | |
| M[CrossEntropyLoss<br/>class weights + label smoothing] | |
| N[AdamW + LR Scheduler<br/>Warmup + Weight Decay] | |
| O[Early Stopping<br/>monitor val F1] | |
| end | |
| J3 --> M | |
| M --> N | |
| N --> O | |
| ``` | |
| --- | |
| ## π Project Structure | |
| ``` | |
| FinalYearProject/ | |
| βββ app/ | |
| β βββ main.py # FastAPI app, CORS, rate limiter, logging middleware | |
| β βββ auth.py # JWT token logic, bcrypt helpers | |
| β βββ database.py # Motor async MongoDB client | |
| β βββ limiter.py # Shared slowapi Limiter instance | |
| β βββ api/ | |
| β β βββ routes.py # Prediction endpoints (/api/predict, /api/batch-predict, /api/image-predict) | |
| β β βββ auth_routes.py # Auth endpoints (/api/auth/*) | |
| β βββ models/ | |
| β β βββ bert_model.py # BERT inference wrapper (fallback only) | |
| β βββ schemas/ | |
| β β βββ prediction.py # Pydantic request/response models | |
| β β βββ auth.py # User & token schemas | |
| β βββ utils/ | |
| β βββ ai_verification.py # LLM fact-checker (primary classifier) | |
| β βββ news_validator.py # Multi-source news validation + RSS parser | |
| β βββ image_ocr.py # Mistral OCR β image upload + text extraction | |
| β βββ logger.py # RotatingFileHandler logger factory | |
| βββ enhanced_bert_liar_model/ # BERT fine-tuned on LIAR dataset (fallback) | |
| βββ enhanced_bert_welfake_model/ # BERT fine-tuned on WELFake dataset (fallback) | |
| βββ frontend/ | |
| β βββ src/ | |
| β βββ App.jsx | |
| β βββ api/index.js | |
| β βββ context/AuthContext.jsx | |
| β βββ motion/ # GSAP + Framer Motion helpers | |
| β βββ pages/ # Home, Dashboard, Login, Register | |
| βββ logs/ # Auto-created β rotating app.log | |
| βββ Data/WELFake_Dataset.csv | |
| βββ Notebook/ | |
| β βββ bert_finetune_notebook.ipynb | |
| β βββ wel-fakebert-finetune-notebook.ipynb | |
| βββ run_api.py | |
| βββ pyproject.toml | |
| βββ README.md | |
| ``` | |
| --- | |
| ## π Production Deployment | |
| ``` | |
| Browser | |
| ββββΆ Vercel (React/Vite frontend) | |
| βββ VITE_API_URL βββΆ Hugging Face Spaces (FastAPI + BERT + LLM) | |
| βββ MONGODB_URL βββΆ MongoDB Atlas | |
| ``` | |
| | Layer | Platform | Plan | | |
| |-------|----------|------| | |
| | Frontend | [Vercel](https://vercel.com) | Free | | |
| | Backend | [Hugging Face Spaces](https://huggingface.co/spaces) | CPU Basic (Free) | | |
| | Database | [MongoDB Atlas](https://cloud.mongodb.com) | M0 Free | | |
| ### Deploy your own copy | |
| **Backend (HF Spaces)** | |
| 1. Fork this repo and create a new Space (SDK: **Docker**) | |
| 2. Copy `app/`, `enhanced_bert_*/`, `run_api.py`, `Dockerfile.huggingface` (rename to `Dockerfile`) | |
| 3. Add secrets in Space Settings: | |
| | Secret | Description | | |
| |--------|-------------| | |
| | `MONGODB_URL` | MongoDB Atlas connection string | | |
| | `SECRET_KEY` | JWT signing secret | | |
| | `AI_API_KEY` | LLM API key for the primary fact-checker | | |
| | `MISTRAL_API_KEY` | Mistral API key (for image OCR) | | |
| | `ALLOWED_ORIGINS` | Comma-separated frontend URLs | | |
| **Frontend (Vercel)** | |
| 1. Import your GitHub repo β set **Root Directory** to `frontend` | |
| 2. Add env var: `VITE_API_URL=https://YOUR_HF_USER-your-space.hf.space/api` | |
| --- | |
| ## π» Local Development | |
| ### Prerequisites | |
| - Python 3.11+, Node.js 18+ | |
| - [UV](https://github.com/astral-sh/uv) package manager | |
| - MongoDB Atlas account | |
| - LLM API key (for the primary fact-checker) | |
| - Mistral API key (free at [mistral.ai](https://mistral.ai)) β for image OCR | |
| ### 1. Install Backend | |
| ```bash | |
| git clone <your-repo-url> | |
| cd FinalYearProject | |
| pip install uv | |
| uv sync | |
| ``` | |
| ### 2. Configure Environment | |
| Create `.env` in the project root: | |
| ```env | |
| # MongoDB Atlas | |
| MONGODB_URL=mongodb+srv://username:password@cluster.mongodb.net/?retryWrites=true&w=majority | |
| DATABASE_NAME=fake_news_detector | |
| # JWT | |
| SECRET_KEY=your-super-secret-jwt-key-change-in-production | |
| ACCESS_TOKEN_EXPIRE_MINUTES=1440 | |
| # LLM API key (primary fact-checker) | |
| AI_API_KEY=your_api_key_here | |
| # Mistral AI (image OCR) | |
| MISTRAL_API_KEY=your_mistral_api_key_here | |
| # News Validation (optional β Google News RSS is free) | |
| NEWSAPI_KEY=your_newsapi_key | |
| SERPAPI_KEY=your_serpapi_key | |
| # CORS | |
| ALLOWED_ORIGINS=http://localhost:5173,http://localhost:3000 | |
| ``` | |
| ### 3. Start the Backend | |
| ```bash | |
| python run_api.py | |
| ``` | |
| - API: **http://localhost:8000** | |
| - Swagger: **http://localhost:8000/docs** | |
| ### 4. Start the Frontend | |
| ```bash | |
| cd frontend | |
| npm install | |
| npm run dev | |
| ``` | |
| Frontend: **http://localhost:5173** | |
| --- | |
| ## π API Reference | |
| ### Authentication | |
| | Method | Endpoint | Rate Limit | Description | | |
| |--------|----------|------------|-------------| | |
| | `POST` | `/api/auth/register` | 3/min | Create a new user account | | |
| | `POST` | `/api/auth/login` | 5/min | Login and receive a JWT token | | |
| | `GET` | `/api/auth/me` | β | Get current authenticated user | | |
| | `GET` | `/api/auth/history` | β | Retrieve prediction history | | |
| | `GET` | `/api/auth/stats` | β | Get total/real/fake counts | | |
| | `POST` | `/api/auth/logout` | β | Logout | | |
| ### Predictions (JWT required) | |
| | Method | Endpoint | Rate Limit | Description | | |
| |--------|----------|------------|-------------| | |
| | `POST` | `/api/predict` | 30/min | Analyse a single news headline | | |
| | `POST` | `/api/batch-predict` | 5/min | Analyse up to 10 texts in one call | | |
| | `POST` | `/api/image-predict` | 10/min | OCR + analyse a news screenshot | | |
| | `POST` | `/api/extract-image-text` | 10/min | OCR only (no prediction) | | |
| ### Example β Single Prediction | |
| **Request:** | |
| ```bash | |
| curl -X POST http://localhost:8000/api/predict \ | |
| -H "Authorization: Bearer YOUR_JWT_TOKEN" \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"title": "Scientists discover new planet in solar system"}' | |
| ``` | |
| **Response:** | |
| ```json | |
| { | |
| "text": "Scientists discover new planet in solar system", | |
| "prediction": "unverified", | |
| "confidence": 0.62, | |
| "probabilities": { "real": 0.62, "fake": 0.38 }, | |
| "is_fake": false, | |
| "prediction_source": "llm_primary", | |
| "context_articles_used": 2, | |
| "news_insight": "βΉοΈ Limited related news coverage found." | |
| } | |
| ``` | |
| --- | |
| ## π§ Technology Stack | |
| ### Backend | |
| | Library | Purpose | | |
| |---------|---------| | |
| | FastAPI | Async REST API framework | | |
| | Uvicorn | ASGI server | | |
| | **google-genai** | **LLM SDK β primary fact-checker** | | |
| | **mistralai** | **Mistral OCR β image text extraction** | | |
| | **slowapi** | **Per-IP API rate limiting** | | |
| | PyTorch | BERT model inference (fallback) | | |
| | Transformers (HuggingFace) | Tokeniser + BERT model architecture | | |
| | Motor | Async MongoDB driver | | |
| | python-jose | JWT token generation & validation | | |
| | passlib[bcrypt] | Password hashing | | |
| | requests + beautifulsoup4 | News RSS scraping | | |
| | newsapi-python | NewsAPI client | | |
| | serpapi | SerpAPI client | | |
| ### Frontend | |
| | Library | Purpose | | |
| |---------|---------| | |
| | React 18 | UI component library | | |
| | Vite | Build tool & dev server | | |
| | TailwindCSS 3 | Utility-first styling | | |
| | GSAP + ScrollTrigger | Scroll-driven animations | | |
| | Framer Motion | Page transition system | | |
| | Recharts | Pie chart visualisation | | |
| | Axios | HTTP client with interceptors | | |
| --- | |
| ## π€ Classification Details | |
| ### LLM Fact-Checker (Primary) | |
| | Property | Value | | |
| |----------|-------| | |
| | Input | User claim + live news articles (headline, summary, date, URL) | | |
| | Output labels | `REAL` / `FAKE` / `UNVERIFIED` | | |
| | Fallback chain | Multiple model tiers tried automatically on quota errors | | |
| | Context | Receives live Google News articles before deciding | | |
| **UNVERIFIED** is returned when the LLM cannot confirm or deny the claim from available evidence (e.g. very recent events not yet widely reported). It maps to `is_fake: false` with capped confidence (β€ 68%). | |
| **FAKE** is only returned when retrieved articles **directly contradict** the specific factual assertion β not merely because the claim is surprising or uses dramatic language. | |
| ### BERT (Fallback) | |
| | Property | Value | | |
| |----------|-------| | |
| | Architecture | BERT (bert-base-uncased) | | |
| | Training | LIAR dataset (binarised) | | |
| | Max token length | 512 | | |
| | Accuracy | ~95% | | |
| | When used | Only when all Gemini models fail | | |
| --- | |
| ## π Security | |
| - JWT tokens with configurable expiry (default 24 hours) | |
| - Bcrypt password hashing | |
| - Per-IP rate limiting on all public endpoints | |
| - CORS middleware (configurable via `ALLOWED_ORIGINS`) | |
| - Pydantic input validation on all endpoints | |
| - Environment-variable-driven secrets | |
| --- | |
| ## π§ Environment Variables Reference | |
| | Variable | Required | Description | | |
| |----------|----------|-------------| | |
| | `MONGODB_URL` | β | MongoDB Atlas connection string | | |
| | `DATABASE_NAME` | β | Target database name | | |
| | `SECRET_KEY` | β | Secret used to sign JWT tokens | | |
| | `AI_API_KEY` | β | LLM API key (primary fact-checker) | | |
| | `MISTRAL_API_KEY` | β | Mistral API key (image OCR) | | |
| | `ACCESS_TOKEN_EXPIRE_MINUTES` | β | Token TTL (default: 1440) | | |
| | `NEWSAPI_KEY` | β | NewsAPI key | | |
| | `SERPAPI_KEY` | β | SerpAPI key | | |
| | `ALLOWED_ORIGINS` | β | Comma-separated CORS origins | | |
| | `ENABLE_AI_CHECK` | β | Set `false` to force BERT-only mode | | |
| --- | |
| ## π Datasets | |
| ### LIAR Dataset | |
| | Property | Detail | | |
| |----------|--------| | |
| | **Source** | [W. Wang, 2017](https://aclanthology.org/P17-2067/) β UCSB | | |
| | **Size** | ~12,800 labelled statements | | |
| | **Labels** | 6-class β binarised to fake / real | | |
| | **Domain** | Political statements (PolitiFact) | | |
| | **License** | Public domain | | |
| ### WELFake Dataset | |
| | Property | Detail | | |
| |----------|--------| | |
| | **Source** | [Verma et al., 2021](https://doi.org/10.1109/TVCG.2021.3071339) | | |
| | **Size** | 72,134 articles (35,028 fake Β· 37,106 real) | | |
| | **Domain** | Mixed: Kaggle, Reuters, BuzzFeed | | |
| | **License** | CC BY 4.0 | | |
| --- | |
| ## π§ͺ Training Notebooks | |
| | Notebook | Description | | |
| |----------|-------------| | |
| | `Notebook/bert_finetune_notebook.ipynb` | BERT fine-tuning on LIAR dataset | | |
| | `Notebook/wel-fakebert-finetune-notebook.ipynb` | BERT fine-tuning on WELFake dataset | | |
| --- | |
| ## π€ Contributing | |
| 1. Fork the repository | |
| 2. Create a feature branch: `git checkout -b feature/my-feature` | |
| 3. Commit: `git commit -m "feat: add my feature"` | |
| 4. Push: `git push origin feature/my-feature` | |
| 5. Open a Pull Request | |
| --- | |
| ## π License | |
| MIT License | |
| --- | |
| ## π Acknowledgements | |
| - [LIAR Dataset](https://www.cs.ucsb.edu/~william/data/liar_dataset.zip) β W. Wang, 2017 | |
| - [WELFake Dataset](https://zenodo.org/record/4561253) β Verma et al., 2021 | |
| - [Hugging Face Transformers](https://huggingface.co/) β BERT tokeniser and model utilities | |
| - Primary LLM fact-checker β contextual claim verification against live news | |
| - [Mistral AI](https://mistral.ai/) β Image OCR | |
| --- | |
| <p align="center">π‘οΈ Built to fight misinformation β TruthLens</p> | |