proofly / README.md
Pragthedon's picture
Update README with Hugging Face Spaces config block
53da193
---
title: Proofly API
emoji: πŸ›‘οΈ
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
app_port: 7860
---
# Proofly
An AI-powered claim verification system that gathers evidence from 7 live sources, builds a semantic vector index, and uses Natural Language Inference (NLI) to produce a **True / False / Mixture/Uncertain** verdict β€” with full user authentication, history tracking, and a premium responsive UI.
---
## Features
- **JWT Authentication** β€” Register, login, logout with bcrypt-peppered passwords and HttpOnly cookie tokens
- **Per-User History** β€” Every fact-check is saved to MongoDB Atlas; view, delete, or clear your history
- **7 Evidence Sources**
- Static Knowledge Base (local, instant β€” no network needed)
- Wikidata (free entity facts, no API key)
- 12 RSS Feeds (BBC, CNN, Al Jazeera, NYT, The Hindu, NDTV, …)
- GDELT Project (global news events, no API key)
- NewsAPI (quality English headlines, requires free API key)
- Wikipedia REST API (encyclopedic summaries)
- DuckDuckGo HTML scrape (automatic fallback)
- **AI Pipeline** β€” `all-MiniLM-L6-v2` for semantic embeddings + FAISS vector search + `facebook/bart-large-mnli` for NLI
- **KB Short-Circuit** β€” Skips slow live fetches when the knowledge base already has a strong match (β‰₯ 0.65 similarity)
- **Image OCR** β€” Upload an image β†’ EasyOCR extracts text β†’ auto-fills the claim field
- **Security** β€” Flask-Talisman security headers, Flask-Limiter rate limiting, JWT blocklist on logout
- **Responsive UI** β€” Premium dark/light theme, permanent sidebar on all screen sizes
---
## Setup
### Prerequisites
- Python 3.8+
- MongoDB Atlas account (free tier works)
- (Optional) NewsAPI key β€” https://newsapi.org
### 1. Clone
```bash
git clone https://github.com/yourusername/proofly.git
cd proofly
```
### 2. Install dependencies
```bash
pip install -r requirements.txt
```
> PyTorch + Transformers models (~1–2 GB) download automatically on first run.
### 3. Configure `.env`
Copy `.env.example` to `.env` and fill in:
```env
# MongoDB Atlas
MONGO_URI=mongodb+srv://<user>:<password>@<cluster>.mongodb.net/?appName=<app>
MONGO_DB_NAME=factcheck
# FAISS index file path
FAISS_FILE=faiss.index
# NewsAPI (free key at newsapi.org)
NEWS_API_KEY=your_key_here
# Flask
FLASK_SECRET_KEY=your_long_random_secret_key
# JWT
JWT_SECRET_KEY=your_jwt_secret
JWT_ACCESS_TOKEN_MINS=15
JWT_REFRESH_TOKEN_DAYS=7
# Password pepper β€” keep secret, never commit
BCRYPT_PEPPER=your_pepper_string
# Bot identity header
USER_AGENT=ProoflyBot/1.0
```
### 4. Initialise MongoDB collections & indexes
```bash
python setup_db.py
```
Creates all 4 collections (`users`, `history`, `evidence`, `revoked_tokens`) with validators and indexes on Atlas.
### 5. Pre-populate evidence index *(recommended before first use)*
```bash
python update_data.py
```
Fetches from all sources across 24 broad topics and builds the FAISS index. Re-run weekly to keep evidence fresh.
### 6. Run
```bash
python app.py
```
Open `http://localhost:5000` β€” register an account and start fact-checking.
---
## Project Structure
```
newsXX/
β”œβ”€β”€ app.py # Flask app β€” routes, JWT config, security middleware
β”œβ”€β”€ auth.py # Auth Blueprint β€” register / login / logout / refresh
β”œβ”€β”€ api_wrapper.py # Per-request pipeline: evidence β†’ FAISS β†’ NLI β†’ verdict
β”œβ”€β”€ model.py # AI models + 7 evidence fetchers
β”œβ”€β”€ update_data.py # Offline bulk evidence updater + FAISS index builder
β”œβ”€β”€ knowledge_base.py # ~80 curated static facts (no network required)
β”œβ”€β”€ setup_db.py # One-time MongoDB Atlas collection + index setup
β”œβ”€β”€ project/
β”‚ β”œβ”€β”€ config.py # All settings from .env (single source of truth)
β”‚ └── database.py # MongoDB helpers (Borg singleton, CRUD, TTL)
β”œβ”€β”€ templates/
β”‚ β”œβ”€β”€ index.html # Dashboard / claim submission
β”‚ β”œβ”€β”€ results.html # Verdict + evidence + NLI breakdown
β”‚ β”œβ”€β”€ history.html # User claim history
β”‚ β”œβ”€β”€ login.html # Login page
β”‚ └── register.html # Register page
β”œβ”€β”€ static/
β”‚ └── style.css # Full design system (dark/light theme, responsive)
β”œβ”€β”€ .env # Local secrets (never commit)
β”œβ”€β”€ .env.example # Template
β”œβ”€β”€ requirements.txt # Python dependencies
└── faiss.index # Vector index (built by update_data.py)
```
---
## How the Verdict Works
```
Claim β†’ Embed (MiniLM) β†’ Knowledge Base check
↓ if score β‰₯ 0.65 β†’ skip live fetches
Wikidata + RSS + GDELT + NewsAPI + Wikipedia
↓ if < 3 items β†’ DuckDuckGo fallback
Build FAISS index
↓
Top-5 most similar evidence items
↓
NLI (BART-MNLI) on each piece
↓
Majority vote β†’ True / False / Mixture/Uncertain
```
| Condition | Verdict |
|---|---|
| More entailment results than contradiction | βœ… **True** |
| More contradiction results than entailment | ❌ **False** |
| Tied or average scores below 0.4 | ⚠️ **Mixture/Uncertain** |
---
## MongoDB Collections
| Collection | Purpose | Auto-cleanup |
|---|---|---|
| `users` | Accounts with hashed passwords | β€” |
| `history` | Per-user fact-check records | β€” |
| `evidence` | Scraped text for FAISS | TTL 30 days |
| `revoked_tokens` | JWT logout blocklist | TTL at token expiry |
---
## Dependencies
| Package | Purpose |
|---|---|
| `flask` | Web framework |
| `flask-jwt-extended` | JWT access + refresh tokens via cookies |
| `flask-bcrypt` | Password hashing |
| `flask-limiter` | Rate limiting on auth endpoints |
| `flask-talisman` | HTTP security headers |
| `pymongo` | MongoDB Atlas driver |
| `python-dotenv` | `.env` loading |
| `sentence-transformers` | MiniLM-L6 embeddings |
| `transformers` | BART-MNLI NLI pipeline |
| `faiss-cpu` | Vector similarity search |
| `requests` | HTTP calls to APIs |
| `beautifulsoup4` | DuckDuckGo HTML scraping |
| `feedparser` | RSS feed parsing |
| `numpy` | Numerical operations |
| `torch` | Deep learning backend |
| `easyocr` | Image OCR |
| `Pillow` | Image processing |
---
## Security Notes
- Passwords are hashed with **bcrypt** + a server-side **pepper** β€” a leaked database alone cannot crack them
- JWT tokens stored in **HttpOnly** cookies β€” inaccessible to JavaScript (XSS-safe)
- `SameSite=Strict` cookie policy prevents CSRF
- Rate limiting: 5 login attempts / minute, 3 register attempts / minute per IP
- All security headers enforced by Flask-Talisman
---
## Contributing
Pull requests welcome. Please open an issue first for major changes.
---
## License
Open-source. See repository for license details.