File size: 7,077 Bytes
53da193
 
 
 
 
 
 
 
 
 
4f48a4e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
---
title: Proofly API
emoji: πŸ›‘οΈ
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
app_port: 7860
---

# Proofly

An AI-powered claim verification system that gathers evidence from 7 live sources, builds a semantic vector index, and uses Natural Language Inference (NLI) to produce a **True / False / Mixture/Uncertain** verdict β€” with full user authentication, history tracking, and a premium responsive UI.

---

## Features

- **JWT Authentication** β€” Register, login, logout with bcrypt-peppered passwords and HttpOnly cookie tokens
- **Per-User History** β€” Every fact-check is saved to MongoDB Atlas; view, delete, or clear your history
- **7 Evidence Sources**
  - Static Knowledge Base (local, instant β€” no network needed)
  - Wikidata (free entity facts, no API key)
  - 12 RSS Feeds (BBC, CNN, Al Jazeera, NYT, The Hindu, NDTV, …)
  - GDELT Project (global news events, no API key)
  - NewsAPI (quality English headlines, requires free API key)
  - Wikipedia REST API (encyclopedic summaries)
  - DuckDuckGo HTML scrape (automatic fallback)
- **AI Pipeline** β€” `all-MiniLM-L6-v2` for semantic embeddings + FAISS vector search + `facebook/bart-large-mnli` for NLI
- **KB Short-Circuit** β€” Skips slow live fetches when the knowledge base already has a strong match (β‰₯ 0.65 similarity)
- **Image OCR** β€” Upload an image β†’ EasyOCR extracts text β†’ auto-fills the claim field
- **Security** β€” Flask-Talisman security headers, Flask-Limiter rate limiting, JWT blocklist on logout
- **Responsive UI** β€” Premium dark/light theme, permanent sidebar on all screen sizes

---

## Setup

### Prerequisites
- Python 3.8+
- MongoDB Atlas account (free tier works)
- (Optional) NewsAPI key β€” https://newsapi.org

### 1. Clone
```bash
git clone https://github.com/yourusername/proofly.git
cd proofly
```

### 2. Install dependencies
```bash
pip install -r requirements.txt
```
> PyTorch + Transformers models (~1–2 GB) download automatically on first run.

### 3. Configure `.env`
Copy `.env.example` to `.env` and fill in:

```env
# MongoDB Atlas
MONGO_URI=mongodb+srv://<user>:<password>@<cluster>.mongodb.net/?appName=<app>
MONGO_DB_NAME=factcheck

# FAISS index file path
FAISS_FILE=faiss.index

# NewsAPI (free key at newsapi.org)
NEWS_API_KEY=your_key_here

# Flask
FLASK_SECRET_KEY=your_long_random_secret_key

# JWT
JWT_SECRET_KEY=your_jwt_secret
JWT_ACCESS_TOKEN_MINS=15
JWT_REFRESH_TOKEN_DAYS=7

# Password pepper β€” keep secret, never commit
BCRYPT_PEPPER=your_pepper_string

# Bot identity header
USER_AGENT=ProoflyBot/1.0
```

### 4. Initialise MongoDB collections & indexes
```bash
python setup_db.py
```
Creates all 4 collections (`users`, `history`, `evidence`, `revoked_tokens`) with validators and indexes on Atlas.

### 5. Pre-populate evidence index *(recommended before first use)*
```bash
python update_data.py
```
Fetches from all sources across 24 broad topics and builds the FAISS index. Re-run weekly to keep evidence fresh.

### 6. Run
```bash
python app.py
```
Open `http://localhost:5000` β€” register an account and start fact-checking.

---

## Project Structure

```
newsXX/
β”œβ”€β”€ app.py                  # Flask app β€” routes, JWT config, security middleware
β”œβ”€β”€ auth.py                 # Auth Blueprint β€” register / login / logout / refresh
β”œβ”€β”€ api_wrapper.py          # Per-request pipeline: evidence β†’ FAISS β†’ NLI β†’ verdict
β”œβ”€β”€ model.py                # AI models + 7 evidence fetchers
β”œβ”€β”€ update_data.py          # Offline bulk evidence updater + FAISS index builder
β”œβ”€β”€ knowledge_base.py       # ~80 curated static facts (no network required)
β”œβ”€β”€ setup_db.py             # One-time MongoDB Atlas collection + index setup
β”œβ”€β”€ project/
β”‚   β”œβ”€β”€ config.py           # All settings from .env (single source of truth)
β”‚   └── database.py         # MongoDB helpers (Borg singleton, CRUD, TTL)
β”œβ”€β”€ templates/
β”‚   β”œβ”€β”€ index.html          # Dashboard / claim submission
β”‚   β”œβ”€β”€ results.html        # Verdict + evidence + NLI breakdown
β”‚   β”œβ”€β”€ history.html        # User claim history
β”‚   β”œβ”€β”€ login.html          # Login page
β”‚   └── register.html       # Register page
β”œβ”€β”€ static/
β”‚   └── style.css           # Full design system (dark/light theme, responsive)
β”œβ”€β”€ .env                    # Local secrets (never commit)
β”œβ”€β”€ .env.example            # Template
β”œβ”€β”€ requirements.txt        # Python dependencies
└── faiss.index             # Vector index (built by update_data.py)
```

---

## How the Verdict Works

```
Claim β†’ Embed (MiniLM) β†’ Knowledge Base check
                              ↓ if score β‰₯ 0.65 β†’ skip live fetches
                         Wikidata + RSS + GDELT + NewsAPI + Wikipedia
                              ↓ if < 3 items β†’ DuckDuckGo fallback
                         Build FAISS index
                              ↓
                         Top-5 most similar evidence items
                              ↓
                         NLI (BART-MNLI) on each piece
                              ↓
                    Majority vote β†’ True / False / Mixture/Uncertain
```

| Condition | Verdict |
|---|---|
| More entailment results than contradiction | βœ… **True** |
| More contradiction results than entailment | ❌ **False** |
| Tied or average scores below 0.4 | ⚠️ **Mixture/Uncertain** |

---

## MongoDB Collections

| Collection | Purpose | Auto-cleanup |
|---|---|---|
| `users` | Accounts with hashed passwords | β€” |
| `history` | Per-user fact-check records | β€” |
| `evidence` | Scraped text for FAISS | TTL 30 days |
| `revoked_tokens` | JWT logout blocklist | TTL at token expiry |

---

## Dependencies

| Package | Purpose |
|---|---|
| `flask` | Web framework |
| `flask-jwt-extended` | JWT access + refresh tokens via cookies |
| `flask-bcrypt` | Password hashing |
| `flask-limiter` | Rate limiting on auth endpoints |
| `flask-talisman` | HTTP security headers |
| `pymongo` | MongoDB Atlas driver |
| `python-dotenv` | `.env` loading |
| `sentence-transformers` | MiniLM-L6 embeddings |
| `transformers` | BART-MNLI NLI pipeline |
| `faiss-cpu` | Vector similarity search |
| `requests` | HTTP calls to APIs |
| `beautifulsoup4` | DuckDuckGo HTML scraping |
| `feedparser` | RSS feed parsing |
| `numpy` | Numerical operations |
| `torch` | Deep learning backend |
| `easyocr` | Image OCR |
| `Pillow` | Image processing |

---

## Security Notes

- Passwords are hashed with **bcrypt** + a server-side **pepper** β€” a leaked database alone cannot crack them
- JWT tokens stored in **HttpOnly** cookies β€” inaccessible to JavaScript (XSS-safe)
- `SameSite=Strict` cookie policy prevents CSRF
- Rate limiting: 5 login attempts / minute, 3 register attempts / minute per IP
- All security headers enforced by Flask-Talisman

---

## Contributing

Pull requests welcome. Please open an issue first for major changes.

---

## License

Open-source. See repository for license details.