Spaces:
Running
Running
File size: 15,143 Bytes
7a9ef86 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 | # VeriFile-X Architecture
## System Overview
VeriFile-X is a privacy-preserving digital forensics platform that analyzes images for authenticity using statistical analysis and metadata extraction.
```
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Client β
β (Browser / curl) β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β HTTPS
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FastAPI Backend β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Rate Limiter (10 req/min) β β
β ββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β API Routes (/api/v1/...) β β
β β β’ /upload/validate - File validation β β
β β β’ /analyze/image - Forensic analysis β β
β ββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Validation Layer β β
β β β’ MIME type check (python-magic) β β
β β β’ Size limit (10MB) β β
β β β’ Content-type header validation β β
β ββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β SHA-256 Hash Cache β β
β β β’ Check for duplicate (cache hit) β β
β β β’ Return cached result if found β β
β ββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Forensics Analysis Pipeline β β
β β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β 1. Metadata Extraction (EXIF, GPS) β β β
β β β - Camera make/model β β β
β β β - GPS coordinates β β β
β β β - Software used β β β
β β β - Timestamps β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β 2. Hash Generation β β β
β β β - SHA-256 (cryptographic) β β β
β β β - MD5 (legacy) β β β
β β β - Perceptual hash (similarity) β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β 3. AI Detection (Statistical) β β β
β β β - Noise pattern analysis (Laplacian) β β β
β β β - Frequency domain (2D FFT) β β β
β β β - JPEG artifacts (DCT blocks) β β β
β β β - Color distribution (HSV entropy) β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β 4. Tampering Detection β β β
β β β - Missing EXIF indicators β β β
β β β - Software manipulation traces β β β
β β β - AI generation signatures β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β
β ββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Response Generation β β
β β β’ Compile forensic report (JSON) β β
β β β’ Cache result for duplicates β β
β β β’ Return to client β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```
## Core Components
### 1. API Layer (`backend/api/`)
- **Purpose:** HTTP endpoint routing and request handling
- **Technology:** FastAPI (async ASGI)
- **Security:** Rate limiting (slowapi), CORS, input validation
- **Routes:**
- `POST /api/v1/upload/validate` - Fast file validation
- `POST /api/v1/analyze/image` - Full forensic analysis
### 2. Validation Layer (`backend/utils/`)
- **Purpose:** Multi-layer file security validation
- **Components:**
- Content-type header check (fast fail)
- python-magic MIME verification (reads file signature)
- Size limit enforcement (10MB max)
- Malicious file rejection
### 3. Cache System (`backend/core/cache.py`)
- **Purpose:** Performance optimization via deduplication
- **Implementation:** In-memory SHA-256 keyed cache
- **Strategy:** LRU eviction, 60min TTL, 500 entry limit
- **Privacy:** Stores analysis results only, never file bytes
### 4. Forensics Engine (`backend/services/`)
#### Image Forensics (`image_forensics.py`)
- **EXIF Extraction:** Pillow + ExifTags parsing
- **GPS Decoding:** DMS to decimal degrees conversion
- **Hash Generation:** SHA-256, MD5, perceptual (imagehash)
- **Tampering Detection:** Software traces, missing metadata
#### AI Detector (`ai_detector.py`)
- **Approach:** Statistical analysis (no heavy ML models)
- **Signals:**
1. **Noise Analysis:** Laplacian operator + local variance
- Metric: `consistency = Ο_local / ΞΌ_local`
- Real photos: Higher variance diversity
2. **Frequency Domain:** 2D FFT spectral analysis
- Metric: `ratio = LowFreq / HighFreq`
- AI images: Abnormal spectral signatures
3. **JPEG Artifacts:** DCT block boundary analysis
- Metric: Blockiness + edge density
- AI images: Over-smoothed or missing artifacts
4. **Color Distribution:** HSV histogram entropy
- Metric: `H(X) = -Ξ£ p(x)log p(x)`
- AI images: Lower entropy, oversaturation
### 5. Configuration (`backend/core/config.py`)
- **Pydantic Settings:** Type-safe env var management
- **Environment-based:** DEBUG, CORS_ORIGINS, file size limits
- **Security:** No secrets in code, .env for local dev
### 6. Testing (`backend/tests/`)
- **Framework:** pytest + pytest-asyncio
- **Coverage:** 31 tests across all modules
- **Strategy:** Unit tests per component + integration tests for API
## Data Flow
### Typical Request Flow
1. **Client uploads image** β POST /api/v1/analyze/image
2. **Rate limiter** checks IP (10 req/min limit)
3. **Validation** checks content-type, size, MIME
4. **Cache lookup** via SHA-256 hash
- **Cache hit:** Return cached result (instant)
- **Cache miss:** Continue to analysis
5. **Forensics pipeline:**
- Extract EXIF metadata
- Generate hashes (SHA-256, perceptual)
- Run AI detection (4 statistical signals)
- Detect tampering indicators
6. **Report generation** (JSON)
7. **Cache storage** for future duplicates
8. **Response** to client with complete analysis
## Security Architecture
### Defense Layers
1. **Rate Limiting:** 10 requests/minute per IP
2. **Input Validation:**
- Content-type header check
- python-magic MIME type verification
- 10MB size limit
3. **Memory Safety:**
- All processing in-memory
- No disk writes (privacy-first)
- File handles closed in `finally` blocks
4. **Type Safety:** Pydantic models for all I/O
5. **Logging:** Structured logs without PII
### Privacy Guarantees
- **Zero file storage:** Files never touch disk
- **In-memory only:** Bytes processed in RAM
- **No PII logging:** File content never logged
- **Cache privacy:** Stores results only, not file data
- **Auto-cleanup:** Cache clears on restart
## Performance Characteristics
### Timing Breakdown (100x100 image)
| Operation | Time | Cacheable |
|-----------|------|-----------|
| File validation | ~5ms | No |
| EXIF extraction | ~10ms | Yes |
| Hash generation | ~15ms | Yes |
| AI detection | ~200ms | Yes |
| Tampering check | ~5ms | Yes |
| **Total (cache miss)** | **~235ms** | - |
| **Total (cache hit)** | **~5ms** | - |
### Scalability Considerations
- **Bottleneck:** AI detection (CPU-intensive FFT)
- **Cache benefit:** 47x speedup on duplicates
- **Rate limiting:** Prevents DoS on CPU-heavy ops
- **Async I/O:** Non-blocking file reads
- **Horizontal scaling:** Stateless design (cache per instance)
## Technology Stack
### Core Dependencies
| Package | Version | Purpose |
|---------|---------|---------|
| FastAPI | 0.109.0 | Async web framework |
| Pydantic | 2.5.3 | Data validation |
| python-magic | 0.4.27 | MIME type detection |
| Pillow | 10.2.0 | Image processing + EXIF |
| OpenCV | 4.9.0 | Computer vision operations |
| NumPy | 1.26.3 | Numerical computing |
| SciPy | 1.11.4 | Scientific computing (FFT) |
| imagehash | 4.3.1 | Perceptual hashing |
| slowapi | 0.1.9 | Rate limiting |
### Development Tools
- **Testing:** pytest, pytest-asyncio, httpx
- **Linting:** (recommended: ruff, black)
- **CI/CD:** GitHub Actions
- **Python:** 3.11+ (for performance)
## Design Decisions
### Why Statistical AI Detection (Not Deep Learning)?
**Pros:**
- β
No heavy model downloads (TensorFlow/PyTorch)
- β
Fast inference (<1 second)
- β
Interpretable results (signal breakdown)
- β
Works offline
- β
Lower memory footprint
**Cons:**
- β οΈ ~70-80% accuracy vs ~90%+ with CNNs
- β οΈ Vulnerable to adversarial attacks
**Justification:** For a portfolio/learning project, statistical approach demonstrates understanding of signal processing, computer vision fundamentals, and engineering tradeoffs without requiring GPU infrastructure.
### Why In-Memory Only (No Database)?
**Pros:**
- β
True privacy (nothing persisted)
- β
Simpler deployment (no DB management)
- β
Faster (no I/O overhead)
- β
GDPR/privacy compliant by design
**Cons:**
- β οΈ No historical analysis
- β οΈ Cache lost on restart
- β οΈ No user accounts/sessions
**Justification:** Privacy-first design is the core value proposition. For a forensics tool, users may not want their files tracked.
### Why FastAPI (Not Flask/Django)?
- **Async support:** Non-blocking I/O for file uploads
- **Auto documentation:** OpenAPI/Swagger UI
- **Type safety:** Pydantic integration
- **Performance:** Faster than Flask
- **Modern:** Python 3.11+ features
## Future Enhancements
### Short-term (Next Iterations)
1. Frontend UI (React/Vue)
2. Video forensics support
3. Document analysis (PDF tampering)
4. Batch processing endpoint
### Long-term (Production)
1. CNN-based AI detection (higher accuracy)
2. Redis cache (persistent across restarts)
3. PostgreSQL for audit logs (optional)
4. Kubernetes deployment
5. WebSocket real-time progress
6. Fine-tuned models on custom dataset
## Development Setup
See main README for full setup instructions.
## Testing
```bash
# Run all tests
pytest backend/tests/ -v
# Run with coverage
pytest backend/tests/ --cov=backend --cov-report=html
# Run specific module
pytest backend/tests/test_ai_detector.py -v
```
## Deployment
Currently designed for single-instance deployment (Render, Railway, fly.io). For production scale, consider:
- Load balancer (Nginx)
- Redis for shared cache
- Separate worker processes for CPU-heavy operations
- CDN for static assets
---
**Last Updated:** February 2026
**Version:** 1.0.0
**Author:** Abinaze Binoy
|