Spaces:
Sleeping
Sleeping
| # VeriFile-X Architecture | |
| ## System Overview | |
| VeriFile-X is a privacy-preserving digital forensics platform that analyzes images for authenticity using statistical analysis and metadata extraction. | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β Client β | |
| β (Browser / curl) β | |
| ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ | |
| β HTTPS | |
| βΌ | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β FastAPI Backend β | |
| β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β Rate Limiter (10 req/min) β β | |
| β ββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ β | |
| β βΌ β | |
| β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β API Routes (/api/v1/...) β β | |
| β β β’ /upload/validate - File validation β β | |
| β β β’ /analyze/image - Forensic analysis β β | |
| β ββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ β | |
| β βΌ β | |
| β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β Validation Layer β β | |
| β β β’ MIME type check (python-magic) β β | |
| β β β’ Size limit (10MB) β β | |
| β β β’ Content-type header validation β β | |
| β ββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ β | |
| β βΌ β | |
| β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β SHA-256 Hash Cache β β | |
| β β β’ Check for duplicate (cache hit) β β | |
| β β β’ Return cached result if found β β | |
| β ββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ β | |
| β βΌ β | |
| β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β Forensics Analysis Pipeline β β | |
| β β β β | |
| β β ββββββββββββββββββββββββββββββββββββββββββββββββββ β β | |
| β β β 1. Metadata Extraction (EXIF, GPS) β β β | |
| β β β - Camera make/model β β β | |
| β β β - GPS coordinates β β β | |
| β β β - Software used β β β | |
| β β β - Timestamps β β β | |
| β β ββββββββββββββββββββββββββββββββββββββββββββββββββ β β | |
| β β ββββββββββββββββββββββββββββββββββββββββββββββββββ β β | |
| β β β 2. Hash Generation β β β | |
| β β β - SHA-256 (cryptographic) β β β | |
| β β β - MD5 (legacy) β β β | |
| β β β - Perceptual hash (similarity) β β β | |
| β β ββββββββββββββββββββββββββββββββββββββββββββββββββ β β | |
| β β ββββββββββββββββββββββββββββββββββββββββββββββββββ β β | |
| β β β 3. AI Detection (Statistical) β β β | |
| β β β - Noise pattern analysis (Laplacian) β β β | |
| β β β - Frequency domain (2D FFT) β β β | |
| β β β - JPEG artifacts (DCT blocks) β β β | |
| β β β - Color distribution (HSV entropy) β β β | |
| β β ββββββββββββββββββββββββββββββββββββββββββββββββββ β β | |
| β β ββββββββββββββββββββββββββββββββββββββββββββββββββ β β | |
| β β β 4. Tampering Detection β β β | |
| β β β - Missing EXIF indicators β β β | |
| β β β - Software manipulation traces β β β | |
| β β β - AI generation signatures β β β | |
| β β ββββββββββββββββββββββββββββββββββββββββββββββββββ β β | |
| β β β β | |
| β ββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ β | |
| β βΌ β | |
| β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β Response Generation β β | |
| β β β’ Compile forensic report (JSON) β β | |
| β β β’ Cache result for duplicates β β | |
| β β β’ Return to client β β | |
| β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| ## Core Components | |
| ### 1. API Layer (`backend/api/`) | |
| - **Purpose:** HTTP endpoint routing and request handling | |
| - **Technology:** FastAPI (async ASGI) | |
| - **Security:** Rate limiting (slowapi), CORS, input validation | |
| - **Routes:** | |
| - `POST /api/v1/upload/validate` - Fast file validation | |
| - `POST /api/v1/analyze/image` - Full forensic analysis | |
| ### 2. Validation Layer (`backend/utils/`) | |
| - **Purpose:** Multi-layer file security validation | |
| - **Components:** | |
| - Content-type header check (fast fail) | |
| - python-magic MIME verification (reads file signature) | |
| - Size limit enforcement (10MB max) | |
| - Malicious file rejection | |
| ### 3. Cache System (`backend/core/cache.py`) | |
| - **Purpose:** Performance optimization via deduplication | |
| - **Implementation:** In-memory SHA-256 keyed cache | |
| - **Strategy:** LRU eviction, 60min TTL, 500 entry limit | |
| - **Privacy:** Stores analysis results only, never file bytes | |
| ### 4. Forensics Engine (`backend/services/`) | |
| #### Image Forensics (`image_forensics.py`) | |
| - **EXIF Extraction:** Pillow + ExifTags parsing | |
| - **GPS Decoding:** DMS to decimal degrees conversion | |
| - **Hash Generation:** SHA-256, MD5, perceptual (imagehash) | |
| - **Tampering Detection:** Software traces, missing metadata | |
| #### AI Detector (`ai_detector.py`) | |
| - **Approach:** Statistical analysis (no heavy ML models) | |
| - **Signals:** | |
| 1. **Noise Analysis:** Laplacian operator + local variance | |
| - Metric: `consistency = Ο_local / ΞΌ_local` | |
| - Real photos: Higher variance diversity | |
| 2. **Frequency Domain:** 2D FFT spectral analysis | |
| - Metric: `ratio = LowFreq / HighFreq` | |
| - AI images: Abnormal spectral signatures | |
| 3. **JPEG Artifacts:** DCT block boundary analysis | |
| - Metric: Blockiness + edge density | |
| - AI images: Over-smoothed or missing artifacts | |
| 4. **Color Distribution:** HSV histogram entropy | |
| - Metric: `H(X) = -Ξ£ p(x)log p(x)` | |
| - AI images: Lower entropy, oversaturation | |
| ### 5. Configuration (`backend/core/config.py`) | |
| - **Pydantic Settings:** Type-safe env var management | |
| - **Environment-based:** DEBUG, CORS_ORIGINS, file size limits | |
| - **Security:** No secrets in code, .env for local dev | |
| ### 6. Testing (`backend/tests/`) | |
| - **Framework:** pytest + pytest-asyncio | |
| - **Coverage:** 31 tests across all modules | |
| - **Strategy:** Unit tests per component + integration tests for API | |
| ## Data Flow | |
| ### Typical Request Flow | |
| 1. **Client uploads image** β POST /api/v1/analyze/image | |
| 2. **Rate limiter** checks IP (10 req/min limit) | |
| 3. **Validation** checks content-type, size, MIME | |
| 4. **Cache lookup** via SHA-256 hash | |
| - **Cache hit:** Return cached result (instant) | |
| - **Cache miss:** Continue to analysis | |
| 5. **Forensics pipeline:** | |
| - Extract EXIF metadata | |
| - Generate hashes (SHA-256, perceptual) | |
| - Run AI detection (4 statistical signals) | |
| - Detect tampering indicators | |
| 6. **Report generation** (JSON) | |
| 7. **Cache storage** for future duplicates | |
| 8. **Response** to client with complete analysis | |
| ## Security Architecture | |
| ### Defense Layers | |
| 1. **Rate Limiting:** 10 requests/minute per IP | |
| 2. **Input Validation:** | |
| - Content-type header check | |
| - python-magic MIME type verification | |
| - 10MB size limit | |
| 3. **Memory Safety:** | |
| - All processing in-memory | |
| - No disk writes (privacy-first) | |
| - File handles closed in `finally` blocks | |
| 4. **Type Safety:** Pydantic models for all I/O | |
| 5. **Logging:** Structured logs without PII | |
| ### Privacy Guarantees | |
| - **Zero file storage:** Files never touch disk | |
| - **In-memory only:** Bytes processed in RAM | |
| - **No PII logging:** File content never logged | |
| - **Cache privacy:** Stores results only, not file data | |
| - **Auto-cleanup:** Cache clears on restart | |
| ## Performance Characteristics | |
| ### Timing Breakdown (100x100 image) | |
| | Operation | Time | Cacheable | | |
| |-----------|------|-----------| | |
| | File validation | ~5ms | No | | |
| | EXIF extraction | ~10ms | Yes | | |
| | Hash generation | ~15ms | Yes | | |
| | AI detection | ~200ms | Yes | | |
| | Tampering check | ~5ms | Yes | | |
| | **Total (cache miss)** | **~235ms** | - | | |
| | **Total (cache hit)** | **~5ms** | - | | |
| ### Scalability Considerations | |
| - **Bottleneck:** AI detection (CPU-intensive FFT) | |
| - **Cache benefit:** 47x speedup on duplicates | |
| - **Rate limiting:** Prevents DoS on CPU-heavy ops | |
| - **Async I/O:** Non-blocking file reads | |
| - **Horizontal scaling:** Stateless design (cache per instance) | |
| ## Technology Stack | |
| ### Core Dependencies | |
| | Package | Version | Purpose | | |
| |---------|---------|---------| | |
| | FastAPI | 0.109.0 | Async web framework | | |
| | Pydantic | 2.5.3 | Data validation | | |
| | python-magic | 0.4.27 | MIME type detection | | |
| | Pillow | 10.2.0 | Image processing + EXIF | | |
| | OpenCV | 4.9.0 | Computer vision operations | | |
| | NumPy | 1.26.3 | Numerical computing | | |
| | SciPy | 1.11.4 | Scientific computing (FFT) | | |
| | imagehash | 4.3.1 | Perceptual hashing | | |
| | slowapi | 0.1.9 | Rate limiting | | |
| ### Development Tools | |
| - **Testing:** pytest, pytest-asyncio, httpx | |
| - **Linting:** (recommended: ruff, black) | |
| - **CI/CD:** GitHub Actions | |
| - **Python:** 3.11+ (for performance) | |
| ## Design Decisions | |
| ### Why Statistical AI Detection (Not Deep Learning)? | |
| **Pros:** | |
| - β No heavy model downloads (TensorFlow/PyTorch) | |
| - β Fast inference (<1 second) | |
| - β Interpretable results (signal breakdown) | |
| - β Works offline | |
| - β Lower memory footprint | |
| **Cons:** | |
| - β οΈ ~70-80% accuracy vs ~90%+ with CNNs | |
| - β οΈ Vulnerable to adversarial attacks | |
| **Justification:** For a portfolio/learning project, statistical approach demonstrates understanding of signal processing, computer vision fundamentals, and engineering tradeoffs without requiring GPU infrastructure. | |
| ### Why In-Memory Only (No Database)? | |
| **Pros:** | |
| - β True privacy (nothing persisted) | |
| - β Simpler deployment (no DB management) | |
| - β Faster (no I/O overhead) | |
| - β GDPR/privacy compliant by design | |
| **Cons:** | |
| - β οΈ No historical analysis | |
| - β οΈ Cache lost on restart | |
| - β οΈ No user accounts/sessions | |
| **Justification:** Privacy-first design is the core value proposition. For a forensics tool, users may not want their files tracked. | |
| ### Why FastAPI (Not Flask/Django)? | |
| - **Async support:** Non-blocking I/O for file uploads | |
| - **Auto documentation:** OpenAPI/Swagger UI | |
| - **Type safety:** Pydantic integration | |
| - **Performance:** Faster than Flask | |
| - **Modern:** Python 3.11+ features | |
| ## Future Enhancements | |
| ### Short-term (Next Iterations) | |
| 1. Frontend UI (React/Vue) | |
| 2. Video forensics support | |
| 3. Document analysis (PDF tampering) | |
| 4. Batch processing endpoint | |
| ### Long-term (Production) | |
| 1. CNN-based AI detection (higher accuracy) | |
| 2. Redis cache (persistent across restarts) | |
| 3. PostgreSQL for audit logs (optional) | |
| 4. Kubernetes deployment | |
| 5. WebSocket real-time progress | |
| 6. Fine-tuned models on custom dataset | |
| ## Development Setup | |
| See main README for full setup instructions. | |
| ## Testing | |
| ```bash | |
| # Run all tests | |
| pytest backend/tests/ -v | |
| # Run with coverage | |
| pytest backend/tests/ --cov=backend --cov-report=html | |
| # Run specific module | |
| pytest backend/tests/test_ai_detector.py -v | |
| ``` | |
| ## Deployment | |
| Currently designed for single-instance deployment (Render, Railway, fly.io). For production scale, consider: | |
| - Load balancer (Nginx) | |
| - Redis for shared cache | |
| - Separate worker processes for CPU-heavy operations | |
| - CDN for static assets | |
| --- | |
| **Last Updated:** February 2026 | |
| **Version:** 1.0.0 | |
| **Author:** Abinaze Binoy | |