Spaces:
Sleeping
Sleeping
VeriFile-X Architecture
System Overview
VeriFile-X is a privacy-preserving digital forensics platform that analyzes images for authenticity using statistical analysis and metadata extraction.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Client β
β (Browser / curl) β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β HTTPS
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FastAPI Backend β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Rate Limiter (10 req/min) β β
β ββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β API Routes (/api/v1/...) β β
β β β’ /upload/validate - File validation β β
β β β’ /analyze/image - Forensic analysis β β
β ββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Validation Layer β β
β β β’ MIME type check (python-magic) β β
β β β’ Size limit (10MB) β β
β β β’ Content-type header validation β β
β ββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β SHA-256 Hash Cache β β
β β β’ Check for duplicate (cache hit) β β
β β β’ Return cached result if found β β
β ββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Forensics Analysis Pipeline β β
β β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β 1. Metadata Extraction (EXIF, GPS) β β β
β β β - Camera make/model β β β
β β β - GPS coordinates β β β
β β β - Software used β β β
β β β - Timestamps β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β 2. Hash Generation β β β
β β β - SHA-256 (cryptographic) β β β
β β β - MD5 (legacy) β β β
β β β - Perceptual hash (similarity) β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β 3. AI Detection (Statistical) β β β
β β β - Noise pattern analysis (Laplacian) β β β
β β β - Frequency domain (2D FFT) β β β
β β β - JPEG artifacts (DCT blocks) β β β
β β β - Color distribution (HSV entropy) β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β 4. Tampering Detection β β β
β β β - Missing EXIF indicators β β β
β β β - Software manipulation traces β β β
β β β - AI generation signatures β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β
β ββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Response Generation β β
β β β’ Compile forensic report (JSON) β β
β β β’ Cache result for duplicates β β
β β β’ Return to client β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Core Components
1. API Layer (backend/api/)
- Purpose: HTTP endpoint routing and request handling
- Technology: FastAPI (async ASGI)
- Security: Rate limiting (slowapi), CORS, input validation
- Routes:
POST /api/v1/upload/validate- Fast file validationPOST /api/v1/analyze/image- Full forensic analysis
2. Validation Layer (backend/utils/)
- Purpose: Multi-layer file security validation
- Components:
- Content-type header check (fast fail)
- python-magic MIME verification (reads file signature)
- Size limit enforcement (10MB max)
- Malicious file rejection
3. Cache System (backend/core/cache.py)
- Purpose: Performance optimization via deduplication
- Implementation: In-memory SHA-256 keyed cache
- Strategy: LRU eviction, 60min TTL, 500 entry limit
- Privacy: Stores analysis results only, never file bytes
4. Forensics Engine (backend/services/)
Image Forensics (image_forensics.py)
- EXIF Extraction: Pillow + ExifTags parsing
- GPS Decoding: DMS to decimal degrees conversion
- Hash Generation: SHA-256, MD5, perceptual (imagehash)
- Tampering Detection: Software traces, missing metadata
AI Detector (ai_detector.py)
- Approach: Statistical analysis (no heavy ML models)
- Signals:
- Noise Analysis: Laplacian operator + local variance
- Metric:
consistency = Ο_local / ΞΌ_local - Real photos: Higher variance diversity
- Metric:
- Frequency Domain: 2D FFT spectral analysis
- Metric:
ratio = LowFreq / HighFreq - AI images: Abnormal spectral signatures
- Metric:
- JPEG Artifacts: DCT block boundary analysis
- Metric: Blockiness + edge density
- AI images: Over-smoothed or missing artifacts
- Color Distribution: HSV histogram entropy
- Metric:
H(X) = -Ξ£ p(x)log p(x) - AI images: Lower entropy, oversaturation
- Metric:
- Noise Analysis: Laplacian operator + local variance
5. Configuration (backend/core/config.py)
- Pydantic Settings: Type-safe env var management
- Environment-based: DEBUG, CORS_ORIGINS, file size limits
- Security: No secrets in code, .env for local dev
6. Testing (backend/tests/)
- Framework: pytest + pytest-asyncio
- Coverage: 31 tests across all modules
- Strategy: Unit tests per component + integration tests for API
Data Flow
Typical Request Flow
- Client uploads image β POST /api/v1/analyze/image
- Rate limiter checks IP (10 req/min limit)
- Validation checks content-type, size, MIME
- Cache lookup via SHA-256 hash
- Cache hit: Return cached result (instant)
- Cache miss: Continue to analysis
- Forensics pipeline:
- Extract EXIF metadata
- Generate hashes (SHA-256, perceptual)
- Run AI detection (4 statistical signals)
- Detect tampering indicators
- Report generation (JSON)
- Cache storage for future duplicates
- Response to client with complete analysis
Security Architecture
Defense Layers
- Rate Limiting: 10 requests/minute per IP
- Input Validation:
- Content-type header check
- python-magic MIME type verification
- 10MB size limit
- Memory Safety:
- All processing in-memory
- No disk writes (privacy-first)
- File handles closed in
finallyblocks
- Type Safety: Pydantic models for all I/O
- Logging: Structured logs without PII
Privacy Guarantees
- Zero file storage: Files never touch disk
- In-memory only: Bytes processed in RAM
- No PII logging: File content never logged
- Cache privacy: Stores results only, not file data
- Auto-cleanup: Cache clears on restart
Performance Characteristics
Timing Breakdown (100x100 image)
| Operation | Time | Cacheable |
|---|---|---|
| File validation | ~5ms | No |
| EXIF extraction | ~10ms | Yes |
| Hash generation | ~15ms | Yes |
| AI detection | ~200ms | Yes |
| Tampering check | ~5ms | Yes |
| Total (cache miss) | ~235ms | - |
| Total (cache hit) | ~5ms | - |
Scalability Considerations
- Bottleneck: AI detection (CPU-intensive FFT)
- Cache benefit: 47x speedup on duplicates
- Rate limiting: Prevents DoS on CPU-heavy ops
- Async I/O: Non-blocking file reads
- Horizontal scaling: Stateless design (cache per instance)
Technology Stack
Core Dependencies
| Package | Version | Purpose |
|---|---|---|
| FastAPI | 0.109.0 | Async web framework |
| Pydantic | 2.5.3 | Data validation |
| python-magic | 0.4.27 | MIME type detection |
| Pillow | 10.2.0 | Image processing + EXIF |
| OpenCV | 4.9.0 | Computer vision operations |
| NumPy | 1.26.3 | Numerical computing |
| SciPy | 1.11.4 | Scientific computing (FFT) |
| imagehash | 4.3.1 | Perceptual hashing |
| slowapi | 0.1.9 | Rate limiting |
Development Tools
- Testing: pytest, pytest-asyncio, httpx
- Linting: (recommended: ruff, black)
- CI/CD: GitHub Actions
- Python: 3.11+ (for performance)
Design Decisions
Why Statistical AI Detection (Not Deep Learning)?
Pros:
- β No heavy model downloads (TensorFlow/PyTorch)
- β Fast inference (<1 second)
- β Interpretable results (signal breakdown)
- β Works offline
- β Lower memory footprint
Cons:
- β οΈ ~70-80% accuracy vs ~90%+ with CNNs
- β οΈ Vulnerable to adversarial attacks
Justification: For a portfolio/learning project, statistical approach demonstrates understanding of signal processing, computer vision fundamentals, and engineering tradeoffs without requiring GPU infrastructure.
Why In-Memory Only (No Database)?
Pros:
- β True privacy (nothing persisted)
- β Simpler deployment (no DB management)
- β Faster (no I/O overhead)
- β GDPR/privacy compliant by design
Cons:
- β οΈ No historical analysis
- β οΈ Cache lost on restart
- β οΈ No user accounts/sessions
Justification: Privacy-first design is the core value proposition. For a forensics tool, users may not want their files tracked.
Why FastAPI (Not Flask/Django)?
- Async support: Non-blocking I/O for file uploads
- Auto documentation: OpenAPI/Swagger UI
- Type safety: Pydantic integration
- Performance: Faster than Flask
- Modern: Python 3.11+ features
Future Enhancements
Short-term (Next Iterations)
- Frontend UI (React/Vue)
- Video forensics support
- Document analysis (PDF tampering)
- Batch processing endpoint
Long-term (Production)
- CNN-based AI detection (higher accuracy)
- Redis cache (persistent across restarts)
- PostgreSQL for audit logs (optional)
- Kubernetes deployment
- WebSocket real-time progress
- Fine-tuned models on custom dataset
Development Setup
See main README for full setup instructions.
Testing
# Run all tests
pytest backend/tests/ -v
# Run with coverage
pytest backend/tests/ --cov=backend --cov-report=html
# Run specific module
pytest backend/tests/test_ai_detector.py -v
Deployment
Currently designed for single-instance deployment (Render, Railway, fly.io). For production scale, consider:
- Load balancer (Nginx)
- Redis for shared cache
- Separate worker processes for CPU-heavy operations
- CDN for static assets
Last Updated: February 2026
Version: 1.0.0
Author: Abinaze Binoy