verifile-x-api / docs /ARCHITECTURE.md
abinazebinoy's picture
Polish documentation and README for portfolio presentation (#16)
7a9ef86
# VeriFile-X Architecture
## System Overview
VeriFile-X is a privacy-preserving digital forensics platform that analyzes images for authenticity using statistical analysis and metadata extraction.
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Client β”‚
β”‚ (Browser / curl) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ HTTPS
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ FastAPI Backend β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Rate Limiter (10 req/min) β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β–Ό β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ API Routes (/api/v1/...) β”‚ β”‚
β”‚ β”‚ β€’ /upload/validate - File validation β”‚ β”‚
β”‚ β”‚ β€’ /analyze/image - Forensic analysis β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β–Ό β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Validation Layer β”‚ β”‚
β”‚ β”‚ β€’ MIME type check (python-magic) β”‚ β”‚
β”‚ β”‚ β€’ Size limit (10MB) β”‚ β”‚
β”‚ β”‚ β€’ Content-type header validation β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β–Ό β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ SHA-256 Hash Cache β”‚ β”‚
β”‚ β”‚ β€’ Check for duplicate (cache hit) β”‚ β”‚
β”‚ β”‚ β€’ Return cached result if found β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β–Ό β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Forensics Analysis Pipeline β”‚ β”‚
β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚
β”‚ β”‚ β”‚ 1. Metadata Extraction (EXIF, GPS) β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ - Camera make/model β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ - GPS coordinates β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ - Software used β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ - Timestamps β”‚ β”‚ β”‚
β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚
β”‚ β”‚ β”‚ 2. Hash Generation β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ - SHA-256 (cryptographic) β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ - MD5 (legacy) β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ - Perceptual hash (similarity) β”‚ β”‚ β”‚
β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚
β”‚ β”‚ β”‚ 3. AI Detection (Statistical) β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ - Noise pattern analysis (Laplacian) β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ - Frequency domain (2D FFT) β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ - JPEG artifacts (DCT blocks) β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ - Color distribution (HSV entropy) β”‚ β”‚ β”‚
β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚
β”‚ β”‚ β”‚ 4. Tampering Detection β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ - Missing EXIF indicators β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ - Software manipulation traces β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ - AI generation signatures β”‚ β”‚ β”‚
β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
β”‚ β”‚ β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β–Ό β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Response Generation β”‚ β”‚
β”‚ β”‚ β€’ Compile forensic report (JSON) β”‚ β”‚
β”‚ β”‚ β€’ Cache result for duplicates β”‚ β”‚
β”‚ β”‚ β€’ Return to client β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
## Core Components
### 1. API Layer (`backend/api/`)
- **Purpose:** HTTP endpoint routing and request handling
- **Technology:** FastAPI (async ASGI)
- **Security:** Rate limiting (slowapi), CORS, input validation
- **Routes:**
- `POST /api/v1/upload/validate` - Fast file validation
- `POST /api/v1/analyze/image` - Full forensic analysis
### 2. Validation Layer (`backend/utils/`)
- **Purpose:** Multi-layer file security validation
- **Components:**
- Content-type header check (fast fail)
- python-magic MIME verification (reads file signature)
- Size limit enforcement (10MB max)
- Malicious file rejection
### 3. Cache System (`backend/core/cache.py`)
- **Purpose:** Performance optimization via deduplication
- **Implementation:** In-memory SHA-256 keyed cache
- **Strategy:** LRU eviction, 60min TTL, 500 entry limit
- **Privacy:** Stores analysis results only, never file bytes
### 4. Forensics Engine (`backend/services/`)
#### Image Forensics (`image_forensics.py`)
- **EXIF Extraction:** Pillow + ExifTags parsing
- **GPS Decoding:** DMS to decimal degrees conversion
- **Hash Generation:** SHA-256, MD5, perceptual (imagehash)
- **Tampering Detection:** Software traces, missing metadata
#### AI Detector (`ai_detector.py`)
- **Approach:** Statistical analysis (no heavy ML models)
- **Signals:**
1. **Noise Analysis:** Laplacian operator + local variance
- Metric: `consistency = Οƒ_local / ΞΌ_local`
- Real photos: Higher variance diversity
2. **Frequency Domain:** 2D FFT spectral analysis
- Metric: `ratio = LowFreq / HighFreq`
- AI images: Abnormal spectral signatures
3. **JPEG Artifacts:** DCT block boundary analysis
- Metric: Blockiness + edge density
- AI images: Over-smoothed or missing artifacts
4. **Color Distribution:** HSV histogram entropy
- Metric: `H(X) = -Ξ£ p(x)log p(x)`
- AI images: Lower entropy, oversaturation
### 5. Configuration (`backend/core/config.py`)
- **Pydantic Settings:** Type-safe env var management
- **Environment-based:** DEBUG, CORS_ORIGINS, file size limits
- **Security:** No secrets in code, .env for local dev
### 6. Testing (`backend/tests/`)
- **Framework:** pytest + pytest-asyncio
- **Coverage:** 31 tests across all modules
- **Strategy:** Unit tests per component + integration tests for API
## Data Flow
### Typical Request Flow
1. **Client uploads image** β†’ POST /api/v1/analyze/image
2. **Rate limiter** checks IP (10 req/min limit)
3. **Validation** checks content-type, size, MIME
4. **Cache lookup** via SHA-256 hash
- **Cache hit:** Return cached result (instant)
- **Cache miss:** Continue to analysis
5. **Forensics pipeline:**
- Extract EXIF metadata
- Generate hashes (SHA-256, perceptual)
- Run AI detection (4 statistical signals)
- Detect tampering indicators
6. **Report generation** (JSON)
7. **Cache storage** for future duplicates
8. **Response** to client with complete analysis
## Security Architecture
### Defense Layers
1. **Rate Limiting:** 10 requests/minute per IP
2. **Input Validation:**
- Content-type header check
- python-magic MIME type verification
- 10MB size limit
3. **Memory Safety:**
- All processing in-memory
- No disk writes (privacy-first)
- File handles closed in `finally` blocks
4. **Type Safety:** Pydantic models for all I/O
5. **Logging:** Structured logs without PII
### Privacy Guarantees
- **Zero file storage:** Files never touch disk
- **In-memory only:** Bytes processed in RAM
- **No PII logging:** File content never logged
- **Cache privacy:** Stores results only, not file data
- **Auto-cleanup:** Cache clears on restart
## Performance Characteristics
### Timing Breakdown (100x100 image)
| Operation | Time | Cacheable |
|-----------|------|-----------|
| File validation | ~5ms | No |
| EXIF extraction | ~10ms | Yes |
| Hash generation | ~15ms | Yes |
| AI detection | ~200ms | Yes |
| Tampering check | ~5ms | Yes |
| **Total (cache miss)** | **~235ms** | - |
| **Total (cache hit)** | **~5ms** | - |
### Scalability Considerations
- **Bottleneck:** AI detection (CPU-intensive FFT)
- **Cache benefit:** 47x speedup on duplicates
- **Rate limiting:** Prevents DoS on CPU-heavy ops
- **Async I/O:** Non-blocking file reads
- **Horizontal scaling:** Stateless design (cache per instance)
## Technology Stack
### Core Dependencies
| Package | Version | Purpose |
|---------|---------|---------|
| FastAPI | 0.109.0 | Async web framework |
| Pydantic | 2.5.3 | Data validation |
| python-magic | 0.4.27 | MIME type detection |
| Pillow | 10.2.0 | Image processing + EXIF |
| OpenCV | 4.9.0 | Computer vision operations |
| NumPy | 1.26.3 | Numerical computing |
| SciPy | 1.11.4 | Scientific computing (FFT) |
| imagehash | 4.3.1 | Perceptual hashing |
| slowapi | 0.1.9 | Rate limiting |
### Development Tools
- **Testing:** pytest, pytest-asyncio, httpx
- **Linting:** (recommended: ruff, black)
- **CI/CD:** GitHub Actions
- **Python:** 3.11+ (for performance)
## Design Decisions
### Why Statistical AI Detection (Not Deep Learning)?
**Pros:**
- βœ… No heavy model downloads (TensorFlow/PyTorch)
- βœ… Fast inference (<1 second)
- βœ… Interpretable results (signal breakdown)
- βœ… Works offline
- βœ… Lower memory footprint
**Cons:**
- ⚠️ ~70-80% accuracy vs ~90%+ with CNNs
- ⚠️ Vulnerable to adversarial attacks
**Justification:** For a portfolio/learning project, statistical approach demonstrates understanding of signal processing, computer vision fundamentals, and engineering tradeoffs without requiring GPU infrastructure.
### Why In-Memory Only (No Database)?
**Pros:**
- βœ… True privacy (nothing persisted)
- βœ… Simpler deployment (no DB management)
- βœ… Faster (no I/O overhead)
- βœ… GDPR/privacy compliant by design
**Cons:**
- ⚠️ No historical analysis
- ⚠️ Cache lost on restart
- ⚠️ No user accounts/sessions
**Justification:** Privacy-first design is the core value proposition. For a forensics tool, users may not want their files tracked.
### Why FastAPI (Not Flask/Django)?
- **Async support:** Non-blocking I/O for file uploads
- **Auto documentation:** OpenAPI/Swagger UI
- **Type safety:** Pydantic integration
- **Performance:** Faster than Flask
- **Modern:** Python 3.11+ features
## Future Enhancements
### Short-term (Next Iterations)
1. Frontend UI (React/Vue)
2. Video forensics support
3. Document analysis (PDF tampering)
4. Batch processing endpoint
### Long-term (Production)
1. CNN-based AI detection (higher accuracy)
2. Redis cache (persistent across restarts)
3. PostgreSQL for audit logs (optional)
4. Kubernetes deployment
5. WebSocket real-time progress
6. Fine-tuned models on custom dataset
## Development Setup
See main README for full setup instructions.
## Testing
```bash
# Run all tests
pytest backend/tests/ -v
# Run with coverage
pytest backend/tests/ --cov=backend --cov-report=html
# Run specific module
pytest backend/tests/test_ai_detector.py -v
```
## Deployment
Currently designed for single-instance deployment (Render, Railway, fly.io). For production scale, consider:
- Load balancer (Nginx)
- Redis for shared cache
- Separate worker processes for CPU-heavy operations
- CDN for static assets
---
**Last Updated:** February 2026
**Version:** 1.0.0
**Author:** Abinaze Binoy