Spaces:

abinazebinoy
/

verifile-x-api

Sleeping

App Files Files Community

verifile-x-api / docs /ARCHITECTURE.md

abinazebinoy

Polish documentation and README for portfolio presentation (#16)

7a9ef86 about 1 month ago

preview code

raw

history blame contribute delete

15.1 kB

	# VeriFile-X Architecture

	## System Overview

	VeriFile-X is a privacy-preserving digital forensics platform that analyzes images for authenticity using statistical analysis and metadata extraction.
	```
	┌─────────────────────────────────────────────────────────────┐
	│ Client │
	│ (Browser / curl) │
	└────────────────────────┬────────────────────────────────────┘
	│ HTTPS
	▼
	┌─────────────────────────────────────────────────────────────┐
	│ FastAPI Backend │
	│ ┌──────────────────────────────────────────────────────┐ │
	│ │ Rate Limiter (10 req/min) │ │
	│ └──────────────────────┬───────────────────────────────┘ │
	│ ▼ │
	│ ┌──────────────────────────────────────────────────────┐ │
	│ │ API Routes (/api/v1/...) │ │
	│ │ • /upload/validate - File validation │ │
	│ │ • /analyze/image - Forensic analysis │ │
	│ └──────────────────────┬───────────────────────────────┘ │
	│ ▼ │
	│ ┌──────────────────────────────────────────────────────┐ │
	│ │ Validation Layer │ │
	│ │ • MIME type check (python-magic) │ │
	│ │ • Size limit (10MB) │ │
	│ │ • Content-type header validation │ │
	│ └──────────────────────┬───────────────────────────────┘ │
	│ ▼ │
	│ ┌──────────────────────────────────────────────────────┐ │
	│ │ SHA-256 Hash Cache │ │
	│ │ • Check for duplicate (cache hit) │ │
	│ │ • Return cached result if found │ │
	│ └──────────────────────┬───────────────────────────────┘ │
	│ ▼ │
	│ ┌──────────────────────────────────────────────────────┐ │
	│ │ Forensics Analysis Pipeline │ │
	│ │ │ │
	│ │ ┌────────────────────────────────────────────────┐ │ │
	│ │ │ 1. Metadata Extraction (EXIF, GPS) │ │ │
	│ │ │ - Camera make/model │ │ │
	│ │ │ - GPS coordinates │ │ │
	│ │ │ - Software used │ │ │
	│ │ │ - Timestamps │ │ │
	│ │ └────────────────────────────────────────────────┘ │ │
	│ │ ┌────────────────────────────────────────────────┐ │ │
	│ │ │ 2. Hash Generation │ │ │
	│ │ │ - SHA-256 (cryptographic) │ │ │
	│ │ │ - MD5 (legacy) │ │ │
	│ │ │ - Perceptual hash (similarity) │ │ │
	│ │ └────────────────────────────────────────────────┘ │ │
	│ │ ┌────────────────────────────────────────────────┐ │ │
	│ │ │ 3. AI Detection (Statistical) │ │ │
	│ │ │ - Noise pattern analysis (Laplacian) │ │ │
	│ │ │ - Frequency domain (2D FFT) │ │ │
	│ │ │ - JPEG artifacts (DCT blocks) │ │ │
	│ │ │ - Color distribution (HSV entropy) │ │ │
	│ │ └────────────────────────────────────────────────┘ │ │
	│ │ ┌────────────────────────────────────────────────┐ │ │
	│ │ │ 4. Tampering Detection │ │ │
	│ │ │ - Missing EXIF indicators │ │ │
	│ │ │ - Software manipulation traces │ │ │
	│ │ │ - AI generation signatures │ │ │
	│ │ └────────────────────────────────────────────────┘ │ │
	│ │ │ │
	│ └──────────────────────┬───────────────────────────────┘ │
	│ ▼ │
	│ ┌──────────────────────────────────────────────────────┐ │
	│ │ Response Generation │ │
	│ │ • Compile forensic report (JSON) │ │
	│ │ • Cache result for duplicates │ │
	│ │ • Return to client │ │
	│ └──────────────────────────────────────────────────────┘ │
	└─────────────────────────────────────────────────────────────┘
	```

	## Core Components

	### 1. API Layer (`backend/api/`)
	- Purpose: HTTP endpoint routing and request handling
	- Technology: FastAPI (async ASGI)
	- Security: Rate limiting (slowapi), CORS, input validation
	- Routes:
	- `POST /api/v1/upload/validate` - Fast file validation
	- `POST /api/v1/analyze/image` - Full forensic analysis

	### 2. Validation Layer (`backend/utils/`)
	- Purpose: Multi-layer file security validation
	- Components:
	- Content-type header check (fast fail)
	- python-magic MIME verification (reads file signature)
	- Size limit enforcement (10MB max)
	- Malicious file rejection

	### 3. Cache System (`backend/core/cache.py`)
	- Purpose: Performance optimization via deduplication
	- Implementation: In-memory SHA-256 keyed cache
	- Strategy: LRU eviction, 60min TTL, 500 entry limit
	- Privacy: Stores analysis results only, never file bytes

	### 4. Forensics Engine (`backend/services/`)

	#### Image Forensics (`image_forensics.py`)
	- EXIF Extraction: Pillow + ExifTags parsing
	- GPS Decoding: DMS to decimal degrees conversion
	- Hash Generation: SHA-256, MD5, perceptual (imagehash)
	- Tampering Detection: Software traces, missing metadata

	#### AI Detector (`ai_detector.py`)
	- Approach: Statistical analysis (no heavy ML models)
	- Signals:
	1. Noise Analysis: Laplacian operator + local variance
	- Metric: `consistency = σ_local / μ_local`
	- Real photos: Higher variance diversity
	2. Frequency Domain: 2D FFT spectral analysis
	- Metric: `ratio = LowFreq / HighFreq`
	- AI images: Abnormal spectral signatures
	3. JPEG Artifacts: DCT block boundary analysis
	- Metric: Blockiness + edge density
	- AI images: Over-smoothed or missing artifacts
	4. Color Distribution: HSV histogram entropy
	- Metric: `H(X) = -Σ p(x)log p(x)`
	- AI images: Lower entropy, oversaturation

	### 5. Configuration (`backend/core/config.py`)
	- Pydantic Settings: Type-safe env var management
	- Environment-based: DEBUG, CORS_ORIGINS, file size limits
	- Security: No secrets in code, .env for local dev

	### 6. Testing (`backend/tests/`)
	- Framework: pytest + pytest-asyncio
	- Coverage: 31 tests across all modules
	- Strategy: Unit tests per component + integration tests for API

	## Data Flow

	### Typical Request Flow

	1. Client uploads image → POST /api/v1/analyze/image
	2. Rate limiter checks IP (10 req/min limit)
	3. Validation checks content-type, size, MIME
	4. Cache lookup via SHA-256 hash
	- Cache hit: Return cached result (instant)
	- Cache miss: Continue to analysis
	5. Forensics pipeline:
	- Extract EXIF metadata
	- Generate hashes (SHA-256, perceptual)
	- Run AI detection (4 statistical signals)
	- Detect tampering indicators
	6. Report generation (JSON)
	7. Cache storage for future duplicates
	8. Response to client with complete analysis

	## Security Architecture

	### Defense Layers

	1. Rate Limiting: 10 requests/minute per IP
	2. Input Validation:
	- Content-type header check
	- python-magic MIME type verification
	- 10MB size limit
	3. Memory Safety:
	- All processing in-memory
	- No disk writes (privacy-first)
	- File handles closed in `finally` blocks
	4. Type Safety: Pydantic models for all I/O
	5. Logging: Structured logs without PII

	### Privacy Guarantees

	- Zero file storage: Files never touch disk
	- In-memory only: Bytes processed in RAM
	- No PII logging: File content never logged
	- Cache privacy: Stores results only, not file data
	- Auto-cleanup: Cache clears on restart

	## Performance Characteristics

	### Timing Breakdown (100x100 image)

	\| Operation \| Time \| Cacheable \|
	\|-----------\|------\|-----------\|
	\| File validation \| ~5ms \| No \|
	\| EXIF extraction \| ~10ms \| Yes \|
	\| Hash generation \| ~15ms \| Yes \|
	\| AI detection \| ~200ms \| Yes \|
	\| Tampering check \| ~5ms \| Yes \|
	\| Total (cache miss) \| ~235ms \| - \|
	\| Total (cache hit) \| ~5ms \| - \|

	### Scalability Considerations

	- Bottleneck: AI detection (CPU-intensive FFT)
	- Cache benefit: 47x speedup on duplicates
	- Rate limiting: Prevents DoS on CPU-heavy ops
	- Async I/O: Non-blocking file reads
	- Horizontal scaling: Stateless design (cache per instance)

	## Technology Stack

	### Core Dependencies

	\| Package \| Version \| Purpose \|
	\|---------\|---------\|---------\|
	\| FastAPI \| 0.109.0 \| Async web framework \|
	\| Pydantic \| 2.5.3 \| Data validation \|
	\| python-magic \| 0.4.27 \| MIME type detection \|
	\| Pillow \| 10.2.0 \| Image processing + EXIF \|
	\| OpenCV \| 4.9.0 \| Computer vision operations \|
	\| NumPy \| 1.26.3 \| Numerical computing \|
	\| SciPy \| 1.11.4 \| Scientific computing (FFT) \|
	\| imagehash \| 4.3.1 \| Perceptual hashing \|
	\| slowapi \| 0.1.9 \| Rate limiting \|

	### Development Tools

	- Testing: pytest, pytest-asyncio, httpx
	- Linting: (recommended: ruff, black)
	- CI/CD: GitHub Actions
	- Python: 3.11+ (for performance)

	## Design Decisions

	### Why Statistical AI Detection (Not Deep Learning)?

	Pros:
	- ✅ No heavy model downloads (TensorFlow/PyTorch)
	- ✅ Fast inference (<1 second)
	- ✅ Interpretable results (signal breakdown)
	- ✅ Works offline
	- ✅ Lower memory footprint

	Cons:
	- ⚠️ ~70-80% accuracy vs ~90%+ with CNNs
	- ⚠️ Vulnerable to adversarial attacks

	Justification: For a portfolio/learning project, statistical approach demonstrates understanding of signal processing, computer vision fundamentals, and engineering tradeoffs without requiring GPU infrastructure.

	### Why In-Memory Only (No Database)?

	Pros:
	- ✅ True privacy (nothing persisted)
	- ✅ Simpler deployment (no DB management)
	- ✅ Faster (no I/O overhead)
	- ✅ GDPR/privacy compliant by design

	Cons:
	- ⚠️ No historical analysis
	- ⚠️ Cache lost on restart
	- ⚠️ No user accounts/sessions

	Justification: Privacy-first design is the core value proposition. For a forensics tool, users may not want their files tracked.

	### Why FastAPI (Not Flask/Django)?

	- Async support: Non-blocking I/O for file uploads
	- Auto documentation: OpenAPI/Swagger UI
	- Type safety: Pydantic integration
	- Performance: Faster than Flask
	- Modern: Python 3.11+ features

	## Future Enhancements

	### Short-term (Next Iterations)
	1. Frontend UI (React/Vue)
	2. Video forensics support
	3. Document analysis (PDF tampering)
	4. Batch processing endpoint

	### Long-term (Production)
	1. CNN-based AI detection (higher accuracy)
	2. Redis cache (persistent across restarts)
	3. PostgreSQL for audit logs (optional)
	4. Kubernetes deployment
	5. WebSocket real-time progress
	6. Fine-tuned models on custom dataset

	## Development Setup

	See main README for full setup instructions.

	## Testing
	```bash
	# Run all tests
	pytest backend/tests/ -v

	# Run with coverage
	pytest backend/tests/ --cov=backend --cov-report=html

	# Run specific module
	pytest backend/tests/test_ai_detector.py -v
	```

	## Deployment

	Currently designed for single-instance deployment (Render, Railway, fly.io). For production scale, consider:
	- Load balancer (Nginx)
	- Redis for shared cache
	- Separate worker processes for CPU-heavy operations
	- CDN for static assets

	---

	Last Updated: February 2026
	Version: 1.0.0
	Author: Abinaze Binoy