Spaces:

SpandanM110
/

DocSentry

Sleeping

App Files Files Community

DocSentry / ARCHITECTURE.md

SpandanM110

Round 2: fraud ring graph, AI-gen detector, provenance ledger, architecture doc

e97f963 6 days ago

preview code

Raw

History Blame Contribute Delete

16.4 kB

	# DocSentry — System Architecture

	Real-time document anomaly detection for Indian bank underwriting.

	DocSentry is the operational realisation of the Round-1 submission idea:
	catch tampered, forged, and AI-generated documents at the moment of
	upload, score them on a calibrated risk scale, and hand the underwriter
	a defensible audit trail. Round-2 turns that idea into a robust,
	production-grade platform.

	---

	## 1. Architectural principles

	\| Principle \| What it means in DocSentry \|
	\|---\|---\|
	\| Defence in depth \| Six independent detection layers (rule, image, PDF, OCR, ML, AI-generated). No single bypass defeats the system. \|
	\| Explainability first \| Every verdict ships with sub-scores, evidence bullets, and visual heatmaps. Black-box outputs are unacceptable in regulated finance. \|
	\| Tamper-evident provenance \| Every analysis is appended to a SHA-256 hash chain. Retroactive edits are mathematically detectable. \|
	\| Portfolio-level vision \| Single-document forensics is necessary but insufficient. Real fraud is organised; the system reasons across applicants. \|
	\| Zero data egress \| All inference runs locally. No applicant PII leaves the bank's perimeter. \|
	\| RBI-aligned output \| Compliance reports follow Master Direction on KYC formatting so they can be filed directly. \|

	---

	## 2. Layered architecture

	```
	┌─────────────────────────────────────────┐
	│ PRESENTATION (Streamlit / future PWA)│
	│ │
	│ Tab 1 Single-doc analysis │
	│ Tab 2 Cross-document KYC │
	│ Tab 3 Batch underwriter audit │
	│ Tab 4 RBI compliance + Provenance │
	│ Tab 5 Live Tamper Forge Studio │
	│ Tab 6 Fraud Ring Network │
	└────────────────────┬─────────────────────┘
	│
	┌────────────────────▼─────────────────────┐
	│ API GATEWAY (planned FastAPI front) │
	│ /analyse /verify /forge-test │
	│ /compliance /batch /webhook │
	└────────────────────┬─────────────────────┘
	│
	┌─────────────────────────────────────┼───────────────────────────────┐
	▼ ▼ ▼
	┌─────────────────┐ ┌───────────────────────┐ ┌─────────────────────┐
	│ INGESTION │ │ FORENSICS CORE │ │ COMPLIANCE CORE │
	│ - Direct upload│ │ Rule layer (ELA, │ │ RBI IFSC lookup │
	│ - Watch folder │ │ copy-move, noise, │ │ PAN entity check │
	│ - PDF / image │ │ EXIF, PDF struct, │ │ Aadhaar Verhoeff │
	│ - Future Kafka │ │ OCR rules) │ │ PII redaction │
	└────────┬────────┘ │ RF classifier (11-d) │ │ DPDP-aligned │
	│ │ CNN (MobileNetV2) │ └──────────┬──────────┘
	▼ │ AI-gen detector (FFT)│ │
	┌─────────────────┐ └───────────┬───────────┘ │
	│ PROVENANCE │ │ │
	│ SHA-256 chain │ ▼ │
	│ SQLite ledger │ ┌───────────────────┐ │
	│ verify_chain() │ │ ENSEMBLE FUSION │ │
	└────────┬────────┘ │ Weighted blend │ │
	│ │ per sub-detector │ │
	│ └───────────┬───────┘ │
	└──────────────────┬─────────────┴──────────────────────────────┘
	▼
	┌─────────────────────────────────┐
	│ FRAUD RING DETECTOR │
	│ NetworkX similarity graph │
	│ Clique-based ring discovery │
	│ Cross-applicant correlation │
	└────────────────┬─────────────────┘
	▼
	┌─────────────────────────────────┐
	│ RISK ORCHESTRATOR │
	│ Score -> band -> action │
	│ (LOW / MEDIUM / HIGH / CRIT) │
	└────────────────┬─────────────────┘
	▼
	┌─────────────────────────────────┐
	│ OUTPUT LAYER │
	│ - Streamlit dashboard │
	│ - Bank-letterhead audit PDF │
	│ - RBI compliance pack PDF │
	│ - Audit JSON │
	│ - Webhook alerts (planned) │
	│ - Provenance ledger entry │
	└─────────────────────────────────┘
	```

	---

	## 3. Component reference

	### 3.1 Forensics core (`forensics.py`)

	Six independent detectors blended via `analyse_document(path)`:

	\| Detector \| Method \| Sub-score key \|
	\|---------------------\|--------------------------------------------\|-----------------\|
	\| Error Level Analysis\| JPEG re-save diff \| `ela` \|
	\| Copy-move \| ORB keypoints + cross-matching \| `copy_move` \|
	\| Noise inconsistency \| Per-block Laplacian variance \| `noise` \|
	\| EXIF audit \| Metadata + software-tag fingerprint \| `exif` \|
	\| OCR + text rules \| Tesseract + IFSC/PAN/date/amount regex \| `text_rules` \|
	\| AI-generated \| Radial FFT spectral analysis (new) \| `ai_generated` \|

	Optional ML overlays:
	- Random Forest (`predict_with_model`) — 11 features (4 forensics + 4 GLCM texture + 3 colour entropy).
	- MobileNetV2 CNN (`predict_with_cnn`) — fine-tuned on CASIA v2; weight grows with measured validation AUC.

	Final score: `risk_score = weighted_blend(sub_scores) -> RF overlay -> CNN overlay -> AI-gen overlay`.

	### 3.2 AI-generated detector (`ai_detector.py`)

	The 2026 threat model is Sora / Midjourney / Stable Diffusion outputs, not Photoshop. This module catches them in the frequency domain:

	\| Signal \| Detection \|
	\|--------------------------------\|-----------------------------------------------\|
	\| High-frequency suppression \| Ratio of low- to high-frequency FFT energy. \|
	\| Periodic spectral peaks \| Spike count in high-frequency band. \|
	\| JPEG quantization absence \| PIL `img.quantization` table inspection. \|

	Blended into the main risk score with a +20% cap so it never dominates classical signals, but reliably surfaces synthetic media.

	### 3.3 Cross-document KYC (`compliance.py`)

	- IFSC validation against 36 RBI bank codes
	- PAN entity-type character + Luhn-like structural check
	- Aadhaar UIDAI Verhoeff checksum (rejects 0/1 prefix)
	- PII redaction via PyMuPDF text-bbox overlays
	- RBI-format compliance audit PDF (5 sections, ReportLab Platypus)

	### 3.4 Fraud Ring Detector (`fraud_ring.py`) — new headline feature

	Single-document forensics misses organised fraud. This module fixes that.

	Pipeline:

	1. Extract identity signals from each applicant: name, DOB, address, phone, IFSC, account, employer (OCR + regex).
	2. Build a weighted similarity graph (NetworkX). Edge weight is a sum of per-signal match weights, with field-specific fraud significance:
	- account number 0.25 (highest — same account = same person)
	- DOB 0.15, address 0.20, phone 0.20
	- name 0.10, IFSC 0.05, employer 0.05
	3. Detect rings = connected components above a configurable similarity threshold, size ≥ 3.
	4. Score each ring: CRITICAL (≥5 applicants), HIGH (3-4), MEDIUM (2).
	5. Visualise as an interactive force-directed graph; ring members rendered in red with thick edges.

	Banking impact: detects identity-recycling rings, address farms, mule-account networks — the patterns that cost banks ~₹3,000 crore/year (RBI Annual Report).

	### 3.5 Tamper Forge Studio (`tampering.py`)

	Adversarial validation: live UI to apply copy-move, splice, text-edit, compression, metadata-strip, custom-region, or chained tampering operations to a clean sample, then immediately re-run detection. Visual before/after with bounding boxes, per-detector scorecard, ELA + noise heatmap overlays. Doubles as a continuous test harness for the forensics layer.

	### 3.6 Provenance Ledger (`provenance.py`) — new compliance feature

	Tamper-evident SHA-256 hash chain over every analysis:

	```
	record_hash = SHA256(timestamp \| doc_sha256 \| risk_band \| risk_score \| prev_hash)
	```

	- Stored in SQLite (single file, zero-deploy)
	- `verify_chain()` walks every record in O(N) and pinpoints the first broken record
	- Satisfies RBI Master Direction on KYC (2016), Para 67 record-retention requirements
	- Downloadable as JSON for external auditors

	Conceptually a baby blockchain: append-only, hash-linked, mathematically verifiable.

	### 3.7 Audit report (`audit_report.py`)

	Bank-letterhead PDF with:
	- Metadata table (file, SHA-256, analysed timestamp)
	- Risk verdict box (colour-coded by band)
	- Sub-score table with ASCII bars
	- Evidence bullets
	- Embedded forensic heatmaps

	### 3.8 Dashboard (`app.py`)

	Six-tab Streamlit UI. Sample documents bundled (`sample_data/`) for instant demo.

	---

	## 4. Data assets

	\| Asset \| Purpose \| Volume \|
	\|--------------------------------------\|--------------------------------------\|--------\|
	\| AgamiAI Indian Bank Statements (HF) \| Real Indian bank statement PDFs \| 217 \|
	\| IDRBT Cheque Image Dataset \| Cheque images, Indian banking format \| 112 \|
	\| CASIA v2 \| CNN training (forged/authentic) \| ~12 k \|
	\| `sample_data/` bundled \| Demo fixtures \| 26 \|

	---

	## 5. Ensemble fusion logic

	```
	sub_scores = {ela, copy_move, noise, exif, text_rules, ai_generated}
	weights = {ela:0.20, copy_move:0.25, noise:0.20, exif:0.15,
	text_rules:0.20}
	# ai_generated is a separate overlay, not in base weights

	base_score = sum(weights[k] * sub_scores[k] for k in weights)
	score_with_rf = 0.5 * base_score + 0.5 * rf.predict_proba(features)
	score_with_cnn = (1-w) * score_with_rf + w * cnn.predict(image)
	where w = clamp(cnn.val_auc, 0.4, 0.7)
	final_score = 0.9 * score_with_cnn + 0.1 * ai_gen_prob * 2.0
	(AI-gen capped at +20%)
	```

	Band mapping: `0-0.30 LOW · 0.30-0.50 MEDIUM · 0.50-0.75 HIGH · 0.75+ CRITICAL`

	---

	## 6. Roadmap — what's next

	The architecture below is wired for these extensions; they ship in subsequent rounds.

	\| Capability \| Status \| Notes \|
	\|---------------------------------------\|---------\|----------------------------------------\|
	\| FastAPI gateway + webhook alerts \| planned \| Push to LOS / CRM on HIGH or CRITICAL \|
	\| Federated learning across banks \| planned \| Flower (`flwr`); no raw data leaves \|
	\| LLM-based document reasoning \| planned \| Local Phi-3 / Gemma over OCR text \|
	\| Real-time drift monitoring \| planned \| Track per-detector confidence over time\|
	\| Kubernetes deployment \| planned \| For multi-tenant bank hosting \|
	\| Multilingual OCR (Hindi / Bengali) \| planned \| Tesseract + IndicOCR models \|

	---

	## 7. Mapping to Round-1 submission

	\| Round-1 idea \| Round-2 realisation \|
	\|------------------------------------\|-------------------------------------------------------\|
	\| Image forensics (ELA, copy-move, noise, EXIF) \| `forensics.py` — fully implemented + AI-gen FFT detector \|
	\| PDF structural auditing \| `forensics.pdf_structural_audit` + `pdf_font_audit` \|
	\| OCR + financial validation \| `forensics.text_rule_checks` + IFSC/PAN/Aadhaar full validators \|
	\| Random Forest risk scoring \| `forensics.predict_with_model` — trained on 11-d feature set \|
	\| Real-time underwriter dashboard \| Streamlit app, 6 tabs, bank-letterhead PDF output \|
	\| CNN with MobileNetV2 (future) \| Delivered — fine-tuned on CASIA v2 \|
	\| LLM reasoning (future) \| Roadmap (see § 6) \|
	\| API deployment (future) \| Roadmap — FastAPI gateway scaffolded \|
	\| NEW — Fraud Ring Network \| Cross-applicant graph + clique discovery \|
	\| NEW — Provenance ledger \| SHA-256 hash chain, RBI Para 67 compliant \|
	\| NEW — Tamper Forge Studio \| Adversarial-validation harness \|

	The Round-1 pillars remain the visible centre of the system. The new
	pillars extend each axis without breaking the original framing:
	forensics gets AI-gen detection, scoring gets a cross-applicant view,
	the dashboard gets a tamper-evident audit trail.

	---

	## 8. Repository layout

	```
	.
	├── app.py Streamlit dashboard (6 tabs)
	├── forensics.py Core analysis pipeline
	├── ai_detector.py AI-generated content detector (FFT)
	├── fraud_ring.py Cross-applicant graph + clique detection
	├── provenance.py Tamper-evident SHA-256 hash chain
	├── compliance.py IFSC / PAN / Aadhaar / PII redaction
	├── tampering.py Adversarial harness for Forge Studio
	├── audit_report.py Bank-letterhead PDF builder
	├── docsentry_master.ipynb Notebook source of truth
	├── models/ RF + CNN model files
	├── sample_data/ 26 demo documents
	├── requirements.txt Python dependencies
	├── packages.txt apt-get packages (HF Spaces)
	├── README.md Reference + install guide
	├── ARCHITECTURE.md This document
	└── LICENSE MIT + third-party notices
	```

	---

	*This architecture document is the technical reference for DocSentry Round 2.
	It accompanies the live demo at https://huggingface.co/spaces/SpandanM110/DocSentry
	and the source code at https://github.com/SpandanM110/Doc-Sentry.*