DocSentry / ARCHITECTURE.md
SpandanM110's picture
Round 2: fraud ring graph, AI-gen detector, provenance ledger, architecture doc
e97f963
|
Raw
History Blame Contribute Delete
16.4 kB
# DocSentry β€” System Architecture
**Real-time document anomaly detection for Indian bank underwriting.**
DocSentry is the operational realisation of the Round-1 submission idea:
catch tampered, forged, and AI-generated documents at the moment of
upload, score them on a calibrated risk scale, and hand the underwriter
a defensible audit trail. Round-2 turns that idea into a robust,
production-grade platform.
---
## 1. Architectural principles
| Principle | What it means in DocSentry |
|---|---|
| **Defence in depth** | Six independent detection layers (rule, image, PDF, OCR, ML, AI-generated). No single bypass defeats the system. |
| **Explainability first** | Every verdict ships with sub-scores, evidence bullets, and visual heatmaps. Black-box outputs are unacceptable in regulated finance. |
| **Tamper-evident provenance** | Every analysis is appended to a SHA-256 hash chain. Retroactive edits are mathematically detectable. |
| **Portfolio-level vision** | Single-document forensics is necessary but insufficient. Real fraud is *organised*; the system reasons across applicants. |
| **Zero data egress** | All inference runs locally. No applicant PII leaves the bank's perimeter. |
| **RBI-aligned output** | Compliance reports follow Master Direction on KYC formatting so they can be filed directly. |
---
## 2. Layered architecture
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ PRESENTATION (Streamlit / future PWA)β”‚
β”‚ β”‚
β”‚ Tab 1 Single-doc analysis β”‚
β”‚ Tab 2 Cross-document KYC β”‚
β”‚ Tab 3 Batch underwriter audit β”‚
β”‚ Tab 4 RBI compliance + Provenance β”‚
β”‚ Tab 5 Live Tamper Forge Studio β”‚
β”‚ Tab 6 Fraud Ring Network β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ API GATEWAY (planned FastAPI front) β”‚
β”‚ /analyse /verify /forge-test β”‚
β”‚ /compliance /batch /webhook β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β–Ό β–Ό β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ INGESTION β”‚ β”‚ FORENSICS CORE β”‚ β”‚ COMPLIANCE CORE β”‚
β”‚ - Direct uploadβ”‚ β”‚ Rule layer (ELA, β”‚ β”‚ RBI IFSC lookup β”‚
β”‚ - Watch folder β”‚ β”‚ copy-move, noise, β”‚ β”‚ PAN entity check β”‚
β”‚ - PDF / image β”‚ β”‚ EXIF, PDF struct, β”‚ β”‚ Aadhaar Verhoeff β”‚
β”‚ - Future Kafka β”‚ β”‚ OCR rules) β”‚ β”‚ PII redaction β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ RF classifier (11-d) β”‚ β”‚ DPDP-aligned β”‚
β”‚ β”‚ CNN (MobileNetV2) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β–Ό β”‚ AI-gen detector (FFT)β”‚ β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ PROVENANCE β”‚ β”‚ β”‚
β”‚ SHA-256 chain β”‚ β–Ό β”‚
β”‚ SQLite ledger β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ verify_chain() β”‚ β”‚ ENSEMBLE FUSION β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ Weighted blend β”‚ β”‚
β”‚ β”‚ per sub-detector β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ FRAUD RING DETECTOR β”‚
β”‚ NetworkX similarity graph β”‚
β”‚ Clique-based ring discovery β”‚
β”‚ Cross-applicant correlation β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ RISK ORCHESTRATOR β”‚
β”‚ Score -> band -> action β”‚
β”‚ (LOW / MEDIUM / HIGH / CRIT) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ OUTPUT LAYER β”‚
β”‚ - Streamlit dashboard β”‚
β”‚ - Bank-letterhead audit PDF β”‚
β”‚ - RBI compliance pack PDF β”‚
β”‚ - Audit JSON β”‚
β”‚ - Webhook alerts (planned) β”‚
β”‚ - Provenance ledger entry β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
---
## 3. Component reference
### 3.1 Forensics core (`forensics.py`)
Six independent detectors blended via `analyse_document(path)`:
| Detector | Method | Sub-score key |
|---------------------|--------------------------------------------|-----------------|
| Error Level Analysis| JPEG re-save diff | `ela` |
| Copy-move | ORB keypoints + cross-matching | `copy_move` |
| Noise inconsistency | Per-block Laplacian variance | `noise` |
| EXIF audit | Metadata + software-tag fingerprint | `exif` |
| OCR + text rules | Tesseract + IFSC/PAN/date/amount regex | `text_rules` |
| **AI-generated** | **Radial FFT spectral analysis (new)** | `ai_generated` |
Optional ML overlays:
- **Random Forest** (`predict_with_model`) β€” 11 features (4 forensics + 4 GLCM texture + 3 colour entropy).
- **MobileNetV2 CNN** (`predict_with_cnn`) β€” fine-tuned on CASIA v2; weight grows with measured validation AUC.
Final score: `risk_score = weighted_blend(sub_scores) -> RF overlay -> CNN overlay -> AI-gen overlay`.
### 3.2 AI-generated detector (`ai_detector.py`)
The 2026 threat model is Sora / Midjourney / Stable Diffusion outputs, not Photoshop. This module catches them in the frequency domain:
| Signal | Detection |
|--------------------------------|-----------------------------------------------|
| High-frequency suppression | Ratio of low- to high-frequency FFT energy. |
| Periodic spectral peaks | Spike count in high-frequency band. |
| JPEG quantization absence | PIL `img.quantization` table inspection. |
Blended into the main risk score with a +20% cap so it never dominates classical signals, but reliably surfaces synthetic media.
### 3.3 Cross-document KYC (`compliance.py`)
- IFSC validation against 36 RBI bank codes
- PAN entity-type character + Luhn-like structural check
- Aadhaar UIDAI Verhoeff checksum (rejects 0/1 prefix)
- PII redaction via PyMuPDF text-bbox overlays
- RBI-format compliance audit PDF (5 sections, ReportLab Platypus)
### 3.4 Fraud Ring Detector (`fraud_ring.py`) β€” *new headline feature*
Single-document forensics misses **organised** fraud. This module fixes that.
**Pipeline:**
1. **Extract identity signals** from each applicant: name, DOB, address, phone, IFSC, account, employer (OCR + regex).
2. **Build a weighted similarity graph** (NetworkX). Edge weight is a sum of per-signal match weights, with field-specific fraud significance:
- account number 0.25 (highest β€” same account = same person)
- DOB 0.15, address 0.20, phone 0.20
- name 0.10, IFSC 0.05, employer 0.05
3. **Detect rings** = connected components above a configurable similarity threshold, size β‰₯ 3.
4. **Score each ring**: CRITICAL (β‰₯5 applicants), HIGH (3-4), MEDIUM (2).
5. **Visualise** as an interactive force-directed graph; ring members rendered in red with thick edges.
Banking impact: detects identity-recycling rings, address farms, mule-account networks β€” the patterns that cost banks ~β‚Ή3,000 crore/year (RBI Annual Report).
### 3.5 Tamper Forge Studio (`tampering.py`)
Adversarial validation: live UI to apply copy-move, splice, text-edit, compression, metadata-strip, custom-region, or chained tampering operations to a clean sample, then immediately re-run detection. Visual before/after with bounding boxes, per-detector scorecard, ELA + noise heatmap overlays. Doubles as a continuous test harness for the forensics layer.
### 3.6 Provenance Ledger (`provenance.py`) β€” *new compliance feature*
Tamper-evident SHA-256 hash chain over every analysis:
```
record_hash = SHA256(timestamp | doc_sha256 | risk_band | risk_score | prev_hash)
```
- Stored in SQLite (single file, zero-deploy)
- `verify_chain()` walks every record in O(N) and pinpoints the first broken record
- Satisfies RBI Master Direction on KYC (2016), Para 67 record-retention requirements
- Downloadable as JSON for external auditors
Conceptually a baby blockchain: append-only, hash-linked, mathematically verifiable.
### 3.7 Audit report (`audit_report.py`)
Bank-letterhead PDF with:
- Metadata table (file, SHA-256, analysed timestamp)
- Risk verdict box (colour-coded by band)
- Sub-score table with ASCII bars
- Evidence bullets
- Embedded forensic heatmaps
### 3.8 Dashboard (`app.py`)
Six-tab Streamlit UI. Sample documents bundled (`sample_data/`) for instant demo.
---
## 4. Data assets
| Asset | Purpose | Volume |
|--------------------------------------|--------------------------------------|--------|
| AgamiAI Indian Bank Statements (HF) | Real Indian bank statement PDFs | 217 |
| IDRBT Cheque Image Dataset | Cheque images, Indian banking format | 112 |
| CASIA v2 | CNN training (forged/authentic) | ~12 k |
| `sample_data/` bundled | Demo fixtures | 26 |
---
## 5. Ensemble fusion logic
```
sub_scores = {ela, copy_move, noise, exif, text_rules, ai_generated}
weights = {ela:0.20, copy_move:0.25, noise:0.20, exif:0.15,
text_rules:0.20}
# ai_generated is a separate overlay, not in base weights
base_score = sum(weights[k] * sub_scores[k] for k in weights)
score_with_rf = 0.5 * base_score + 0.5 * rf.predict_proba(features)
score_with_cnn = (1-w) * score_with_rf + w * cnn.predict(image)
where w = clamp(cnn.val_auc, 0.4, 0.7)
final_score = 0.9 * score_with_cnn + 0.1 * ai_gen_prob * 2.0
(AI-gen capped at +20%)
```
Band mapping: `0-0.30 LOW Β· 0.30-0.50 MEDIUM Β· 0.50-0.75 HIGH Β· 0.75+ CRITICAL`
---
## 6. Roadmap β€” what's next
The architecture below is **wired** for these extensions; they ship in subsequent rounds.
| Capability | Status | Notes |
|---------------------------------------|---------|----------------------------------------|
| FastAPI gateway + webhook alerts | planned | Push to LOS / CRM on HIGH or CRITICAL |
| Federated learning across banks | planned | Flower (`flwr`); no raw data leaves |
| LLM-based document reasoning | planned | Local Phi-3 / Gemma over OCR text |
| Real-time drift monitoring | planned | Track per-detector confidence over time|
| Kubernetes deployment | planned | For multi-tenant bank hosting |
| Multilingual OCR (Hindi / Bengali) | planned | Tesseract + IndicOCR models |
---
## 7. Mapping to Round-1 submission
| Round-1 idea | Round-2 realisation |
|------------------------------------|-------------------------------------------------------|
| Image forensics (ELA, copy-move, noise, EXIF) | `forensics.py` β€” fully implemented + **AI-gen FFT detector** |
| PDF structural auditing | `forensics.pdf_structural_audit` + `pdf_font_audit` |
| OCR + financial validation | `forensics.text_rule_checks` + IFSC/PAN/Aadhaar full validators |
| Random Forest risk scoring | `forensics.predict_with_model` β€” trained on 11-d feature set |
| Real-time underwriter dashboard | Streamlit app, 6 tabs, bank-letterhead PDF output |
| CNN with MobileNetV2 (future) | **Delivered** β€” fine-tuned on CASIA v2 |
| LLM reasoning (future) | Roadmap (see Β§ 6) |
| API deployment (future) | Roadmap β€” FastAPI gateway scaffolded |
| **NEW β€” Fraud Ring Network** | Cross-applicant graph + clique discovery |
| **NEW β€” Provenance ledger** | SHA-256 hash chain, RBI Para 67 compliant |
| **NEW β€” Tamper Forge Studio** | Adversarial-validation harness |
The Round-1 pillars remain the visible centre of the system. The new
pillars extend each axis without breaking the original framing:
forensics gets AI-gen detection, scoring gets a cross-applicant view,
the dashboard gets a tamper-evident audit trail.
---
## 8. Repository layout
```
.
β”œβ”€β”€ app.py Streamlit dashboard (6 tabs)
β”œβ”€β”€ forensics.py Core analysis pipeline
β”œβ”€β”€ ai_detector.py AI-generated content detector (FFT)
β”œβ”€β”€ fraud_ring.py Cross-applicant graph + clique detection
β”œβ”€β”€ provenance.py Tamper-evident SHA-256 hash chain
β”œβ”€β”€ compliance.py IFSC / PAN / Aadhaar / PII redaction
β”œβ”€β”€ tampering.py Adversarial harness for Forge Studio
β”œβ”€β”€ audit_report.py Bank-letterhead PDF builder
β”œβ”€β”€ docsentry_master.ipynb Notebook source of truth
β”œβ”€β”€ models/ RF + CNN model files
β”œβ”€β”€ sample_data/ 26 demo documents
β”œβ”€β”€ requirements.txt Python dependencies
β”œβ”€β”€ packages.txt apt-get packages (HF Spaces)
β”œβ”€β”€ README.md Reference + install guide
β”œβ”€β”€ ARCHITECTURE.md This document
└── LICENSE MIT + third-party notices
```
---
*This architecture document is the technical reference for DocSentry Round 2.
It accompanies the live demo at https://huggingface.co/spaces/SpandanM110/DocSentry
and the source code at https://github.com/SpandanM110/Doc-Sentry.*