Spaces:
Sleeping
Sleeping
File size: 16,434 Bytes
e97f963 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 | # DocSentry β System Architecture
**Real-time document anomaly detection for Indian bank underwriting.**
DocSentry is the operational realisation of the Round-1 submission idea:
catch tampered, forged, and AI-generated documents at the moment of
upload, score them on a calibrated risk scale, and hand the underwriter
a defensible audit trail. Round-2 turns that idea into a robust,
production-grade platform.
---
## 1. Architectural principles
| Principle | What it means in DocSentry |
|---|---|
| **Defence in depth** | Six independent detection layers (rule, image, PDF, OCR, ML, AI-generated). No single bypass defeats the system. |
| **Explainability first** | Every verdict ships with sub-scores, evidence bullets, and visual heatmaps. Black-box outputs are unacceptable in regulated finance. |
| **Tamper-evident provenance** | Every analysis is appended to a SHA-256 hash chain. Retroactive edits are mathematically detectable. |
| **Portfolio-level vision** | Single-document forensics is necessary but insufficient. Real fraud is *organised*; the system reasons across applicants. |
| **Zero data egress** | All inference runs locally. No applicant PII leaves the bank's perimeter. |
| **RBI-aligned output** | Compliance reports follow Master Direction on KYC formatting so they can be filed directly. |
---
## 2. Layered architecture
```
βββββββββββββββββββββββββββββββββββββββββββ
β PRESENTATION (Streamlit / future PWA)β
β β
β Tab 1 Single-doc analysis β
β Tab 2 Cross-document KYC β
β Tab 3 Batch underwriter audit β
β Tab 4 RBI compliance + Provenance β
β Tab 5 Live Tamper Forge Studio β
β Tab 6 Fraud Ring Network β
ββββββββββββββββββββββ¬ββββββββββββββββββββββ
β
ββββββββββββββββββββββΌββββββββββββββββββββββ
β API GATEWAY (planned FastAPI front) β
β /analyse /verify /forge-test β
β /compliance /batch /webhook β
ββββββββββββββββββββββ¬ββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββ
βΌ βΌ βΌ
βββββββββββββββββββ βββββββββββββββββββββββββ βββββββββββββββββββββββ
β INGESTION β β FORENSICS CORE β β COMPLIANCE CORE β
β - Direct uploadβ β Rule layer (ELA, β β RBI IFSC lookup β
β - Watch folder β β copy-move, noise, β β PAN entity check β
β - PDF / image β β EXIF, PDF struct, β β Aadhaar Verhoeff β
β - Future Kafka β β OCR rules) β β PII redaction β
ββββββββββ¬βββββββββ β RF classifier (11-d) β β DPDP-aligned β
β β CNN (MobileNetV2) β ββββββββββββ¬βββββββββββ
βΌ β AI-gen detector (FFT)β β
βββββββββββββββββββ βββββββββββββ¬ββββββββββββ β
β PROVENANCE β β β
β SHA-256 chain β βΌ β
β SQLite ledger β βββββββββββββββββββββ β
β verify_chain() β β ENSEMBLE FUSION β β
ββββββββββ¬βββββββββ β Weighted blend β β
β β per sub-detector β β
β βββββββββββββ¬ββββββββ β
ββββββββββββββββββββ¬ββββββββββββββ΄βββββββββββββββββββββββββββββββ
βΌ
βββββββββββββββββββββββββββββββββββ
β FRAUD RING DETECTOR β
β NetworkX similarity graph β
β Clique-based ring discovery β
β Cross-applicant correlation β
ββββββββββββββββββ¬ββββββββββββββββββ
βΌ
βββββββββββββββββββββββββββββββββββ
β RISK ORCHESTRATOR β
β Score -> band -> action β
β (LOW / MEDIUM / HIGH / CRIT) β
ββββββββββββββββββ¬ββββββββββββββββββ
βΌ
βββββββββββββββββββββββββββββββββββ
β OUTPUT LAYER β
β - Streamlit dashboard β
β - Bank-letterhead audit PDF β
β - RBI compliance pack PDF β
β - Audit JSON β
β - Webhook alerts (planned) β
β - Provenance ledger entry β
βββββββββββββββββββββββββββββββββββ
```
---
## 3. Component reference
### 3.1 Forensics core (`forensics.py`)
Six independent detectors blended via `analyse_document(path)`:
| Detector | Method | Sub-score key |
|---------------------|--------------------------------------------|-----------------|
| Error Level Analysis| JPEG re-save diff | `ela` |
| Copy-move | ORB keypoints + cross-matching | `copy_move` |
| Noise inconsistency | Per-block Laplacian variance | `noise` |
| EXIF audit | Metadata + software-tag fingerprint | `exif` |
| OCR + text rules | Tesseract + IFSC/PAN/date/amount regex | `text_rules` |
| **AI-generated** | **Radial FFT spectral analysis (new)** | `ai_generated` |
Optional ML overlays:
- **Random Forest** (`predict_with_model`) β 11 features (4 forensics + 4 GLCM texture + 3 colour entropy).
- **MobileNetV2 CNN** (`predict_with_cnn`) β fine-tuned on CASIA v2; weight grows with measured validation AUC.
Final score: `risk_score = weighted_blend(sub_scores) -> RF overlay -> CNN overlay -> AI-gen overlay`.
### 3.2 AI-generated detector (`ai_detector.py`)
The 2026 threat model is Sora / Midjourney / Stable Diffusion outputs, not Photoshop. This module catches them in the frequency domain:
| Signal | Detection |
|--------------------------------|-----------------------------------------------|
| High-frequency suppression | Ratio of low- to high-frequency FFT energy. |
| Periodic spectral peaks | Spike count in high-frequency band. |
| JPEG quantization absence | PIL `img.quantization` table inspection. |
Blended into the main risk score with a +20% cap so it never dominates classical signals, but reliably surfaces synthetic media.
### 3.3 Cross-document KYC (`compliance.py`)
- IFSC validation against 36 RBI bank codes
- PAN entity-type character + Luhn-like structural check
- Aadhaar UIDAI Verhoeff checksum (rejects 0/1 prefix)
- PII redaction via PyMuPDF text-bbox overlays
- RBI-format compliance audit PDF (5 sections, ReportLab Platypus)
### 3.4 Fraud Ring Detector (`fraud_ring.py`) β *new headline feature*
Single-document forensics misses **organised** fraud. This module fixes that.
**Pipeline:**
1. **Extract identity signals** from each applicant: name, DOB, address, phone, IFSC, account, employer (OCR + regex).
2. **Build a weighted similarity graph** (NetworkX). Edge weight is a sum of per-signal match weights, with field-specific fraud significance:
- account number 0.25 (highest β same account = same person)
- DOB 0.15, address 0.20, phone 0.20
- name 0.10, IFSC 0.05, employer 0.05
3. **Detect rings** = connected components above a configurable similarity threshold, size β₯ 3.
4. **Score each ring**: CRITICAL (β₯5 applicants), HIGH (3-4), MEDIUM (2).
5. **Visualise** as an interactive force-directed graph; ring members rendered in red with thick edges.
Banking impact: detects identity-recycling rings, address farms, mule-account networks β the patterns that cost banks ~βΉ3,000 crore/year (RBI Annual Report).
### 3.5 Tamper Forge Studio (`tampering.py`)
Adversarial validation: live UI to apply copy-move, splice, text-edit, compression, metadata-strip, custom-region, or chained tampering operations to a clean sample, then immediately re-run detection. Visual before/after with bounding boxes, per-detector scorecard, ELA + noise heatmap overlays. Doubles as a continuous test harness for the forensics layer.
### 3.6 Provenance Ledger (`provenance.py`) β *new compliance feature*
Tamper-evident SHA-256 hash chain over every analysis:
```
record_hash = SHA256(timestamp | doc_sha256 | risk_band | risk_score | prev_hash)
```
- Stored in SQLite (single file, zero-deploy)
- `verify_chain()` walks every record in O(N) and pinpoints the first broken record
- Satisfies RBI Master Direction on KYC (2016), Para 67 record-retention requirements
- Downloadable as JSON for external auditors
Conceptually a baby blockchain: append-only, hash-linked, mathematically verifiable.
### 3.7 Audit report (`audit_report.py`)
Bank-letterhead PDF with:
- Metadata table (file, SHA-256, analysed timestamp)
- Risk verdict box (colour-coded by band)
- Sub-score table with ASCII bars
- Evidence bullets
- Embedded forensic heatmaps
### 3.8 Dashboard (`app.py`)
Six-tab Streamlit UI. Sample documents bundled (`sample_data/`) for instant demo.
---
## 4. Data assets
| Asset | Purpose | Volume |
|--------------------------------------|--------------------------------------|--------|
| AgamiAI Indian Bank Statements (HF) | Real Indian bank statement PDFs | 217 |
| IDRBT Cheque Image Dataset | Cheque images, Indian banking format | 112 |
| CASIA v2 | CNN training (forged/authentic) | ~12 k |
| `sample_data/` bundled | Demo fixtures | 26 |
---
## 5. Ensemble fusion logic
```
sub_scores = {ela, copy_move, noise, exif, text_rules, ai_generated}
weights = {ela:0.20, copy_move:0.25, noise:0.20, exif:0.15,
text_rules:0.20}
# ai_generated is a separate overlay, not in base weights
base_score = sum(weights[k] * sub_scores[k] for k in weights)
score_with_rf = 0.5 * base_score + 0.5 * rf.predict_proba(features)
score_with_cnn = (1-w) * score_with_rf + w * cnn.predict(image)
where w = clamp(cnn.val_auc, 0.4, 0.7)
final_score = 0.9 * score_with_cnn + 0.1 * ai_gen_prob * 2.0
(AI-gen capped at +20%)
```
Band mapping: `0-0.30 LOW Β· 0.30-0.50 MEDIUM Β· 0.50-0.75 HIGH Β· 0.75+ CRITICAL`
---
## 6. Roadmap β what's next
The architecture below is **wired** for these extensions; they ship in subsequent rounds.
| Capability | Status | Notes |
|---------------------------------------|---------|----------------------------------------|
| FastAPI gateway + webhook alerts | planned | Push to LOS / CRM on HIGH or CRITICAL |
| Federated learning across banks | planned | Flower (`flwr`); no raw data leaves |
| LLM-based document reasoning | planned | Local Phi-3 / Gemma over OCR text |
| Real-time drift monitoring | planned | Track per-detector confidence over time|
| Kubernetes deployment | planned | For multi-tenant bank hosting |
| Multilingual OCR (Hindi / Bengali) | planned | Tesseract + IndicOCR models |
---
## 7. Mapping to Round-1 submission
| Round-1 idea | Round-2 realisation |
|------------------------------------|-------------------------------------------------------|
| Image forensics (ELA, copy-move, noise, EXIF) | `forensics.py` β fully implemented + **AI-gen FFT detector** |
| PDF structural auditing | `forensics.pdf_structural_audit` + `pdf_font_audit` |
| OCR + financial validation | `forensics.text_rule_checks` + IFSC/PAN/Aadhaar full validators |
| Random Forest risk scoring | `forensics.predict_with_model` β trained on 11-d feature set |
| Real-time underwriter dashboard | Streamlit app, 6 tabs, bank-letterhead PDF output |
| CNN with MobileNetV2 (future) | **Delivered** β fine-tuned on CASIA v2 |
| LLM reasoning (future) | Roadmap (see Β§ 6) |
| API deployment (future) | Roadmap β FastAPI gateway scaffolded |
| **NEW β Fraud Ring Network** | Cross-applicant graph + clique discovery |
| **NEW β Provenance ledger** | SHA-256 hash chain, RBI Para 67 compliant |
| **NEW β Tamper Forge Studio** | Adversarial-validation harness |
The Round-1 pillars remain the visible centre of the system. The new
pillars extend each axis without breaking the original framing:
forensics gets AI-gen detection, scoring gets a cross-applicant view,
the dashboard gets a tamper-evident audit trail.
---
## 8. Repository layout
```
.
βββ app.py Streamlit dashboard (6 tabs)
βββ forensics.py Core analysis pipeline
βββ ai_detector.py AI-generated content detector (FFT)
βββ fraud_ring.py Cross-applicant graph + clique detection
βββ provenance.py Tamper-evident SHA-256 hash chain
βββ compliance.py IFSC / PAN / Aadhaar / PII redaction
βββ tampering.py Adversarial harness for Forge Studio
βββ audit_report.py Bank-letterhead PDF builder
βββ docsentry_master.ipynb Notebook source of truth
βββ models/ RF + CNN model files
βββ sample_data/ 26 demo documents
βββ requirements.txt Python dependencies
βββ packages.txt apt-get packages (HF Spaces)
βββ README.md Reference + install guide
βββ ARCHITECTURE.md This document
βββ LICENSE MIT + third-party notices
```
---
*This architecture document is the technical reference for DocSentry Round 2.
It accompanies the live demo at https://huggingface.co/spaces/SpandanM110/DocSentry
and the source code at https://github.com/SpandanM110/Doc-Sentry.*
|