# DocSentry — System Architecture **Real-time document anomaly detection for Indian bank underwriting.** DocSentry is the operational realisation of the Round-1 submission idea: catch tampered, forged, and AI-generated documents at the moment of upload, score them on a calibrated risk scale, and hand the underwriter a defensible audit trail. Round-2 turns that idea into a robust, production-grade platform. --- ## 1. Architectural principles | Principle | What it means in DocSentry | |---|---| | **Defence in depth** | Six independent detection layers (rule, image, PDF, OCR, ML, AI-generated). No single bypass defeats the system. | | **Explainability first** | Every verdict ships with sub-scores, evidence bullets, and visual heatmaps. Black-box outputs are unacceptable in regulated finance. | | **Tamper-evident provenance** | Every analysis is appended to a SHA-256 hash chain. Retroactive edits are mathematically detectable. | | **Portfolio-level vision** | Single-document forensics is necessary but insufficient. Real fraud is *organised*; the system reasons across applicants. | | **Zero data egress** | All inference runs locally. No applicant PII leaves the bank's perimeter. | | **RBI-aligned output** | Compliance reports follow Master Direction on KYC formatting so they can be filed directly. | --- ## 2. Layered architecture ``` ┌─────────────────────────────────────────┐ │ PRESENTATION (Streamlit / future PWA)│ │ │ │ Tab 1 Single-doc analysis │ │ Tab 2 Cross-document KYC │ │ Tab 3 Batch underwriter audit │ │ Tab 4 RBI compliance + Provenance │ │ Tab 5 Live Tamper Forge Studio │ │ Tab 6 Fraud Ring Network │ └────────────────────┬─────────────────────┘ │ ┌────────────────────▼─────────────────────┐ │ API GATEWAY (planned FastAPI front) │ │ /analyse /verify /forge-test │ │ /compliance /batch /webhook │ └────────────────────┬─────────────────────┘ │ ┌─────────────────────────────────────┼───────────────────────────────┐ ▼ ▼ ▼ ┌─────────────────┐ ┌───────────────────────┐ ┌─────────────────────┐ │ INGESTION │ │ FORENSICS CORE │ │ COMPLIANCE CORE │ │ - Direct upload│ │ Rule layer (ELA, │ │ RBI IFSC lookup │ │ - Watch folder │ │ copy-move, noise, │ │ PAN entity check │ │ - PDF / image │ │ EXIF, PDF struct, │ │ Aadhaar Verhoeff │ │ - Future Kafka │ │ OCR rules) │ │ PII redaction │ └────────┬────────┘ │ RF classifier (11-d) │ │ DPDP-aligned │ │ │ CNN (MobileNetV2) │ └──────────┬──────────┘ ▼ │ AI-gen detector (FFT)│ │ ┌─────────────────┐ └───────────┬───────────┘ │ │ PROVENANCE │ │ │ │ SHA-256 chain │ ▼ │ │ SQLite ledger │ ┌───────────────────┐ │ │ verify_chain() │ │ ENSEMBLE FUSION │ │ └────────┬────────┘ │ Weighted blend │ │ │ │ per sub-detector │ │ │ └───────────┬───────┘ │ └──────────────────┬─────────────┴──────────────────────────────┘ ▼ ┌─────────────────────────────────┐ │ FRAUD RING DETECTOR │ │ NetworkX similarity graph │ │ Clique-based ring discovery │ │ Cross-applicant correlation │ └────────────────┬─────────────────┘ ▼ ┌─────────────────────────────────┐ │ RISK ORCHESTRATOR │ │ Score -> band -> action │ │ (LOW / MEDIUM / HIGH / CRIT) │ └────────────────┬─────────────────┘ ▼ ┌─────────────────────────────────┐ │ OUTPUT LAYER │ │ - Streamlit dashboard │ │ - Bank-letterhead audit PDF │ │ - RBI compliance pack PDF │ │ - Audit JSON │ │ - Webhook alerts (planned) │ │ - Provenance ledger entry │ └─────────────────────────────────┘ ``` --- ## 3. Component reference ### 3.1 Forensics core (`forensics.py`) Six independent detectors blended via `analyse_document(path)`: | Detector | Method | Sub-score key | |---------------------|--------------------------------------------|-----------------| | Error Level Analysis| JPEG re-save diff | `ela` | | Copy-move | ORB keypoints + cross-matching | `copy_move` | | Noise inconsistency | Per-block Laplacian variance | `noise` | | EXIF audit | Metadata + software-tag fingerprint | `exif` | | OCR + text rules | Tesseract + IFSC/PAN/date/amount regex | `text_rules` | | **AI-generated** | **Radial FFT spectral analysis (new)** | `ai_generated` | Optional ML overlays: - **Random Forest** (`predict_with_model`) — 11 features (4 forensics + 4 GLCM texture + 3 colour entropy). - **MobileNetV2 CNN** (`predict_with_cnn`) — fine-tuned on CASIA v2; weight grows with measured validation AUC. Final score: `risk_score = weighted_blend(sub_scores) -> RF overlay -> CNN overlay -> AI-gen overlay`. ### 3.2 AI-generated detector (`ai_detector.py`) The 2026 threat model is Sora / Midjourney / Stable Diffusion outputs, not Photoshop. This module catches them in the frequency domain: | Signal | Detection | |--------------------------------|-----------------------------------------------| | High-frequency suppression | Ratio of low- to high-frequency FFT energy. | | Periodic spectral peaks | Spike count in high-frequency band. | | JPEG quantization absence | PIL `img.quantization` table inspection. | Blended into the main risk score with a +20% cap so it never dominates classical signals, but reliably surfaces synthetic media. ### 3.3 Cross-document KYC (`compliance.py`) - IFSC validation against 36 RBI bank codes - PAN entity-type character + Luhn-like structural check - Aadhaar UIDAI Verhoeff checksum (rejects 0/1 prefix) - PII redaction via PyMuPDF text-bbox overlays - RBI-format compliance audit PDF (5 sections, ReportLab Platypus) ### 3.4 Fraud Ring Detector (`fraud_ring.py`) — *new headline feature* Single-document forensics misses **organised** fraud. This module fixes that. **Pipeline:** 1. **Extract identity signals** from each applicant: name, DOB, address, phone, IFSC, account, employer (OCR + regex). 2. **Build a weighted similarity graph** (NetworkX). Edge weight is a sum of per-signal match weights, with field-specific fraud significance: - account number 0.25 (highest — same account = same person) - DOB 0.15, address 0.20, phone 0.20 - name 0.10, IFSC 0.05, employer 0.05 3. **Detect rings** = connected components above a configurable similarity threshold, size ≥ 3. 4. **Score each ring**: CRITICAL (≥5 applicants), HIGH (3-4), MEDIUM (2). 5. **Visualise** as an interactive force-directed graph; ring members rendered in red with thick edges. Banking impact: detects identity-recycling rings, address farms, mule-account networks — the patterns that cost banks ~₹3,000 crore/year (RBI Annual Report). ### 3.5 Tamper Forge Studio (`tampering.py`) Adversarial validation: live UI to apply copy-move, splice, text-edit, compression, metadata-strip, custom-region, or chained tampering operations to a clean sample, then immediately re-run detection. Visual before/after with bounding boxes, per-detector scorecard, ELA + noise heatmap overlays. Doubles as a continuous test harness for the forensics layer. ### 3.6 Provenance Ledger (`provenance.py`) — *new compliance feature* Tamper-evident SHA-256 hash chain over every analysis: ``` record_hash = SHA256(timestamp | doc_sha256 | risk_band | risk_score | prev_hash) ``` - Stored in SQLite (single file, zero-deploy) - `verify_chain()` walks every record in O(N) and pinpoints the first broken record - Satisfies RBI Master Direction on KYC (2016), Para 67 record-retention requirements - Downloadable as JSON for external auditors Conceptually a baby blockchain: append-only, hash-linked, mathematically verifiable. ### 3.7 Audit report (`audit_report.py`) Bank-letterhead PDF with: - Metadata table (file, SHA-256, analysed timestamp) - Risk verdict box (colour-coded by band) - Sub-score table with ASCII bars - Evidence bullets - Embedded forensic heatmaps ### 3.8 Dashboard (`app.py`) Six-tab Streamlit UI. Sample documents bundled (`sample_data/`) for instant demo. --- ## 4. Data assets | Asset | Purpose | Volume | |--------------------------------------|--------------------------------------|--------| | AgamiAI Indian Bank Statements (HF) | Real Indian bank statement PDFs | 217 | | IDRBT Cheque Image Dataset | Cheque images, Indian banking format | 112 | | CASIA v2 | CNN training (forged/authentic) | ~12 k | | `sample_data/` bundled | Demo fixtures | 26 | --- ## 5. Ensemble fusion logic ``` sub_scores = {ela, copy_move, noise, exif, text_rules, ai_generated} weights = {ela:0.20, copy_move:0.25, noise:0.20, exif:0.15, text_rules:0.20} # ai_generated is a separate overlay, not in base weights base_score = sum(weights[k] * sub_scores[k] for k in weights) score_with_rf = 0.5 * base_score + 0.5 * rf.predict_proba(features) score_with_cnn = (1-w) * score_with_rf + w * cnn.predict(image) where w = clamp(cnn.val_auc, 0.4, 0.7) final_score = 0.9 * score_with_cnn + 0.1 * ai_gen_prob * 2.0 (AI-gen capped at +20%) ``` Band mapping: `0-0.30 LOW · 0.30-0.50 MEDIUM · 0.50-0.75 HIGH · 0.75+ CRITICAL` --- ## 6. Roadmap — what's next The architecture below is **wired** for these extensions; they ship in subsequent rounds. | Capability | Status | Notes | |---------------------------------------|---------|----------------------------------------| | FastAPI gateway + webhook alerts | planned | Push to LOS / CRM on HIGH or CRITICAL | | Federated learning across banks | planned | Flower (`flwr`); no raw data leaves | | LLM-based document reasoning | planned | Local Phi-3 / Gemma over OCR text | | Real-time drift monitoring | planned | Track per-detector confidence over time| | Kubernetes deployment | planned | For multi-tenant bank hosting | | Multilingual OCR (Hindi / Bengali) | planned | Tesseract + IndicOCR models | --- ## 7. Mapping to Round-1 submission | Round-1 idea | Round-2 realisation | |------------------------------------|-------------------------------------------------------| | Image forensics (ELA, copy-move, noise, EXIF) | `forensics.py` — fully implemented + **AI-gen FFT detector** | | PDF structural auditing | `forensics.pdf_structural_audit` + `pdf_font_audit` | | OCR + financial validation | `forensics.text_rule_checks` + IFSC/PAN/Aadhaar full validators | | Random Forest risk scoring | `forensics.predict_with_model` — trained on 11-d feature set | | Real-time underwriter dashboard | Streamlit app, 6 tabs, bank-letterhead PDF output | | CNN with MobileNetV2 (future) | **Delivered** — fine-tuned on CASIA v2 | | LLM reasoning (future) | Roadmap (see § 6) | | API deployment (future) | Roadmap — FastAPI gateway scaffolded | | **NEW — Fraud Ring Network** | Cross-applicant graph + clique discovery | | **NEW — Provenance ledger** | SHA-256 hash chain, RBI Para 67 compliant | | **NEW — Tamper Forge Studio** | Adversarial-validation harness | The Round-1 pillars remain the visible centre of the system. The new pillars extend each axis without breaking the original framing: forensics gets AI-gen detection, scoring gets a cross-applicant view, the dashboard gets a tamper-evident audit trail. --- ## 8. Repository layout ``` . ├── app.py Streamlit dashboard (6 tabs) ├── forensics.py Core analysis pipeline ├── ai_detector.py AI-generated content detector (FFT) ├── fraud_ring.py Cross-applicant graph + clique detection ├── provenance.py Tamper-evident SHA-256 hash chain ├── compliance.py IFSC / PAN / Aadhaar / PII redaction ├── tampering.py Adversarial harness for Forge Studio ├── audit_report.py Bank-letterhead PDF builder ├── docsentry_master.ipynb Notebook source of truth ├── models/ RF + CNN model files ├── sample_data/ 26 demo documents ├── requirements.txt Python dependencies ├── packages.txt apt-get packages (HF Spaces) ├── README.md Reference + install guide ├── ARCHITECTURE.md This document └── LICENSE MIT + third-party notices ``` --- *This architecture document is the technical reference for DocSentry Round 2. It accompanies the live demo at https://huggingface.co/spaces/SpandanM110/DocSentry and the source code at https://github.com/SpandanM110/Doc-Sentry.*