Spaces:
Sleeping
Sleeping
| # DocSentry β System Architecture | |
| **Real-time document anomaly detection for Indian bank underwriting.** | |
| DocSentry is the operational realisation of the Round-1 submission idea: | |
| catch tampered, forged, and AI-generated documents at the moment of | |
| upload, score them on a calibrated risk scale, and hand the underwriter | |
| a defensible audit trail. Round-2 turns that idea into a robust, | |
| production-grade platform. | |
| --- | |
| ## 1. Architectural principles | |
| | Principle | What it means in DocSentry | | |
| |---|---| | |
| | **Defence in depth** | Six independent detection layers (rule, image, PDF, OCR, ML, AI-generated). No single bypass defeats the system. | | |
| | **Explainability first** | Every verdict ships with sub-scores, evidence bullets, and visual heatmaps. Black-box outputs are unacceptable in regulated finance. | | |
| | **Tamper-evident provenance** | Every analysis is appended to a SHA-256 hash chain. Retroactive edits are mathematically detectable. | | |
| | **Portfolio-level vision** | Single-document forensics is necessary but insufficient. Real fraud is *organised*; the system reasons across applicants. | | |
| | **Zero data egress** | All inference runs locally. No applicant PII leaves the bank's perimeter. | | |
| | **RBI-aligned output** | Compliance reports follow Master Direction on KYC formatting so they can be filed directly. | | |
| --- | |
| ## 2. Layered architecture | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββ | |
| β PRESENTATION (Streamlit / future PWA)β | |
| β β | |
| β Tab 1 Single-doc analysis β | |
| β Tab 2 Cross-document KYC β | |
| β Tab 3 Batch underwriter audit β | |
| β Tab 4 RBI compliance + Provenance β | |
| β Tab 5 Live Tamper Forge Studio β | |
| β Tab 6 Fraud Ring Network β | |
| ββββββββββββββββββββββ¬ββββββββββββββββββββββ | |
| β | |
| ββββββββββββββββββββββΌββββββββββββββββββββββ | |
| β API GATEWAY (planned FastAPI front) β | |
| β /analyse /verify /forge-test β | |
| β /compliance /batch /webhook β | |
| ββββββββββββββββββββββ¬ββββββββββββββββββββββ | |
| β | |
| βββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββ | |
| βΌ βΌ βΌ | |
| βββββββββββββββββββ βββββββββββββββββββββββββ βββββββββββββββββββββββ | |
| β INGESTION β β FORENSICS CORE β β COMPLIANCE CORE β | |
| β - Direct uploadβ β Rule layer (ELA, β β RBI IFSC lookup β | |
| β - Watch folder β β copy-move, noise, β β PAN entity check β | |
| β - PDF / image β β EXIF, PDF struct, β β Aadhaar Verhoeff β | |
| β - Future Kafka β β OCR rules) β β PII redaction β | |
| ββββββββββ¬βββββββββ β RF classifier (11-d) β β DPDP-aligned β | |
| β β CNN (MobileNetV2) β ββββββββββββ¬βββββββββββ | |
| βΌ β AI-gen detector (FFT)β β | |
| βββββββββββββββββββ βββββββββββββ¬ββββββββββββ β | |
| β PROVENANCE β β β | |
| β SHA-256 chain β βΌ β | |
| β SQLite ledger β βββββββββββββββββββββ β | |
| β verify_chain() β β ENSEMBLE FUSION β β | |
| ββββββββββ¬βββββββββ β Weighted blend β β | |
| β β per sub-detector β β | |
| β βββββββββββββ¬ββββββββ β | |
| ββββββββββββββββββββ¬ββββββββββββββ΄βββββββββββββββββββββββββββββββ | |
| βΌ | |
| βββββββββββββββββββββββββββββββββββ | |
| β FRAUD RING DETECTOR β | |
| β NetworkX similarity graph β | |
| β Clique-based ring discovery β | |
| β Cross-applicant correlation β | |
| ββββββββββββββββββ¬ββββββββββββββββββ | |
| βΌ | |
| βββββββββββββββββββββββββββββββββββ | |
| β RISK ORCHESTRATOR β | |
| β Score -> band -> action β | |
| β (LOW / MEDIUM / HIGH / CRIT) β | |
| ββββββββββββββββββ¬ββββββββββββββββββ | |
| βΌ | |
| βββββββββββββββββββββββββββββββββββ | |
| β OUTPUT LAYER β | |
| β - Streamlit dashboard β | |
| β - Bank-letterhead audit PDF β | |
| β - RBI compliance pack PDF β | |
| β - Audit JSON β | |
| β - Webhook alerts (planned) β | |
| β - Provenance ledger entry β | |
| βββββββββββββββββββββββββββββββββββ | |
| ``` | |
| --- | |
| ## 3. Component reference | |
| ### 3.1 Forensics core (`forensics.py`) | |
| Six independent detectors blended via `analyse_document(path)`: | |
| | Detector | Method | Sub-score key | | |
| |---------------------|--------------------------------------------|-----------------| | |
| | Error Level Analysis| JPEG re-save diff | `ela` | | |
| | Copy-move | ORB keypoints + cross-matching | `copy_move` | | |
| | Noise inconsistency | Per-block Laplacian variance | `noise` | | |
| | EXIF audit | Metadata + software-tag fingerprint | `exif` | | |
| | OCR + text rules | Tesseract + IFSC/PAN/date/amount regex | `text_rules` | | |
| | **AI-generated** | **Radial FFT spectral analysis (new)** | `ai_generated` | | |
| Optional ML overlays: | |
| - **Random Forest** (`predict_with_model`) β 11 features (4 forensics + 4 GLCM texture + 3 colour entropy). | |
| - **MobileNetV2 CNN** (`predict_with_cnn`) β fine-tuned on CASIA v2; weight grows with measured validation AUC. | |
| Final score: `risk_score = weighted_blend(sub_scores) -> RF overlay -> CNN overlay -> AI-gen overlay`. | |
| ### 3.2 AI-generated detector (`ai_detector.py`) | |
| The 2026 threat model is Sora / Midjourney / Stable Diffusion outputs, not Photoshop. This module catches them in the frequency domain: | |
| | Signal | Detection | | |
| |--------------------------------|-----------------------------------------------| | |
| | High-frequency suppression | Ratio of low- to high-frequency FFT energy. | | |
| | Periodic spectral peaks | Spike count in high-frequency band. | | |
| | JPEG quantization absence | PIL `img.quantization` table inspection. | | |
| Blended into the main risk score with a +20% cap so it never dominates classical signals, but reliably surfaces synthetic media. | |
| ### 3.3 Cross-document KYC (`compliance.py`) | |
| - IFSC validation against 36 RBI bank codes | |
| - PAN entity-type character + Luhn-like structural check | |
| - Aadhaar UIDAI Verhoeff checksum (rejects 0/1 prefix) | |
| - PII redaction via PyMuPDF text-bbox overlays | |
| - RBI-format compliance audit PDF (5 sections, ReportLab Platypus) | |
| ### 3.4 Fraud Ring Detector (`fraud_ring.py`) β *new headline feature* | |
| Single-document forensics misses **organised** fraud. This module fixes that. | |
| **Pipeline:** | |
| 1. **Extract identity signals** from each applicant: name, DOB, address, phone, IFSC, account, employer (OCR + regex). | |
| 2. **Build a weighted similarity graph** (NetworkX). Edge weight is a sum of per-signal match weights, with field-specific fraud significance: | |
| - account number 0.25 (highest β same account = same person) | |
| - DOB 0.15, address 0.20, phone 0.20 | |
| - name 0.10, IFSC 0.05, employer 0.05 | |
| 3. **Detect rings** = connected components above a configurable similarity threshold, size β₯ 3. | |
| 4. **Score each ring**: CRITICAL (β₯5 applicants), HIGH (3-4), MEDIUM (2). | |
| 5. **Visualise** as an interactive force-directed graph; ring members rendered in red with thick edges. | |
| Banking impact: detects identity-recycling rings, address farms, mule-account networks β the patterns that cost banks ~βΉ3,000 crore/year (RBI Annual Report). | |
| ### 3.5 Tamper Forge Studio (`tampering.py`) | |
| Adversarial validation: live UI to apply copy-move, splice, text-edit, compression, metadata-strip, custom-region, or chained tampering operations to a clean sample, then immediately re-run detection. Visual before/after with bounding boxes, per-detector scorecard, ELA + noise heatmap overlays. Doubles as a continuous test harness for the forensics layer. | |
| ### 3.6 Provenance Ledger (`provenance.py`) β *new compliance feature* | |
| Tamper-evident SHA-256 hash chain over every analysis: | |
| ``` | |
| record_hash = SHA256(timestamp | doc_sha256 | risk_band | risk_score | prev_hash) | |
| ``` | |
| - Stored in SQLite (single file, zero-deploy) | |
| - `verify_chain()` walks every record in O(N) and pinpoints the first broken record | |
| - Satisfies RBI Master Direction on KYC (2016), Para 67 record-retention requirements | |
| - Downloadable as JSON for external auditors | |
| Conceptually a baby blockchain: append-only, hash-linked, mathematically verifiable. | |
| ### 3.7 Audit report (`audit_report.py`) | |
| Bank-letterhead PDF with: | |
| - Metadata table (file, SHA-256, analysed timestamp) | |
| - Risk verdict box (colour-coded by band) | |
| - Sub-score table with ASCII bars | |
| - Evidence bullets | |
| - Embedded forensic heatmaps | |
| ### 3.8 Dashboard (`app.py`) | |
| Six-tab Streamlit UI. Sample documents bundled (`sample_data/`) for instant demo. | |
| --- | |
| ## 4. Data assets | |
| | Asset | Purpose | Volume | | |
| |--------------------------------------|--------------------------------------|--------| | |
| | AgamiAI Indian Bank Statements (HF) | Real Indian bank statement PDFs | 217 | | |
| | IDRBT Cheque Image Dataset | Cheque images, Indian banking format | 112 | | |
| | CASIA v2 | CNN training (forged/authentic) | ~12 k | | |
| | `sample_data/` bundled | Demo fixtures | 26 | | |
| --- | |
| ## 5. Ensemble fusion logic | |
| ``` | |
| sub_scores = {ela, copy_move, noise, exif, text_rules, ai_generated} | |
| weights = {ela:0.20, copy_move:0.25, noise:0.20, exif:0.15, | |
| text_rules:0.20} | |
| # ai_generated is a separate overlay, not in base weights | |
| base_score = sum(weights[k] * sub_scores[k] for k in weights) | |
| score_with_rf = 0.5 * base_score + 0.5 * rf.predict_proba(features) | |
| score_with_cnn = (1-w) * score_with_rf + w * cnn.predict(image) | |
| where w = clamp(cnn.val_auc, 0.4, 0.7) | |
| final_score = 0.9 * score_with_cnn + 0.1 * ai_gen_prob * 2.0 | |
| (AI-gen capped at +20%) | |
| ``` | |
| Band mapping: `0-0.30 LOW Β· 0.30-0.50 MEDIUM Β· 0.50-0.75 HIGH Β· 0.75+ CRITICAL` | |
| --- | |
| ## 6. Roadmap β what's next | |
| The architecture below is **wired** for these extensions; they ship in subsequent rounds. | |
| | Capability | Status | Notes | | |
| |---------------------------------------|---------|----------------------------------------| | |
| | FastAPI gateway + webhook alerts | planned | Push to LOS / CRM on HIGH or CRITICAL | | |
| | Federated learning across banks | planned | Flower (`flwr`); no raw data leaves | | |
| | LLM-based document reasoning | planned | Local Phi-3 / Gemma over OCR text | | |
| | Real-time drift monitoring | planned | Track per-detector confidence over time| | |
| | Kubernetes deployment | planned | For multi-tenant bank hosting | | |
| | Multilingual OCR (Hindi / Bengali) | planned | Tesseract + IndicOCR models | | |
| --- | |
| ## 7. Mapping to Round-1 submission | |
| | Round-1 idea | Round-2 realisation | | |
| |------------------------------------|-------------------------------------------------------| | |
| | Image forensics (ELA, copy-move, noise, EXIF) | `forensics.py` β fully implemented + **AI-gen FFT detector** | | |
| | PDF structural auditing | `forensics.pdf_structural_audit` + `pdf_font_audit` | | |
| | OCR + financial validation | `forensics.text_rule_checks` + IFSC/PAN/Aadhaar full validators | | |
| | Random Forest risk scoring | `forensics.predict_with_model` β trained on 11-d feature set | | |
| | Real-time underwriter dashboard | Streamlit app, 6 tabs, bank-letterhead PDF output | | |
| | CNN with MobileNetV2 (future) | **Delivered** β fine-tuned on CASIA v2 | | |
| | LLM reasoning (future) | Roadmap (see Β§ 6) | | |
| | API deployment (future) | Roadmap β FastAPI gateway scaffolded | | |
| | **NEW β Fraud Ring Network** | Cross-applicant graph + clique discovery | | |
| | **NEW β Provenance ledger** | SHA-256 hash chain, RBI Para 67 compliant | | |
| | **NEW β Tamper Forge Studio** | Adversarial-validation harness | | |
| The Round-1 pillars remain the visible centre of the system. The new | |
| pillars extend each axis without breaking the original framing: | |
| forensics gets AI-gen detection, scoring gets a cross-applicant view, | |
| the dashboard gets a tamper-evident audit trail. | |
| --- | |
| ## 8. Repository layout | |
| ``` | |
| . | |
| βββ app.py Streamlit dashboard (6 tabs) | |
| βββ forensics.py Core analysis pipeline | |
| βββ ai_detector.py AI-generated content detector (FFT) | |
| βββ fraud_ring.py Cross-applicant graph + clique detection | |
| βββ provenance.py Tamper-evident SHA-256 hash chain | |
| βββ compliance.py IFSC / PAN / Aadhaar / PII redaction | |
| βββ tampering.py Adversarial harness for Forge Studio | |
| βββ audit_report.py Bank-letterhead PDF builder | |
| βββ docsentry_master.ipynb Notebook source of truth | |
| βββ models/ RF + CNN model files | |
| βββ sample_data/ 26 demo documents | |
| βββ requirements.txt Python dependencies | |
| βββ packages.txt apt-get packages (HF Spaces) | |
| βββ README.md Reference + install guide | |
| βββ ARCHITECTURE.md This document | |
| βββ LICENSE MIT + third-party notices | |
| ``` | |
| --- | |
| *This architecture document is the technical reference for DocSentry Round 2. | |
| It accompanies the live demo at https://huggingface.co/spaces/SpandanM110/DocSentry | |
| and the source code at https://github.com/SpandanM110/Doc-Sentry.* | |