Spaces:

SpandanM110
/

DocSentry

Running

App Files Files Community

DocSentry / ARCHITECTURE.md

SpandanM110

Round 2: fraud ring graph, AI-gen detector, provenance ledger, architecture doc

e97f963 5 days ago

preview code

Raw

History Blame Contribute Delete

16.4 kB

A newer version of the Streamlit SDK is available: 1.58.0

Upgrade

DocSentry — System Architecture

Real-time document anomaly detection for Indian bank underwriting.

DocSentry is the operational realisation of the Round-1 submission idea: catch tampered, forged, and AI-generated documents at the moment of upload, score them on a calibrated risk scale, and hand the underwriter a defensible audit trail. Round-2 turns that idea into a robust, production-grade platform.

1. Architectural principles

Principle	What it means in DocSentry
Defence in depth	Six independent detection layers (rule, image, PDF, OCR, ML, AI-generated). No single bypass defeats the system.
Explainability first	Every verdict ships with sub-scores, evidence bullets, and visual heatmaps. Black-box outputs are unacceptable in regulated finance.
Tamper-evident provenance	Every analysis is appended to a SHA-256 hash chain. Retroactive edits are mathematically detectable.
Portfolio-level vision	Single-document forensics is necessary but insufficient. Real fraud is organised; the system reasons across applicants.
Zero data egress	All inference runs locally. No applicant PII leaves the bank's perimeter.
RBI-aligned output	Compliance reports follow Master Direction on KYC formatting so they can be filed directly.

2. Layered architecture

                    ┌─────────────────────────────────────────┐
                    │     PRESENTATION (Streamlit / future PWA)│
                    │                                          │
                    │  Tab 1  Single-doc analysis              │
                    │  Tab 2  Cross-document KYC               │
                    │  Tab 3  Batch underwriter audit          │
                    │  Tab 4  RBI compliance + Provenance      │
                    │  Tab 5  Live Tamper Forge Studio         │
                    │  Tab 6  Fraud Ring Network               │
                    └────────────────────┬─────────────────────┘
                                         │
                    ┌────────────────────▼─────────────────────┐
                    │   API GATEWAY (planned FastAPI front)    │
                    │   /analyse  /verify  /forge-test         │
                    │   /compliance  /batch  /webhook          │
                    └────────────────────┬─────────────────────┘
                                         │
   ┌─────────────────────────────────────┼───────────────────────────────┐
   ▼                                     ▼                               ▼
┌─────────────────┐         ┌───────────────────────┐         ┌─────────────────────┐
│  INGESTION      │         │   FORENSICS CORE      │         │  COMPLIANCE CORE    │
│  - Direct upload│         │  Rule layer (ELA,     │         │  RBI IFSC lookup    │
│  - Watch folder │         │   copy-move, noise,   │         │  PAN entity check   │
│  - PDF / image  │         │   EXIF, PDF struct,   │         │  Aadhaar Verhoeff   │
│  - Future Kafka │         │   OCR rules)          │         │  PII redaction      │
└────────┬────────┘         │  RF classifier (11-d) │         │  DPDP-aligned       │
         │                  │  CNN (MobileNetV2)    │         └──────────┬──────────┘
         ▼                  │  AI-gen detector (FFT)│                    │
┌─────────────────┐         └───────────┬───────────┘                    │
│  PROVENANCE     │                     │                                │
│  SHA-256 chain  │                     ▼                                │
│  SQLite ledger  │           ┌───────────────────┐                      │
│  verify_chain() │           │  ENSEMBLE FUSION  │                      │
└────────┬────────┘           │  Weighted blend   │                      │
         │                    │  per sub-detector │                      │
         │                    └───────────┬───────┘                      │
         └──────────────────┬─────────────┴──────────────────────────────┘
                            ▼
              ┌─────────────────────────────────┐
              │   FRAUD RING DETECTOR            │
              │   NetworkX similarity graph      │
              │   Clique-based ring discovery    │
              │   Cross-applicant correlation    │
              └────────────────┬─────────────────┘
                               ▼
              ┌─────────────────────────────────┐
              │   RISK ORCHESTRATOR              │
              │   Score -> band -> action        │
              │   (LOW / MEDIUM / HIGH / CRIT)   │
              └────────────────┬─────────────────┘
                               ▼
              ┌─────────────────────────────────┐
              │   OUTPUT LAYER                   │
              │   - Streamlit dashboard          │
              │   - Bank-letterhead audit PDF    │
              │   - RBI compliance pack PDF      │
              │   - Audit JSON                   │
              │   - Webhook alerts (planned)     │
              │   - Provenance ledger entry      │
              └─────────────────────────────────┘

3. Component reference

3.1 Forensics core (`forensics.py`)

Six independent detectors blended via analyse_document(path):

Detector	Method	Sub-score key
Error Level Analysis	JPEG re-save diff	`ela`
Copy-move	ORB keypoints + cross-matching	`copy_move`
Noise inconsistency	Per-block Laplacian variance	`noise`
EXIF audit	Metadata + software-tag fingerprint	`exif`
OCR + text rules	Tesseract + IFSC/PAN/date/amount regex	`text_rules`
AI-generated	Radial FFT spectral analysis (new)	`ai_generated`

Optional ML overlays:

Random Forest (predict_with_model) — 11 features (4 forensics + 4 GLCM texture + 3 colour entropy).
MobileNetV2 CNN (predict_with_cnn) — fine-tuned on CASIA v2; weight grows with measured validation AUC.

Final score: risk_score = weighted_blend(sub_scores) -> RF overlay -> CNN overlay -> AI-gen overlay.

3.2 AI-generated detector (`ai_detector.py`)

The 2026 threat model is Sora / Midjourney / Stable Diffusion outputs, not Photoshop. This module catches them in the frequency domain:

Signal	Detection
High-frequency suppression	Ratio of low- to high-frequency FFT energy.
Periodic spectral peaks	Spike count in high-frequency band.
JPEG quantization absence	PIL `img.quantization` table inspection.

Blended into the main risk score with a +20% cap so it never dominates classical signals, but reliably surfaces synthetic media.

3.3 Cross-document KYC (`compliance.py`)

IFSC validation against 36 RBI bank codes
PAN entity-type character + Luhn-like structural check
Aadhaar UIDAI Verhoeff checksum (rejects 0/1 prefix)
PII redaction via PyMuPDF text-bbox overlays
RBI-format compliance audit PDF (5 sections, ReportLab Platypus)

3.4 Fraud Ring Detector (`fraud_ring.py`) — new headline feature

Single-document forensics misses organised fraud. This module fixes that.

Pipeline:

Extract identity signals from each applicant: name, DOB, address, phone, IFSC, account, employer (OCR + regex).
Build a weighted similarity graph (NetworkX). Edge weight is a sum of per-signal match weights, with field-specific fraud significance:
- account number 0.25 (highest — same account = same person)
- DOB 0.15, address 0.20, phone 0.20
- name 0.10, IFSC 0.05, employer 0.05
Detect rings = connected components above a configurable similarity threshold, size ≥ 3.
Score each ring: CRITICAL (≥5 applicants), HIGH (3-4), MEDIUM (2).
Visualise as an interactive force-directed graph; ring members rendered in red with thick edges.

Banking impact: detects identity-recycling rings, address farms, mule-account networks — the patterns that cost banks ~₹3,000 crore/year (RBI Annual Report).

3.5 Tamper Forge Studio (`tampering.py`)

Adversarial validation: live UI to apply copy-move, splice, text-edit, compression, metadata-strip, custom-region, or chained tampering operations to a clean sample, then immediately re-run detection. Visual before/after with bounding boxes, per-detector scorecard, ELA + noise heatmap overlays. Doubles as a continuous test harness for the forensics layer.

3.6 Provenance Ledger (`provenance.py`) — new compliance feature

Tamper-evident SHA-256 hash chain over every analysis:

record_hash = SHA256(timestamp | doc_sha256 | risk_band | risk_score | prev_hash)

Stored in SQLite (single file, zero-deploy)
verify_chain() walks every record in O(N) and pinpoints the first broken record
Satisfies RBI Master Direction on KYC (2016), Para 67 record-retention requirements
Downloadable as JSON for external auditors

Conceptually a baby blockchain: append-only, hash-linked, mathematically verifiable.

3.7 Audit report (`audit_report.py`)

Bank-letterhead PDF with:

Metadata table (file, SHA-256, analysed timestamp)
Risk verdict box (colour-coded by band)
Sub-score table with ASCII bars
Evidence bullets
Embedded forensic heatmaps

3.8 Dashboard (`app.py`)

Six-tab Streamlit UI. Sample documents bundled (sample_data/) for instant demo.

4. Data assets

Asset	Purpose	Volume
AgamiAI Indian Bank Statements (HF)	Real Indian bank statement PDFs	217
IDRBT Cheque Image Dataset	Cheque images, Indian banking format	112
CASIA v2	CNN training (forged/authentic)	~12 k
`sample_data/` bundled	Demo fixtures	26

5. Ensemble fusion logic

sub_scores = {ela, copy_move, noise, exif, text_rules, ai_generated}
weights    = {ela:0.20, copy_move:0.25, noise:0.20, exif:0.15,
              text_rules:0.20}
              # ai_generated is a separate overlay, not in base weights

base_score    = sum(weights[k] * sub_scores[k] for k in weights)
score_with_rf = 0.5 * base_score + 0.5 * rf.predict_proba(features)
score_with_cnn = (1-w) * score_with_rf + w * cnn.predict(image)
                 where w = clamp(cnn.val_auc, 0.4, 0.7)
final_score    = 0.9 * score_with_cnn + 0.1 * ai_gen_prob * 2.0
                 (AI-gen capped at +20%)

Band mapping: 0-0.30 LOW · 0.30-0.50 MEDIUM · 0.50-0.75 HIGH · 0.75+ CRITICAL

6. Roadmap — what's next

The architecture below is wired for these extensions; they ship in subsequent rounds.

Capability	Status	Notes
FastAPI gateway + webhook alerts	planned	Push to LOS / CRM on HIGH or CRITICAL
Federated learning across banks	planned	Flower (`flwr`); no raw data leaves
LLM-based document reasoning	planned	Local Phi-3 / Gemma over OCR text
Real-time drift monitoring	planned	Track per-detector confidence over time
Kubernetes deployment	planned	For multi-tenant bank hosting
Multilingual OCR (Hindi / Bengali)	planned	Tesseract + IndicOCR models

7. Mapping to Round-1 submission

Round-1 idea	Round-2 realisation
Image forensics (ELA, copy-move, noise, EXIF)	`forensics.py` — fully implemented + AI-gen FFT detector
PDF structural auditing	`forensics.pdf_structural_audit` + `pdf_font_audit`
OCR + financial validation	`forensics.text_rule_checks` + IFSC/PAN/Aadhaar full validators
Random Forest risk scoring	`forensics.predict_with_model` — trained on 11-d feature set
Real-time underwriter dashboard	Streamlit app, 6 tabs, bank-letterhead PDF output
CNN with MobileNetV2 (future)	Delivered — fine-tuned on CASIA v2
LLM reasoning (future)	Roadmap (see § 6)
API deployment (future)	Roadmap — FastAPI gateway scaffolded
NEW — Fraud Ring Network	Cross-applicant graph + clique discovery
NEW — Provenance ledger	SHA-256 hash chain, RBI Para 67 compliant
NEW — Tamper Forge Studio	Adversarial-validation harness

The Round-1 pillars remain the visible centre of the system. The new pillars extend each axis without breaking the original framing: forensics gets AI-gen detection, scoring gets a cross-applicant view, the dashboard gets a tamper-evident audit trail.

8. Repository layout

.
├── app.py                  Streamlit dashboard (6 tabs)
├── forensics.py            Core analysis pipeline
├── ai_detector.py          AI-generated content detector (FFT)
├── fraud_ring.py           Cross-applicant graph + clique detection
├── provenance.py           Tamper-evident SHA-256 hash chain
├── compliance.py           IFSC / PAN / Aadhaar / PII redaction
├── tampering.py            Adversarial harness for Forge Studio
├── audit_report.py         Bank-letterhead PDF builder
├── docsentry_master.ipynb  Notebook source of truth
├── models/                 RF + CNN model files
├── sample_data/            26 demo documents
├── requirements.txt        Python dependencies
├── packages.txt            apt-get packages (HF Spaces)
├── README.md               Reference + install guide
├── ARCHITECTURE.md         This document
└── LICENSE                 MIT + third-party notices

This architecture document is the technical reference for DocSentry Round 2. It accompanies the live demo at https://huggingface.co/spaces/SpandanM110/DocSentry and the source code at https://github.com/SpandanM110/Doc-Sentry.

DocSentry — System Architecture

1. Architectural principles

2. Layered architecture

3. Component reference

3.1 Forensics core (forensics.py)

3.2 AI-generated detector (ai_detector.py)

3.3 Cross-document KYC (compliance.py)

3.4 Fraud Ring Detector (fraud_ring.py) — new headline feature

3.5 Tamper Forge Studio (tampering.py)

3.6 Provenance Ledger (provenance.py) — new compliance feature

3.7 Audit report (audit_report.py)

3.8 Dashboard (app.py)