DocSentry / ARCHITECTURE.md
SpandanM110's picture
Round 2: fraud ring graph, AI-gen detector, provenance ledger, architecture doc
e97f963
|
Raw
History Blame Contribute Delete
16.4 kB

A newer version of the Streamlit SDK is available: 1.58.0

Upgrade

DocSentry β€” System Architecture

Real-time document anomaly detection for Indian bank underwriting.

DocSentry is the operational realisation of the Round-1 submission idea: catch tampered, forged, and AI-generated documents at the moment of upload, score them on a calibrated risk scale, and hand the underwriter a defensible audit trail. Round-2 turns that idea into a robust, production-grade platform.


1. Architectural principles

Principle What it means in DocSentry
Defence in depth Six independent detection layers (rule, image, PDF, OCR, ML, AI-generated). No single bypass defeats the system.
Explainability first Every verdict ships with sub-scores, evidence bullets, and visual heatmaps. Black-box outputs are unacceptable in regulated finance.
Tamper-evident provenance Every analysis is appended to a SHA-256 hash chain. Retroactive edits are mathematically detectable.
Portfolio-level vision Single-document forensics is necessary but insufficient. Real fraud is organised; the system reasons across applicants.
Zero data egress All inference runs locally. No applicant PII leaves the bank's perimeter.
RBI-aligned output Compliance reports follow Master Direction on KYC formatting so they can be filed directly.

2. Layered architecture

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚     PRESENTATION (Streamlit / future PWA)β”‚
                    β”‚                                          β”‚
                    β”‚  Tab 1  Single-doc analysis              β”‚
                    β”‚  Tab 2  Cross-document KYC               β”‚
                    β”‚  Tab 3  Batch underwriter audit          β”‚
                    β”‚  Tab 4  RBI compliance + Provenance      β”‚
                    β”‚  Tab 5  Live Tamper Forge Studio         β”‚
                    β”‚  Tab 6  Fraud Ring Network               β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                         β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚   API GATEWAY (planned FastAPI front)    β”‚
                    β”‚   /analyse  /verify  /forge-test         β”‚
                    β”‚   /compliance  /batch  /webhook          β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                         β”‚
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β–Ό                                     β–Ό                               β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  INGESTION      β”‚         β”‚   FORENSICS CORE      β”‚         β”‚  COMPLIANCE CORE    β”‚
β”‚  - Direct uploadβ”‚         β”‚  Rule layer (ELA,     β”‚         β”‚  RBI IFSC lookup    β”‚
β”‚  - Watch folder β”‚         β”‚   copy-move, noise,   β”‚         β”‚  PAN entity check   β”‚
β”‚  - PDF / image  β”‚         β”‚   EXIF, PDF struct,   β”‚         β”‚  Aadhaar Verhoeff   β”‚
β”‚  - Future Kafka β”‚         β”‚   OCR rules)          β”‚         β”‚  PII redaction      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β”‚  RF classifier (11-d) β”‚         β”‚  DPDP-aligned       β”‚
         β”‚                  β”‚  CNN (MobileNetV2)    β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β–Ό                  β”‚  AI-gen detector (FFT)β”‚                    β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                    β”‚
β”‚  PROVENANCE     β”‚                     β”‚                                β”‚
β”‚  SHA-256 chain  β”‚                     β–Ό                                β”‚
β”‚  SQLite ledger  β”‚           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                      β”‚
β”‚  verify_chain() β”‚           β”‚  ENSEMBLE FUSION  β”‚                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β”‚  Weighted blend   β”‚                      β”‚
         β”‚                    β”‚  per sub-detector β”‚                      β”‚
         β”‚                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜                      β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β–Ό
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚   FRAUD RING DETECTOR            β”‚
              β”‚   NetworkX similarity graph      β”‚
              β”‚   Clique-based ring discovery    β”‚
              β”‚   Cross-applicant correlation    β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β–Ό
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚   RISK ORCHESTRATOR              β”‚
              β”‚   Score -> band -> action        β”‚
              β”‚   (LOW / MEDIUM / HIGH / CRIT)   β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β–Ό
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚   OUTPUT LAYER                   β”‚
              β”‚   - Streamlit dashboard          β”‚
              β”‚   - Bank-letterhead audit PDF    β”‚
              β”‚   - RBI compliance pack PDF      β”‚
              β”‚   - Audit JSON                   β”‚
              β”‚   - Webhook alerts (planned)     β”‚
              β”‚   - Provenance ledger entry      β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

3. Component reference

3.1 Forensics core (forensics.py)

Six independent detectors blended via analyse_document(path):

Detector Method Sub-score key
Error Level Analysis JPEG re-save diff ela
Copy-move ORB keypoints + cross-matching copy_move
Noise inconsistency Per-block Laplacian variance noise
EXIF audit Metadata + software-tag fingerprint exif
OCR + text rules Tesseract + IFSC/PAN/date/amount regex text_rules
AI-generated Radial FFT spectral analysis (new) ai_generated

Optional ML overlays:

  • Random Forest (predict_with_model) β€” 11 features (4 forensics + 4 GLCM texture + 3 colour entropy).
  • MobileNetV2 CNN (predict_with_cnn) β€” fine-tuned on CASIA v2; weight grows with measured validation AUC.

Final score: risk_score = weighted_blend(sub_scores) -> RF overlay -> CNN overlay -> AI-gen overlay.

3.2 AI-generated detector (ai_detector.py)

The 2026 threat model is Sora / Midjourney / Stable Diffusion outputs, not Photoshop. This module catches them in the frequency domain:

Signal Detection
High-frequency suppression Ratio of low- to high-frequency FFT energy.
Periodic spectral peaks Spike count in high-frequency band.
JPEG quantization absence PIL img.quantization table inspection.

Blended into the main risk score with a +20% cap so it never dominates classical signals, but reliably surfaces synthetic media.

3.3 Cross-document KYC (compliance.py)

  • IFSC validation against 36 RBI bank codes
  • PAN entity-type character + Luhn-like structural check
  • Aadhaar UIDAI Verhoeff checksum (rejects 0/1 prefix)
  • PII redaction via PyMuPDF text-bbox overlays
  • RBI-format compliance audit PDF (5 sections, ReportLab Platypus)

3.4 Fraud Ring Detector (fraud_ring.py) β€” new headline feature

Single-document forensics misses organised fraud. This module fixes that.

Pipeline:

  1. Extract identity signals from each applicant: name, DOB, address, phone, IFSC, account, employer (OCR + regex).
  2. Build a weighted similarity graph (NetworkX). Edge weight is a sum of per-signal match weights, with field-specific fraud significance:
    • account number 0.25 (highest β€” same account = same person)
    • DOB 0.15, address 0.20, phone 0.20
    • name 0.10, IFSC 0.05, employer 0.05
  3. Detect rings = connected components above a configurable similarity threshold, size β‰₯ 3.
  4. Score each ring: CRITICAL (β‰₯5 applicants), HIGH (3-4), MEDIUM (2).
  5. Visualise as an interactive force-directed graph; ring members rendered in red with thick edges.

Banking impact: detects identity-recycling rings, address farms, mule-account networks β€” the patterns that cost banks ~β‚Ή3,000 crore/year (RBI Annual Report).

3.5 Tamper Forge Studio (tampering.py)

Adversarial validation: live UI to apply copy-move, splice, text-edit, compression, metadata-strip, custom-region, or chained tampering operations to a clean sample, then immediately re-run detection. Visual before/after with bounding boxes, per-detector scorecard, ELA + noise heatmap overlays. Doubles as a continuous test harness for the forensics layer.

3.6 Provenance Ledger (provenance.py) β€” new compliance feature

Tamper-evident SHA-256 hash chain over every analysis:

record_hash = SHA256(timestamp | doc_sha256 | risk_band | risk_score | prev_hash)
  • Stored in SQLite (single file, zero-deploy)
  • verify_chain() walks every record in O(N) and pinpoints the first broken record
  • Satisfies RBI Master Direction on KYC (2016), Para 67 record-retention requirements
  • Downloadable as JSON for external auditors

Conceptually a baby blockchain: append-only, hash-linked, mathematically verifiable.

3.7 Audit report (audit_report.py)

Bank-letterhead PDF with:

  • Metadata table (file, SHA-256, analysed timestamp)
  • Risk verdict box (colour-coded by band)
  • Sub-score table with ASCII bars
  • Evidence bullets
  • Embedded forensic heatmaps

3.8 Dashboard (app.py)

Six-tab Streamlit UI. Sample documents bundled (sample_data/) for instant demo.


4. Data assets

Asset Purpose Volume
AgamiAI Indian Bank Statements (HF) Real Indian bank statement PDFs 217
IDRBT Cheque Image Dataset Cheque images, Indian banking format 112
CASIA v2 CNN training (forged/authentic) ~12 k
sample_data/ bundled Demo fixtures 26

5. Ensemble fusion logic

sub_scores = {ela, copy_move, noise, exif, text_rules, ai_generated}
weights    = {ela:0.20, copy_move:0.25, noise:0.20, exif:0.15,
              text_rules:0.20}
              # ai_generated is a separate overlay, not in base weights

base_score    = sum(weights[k] * sub_scores[k] for k in weights)
score_with_rf = 0.5 * base_score + 0.5 * rf.predict_proba(features)
score_with_cnn = (1-w) * score_with_rf + w * cnn.predict(image)
                 where w = clamp(cnn.val_auc, 0.4, 0.7)
final_score    = 0.9 * score_with_cnn + 0.1 * ai_gen_prob * 2.0
                 (AI-gen capped at +20%)

Band mapping: 0-0.30 LOW Β· 0.30-0.50 MEDIUM Β· 0.50-0.75 HIGH Β· 0.75+ CRITICAL


6. Roadmap β€” what's next

The architecture below is wired for these extensions; they ship in subsequent rounds.

Capability Status Notes
FastAPI gateway + webhook alerts planned Push to LOS / CRM on HIGH or CRITICAL
Federated learning across banks planned Flower (flwr); no raw data leaves
LLM-based document reasoning planned Local Phi-3 / Gemma over OCR text
Real-time drift monitoring planned Track per-detector confidence over time
Kubernetes deployment planned For multi-tenant bank hosting
Multilingual OCR (Hindi / Bengali) planned Tesseract + IndicOCR models

7. Mapping to Round-1 submission

Round-1 idea Round-2 realisation
Image forensics (ELA, copy-move, noise, EXIF) forensics.py β€” fully implemented + AI-gen FFT detector
PDF structural auditing forensics.pdf_structural_audit + pdf_font_audit
OCR + financial validation forensics.text_rule_checks + IFSC/PAN/Aadhaar full validators
Random Forest risk scoring forensics.predict_with_model β€” trained on 11-d feature set
Real-time underwriter dashboard Streamlit app, 6 tabs, bank-letterhead PDF output
CNN with MobileNetV2 (future) Delivered β€” fine-tuned on CASIA v2
LLM reasoning (future) Roadmap (see Β§ 6)
API deployment (future) Roadmap β€” FastAPI gateway scaffolded
NEW β€” Fraud Ring Network Cross-applicant graph + clique discovery
NEW β€” Provenance ledger SHA-256 hash chain, RBI Para 67 compliant
NEW β€” Tamper Forge Studio Adversarial-validation harness

The Round-1 pillars remain the visible centre of the system. The new pillars extend each axis without breaking the original framing: forensics gets AI-gen detection, scoring gets a cross-applicant view, the dashboard gets a tamper-evident audit trail.


8. Repository layout

.
β”œβ”€β”€ app.py                  Streamlit dashboard (6 tabs)
β”œβ”€β”€ forensics.py            Core analysis pipeline
β”œβ”€β”€ ai_detector.py          AI-generated content detector (FFT)
β”œβ”€β”€ fraud_ring.py           Cross-applicant graph + clique detection
β”œβ”€β”€ provenance.py           Tamper-evident SHA-256 hash chain
β”œβ”€β”€ compliance.py           IFSC / PAN / Aadhaar / PII redaction
β”œβ”€β”€ tampering.py            Adversarial harness for Forge Studio
β”œβ”€β”€ audit_report.py         Bank-letterhead PDF builder
β”œβ”€β”€ docsentry_master.ipynb  Notebook source of truth
β”œβ”€β”€ models/                 RF + CNN model files
β”œβ”€β”€ sample_data/            26 demo documents
β”œβ”€β”€ requirements.txt        Python dependencies
β”œβ”€β”€ packages.txt            apt-get packages (HF Spaces)
β”œβ”€β”€ README.md               Reference + install guide
β”œβ”€β”€ ARCHITECTURE.md         This document
└── LICENSE                 MIT + third-party notices

This architecture document is the technical reference for DocSentry Round 2. It accompanies the live demo at https://huggingface.co/spaces/SpandanM110/DocSentry and the source code at https://github.com/SpandanM110/Doc-Sentry.