Spaces:
Running
A newer version of the Streamlit SDK is available: 1.58.0
DocSentry β System Architecture
Real-time document anomaly detection for Indian bank underwriting.
DocSentry is the operational realisation of the Round-1 submission idea: catch tampered, forged, and AI-generated documents at the moment of upload, score them on a calibrated risk scale, and hand the underwriter a defensible audit trail. Round-2 turns that idea into a robust, production-grade platform.
1. Architectural principles
| Principle | What it means in DocSentry |
|---|---|
| Defence in depth | Six independent detection layers (rule, image, PDF, OCR, ML, AI-generated). No single bypass defeats the system. |
| Explainability first | Every verdict ships with sub-scores, evidence bullets, and visual heatmaps. Black-box outputs are unacceptable in regulated finance. |
| Tamper-evident provenance | Every analysis is appended to a SHA-256 hash chain. Retroactive edits are mathematically detectable. |
| Portfolio-level vision | Single-document forensics is necessary but insufficient. Real fraud is organised; the system reasons across applicants. |
| Zero data egress | All inference runs locally. No applicant PII leaves the bank's perimeter. |
| RBI-aligned output | Compliance reports follow Master Direction on KYC formatting so they can be filed directly. |
2. Layered architecture
βββββββββββββββββββββββββββββββββββββββββββ
β PRESENTATION (Streamlit / future PWA)β
β β
β Tab 1 Single-doc analysis β
β Tab 2 Cross-document KYC β
β Tab 3 Batch underwriter audit β
β Tab 4 RBI compliance + Provenance β
β Tab 5 Live Tamper Forge Studio β
β Tab 6 Fraud Ring Network β
ββββββββββββββββββββββ¬ββββββββββββββββββββββ
β
ββββββββββββββββββββββΌββββββββββββββββββββββ
β API GATEWAY (planned FastAPI front) β
β /analyse /verify /forge-test β
β /compliance /batch /webhook β
ββββββββββββββββββββββ¬ββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββ
βΌ βΌ βΌ
βββββββββββββββββββ βββββββββββββββββββββββββ βββββββββββββββββββββββ
β INGESTION β β FORENSICS CORE β β COMPLIANCE CORE β
β - Direct uploadβ β Rule layer (ELA, β β RBI IFSC lookup β
β - Watch folder β β copy-move, noise, β β PAN entity check β
β - PDF / image β β EXIF, PDF struct, β β Aadhaar Verhoeff β
β - Future Kafka β β OCR rules) β β PII redaction β
ββββββββββ¬βββββββββ β RF classifier (11-d) β β DPDP-aligned β
β β CNN (MobileNetV2) β ββββββββββββ¬βββββββββββ
βΌ β AI-gen detector (FFT)β β
βββββββββββββββββββ βββββββββββββ¬ββββββββββββ β
β PROVENANCE β β β
β SHA-256 chain β βΌ β
β SQLite ledger β βββββββββββββββββββββ β
β verify_chain() β β ENSEMBLE FUSION β β
ββββββββββ¬βββββββββ β Weighted blend β β
β β per sub-detector β β
β βββββββββββββ¬ββββββββ β
ββββββββββββββββββββ¬ββββββββββββββ΄βββββββββββββββββββββββββββββββ
βΌ
βββββββββββββββββββββββββββββββββββ
β FRAUD RING DETECTOR β
β NetworkX similarity graph β
β Clique-based ring discovery β
β Cross-applicant correlation β
ββββββββββββββββββ¬ββββββββββββββββββ
βΌ
βββββββββββββββββββββββββββββββββββ
β RISK ORCHESTRATOR β
β Score -> band -> action β
β (LOW / MEDIUM / HIGH / CRIT) β
ββββββββββββββββββ¬ββββββββββββββββββ
βΌ
βββββββββββββββββββββββββββββββββββ
β OUTPUT LAYER β
β - Streamlit dashboard β
β - Bank-letterhead audit PDF β
β - RBI compliance pack PDF β
β - Audit JSON β
β - Webhook alerts (planned) β
β - Provenance ledger entry β
βββββββββββββββββββββββββββββββββββ
3. Component reference
3.1 Forensics core (forensics.py)
Six independent detectors blended via analyse_document(path):
| Detector | Method | Sub-score key |
|---|---|---|
| Error Level Analysis | JPEG re-save diff | ela |
| Copy-move | ORB keypoints + cross-matching | copy_move |
| Noise inconsistency | Per-block Laplacian variance | noise |
| EXIF audit | Metadata + software-tag fingerprint | exif |
| OCR + text rules | Tesseract + IFSC/PAN/date/amount regex | text_rules |
| AI-generated | Radial FFT spectral analysis (new) | ai_generated |
Optional ML overlays:
- Random Forest (
predict_with_model) β 11 features (4 forensics + 4 GLCM texture + 3 colour entropy). - MobileNetV2 CNN (
predict_with_cnn) β fine-tuned on CASIA v2; weight grows with measured validation AUC.
Final score: risk_score = weighted_blend(sub_scores) -> RF overlay -> CNN overlay -> AI-gen overlay.
3.2 AI-generated detector (ai_detector.py)
The 2026 threat model is Sora / Midjourney / Stable Diffusion outputs, not Photoshop. This module catches them in the frequency domain:
| Signal | Detection |
|---|---|
| High-frequency suppression | Ratio of low- to high-frequency FFT energy. |
| Periodic spectral peaks | Spike count in high-frequency band. |
| JPEG quantization absence | PIL img.quantization table inspection. |
Blended into the main risk score with a +20% cap so it never dominates classical signals, but reliably surfaces synthetic media.
3.3 Cross-document KYC (compliance.py)
- IFSC validation against 36 RBI bank codes
- PAN entity-type character + Luhn-like structural check
- Aadhaar UIDAI Verhoeff checksum (rejects 0/1 prefix)
- PII redaction via PyMuPDF text-bbox overlays
- RBI-format compliance audit PDF (5 sections, ReportLab Platypus)
3.4 Fraud Ring Detector (fraud_ring.py) β new headline feature
Single-document forensics misses organised fraud. This module fixes that.
Pipeline:
- Extract identity signals from each applicant: name, DOB, address, phone, IFSC, account, employer (OCR + regex).
- Build a weighted similarity graph (NetworkX). Edge weight is a sum of per-signal match weights, with field-specific fraud significance:
- account number 0.25 (highest β same account = same person)
- DOB 0.15, address 0.20, phone 0.20
- name 0.10, IFSC 0.05, employer 0.05
- Detect rings = connected components above a configurable similarity threshold, size β₯ 3.
- Score each ring: CRITICAL (β₯5 applicants), HIGH (3-4), MEDIUM (2).
- Visualise as an interactive force-directed graph; ring members rendered in red with thick edges.
Banking impact: detects identity-recycling rings, address farms, mule-account networks β the patterns that cost banks ~βΉ3,000 crore/year (RBI Annual Report).
3.5 Tamper Forge Studio (tampering.py)
Adversarial validation: live UI to apply copy-move, splice, text-edit, compression, metadata-strip, custom-region, or chained tampering operations to a clean sample, then immediately re-run detection. Visual before/after with bounding boxes, per-detector scorecard, ELA + noise heatmap overlays. Doubles as a continuous test harness for the forensics layer.
3.6 Provenance Ledger (provenance.py) β new compliance feature
Tamper-evident SHA-256 hash chain over every analysis:
record_hash = SHA256(timestamp | doc_sha256 | risk_band | risk_score | prev_hash)
- Stored in SQLite (single file, zero-deploy)
verify_chain()walks every record in O(N) and pinpoints the first broken record- Satisfies RBI Master Direction on KYC (2016), Para 67 record-retention requirements
- Downloadable as JSON for external auditors
Conceptually a baby blockchain: append-only, hash-linked, mathematically verifiable.
3.7 Audit report (audit_report.py)
Bank-letterhead PDF with:
- Metadata table (file, SHA-256, analysed timestamp)
- Risk verdict box (colour-coded by band)
- Sub-score table with ASCII bars
- Evidence bullets
- Embedded forensic heatmaps
3.8 Dashboard (app.py)
Six-tab Streamlit UI. Sample documents bundled (sample_data/) for instant demo.
4. Data assets
| Asset | Purpose | Volume |
|---|---|---|
| AgamiAI Indian Bank Statements (HF) | Real Indian bank statement PDFs | 217 |
| IDRBT Cheque Image Dataset | Cheque images, Indian banking format | 112 |
| CASIA v2 | CNN training (forged/authentic) | ~12 k |
sample_data/ bundled |
Demo fixtures | 26 |
5. Ensemble fusion logic
sub_scores = {ela, copy_move, noise, exif, text_rules, ai_generated}
weights = {ela:0.20, copy_move:0.25, noise:0.20, exif:0.15,
text_rules:0.20}
# ai_generated is a separate overlay, not in base weights
base_score = sum(weights[k] * sub_scores[k] for k in weights)
score_with_rf = 0.5 * base_score + 0.5 * rf.predict_proba(features)
score_with_cnn = (1-w) * score_with_rf + w * cnn.predict(image)
where w = clamp(cnn.val_auc, 0.4, 0.7)
final_score = 0.9 * score_with_cnn + 0.1 * ai_gen_prob * 2.0
(AI-gen capped at +20%)
Band mapping: 0-0.30 LOW Β· 0.30-0.50 MEDIUM Β· 0.50-0.75 HIGH Β· 0.75+ CRITICAL
6. Roadmap β what's next
The architecture below is wired for these extensions; they ship in subsequent rounds.
| Capability | Status | Notes |
|---|---|---|
| FastAPI gateway + webhook alerts | planned | Push to LOS / CRM on HIGH or CRITICAL |
| Federated learning across banks | planned | Flower (flwr); no raw data leaves |
| LLM-based document reasoning | planned | Local Phi-3 / Gemma over OCR text |
| Real-time drift monitoring | planned | Track per-detector confidence over time |
| Kubernetes deployment | planned | For multi-tenant bank hosting |
| Multilingual OCR (Hindi / Bengali) | planned | Tesseract + IndicOCR models |
7. Mapping to Round-1 submission
| Round-1 idea | Round-2 realisation |
|---|---|
| Image forensics (ELA, copy-move, noise, EXIF) | forensics.py β fully implemented + AI-gen FFT detector |
| PDF structural auditing | forensics.pdf_structural_audit + pdf_font_audit |
| OCR + financial validation | forensics.text_rule_checks + IFSC/PAN/Aadhaar full validators |
| Random Forest risk scoring | forensics.predict_with_model β trained on 11-d feature set |
| Real-time underwriter dashboard | Streamlit app, 6 tabs, bank-letterhead PDF output |
| CNN with MobileNetV2 (future) | Delivered β fine-tuned on CASIA v2 |
| LLM reasoning (future) | Roadmap (see Β§ 6) |
| API deployment (future) | Roadmap β FastAPI gateway scaffolded |
| NEW β Fraud Ring Network | Cross-applicant graph + clique discovery |
| NEW β Provenance ledger | SHA-256 hash chain, RBI Para 67 compliant |
| NEW β Tamper Forge Studio | Adversarial-validation harness |
The Round-1 pillars remain the visible centre of the system. The new pillars extend each axis without breaking the original framing: forensics gets AI-gen detection, scoring gets a cross-applicant view, the dashboard gets a tamper-evident audit trail.
8. Repository layout
.
βββ app.py Streamlit dashboard (6 tabs)
βββ forensics.py Core analysis pipeline
βββ ai_detector.py AI-generated content detector (FFT)
βββ fraud_ring.py Cross-applicant graph + clique detection
βββ provenance.py Tamper-evident SHA-256 hash chain
βββ compliance.py IFSC / PAN / Aadhaar / PII redaction
βββ tampering.py Adversarial harness for Forge Studio
βββ audit_report.py Bank-letterhead PDF builder
βββ docsentry_master.ipynb Notebook source of truth
βββ models/ RF + CNN model files
βββ sample_data/ 26 demo documents
βββ requirements.txt Python dependencies
βββ packages.txt apt-get packages (HF Spaces)
βββ README.md Reference + install guide
βββ ARCHITECTURE.md This document
βββ LICENSE MIT + third-party notices
This architecture document is the technical reference for DocSentry Round 2. It accompanies the live demo at https://huggingface.co/spaces/SpandanM110/DocSentry and the source code at https://github.com/SpandanM110/Doc-Sentry.