Spaces:

SpandanM110
/

DocSentry

Sleeping

File size: 16,434 Bytes

e97f963

# DocSentry — System Architecture

**Real-time document anomaly detection for Indian bank underwriting.**

DocSentry is the operational realisation of the Round-1 submission idea:
catch tampered, forged, and AI-generated documents at the moment of
upload, score them on a calibrated risk scale, and hand the underwriter
a defensible audit trail. Round-2 turns that idea into a robust,
production-grade platform.

---

## 1. Architectural principles

| Principle | What it means in DocSentry |
|---|---|
| **Defence in depth** | Six independent detection layers (rule, image, PDF, OCR, ML, AI-generated). No single bypass defeats the system. |
| **Explainability first** | Every verdict ships with sub-scores, evidence bullets, and visual heatmaps. Black-box outputs are unacceptable in regulated finance. |
| **Tamper-evident provenance** | Every analysis is appended to a SHA-256 hash chain. Retroactive edits are mathematically detectable. |
| **Portfolio-level vision** | Single-document forensics is necessary but insufficient. Real fraud is *organised*; the system reasons across applicants. |
| **Zero data egress** | All inference runs locally. No applicant PII leaves the bank's perimeter. |
| **RBI-aligned output** | Compliance reports follow Master Direction on KYC formatting so they can be filed directly. |

---

## 2. Layered architecture

```
                    ┌─────────────────────────────────────────┐
                    │     PRESENTATION (Streamlit / future PWA)│
                    │                                          │
                    │  Tab 1  Single-doc analysis              │
                    │  Tab 2  Cross-document KYC               │
                    │  Tab 3  Batch underwriter audit          │
                    │  Tab 4  RBI compliance + Provenance      │
                    │  Tab 5  Live Tamper Forge Studio         │
                    │  Tab 6  Fraud Ring Network               │
                    └────────────────────┬─────────────────────┘
                                         │
                    ┌────────────────────▼─────────────────────┐
                    │   API GATEWAY (planned FastAPI front)    │
                    │   /analyse  /verify  /forge-test         │
                    │   /compliance  /batch  /webhook          │
                    └────────────────────┬─────────────────────┘
                                         │
   ┌─────────────────────────────────────┼───────────────────────────────┐
   ▼                                     ▼                               ▼
┌─────────────────┐         ┌───────────────────────┐         ┌─────────────────────┐
│  INGESTION      │         │   FORENSICS CORE      │         │  COMPLIANCE CORE    │
│  - Direct upload│         │  Rule layer (ELA,     │         │  RBI IFSC lookup    │
│  - Watch folder │         │   copy-move, noise,   │         │  PAN entity check   │
│  - PDF / image  │         │   EXIF, PDF struct,   │         │  Aadhaar Verhoeff   │
│  - Future Kafka │         │   OCR rules)          │         │  PII redaction      │
└────────┬────────┘         │  RF classifier (11-d) │         │  DPDP-aligned       │
         │                  │  CNN (MobileNetV2)    │         └──────────┬──────────┘
         ▼                  │  AI-gen detector (FFT)│                    │
┌─────────────────┐         └───────────┬───────────┘                    │
│  PROVENANCE     │                     │                                │
│  SHA-256 chain  │                     ▼                                │
│  SQLite ledger  │           ┌───────────────────┐                      │
│  verify_chain() │           │  ENSEMBLE FUSION  │                      │
└────────┬────────┘           │  Weighted blend   │                      │
         │                    │  per sub-detector │                      │
         │                    └───────────┬───────┘                      │
         └──────────────────┬─────────────┴──────────────────────────────┘
                            ▼
              ┌─────────────────────────────────┐
              │   FRAUD RING DETECTOR            │
              │   NetworkX similarity graph      │
              │   Clique-based ring discovery    │
              │   Cross-applicant correlation    │
              └────────────────┬─────────────────┘
                               ▼
              ┌─────────────────────────────────┐
              │   RISK ORCHESTRATOR              │
              │   Score -> band -> action        │
              │   (LOW / MEDIUM / HIGH / CRIT)   │
              └────────────────┬─────────────────┘
                               ▼
              ┌─────────────────────────────────┐
              │   OUTPUT LAYER                   │
              │   - Streamlit dashboard          │
              │   - Bank-letterhead audit PDF    │
              │   - RBI compliance pack PDF      │
              │   - Audit JSON                   │
              │   - Webhook alerts (planned)     │
              │   - Provenance ledger entry      │
              └─────────────────────────────────┘
```

---

## 3. Component reference

### 3.1 Forensics core (`forensics.py`)

Six independent detectors blended via `analyse_document(path)`:

| Detector            | Method                                     | Sub-score key   |
|---------------------|--------------------------------------------|-----------------|
| Error Level Analysis| JPEG re-save diff                          | `ela`           |
| Copy-move           | ORB keypoints + cross-matching             | `copy_move`     |
| Noise inconsistency | Per-block Laplacian variance               | `noise`         |
| EXIF audit          | Metadata + software-tag fingerprint        | `exif`          |
| OCR + text rules    | Tesseract + IFSC/PAN/date/amount regex     | `text_rules`    |
| **AI-generated**    | **Radial FFT spectral analysis (new)**     | `ai_generated`  |

Optional ML overlays:
- **Random Forest** (`predict_with_model`) — 11 features (4 forensics + 4 GLCM texture + 3 colour entropy).
- **MobileNetV2 CNN** (`predict_with_cnn`) — fine-tuned on CASIA v2; weight grows with measured validation AUC.

Final score: `risk_score = weighted_blend(sub_scores) -> RF overlay -> CNN overlay -> AI-gen overlay`.

### 3.2 AI-generated detector (`ai_detector.py`)

The 2026 threat model is Sora / Midjourney / Stable Diffusion outputs, not Photoshop. This module catches them in the frequency domain:

| Signal                         | Detection                                     |
|--------------------------------|-----------------------------------------------|
| High-frequency suppression     | Ratio of low- to high-frequency FFT energy.   |
| Periodic spectral peaks        | Spike count in high-frequency band.           |
| JPEG quantization absence      | PIL `img.quantization` table inspection.      |

Blended into the main risk score with a +20% cap so it never dominates classical signals, but reliably surfaces synthetic media.

### 3.3 Cross-document KYC (`compliance.py`)

- IFSC validation against 36 RBI bank codes
- PAN entity-type character + Luhn-like structural check
- Aadhaar UIDAI Verhoeff checksum (rejects 0/1 prefix)
- PII redaction via PyMuPDF text-bbox overlays
- RBI-format compliance audit PDF (5 sections, ReportLab Platypus)

### 3.4 Fraud Ring Detector (`fraud_ring.py`) — *new headline feature*

Single-document forensics misses **organised** fraud. This module fixes that.

**Pipeline:**

1. **Extract identity signals** from each applicant: name, DOB, address, phone, IFSC, account, employer (OCR + regex).
2. **Build a weighted similarity graph** (NetworkX). Edge weight is a sum of per-signal match weights, with field-specific fraud significance:
   - account number 0.25 (highest — same account = same person)
   - DOB 0.15, address 0.20, phone 0.20
   - name 0.10, IFSC 0.05, employer 0.05
3. **Detect rings** = connected components above a configurable similarity threshold, size ≥ 3.
4. **Score each ring**: CRITICAL (≥5 applicants), HIGH (3-4), MEDIUM (2).
5. **Visualise** as an interactive force-directed graph; ring members rendered in red with thick edges.

Banking impact: detects identity-recycling rings, address farms, mule-account networks — the patterns that cost banks ~₹3,000 crore/year (RBI Annual Report).

### 3.5 Tamper Forge Studio (`tampering.py`)

Adversarial validation: live UI to apply copy-move, splice, text-edit, compression, metadata-strip, custom-region, or chained tampering operations to a clean sample, then immediately re-run detection. Visual before/after with bounding boxes, per-detector scorecard, ELA + noise heatmap overlays. Doubles as a continuous test harness for the forensics layer.

### 3.6 Provenance Ledger (`provenance.py`) — *new compliance feature*

Tamper-evident SHA-256 hash chain over every analysis:

```
record_hash = SHA256(timestamp | doc_sha256 | risk_band | risk_score | prev_hash)
```

- Stored in SQLite (single file, zero-deploy)
- `verify_chain()` walks every record in O(N) and pinpoints the first broken record
- Satisfies RBI Master Direction on KYC (2016), Para 67 record-retention requirements
- Downloadable as JSON for external auditors

Conceptually a baby blockchain: append-only, hash-linked, mathematically verifiable.

### 3.7 Audit report (`audit_report.py`)

Bank-letterhead PDF with:
- Metadata table (file, SHA-256, analysed timestamp)
- Risk verdict box (colour-coded by band)
- Sub-score table with ASCII bars
- Evidence bullets
- Embedded forensic heatmaps

### 3.8 Dashboard (`app.py`)

Six-tab Streamlit UI. Sample documents bundled (`sample_data/`) for instant demo.

---

## 4. Data assets

| Asset                                | Purpose                              | Volume |
|--------------------------------------|--------------------------------------|--------|
| AgamiAI Indian Bank Statements (HF)  | Real Indian bank statement PDFs      | 217    |
| IDRBT Cheque Image Dataset           | Cheque images, Indian banking format | 112    |
| CASIA v2                             | CNN training (forged/authentic)      | ~12 k  |
| `sample_data/` bundled               | Demo fixtures                        | 26     |

---

## 5. Ensemble fusion logic

```
sub_scores = {ela, copy_move, noise, exif, text_rules, ai_generated}
weights    = {ela:0.20, copy_move:0.25, noise:0.20, exif:0.15,
              text_rules:0.20}
              # ai_generated is a separate overlay, not in base weights

base_score    = sum(weights[k] * sub_scores[k] for k in weights)
score_with_rf = 0.5 * base_score + 0.5 * rf.predict_proba(features)
score_with_cnn = (1-w) * score_with_rf + w * cnn.predict(image)
                 where w = clamp(cnn.val_auc, 0.4, 0.7)
final_score    = 0.9 * score_with_cnn + 0.1 * ai_gen_prob * 2.0
                 (AI-gen capped at +20%)
```

Band mapping: `0-0.30 LOW · 0.30-0.50 MEDIUM · 0.50-0.75 HIGH · 0.75+ CRITICAL`

---

## 6. Roadmap — what's next

The architecture below is **wired** for these extensions; they ship in subsequent rounds.

| Capability                            | Status  | Notes                                  |
|---------------------------------------|---------|----------------------------------------|
| FastAPI gateway + webhook alerts       | planned | Push to LOS / CRM on HIGH or CRITICAL  |
| Federated learning across banks        | planned | Flower (`flwr`); no raw data leaves    |
| LLM-based document reasoning           | planned | Local Phi-3 / Gemma over OCR text      |
| Real-time drift monitoring             | planned | Track per-detector confidence over time|
| Kubernetes deployment                  | planned | For multi-tenant bank hosting          |
| Multilingual OCR (Hindi / Bengali)     | planned | Tesseract + IndicOCR models            |

---

## 7. Mapping to Round-1 submission

| Round-1 idea                       | Round-2 realisation                                   |
|------------------------------------|-------------------------------------------------------|
| Image forensics (ELA, copy-move, noise, EXIF) | `forensics.py` — fully implemented + **AI-gen FFT detector** |
| PDF structural auditing            | `forensics.pdf_structural_audit` + `pdf_font_audit`  |
| OCR + financial validation         | `forensics.text_rule_checks` + IFSC/PAN/Aadhaar full validators |
| Random Forest risk scoring         | `forensics.predict_with_model` — trained on 11-d feature set |
| Real-time underwriter dashboard    | Streamlit app, 6 tabs, bank-letterhead PDF output    |
| CNN with MobileNetV2 (future)      | **Delivered** — fine-tuned on CASIA v2               |
| LLM reasoning (future)             | Roadmap (see § 6)                                     |
| API deployment (future)            | Roadmap — FastAPI gateway scaffolded                  |
| **NEW — Fraud Ring Network**       | Cross-applicant graph + clique discovery             |
| **NEW — Provenance ledger**        | SHA-256 hash chain, RBI Para 67 compliant            |
| **NEW — Tamper Forge Studio**      | Adversarial-validation harness                       |

The Round-1 pillars remain the visible centre of the system. The new
pillars extend each axis without breaking the original framing:
forensics gets AI-gen detection, scoring gets a cross-applicant view,
the dashboard gets a tamper-evident audit trail.

---

## 8. Repository layout

```
.
├── app.py                  Streamlit dashboard (6 tabs)
├── forensics.py            Core analysis pipeline
├── ai_detector.py          AI-generated content detector (FFT)
├── fraud_ring.py           Cross-applicant graph + clique detection
├── provenance.py           Tamper-evident SHA-256 hash chain
├── compliance.py           IFSC / PAN / Aadhaar / PII redaction
├── tampering.py            Adversarial harness for Forge Studio
├── audit_report.py         Bank-letterhead PDF builder
├── docsentry_master.ipynb  Notebook source of truth
├── models/                 RF + CNN model files
├── sample_data/            26 demo documents
├── requirements.txt        Python dependencies
├── packages.txt            apt-get packages (HF Spaces)
├── README.md               Reference + install guide
├── ARCHITECTURE.md         This document
└── LICENSE                 MIT + third-party notices
```

---

*This architecture document is the technical reference for DocSentry Round 2.
It accompanies the live demo at https://huggingface.co/spaces/SpandanM110/DocSentry
and the source code at https://github.com/SpandanM110/Doc-Sentry.*