File size: 16,434 Bytes
e97f963
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
# DocSentry β€” System Architecture

**Real-time document anomaly detection for Indian bank underwriting.**

DocSentry is the operational realisation of the Round-1 submission idea:
catch tampered, forged, and AI-generated documents at the moment of
upload, score them on a calibrated risk scale, and hand the underwriter
a defensible audit trail. Round-2 turns that idea into a robust,
production-grade platform.

---

## 1. Architectural principles

| Principle | What it means in DocSentry |
|---|---|
| **Defence in depth** | Six independent detection layers (rule, image, PDF, OCR, ML, AI-generated). No single bypass defeats the system. |
| **Explainability first** | Every verdict ships with sub-scores, evidence bullets, and visual heatmaps. Black-box outputs are unacceptable in regulated finance. |
| **Tamper-evident provenance** | Every analysis is appended to a SHA-256 hash chain. Retroactive edits are mathematically detectable. |
| **Portfolio-level vision** | Single-document forensics is necessary but insufficient. Real fraud is *organised*; the system reasons across applicants. |
| **Zero data egress** | All inference runs locally. No applicant PII leaves the bank's perimeter. |
| **RBI-aligned output** | Compliance reports follow Master Direction on KYC formatting so they can be filed directly. |

---

## 2. Layered architecture

```
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚     PRESENTATION (Streamlit / future PWA)β”‚
                    β”‚                                          β”‚
                    β”‚  Tab 1  Single-doc analysis              β”‚
                    β”‚  Tab 2  Cross-document KYC               β”‚
                    β”‚  Tab 3  Batch underwriter audit          β”‚
                    β”‚  Tab 4  RBI compliance + Provenance      β”‚
                    β”‚  Tab 5  Live Tamper Forge Studio         β”‚
                    β”‚  Tab 6  Fraud Ring Network               β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                         β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚   API GATEWAY (planned FastAPI front)    β”‚
                    β”‚   /analyse  /verify  /forge-test         β”‚
                    β”‚   /compliance  /batch  /webhook          β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                         β”‚
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β–Ό                                     β–Ό                               β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  INGESTION      β”‚         β”‚   FORENSICS CORE      β”‚         β”‚  COMPLIANCE CORE    β”‚
β”‚  - Direct uploadβ”‚         β”‚  Rule layer (ELA,     β”‚         β”‚  RBI IFSC lookup    β”‚
β”‚  - Watch folder β”‚         β”‚   copy-move, noise,   β”‚         β”‚  PAN entity check   β”‚
β”‚  - PDF / image  β”‚         β”‚   EXIF, PDF struct,   β”‚         β”‚  Aadhaar Verhoeff   β”‚
β”‚  - Future Kafka β”‚         β”‚   OCR rules)          β”‚         β”‚  PII redaction      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β”‚  RF classifier (11-d) β”‚         β”‚  DPDP-aligned       β”‚
         β”‚                  β”‚  CNN (MobileNetV2)    β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β–Ό                  β”‚  AI-gen detector (FFT)β”‚                    β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                    β”‚
β”‚  PROVENANCE     β”‚                     β”‚                                β”‚
β”‚  SHA-256 chain  β”‚                     β–Ό                                β”‚
β”‚  SQLite ledger  β”‚           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                      β”‚
β”‚  verify_chain() β”‚           β”‚  ENSEMBLE FUSION  β”‚                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β”‚  Weighted blend   β”‚                      β”‚
         β”‚                    β”‚  per sub-detector β”‚                      β”‚
         β”‚                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜                      β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β–Ό
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚   FRAUD RING DETECTOR            β”‚
              β”‚   NetworkX similarity graph      β”‚
              β”‚   Clique-based ring discovery    β”‚
              β”‚   Cross-applicant correlation    β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β–Ό
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚   RISK ORCHESTRATOR              β”‚
              β”‚   Score -> band -> action        β”‚
              β”‚   (LOW / MEDIUM / HIGH / CRIT)   β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β–Ό
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚   OUTPUT LAYER                   β”‚
              β”‚   - Streamlit dashboard          β”‚
              β”‚   - Bank-letterhead audit PDF    β”‚
              β”‚   - RBI compliance pack PDF      β”‚
              β”‚   - Audit JSON                   β”‚
              β”‚   - Webhook alerts (planned)     β”‚
              β”‚   - Provenance ledger entry      β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

---

## 3. Component reference

### 3.1 Forensics core (`forensics.py`)

Six independent detectors blended via `analyse_document(path)`:

| Detector            | Method                                     | Sub-score key   |
|---------------------|--------------------------------------------|-----------------|
| Error Level Analysis| JPEG re-save diff                          | `ela`           |
| Copy-move           | ORB keypoints + cross-matching             | `copy_move`     |
| Noise inconsistency | Per-block Laplacian variance               | `noise`         |
| EXIF audit          | Metadata + software-tag fingerprint        | `exif`          |
| OCR + text rules    | Tesseract + IFSC/PAN/date/amount regex     | `text_rules`    |
| **AI-generated**    | **Radial FFT spectral analysis (new)**     | `ai_generated`  |

Optional ML overlays:
- **Random Forest** (`predict_with_model`) β€” 11 features (4 forensics + 4 GLCM texture + 3 colour entropy).
- **MobileNetV2 CNN** (`predict_with_cnn`) β€” fine-tuned on CASIA v2; weight grows with measured validation AUC.

Final score: `risk_score = weighted_blend(sub_scores) -> RF overlay -> CNN overlay -> AI-gen overlay`.

### 3.2 AI-generated detector (`ai_detector.py`)

The 2026 threat model is Sora / Midjourney / Stable Diffusion outputs, not Photoshop. This module catches them in the frequency domain:

| Signal                         | Detection                                     |
|--------------------------------|-----------------------------------------------|
| High-frequency suppression     | Ratio of low- to high-frequency FFT energy.   |
| Periodic spectral peaks        | Spike count in high-frequency band.           |
| JPEG quantization absence      | PIL `img.quantization` table inspection.      |

Blended into the main risk score with a +20% cap so it never dominates classical signals, but reliably surfaces synthetic media.

### 3.3 Cross-document KYC (`compliance.py`)

- IFSC validation against 36 RBI bank codes
- PAN entity-type character + Luhn-like structural check
- Aadhaar UIDAI Verhoeff checksum (rejects 0/1 prefix)
- PII redaction via PyMuPDF text-bbox overlays
- RBI-format compliance audit PDF (5 sections, ReportLab Platypus)

### 3.4 Fraud Ring Detector (`fraud_ring.py`) β€” *new headline feature*

Single-document forensics misses **organised** fraud. This module fixes that.

**Pipeline:**

1. **Extract identity signals** from each applicant: name, DOB, address, phone, IFSC, account, employer (OCR + regex).
2. **Build a weighted similarity graph** (NetworkX). Edge weight is a sum of per-signal match weights, with field-specific fraud significance:
   - account number 0.25 (highest β€” same account = same person)
   - DOB 0.15, address 0.20, phone 0.20
   - name 0.10, IFSC 0.05, employer 0.05
3. **Detect rings** = connected components above a configurable similarity threshold, size β‰₯ 3.
4. **Score each ring**: CRITICAL (β‰₯5 applicants), HIGH (3-4), MEDIUM (2).
5. **Visualise** as an interactive force-directed graph; ring members rendered in red with thick edges.

Banking impact: detects identity-recycling rings, address farms, mule-account networks β€” the patterns that cost banks ~β‚Ή3,000 crore/year (RBI Annual Report).

### 3.5 Tamper Forge Studio (`tampering.py`)

Adversarial validation: live UI to apply copy-move, splice, text-edit, compression, metadata-strip, custom-region, or chained tampering operations to a clean sample, then immediately re-run detection. Visual before/after with bounding boxes, per-detector scorecard, ELA + noise heatmap overlays. Doubles as a continuous test harness for the forensics layer.

### 3.6 Provenance Ledger (`provenance.py`) β€” *new compliance feature*

Tamper-evident SHA-256 hash chain over every analysis:

```
record_hash = SHA256(timestamp | doc_sha256 | risk_band | risk_score | prev_hash)
```

- Stored in SQLite (single file, zero-deploy)
- `verify_chain()` walks every record in O(N) and pinpoints the first broken record
- Satisfies RBI Master Direction on KYC (2016), Para 67 record-retention requirements
- Downloadable as JSON for external auditors

Conceptually a baby blockchain: append-only, hash-linked, mathematically verifiable.

### 3.7 Audit report (`audit_report.py`)

Bank-letterhead PDF with:
- Metadata table (file, SHA-256, analysed timestamp)
- Risk verdict box (colour-coded by band)
- Sub-score table with ASCII bars
- Evidence bullets
- Embedded forensic heatmaps

### 3.8 Dashboard (`app.py`)

Six-tab Streamlit UI. Sample documents bundled (`sample_data/`) for instant demo.

---

## 4. Data assets

| Asset                                | Purpose                              | Volume |
|--------------------------------------|--------------------------------------|--------|
| AgamiAI Indian Bank Statements (HF)  | Real Indian bank statement PDFs      | 217    |
| IDRBT Cheque Image Dataset           | Cheque images, Indian banking format | 112    |
| CASIA v2                             | CNN training (forged/authentic)      | ~12 k  |
| `sample_data/` bundled               | Demo fixtures                        | 26     |

---

## 5. Ensemble fusion logic

```
sub_scores = {ela, copy_move, noise, exif, text_rules, ai_generated}
weights    = {ela:0.20, copy_move:0.25, noise:0.20, exif:0.15,
              text_rules:0.20}
              # ai_generated is a separate overlay, not in base weights

base_score    = sum(weights[k] * sub_scores[k] for k in weights)
score_with_rf = 0.5 * base_score + 0.5 * rf.predict_proba(features)
score_with_cnn = (1-w) * score_with_rf + w * cnn.predict(image)
                 where w = clamp(cnn.val_auc, 0.4, 0.7)
final_score    = 0.9 * score_with_cnn + 0.1 * ai_gen_prob * 2.0
                 (AI-gen capped at +20%)
```

Band mapping: `0-0.30 LOW Β· 0.30-0.50 MEDIUM Β· 0.50-0.75 HIGH Β· 0.75+ CRITICAL`

---

## 6. Roadmap β€” what's next

The architecture below is **wired** for these extensions; they ship in subsequent rounds.

| Capability                            | Status  | Notes                                  |
|---------------------------------------|---------|----------------------------------------|
| FastAPI gateway + webhook alerts       | planned | Push to LOS / CRM on HIGH or CRITICAL  |
| Federated learning across banks        | planned | Flower (`flwr`); no raw data leaves    |
| LLM-based document reasoning           | planned | Local Phi-3 / Gemma over OCR text      |
| Real-time drift monitoring             | planned | Track per-detector confidence over time|
| Kubernetes deployment                  | planned | For multi-tenant bank hosting          |
| Multilingual OCR (Hindi / Bengali)     | planned | Tesseract + IndicOCR models            |

---

## 7. Mapping to Round-1 submission

| Round-1 idea                       | Round-2 realisation                                   |
|------------------------------------|-------------------------------------------------------|
| Image forensics (ELA, copy-move, noise, EXIF) | `forensics.py` β€” fully implemented + **AI-gen FFT detector** |
| PDF structural auditing            | `forensics.pdf_structural_audit` + `pdf_font_audit`  |
| OCR + financial validation         | `forensics.text_rule_checks` + IFSC/PAN/Aadhaar full validators |
| Random Forest risk scoring         | `forensics.predict_with_model` β€” trained on 11-d feature set |
| Real-time underwriter dashboard    | Streamlit app, 6 tabs, bank-letterhead PDF output    |
| CNN with MobileNetV2 (future)      | **Delivered** β€” fine-tuned on CASIA v2               |
| LLM reasoning (future)             | Roadmap (see Β§ 6)                                     |
| API deployment (future)            | Roadmap β€” FastAPI gateway scaffolded                  |
| **NEW β€” Fraud Ring Network**       | Cross-applicant graph + clique discovery             |
| **NEW β€” Provenance ledger**        | SHA-256 hash chain, RBI Para 67 compliant            |
| **NEW β€” Tamper Forge Studio**      | Adversarial-validation harness                       |

The Round-1 pillars remain the visible centre of the system. The new
pillars extend each axis without breaking the original framing:
forensics gets AI-gen detection, scoring gets a cross-applicant view,
the dashboard gets a tamper-evident audit trail.

---

## 8. Repository layout

```
.
β”œβ”€β”€ app.py                  Streamlit dashboard (6 tabs)
β”œβ”€β”€ forensics.py            Core analysis pipeline
β”œβ”€β”€ ai_detector.py          AI-generated content detector (FFT)
β”œβ”€β”€ fraud_ring.py           Cross-applicant graph + clique detection
β”œβ”€β”€ provenance.py           Tamper-evident SHA-256 hash chain
β”œβ”€β”€ compliance.py           IFSC / PAN / Aadhaar / PII redaction
β”œβ”€β”€ tampering.py            Adversarial harness for Forge Studio
β”œβ”€β”€ audit_report.py         Bank-letterhead PDF builder
β”œβ”€β”€ docsentry_master.ipynb  Notebook source of truth
β”œβ”€β”€ models/                 RF + CNN model files
β”œβ”€β”€ sample_data/            26 demo documents
β”œβ”€β”€ requirements.txt        Python dependencies
β”œβ”€β”€ packages.txt            apt-get packages (HF Spaces)
β”œβ”€β”€ README.md               Reference + install guide
β”œβ”€β”€ ARCHITECTURE.md         This document
└── LICENSE                 MIT + third-party notices
```

---

*This architecture document is the technical reference for DocSentry Round 2.
It accompanies the live demo at https://huggingface.co/spaces/SpandanM110/DocSentry
and the source code at https://github.com/SpandanM110/Doc-Sentry.*