SpandanM110 commited on
Commit
b2ca303
Β·
1 Parent(s): e97f963

Round 2: README + ignore pitch artefacts and runtime ledger

Browse files
Files changed (2) hide show
  1. .gitignore +13 -0
  2. README.md +178 -137
.gitignore CHANGED
@@ -34,3 +34,16 @@ Thumbs.db
34
 
35
  # Archive (the old notebooks we moved out)
36
  archive/
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
 
35
  # Archive (the old notebooks we moved out)
36
  archive/
37
+
38
+ # Hackathon pitch artefacts - keep local, don't commit
39
+ BankShield_Pitch.pptx
40
+ SUBMISSION.md
41
+ *.pptx
42
+
43
+ # Runtime state - SQLite ledger fills up at runtime, never commit
44
+ provenance.db
45
+ provenance.db-journal
46
+ *.db
47
+ *.db-journal
48
+ *.sqlite
49
+ *.sqlite3
README.md CHANGED
@@ -8,15 +8,35 @@ sdk_version: 1.32.0
8
  app_file: app.py
9
  pinned: false
10
  license: mit
11
- short_description: Document forensics + KYC compliance for bank underwriting
12
  ---
13
 
14
- # DocSentry
15
 
16
- Document forensics and KYC compliance pipeline for bank underwriting workflows. Detects tampering and forgery in land records, legal documents, financial statements, and cheques. Validates KYC fields against RBI rules. Produces explainable risk scores and regulator-ready audit reports.
17
 
18
- 100% open-source. No paid APIs. No LLM calls. CPU-only by default.
19
- <img width="1915" height="709" alt="image" src="https://github.com/user-attachments/assets/4567694f-b07e-4367-afa6-174069e2e48f" />
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
  ---
22
 
@@ -24,14 +44,18 @@ Document forensics and KYC compliance pipeline for bank underwriting workflows.
24
 
25
  ```
26
  Doc-Sentry/
27
- β”œβ”€β”€ app.py Streamlit web UI (4 tabs)
28
- β”œβ”€β”€ forensics.py Core detection engine
29
- β”œβ”€β”€ audit_report.py Bank-letterhead PDF report builder
 
 
 
30
  β”œβ”€β”€ compliance.py KYC validators, PII redaction, RBI report builder
 
31
  β”œβ”€β”€ docsentry_master.ipynb Single source-of-truth Jupyter notebook
32
  β”‚
33
  β”œβ”€β”€ requirements.txt Python dependencies
34
- β”œβ”€β”€ packages.txt System packages (Tesseract) for Streamlit Cloud
35
  β”œβ”€β”€ .streamlit/config.toml Streamlit theme + server config
36
  β”‚
37
  β”œβ”€β”€ sample_data/ 26 demo files for the live app
@@ -40,9 +64,13 @@ Doc-Sentry/
40
  β”‚ └── pdfs/ 2 PDFs (1 genuine, 1 tampered)
41
  β”‚
42
  β”œβ”€β”€ models/ Trained model artefacts
43
- β”‚ └── forgery_rf.joblib Random Forest classifier
 
44
  β”‚
45
- β”œβ”€β”€ README.md DEPLOY.md RUN_APP.md DATASETS.md PUSH.md LICENSE
 
 
 
46
  └── data/ (gitignored) full training data + downloaded datasets
47
  ```
48
 
@@ -56,166 +84,182 @@ The core analytical module. Stateless functions; all logic is independently test
56
 
57
  | Function | Returns | Description |
58
  |---|---|---|
59
- | `analyse_document(path)` | dict | End-to-end pipeline. Auto-detects type (image vs PDF), runs all relevant detectors, blends Random Forest + CNN predictions when available. Primary entry point. |
60
  | `score_image(path)` | (float, dict, list) | Composite forensic score for an image. Returns total, sub-scores by detector, and EXIF flags. |
61
- | `error_level_analysis(path, quality=90)` | (PIL.Image, float) | ELA visualisation + scalar suspicion score. Re-saves at given JPEG quality; tampered regions diverge from the rest of the image. |
62
- | `copy_move_detect(path)` | (np.ndarray, int, list) | Detects regions duplicated within the same image using ORB keypoint matching. Returns annotated visualisation, match count, and raw matches. |
63
- | `noise_inconsistency(path, block=32)` | (np.ndarray, float) | Per-block Laplacian variance. Returns a heatmap of outlier blocks and a normalised ratio. Useful for splicing detection. |
64
- | `exif_sanity(path)` | list of str | EXIF metadata audit. Flags missing EXIF, photo-editor signatures (Photoshop/GIMP/Snapseed), and timestamp inconsistencies. |
65
- | `pdf_structural_audit(path)` | dict | Counts `%%EOF` markers (incremental edits), compares producer vs creator, flags consumer-tool fingerprints (iLovePDF, Smallpdf, etc.). |
66
- | `pdf_font_audit(path)` | dict | Lists embedded fonts and flags unusually high font counts (a signal of inserted text). |
67
- | `ocr_text(path)` | str | Tesseract OCR with auto-fallback. Returns empty string if Tesseract isn't installed. |
68
- | `text_rule_checks(text)` | dict | Validates date monotonicity, amount sanity, IFSC format, account number patterns. |
69
- | `extract_features(path)` | dict | Feature vector for the Random Forest: 11 features (ELA, copy-move count, noise ratio, EXIF flag, 4 GLCM texture features, 3 colour histogram entropies). |
70
- | `predict_with_model(path)` | dict or None | Loads `models/forgery_rf.joblib` and returns tamper probability + verdict. None if model isn't present. |
71
- | `predict_with_cnn(path)` | dict or None | Lazy-loads `models/forgery_cnn.keras` (TensorFlow). None if model isn't present, so the app starts fast without TF. |
72
- | `extract_identity_fields(path)` | (dict, str) | Pulls name, DOB, address, account number, IFSC, and amounts from any document. |
73
- | `cross_doc_consistency(paths)` | dict | Compares identity fields across 2+ documents using `difflib.SequenceMatcher`. Returns per-field match status and an aggregate consistency risk. |
74
- | `generate_insights(score, sub, flags)` | dict | Converts numeric sub-scores into underwriter-readable bullets, risk band, and recommended action. |
75
- | `band(score)` | str | Maps a float to LOW / MEDIUM / HIGH / CRITICAL. Boundaries at 0.25, 0.50, 0.75. |
76
-
77
- Constants of interest: `WEIGHTS`, `INSIGHT_RULES`, `ACTIONS`, `MODEL_PATH`, `CNN_MODEL_PATH`, `TESSERACT_OK`.
78
-
79
- ### `app.py` β€” Streamlit UI
80
-
81
- Four-tab web app. Imports `forensics`, `compliance`, and `audit_report`.
82
 
83
- | Tab | Function |
84
  |---|---|
85
- | Single-document analysis | Drag-drop or pick a sample; shows risk band, sub-score breakdown, evidence list, ELA / copy-move / noise visualisations, PDF audit details, ML/CNN predictions, downloadable JSON + PDF reports. |
86
- | Cross-document check | Upload 2–4 documents for one applicant; the system extracts identity fields and shows a coloured comparison table with similarity scores. |
87
- | Batch audit | Point at a folder; scans every supported file and produces a sortable risk table + CSV. |
88
- | Compliance & Audit Pack | Three sub-tabs: KYC field validation (manual or doc-extracted), PII auto-redaction (PDF + text), RBI-style compliance report generation. |
 
89
 
90
- The sample picker auto-populates from `sample_data/`; useful for the deployed demo where users can't browse the local filesystem.
91
 
92
- ### `audit_report.py` β€” bank-letterhead PDF
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
93
 
94
- Single public function: `build_pdf_report(report, source_path)` β†’ `bytes`.
95
 
96
- Generates a multi-page PDF with header letterhead, metadata table, coloured risk-verdict box, sub-score breakdown table (with ASCII bar chart), evidence list, embedded heatmaps for image documents, structural audit details for PDFs, ML model verdict block, and footer disclaimer. Uses ReportLab Platypus.
 
 
97
 
98
  ### `compliance.py` β€” KYC + regulatory
99
 
100
  | Function | Description |
101
  |---|---|
102
- | `validate_ifsc(code)` | Format check (`^[A-Z]{4}0[A-Z0-9]{6}$`) + lookup against an embedded RBI bank-code list (~36 major Indian banks). Returns bank name and branch code on success. |
103
- | `validate_pan(code)` | Format check (`^[A-Z]{5}\d{4}[A-Z]$`) + entity-type character validation (P=Individual, F=Firm, C=Company, etc.). |
104
- | `validate_aadhaar(num)` | 12-digit format + UIDAI Verhoeff checksum verification. Aadhaar numbers cannot start with 0 or 1 per UIDAI spec. |
105
- | `redact_text(text)` | Masks IFSC, PAN, Aadhaar, and account numbers in arbitrary text. |
106
- | `redact_pdf(input_path, output_path)` | Renders each PDF page, locates PII bounding boxes via `page.search_for`, overlays opaque black rectangles. |
107
- | `extract_pii_fields(path)` | Pulls all PII candidates from any document (PDF or image via OCR). |
108
- | `build_compliance_report(forensic_report, source_path, kyc_results)` | Generates a 5-section regulator-ready PDF: document ID + SHA-256, KYC verification table, fraud-screening verdict, recommended RBI risk treatment, auditor sign-off block. References specific RBI Master Directions. |
109
-
110
- ### `docsentry_master.ipynb`
111
-
112
- Single source of truth. Sections:
113
-
114
- 1. Environment auto-detection (Colab vs local)
115
- 2. Datasets (synthetic generator + Kaggle CASIA v2 hook + manual download references)
116
- 3. Image forensics
117
- 4. PDF forensics
118
- 5. OCR + text rules
119
- 6. Random Forest training + saving
120
- 7. (Optional) CNN training on Colab GPU
121
- 8. End-to-end pipeline
122
- 9. Cross-document consistency
123
- 10. Dashboard + batch audit
124
- 11. PDF report generator
125
- 12. Export cell β€” writes `forensics.py`, `app.py`, `audit_report.py` to disk for the Streamlit demo
126
- 13. Launch instructions
127
-
128
- Edit the notebook, re-run section 12, and the `.py` files used by Streamlit regenerate automatically.
129
 
130
  ---
131
 
132
  ## Pipeline architecture
133
 
134
  ```
135
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
136
- β”‚ Document (PNG/PDF/JPG) β”‚
137
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
138
- β”‚
139
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
140
- β–Ό β–Ό β–Ό β–Ό
141
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
142
- β”‚ ELA β”‚ β”‚ Copy- β”‚ β”‚ Noise β”‚ β”‚ EXIF β”‚
143
- β”‚ analysisβ”‚ β”‚ move β”‚ β”‚ heatmap β”‚ β”‚ audit β”‚
144
- β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
145
- β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
146
- β”‚ (images) β”‚
147
- β”‚ β–Ό
148
- β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
149
- β”‚ β”‚ OCR + text rules β”‚
150
- β”‚ β”‚ dates Β· IFSC Β· math β”‚
151
- β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
152
- β–Ό β”‚
153
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
154
- β”‚ Feature vector (11-dim) β”‚ β”‚
155
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
156
- β–Ό β”‚
157
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
158
- β”‚ Random Forest classifier β”‚ β”‚
159
- β”‚ (forgery_rf.joblib) β”‚ β”‚
160
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
161
- β”‚ β”‚
162
- β–Ό β”‚
163
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
164
- β”‚ (Optional) CNN inference β”‚ β”‚
165
- β”‚ MobileNetV2 fine-tuned β”‚ β”‚
166
- β”‚ (forgery_cnn.keras) β”‚ β”‚
167
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
168
- β”‚ β”‚
169
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
170
- β–Ό
171
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
172
- β”‚ Weighted ensemble scorer β”‚
173
- β”‚ (rule + RF + CNN blend) β”‚
174
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
175
- β–Ό
176
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
177
- β”‚ Risk band + Evidence list β”‚
178
- β”‚ Recommended action β”‚
179
- β”‚ Audit JSON + PDF report β”‚
180
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
181
  ```
182
 
183
- Each detector outputs a sub-score in `[0, 1]`. The default weight vector (in `forensics.WEIGHTS`) is `{ela: 0.20, copy_move: 0.25, noise: 0.15, exif: 0.10, pdf_struct: 0.15, text_rules: 0.10, math: 0.05}`. The Random Forest probability, when available, is blended 50/50 with the rule-based score. The CNN probability, when available, is blended at a weight between 0.4 and 0.7 depending on the CNN's reported validation AUC.
 
 
 
 
184
 
185
  ---
186
 
187
  ## Detection coverage
188
 
189
  **Image tampering**
 
190
  - Copy-move forgery β€” ORB keypoint matching with distance filter
191
  - Image splicing β€” block-wise noise inconsistency via Laplacian variance
192
  - Text edits / amount tampering β€” Error Level Analysis
193
  - Photoshop / GIMP / Snapseed edits β€” EXIF Software-tag string match
194
  - Timestamp inconsistencies β€” DateTime vs DateTimeOriginal comparison
195
 
 
 
 
 
 
 
 
196
  **PDF tampering**
 
197
  - Incremental edits β€” multi-`%%EOF` marker counting
198
- - Consumer-tool fingerprints β€” iLovePDF, Smallpdf, PDFescape, Sejda, Foxit Phantom strings in producer/creator
199
  - Producer/Creator mismatch β€” flags re-processed PDFs
200
- - Inserted text β€” embedded font count anomalies
201
 
202
- **Text-level**
203
- - Date sequence violations β€” monotonic check on extracted dates
204
- - Round-number anomalies β€” counts mega-amounts that are multiples of β‚Ή1 lakh
205
- - Missing IFSC with account number present β€” invalid bank document
206
 
207
- **Cross-document**
208
  - Name / DOB / address fuzzy match across multiple documents
209
- - Per-field similarity scoring with green/yellow/red status
 
 
210
 
211
  **KYC validation**
 
212
  - IFSC: format + RBI bank-code list (36 banks)
213
  - PAN: format + entity-type character (10 types per income-tax dept spec)
214
  - Aadhaar: 12-digit format + UIDAI Verhoeff checksum
215
 
216
- **PII redaction**
 
217
  - Aadhaar, PAN, IFSC, account-number masking
218
- - PDF redaction with black rectangle overlays via `page.search_for` bounding boxes
 
219
 
220
  ---
221
 
@@ -231,37 +275,34 @@ streamlit run app.py
231
  Browser opens at `http://localhost:8501`.
232
 
233
  For full OCR text-rule support, install Tesseract OCR:
 
234
  - Windows: https://github.com/UB-Mannheim/tesseract/wiki
235
  - macOS: `brew install tesseract`
236
- - Linux: `sudo apt-get install tesseract-ocr`
237
 
238
  The app auto-detects Tesseract on standard Windows install paths; no environment variable required.
239
 
240
- See `RUN_APP.md` for a more detailed walkthrough.
241
-
242
  ---
243
 
244
  ## Deployment
245
 
246
- Push to GitHub, connect at https://share.streamlit.io, point at `app.py`, deploy. The `packages.txt` ensures Tesseract is installed on the Streamlit Cloud VM; `requirements.txt` covers Python dependencies.
247
 
248
- See `DEPLOY.md` for step-by-step instructions including troubleshooting.
249
 
250
  ---
251
 
252
  ## Training your own model
253
 
254
- Drop labelled data into `data/images/originals/` and `data/images/tampered/`, open `docsentry_master.ipynb`, run section 6. A Random Forest auto-trains on whatever you put there and saves to `models/forgery_rf.joblib`. The Streamlit app picks it up automatically on next restart β€” no code changes.
255
-
256
- For a CNN upgrade, set `TRAIN_CNN = True` in section 7 and run on a Colab T4 GPU (free tier). Saves `models/forgery_cnn.keras` + `models/forgery_cnn.meta.json`. The app loads this lazily on first request.
257
 
258
- See `DATASETS.md` for public datasets you can use.
259
 
260
  ---
261
 
262
  ## Dependencies
263
 
264
- OpenCV (cv2), Pillow (PIL), scikit-image, scikit-learn, joblib, PyMuPDF (fitz), pdfplumber, pikepdf, pytesseract, python-dateutil, Streamlit, ReportLab, NumPy, pandas, matplotlib. Optional: TensorFlow (only required for the CNN path).
265
 
266
  All pip-installable. No GPU required for the default pipeline.
267
 
@@ -269,7 +310,7 @@ All pip-installable. No GPU required for the default pipeline.
269
 
270
  ## License
271
 
272
- MIT β€” see `LICENSE`. The MIT license covers the source code in this repository. Third-party datasets and pretrained models bundled or referenced (CASIA v2, IDRBT cheque dataset, AgamiAI Indian Bank Statements, MobileNetV2 ImageNet weights, etc.) are governed by their own terms; those notices are reproduced in `LICENSE` below the MIT block.
273
 
274
  ---
275
 
 
8
  app_file: app.py
9
  pinned: false
10
  license: mit
11
+ short_description: BankShield β€” document forensics + fraud-ring detection for Indian bank underwriting
12
  ---
13
 
14
+ # BankShield
15
 
16
+ **Real-Time Document Forensics, AI-Generated Forgery Detection, and Cross-Applicant Fraud-Ring Intelligence for Indian Bank Underwriting.**
17
 
18
+ BankShield catches tampered, forged, and AI-generated documents the moment they reach the underwriter β€” and surfaces organised fraud rings that span multiple applicants. Six independent detection layers fuse into a single calibrated risk score, with explainable evidence, tamper-evident audit trails, and RBI-format compliance reports out of the box.
19
+
20
+ 100% open source. No paid APIs. No external LLM calls. CPU-only by default. Runs locally on the bank's perimeter β€” PII never leaves.
21
+
22
+ - **Live demo:** https://huggingface.co/spaces/SpandanM110/DocSentry
23
+ - **Source:** https://github.com/SpandanM110/Doc-Sentry
24
+ - **Architecture reference:** see [`ARCHITECTURE.md`](ARCHITECTURE.md)
25
+
26
+ ---
27
+
28
+ ## The six pillars
29
+
30
+ | Pillar | Module | What it does |
31
+ |---|---|---|
32
+ | **Image Forensics** | `forensics.py` | ELA, copy-move (ORB), Laplacian noise inconsistency, EXIF audit |
33
+ | **PDF Structural Audit** | `forensics.py` | EOF marker counting, producer/creator drift, embedded-font anomalies, consumer-tool fingerprints |
34
+ | **OCR + Financial Rules** | `forensics.py` | Tesseract OCR + IFSC / PAN / Aadhaar / date monotonicity / amount sanity |
35
+ | **AI-Generated Detection** *(new)* | `ai_detector.py` | Radial FFT spectral analysis β€” catches Sora / Midjourney / Stable Diffusion outputs |
36
+ | **Fraud Ring Network** *(new)* | `fraud_ring.py` | NetworkX similarity graph across applicants; clique discovery flags organised fraud rings |
37
+ | **Provenance Ledger** *(new)* | `provenance.py` | SHA-256 hash chain over every analysis; O(N) verifiable; RBI Para 67 compliant |
38
+
39
+ Plus the **Live Tamper Forge Studio** (`tampering.py`) β€” an adversarial-validation harness built directly into the dashboard.
40
 
41
  ---
42
 
 
44
 
45
  ```
46
  Doc-Sentry/
47
+ β”œβ”€β”€ app.py Streamlit web UI (6 tabs)
48
+ β”œβ”€β”€ forensics.py Core detection engine + ensemble fusion
49
+ β”œβ”€β”€ ai_detector.py AI-generated forgery detector (FFT spectral)
50
+ β”œβ”€β”€ fraud_ring.py Cross-applicant similarity graph + clique detection
51
+ β”œβ”€β”€ provenance.py Tamper-evident SHA-256 hash chain
52
+ β”œβ”€β”€ tampering.py Forge Studio adversarial harness
53
  β”œβ”€β”€ compliance.py KYC validators, PII redaction, RBI report builder
54
+ β”œβ”€β”€ audit_report.py Bank-letterhead PDF report builder
55
  β”œβ”€β”€ docsentry_master.ipynb Single source-of-truth Jupyter notebook
56
  β”‚
57
  β”œβ”€β”€ requirements.txt Python dependencies
58
+ β”œβ”€β”€ packages.txt System packages (Tesseract) for Streamlit Cloud / HF Spaces
59
  β”œβ”€β”€ .streamlit/config.toml Streamlit theme + server config
60
  β”‚
61
  β”œβ”€β”€ sample_data/ 26 demo files for the live app
 
64
  β”‚ └── pdfs/ 2 PDFs (1 genuine, 1 tampered)
65
  β”‚
66
  β”œβ”€β”€ models/ Trained model artefacts
67
+ β”‚ β”œβ”€β”€ forgery_rf.joblib Random Forest classifier
68
+ β”‚ └── forgery_cnn.keras MobileNetV2 fine-tuned on CASIA v2 (optional)
69
  β”‚
70
+ β”œβ”€β”€ ARCHITECTURE.md Full architecture reference
71
+ β”œβ”€β”€ SUBMISSION.md Hackathon submission packet
72
+ β”œβ”€β”€ BankShield_Pitch.pptx Pitch deck (15 slides)
73
+ β”œβ”€β”€ README.md LICENSE
74
  └── data/ (gitignored) full training data + downloaded datasets
75
  ```
76
 
 
84
 
85
  | Function | Returns | Description |
86
  |---|---|---|
87
+ | `analyse_document(path)` | dict | End-to-end pipeline. Auto-detects type, runs all relevant detectors, blends Random Forest + CNN + AI-gen predictions, auto-logs to provenance ledger. Primary entry point. |
88
  | `score_image(path)` | (float, dict, list) | Composite forensic score for an image. Returns total, sub-scores by detector, and EXIF flags. |
89
+ | `error_level_analysis(path, quality=90)` | (PIL.Image, float) | ELA visualisation + scalar suspicion score. |
90
+ | `copy_move_detect(path)` | (np.ndarray, int, list) | ORB-based copy-move detection. Returns annotated viz, match count, raw matches. |
91
+ | `noise_inconsistency(path, block=32)` | (np.ndarray, float) | Per-block Laplacian variance heatmap + outlier ratio. |
92
+ | `exif_sanity(path)` | list of str | EXIF audit: missing EXIF, editor signatures, timestamp inconsistencies. |
93
+ | `pdf_structural_audit(path)` | dict | `%%EOF` markers, producer/creator drift, consumer-tool fingerprints. |
94
+ | `pdf_font_audit(path)` | dict | Embedded font listing + count anomalies. |
95
+ | `ocr_text(path)` | str | Tesseract OCR with auto-fallback. |
96
+ | `text_rule_checks(text)` | dict | Date monotonicity, amount sanity, IFSC format, account-number patterns. |
97
+ | `extract_features(path)` | dict | 11-feature vector for the Random Forest. |
98
+ | `predict_with_model(path)` | dict / None | Random Forest tamper probability + verdict. |
99
+ | `predict_with_cnn(path)` | dict / None | MobileNetV2 CNN inference (lazy-loaded). |
100
+ | `extract_identity_fields(path)` | (dict, str) | Pulls name, DOB, address, IFSC, account, amounts. |
101
+ | `cross_doc_consistency(paths)` | dict | Per-field similarity across 2+ documents. |
102
+ | `generate_insights(score, sub, flags)` | dict | Numeric β†’ underwriter-readable bullets + recommended action. |
103
+ | `band(score)` | str | Maps a float to LOW / MEDIUM / HIGH / CRITICAL. |
104
+
105
+ ### `ai_detector.py` β€” AI-generated forgery detection
 
 
 
 
106
 
107
+ | Function | Description |
108
  |---|---|
109
+ | `detect_ai_generated(path)` | Full pipeline β†’ probability + verdict + flags + FFT profile. |
110
+ | `radial_fft_profile(gray)` | Radially-averaged log-magnitude FFT spectrum. |
111
+ | `high_freq_attenuation(profile)` | Smoothness score β€” low for real scans, high for AI outputs. |
112
+ | `spectral_peak_score(profile)` | Counts checkerboard-stride peaks in the high-frequency band. |
113
+ | `jpeg_quantization_check(path)` | Inspects JPEG quantization tables for synthetic-media signatures. |
114
 
115
+ Blended into the main risk score with a capped +20% overlay so AI-gen signals reliably surface synthetic media without dominating classical detectors.
116
 
117
+ ### `fraud_ring.py` β€” cross-applicant fraud-ring detection
118
+
119
+ | Function | Description |
120
+ |---|---|
121
+ | `extract_applicant_fields(path)` | OCR + regex pull of name / DOB / address / phone / IFSC / account / employer. |
122
+ | `compare_applicants(a, b)` | Per-field similarity + weighted score. |
123
+ | `build_fraud_graph(applicants)` | NetworkX similarity graph (edges weighted by shared signals). |
124
+ | `detect_rings(G, min_size=3)` | Connected components above threshold β†’ suspected fraud rings. |
125
+ | `visualize_graph(G, rings)` | Force-directed graph with ring members in red. |
126
+ | `fraud_summary(G, rings, applicants)` | Structured summary for the Streamlit UI. |
127
+
128
+ ### `provenance.py` β€” tamper-evident audit ledger
129
+
130
+ | Function | Description |
131
+ |---|---|
132
+ | `log_analysis(...)` | Appends a SHA-256 hash-chained record to the SQLite ledger. |
133
+ | `verify_chain()` | Walks every record in O(N); pinpoints the first broken record. |
134
+ | `chain_stats()` | Count, first/last timestamps, breakdown by risk band, chain status. |
135
+ | `fetch_ledger(limit)` | Returns the latest N entries. |
136
+ | `ledger_dataframe(limit)` | Pandas DataFrame view (for Streamlit display). |
137
 
138
+ Each record's `record_hash = SHA256(timestamp | doc_sha256 | risk_band | risk_score | prev_hash)` β€” retroactive edits break the chain mathematically.
139
 
140
+ ### `tampering.py` β€” adversarial Forge Studio
141
+
142
+ `tamper_copy_move`, `tamper_text_edit`, `tamper_splice`, `tamper_compression`, `tamper_metadata_strip`, `tamper_custom_region`, `tamper_chain`, `annotate_before_after`, `overlay_heatmap_on_image`, `detector_scorecard`. Used by Tab 5 to apply controlled forgeries and immediately re-run detection.
143
 
144
  ### `compliance.py` β€” KYC + regulatory
145
 
146
  | Function | Description |
147
  |---|---|
148
+ | `validate_ifsc(code)` | Format check + RBI bank-code lookup (36 banks). |
149
+ | `validate_pan(code)` | Format + entity-type character validation. |
150
+ | `validate_aadhaar(num)` | 12-digit format + UIDAI Verhoeff checksum. |
151
+ | `redact_text(text)` | Masks IFSC, PAN, Aadhaar, account numbers. |
152
+ | `redact_pdf(input_path, output_path)` | PII black-box overlays via PyMuPDF text-bbox. |
153
+ | `extract_pii_fields(path)` | Pulls all PII candidates from any document. |
154
+ | `build_compliance_report(...)` | RBI Master-Direction-format audit PDF (5 sections). |
155
+
156
+ ### `audit_report.py` β€” bank-letterhead PDF
157
+
158
+ `build_pdf_report(report, source_path) β†’ bytes`. Multi-page PDF with header letterhead, metadata table, colour-coded risk verdict box, sub-score breakdown table, evidence list, embedded forensic heatmaps. Built with ReportLab Platypus.
159
+
160
+ ### `app.py` β€” Streamlit UI (6 tabs)
161
+
162
+ | Tab | Function |
163
+ |---|---|
164
+ | 1. Single-document analysis | Risk band, sub-score chart, ELA / copy-move / noise heatmaps, AI-gen FFT profile, ML/CNN predictions, downloadable JSON + PDF. |
165
+ | 2. Cross-document KYC | Upload 2–4 docs for one applicant; identity-field consistency table. |
166
+ | 3. Batch audit | Scan a folder; sortable risk table + CSV download. |
167
+ | 4. Compliance & Audit Pack | KYC validation, PII auto-redaction, RBI compliance PDF, **provenance ledger view with chain re-verify**. |
168
+ | 5. Live Tamper Forge Studio | Pick clean sample β†’ choose technique + intensity β†’ watch BankShield localise the tamper with per-detector scorecard + heatmap overlays. |
169
+ | 6. Fraud Ring Network | Upload N applicants β†’ similarity graph with red ring members + ring summary cards. |
 
 
 
 
 
170
 
171
  ---
172
 
173
  ## Pipeline architecture
174
 
175
  ```
176
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
177
+ β”‚ PRESENTATION (Streamlit, 6 tabs) β”‚
178
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
179
+ β–Ό
180
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
181
+ β”‚ FORENSICS CORE β”‚
182
+ β”‚ ELA Β· Copy-move Β· Noise Β· EXIF Β· OCR Β· PDF β”‚
183
+ β”‚ + Random Forest (11-d feature vector) β”‚
184
+ β”‚ + MobileNetV2 CNN (CASIA v2 fine-tuned) β”‚
185
+ β”‚ + AI-Gen Detector (radial FFT) β”‚
186
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
187
+ β–Ό
188
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
189
+ β”‚ ENSEMBLE FUSION β”‚
190
+ β”‚ weighted blend β†’ RF overlay β†’ CNN overlay β”‚
191
+ β”‚ β†’ AI-gen overlay (capped at +20%) β”‚
192
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
193
+ β–Ό
194
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
195
+ β–Ό β–Ό β–Ό β–Ό
196
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
197
+ β”‚ COMPLIANCE β”‚ β”‚ FRAUD-RING β”‚ β”‚ PROVENANCE β”‚ β”‚ TAMPER FORGE β”‚
198
+ β”‚ IFSC Β· PAN Β· β”‚ β”‚ NetworkX graph β”‚ β”‚ SHA-256 hash β”‚ β”‚ Adversarial β”‚
199
+ β”‚ Aadhaar Β· PIIβ”‚ β”‚ clique detect β”‚ β”‚ chain ledger β”‚ β”‚ validation β”‚
200
+ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
201
+ β”‚ β”‚ β”‚
202
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
203
+ β–Ό
204
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
205
+ β”‚ OUTPUT β”‚
206
+ β”‚ Risk band Β· Evidence list β”‚
207
+ β”‚ Bank-letterhead audit PDF β”‚
208
+ β”‚ RBI compliance PDF Β· Audit JSON β”‚
209
+ β”‚ Tamper-evident ledger entry β”‚
210
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
 
 
 
 
 
 
 
 
 
 
 
211
  ```
212
 
213
+ Default weight vector (`forensics.WEIGHTS`): `{ela: 0.20, copy_move: 0.25, noise: 0.20, exif: 0.15, text_rules: 0.20}`. The Random Forest probability, when available, is blended 50/50 with the rule-based score. The CNN probability is blended at a weight between 0.4 and 0.7 based on the CNN's reported validation AUC. The AI-gen probability is applied as a final overlay capped at +20%.
214
+
215
+ Band mapping: `0–0.30 LOW Β· 0.30–0.50 MEDIUM Β· 0.50–0.75 HIGH Β· 0.75+ CRITICAL`.
216
+
217
+ See [`ARCHITECTURE.md`](ARCHITECTURE.md) for the full reference.
218
 
219
  ---
220
 
221
  ## Detection coverage
222
 
223
  **Image tampering**
224
+
225
  - Copy-move forgery β€” ORB keypoint matching with distance filter
226
  - Image splicing β€” block-wise noise inconsistency via Laplacian variance
227
  - Text edits / amount tampering β€” Error Level Analysis
228
  - Photoshop / GIMP / Snapseed edits β€” EXIF Software-tag string match
229
  - Timestamp inconsistencies β€” DateTime vs DateTimeOriginal comparison
230
 
231
+ **AI-generated content**
232
+
233
+ - Sora / Midjourney / Stable Diffusion / DALL-E outputs β€” FFT spectral analysis
234
+ - High-frequency suppression (1/f decay deviation)
235
+ - Periodic checkerboard peaks from upsampling stride
236
+ - Non-standard JPEG quantization tables
237
+
238
  **PDF tampering**
239
+
240
  - Incremental edits β€” multi-`%%EOF` marker counting
241
+ - Consumer-tool fingerprints β€” iLovePDF, Smallpdf, PDFescape, Sejda, Foxit Phantom
242
  - Producer/Creator mismatch β€” flags re-processed PDFs
243
+ - Inserted text β€” embedded-font count anomalies
244
 
245
+ **Cross-document & fraud-ring**
 
 
 
246
 
 
247
  - Name / DOB / address fuzzy match across multiple documents
248
+ - Per-field weighted scoring with green / yellow / red status
249
+ - Cross-applicant similarity graph; cliques β‰₯3 = suspected fraud ring
250
+ - Ring bands: CRITICAL (β‰₯5 members) / HIGH (3–4) / MEDIUM (2)
251
 
252
  **KYC validation**
253
+
254
  - IFSC: format + RBI bank-code list (36 banks)
255
  - PAN: format + entity-type character (10 types per income-tax dept spec)
256
  - Aadhaar: 12-digit format + UIDAI Verhoeff checksum
257
 
258
+ **PII redaction & audit**
259
+
260
  - Aadhaar, PAN, IFSC, account-number masking
261
+ - PDF redaction with black rectangle overlays
262
+ - SHA-256 hash-chained provenance ledger (RBI Para 67 compliant)
263
 
264
  ---
265
 
 
275
  Browser opens at `http://localhost:8501`.
276
 
277
  For full OCR text-rule support, install Tesseract OCR:
278
+
279
  - Windows: https://github.com/UB-Mannheim/tesseract/wiki
280
  - macOS: `brew install tesseract`
281
+ - Linux: `sudo apt-get install tesseract-ocr libtesseract-dev`
282
 
283
  The app auto-detects Tesseract on standard Windows install paths; no environment variable required.
284
 
 
 
285
  ---
286
 
287
  ## Deployment
288
 
289
+ The repository is deployment-ready for both **Streamlit Community Cloud** and **Hugging Face Spaces**. The YAML frontmatter at the top of this README configures the HF Space; `packages.txt` ensures Tesseract is installed on the build VM; `requirements.txt` covers Python dependencies.
290
 
291
+ Live deployment: https://huggingface.co/spaces/SpandanM110/DocSentry
292
 
293
  ---
294
 
295
  ## Training your own model
296
 
297
+ Drop labelled data into `data/images/originals/` and `data/images/tampered/`, open `docsentry_master.ipynb`, run section 6. A Random Forest auto-trains on whatever you put there and saves to `models/forgery_rf.joblib`. The Streamlit app picks it up automatically on next restart.
 
 
298
 
299
+ For a CNN upgrade, set `TRAIN_CNN = True` in section 7 and run on a Colab T4 GPU (free tier). Saves `models/forgery_cnn.keras` + `models/forgery_cnn.meta.json`. The app loads it lazily on first request.
300
 
301
  ---
302
 
303
  ## Dependencies
304
 
305
+ OpenCV (cv2), Pillow (PIL), scikit-image, scikit-learn, joblib, PyMuPDF (fitz), pdfplumber, pikepdf, pytesseract, python-dateutil, Streamlit, streamlit-drawable-canvas, ReportLab, NumPy, pandas, matplotlib, NetworkX. Optional: TensorFlow (only required for the CNN path).
306
 
307
  All pip-installable. No GPU required for the default pipeline.
308
 
 
310
 
311
  ## License
312
 
313
+ MIT β€” see `LICENSE`. The MIT license covers the source code in this repository. Third-party datasets and pretrained models bundled or referenced (CASIA v2, IDRBT cheque dataset, AgamiAI Indian Bank Statements, MobileNetV2 ImageNet weights) are governed by their own terms; those notices are reproduced in `LICENSE` below the MIT block.
314
 
315
  ---
316