Spaces:
Sleeping
Sleeping
Commit Β·
b2ca303
1
Parent(s): e97f963
Round 2: README + ignore pitch artefacts and runtime ledger
Browse files- .gitignore +13 -0
- README.md +178 -137
.gitignore
CHANGED
|
@@ -34,3 +34,16 @@ Thumbs.db
|
|
| 34 |
|
| 35 |
# Archive (the old notebooks we moved out)
|
| 36 |
archive/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
# Archive (the old notebooks we moved out)
|
| 36 |
archive/
|
| 37 |
+
|
| 38 |
+
# Hackathon pitch artefacts - keep local, don't commit
|
| 39 |
+
BankShield_Pitch.pptx
|
| 40 |
+
SUBMISSION.md
|
| 41 |
+
*.pptx
|
| 42 |
+
|
| 43 |
+
# Runtime state - SQLite ledger fills up at runtime, never commit
|
| 44 |
+
provenance.db
|
| 45 |
+
provenance.db-journal
|
| 46 |
+
*.db
|
| 47 |
+
*.db-journal
|
| 48 |
+
*.sqlite
|
| 49 |
+
*.sqlite3
|
README.md
CHANGED
|
@@ -8,15 +8,35 @@ sdk_version: 1.32.0
|
|
| 8 |
app_file: app.py
|
| 9 |
pinned: false
|
| 10 |
license: mit
|
| 11 |
-
short_description:
|
| 12 |
---
|
| 13 |
|
| 14 |
-
#
|
| 15 |
|
| 16 |
-
Document
|
| 17 |
|
| 18 |
-
|
| 19 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
|
| 21 |
---
|
| 22 |
|
|
@@ -24,14 +44,18 @@ Document forensics and KYC compliance pipeline for bank underwriting workflows.
|
|
| 24 |
|
| 25 |
```
|
| 26 |
Doc-Sentry/
|
| 27 |
-
βββ app.py Streamlit web UI (
|
| 28 |
-
βββ forensics.py Core detection engine
|
| 29 |
-
βββ
|
|
|
|
|
|
|
|
|
|
| 30 |
βββ compliance.py KYC validators, PII redaction, RBI report builder
|
|
|
|
| 31 |
βββ docsentry_master.ipynb Single source-of-truth Jupyter notebook
|
| 32 |
β
|
| 33 |
βββ requirements.txt Python dependencies
|
| 34 |
-
βββ packages.txt System packages (Tesseract) for Streamlit Cloud
|
| 35 |
βββ .streamlit/config.toml Streamlit theme + server config
|
| 36 |
β
|
| 37 |
βββ sample_data/ 26 demo files for the live app
|
|
@@ -40,9 +64,13 @@ Doc-Sentry/
|
|
| 40 |
β βββ pdfs/ 2 PDFs (1 genuine, 1 tampered)
|
| 41 |
β
|
| 42 |
βββ models/ Trained model artefacts
|
| 43 |
-
β
|
|
|
|
| 44 |
β
|
| 45 |
-
βββ
|
|
|
|
|
|
|
|
|
|
| 46 |
βββ data/ (gitignored) full training data + downloaded datasets
|
| 47 |
```
|
| 48 |
|
|
@@ -56,166 +84,182 @@ The core analytical module. Stateless functions; all logic is independently test
|
|
| 56 |
|
| 57 |
| Function | Returns | Description |
|
| 58 |
|---|---|---|
|
| 59 |
-
| `analyse_document(path)` | dict | End-to-end pipeline. Auto-detects type
|
| 60 |
| `score_image(path)` | (float, dict, list) | Composite forensic score for an image. Returns total, sub-scores by detector, and EXIF flags. |
|
| 61 |
-
| `error_level_analysis(path, quality=90)` | (PIL.Image, float) | ELA visualisation + scalar suspicion score.
|
| 62 |
-
| `copy_move_detect(path)` | (np.ndarray, int, list) |
|
| 63 |
-
| `noise_inconsistency(path, block=32)` | (np.ndarray, float) | Per-block Laplacian variance
|
| 64 |
-
| `exif_sanity(path)` | list of str | EXIF
|
| 65 |
-
| `pdf_structural_audit(path)` | dict |
|
| 66 |
-
| `pdf_font_audit(path)` | dict |
|
| 67 |
-
| `ocr_text(path)` | str | Tesseract OCR with auto-fallback.
|
| 68 |
-
| `text_rule_checks(text)` | dict |
|
| 69 |
-
| `extract_features(path)` | dict |
|
| 70 |
-
| `predict_with_model(path)` | dict
|
| 71 |
-
| `predict_with_cnn(path)` | dict
|
| 72 |
-
| `extract_identity_fields(path)` | (dict, str) | Pulls name, DOB, address,
|
| 73 |
-
| `cross_doc_consistency(paths)` | dict |
|
| 74 |
-
| `generate_insights(score, sub, flags)` | dict |
|
| 75 |
-
| `band(score)` | str | Maps a float to LOW / MEDIUM / HIGH / CRITICAL.
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
### `app.py` β Streamlit UI
|
| 80 |
-
|
| 81 |
-
Four-tab web app. Imports `forensics`, `compliance`, and `audit_report`.
|
| 82 |
|
| 83 |
-
|
|
| 84 |
|---|---|
|
| 85 |
-
|
|
| 86 |
-
|
|
| 87 |
-
|
|
| 88 |
-
|
|
|
|
|
| 89 |
|
| 90 |
-
|
| 91 |
|
| 92 |
-
### `
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 93 |
|
| 94 |
-
|
| 95 |
|
| 96 |
-
|
|
|
|
|
|
|
| 97 |
|
| 98 |
### `compliance.py` β KYC + regulatory
|
| 99 |
|
| 100 |
| Function | Description |
|
| 101 |
|---|---|
|
| 102 |
-
| `validate_ifsc(code)` | Format check
|
| 103 |
-
| `validate_pan(code)` | Format
|
| 104 |
-
| `validate_aadhaar(num)` | 12-digit format + UIDAI Verhoeff checksum
|
| 105 |
-
| `redact_text(text)` | Masks IFSC, PAN, Aadhaar,
|
| 106 |
-
| `redact_pdf(input_path, output_path)` |
|
| 107 |
-
| `extract_pii_fields(path)` | Pulls all PII candidates from any document
|
| 108 |
-
| `build_compliance_report(
|
| 109 |
-
|
| 110 |
-
### `
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
11. PDF report generator
|
| 125 |
-
12. Export cell β writes `forensics.py`, `app.py`, `audit_report.py` to disk for the Streamlit demo
|
| 126 |
-
13. Launch instructions
|
| 127 |
-
|
| 128 |
-
Edit the notebook, re-run section 12, and the `.py` files used by Streamlit regenerate automatically.
|
| 129 |
|
| 130 |
---
|
| 131 |
|
| 132 |
## Pipeline architecture
|
| 133 |
|
| 134 |
```
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
βββββββββββββββββββ
|
| 140 |
-
|
| 141 |
-
|
| 142 |
-
|
| 143 |
-
|
| 144 |
-
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
|
| 148 |
-
|
| 149 |
-
|
| 150 |
-
|
| 151 |
-
|
| 152 |
-
|
| 153 |
-
ββββββββββββββββββββββββββββββ
|
| 154 |
-
|
| 155 |
-
|
| 156 |
-
|
| 157 |
-
|
| 158 |
-
|
| 159 |
-
|
| 160 |
-
|
| 161 |
-
|
| 162 |
-
|
| 163 |
-
|
| 164 |
-
|
| 165 |
-
|
| 166 |
-
|
| 167 |
-
|
| 168 |
-
|
| 169 |
-
|
| 170 |
-
βΌ
|
| 171 |
-
ββββββββββββββββββββββββββββββββ
|
| 172 |
-
β Weighted ensemble scorer β
|
| 173 |
-
β (rule + RF + CNN blend) β
|
| 174 |
-
ββββββββββββββββ¬ββββββββββββββββ
|
| 175 |
-
βΌ
|
| 176 |
-
ββββββββββββββββββββββββββββββββ
|
| 177 |
-
β Risk band + Evidence list β
|
| 178 |
-
β Recommended action β
|
| 179 |
-
β Audit JSON + PDF report β
|
| 180 |
-
ββββββββββββββββββββββββββββββββ
|
| 181 |
```
|
| 182 |
|
| 183 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 184 |
|
| 185 |
---
|
| 186 |
|
| 187 |
## Detection coverage
|
| 188 |
|
| 189 |
**Image tampering**
|
|
|
|
| 190 |
- Copy-move forgery β ORB keypoint matching with distance filter
|
| 191 |
- Image splicing β block-wise noise inconsistency via Laplacian variance
|
| 192 |
- Text edits / amount tampering β Error Level Analysis
|
| 193 |
- Photoshop / GIMP / Snapseed edits β EXIF Software-tag string match
|
| 194 |
- Timestamp inconsistencies β DateTime vs DateTimeOriginal comparison
|
| 195 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 196 |
**PDF tampering**
|
|
|
|
| 197 |
- Incremental edits β multi-`%%EOF` marker counting
|
| 198 |
-
- Consumer-tool fingerprints β iLovePDF, Smallpdf, PDFescape, Sejda, Foxit Phantom
|
| 199 |
- Producer/Creator mismatch β flags re-processed PDFs
|
| 200 |
-
- Inserted text β embedded
|
| 201 |
|
| 202 |
-
**
|
| 203 |
-
- Date sequence violations β monotonic check on extracted dates
|
| 204 |
-
- Round-number anomalies β counts mega-amounts that are multiples of βΉ1 lakh
|
| 205 |
-
- Missing IFSC with account number present β invalid bank document
|
| 206 |
|
| 207 |
-
**Cross-document**
|
| 208 |
- Name / DOB / address fuzzy match across multiple documents
|
| 209 |
-
- Per-field
|
|
|
|
|
|
|
| 210 |
|
| 211 |
**KYC validation**
|
|
|
|
| 212 |
- IFSC: format + RBI bank-code list (36 banks)
|
| 213 |
- PAN: format + entity-type character (10 types per income-tax dept spec)
|
| 214 |
- Aadhaar: 12-digit format + UIDAI Verhoeff checksum
|
| 215 |
|
| 216 |
-
**PII redaction**
|
|
|
|
| 217 |
- Aadhaar, PAN, IFSC, account-number masking
|
| 218 |
-
- PDF redaction with black rectangle overlays
|
|
|
|
| 219 |
|
| 220 |
---
|
| 221 |
|
|
@@ -231,37 +275,34 @@ streamlit run app.py
|
|
| 231 |
Browser opens at `http://localhost:8501`.
|
| 232 |
|
| 233 |
For full OCR text-rule support, install Tesseract OCR:
|
|
|
|
| 234 |
- Windows: https://github.com/UB-Mannheim/tesseract/wiki
|
| 235 |
- macOS: `brew install tesseract`
|
| 236 |
-
- Linux: `sudo apt-get install tesseract-ocr`
|
| 237 |
|
| 238 |
The app auto-detects Tesseract on standard Windows install paths; no environment variable required.
|
| 239 |
|
| 240 |
-
See `RUN_APP.md` for a more detailed walkthrough.
|
| 241 |
-
|
| 242 |
---
|
| 243 |
|
| 244 |
## Deployment
|
| 245 |
|
| 246 |
-
|
| 247 |
|
| 248 |
-
|
| 249 |
|
| 250 |
---
|
| 251 |
|
| 252 |
## Training your own model
|
| 253 |
|
| 254 |
-
Drop labelled data into `data/images/originals/` and `data/images/tampered/`, open `docsentry_master.ipynb`, run section 6. A Random Forest auto-trains on whatever you put there and saves to `models/forgery_rf.joblib`. The Streamlit app picks it up automatically on next restart
|
| 255 |
-
|
| 256 |
-
For a CNN upgrade, set `TRAIN_CNN = True` in section 7 and run on a Colab T4 GPU (free tier). Saves `models/forgery_cnn.keras` + `models/forgery_cnn.meta.json`. The app loads this lazily on first request.
|
| 257 |
|
| 258 |
-
|
| 259 |
|
| 260 |
---
|
| 261 |
|
| 262 |
## Dependencies
|
| 263 |
|
| 264 |
-
OpenCV (cv2), Pillow (PIL), scikit-image, scikit-learn, joblib, PyMuPDF (fitz), pdfplumber, pikepdf, pytesseract, python-dateutil, Streamlit, ReportLab, NumPy, pandas, matplotlib. Optional: TensorFlow (only required for the CNN path).
|
| 265 |
|
| 266 |
All pip-installable. No GPU required for the default pipeline.
|
| 267 |
|
|
@@ -269,7 +310,7 @@ All pip-installable. No GPU required for the default pipeline.
|
|
| 269 |
|
| 270 |
## License
|
| 271 |
|
| 272 |
-
MIT β see `LICENSE`. The MIT license covers the source code in this repository. Third-party datasets and pretrained models bundled or referenced (CASIA v2, IDRBT cheque dataset, AgamiAI Indian Bank Statements, MobileNetV2 ImageNet weights
|
| 273 |
|
| 274 |
---
|
| 275 |
|
|
|
|
| 8 |
app_file: app.py
|
| 9 |
pinned: false
|
| 10 |
license: mit
|
| 11 |
+
short_description: BankShield β document forensics + fraud-ring detection for Indian bank underwriting
|
| 12 |
---
|
| 13 |
|
| 14 |
+
# BankShield
|
| 15 |
|
| 16 |
+
**Real-Time Document Forensics, AI-Generated Forgery Detection, and Cross-Applicant Fraud-Ring Intelligence for Indian Bank Underwriting.**
|
| 17 |
|
| 18 |
+
BankShield catches tampered, forged, and AI-generated documents the moment they reach the underwriter β and surfaces organised fraud rings that span multiple applicants. Six independent detection layers fuse into a single calibrated risk score, with explainable evidence, tamper-evident audit trails, and RBI-format compliance reports out of the box.
|
| 19 |
+
|
| 20 |
+
100% open source. No paid APIs. No external LLM calls. CPU-only by default. Runs locally on the bank's perimeter β PII never leaves.
|
| 21 |
+
|
| 22 |
+
- **Live demo:** https://huggingface.co/spaces/SpandanM110/DocSentry
|
| 23 |
+
- **Source:** https://github.com/SpandanM110/Doc-Sentry
|
| 24 |
+
- **Architecture reference:** see [`ARCHITECTURE.md`](ARCHITECTURE.md)
|
| 25 |
+
|
| 26 |
+
---
|
| 27 |
+
|
| 28 |
+
## The six pillars
|
| 29 |
+
|
| 30 |
+
| Pillar | Module | What it does |
|
| 31 |
+
|---|---|---|
|
| 32 |
+
| **Image Forensics** | `forensics.py` | ELA, copy-move (ORB), Laplacian noise inconsistency, EXIF audit |
|
| 33 |
+
| **PDF Structural Audit** | `forensics.py` | EOF marker counting, producer/creator drift, embedded-font anomalies, consumer-tool fingerprints |
|
| 34 |
+
| **OCR + Financial Rules** | `forensics.py` | Tesseract OCR + IFSC / PAN / Aadhaar / date monotonicity / amount sanity |
|
| 35 |
+
| **AI-Generated Detection** *(new)* | `ai_detector.py` | Radial FFT spectral analysis β catches Sora / Midjourney / Stable Diffusion outputs |
|
| 36 |
+
| **Fraud Ring Network** *(new)* | `fraud_ring.py` | NetworkX similarity graph across applicants; clique discovery flags organised fraud rings |
|
| 37 |
+
| **Provenance Ledger** *(new)* | `provenance.py` | SHA-256 hash chain over every analysis; O(N) verifiable; RBI Para 67 compliant |
|
| 38 |
+
|
| 39 |
+
Plus the **Live Tamper Forge Studio** (`tampering.py`) β an adversarial-validation harness built directly into the dashboard.
|
| 40 |
|
| 41 |
---
|
| 42 |
|
|
|
|
| 44 |
|
| 45 |
```
|
| 46 |
Doc-Sentry/
|
| 47 |
+
βββ app.py Streamlit web UI (6 tabs)
|
| 48 |
+
βββ forensics.py Core detection engine + ensemble fusion
|
| 49 |
+
βββ ai_detector.py AI-generated forgery detector (FFT spectral)
|
| 50 |
+
βββ fraud_ring.py Cross-applicant similarity graph + clique detection
|
| 51 |
+
βββ provenance.py Tamper-evident SHA-256 hash chain
|
| 52 |
+
βββ tampering.py Forge Studio adversarial harness
|
| 53 |
βββ compliance.py KYC validators, PII redaction, RBI report builder
|
| 54 |
+
βββ audit_report.py Bank-letterhead PDF report builder
|
| 55 |
βββ docsentry_master.ipynb Single source-of-truth Jupyter notebook
|
| 56 |
β
|
| 57 |
βββ requirements.txt Python dependencies
|
| 58 |
+
βββ packages.txt System packages (Tesseract) for Streamlit Cloud / HF Spaces
|
| 59 |
βββ .streamlit/config.toml Streamlit theme + server config
|
| 60 |
β
|
| 61 |
βββ sample_data/ 26 demo files for the live app
|
|
|
|
| 64 |
β βββ pdfs/ 2 PDFs (1 genuine, 1 tampered)
|
| 65 |
β
|
| 66 |
βββ models/ Trained model artefacts
|
| 67 |
+
β βββ forgery_rf.joblib Random Forest classifier
|
| 68 |
+
β βββ forgery_cnn.keras MobileNetV2 fine-tuned on CASIA v2 (optional)
|
| 69 |
β
|
| 70 |
+
βββ ARCHITECTURE.md Full architecture reference
|
| 71 |
+
βββ SUBMISSION.md Hackathon submission packet
|
| 72 |
+
βββ BankShield_Pitch.pptx Pitch deck (15 slides)
|
| 73 |
+
βββ README.md LICENSE
|
| 74 |
βββ data/ (gitignored) full training data + downloaded datasets
|
| 75 |
```
|
| 76 |
|
|
|
|
| 84 |
|
| 85 |
| Function | Returns | Description |
|
| 86 |
|---|---|---|
|
| 87 |
+
| `analyse_document(path)` | dict | End-to-end pipeline. Auto-detects type, runs all relevant detectors, blends Random Forest + CNN + AI-gen predictions, auto-logs to provenance ledger. Primary entry point. |
|
| 88 |
| `score_image(path)` | (float, dict, list) | Composite forensic score for an image. Returns total, sub-scores by detector, and EXIF flags. |
|
| 89 |
+
| `error_level_analysis(path, quality=90)` | (PIL.Image, float) | ELA visualisation + scalar suspicion score. |
|
| 90 |
+
| `copy_move_detect(path)` | (np.ndarray, int, list) | ORB-based copy-move detection. Returns annotated viz, match count, raw matches. |
|
| 91 |
+
| `noise_inconsistency(path, block=32)` | (np.ndarray, float) | Per-block Laplacian variance heatmap + outlier ratio. |
|
| 92 |
+
| `exif_sanity(path)` | list of str | EXIF audit: missing EXIF, editor signatures, timestamp inconsistencies. |
|
| 93 |
+
| `pdf_structural_audit(path)` | dict | `%%EOF` markers, producer/creator drift, consumer-tool fingerprints. |
|
| 94 |
+
| `pdf_font_audit(path)` | dict | Embedded font listing + count anomalies. |
|
| 95 |
+
| `ocr_text(path)` | str | Tesseract OCR with auto-fallback. |
|
| 96 |
+
| `text_rule_checks(text)` | dict | Date monotonicity, amount sanity, IFSC format, account-number patterns. |
|
| 97 |
+
| `extract_features(path)` | dict | 11-feature vector for the Random Forest. |
|
| 98 |
+
| `predict_with_model(path)` | dict / None | Random Forest tamper probability + verdict. |
|
| 99 |
+
| `predict_with_cnn(path)` | dict / None | MobileNetV2 CNN inference (lazy-loaded). |
|
| 100 |
+
| `extract_identity_fields(path)` | (dict, str) | Pulls name, DOB, address, IFSC, account, amounts. |
|
| 101 |
+
| `cross_doc_consistency(paths)` | dict | Per-field similarity across 2+ documents. |
|
| 102 |
+
| `generate_insights(score, sub, flags)` | dict | Numeric β underwriter-readable bullets + recommended action. |
|
| 103 |
+
| `band(score)` | str | Maps a float to LOW / MEDIUM / HIGH / CRITICAL. |
|
| 104 |
+
|
| 105 |
+
### `ai_detector.py` β AI-generated forgery detection
|
|
|
|
|
|
|
|
|
|
|
|
|
| 106 |
|
| 107 |
+
| Function | Description |
|
| 108 |
|---|---|
|
| 109 |
+
| `detect_ai_generated(path)` | Full pipeline β probability + verdict + flags + FFT profile. |
|
| 110 |
+
| `radial_fft_profile(gray)` | Radially-averaged log-magnitude FFT spectrum. |
|
| 111 |
+
| `high_freq_attenuation(profile)` | Smoothness score β low for real scans, high for AI outputs. |
|
| 112 |
+
| `spectral_peak_score(profile)` | Counts checkerboard-stride peaks in the high-frequency band. |
|
| 113 |
+
| `jpeg_quantization_check(path)` | Inspects JPEG quantization tables for synthetic-media signatures. |
|
| 114 |
|
| 115 |
+
Blended into the main risk score with a capped +20% overlay so AI-gen signals reliably surface synthetic media without dominating classical detectors.
|
| 116 |
|
| 117 |
+
### `fraud_ring.py` β cross-applicant fraud-ring detection
|
| 118 |
+
|
| 119 |
+
| Function | Description |
|
| 120 |
+
|---|---|
|
| 121 |
+
| `extract_applicant_fields(path)` | OCR + regex pull of name / DOB / address / phone / IFSC / account / employer. |
|
| 122 |
+
| `compare_applicants(a, b)` | Per-field similarity + weighted score. |
|
| 123 |
+
| `build_fraud_graph(applicants)` | NetworkX similarity graph (edges weighted by shared signals). |
|
| 124 |
+
| `detect_rings(G, min_size=3)` | Connected components above threshold β suspected fraud rings. |
|
| 125 |
+
| `visualize_graph(G, rings)` | Force-directed graph with ring members in red. |
|
| 126 |
+
| `fraud_summary(G, rings, applicants)` | Structured summary for the Streamlit UI. |
|
| 127 |
+
|
| 128 |
+
### `provenance.py` β tamper-evident audit ledger
|
| 129 |
+
|
| 130 |
+
| Function | Description |
|
| 131 |
+
|---|---|
|
| 132 |
+
| `log_analysis(...)` | Appends a SHA-256 hash-chained record to the SQLite ledger. |
|
| 133 |
+
| `verify_chain()` | Walks every record in O(N); pinpoints the first broken record. |
|
| 134 |
+
| `chain_stats()` | Count, first/last timestamps, breakdown by risk band, chain status. |
|
| 135 |
+
| `fetch_ledger(limit)` | Returns the latest N entries. |
|
| 136 |
+
| `ledger_dataframe(limit)` | Pandas DataFrame view (for Streamlit display). |
|
| 137 |
|
| 138 |
+
Each record's `record_hash = SHA256(timestamp | doc_sha256 | risk_band | risk_score | prev_hash)` β retroactive edits break the chain mathematically.
|
| 139 |
|
| 140 |
+
### `tampering.py` β adversarial Forge Studio
|
| 141 |
+
|
| 142 |
+
`tamper_copy_move`, `tamper_text_edit`, `tamper_splice`, `tamper_compression`, `tamper_metadata_strip`, `tamper_custom_region`, `tamper_chain`, `annotate_before_after`, `overlay_heatmap_on_image`, `detector_scorecard`. Used by Tab 5 to apply controlled forgeries and immediately re-run detection.
|
| 143 |
|
| 144 |
### `compliance.py` β KYC + regulatory
|
| 145 |
|
| 146 |
| Function | Description |
|
| 147 |
|---|---|
|
| 148 |
+
| `validate_ifsc(code)` | Format check + RBI bank-code lookup (36 banks). |
|
| 149 |
+
| `validate_pan(code)` | Format + entity-type character validation. |
|
| 150 |
+
| `validate_aadhaar(num)` | 12-digit format + UIDAI Verhoeff checksum. |
|
| 151 |
+
| `redact_text(text)` | Masks IFSC, PAN, Aadhaar, account numbers. |
|
| 152 |
+
| `redact_pdf(input_path, output_path)` | PII black-box overlays via PyMuPDF text-bbox. |
|
| 153 |
+
| `extract_pii_fields(path)` | Pulls all PII candidates from any document. |
|
| 154 |
+
| `build_compliance_report(...)` | RBI Master-Direction-format audit PDF (5 sections). |
|
| 155 |
+
|
| 156 |
+
### `audit_report.py` β bank-letterhead PDF
|
| 157 |
+
|
| 158 |
+
`build_pdf_report(report, source_path) β bytes`. Multi-page PDF with header letterhead, metadata table, colour-coded risk verdict box, sub-score breakdown table, evidence list, embedded forensic heatmaps. Built with ReportLab Platypus.
|
| 159 |
+
|
| 160 |
+
### `app.py` β Streamlit UI (6 tabs)
|
| 161 |
+
|
| 162 |
+
| Tab | Function |
|
| 163 |
+
|---|---|
|
| 164 |
+
| 1. Single-document analysis | Risk band, sub-score chart, ELA / copy-move / noise heatmaps, AI-gen FFT profile, ML/CNN predictions, downloadable JSON + PDF. |
|
| 165 |
+
| 2. Cross-document KYC | Upload 2β4 docs for one applicant; identity-field consistency table. |
|
| 166 |
+
| 3. Batch audit | Scan a folder; sortable risk table + CSV download. |
|
| 167 |
+
| 4. Compliance & Audit Pack | KYC validation, PII auto-redaction, RBI compliance PDF, **provenance ledger view with chain re-verify**. |
|
| 168 |
+
| 5. Live Tamper Forge Studio | Pick clean sample β choose technique + intensity β watch BankShield localise the tamper with per-detector scorecard + heatmap overlays. |
|
| 169 |
+
| 6. Fraud Ring Network | Upload N applicants β similarity graph with red ring members + ring summary cards. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 170 |
|
| 171 |
---
|
| 172 |
|
| 173 |
## Pipeline architecture
|
| 174 |
|
| 175 |
```
|
| 176 |
+
ββββββββββββββββββββββββββββββββββββββββββ
|
| 177 |
+
β PRESENTATION (Streamlit, 6 tabs) β
|
| 178 |
+
ββββββββββββββββββββ¬ββββββββββββββββββββββ
|
| 179 |
+
βΌ
|
| 180 |
+
ββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 181 |
+
β FORENSICS CORE β
|
| 182 |
+
β ELA Β· Copy-move Β· Noise Β· EXIF Β· OCR Β· PDF β
|
| 183 |
+
β + Random Forest (11-d feature vector) β
|
| 184 |
+
β + MobileNetV2 CNN (CASIA v2 fine-tuned) β
|
| 185 |
+
β + AI-Gen Detector (radial FFT) β
|
| 186 |
+
ββββββββββββββββββββ¬ββββββββββββββββββββββββββββ
|
| 187 |
+
βΌ
|
| 188 |
+
ββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 189 |
+
β ENSEMBLE FUSION β
|
| 190 |
+
β weighted blend β RF overlay β CNN overlay β
|
| 191 |
+
β β AI-gen overlay (capped at +20%) β
|
| 192 |
+
ββββββββββββββββββββ¬ββββββββββββββββββββββββββββ
|
| 193 |
+
βΌ
|
| 194 |
+
ββββββββββββββββββββ¬ββββββ΄ββββββ¬βββββββββββββββββββ
|
| 195 |
+
βΌ βΌ βΌ βΌ
|
| 196 |
+
ββββββββββββββββ ββββββββββββββββββ ββββββββββββββββ ββββββββββββββββββ
|
| 197 |
+
β COMPLIANCE β β FRAUD-RING β β PROVENANCE β β TAMPER FORGE β
|
| 198 |
+
β IFSC Β· PAN Β· β β NetworkX graph β β SHA-256 hash β β Adversarial β
|
| 199 |
+
β Aadhaar Β· PIIβ β clique detect β β chain ledger β β validation β
|
| 200 |
+
ββββββββ¬ββββββββ ββββββββββ¬ββββββββ ββββββββ¬ββββββββ ββββββββββββββββββ
|
| 201 |
+
β β β
|
| 202 |
+
ββββββββββββββ¬βββββββ΄βββββββββββββββββ
|
| 203 |
+
βΌ
|
| 204 |
+
ββββββββββββββββββββββββββββββββββββββ
|
| 205 |
+
β OUTPUT β
|
| 206 |
+
β Risk band Β· Evidence list β
|
| 207 |
+
β Bank-letterhead audit PDF β
|
| 208 |
+
β RBI compliance PDF Β· Audit JSON β
|
| 209 |
+
β Tamper-evident ledger entry β
|
| 210 |
+
ββββββββββββββββββββββββββββββββββββββ
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 211 |
```
|
| 212 |
|
| 213 |
+
Default weight vector (`forensics.WEIGHTS`): `{ela: 0.20, copy_move: 0.25, noise: 0.20, exif: 0.15, text_rules: 0.20}`. The Random Forest probability, when available, is blended 50/50 with the rule-based score. The CNN probability is blended at a weight between 0.4 and 0.7 based on the CNN's reported validation AUC. The AI-gen probability is applied as a final overlay capped at +20%.
|
| 214 |
+
|
| 215 |
+
Band mapping: `0β0.30 LOW Β· 0.30β0.50 MEDIUM Β· 0.50β0.75 HIGH Β· 0.75+ CRITICAL`.
|
| 216 |
+
|
| 217 |
+
See [`ARCHITECTURE.md`](ARCHITECTURE.md) for the full reference.
|
| 218 |
|
| 219 |
---
|
| 220 |
|
| 221 |
## Detection coverage
|
| 222 |
|
| 223 |
**Image tampering**
|
| 224 |
+
|
| 225 |
- Copy-move forgery β ORB keypoint matching with distance filter
|
| 226 |
- Image splicing β block-wise noise inconsistency via Laplacian variance
|
| 227 |
- Text edits / amount tampering β Error Level Analysis
|
| 228 |
- Photoshop / GIMP / Snapseed edits β EXIF Software-tag string match
|
| 229 |
- Timestamp inconsistencies β DateTime vs DateTimeOriginal comparison
|
| 230 |
|
| 231 |
+
**AI-generated content**
|
| 232 |
+
|
| 233 |
+
- Sora / Midjourney / Stable Diffusion / DALL-E outputs β FFT spectral analysis
|
| 234 |
+
- High-frequency suppression (1/f decay deviation)
|
| 235 |
+
- Periodic checkerboard peaks from upsampling stride
|
| 236 |
+
- Non-standard JPEG quantization tables
|
| 237 |
+
|
| 238 |
**PDF tampering**
|
| 239 |
+
|
| 240 |
- Incremental edits β multi-`%%EOF` marker counting
|
| 241 |
+
- Consumer-tool fingerprints β iLovePDF, Smallpdf, PDFescape, Sejda, Foxit Phantom
|
| 242 |
- Producer/Creator mismatch β flags re-processed PDFs
|
| 243 |
+
- Inserted text β embedded-font count anomalies
|
| 244 |
|
| 245 |
+
**Cross-document & fraud-ring**
|
|
|
|
|
|
|
|
|
|
| 246 |
|
|
|
|
| 247 |
- Name / DOB / address fuzzy match across multiple documents
|
| 248 |
+
- Per-field weighted scoring with green / yellow / red status
|
| 249 |
+
- Cross-applicant similarity graph; cliques β₯3 = suspected fraud ring
|
| 250 |
+
- Ring bands: CRITICAL (β₯5 members) / HIGH (3β4) / MEDIUM (2)
|
| 251 |
|
| 252 |
**KYC validation**
|
| 253 |
+
|
| 254 |
- IFSC: format + RBI bank-code list (36 banks)
|
| 255 |
- PAN: format + entity-type character (10 types per income-tax dept spec)
|
| 256 |
- Aadhaar: 12-digit format + UIDAI Verhoeff checksum
|
| 257 |
|
| 258 |
+
**PII redaction & audit**
|
| 259 |
+
|
| 260 |
- Aadhaar, PAN, IFSC, account-number masking
|
| 261 |
+
- PDF redaction with black rectangle overlays
|
| 262 |
+
- SHA-256 hash-chained provenance ledger (RBI Para 67 compliant)
|
| 263 |
|
| 264 |
---
|
| 265 |
|
|
|
|
| 275 |
Browser opens at `http://localhost:8501`.
|
| 276 |
|
| 277 |
For full OCR text-rule support, install Tesseract OCR:
|
| 278 |
+
|
| 279 |
- Windows: https://github.com/UB-Mannheim/tesseract/wiki
|
| 280 |
- macOS: `brew install tesseract`
|
| 281 |
+
- Linux: `sudo apt-get install tesseract-ocr libtesseract-dev`
|
| 282 |
|
| 283 |
The app auto-detects Tesseract on standard Windows install paths; no environment variable required.
|
| 284 |
|
|
|
|
|
|
|
| 285 |
---
|
| 286 |
|
| 287 |
## Deployment
|
| 288 |
|
| 289 |
+
The repository is deployment-ready for both **Streamlit Community Cloud** and **Hugging Face Spaces**. The YAML frontmatter at the top of this README configures the HF Space; `packages.txt` ensures Tesseract is installed on the build VM; `requirements.txt` covers Python dependencies.
|
| 290 |
|
| 291 |
+
Live deployment: https://huggingface.co/spaces/SpandanM110/DocSentry
|
| 292 |
|
| 293 |
---
|
| 294 |
|
| 295 |
## Training your own model
|
| 296 |
|
| 297 |
+
Drop labelled data into `data/images/originals/` and `data/images/tampered/`, open `docsentry_master.ipynb`, run section 6. A Random Forest auto-trains on whatever you put there and saves to `models/forgery_rf.joblib`. The Streamlit app picks it up automatically on next restart.
|
|
|
|
|
|
|
| 298 |
|
| 299 |
+
For a CNN upgrade, set `TRAIN_CNN = True` in section 7 and run on a Colab T4 GPU (free tier). Saves `models/forgery_cnn.keras` + `models/forgery_cnn.meta.json`. The app loads it lazily on first request.
|
| 300 |
|
| 301 |
---
|
| 302 |
|
| 303 |
## Dependencies
|
| 304 |
|
| 305 |
+
OpenCV (cv2), Pillow (PIL), scikit-image, scikit-learn, joblib, PyMuPDF (fitz), pdfplumber, pikepdf, pytesseract, python-dateutil, Streamlit, streamlit-drawable-canvas, ReportLab, NumPy, pandas, matplotlib, NetworkX. Optional: TensorFlow (only required for the CNN path).
|
| 306 |
|
| 307 |
All pip-installable. No GPU required for the default pipeline.
|
| 308 |
|
|
|
|
| 310 |
|
| 311 |
## License
|
| 312 |
|
| 313 |
+
MIT β see `LICENSE`. The MIT license covers the source code in this repository. Third-party datasets and pretrained models bundled or referenced (CASIA v2, IDRBT cheque dataset, AgamiAI Indian Bank Statements, MobileNetV2 ImageNet weights) are governed by their own terms; those notices are reproduced in `LICENSE` below the MIT block.
|
| 314 |
|
| 315 |
---
|
| 316 |
|