---
title: DocSentry
emoji: 🛡️
colorFrom: blue
colorTo: indigo
sdk: streamlit
sdk_version: 1.32.0
app_file: app.py
pinned: false
license: mit
short_description: Document forensics + fraud-ring detection for Indian banks
---

# BankShield

**Real-Time Document Forensics, AI-Generated Forgery Detection, and Cross-Applicant Fraud-Ring Intelligence for Indian Bank Underwriting.**

BankShield catches tampered, forged, and AI-generated documents the moment they reach the underwriter — and surfaces organised fraud rings that span multiple applicants. Six independent detection layers fuse into a single calibrated risk score, with explainable evidence, tamper-evident audit trails, and RBI-format compliance reports out of the box.

100% open source. No paid APIs. No external LLM calls. CPU-only by default. Runs locally on the bank's perimeter — PII never leaves.

- **Live demo:** https://huggingface.co/spaces/SpandanM110/DocSentry
- **Source:** https://github.com/SpandanM110/Doc-Sentry
- **Architecture reference:** see [`ARCHITECTURE.md`](ARCHITECTURE.md)

---

## The six pillars

| Pillar | Module | What it does |
|---|---|---|
| **Image Forensics** | `forensics.py` | ELA, copy-move (ORB), Laplacian noise inconsistency, EXIF audit |
| **PDF Structural Audit** | `forensics.py` | EOF marker counting, producer/creator drift, embedded-font anomalies, consumer-tool fingerprints |
| **OCR + Financial Rules** | `forensics.py` | Tesseract OCR + IFSC / PAN / Aadhaar / date monotonicity / amount sanity |
| **AI-Generated Detection** *(new)* | `ai_detector.py` | Radial FFT spectral analysis — catches Sora / Midjourney / Stable Diffusion outputs |
| **Fraud Ring Network** *(new)* | `fraud_ring.py` | NetworkX similarity graph across applicants; clique discovery flags organised fraud rings |
| **Provenance Ledger** *(new)* | `provenance.py` | SHA-256 hash chain over every analysis; O(N) verifiable; RBI Para 67 compliant |

Plus the **Live Tamper Forge Studio** (`tampering.py`) — an adversarial-validation harness built directly into the dashboard.

---

## Repository layout

```
Doc-Sentry/
├── app.py                       Streamlit web UI (6 tabs)
├── forensics.py                 Core detection engine + ensemble fusion
├── ai_detector.py               AI-generated forgery detector (FFT spectral)
├── fraud_ring.py                Cross-applicant similarity graph + clique detection
├── provenance.py                Tamper-evident SHA-256 hash chain
├── tampering.py                 Forge Studio adversarial harness
├── compliance.py                KYC validators, PII redaction, RBI report builder
├── audit_report.py              Bank-letterhead PDF report builder
├── docsentry_master.ipynb       Single source-of-truth Jupyter notebook
│
├── requirements.txt             Python dependencies
├── packages.txt                 System packages (Tesseract) for Streamlit Cloud / HF Spaces
├── .streamlit/config.toml       Streamlit theme + server config
│
├── sample_data/                 26 demo files for the live app
│   ├── originals/               12 genuine documents
│   ├── tampered/                12 tampered documents
│   └── pdfs/                    2 PDFs (1 genuine, 1 tampered)
│
├── models/                      Trained model artefacts
│   ├── forgery_rf.joblib        Random Forest classifier
│   └── forgery_cnn.keras        MobileNetV2 fine-tuned on CASIA v2 (optional)
│
├── ARCHITECTURE.md              Full architecture reference
├── SUBMISSION.md                Hackathon submission packet
├── BankShield_Pitch.pptx        Pitch deck (15 slides)
├── README.md  LICENSE
└── data/                        (gitignored) full training data + downloaded datasets
```

---

## Module reference

### `forensics.py` — detection engine

The core analytical module. Stateless functions; all logic is independently testable.

| Function | Returns | Description |
|---|---|---|
| `analyse_document(path)` | dict | End-to-end pipeline. Auto-detects type, runs all relevant detectors, blends Random Forest + CNN + AI-gen predictions, auto-logs to provenance ledger. Primary entry point. |
| `score_image(path)` | (float, dict, list) | Composite forensic score for an image. Returns total, sub-scores by detector, and EXIF flags. |
| `error_level_analysis(path, quality=90)` | (PIL.Image, float) | ELA visualisation + scalar suspicion score. |
| `copy_move_detect(path)` | (np.ndarray, int, list) | ORB-based copy-move detection. Returns annotated viz, match count, raw matches. |
| `noise_inconsistency(path, block=32)` | (np.ndarray, float) | Per-block Laplacian variance heatmap + outlier ratio. |
| `exif_sanity(path)` | list of str | EXIF audit: missing EXIF, editor signatures, timestamp inconsistencies. |
| `pdf_structural_audit(path)` | dict | `%%EOF` markers, producer/creator drift, consumer-tool fingerprints. |
| `pdf_font_audit(path)` | dict | Embedded font listing + count anomalies. |
| `ocr_text(path)` | str | Tesseract OCR with auto-fallback. |
| `text_rule_checks(text)` | dict | Date monotonicity, amount sanity, IFSC format, account-number patterns. |
| `extract_features(path)` | dict | 11-feature vector for the Random Forest. |
| `predict_with_model(path)` | dict / None | Random Forest tamper probability + verdict. |
| `predict_with_cnn(path)` | dict / None | MobileNetV2 CNN inference (lazy-loaded). |
| `extract_identity_fields(path)` | (dict, str) | Pulls name, DOB, address, IFSC, account, amounts. |
| `cross_doc_consistency(paths)` | dict | Per-field similarity across 2+ documents. |
| `generate_insights(score, sub, flags)` | dict | Numeric → underwriter-readable bullets + recommended action. |
| `band(score)` | str | Maps a float to LOW / MEDIUM / HIGH / CRITICAL. |

### `ai_detector.py` — AI-generated forgery detection

| Function | Description |
|---|---|
| `detect_ai_generated(path)` | Full pipeline → probability + verdict + flags + FFT profile. |
| `radial_fft_profile(gray)` | Radially-averaged log-magnitude FFT spectrum. |
| `high_freq_attenuation(profile)` | Smoothness score — low for real scans, high for AI outputs. |
| `spectral_peak_score(profile)` | Counts checkerboard-stride peaks in the high-frequency band. |
| `jpeg_quantization_check(path)` | Inspects JPEG quantization tables for synthetic-media signatures. |

Blended into the main risk score with a capped +20% overlay so AI-gen signals reliably surface synthetic media without dominating classical detectors.

### `fraud_ring.py` — cross-applicant fraud-ring detection

| Function | Description |
|---|---|
| `extract_applicant_fields(path)` | OCR + regex pull of name / DOB / address / phone / IFSC / account / employer. |
| `compare_applicants(a, b)` | Per-field similarity + weighted score. |
| `build_fraud_graph(applicants)` | NetworkX similarity graph (edges weighted by shared signals). |
| `detect_rings(G, min_size=3)` | Connected components above threshold → suspected fraud rings. |
| `visualize_graph(G, rings)` | Force-directed graph with ring members in red. |
| `fraud_summary(G, rings, applicants)` | Structured summary for the Streamlit UI. |

### `provenance.py` — tamper-evident audit ledger

| Function | Description |
|---|---|
| `log_analysis(...)` | Appends a SHA-256 hash-chained record to the SQLite ledger. |
| `verify_chain()` | Walks every record in O(N); pinpoints the first broken record. |
| `chain_stats()` | Count, first/last timestamps, breakdown by risk band, chain status. |
| `fetch_ledger(limit)` | Returns the latest N entries. |
| `ledger_dataframe(limit)` | Pandas DataFrame view (for Streamlit display). |

Each record's `record_hash = SHA256(timestamp | doc_sha256 | risk_band | risk_score | prev_hash)` — retroactive edits break the chain mathematically.

### `tampering.py` — adversarial Forge Studio

`tamper_copy_move`, `tamper_text_edit`, `tamper_splice`, `tamper_compression`, `tamper_metadata_strip`, `tamper_custom_region`, `tamper_chain`, `annotate_before_after`, `overlay_heatmap_on_image`, `detector_scorecard`. Used by Tab 5 to apply controlled forgeries and immediately re-run detection.

### `compliance.py` — KYC + regulatory

| Function | Description |
|---|---|
| `validate_ifsc(code)` | Format check + RBI bank-code lookup (36 banks). |
| `validate_pan(code)` | Format + entity-type character validation. |
| `validate_aadhaar(num)` | 12-digit format + UIDAI Verhoeff checksum. |
| `redact_text(text)` | Masks IFSC, PAN, Aadhaar, account numbers. |
| `redact_pdf(input_path, output_path)` | PII black-box overlays via PyMuPDF text-bbox. |
| `extract_pii_fields(path)` | Pulls all PII candidates from any document. |
| `build_compliance_report(...)` | RBI Master-Direction-format audit PDF (5 sections). |

### `audit_report.py` — bank-letterhead PDF

`build_pdf_report(report, source_path) → bytes`. Multi-page PDF with header letterhead, metadata table, colour-coded risk verdict box, sub-score breakdown table, evidence list, embedded forensic heatmaps. Built with ReportLab Platypus.

### `app.py` — Streamlit UI (6 tabs)

| Tab | Function |
|---|---|
| 1. Single-document analysis | Risk band, sub-score chart, ELA / copy-move / noise heatmaps, AI-gen FFT profile, ML/CNN predictions, downloadable JSON + PDF. |
| 2. Cross-document KYC | Upload 2–4 docs for one applicant; identity-field consistency table. |
| 3. Batch audit | Scan a folder; sortable risk table + CSV download. |
| 4. Compliance & Audit Pack | KYC validation, PII auto-redaction, RBI compliance PDF, **provenance ledger view with chain re-verify**. |
| 5. Live Tamper Forge Studio | Pick clean sample → choose technique + intensity → watch BankShield localise the tamper with per-detector scorecard + heatmap overlays. |
| 6. Fraud Ring Network | Upload N applicants → similarity graph with red ring members + ring summary cards. |

---

## Pipeline architecture

```
                    ┌────────────────────────────────────────┐
                    │   PRESENTATION (Streamlit, 6 tabs)     │
                    └──────────────────┬─────────────────────┘
                                       ▼
              ┌──────────────────────────────────────────────┐
              │   FORENSICS CORE                             │
              │   ELA · Copy-move · Noise · EXIF · OCR · PDF │
              │   + Random Forest (11-d feature vector)      │
              │   + MobileNetV2 CNN (CASIA v2 fine-tuned)    │
              │   + AI-Gen Detector (radial FFT)             │
              └──────────────────┬───────────────────────────┘
                                 ▼
              ┌──────────────────────────────────────────────┐
              │   ENSEMBLE FUSION                            │
              │   weighted blend → RF overlay → CNN overlay  │
              │   → AI-gen overlay (capped at +20%)          │
              └──────────────────┬───────────────────────────┘
                                 ▼
        ┌──────────────────┬─────┴─────┬──────────────────┐
        ▼                  ▼           ▼                  ▼
┌──────────────┐  ┌────────────────┐ ┌──────────────┐ ┌────────────────┐
│ COMPLIANCE   │  │ FRAUD-RING     │ │ PROVENANCE   │ │ TAMPER FORGE   │
│ IFSC · PAN · │  │ NetworkX graph │ │ SHA-256 hash │ │ Adversarial    │
│ Aadhaar · PII│  │ clique detect  │ │ chain ledger │ │ validation     │
└──────┬───────┘  └────────┬───────┘ └──────┬───────┘ └────────────────┘
       │                   │                │
       └────────────┬──────┴────────────────┘
                    ▼
         ┌────────────────────────────────────┐
         │   OUTPUT                           │
         │   Risk band · Evidence list        │
         │   Bank-letterhead audit PDF        │
         │   RBI compliance PDF · Audit JSON  │
         │   Tamper-evident ledger entry      │
         └────────────────────────────────────┘
```

Default weight vector (`forensics.WEIGHTS`): `{ela: 0.20, copy_move: 0.25, noise: 0.20, exif: 0.15, text_rules: 0.20}`. The Random Forest probability, when available, is blended 50/50 with the rule-based score. The CNN probability is blended at a weight between 0.4 and 0.7 based on the CNN's reported validation AUC. The AI-gen probability is applied as a final overlay capped at +20%.

Band mapping: `0–0.30 LOW · 0.30–0.50 MEDIUM · 0.50–0.75 HIGH · 0.75+ CRITICAL`.

See [`ARCHITECTURE.md`](ARCHITECTURE.md) for the full reference.

---

## Detection coverage

**Image tampering**

- Copy-move forgery — ORB keypoint matching with distance filter
- Image splicing — block-wise noise inconsistency via Laplacian variance
- Text edits / amount tampering — Error Level Analysis
- Photoshop / GIMP / Snapseed edits — EXIF Software-tag string match
- Timestamp inconsistencies — DateTime vs DateTimeOriginal comparison

**AI-generated content**

- Sora / Midjourney / Stable Diffusion / DALL-E outputs — FFT spectral analysis
- High-frequency suppression (1/f decay deviation)
- Periodic checkerboard peaks from upsampling stride
- Non-standard JPEG quantization tables

**PDF tampering**

- Incremental edits — multi-`%%EOF` marker counting
- Consumer-tool fingerprints — iLovePDF, Smallpdf, PDFescape, Sejda, Foxit Phantom
- Producer/Creator mismatch — flags re-processed PDFs
- Inserted text — embedded-font count anomalies

**Cross-document & fraud-ring**

- Name / DOB / address fuzzy match across multiple documents
- Per-field weighted scoring with green / yellow / red status
- Cross-applicant similarity graph; cliques ≥3 = suspected fraud ring
- Ring bands: CRITICAL (≥5 members) / HIGH (3–4) / MEDIUM (2)

**KYC validation**

- IFSC: format + RBI bank-code list (36 banks)
- PAN: format + entity-type character (10 types per income-tax dept spec)
- Aadhaar: 12-digit format + UIDAI Verhoeff checksum

**PII redaction & audit**

- Aadhaar, PAN, IFSC, account-number masking
- PDF redaction with black rectangle overlays
- SHA-256 hash-chained provenance ledger (RBI Para 67 compliant)

---

## Running locally

```bash
git clone https://github.com/SpandanM110/Doc-Sentry.git
cd Doc-Sentry
pip install -r requirements.txt
streamlit run app.py
```

Browser opens at `http://localhost:8501`.

For full OCR text-rule support, install Tesseract OCR:

- Windows: https://github.com/UB-Mannheim/tesseract/wiki
- macOS: `brew install tesseract`
- Linux: `sudo apt-get install tesseract-ocr libtesseract-dev`

The app auto-detects Tesseract on standard Windows install paths; no environment variable required.

---

## Deployment

The repository is deployment-ready for both **Streamlit Community Cloud** and **Hugging Face Spaces**. The YAML frontmatter at the top of this README configures the HF Space; `packages.txt` ensures Tesseract is installed on the build VM; `requirements.txt` covers Python dependencies.

Live deployment: https://huggingface.co/spaces/SpandanM110/DocSentry

---

## Training your own model

Drop labelled data into `data/images/originals/` and `data/images/tampered/`, open `docsentry_master.ipynb`, run section 6. A Random Forest auto-trains on whatever you put there and saves to `models/forgery_rf.joblib`. The Streamlit app picks it up automatically on next restart.

For a CNN upgrade, set `TRAIN_CNN = True` in section 7 and run on a Colab T4 GPU (free tier). Saves `models/forgery_cnn.keras` + `models/forgery_cnn.meta.json`. The app loads it lazily on first request.

---

## Dependencies

OpenCV (cv2), Pillow (PIL), scikit-image, scikit-learn, joblib, PyMuPDF (fitz), pdfplumber, pikepdf, pytesseract, python-dateutil, Streamlit, streamlit-drawable-canvas, ReportLab, NumPy, pandas, matplotlib, NetworkX. Optional: TensorFlow (only required for the CNN path).

All pip-installable. No GPU required for the default pipeline.

---

## License

MIT — see `LICENSE`. The MIT license covers the source code in this repository. Third-party datasets and pretrained models bundled or referenced (CASIA v2, IDRBT cheque dataset, AgamiAI Indian Bank Statements, MobileNetV2 ImageNet weights) are governed by their own terms; those notices are reproduced in `LICENSE` below the MIT block.

---

## Acknowledgements

- **AgamiAI Indian Bank Statements** (Hugging Face) — Apache 2.0
- **IDRBT Cheque Image Dataset** — Institute for Development and Research in Banking Technology, India
- **CASIA v2** image tampering dataset — Chinese Academy of Sciences
- **MICC-F220** copy-move benchmark — University of Florence
- **CoMoFoD** dataset — University of Zagreb
- **Tobacco-3482** document corpus — University of Maryland