DocSentry / README.md
SpandanM110's picture
Fix HF short_description length
8416232
|
Raw
History Blame Contribute Delete
18 kB
---
title: DocSentry
emoji: πŸ›‘οΈ
colorFrom: blue
colorTo: indigo
sdk: streamlit
sdk_version: 1.32.0
app_file: app.py
pinned: false
license: mit
short_description: Document forensics + fraud-ring detection for Indian banks
---
# BankShield
**Real-Time Document Forensics, AI-Generated Forgery Detection, and Cross-Applicant Fraud-Ring Intelligence for Indian Bank Underwriting.**
BankShield catches tampered, forged, and AI-generated documents the moment they reach the underwriter β€” and surfaces organised fraud rings that span multiple applicants. Six independent detection layers fuse into a single calibrated risk score, with explainable evidence, tamper-evident audit trails, and RBI-format compliance reports out of the box.
100% open source. No paid APIs. No external LLM calls. CPU-only by default. Runs locally on the bank's perimeter β€” PII never leaves.
- **Live demo:** https://huggingface.co/spaces/SpandanM110/DocSentry
- **Source:** https://github.com/SpandanM110/Doc-Sentry
- **Architecture reference:** see [`ARCHITECTURE.md`](ARCHITECTURE.md)
---
## The six pillars
| Pillar | Module | What it does |
|---|---|---|
| **Image Forensics** | `forensics.py` | ELA, copy-move (ORB), Laplacian noise inconsistency, EXIF audit |
| **PDF Structural Audit** | `forensics.py` | EOF marker counting, producer/creator drift, embedded-font anomalies, consumer-tool fingerprints |
| **OCR + Financial Rules** | `forensics.py` | Tesseract OCR + IFSC / PAN / Aadhaar / date monotonicity / amount sanity |
| **AI-Generated Detection** *(new)* | `ai_detector.py` | Radial FFT spectral analysis β€” catches Sora / Midjourney / Stable Diffusion outputs |
| **Fraud Ring Network** *(new)* | `fraud_ring.py` | NetworkX similarity graph across applicants; clique discovery flags organised fraud rings |
| **Provenance Ledger** *(new)* | `provenance.py` | SHA-256 hash chain over every analysis; O(N) verifiable; RBI Para 67 compliant |
Plus the **Live Tamper Forge Studio** (`tampering.py`) β€” an adversarial-validation harness built directly into the dashboard.
---
## Repository layout
```
Doc-Sentry/
β”œβ”€β”€ app.py Streamlit web UI (6 tabs)
β”œβ”€β”€ forensics.py Core detection engine + ensemble fusion
β”œβ”€β”€ ai_detector.py AI-generated forgery detector (FFT spectral)
β”œβ”€β”€ fraud_ring.py Cross-applicant similarity graph + clique detection
β”œβ”€β”€ provenance.py Tamper-evident SHA-256 hash chain
β”œβ”€β”€ tampering.py Forge Studio adversarial harness
β”œβ”€β”€ compliance.py KYC validators, PII redaction, RBI report builder
β”œβ”€β”€ audit_report.py Bank-letterhead PDF report builder
β”œβ”€β”€ docsentry_master.ipynb Single source-of-truth Jupyter notebook
β”‚
β”œβ”€β”€ requirements.txt Python dependencies
β”œβ”€β”€ packages.txt System packages (Tesseract) for Streamlit Cloud / HF Spaces
β”œβ”€β”€ .streamlit/config.toml Streamlit theme + server config
β”‚
β”œβ”€β”€ sample_data/ 26 demo files for the live app
β”‚ β”œβ”€β”€ originals/ 12 genuine documents
β”‚ β”œβ”€β”€ tampered/ 12 tampered documents
β”‚ └── pdfs/ 2 PDFs (1 genuine, 1 tampered)
β”‚
β”œβ”€β”€ models/ Trained model artefacts
β”‚ β”œβ”€β”€ forgery_rf.joblib Random Forest classifier
β”‚ └── forgery_cnn.keras MobileNetV2 fine-tuned on CASIA v2 (optional)
β”‚
β”œβ”€β”€ ARCHITECTURE.md Full architecture reference
β”œβ”€β”€ SUBMISSION.md Hackathon submission packet
β”œβ”€β”€ BankShield_Pitch.pptx Pitch deck (15 slides)
β”œβ”€β”€ README.md LICENSE
└── data/ (gitignored) full training data + downloaded datasets
```
---
## Module reference
### `forensics.py` β€” detection engine
The core analytical module. Stateless functions; all logic is independently testable.
| Function | Returns | Description |
|---|---|---|
| `analyse_document(path)` | dict | End-to-end pipeline. Auto-detects type, runs all relevant detectors, blends Random Forest + CNN + AI-gen predictions, auto-logs to provenance ledger. Primary entry point. |
| `score_image(path)` | (float, dict, list) | Composite forensic score for an image. Returns total, sub-scores by detector, and EXIF flags. |
| `error_level_analysis(path, quality=90)` | (PIL.Image, float) | ELA visualisation + scalar suspicion score. |
| `copy_move_detect(path)` | (np.ndarray, int, list) | ORB-based copy-move detection. Returns annotated viz, match count, raw matches. |
| `noise_inconsistency(path, block=32)` | (np.ndarray, float) | Per-block Laplacian variance heatmap + outlier ratio. |
| `exif_sanity(path)` | list of str | EXIF audit: missing EXIF, editor signatures, timestamp inconsistencies. |
| `pdf_structural_audit(path)` | dict | `%%EOF` markers, producer/creator drift, consumer-tool fingerprints. |
| `pdf_font_audit(path)` | dict | Embedded font listing + count anomalies. |
| `ocr_text(path)` | str | Tesseract OCR with auto-fallback. |
| `text_rule_checks(text)` | dict | Date monotonicity, amount sanity, IFSC format, account-number patterns. |
| `extract_features(path)` | dict | 11-feature vector for the Random Forest. |
| `predict_with_model(path)` | dict / None | Random Forest tamper probability + verdict. |
| `predict_with_cnn(path)` | dict / None | MobileNetV2 CNN inference (lazy-loaded). |
| `extract_identity_fields(path)` | (dict, str) | Pulls name, DOB, address, IFSC, account, amounts. |
| `cross_doc_consistency(paths)` | dict | Per-field similarity across 2+ documents. |
| `generate_insights(score, sub, flags)` | dict | Numeric β†’ underwriter-readable bullets + recommended action. |
| `band(score)` | str | Maps a float to LOW / MEDIUM / HIGH / CRITICAL. |
### `ai_detector.py` β€” AI-generated forgery detection
| Function | Description |
|---|---|
| `detect_ai_generated(path)` | Full pipeline β†’ probability + verdict + flags + FFT profile. |
| `radial_fft_profile(gray)` | Radially-averaged log-magnitude FFT spectrum. |
| `high_freq_attenuation(profile)` | Smoothness score β€” low for real scans, high for AI outputs. |
| `spectral_peak_score(profile)` | Counts checkerboard-stride peaks in the high-frequency band. |
| `jpeg_quantization_check(path)` | Inspects JPEG quantization tables for synthetic-media signatures. |
Blended into the main risk score with a capped +20% overlay so AI-gen signals reliably surface synthetic media without dominating classical detectors.
### `fraud_ring.py` β€” cross-applicant fraud-ring detection
| Function | Description |
|---|---|
| `extract_applicant_fields(path)` | OCR + regex pull of name / DOB / address / phone / IFSC / account / employer. |
| `compare_applicants(a, b)` | Per-field similarity + weighted score. |
| `build_fraud_graph(applicants)` | NetworkX similarity graph (edges weighted by shared signals). |
| `detect_rings(G, min_size=3)` | Connected components above threshold β†’ suspected fraud rings. |
| `visualize_graph(G, rings)` | Force-directed graph with ring members in red. |
| `fraud_summary(G, rings, applicants)` | Structured summary for the Streamlit UI. |
### `provenance.py` β€” tamper-evident audit ledger
| Function | Description |
|---|---|
| `log_analysis(...)` | Appends a SHA-256 hash-chained record to the SQLite ledger. |
| `verify_chain()` | Walks every record in O(N); pinpoints the first broken record. |
| `chain_stats()` | Count, first/last timestamps, breakdown by risk band, chain status. |
| `fetch_ledger(limit)` | Returns the latest N entries. |
| `ledger_dataframe(limit)` | Pandas DataFrame view (for Streamlit display). |
Each record's `record_hash = SHA256(timestamp | doc_sha256 | risk_band | risk_score | prev_hash)` β€” retroactive edits break the chain mathematically.
### `tampering.py` β€” adversarial Forge Studio
`tamper_copy_move`, `tamper_text_edit`, `tamper_splice`, `tamper_compression`, `tamper_metadata_strip`, `tamper_custom_region`, `tamper_chain`, `annotate_before_after`, `overlay_heatmap_on_image`, `detector_scorecard`. Used by Tab 5 to apply controlled forgeries and immediately re-run detection.
### `compliance.py` β€” KYC + regulatory
| Function | Description |
|---|---|
| `validate_ifsc(code)` | Format check + RBI bank-code lookup (36 banks). |
| `validate_pan(code)` | Format + entity-type character validation. |
| `validate_aadhaar(num)` | 12-digit format + UIDAI Verhoeff checksum. |
| `redact_text(text)` | Masks IFSC, PAN, Aadhaar, account numbers. |
| `redact_pdf(input_path, output_path)` | PII black-box overlays via PyMuPDF text-bbox. |
| `extract_pii_fields(path)` | Pulls all PII candidates from any document. |
| `build_compliance_report(...)` | RBI Master-Direction-format audit PDF (5 sections). |
### `audit_report.py` β€” bank-letterhead PDF
`build_pdf_report(report, source_path) β†’ bytes`. Multi-page PDF with header letterhead, metadata table, colour-coded risk verdict box, sub-score breakdown table, evidence list, embedded forensic heatmaps. Built with ReportLab Platypus.
### `app.py` β€” Streamlit UI (6 tabs)
| Tab | Function |
|---|---|
| 1. Single-document analysis | Risk band, sub-score chart, ELA / copy-move / noise heatmaps, AI-gen FFT profile, ML/CNN predictions, downloadable JSON + PDF. |
| 2. Cross-document KYC | Upload 2–4 docs for one applicant; identity-field consistency table. |
| 3. Batch audit | Scan a folder; sortable risk table + CSV download. |
| 4. Compliance & Audit Pack | KYC validation, PII auto-redaction, RBI compliance PDF, **provenance ledger view with chain re-verify**. |
| 5. Live Tamper Forge Studio | Pick clean sample β†’ choose technique + intensity β†’ watch BankShield localise the tamper with per-detector scorecard + heatmap overlays. |
| 6. Fraud Ring Network | Upload N applicants β†’ similarity graph with red ring members + ring summary cards. |
---
## Pipeline architecture
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ PRESENTATION (Streamlit, 6 tabs) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ FORENSICS CORE β”‚
β”‚ ELA Β· Copy-move Β· Noise Β· EXIF Β· OCR Β· PDF β”‚
β”‚ + Random Forest (11-d feature vector) β”‚
β”‚ + MobileNetV2 CNN (CASIA v2 fine-tuned) β”‚
β”‚ + AI-Gen Detector (radial FFT) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ ENSEMBLE FUSION β”‚
β”‚ weighted blend β†’ RF overlay β†’ CNN overlay β”‚
β”‚ β†’ AI-gen overlay (capped at +20%) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β–Ό β–Ό β–Ό β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ COMPLIANCE β”‚ β”‚ FRAUD-RING β”‚ β”‚ PROVENANCE β”‚ β”‚ TAMPER FORGE β”‚
β”‚ IFSC Β· PAN Β· β”‚ β”‚ NetworkX graph β”‚ β”‚ SHA-256 hash β”‚ β”‚ Adversarial β”‚
β”‚ Aadhaar Β· PIIβ”‚ β”‚ clique detect β”‚ β”‚ chain ledger β”‚ β”‚ validation β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ OUTPUT β”‚
β”‚ Risk band Β· Evidence list β”‚
β”‚ Bank-letterhead audit PDF β”‚
β”‚ RBI compliance PDF Β· Audit JSON β”‚
β”‚ Tamper-evident ledger entry β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
Default weight vector (`forensics.WEIGHTS`): `{ela: 0.20, copy_move: 0.25, noise: 0.20, exif: 0.15, text_rules: 0.20}`. The Random Forest probability, when available, is blended 50/50 with the rule-based score. The CNN probability is blended at a weight between 0.4 and 0.7 based on the CNN's reported validation AUC. The AI-gen probability is applied as a final overlay capped at +20%.
Band mapping: `0–0.30 LOW Β· 0.30–0.50 MEDIUM Β· 0.50–0.75 HIGH Β· 0.75+ CRITICAL`.
See [`ARCHITECTURE.md`](ARCHITECTURE.md) for the full reference.
---
## Detection coverage
**Image tampering**
- Copy-move forgery β€” ORB keypoint matching with distance filter
- Image splicing β€” block-wise noise inconsistency via Laplacian variance
- Text edits / amount tampering β€” Error Level Analysis
- Photoshop / GIMP / Snapseed edits β€” EXIF Software-tag string match
- Timestamp inconsistencies β€” DateTime vs DateTimeOriginal comparison
**AI-generated content**
- Sora / Midjourney / Stable Diffusion / DALL-E outputs β€” FFT spectral analysis
- High-frequency suppression (1/f decay deviation)
- Periodic checkerboard peaks from upsampling stride
- Non-standard JPEG quantization tables
**PDF tampering**
- Incremental edits β€” multi-`%%EOF` marker counting
- Consumer-tool fingerprints β€” iLovePDF, Smallpdf, PDFescape, Sejda, Foxit Phantom
- Producer/Creator mismatch β€” flags re-processed PDFs
- Inserted text β€” embedded-font count anomalies
**Cross-document & fraud-ring**
- Name / DOB / address fuzzy match across multiple documents
- Per-field weighted scoring with green / yellow / red status
- Cross-applicant similarity graph; cliques β‰₯3 = suspected fraud ring
- Ring bands: CRITICAL (β‰₯5 members) / HIGH (3–4) / MEDIUM (2)
**KYC validation**
- IFSC: format + RBI bank-code list (36 banks)
- PAN: format + entity-type character (10 types per income-tax dept spec)
- Aadhaar: 12-digit format + UIDAI Verhoeff checksum
**PII redaction & audit**
- Aadhaar, PAN, IFSC, account-number masking
- PDF redaction with black rectangle overlays
- SHA-256 hash-chained provenance ledger (RBI Para 67 compliant)
---
## Running locally
```bash
git clone https://github.com/SpandanM110/Doc-Sentry.git
cd Doc-Sentry
pip install -r requirements.txt
streamlit run app.py
```
Browser opens at `http://localhost:8501`.
For full OCR text-rule support, install Tesseract OCR:
- Windows: https://github.com/UB-Mannheim/tesseract/wiki
- macOS: `brew install tesseract`
- Linux: `sudo apt-get install tesseract-ocr libtesseract-dev`
The app auto-detects Tesseract on standard Windows install paths; no environment variable required.
---
## Deployment
The repository is deployment-ready for both **Streamlit Community Cloud** and **Hugging Face Spaces**. The YAML frontmatter at the top of this README configures the HF Space; `packages.txt` ensures Tesseract is installed on the build VM; `requirements.txt` covers Python dependencies.
Live deployment: https://huggingface.co/spaces/SpandanM110/DocSentry
---
## Training your own model
Drop labelled data into `data/images/originals/` and `data/images/tampered/`, open `docsentry_master.ipynb`, run section 6. A Random Forest auto-trains on whatever you put there and saves to `models/forgery_rf.joblib`. The Streamlit app picks it up automatically on next restart.
For a CNN upgrade, set `TRAIN_CNN = True` in section 7 and run on a Colab T4 GPU (free tier). Saves `models/forgery_cnn.keras` + `models/forgery_cnn.meta.json`. The app loads it lazily on first request.
---
## Dependencies
OpenCV (cv2), Pillow (PIL), scikit-image, scikit-learn, joblib, PyMuPDF (fitz), pdfplumber, pikepdf, pytesseract, python-dateutil, Streamlit, streamlit-drawable-canvas, ReportLab, NumPy, pandas, matplotlib, NetworkX. Optional: TensorFlow (only required for the CNN path).
All pip-installable. No GPU required for the default pipeline.
---
## License
MIT β€” see `LICENSE`. The MIT license covers the source code in this repository. Third-party datasets and pretrained models bundled or referenced (CASIA v2, IDRBT cheque dataset, AgamiAI Indian Bank Statements, MobileNetV2 ImageNet weights) are governed by their own terms; those notices are reproduced in `LICENSE` below the MIT block.
---
## Acknowledgements
- **AgamiAI Indian Bank Statements** (Hugging Face) β€” Apache 2.0
- **IDRBT Cheque Image Dataset** β€” Institute for Development and Research in Banking Technology, India
- **CASIA v2** image tampering dataset β€” Chinese Academy of Sciences
- **MICC-F220** copy-move benchmark β€” University of Florence
- **CoMoFoD** dataset β€” University of Zagreb
- **Tobacco-3482** document corpus β€” University of Maryland