Spaces:

SpandanM110
/

DocSentry

Sleeping

App Files Files Community

DocSentry / README.md

SpandanM110

Fix HF short_description length

8416232 8 days ago

preview code

Raw

History Blame Contribute Delete

18 kB

	---
	title: DocSentry
	emoji: 🛡️
	colorFrom: blue
	colorTo: indigo
	sdk: streamlit
	sdk_version: 1.32.0
	app_file: app.py
	pinned: false
	license: mit
	short_description: Document forensics + fraud-ring detection for Indian banks
	---

	# BankShield

	Real-Time Document Forensics, AI-Generated Forgery Detection, and Cross-Applicant Fraud-Ring Intelligence for Indian Bank Underwriting.

	BankShield catches tampered, forged, and AI-generated documents the moment they reach the underwriter — and surfaces organised fraud rings that span multiple applicants. Six independent detection layers fuse into a single calibrated risk score, with explainable evidence, tamper-evident audit trails, and RBI-format compliance reports out of the box.

	100% open source. No paid APIs. No external LLM calls. CPU-only by default. Runs locally on the bank's perimeter — PII never leaves.

	- Live demo: https://huggingface.co/spaces/SpandanM110/DocSentry
	- Source: https://github.com/SpandanM110/Doc-Sentry
	- Architecture reference: see [`ARCHITECTURE.md`](ARCHITECTURE.md)

	---

	## The six pillars

	\| Pillar \| Module \| What it does \|
	\|---\|---\|---\|
	\| Image Forensics \| `forensics.py` \| ELA, copy-move (ORB), Laplacian noise inconsistency, EXIF audit \|
	\| PDF Structural Audit \| `forensics.py` \| EOF marker counting, producer/creator drift, embedded-font anomalies, consumer-tool fingerprints \|
	\| OCR + Financial Rules \| `forensics.py` \| Tesseract OCR + IFSC / PAN / Aadhaar / date monotonicity / amount sanity \|
	\| AI-Generated Detection (new) \| `ai_detector.py` \| Radial FFT spectral analysis — catches Sora / Midjourney / Stable Diffusion outputs \|
	\| Fraud Ring Network (new) \| `fraud_ring.py` \| NetworkX similarity graph across applicants; clique discovery flags organised fraud rings \|
	\| Provenance Ledger (new) \| `provenance.py` \| SHA-256 hash chain over every analysis; O(N) verifiable; RBI Para 67 compliant \|

	Plus the Live Tamper Forge Studio (`tampering.py`) — an adversarial-validation harness built directly into the dashboard.

	---

	## Repository layout

	```
	Doc-Sentry/
	├── app.py Streamlit web UI (6 tabs)
	├── forensics.py Core detection engine + ensemble fusion
	├── ai_detector.py AI-generated forgery detector (FFT spectral)
	├── fraud_ring.py Cross-applicant similarity graph + clique detection
	├── provenance.py Tamper-evident SHA-256 hash chain
	├── tampering.py Forge Studio adversarial harness
	├── compliance.py KYC validators, PII redaction, RBI report builder
	├── audit_report.py Bank-letterhead PDF report builder
	├── docsentry_master.ipynb Single source-of-truth Jupyter notebook
	│
	├── requirements.txt Python dependencies
	├── packages.txt System packages (Tesseract) for Streamlit Cloud / HF Spaces
	├── .streamlit/config.toml Streamlit theme + server config
	│
	├── sample_data/ 26 demo files for the live app
	│ ├── originals/ 12 genuine documents
	│ ├── tampered/ 12 tampered documents
	│ └── pdfs/ 2 PDFs (1 genuine, 1 tampered)
	│
	├── models/ Trained model artefacts
	│ ├── forgery_rf.joblib Random Forest classifier
	│ └── forgery_cnn.keras MobileNetV2 fine-tuned on CASIA v2 (optional)
	│
	├── ARCHITECTURE.md Full architecture reference
	├── SUBMISSION.md Hackathon submission packet
	├── BankShield_Pitch.pptx Pitch deck (15 slides)
	├── README.md LICENSE
	└── data/ (gitignored) full training data + downloaded datasets
	```

	---

	## Module reference

	### `forensics.py` — detection engine

	The core analytical module. Stateless functions; all logic is independently testable.

	\| Function \| Returns \| Description \|
	\|---\|---\|---\|
	\| `analyse_document(path)` \| dict \| End-to-end pipeline. Auto-detects type, runs all relevant detectors, blends Random Forest + CNN + AI-gen predictions, auto-logs to provenance ledger. Primary entry point. \|
	\| `score_image(path)` \| (float, dict, list) \| Composite forensic score for an image. Returns total, sub-scores by detector, and EXIF flags. \|
	\| `error_level_analysis(path, quality=90)` \| (PIL.Image, float) \| ELA visualisation + scalar suspicion score. \|
	\| `copy_move_detect(path)` \| (np.ndarray, int, list) \| ORB-based copy-move detection. Returns annotated viz, match count, raw matches. \|
	\| `noise_inconsistency(path, block=32)` \| (np.ndarray, float) \| Per-block Laplacian variance heatmap + outlier ratio. \|
	\| `exif_sanity(path)` \| list of str \| EXIF audit: missing EXIF, editor signatures, timestamp inconsistencies. \|
	\| `pdf_structural_audit(path)` \| dict \| `%%EOF` markers, producer/creator drift, consumer-tool fingerprints. \|
	\| `pdf_font_audit(path)` \| dict \| Embedded font listing + count anomalies. \|
	\| `ocr_text(path)` \| str \| Tesseract OCR with auto-fallback. \|
	\| `text_rule_checks(text)` \| dict \| Date monotonicity, amount sanity, IFSC format, account-number patterns. \|
	\| `extract_features(path)` \| dict \| 11-feature vector for the Random Forest. \|
	\| `predict_with_model(path)` \| dict / None \| Random Forest tamper probability + verdict. \|
	\| `predict_with_cnn(path)` \| dict / None \| MobileNetV2 CNN inference (lazy-loaded). \|
	\| `extract_identity_fields(path)` \| (dict, str) \| Pulls name, DOB, address, IFSC, account, amounts. \|
	\| `cross_doc_consistency(paths)` \| dict \| Per-field similarity across 2+ documents. \|
	\| `generate_insights(score, sub, flags)` \| dict \| Numeric → underwriter-readable bullets + recommended action. \|
	\| `band(score)` \| str \| Maps a float to LOW / MEDIUM / HIGH / CRITICAL. \|

	### `ai_detector.py` — AI-generated forgery detection

	\| Function \| Description \|
	\|---\|---\|
	\| `detect_ai_generated(path)` \| Full pipeline → probability + verdict + flags + FFT profile. \|
	\| `radial_fft_profile(gray)` \| Radially-averaged log-magnitude FFT spectrum. \|
	\| `high_freq_attenuation(profile)` \| Smoothness score — low for real scans, high for AI outputs. \|
	\| `spectral_peak_score(profile)` \| Counts checkerboard-stride peaks in the high-frequency band. \|
	\| `jpeg_quantization_check(path)` \| Inspects JPEG quantization tables for synthetic-media signatures. \|

	Blended into the main risk score with a capped +20% overlay so AI-gen signals reliably surface synthetic media without dominating classical detectors.

	### `fraud_ring.py` — cross-applicant fraud-ring detection

	\| Function \| Description \|
	\|---\|---\|
	\| `extract_applicant_fields(path)` \| OCR + regex pull of name / DOB / address / phone / IFSC / account / employer. \|
	\| `compare_applicants(a, b)` \| Per-field similarity + weighted score. \|
	\| `build_fraud_graph(applicants)` \| NetworkX similarity graph (edges weighted by shared signals). \|
	\| `detect_rings(G, min_size=3)` \| Connected components above threshold → suspected fraud rings. \|
	\| `visualize_graph(G, rings)` \| Force-directed graph with ring members in red. \|
	\| `fraud_summary(G, rings, applicants)` \| Structured summary for the Streamlit UI. \|

	### `provenance.py` — tamper-evident audit ledger

	\| Function \| Description \|
	\|---\|---\|
	\| `log_analysis(...)` \| Appends a SHA-256 hash-chained record to the SQLite ledger. \|
	\| `verify_chain()` \| Walks every record in O(N); pinpoints the first broken record. \|
	\| `chain_stats()` \| Count, first/last timestamps, breakdown by risk band, chain status. \|
	\| `fetch_ledger(limit)` \| Returns the latest N entries. \|
	\| `ledger_dataframe(limit)` \| Pandas DataFrame view (for Streamlit display). \|

	Each record's `record_hash = SHA256(timestamp \| doc_sha256 \| risk_band \| risk_score \| prev_hash)` — retroactive edits break the chain mathematically.

	### `tampering.py` — adversarial Forge Studio

	`tamper_copy_move`, `tamper_text_edit`, `tamper_splice`, `tamper_compression`, `tamper_metadata_strip`, `tamper_custom_region`, `tamper_chain`, `annotate_before_after`, `overlay_heatmap_on_image`, `detector_scorecard`. Used by Tab 5 to apply controlled forgeries and immediately re-run detection.

	### `compliance.py` — KYC + regulatory

	\| Function \| Description \|
	\|---\|---\|
	\| `validate_ifsc(code)` \| Format check + RBI bank-code lookup (36 banks). \|
	\| `validate_pan(code)` \| Format + entity-type character validation. \|
	\| `validate_aadhaar(num)` \| 12-digit format + UIDAI Verhoeff checksum. \|
	\| `redact_text(text)` \| Masks IFSC, PAN, Aadhaar, account numbers. \|
	\| `redact_pdf(input_path, output_path)` \| PII black-box overlays via PyMuPDF text-bbox. \|
	\| `extract_pii_fields(path)` \| Pulls all PII candidates from any document. \|
	\| `build_compliance_report(...)` \| RBI Master-Direction-format audit PDF (5 sections). \|

	### `audit_report.py` — bank-letterhead PDF

	`build_pdf_report(report, source_path) → bytes`. Multi-page PDF with header letterhead, metadata table, colour-coded risk verdict box, sub-score breakdown table, evidence list, embedded forensic heatmaps. Built with ReportLab Platypus.

	### `app.py` — Streamlit UI (6 tabs)

	\| Tab \| Function \|
	\|---\|---\|
	\| 1. Single-document analysis \| Risk band, sub-score chart, ELA / copy-move / noise heatmaps, AI-gen FFT profile, ML/CNN predictions, downloadable JSON + PDF. \|
	\| 2. Cross-document KYC \| Upload 2–4 docs for one applicant; identity-field consistency table. \|
	\| 3. Batch audit \| Scan a folder; sortable risk table + CSV download. \|
	\| 4. Compliance & Audit Pack \| KYC validation, PII auto-redaction, RBI compliance PDF, provenance ledger view with chain re-verify. \|
	\| 5. Live Tamper Forge Studio \| Pick clean sample → choose technique + intensity → watch BankShield localise the tamper with per-detector scorecard + heatmap overlays. \|
	\| 6. Fraud Ring Network \| Upload N applicants → similarity graph with red ring members + ring summary cards. \|

	---

	## Pipeline architecture

	```
	┌────────────────────────────────────────┐
	│ PRESENTATION (Streamlit, 6 tabs) │
	└──────────────────┬─────────────────────┘
	▼
	┌──────────────────────────────────────────────┐
	│ FORENSICS CORE │
	│ ELA · Copy-move · Noise · EXIF · OCR · PDF │
	│ + Random Forest (11-d feature vector) │
	│ + MobileNetV2 CNN (CASIA v2 fine-tuned) │
	│ + AI-Gen Detector (radial FFT) │
	└──────────────────┬───────────────────────────┘
	▼
	┌──────────────────────────────────────────────┐
	│ ENSEMBLE FUSION │
	│ weighted blend → RF overlay → CNN overlay │
	│ → AI-gen overlay (capped at +20%) │
	└──────────────────┬───────────────────────────┘
	▼
	┌──────────────────┬─────┴─────┬──────────────────┐
	▼ ▼ ▼ ▼
	┌──────────────┐ ┌────────────────┐ ┌──────────────┐ ┌────────────────┐
	│ COMPLIANCE │ │ FRAUD-RING │ │ PROVENANCE │ │ TAMPER FORGE │
	│ IFSC · PAN · │ │ NetworkX graph │ │ SHA-256 hash │ │ Adversarial │
	│ Aadhaar · PII│ │ clique detect │ │ chain ledger │ │ validation │
	└──────┬───────┘ └────────┬───────┘ └──────┬───────┘ └────────────────┘
	│ │ │
	└────────────┬──────┴────────────────┘
	▼
	┌────────────────────────────────────┐
	│ OUTPUT │
	│ Risk band · Evidence list │
	│ Bank-letterhead audit PDF │
	│ RBI compliance PDF · Audit JSON │
	│ Tamper-evident ledger entry │
	└────────────────────────────────────┘
	```

	Default weight vector (`forensics.WEIGHTS`): `{ela: 0.20, copy_move: 0.25, noise: 0.20, exif: 0.15, text_rules: 0.20}`. The Random Forest probability, when available, is blended 50/50 with the rule-based score. The CNN probability is blended at a weight between 0.4 and 0.7 based on the CNN's reported validation AUC. The AI-gen probability is applied as a final overlay capped at +20%.

	Band mapping: `0–0.30 LOW · 0.30–0.50 MEDIUM · 0.50–0.75 HIGH · 0.75+ CRITICAL`.

	See [`ARCHITECTURE.md`](ARCHITECTURE.md) for the full reference.

	---

	## Detection coverage

	Image tampering

	- Copy-move forgery — ORB keypoint matching with distance filter
	- Image splicing — block-wise noise inconsistency via Laplacian variance
	- Text edits / amount tampering — Error Level Analysis
	- Photoshop / GIMP / Snapseed edits — EXIF Software-tag string match
	- Timestamp inconsistencies — DateTime vs DateTimeOriginal comparison

	AI-generated content

	- Sora / Midjourney / Stable Diffusion / DALL-E outputs — FFT spectral analysis
	- High-frequency suppression (1/f decay deviation)
	- Periodic checkerboard peaks from upsampling stride
	- Non-standard JPEG quantization tables

	PDF tampering

	- Incremental edits — multi-`%%EOF` marker counting
	- Consumer-tool fingerprints — iLovePDF, Smallpdf, PDFescape, Sejda, Foxit Phantom
	- Producer/Creator mismatch — flags re-processed PDFs
	- Inserted text — embedded-font count anomalies

	Cross-document & fraud-ring

	- Name / DOB / address fuzzy match across multiple documents
	- Per-field weighted scoring with green / yellow / red status
	- Cross-applicant similarity graph; cliques ≥3 = suspected fraud ring
	- Ring bands: CRITICAL (≥5 members) / HIGH (3–4) / MEDIUM (2)

	KYC validation

	- IFSC: format + RBI bank-code list (36 banks)
	- PAN: format + entity-type character (10 types per income-tax dept spec)
	- Aadhaar: 12-digit format + UIDAI Verhoeff checksum

	PII redaction & audit

	- Aadhaar, PAN, IFSC, account-number masking
	- PDF redaction with black rectangle overlays
	- SHA-256 hash-chained provenance ledger (RBI Para 67 compliant)

	---

	## Running locally

	```bash
	git clone https://github.com/SpandanM110/Doc-Sentry.git
	cd Doc-Sentry
	pip install -r requirements.txt
	streamlit run app.py
	```

	Browser opens at `http://localhost:8501`.

	For full OCR text-rule support, install Tesseract OCR:

	- Windows: https://github.com/UB-Mannheim/tesseract/wiki
	- macOS: `brew install tesseract`
	- Linux: `sudo apt-get install tesseract-ocr libtesseract-dev`

	The app auto-detects Tesseract on standard Windows install paths; no environment variable required.

	---

	## Deployment

	The repository is deployment-ready for both Streamlit Community Cloud and Hugging Face Spaces. The YAML frontmatter at the top of this README configures the HF Space; `packages.txt` ensures Tesseract is installed on the build VM; `requirements.txt` covers Python dependencies.

	Live deployment: https://huggingface.co/spaces/SpandanM110/DocSentry

	---

	## Training your own model

	Drop labelled data into `data/images/originals/` and `data/images/tampered/`, open `docsentry_master.ipynb`, run section 6. A Random Forest auto-trains on whatever you put there and saves to `models/forgery_rf.joblib`. The Streamlit app picks it up automatically on next restart.

	For a CNN upgrade, set `TRAIN_CNN = True` in section 7 and run on a Colab T4 GPU (free tier). Saves `models/forgery_cnn.keras` + `models/forgery_cnn.meta.json`. The app loads it lazily on first request.

	---

	## Dependencies

	OpenCV (cv2), Pillow (PIL), scikit-image, scikit-learn, joblib, PyMuPDF (fitz), pdfplumber, pikepdf, pytesseract, python-dateutil, Streamlit, streamlit-drawable-canvas, ReportLab, NumPy, pandas, matplotlib, NetworkX. Optional: TensorFlow (only required for the CNN path).

	All pip-installable. No GPU required for the default pipeline.

	---

	## License

	MIT — see `LICENSE`. The MIT license covers the source code in this repository. Third-party datasets and pretrained models bundled or referenced (CASIA v2, IDRBT cheque dataset, AgamiAI Indian Bank Statements, MobileNetV2 ImageNet weights) are governed by their own terms; those notices are reproduced in `LICENSE` below the MIT block.

	---

	## Acknowledgements

	- AgamiAI Indian Bank Statements (Hugging Face) — Apache 2.0
	- IDRBT Cheque Image Dataset — Institute for Development and Research in Banking Technology, India
	- CASIA v2 image tampering dataset — Chinese Academy of Sciences
	- MICC-F220 copy-move benchmark — University of Florence
	- CoMoFoD dataset — University of Zagreb
	- Tobacco-3482 document corpus — University of Maryland