Spaces:

SpandanM110
/

DocSentry

Sleeping

App Files Files Community

DocSentry / README.md

SpandanM110

Fix HF short_description length

8416232 7 days ago

preview code

Raw

History Blame Contribute Delete

18 kB

A newer version of the Streamlit SDK is available: 1.58.0

Upgrade

metadata

title: DocSentry
emoji: 🛡️
colorFrom: blue
colorTo: indigo
sdk: streamlit
sdk_version: 1.32.0
app_file: app.py
pinned: false
license: mit
short_description: Document forensics + fraud-ring detection for Indian banks

BankShield

Real-Time Document Forensics, AI-Generated Forgery Detection, and Cross-Applicant Fraud-Ring Intelligence for Indian Bank Underwriting.

BankShield catches tampered, forged, and AI-generated documents the moment they reach the underwriter — and surfaces organised fraud rings that span multiple applicants. Six independent detection layers fuse into a single calibrated risk score, with explainable evidence, tamper-evident audit trails, and RBI-format compliance reports out of the box.

100% open source. No paid APIs. No external LLM calls. CPU-only by default. Runs locally on the bank's perimeter — PII never leaves.

Live demo: https://huggingface.co/spaces/SpandanM110/DocSentry
Source: https://github.com/SpandanM110/Doc-Sentry
Architecture reference: see ARCHITECTURE.md

The six pillars

Pillar	Module	What it does
Image Forensics	`forensics.py`	ELA, copy-move (ORB), Laplacian noise inconsistency, EXIF audit
PDF Structural Audit	`forensics.py`	EOF marker counting, producer/creator drift, embedded-font anomalies, consumer-tool fingerprints
OCR + Financial Rules	`forensics.py`	Tesseract OCR + IFSC / PAN / Aadhaar / date monotonicity / amount sanity
AI-Generated Detection (new)	`ai_detector.py`	Radial FFT spectral analysis — catches Sora / Midjourney / Stable Diffusion outputs
Fraud Ring Network (new)	`fraud_ring.py`	NetworkX similarity graph across applicants; clique discovery flags organised fraud rings
Provenance Ledger (new)	`provenance.py`	SHA-256 hash chain over every analysis; O(N) verifiable; RBI Para 67 compliant

Plus the Live Tamper Forge Studio (tampering.py) — an adversarial-validation harness built directly into the dashboard.

Repository layout

Doc-Sentry/
├── app.py                       Streamlit web UI (6 tabs)
├── forensics.py                 Core detection engine + ensemble fusion
├── ai_detector.py               AI-generated forgery detector (FFT spectral)
├── fraud_ring.py                Cross-applicant similarity graph + clique detection
├── provenance.py                Tamper-evident SHA-256 hash chain
├── tampering.py                 Forge Studio adversarial harness
├── compliance.py                KYC validators, PII redaction, RBI report builder
├── audit_report.py              Bank-letterhead PDF report builder
├── docsentry_master.ipynb       Single source-of-truth Jupyter notebook
│
├── requirements.txt             Python dependencies
├── packages.txt                 System packages (Tesseract) for Streamlit Cloud / HF Spaces
├── .streamlit/config.toml       Streamlit theme + server config
│
├── sample_data/                 26 demo files for the live app
│   ├── originals/               12 genuine documents
│   ├── tampered/                12 tampered documents
│   └── pdfs/                    2 PDFs (1 genuine, 1 tampered)
│
├── models/                      Trained model artefacts
│   ├── forgery_rf.joblib        Random Forest classifier
│   └── forgery_cnn.keras        MobileNetV2 fine-tuned on CASIA v2 (optional)
│
├── ARCHITECTURE.md              Full architecture reference
├── SUBMISSION.md                Hackathon submission packet
├── BankShield_Pitch.pptx        Pitch deck (15 slides)
├── README.md  LICENSE
└── data/                        (gitignored) full training data + downloaded datasets

Module reference

`forensics.py` — detection engine

The core analytical module. Stateless functions; all logic is independently testable.

Function	Returns	Description
`analyse_document(path)`	dict	End-to-end pipeline. Auto-detects type, runs all relevant detectors, blends Random Forest + CNN + AI-gen predictions, auto-logs to provenance ledger. Primary entry point.
`score_image(path)`	(float, dict, list)	Composite forensic score for an image. Returns total, sub-scores by detector, and EXIF flags.
`error_level_analysis(path, quality=90)`	(PIL.Image, float)	ELA visualisation + scalar suspicion score.
`copy_move_detect(path)`	(np.ndarray, int, list)	ORB-based copy-move detection. Returns annotated viz, match count, raw matches.
`noise_inconsistency(path, block=32)`	(np.ndarray, float)	Per-block Laplacian variance heatmap + outlier ratio.
`exif_sanity(path)`	list of str	EXIF audit: missing EXIF, editor signatures, timestamp inconsistencies.
`pdf_structural_audit(path)`	dict	`%%EOF` markers, producer/creator drift, consumer-tool fingerprints.
`pdf_font_audit(path)`	dict	Embedded font listing + count anomalies.
`ocr_text(path)`	str	Tesseract OCR with auto-fallback.
`text_rule_checks(text)`	dict	Date monotonicity, amount sanity, IFSC format, account-number patterns.
`extract_features(path)`	dict	11-feature vector for the Random Forest.
`predict_with_model(path)`	dict / None	Random Forest tamper probability + verdict.
`predict_with_cnn(path)`	dict / None	MobileNetV2 CNN inference (lazy-loaded).
`extract_identity_fields(path)`	(dict, str)	Pulls name, DOB, address, IFSC, account, amounts.
`cross_doc_consistency(paths)`	dict	Per-field similarity across 2+ documents.
`generate_insights(score, sub, flags)`	dict	Numeric → underwriter-readable bullets + recommended action.
`band(score)`	str	Maps a float to LOW / MEDIUM / HIGH / CRITICAL.

`ai_detector.py` — AI-generated forgery detection

Function	Description
`detect_ai_generated(path)`	Full pipeline → probability + verdict + flags + FFT profile.
`radial_fft_profile(gray)`	Radially-averaged log-magnitude FFT spectrum.
`high_freq_attenuation(profile)`	Smoothness score — low for real scans, high for AI outputs.
`spectral_peak_score(profile)`	Counts checkerboard-stride peaks in the high-frequency band.
`jpeg_quantization_check(path)`	Inspects JPEG quantization tables for synthetic-media signatures.

Blended into the main risk score with a capped +20% overlay so AI-gen signals reliably surface synthetic media without dominating classical detectors.

`fraud_ring.py` — cross-applicant fraud-ring detection

Function	Description
`extract_applicant_fields(path)`	OCR + regex pull of name / DOB / address / phone / IFSC / account / employer.
`compare_applicants(a, b)`	Per-field similarity + weighted score.
`build_fraud_graph(applicants)`	NetworkX similarity graph (edges weighted by shared signals).
`detect_rings(G, min_size=3)`	Connected components above threshold → suspected fraud rings.
`visualize_graph(G, rings)`	Force-directed graph with ring members in red.
`fraud_summary(G, rings, applicants)`	Structured summary for the Streamlit UI.

`provenance.py` — tamper-evident audit ledger

Function	Description
`log_analysis(...)`	Appends a SHA-256 hash-chained record to the SQLite ledger.
`verify_chain()`	Walks every record in O(N); pinpoints the first broken record.
`chain_stats()`	Count, first/last timestamps, breakdown by risk band, chain status.
`fetch_ledger(limit)`	Returns the latest N entries.
`ledger_dataframe(limit)`	Pandas DataFrame view (for Streamlit display).

Each record's record_hash = SHA256(timestamp | doc_sha256 | risk_band | risk_score | prev_hash) — retroactive edits break the chain mathematically.

`tampering.py` — adversarial Forge Studio

tamper_copy_move, tamper_text_edit, tamper_splice, tamper_compression, tamper_metadata_strip, tamper_custom_region, tamper_chain, annotate_before_after, overlay_heatmap_on_image, detector_scorecard. Used by Tab 5 to apply controlled forgeries and immediately re-run detection.

`compliance.py` — KYC + regulatory

Function	Description
`validate_ifsc(code)`	Format check + RBI bank-code lookup (36 banks).
`validate_pan(code)`	Format + entity-type character validation.
`validate_aadhaar(num)`	12-digit format + UIDAI Verhoeff checksum.
`redact_text(text)`	Masks IFSC, PAN, Aadhaar, account numbers.
`redact_pdf(input_path, output_path)`	PII black-box overlays via PyMuPDF text-bbox.
`extract_pii_fields(path)`	Pulls all PII candidates from any document.
`build_compliance_report(...)`	RBI Master-Direction-format audit PDF (5 sections).

`audit_report.py` — bank-letterhead PDF

build_pdf_report(report, source_path) → bytes. Multi-page PDF with header letterhead, metadata table, colour-coded risk verdict box, sub-score breakdown table, evidence list, embedded forensic heatmaps. Built with ReportLab Platypus.

`app.py` — Streamlit UI (6 tabs)

Tab	Function
1. Single-document analysis	Risk band, sub-score chart, ELA / copy-move / noise heatmaps, AI-gen FFT profile, ML/CNN predictions, downloadable JSON + PDF.
2. Cross-document KYC	Upload 2–4 docs for one applicant; identity-field consistency table.
3. Batch audit	Scan a folder; sortable risk table + CSV download.
4. Compliance & Audit Pack	KYC validation, PII auto-redaction, RBI compliance PDF, provenance ledger view with chain re-verify.
5. Live Tamper Forge Studio	Pick clean sample → choose technique + intensity → watch BankShield localise the tamper with per-detector scorecard + heatmap overlays.
6. Fraud Ring Network	Upload N applicants → similarity graph with red ring members + ring summary cards.

Pipeline architecture

                    ┌────────────────────────────────────────┐
                    │   PRESENTATION (Streamlit, 6 tabs)     │
                    └──────────────────┬─────────────────────┘
                                       ▼
              ┌──────────────────────────────────────────────┐
              │   FORENSICS CORE                             │
              │   ELA · Copy-move · Noise · EXIF · OCR · PDF │
              │   + Random Forest (11-d feature vector)      │
              │   + MobileNetV2 CNN (CASIA v2 fine-tuned)    │
              │   + AI-Gen Detector (radial FFT)             │
              └──────────────────┬───────────────────────────┘
                                 ▼
              ┌──────────────────────────────────────────────┐
              │   ENSEMBLE FUSION                            │
              │   weighted blend → RF overlay → CNN overlay  │
              │   → AI-gen overlay (capped at +20%)          │
              └──────────────────┬───────────────────────────┘
                                 ▼
        ┌──────────────────┬─────┴─────┬──────────────────┐
        ▼                  ▼           ▼                  ▼
┌──────────────┐  ┌────────────────┐ ┌──────────────┐ ┌────────────────┐
│ COMPLIANCE   │  │ FRAUD-RING     │ │ PROVENANCE   │ │ TAMPER FORGE   │
│ IFSC · PAN · │  │ NetworkX graph │ │ SHA-256 hash │ │ Adversarial    │
│ Aadhaar · PII│  │ clique detect  │ │ chain ledger │ │ validation     │
└──────┬───────┘  └────────┬───────┘ └──────┬───────┘ └────────────────┘
       │                   │                │
       └────────────┬──────┴────────────────┘
                    ▼
         ┌────────────────────────────────────┐
         │   OUTPUT                           │
         │   Risk band · Evidence list        │
         │   Bank-letterhead audit PDF        │
         │   RBI compliance PDF · Audit JSON  │
         │   Tamper-evident ledger entry      │
         └────────────────────────────────────┘

Default weight vector (forensics.WEIGHTS): {ela: 0.20, copy_move: 0.25, noise: 0.20, exif: 0.15, text_rules: 0.20}. The Random Forest probability, when available, is blended 50/50 with the rule-based score. The CNN probability is blended at a weight between 0.4 and 0.7 based on the CNN's reported validation AUC. The AI-gen probability is applied as a final overlay capped at +20%.

Band mapping: 0–0.30 LOW · 0.30–0.50 MEDIUM · 0.50–0.75 HIGH · 0.75+ CRITICAL.

See ARCHITECTURE.md for the full reference.

Detection coverage

Image tampering

Copy-move forgery — ORB keypoint matching with distance filter
Image splicing — block-wise noise inconsistency via Laplacian variance
Text edits / amount tampering — Error Level Analysis
Photoshop / GIMP / Snapseed edits — EXIF Software-tag string match
Timestamp inconsistencies — DateTime vs DateTimeOriginal comparison

AI-generated content

Sora / Midjourney / Stable Diffusion / DALL-E outputs — FFT spectral analysis
High-frequency suppression (1/f decay deviation)
Periodic checkerboard peaks from upsampling stride
Non-standard JPEG quantization tables

PDF tampering

Incremental edits — multi-%%EOF marker counting
Consumer-tool fingerprints — iLovePDF, Smallpdf, PDFescape, Sejda, Foxit Phantom
Producer/Creator mismatch — flags re-processed PDFs
Inserted text — embedded-font count anomalies

Cross-document & fraud-ring

Name / DOB / address fuzzy match across multiple documents
Per-field weighted scoring with green / yellow / red status
Cross-applicant similarity graph; cliques ≥3 = suspected fraud ring
Ring bands: CRITICAL (≥5 members) / HIGH (3–4) / MEDIUM (2)

KYC validation

IFSC: format + RBI bank-code list (36 banks)
PAN: format + entity-type character (10 types per income-tax dept spec)
Aadhaar: 12-digit format + UIDAI Verhoeff checksum

PII redaction & audit

Aadhaar, PAN, IFSC, account-number masking
PDF redaction with black rectangle overlays
SHA-256 hash-chained provenance ledger (RBI Para 67 compliant)

Running locally

git clone https://github.com/SpandanM110/Doc-Sentry.git
cd Doc-Sentry
pip install -r requirements.txt
streamlit run app.py

Browser opens at http://localhost:8501.

For full OCR text-rule support, install Tesseract OCR:

Windows: https://github.com/UB-Mannheim/tesseract/wiki
macOS: brew install tesseract
Linux: sudo apt-get install tesseract-ocr libtesseract-dev

The app auto-detects Tesseract on standard Windows install paths; no environment variable required.

Deployment

The repository is deployment-ready for both Streamlit Community Cloud and Hugging Face Spaces. The YAML frontmatter at the top of this README configures the HF Space; packages.txt ensures Tesseract is installed on the build VM; requirements.txt covers Python dependencies.

Live deployment: https://huggingface.co/spaces/SpandanM110/DocSentry

Training your own model

Drop labelled data into data/images/originals/ and data/images/tampered/, open docsentry_master.ipynb, run section 6. A Random Forest auto-trains on whatever you put there and saves to models/forgery_rf.joblib. The Streamlit app picks it up automatically on next restart.

For a CNN upgrade, set TRAIN_CNN = True in section 7 and run on a Colab T4 GPU (free tier). Saves models/forgery_cnn.keras + models/forgery_cnn.meta.json. The app loads it lazily on first request.

Dependencies

OpenCV (cv2), Pillow (PIL), scikit-image, scikit-learn, joblib, PyMuPDF (fitz), pdfplumber, pikepdf, pytesseract, python-dateutil, Streamlit, streamlit-drawable-canvas, ReportLab, NumPy, pandas, matplotlib, NetworkX. Optional: TensorFlow (only required for the CNN path).

All pip-installable. No GPU required for the default pipeline.

License

MIT — see LICENSE. The MIT license covers the source code in this repository. Third-party datasets and pretrained models bundled or referenced (CASIA v2, IDRBT cheque dataset, AgamiAI Indian Bank Statements, MobileNetV2 ImageNet weights) are governed by their own terms; those notices are reproduced in LICENSE below the MIT block.

Acknowledgements

AgamiAI Indian Bank Statements (Hugging Face) — Apache 2.0
IDRBT Cheque Image Dataset — Institute for Development and Research in Banking Technology, India
CASIA v2 image tampering dataset — Chinese Academy of Sciences
MICC-F220 copy-move benchmark — University of Florence
CoMoFoD dataset — University of Zagreb
Tobacco-3482 document corpus — University of Maryland