--- title: DocSentry emoji: ๐Ÿ›ก๏ธ colorFrom: blue colorTo: indigo sdk: streamlit sdk_version: 1.32.0 app_file: app.py pinned: false license: mit short_description: Document forensics + fraud-ring detection for Indian banks --- # BankShield **Real-Time Document Forensics, AI-Generated Forgery Detection, and Cross-Applicant Fraud-Ring Intelligence for Indian Bank Underwriting.** BankShield catches tampered, forged, and AI-generated documents the moment they reach the underwriter โ€” and surfaces organised fraud rings that span multiple applicants. Six independent detection layers fuse into a single calibrated risk score, with explainable evidence, tamper-evident audit trails, and RBI-format compliance reports out of the box. 100% open source. No paid APIs. No external LLM calls. CPU-only by default. Runs locally on the bank's perimeter โ€” PII never leaves. - **Live demo:** https://huggingface.co/spaces/SpandanM110/DocSentry - **Source:** https://github.com/SpandanM110/Doc-Sentry - **Architecture reference:** see [`ARCHITECTURE.md`](ARCHITECTURE.md) --- ## The six pillars | Pillar | Module | What it does | |---|---|---| | **Image Forensics** | `forensics.py` | ELA, copy-move (ORB), Laplacian noise inconsistency, EXIF audit | | **PDF Structural Audit** | `forensics.py` | EOF marker counting, producer/creator drift, embedded-font anomalies, consumer-tool fingerprints | | **OCR + Financial Rules** | `forensics.py` | Tesseract OCR + IFSC / PAN / Aadhaar / date monotonicity / amount sanity | | **AI-Generated Detection** *(new)* | `ai_detector.py` | Radial FFT spectral analysis โ€” catches Sora / Midjourney / Stable Diffusion outputs | | **Fraud Ring Network** *(new)* | `fraud_ring.py` | NetworkX similarity graph across applicants; clique discovery flags organised fraud rings | | **Provenance Ledger** *(new)* | `provenance.py` | SHA-256 hash chain over every analysis; O(N) verifiable; RBI Para 67 compliant | Plus the **Live Tamper Forge Studio** (`tampering.py`) โ€” an adversarial-validation harness built directly into the dashboard. --- ## Repository layout ``` Doc-Sentry/ โ”œโ”€โ”€ app.py Streamlit web UI (6 tabs) โ”œโ”€โ”€ forensics.py Core detection engine + ensemble fusion โ”œโ”€โ”€ ai_detector.py AI-generated forgery detector (FFT spectral) โ”œโ”€โ”€ fraud_ring.py Cross-applicant similarity graph + clique detection โ”œโ”€โ”€ provenance.py Tamper-evident SHA-256 hash chain โ”œโ”€โ”€ tampering.py Forge Studio adversarial harness โ”œโ”€โ”€ compliance.py KYC validators, PII redaction, RBI report builder โ”œโ”€โ”€ audit_report.py Bank-letterhead PDF report builder โ”œโ”€โ”€ docsentry_master.ipynb Single source-of-truth Jupyter notebook โ”‚ โ”œโ”€โ”€ requirements.txt Python dependencies โ”œโ”€โ”€ packages.txt System packages (Tesseract) for Streamlit Cloud / HF Spaces โ”œโ”€โ”€ .streamlit/config.toml Streamlit theme + server config โ”‚ โ”œโ”€โ”€ sample_data/ 26 demo files for the live app โ”‚ โ”œโ”€โ”€ originals/ 12 genuine documents โ”‚ โ”œโ”€โ”€ tampered/ 12 tampered documents โ”‚ โ””โ”€โ”€ pdfs/ 2 PDFs (1 genuine, 1 tampered) โ”‚ โ”œโ”€โ”€ models/ Trained model artefacts โ”‚ โ”œโ”€โ”€ forgery_rf.joblib Random Forest classifier โ”‚ โ””โ”€โ”€ forgery_cnn.keras MobileNetV2 fine-tuned on CASIA v2 (optional) โ”‚ โ”œโ”€โ”€ ARCHITECTURE.md Full architecture reference โ”œโ”€โ”€ SUBMISSION.md Hackathon submission packet โ”œโ”€โ”€ BankShield_Pitch.pptx Pitch deck (15 slides) โ”œโ”€โ”€ README.md LICENSE โ””โ”€โ”€ data/ (gitignored) full training data + downloaded datasets ``` --- ## Module reference ### `forensics.py` โ€” detection engine The core analytical module. Stateless functions; all logic is independently testable. | Function | Returns | Description | |---|---|---| | `analyse_document(path)` | dict | End-to-end pipeline. Auto-detects type, runs all relevant detectors, blends Random Forest + CNN + AI-gen predictions, auto-logs to provenance ledger. Primary entry point. | | `score_image(path)` | (float, dict, list) | Composite forensic score for an image. Returns total, sub-scores by detector, and EXIF flags. | | `error_level_analysis(path, quality=90)` | (PIL.Image, float) | ELA visualisation + scalar suspicion score. | | `copy_move_detect(path)` | (np.ndarray, int, list) | ORB-based copy-move detection. Returns annotated viz, match count, raw matches. | | `noise_inconsistency(path, block=32)` | (np.ndarray, float) | Per-block Laplacian variance heatmap + outlier ratio. | | `exif_sanity(path)` | list of str | EXIF audit: missing EXIF, editor signatures, timestamp inconsistencies. | | `pdf_structural_audit(path)` | dict | `%%EOF` markers, producer/creator drift, consumer-tool fingerprints. | | `pdf_font_audit(path)` | dict | Embedded font listing + count anomalies. | | `ocr_text(path)` | str | Tesseract OCR with auto-fallback. | | `text_rule_checks(text)` | dict | Date monotonicity, amount sanity, IFSC format, account-number patterns. | | `extract_features(path)` | dict | 11-feature vector for the Random Forest. | | `predict_with_model(path)` | dict / None | Random Forest tamper probability + verdict. | | `predict_with_cnn(path)` | dict / None | MobileNetV2 CNN inference (lazy-loaded). | | `extract_identity_fields(path)` | (dict, str) | Pulls name, DOB, address, IFSC, account, amounts. | | `cross_doc_consistency(paths)` | dict | Per-field similarity across 2+ documents. | | `generate_insights(score, sub, flags)` | dict | Numeric โ†’ underwriter-readable bullets + recommended action. | | `band(score)` | str | Maps a float to LOW / MEDIUM / HIGH / CRITICAL. | ### `ai_detector.py` โ€” AI-generated forgery detection | Function | Description | |---|---| | `detect_ai_generated(path)` | Full pipeline โ†’ probability + verdict + flags + FFT profile. | | `radial_fft_profile(gray)` | Radially-averaged log-magnitude FFT spectrum. | | `high_freq_attenuation(profile)` | Smoothness score โ€” low for real scans, high for AI outputs. | | `spectral_peak_score(profile)` | Counts checkerboard-stride peaks in the high-frequency band. | | `jpeg_quantization_check(path)` | Inspects JPEG quantization tables for synthetic-media signatures. | Blended into the main risk score with a capped +20% overlay so AI-gen signals reliably surface synthetic media without dominating classical detectors. ### `fraud_ring.py` โ€” cross-applicant fraud-ring detection | Function | Description | |---|---| | `extract_applicant_fields(path)` | OCR + regex pull of name / DOB / address / phone / IFSC / account / employer. | | `compare_applicants(a, b)` | Per-field similarity + weighted score. | | `build_fraud_graph(applicants)` | NetworkX similarity graph (edges weighted by shared signals). | | `detect_rings(G, min_size=3)` | Connected components above threshold โ†’ suspected fraud rings. | | `visualize_graph(G, rings)` | Force-directed graph with ring members in red. | | `fraud_summary(G, rings, applicants)` | Structured summary for the Streamlit UI. | ### `provenance.py` โ€” tamper-evident audit ledger | Function | Description | |---|---| | `log_analysis(...)` | Appends a SHA-256 hash-chained record to the SQLite ledger. | | `verify_chain()` | Walks every record in O(N); pinpoints the first broken record. | | `chain_stats()` | Count, first/last timestamps, breakdown by risk band, chain status. | | `fetch_ledger(limit)` | Returns the latest N entries. | | `ledger_dataframe(limit)` | Pandas DataFrame view (for Streamlit display). | Each record's `record_hash = SHA256(timestamp | doc_sha256 | risk_band | risk_score | prev_hash)` โ€” retroactive edits break the chain mathematically. ### `tampering.py` โ€” adversarial Forge Studio `tamper_copy_move`, `tamper_text_edit`, `tamper_splice`, `tamper_compression`, `tamper_metadata_strip`, `tamper_custom_region`, `tamper_chain`, `annotate_before_after`, `overlay_heatmap_on_image`, `detector_scorecard`. Used by Tab 5 to apply controlled forgeries and immediately re-run detection. ### `compliance.py` โ€” KYC + regulatory | Function | Description | |---|---| | `validate_ifsc(code)` | Format check + RBI bank-code lookup (36 banks). | | `validate_pan(code)` | Format + entity-type character validation. | | `validate_aadhaar(num)` | 12-digit format + UIDAI Verhoeff checksum. | | `redact_text(text)` | Masks IFSC, PAN, Aadhaar, account numbers. | | `redact_pdf(input_path, output_path)` | PII black-box overlays via PyMuPDF text-bbox. | | `extract_pii_fields(path)` | Pulls all PII candidates from any document. | | `build_compliance_report(...)` | RBI Master-Direction-format audit PDF (5 sections). | ### `audit_report.py` โ€” bank-letterhead PDF `build_pdf_report(report, source_path) โ†’ bytes`. Multi-page PDF with header letterhead, metadata table, colour-coded risk verdict box, sub-score breakdown table, evidence list, embedded forensic heatmaps. Built with ReportLab Platypus. ### `app.py` โ€” Streamlit UI (6 tabs) | Tab | Function | |---|---| | 1. Single-document analysis | Risk band, sub-score chart, ELA / copy-move / noise heatmaps, AI-gen FFT profile, ML/CNN predictions, downloadable JSON + PDF. | | 2. Cross-document KYC | Upload 2โ€“4 docs for one applicant; identity-field consistency table. | | 3. Batch audit | Scan a folder; sortable risk table + CSV download. | | 4. Compliance & Audit Pack | KYC validation, PII auto-redaction, RBI compliance PDF, **provenance ledger view with chain re-verify**. | | 5. Live Tamper Forge Studio | Pick clean sample โ†’ choose technique + intensity โ†’ watch BankShield localise the tamper with per-detector scorecard + heatmap overlays. | | 6. Fraud Ring Network | Upload N applicants โ†’ similarity graph with red ring members + ring summary cards. | --- ## Pipeline architecture ``` โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ PRESENTATION (Streamlit, 6 tabs) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ FORENSICS CORE โ”‚ โ”‚ ELA ยท Copy-move ยท Noise ยท EXIF ยท OCR ยท PDF โ”‚ โ”‚ + Random Forest (11-d feature vector) โ”‚ โ”‚ + MobileNetV2 CNN (CASIA v2 fine-tuned) โ”‚ โ”‚ + AI-Gen Detector (radial FFT) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ ENSEMBLE FUSION โ”‚ โ”‚ weighted blend โ†’ RF overlay โ†’ CNN overlay โ”‚ โ”‚ โ†’ AI-gen overlay (capped at +20%) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ–ผ โ–ผ โ–ผ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ COMPLIANCE โ”‚ โ”‚ FRAUD-RING โ”‚ โ”‚ PROVENANCE โ”‚ โ”‚ TAMPER FORGE โ”‚ โ”‚ IFSC ยท PAN ยท โ”‚ โ”‚ NetworkX graph โ”‚ โ”‚ SHA-256 hash โ”‚ โ”‚ Adversarial โ”‚ โ”‚ Aadhaar ยท PIIโ”‚ โ”‚ clique detect โ”‚ โ”‚ chain ledger โ”‚ โ”‚ validation โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ OUTPUT โ”‚ โ”‚ Risk band ยท Evidence list โ”‚ โ”‚ Bank-letterhead audit PDF โ”‚ โ”‚ RBI compliance PDF ยท Audit JSON โ”‚ โ”‚ Tamper-evident ledger entry โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` Default weight vector (`forensics.WEIGHTS`): `{ela: 0.20, copy_move: 0.25, noise: 0.20, exif: 0.15, text_rules: 0.20}`. The Random Forest probability, when available, is blended 50/50 with the rule-based score. The CNN probability is blended at a weight between 0.4 and 0.7 based on the CNN's reported validation AUC. The AI-gen probability is applied as a final overlay capped at +20%. Band mapping: `0โ€“0.30 LOW ยท 0.30โ€“0.50 MEDIUM ยท 0.50โ€“0.75 HIGH ยท 0.75+ CRITICAL`. See [`ARCHITECTURE.md`](ARCHITECTURE.md) for the full reference. --- ## Detection coverage **Image tampering** - Copy-move forgery โ€” ORB keypoint matching with distance filter - Image splicing โ€” block-wise noise inconsistency via Laplacian variance - Text edits / amount tampering โ€” Error Level Analysis - Photoshop / GIMP / Snapseed edits โ€” EXIF Software-tag string match - Timestamp inconsistencies โ€” DateTime vs DateTimeOriginal comparison **AI-generated content** - Sora / Midjourney / Stable Diffusion / DALL-E outputs โ€” FFT spectral analysis - High-frequency suppression (1/f decay deviation) - Periodic checkerboard peaks from upsampling stride - Non-standard JPEG quantization tables **PDF tampering** - Incremental edits โ€” multi-`%%EOF` marker counting - Consumer-tool fingerprints โ€” iLovePDF, Smallpdf, PDFescape, Sejda, Foxit Phantom - Producer/Creator mismatch โ€” flags re-processed PDFs - Inserted text โ€” embedded-font count anomalies **Cross-document & fraud-ring** - Name / DOB / address fuzzy match across multiple documents - Per-field weighted scoring with green / yellow / red status - Cross-applicant similarity graph; cliques โ‰ฅ3 = suspected fraud ring - Ring bands: CRITICAL (โ‰ฅ5 members) / HIGH (3โ€“4) / MEDIUM (2) **KYC validation** - IFSC: format + RBI bank-code list (36 banks) - PAN: format + entity-type character (10 types per income-tax dept spec) - Aadhaar: 12-digit format + UIDAI Verhoeff checksum **PII redaction & audit** - Aadhaar, PAN, IFSC, account-number masking - PDF redaction with black rectangle overlays - SHA-256 hash-chained provenance ledger (RBI Para 67 compliant) --- ## Running locally ```bash git clone https://github.com/SpandanM110/Doc-Sentry.git cd Doc-Sentry pip install -r requirements.txt streamlit run app.py ``` Browser opens at `http://localhost:8501`. For full OCR text-rule support, install Tesseract OCR: - Windows: https://github.com/UB-Mannheim/tesseract/wiki - macOS: `brew install tesseract` - Linux: `sudo apt-get install tesseract-ocr libtesseract-dev` The app auto-detects Tesseract on standard Windows install paths; no environment variable required. --- ## Deployment The repository is deployment-ready for both **Streamlit Community Cloud** and **Hugging Face Spaces**. The YAML frontmatter at the top of this README configures the HF Space; `packages.txt` ensures Tesseract is installed on the build VM; `requirements.txt` covers Python dependencies. Live deployment: https://huggingface.co/spaces/SpandanM110/DocSentry --- ## Training your own model Drop labelled data into `data/images/originals/` and `data/images/tampered/`, open `docsentry_master.ipynb`, run section 6. A Random Forest auto-trains on whatever you put there and saves to `models/forgery_rf.joblib`. The Streamlit app picks it up automatically on next restart. For a CNN upgrade, set `TRAIN_CNN = True` in section 7 and run on a Colab T4 GPU (free tier). Saves `models/forgery_cnn.keras` + `models/forgery_cnn.meta.json`. The app loads it lazily on first request. --- ## Dependencies OpenCV (cv2), Pillow (PIL), scikit-image, scikit-learn, joblib, PyMuPDF (fitz), pdfplumber, pikepdf, pytesseract, python-dateutil, Streamlit, streamlit-drawable-canvas, ReportLab, NumPy, pandas, matplotlib, NetworkX. Optional: TensorFlow (only required for the CNN path). All pip-installable. No GPU required for the default pipeline. --- ## License MIT โ€” see `LICENSE`. The MIT license covers the source code in this repository. Third-party datasets and pretrained models bundled or referenced (CASIA v2, IDRBT cheque dataset, AgamiAI Indian Bank Statements, MobileNetV2 ImageNet weights) are governed by their own terms; those notices are reproduced in `LICENSE` below the MIT block. --- ## Acknowledgements - **AgamiAI Indian Bank Statements** (Hugging Face) โ€” Apache 2.0 - **IDRBT Cheque Image Dataset** โ€” Institute for Development and Research in Banking Technology, India - **CASIA v2** image tampering dataset โ€” Chinese Academy of Sciences - **MICC-F220** copy-move benchmark โ€” University of Florence - **CoMoFoD** dataset โ€” University of Zagreb - **Tobacco-3482** document corpus โ€” University of Maryland