Spaces:
Sleeping
Sleeping
| title: DocSentry | |
| emoji: π‘οΈ | |
| colorFrom: blue | |
| colorTo: indigo | |
| sdk: streamlit | |
| sdk_version: 1.32.0 | |
| app_file: app.py | |
| pinned: false | |
| license: mit | |
| short_description: Document forensics + fraud-ring detection for Indian banks | |
| # BankShield | |
| **Real-Time Document Forensics, AI-Generated Forgery Detection, and Cross-Applicant Fraud-Ring Intelligence for Indian Bank Underwriting.** | |
| BankShield catches tampered, forged, and AI-generated documents the moment they reach the underwriter β and surfaces organised fraud rings that span multiple applicants. Six independent detection layers fuse into a single calibrated risk score, with explainable evidence, tamper-evident audit trails, and RBI-format compliance reports out of the box. | |
| 100% open source. No paid APIs. No external LLM calls. CPU-only by default. Runs locally on the bank's perimeter β PII never leaves. | |
| - **Live demo:** https://huggingface.co/spaces/SpandanM110/DocSentry | |
| - **Source:** https://github.com/SpandanM110/Doc-Sentry | |
| - **Architecture reference:** see [`ARCHITECTURE.md`](ARCHITECTURE.md) | |
| --- | |
| ## The six pillars | |
| | Pillar | Module | What it does | | |
| |---|---|---| | |
| | **Image Forensics** | `forensics.py` | ELA, copy-move (ORB), Laplacian noise inconsistency, EXIF audit | | |
| | **PDF Structural Audit** | `forensics.py` | EOF marker counting, producer/creator drift, embedded-font anomalies, consumer-tool fingerprints | | |
| | **OCR + Financial Rules** | `forensics.py` | Tesseract OCR + IFSC / PAN / Aadhaar / date monotonicity / amount sanity | | |
| | **AI-Generated Detection** *(new)* | `ai_detector.py` | Radial FFT spectral analysis β catches Sora / Midjourney / Stable Diffusion outputs | | |
| | **Fraud Ring Network** *(new)* | `fraud_ring.py` | NetworkX similarity graph across applicants; clique discovery flags organised fraud rings | | |
| | **Provenance Ledger** *(new)* | `provenance.py` | SHA-256 hash chain over every analysis; O(N) verifiable; RBI Para 67 compliant | | |
| Plus the **Live Tamper Forge Studio** (`tampering.py`) β an adversarial-validation harness built directly into the dashboard. | |
| --- | |
| ## Repository layout | |
| ``` | |
| Doc-Sentry/ | |
| βββ app.py Streamlit web UI (6 tabs) | |
| βββ forensics.py Core detection engine + ensemble fusion | |
| βββ ai_detector.py AI-generated forgery detector (FFT spectral) | |
| βββ fraud_ring.py Cross-applicant similarity graph + clique detection | |
| βββ provenance.py Tamper-evident SHA-256 hash chain | |
| βββ tampering.py Forge Studio adversarial harness | |
| βββ compliance.py KYC validators, PII redaction, RBI report builder | |
| βββ audit_report.py Bank-letterhead PDF report builder | |
| βββ docsentry_master.ipynb Single source-of-truth Jupyter notebook | |
| β | |
| βββ requirements.txt Python dependencies | |
| βββ packages.txt System packages (Tesseract) for Streamlit Cloud / HF Spaces | |
| βββ .streamlit/config.toml Streamlit theme + server config | |
| β | |
| βββ sample_data/ 26 demo files for the live app | |
| β βββ originals/ 12 genuine documents | |
| β βββ tampered/ 12 tampered documents | |
| β βββ pdfs/ 2 PDFs (1 genuine, 1 tampered) | |
| β | |
| βββ models/ Trained model artefacts | |
| β βββ forgery_rf.joblib Random Forest classifier | |
| β βββ forgery_cnn.keras MobileNetV2 fine-tuned on CASIA v2 (optional) | |
| β | |
| βββ ARCHITECTURE.md Full architecture reference | |
| βββ SUBMISSION.md Hackathon submission packet | |
| βββ BankShield_Pitch.pptx Pitch deck (15 slides) | |
| βββ README.md LICENSE | |
| βββ data/ (gitignored) full training data + downloaded datasets | |
| ``` | |
| --- | |
| ## Module reference | |
| ### `forensics.py` β detection engine | |
| The core analytical module. Stateless functions; all logic is independently testable. | |
| | Function | Returns | Description | | |
| |---|---|---| | |
| | `analyse_document(path)` | dict | End-to-end pipeline. Auto-detects type, runs all relevant detectors, blends Random Forest + CNN + AI-gen predictions, auto-logs to provenance ledger. Primary entry point. | | |
| | `score_image(path)` | (float, dict, list) | Composite forensic score for an image. Returns total, sub-scores by detector, and EXIF flags. | | |
| | `error_level_analysis(path, quality=90)` | (PIL.Image, float) | ELA visualisation + scalar suspicion score. | | |
| | `copy_move_detect(path)` | (np.ndarray, int, list) | ORB-based copy-move detection. Returns annotated viz, match count, raw matches. | | |
| | `noise_inconsistency(path, block=32)` | (np.ndarray, float) | Per-block Laplacian variance heatmap + outlier ratio. | | |
| | `exif_sanity(path)` | list of str | EXIF audit: missing EXIF, editor signatures, timestamp inconsistencies. | | |
| | `pdf_structural_audit(path)` | dict | `%%EOF` markers, producer/creator drift, consumer-tool fingerprints. | | |
| | `pdf_font_audit(path)` | dict | Embedded font listing + count anomalies. | | |
| | `ocr_text(path)` | str | Tesseract OCR with auto-fallback. | | |
| | `text_rule_checks(text)` | dict | Date monotonicity, amount sanity, IFSC format, account-number patterns. | | |
| | `extract_features(path)` | dict | 11-feature vector for the Random Forest. | | |
| | `predict_with_model(path)` | dict / None | Random Forest tamper probability + verdict. | | |
| | `predict_with_cnn(path)` | dict / None | MobileNetV2 CNN inference (lazy-loaded). | | |
| | `extract_identity_fields(path)` | (dict, str) | Pulls name, DOB, address, IFSC, account, amounts. | | |
| | `cross_doc_consistency(paths)` | dict | Per-field similarity across 2+ documents. | | |
| | `generate_insights(score, sub, flags)` | dict | Numeric β underwriter-readable bullets + recommended action. | | |
| | `band(score)` | str | Maps a float to LOW / MEDIUM / HIGH / CRITICAL. | | |
| ### `ai_detector.py` β AI-generated forgery detection | |
| | Function | Description | | |
| |---|---| | |
| | `detect_ai_generated(path)` | Full pipeline β probability + verdict + flags + FFT profile. | | |
| | `radial_fft_profile(gray)` | Radially-averaged log-magnitude FFT spectrum. | | |
| | `high_freq_attenuation(profile)` | Smoothness score β low for real scans, high for AI outputs. | | |
| | `spectral_peak_score(profile)` | Counts checkerboard-stride peaks in the high-frequency band. | | |
| | `jpeg_quantization_check(path)` | Inspects JPEG quantization tables for synthetic-media signatures. | | |
| Blended into the main risk score with a capped +20% overlay so AI-gen signals reliably surface synthetic media without dominating classical detectors. | |
| ### `fraud_ring.py` β cross-applicant fraud-ring detection | |
| | Function | Description | | |
| |---|---| | |
| | `extract_applicant_fields(path)` | OCR + regex pull of name / DOB / address / phone / IFSC / account / employer. | | |
| | `compare_applicants(a, b)` | Per-field similarity + weighted score. | | |
| | `build_fraud_graph(applicants)` | NetworkX similarity graph (edges weighted by shared signals). | | |
| | `detect_rings(G, min_size=3)` | Connected components above threshold β suspected fraud rings. | | |
| | `visualize_graph(G, rings)` | Force-directed graph with ring members in red. | | |
| | `fraud_summary(G, rings, applicants)` | Structured summary for the Streamlit UI. | | |
| ### `provenance.py` β tamper-evident audit ledger | |
| | Function | Description | | |
| |---|---| | |
| | `log_analysis(...)` | Appends a SHA-256 hash-chained record to the SQLite ledger. | | |
| | `verify_chain()` | Walks every record in O(N); pinpoints the first broken record. | | |
| | `chain_stats()` | Count, first/last timestamps, breakdown by risk band, chain status. | | |
| | `fetch_ledger(limit)` | Returns the latest N entries. | | |
| | `ledger_dataframe(limit)` | Pandas DataFrame view (for Streamlit display). | | |
| Each record's `record_hash = SHA256(timestamp | doc_sha256 | risk_band | risk_score | prev_hash)` β retroactive edits break the chain mathematically. | |
| ### `tampering.py` β adversarial Forge Studio | |
| `tamper_copy_move`, `tamper_text_edit`, `tamper_splice`, `tamper_compression`, `tamper_metadata_strip`, `tamper_custom_region`, `tamper_chain`, `annotate_before_after`, `overlay_heatmap_on_image`, `detector_scorecard`. Used by Tab 5 to apply controlled forgeries and immediately re-run detection. | |
| ### `compliance.py` β KYC + regulatory | |
| | Function | Description | | |
| |---|---| | |
| | `validate_ifsc(code)` | Format check + RBI bank-code lookup (36 banks). | | |
| | `validate_pan(code)` | Format + entity-type character validation. | | |
| | `validate_aadhaar(num)` | 12-digit format + UIDAI Verhoeff checksum. | | |
| | `redact_text(text)` | Masks IFSC, PAN, Aadhaar, account numbers. | | |
| | `redact_pdf(input_path, output_path)` | PII black-box overlays via PyMuPDF text-bbox. | | |
| | `extract_pii_fields(path)` | Pulls all PII candidates from any document. | | |
| | `build_compliance_report(...)` | RBI Master-Direction-format audit PDF (5 sections). | | |
| ### `audit_report.py` β bank-letterhead PDF | |
| `build_pdf_report(report, source_path) β bytes`. Multi-page PDF with header letterhead, metadata table, colour-coded risk verdict box, sub-score breakdown table, evidence list, embedded forensic heatmaps. Built with ReportLab Platypus. | |
| ### `app.py` β Streamlit UI (6 tabs) | |
| | Tab | Function | | |
| |---|---| | |
| | 1. Single-document analysis | Risk band, sub-score chart, ELA / copy-move / noise heatmaps, AI-gen FFT profile, ML/CNN predictions, downloadable JSON + PDF. | | |
| | 2. Cross-document KYC | Upload 2β4 docs for one applicant; identity-field consistency table. | | |
| | 3. Batch audit | Scan a folder; sortable risk table + CSV download. | | |
| | 4. Compliance & Audit Pack | KYC validation, PII auto-redaction, RBI compliance PDF, **provenance ledger view with chain re-verify**. | | |
| | 5. Live Tamper Forge Studio | Pick clean sample β choose technique + intensity β watch BankShield localise the tamper with per-detector scorecard + heatmap overlays. | | |
| | 6. Fraud Ring Network | Upload N applicants β similarity graph with red ring members + ring summary cards. | | |
| --- | |
| ## Pipeline architecture | |
| ``` | |
| ββββββββββββββββββββββββββββββββββββββββββ | |
| β PRESENTATION (Streamlit, 6 tabs) β | |
| ββββββββββββββββββββ¬ββββββββββββββββββββββ | |
| βΌ | |
| ββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β FORENSICS CORE β | |
| β ELA Β· Copy-move Β· Noise Β· EXIF Β· OCR Β· PDF β | |
| β + Random Forest (11-d feature vector) β | |
| β + MobileNetV2 CNN (CASIA v2 fine-tuned) β | |
| β + AI-Gen Detector (radial FFT) β | |
| ββββββββββββββββββββ¬ββββββββββββββββββββββββββββ | |
| βΌ | |
| ββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β ENSEMBLE FUSION β | |
| β weighted blend β RF overlay β CNN overlay β | |
| β β AI-gen overlay (capped at +20%) β | |
| ββββββββββββββββββββ¬ββββββββββββββββββββββββββββ | |
| βΌ | |
| ββββββββββββββββββββ¬ββββββ΄ββββββ¬βββββββββββββββββββ | |
| βΌ βΌ βΌ βΌ | |
| ββββββββββββββββ ββββββββββββββββββ ββββββββββββββββ ββββββββββββββββββ | |
| β COMPLIANCE β β FRAUD-RING β β PROVENANCE β β TAMPER FORGE β | |
| β IFSC Β· PAN Β· β β NetworkX graph β β SHA-256 hash β β Adversarial β | |
| β Aadhaar Β· PIIβ β clique detect β β chain ledger β β validation β | |
| ββββββββ¬ββββββββ ββββββββββ¬ββββββββ ββββββββ¬ββββββββ ββββββββββββββββββ | |
| β β β | |
| ββββββββββββββ¬βββββββ΄βββββββββββββββββ | |
| βΌ | |
| ββββββββββββββββββββββββββββββββββββββ | |
| β OUTPUT β | |
| β Risk band Β· Evidence list β | |
| β Bank-letterhead audit PDF β | |
| β RBI compliance PDF Β· Audit JSON β | |
| β Tamper-evident ledger entry β | |
| ββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| Default weight vector (`forensics.WEIGHTS`): `{ela: 0.20, copy_move: 0.25, noise: 0.20, exif: 0.15, text_rules: 0.20}`. The Random Forest probability, when available, is blended 50/50 with the rule-based score. The CNN probability is blended at a weight between 0.4 and 0.7 based on the CNN's reported validation AUC. The AI-gen probability is applied as a final overlay capped at +20%. | |
| Band mapping: `0β0.30 LOW Β· 0.30β0.50 MEDIUM Β· 0.50β0.75 HIGH Β· 0.75+ CRITICAL`. | |
| See [`ARCHITECTURE.md`](ARCHITECTURE.md) for the full reference. | |
| --- | |
| ## Detection coverage | |
| **Image tampering** | |
| - Copy-move forgery β ORB keypoint matching with distance filter | |
| - Image splicing β block-wise noise inconsistency via Laplacian variance | |
| - Text edits / amount tampering β Error Level Analysis | |
| - Photoshop / GIMP / Snapseed edits β EXIF Software-tag string match | |
| - Timestamp inconsistencies β DateTime vs DateTimeOriginal comparison | |
| **AI-generated content** | |
| - Sora / Midjourney / Stable Diffusion / DALL-E outputs β FFT spectral analysis | |
| - High-frequency suppression (1/f decay deviation) | |
| - Periodic checkerboard peaks from upsampling stride | |
| - Non-standard JPEG quantization tables | |
| **PDF tampering** | |
| - Incremental edits β multi-`%%EOF` marker counting | |
| - Consumer-tool fingerprints β iLovePDF, Smallpdf, PDFescape, Sejda, Foxit Phantom | |
| - Producer/Creator mismatch β flags re-processed PDFs | |
| - Inserted text β embedded-font count anomalies | |
| **Cross-document & fraud-ring** | |
| - Name / DOB / address fuzzy match across multiple documents | |
| - Per-field weighted scoring with green / yellow / red status | |
| - Cross-applicant similarity graph; cliques β₯3 = suspected fraud ring | |
| - Ring bands: CRITICAL (β₯5 members) / HIGH (3β4) / MEDIUM (2) | |
| **KYC validation** | |
| - IFSC: format + RBI bank-code list (36 banks) | |
| - PAN: format + entity-type character (10 types per income-tax dept spec) | |
| - Aadhaar: 12-digit format + UIDAI Verhoeff checksum | |
| **PII redaction & audit** | |
| - Aadhaar, PAN, IFSC, account-number masking | |
| - PDF redaction with black rectangle overlays | |
| - SHA-256 hash-chained provenance ledger (RBI Para 67 compliant) | |
| --- | |
| ## Running locally | |
| ```bash | |
| git clone https://github.com/SpandanM110/Doc-Sentry.git | |
| cd Doc-Sentry | |
| pip install -r requirements.txt | |
| streamlit run app.py | |
| ``` | |
| Browser opens at `http://localhost:8501`. | |
| For full OCR text-rule support, install Tesseract OCR: | |
| - Windows: https://github.com/UB-Mannheim/tesseract/wiki | |
| - macOS: `brew install tesseract` | |
| - Linux: `sudo apt-get install tesseract-ocr libtesseract-dev` | |
| The app auto-detects Tesseract on standard Windows install paths; no environment variable required. | |
| --- | |
| ## Deployment | |
| The repository is deployment-ready for both **Streamlit Community Cloud** and **Hugging Face Spaces**. The YAML frontmatter at the top of this README configures the HF Space; `packages.txt` ensures Tesseract is installed on the build VM; `requirements.txt` covers Python dependencies. | |
| Live deployment: https://huggingface.co/spaces/SpandanM110/DocSentry | |
| --- | |
| ## Training your own model | |
| Drop labelled data into `data/images/originals/` and `data/images/tampered/`, open `docsentry_master.ipynb`, run section 6. A Random Forest auto-trains on whatever you put there and saves to `models/forgery_rf.joblib`. The Streamlit app picks it up automatically on next restart. | |
| For a CNN upgrade, set `TRAIN_CNN = True` in section 7 and run on a Colab T4 GPU (free tier). Saves `models/forgery_cnn.keras` + `models/forgery_cnn.meta.json`. The app loads it lazily on first request. | |
| --- | |
| ## Dependencies | |
| OpenCV (cv2), Pillow (PIL), scikit-image, scikit-learn, joblib, PyMuPDF (fitz), pdfplumber, pikepdf, pytesseract, python-dateutil, Streamlit, streamlit-drawable-canvas, ReportLab, NumPy, pandas, matplotlib, NetworkX. Optional: TensorFlow (only required for the CNN path). | |
| All pip-installable. No GPU required for the default pipeline. | |
| --- | |
| ## License | |
| MIT β see `LICENSE`. The MIT license covers the source code in this repository. Third-party datasets and pretrained models bundled or referenced (CASIA v2, IDRBT cheque dataset, AgamiAI Indian Bank Statements, MobileNetV2 ImageNet weights) are governed by their own terms; those notices are reproduced in `LICENSE` below the MIT block. | |
| --- | |
| ## Acknowledgements | |
| - **AgamiAI Indian Bank Statements** (Hugging Face) β Apache 2.0 | |
| - **IDRBT Cheque Image Dataset** β Institute for Development and Research in Banking Technology, India | |
| - **CASIA v2** image tampering dataset β Chinese Academy of Sciences | |
| - **MICC-F220** copy-move benchmark β University of Florence | |
| - **CoMoFoD** dataset β University of Zagreb | |
| - **Tobacco-3482** document corpus β University of Maryland | |