DocSentry / README.md
SpandanM110's picture
Fix HF short_description length
8416232
|
Raw
History Blame Contribute Delete
18 kB

A newer version of the Streamlit SDK is available: 1.58.0

Upgrade
metadata
title: DocSentry
emoji: πŸ›‘οΈ
colorFrom: blue
colorTo: indigo
sdk: streamlit
sdk_version: 1.32.0
app_file: app.py
pinned: false
license: mit
short_description: Document forensics + fraud-ring detection for Indian banks

BankShield

Real-Time Document Forensics, AI-Generated Forgery Detection, and Cross-Applicant Fraud-Ring Intelligence for Indian Bank Underwriting.

BankShield catches tampered, forged, and AI-generated documents the moment they reach the underwriter β€” and surfaces organised fraud rings that span multiple applicants. Six independent detection layers fuse into a single calibrated risk score, with explainable evidence, tamper-evident audit trails, and RBI-format compliance reports out of the box.

100% open source. No paid APIs. No external LLM calls. CPU-only by default. Runs locally on the bank's perimeter β€” PII never leaves.


The six pillars

Pillar Module What it does
Image Forensics forensics.py ELA, copy-move (ORB), Laplacian noise inconsistency, EXIF audit
PDF Structural Audit forensics.py EOF marker counting, producer/creator drift, embedded-font anomalies, consumer-tool fingerprints
OCR + Financial Rules forensics.py Tesseract OCR + IFSC / PAN / Aadhaar / date monotonicity / amount sanity
AI-Generated Detection (new) ai_detector.py Radial FFT spectral analysis β€” catches Sora / Midjourney / Stable Diffusion outputs
Fraud Ring Network (new) fraud_ring.py NetworkX similarity graph across applicants; clique discovery flags organised fraud rings
Provenance Ledger (new) provenance.py SHA-256 hash chain over every analysis; O(N) verifiable; RBI Para 67 compliant

Plus the Live Tamper Forge Studio (tampering.py) β€” an adversarial-validation harness built directly into the dashboard.


Repository layout

Doc-Sentry/
β”œβ”€β”€ app.py                       Streamlit web UI (6 tabs)
β”œβ”€β”€ forensics.py                 Core detection engine + ensemble fusion
β”œβ”€β”€ ai_detector.py               AI-generated forgery detector (FFT spectral)
β”œβ”€β”€ fraud_ring.py                Cross-applicant similarity graph + clique detection
β”œβ”€β”€ provenance.py                Tamper-evident SHA-256 hash chain
β”œβ”€β”€ tampering.py                 Forge Studio adversarial harness
β”œβ”€β”€ compliance.py                KYC validators, PII redaction, RBI report builder
β”œβ”€β”€ audit_report.py              Bank-letterhead PDF report builder
β”œβ”€β”€ docsentry_master.ipynb       Single source-of-truth Jupyter notebook
β”‚
β”œβ”€β”€ requirements.txt             Python dependencies
β”œβ”€β”€ packages.txt                 System packages (Tesseract) for Streamlit Cloud / HF Spaces
β”œβ”€β”€ .streamlit/config.toml       Streamlit theme + server config
β”‚
β”œβ”€β”€ sample_data/                 26 demo files for the live app
β”‚   β”œβ”€β”€ originals/               12 genuine documents
β”‚   β”œβ”€β”€ tampered/                12 tampered documents
β”‚   └── pdfs/                    2 PDFs (1 genuine, 1 tampered)
β”‚
β”œβ”€β”€ models/                      Trained model artefacts
β”‚   β”œβ”€β”€ forgery_rf.joblib        Random Forest classifier
β”‚   └── forgery_cnn.keras        MobileNetV2 fine-tuned on CASIA v2 (optional)
β”‚
β”œβ”€β”€ ARCHITECTURE.md              Full architecture reference
β”œβ”€β”€ SUBMISSION.md                Hackathon submission packet
β”œβ”€β”€ BankShield_Pitch.pptx        Pitch deck (15 slides)
β”œβ”€β”€ README.md  LICENSE
└── data/                        (gitignored) full training data + downloaded datasets

Module reference

forensics.py β€” detection engine

The core analytical module. Stateless functions; all logic is independently testable.

Function Returns Description
analyse_document(path) dict End-to-end pipeline. Auto-detects type, runs all relevant detectors, blends Random Forest + CNN + AI-gen predictions, auto-logs to provenance ledger. Primary entry point.
score_image(path) (float, dict, list) Composite forensic score for an image. Returns total, sub-scores by detector, and EXIF flags.
error_level_analysis(path, quality=90) (PIL.Image, float) ELA visualisation + scalar suspicion score.
copy_move_detect(path) (np.ndarray, int, list) ORB-based copy-move detection. Returns annotated viz, match count, raw matches.
noise_inconsistency(path, block=32) (np.ndarray, float) Per-block Laplacian variance heatmap + outlier ratio.
exif_sanity(path) list of str EXIF audit: missing EXIF, editor signatures, timestamp inconsistencies.
pdf_structural_audit(path) dict %%EOF markers, producer/creator drift, consumer-tool fingerprints.
pdf_font_audit(path) dict Embedded font listing + count anomalies.
ocr_text(path) str Tesseract OCR with auto-fallback.
text_rule_checks(text) dict Date monotonicity, amount sanity, IFSC format, account-number patterns.
extract_features(path) dict 11-feature vector for the Random Forest.
predict_with_model(path) dict / None Random Forest tamper probability + verdict.
predict_with_cnn(path) dict / None MobileNetV2 CNN inference (lazy-loaded).
extract_identity_fields(path) (dict, str) Pulls name, DOB, address, IFSC, account, amounts.
cross_doc_consistency(paths) dict Per-field similarity across 2+ documents.
generate_insights(score, sub, flags) dict Numeric β†’ underwriter-readable bullets + recommended action.
band(score) str Maps a float to LOW / MEDIUM / HIGH / CRITICAL.

ai_detector.py β€” AI-generated forgery detection

Function Description
detect_ai_generated(path) Full pipeline β†’ probability + verdict + flags + FFT profile.
radial_fft_profile(gray) Radially-averaged log-magnitude FFT spectrum.
high_freq_attenuation(profile) Smoothness score β€” low for real scans, high for AI outputs.
spectral_peak_score(profile) Counts checkerboard-stride peaks in the high-frequency band.
jpeg_quantization_check(path) Inspects JPEG quantization tables for synthetic-media signatures.

Blended into the main risk score with a capped +20% overlay so AI-gen signals reliably surface synthetic media without dominating classical detectors.

fraud_ring.py β€” cross-applicant fraud-ring detection

Function Description
extract_applicant_fields(path) OCR + regex pull of name / DOB / address / phone / IFSC / account / employer.
compare_applicants(a, b) Per-field similarity + weighted score.
build_fraud_graph(applicants) NetworkX similarity graph (edges weighted by shared signals).
detect_rings(G, min_size=3) Connected components above threshold β†’ suspected fraud rings.
visualize_graph(G, rings) Force-directed graph with ring members in red.
fraud_summary(G, rings, applicants) Structured summary for the Streamlit UI.

provenance.py β€” tamper-evident audit ledger

Function Description
log_analysis(...) Appends a SHA-256 hash-chained record to the SQLite ledger.
verify_chain() Walks every record in O(N); pinpoints the first broken record.
chain_stats() Count, first/last timestamps, breakdown by risk band, chain status.
fetch_ledger(limit) Returns the latest N entries.
ledger_dataframe(limit) Pandas DataFrame view (for Streamlit display).

Each record's record_hash = SHA256(timestamp | doc_sha256 | risk_band | risk_score | prev_hash) β€” retroactive edits break the chain mathematically.

tampering.py β€” adversarial Forge Studio

tamper_copy_move, tamper_text_edit, tamper_splice, tamper_compression, tamper_metadata_strip, tamper_custom_region, tamper_chain, annotate_before_after, overlay_heatmap_on_image, detector_scorecard. Used by Tab 5 to apply controlled forgeries and immediately re-run detection.

compliance.py β€” KYC + regulatory

Function Description
validate_ifsc(code) Format check + RBI bank-code lookup (36 banks).
validate_pan(code) Format + entity-type character validation.
validate_aadhaar(num) 12-digit format + UIDAI Verhoeff checksum.
redact_text(text) Masks IFSC, PAN, Aadhaar, account numbers.
redact_pdf(input_path, output_path) PII black-box overlays via PyMuPDF text-bbox.
extract_pii_fields(path) Pulls all PII candidates from any document.
build_compliance_report(...) RBI Master-Direction-format audit PDF (5 sections).

audit_report.py β€” bank-letterhead PDF

build_pdf_report(report, source_path) β†’ bytes. Multi-page PDF with header letterhead, metadata table, colour-coded risk verdict box, sub-score breakdown table, evidence list, embedded forensic heatmaps. Built with ReportLab Platypus.

app.py β€” Streamlit UI (6 tabs)

Tab Function
1. Single-document analysis Risk band, sub-score chart, ELA / copy-move / noise heatmaps, AI-gen FFT profile, ML/CNN predictions, downloadable JSON + PDF.
2. Cross-document KYC Upload 2–4 docs for one applicant; identity-field consistency table.
3. Batch audit Scan a folder; sortable risk table + CSV download.
4. Compliance & Audit Pack KYC validation, PII auto-redaction, RBI compliance PDF, provenance ledger view with chain re-verify.
5. Live Tamper Forge Studio Pick clean sample β†’ choose technique + intensity β†’ watch BankShield localise the tamper with per-detector scorecard + heatmap overlays.
6. Fraud Ring Network Upload N applicants β†’ similarity graph with red ring members + ring summary cards.

Pipeline architecture

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚   PRESENTATION (Streamlit, 6 tabs)     β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                       β–Ό
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚   FORENSICS CORE                             β”‚
              β”‚   ELA Β· Copy-move Β· Noise Β· EXIF Β· OCR Β· PDF β”‚
              β”‚   + Random Forest (11-d feature vector)      β”‚
              β”‚   + MobileNetV2 CNN (CASIA v2 fine-tuned)    β”‚
              β”‚   + AI-Gen Detector (radial FFT)             β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β–Ό
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚   ENSEMBLE FUSION                            β”‚
              β”‚   weighted blend β†’ RF overlay β†’ CNN overlay  β”‚
              β”‚   β†’ AI-gen overlay (capped at +20%)          β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β–Ό
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β–Ό                  β–Ό           β–Ό                  β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ COMPLIANCE   β”‚  β”‚ FRAUD-RING     β”‚ β”‚ PROVENANCE   β”‚ β”‚ TAMPER FORGE   β”‚
β”‚ IFSC Β· PAN Β· β”‚  β”‚ NetworkX graph β”‚ β”‚ SHA-256 hash β”‚ β”‚ Adversarial    β”‚
β”‚ Aadhaar Β· PIIβ”‚  β”‚ clique detect  β”‚ β”‚ chain ledger β”‚ β”‚ validation     β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚                   β”‚                β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β–Ό
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚   OUTPUT                           β”‚
         β”‚   Risk band Β· Evidence list        β”‚
         β”‚   Bank-letterhead audit PDF        β”‚
         β”‚   RBI compliance PDF Β· Audit JSON  β”‚
         β”‚   Tamper-evident ledger entry      β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Default weight vector (forensics.WEIGHTS): {ela: 0.20, copy_move: 0.25, noise: 0.20, exif: 0.15, text_rules: 0.20}. The Random Forest probability, when available, is blended 50/50 with the rule-based score. The CNN probability is blended at a weight between 0.4 and 0.7 based on the CNN's reported validation AUC. The AI-gen probability is applied as a final overlay capped at +20%.

Band mapping: 0–0.30 LOW Β· 0.30–0.50 MEDIUM Β· 0.50–0.75 HIGH Β· 0.75+ CRITICAL.

See ARCHITECTURE.md for the full reference.


Detection coverage

Image tampering

  • Copy-move forgery β€” ORB keypoint matching with distance filter
  • Image splicing β€” block-wise noise inconsistency via Laplacian variance
  • Text edits / amount tampering β€” Error Level Analysis
  • Photoshop / GIMP / Snapseed edits β€” EXIF Software-tag string match
  • Timestamp inconsistencies β€” DateTime vs DateTimeOriginal comparison

AI-generated content

  • Sora / Midjourney / Stable Diffusion / DALL-E outputs β€” FFT spectral analysis
  • High-frequency suppression (1/f decay deviation)
  • Periodic checkerboard peaks from upsampling stride
  • Non-standard JPEG quantization tables

PDF tampering

  • Incremental edits β€” multi-%%EOF marker counting
  • Consumer-tool fingerprints β€” iLovePDF, Smallpdf, PDFescape, Sejda, Foxit Phantom
  • Producer/Creator mismatch β€” flags re-processed PDFs
  • Inserted text β€” embedded-font count anomalies

Cross-document & fraud-ring

  • Name / DOB / address fuzzy match across multiple documents
  • Per-field weighted scoring with green / yellow / red status
  • Cross-applicant similarity graph; cliques β‰₯3 = suspected fraud ring
  • Ring bands: CRITICAL (β‰₯5 members) / HIGH (3–4) / MEDIUM (2)

KYC validation

  • IFSC: format + RBI bank-code list (36 banks)
  • PAN: format + entity-type character (10 types per income-tax dept spec)
  • Aadhaar: 12-digit format + UIDAI Verhoeff checksum

PII redaction & audit

  • Aadhaar, PAN, IFSC, account-number masking
  • PDF redaction with black rectangle overlays
  • SHA-256 hash-chained provenance ledger (RBI Para 67 compliant)

Running locally

git clone https://github.com/SpandanM110/Doc-Sentry.git
cd Doc-Sentry
pip install -r requirements.txt
streamlit run app.py

Browser opens at http://localhost:8501.

For full OCR text-rule support, install Tesseract OCR:

The app auto-detects Tesseract on standard Windows install paths; no environment variable required.


Deployment

The repository is deployment-ready for both Streamlit Community Cloud and Hugging Face Spaces. The YAML frontmatter at the top of this README configures the HF Space; packages.txt ensures Tesseract is installed on the build VM; requirements.txt covers Python dependencies.

Live deployment: https://huggingface.co/spaces/SpandanM110/DocSentry


Training your own model

Drop labelled data into data/images/originals/ and data/images/tampered/, open docsentry_master.ipynb, run section 6. A Random Forest auto-trains on whatever you put there and saves to models/forgery_rf.joblib. The Streamlit app picks it up automatically on next restart.

For a CNN upgrade, set TRAIN_CNN = True in section 7 and run on a Colab T4 GPU (free tier). Saves models/forgery_cnn.keras + models/forgery_cnn.meta.json. The app loads it lazily on first request.


Dependencies

OpenCV (cv2), Pillow (PIL), scikit-image, scikit-learn, joblib, PyMuPDF (fitz), pdfplumber, pikepdf, pytesseract, python-dateutil, Streamlit, streamlit-drawable-canvas, ReportLab, NumPy, pandas, matplotlib, NetworkX. Optional: TensorFlow (only required for the CNN path).

All pip-installable. No GPU required for the default pipeline.


License

MIT β€” see LICENSE. The MIT license covers the source code in this repository. Third-party datasets and pretrained models bundled or referenced (CASIA v2, IDRBT cheque dataset, AgamiAI Indian Bank Statements, MobileNetV2 ImageNet weights) are governed by their own terms; those notices are reproduced in LICENSE below the MIT block.


Acknowledgements

  • AgamiAI Indian Bank Statements (Hugging Face) β€” Apache 2.0
  • IDRBT Cheque Image Dataset β€” Institute for Development and Research in Banking Technology, India
  • CASIA v2 image tampering dataset β€” Chinese Academy of Sciences
  • MICC-F220 copy-move benchmark β€” University of Florence
  • CoMoFoD dataset β€” University of Zagreb
  • Tobacco-3482 document corpus β€” University of Maryland