Spaces:
Sleeping
A newer version of the Streamlit SDK is available: 1.58.0
title: DocSentry
emoji: π‘οΈ
colorFrom: blue
colorTo: indigo
sdk: streamlit
sdk_version: 1.32.0
app_file: app.py
pinned: false
license: mit
short_description: Document forensics + fraud-ring detection for Indian banks
BankShield
Real-Time Document Forensics, AI-Generated Forgery Detection, and Cross-Applicant Fraud-Ring Intelligence for Indian Bank Underwriting.
BankShield catches tampered, forged, and AI-generated documents the moment they reach the underwriter β and surfaces organised fraud rings that span multiple applicants. Six independent detection layers fuse into a single calibrated risk score, with explainable evidence, tamper-evident audit trails, and RBI-format compliance reports out of the box.
100% open source. No paid APIs. No external LLM calls. CPU-only by default. Runs locally on the bank's perimeter β PII never leaves.
- Live demo: https://huggingface.co/spaces/SpandanM110/DocSentry
- Source: https://github.com/SpandanM110/Doc-Sentry
- Architecture reference: see
ARCHITECTURE.md
The six pillars
| Pillar | Module | What it does |
|---|---|---|
| Image Forensics | forensics.py |
ELA, copy-move (ORB), Laplacian noise inconsistency, EXIF audit |
| PDF Structural Audit | forensics.py |
EOF marker counting, producer/creator drift, embedded-font anomalies, consumer-tool fingerprints |
| OCR + Financial Rules | forensics.py |
Tesseract OCR + IFSC / PAN / Aadhaar / date monotonicity / amount sanity |
| AI-Generated Detection (new) | ai_detector.py |
Radial FFT spectral analysis β catches Sora / Midjourney / Stable Diffusion outputs |
| Fraud Ring Network (new) | fraud_ring.py |
NetworkX similarity graph across applicants; clique discovery flags organised fraud rings |
| Provenance Ledger (new) | provenance.py |
SHA-256 hash chain over every analysis; O(N) verifiable; RBI Para 67 compliant |
Plus the Live Tamper Forge Studio (tampering.py) β an adversarial-validation harness built directly into the dashboard.
Repository layout
Doc-Sentry/
βββ app.py Streamlit web UI (6 tabs)
βββ forensics.py Core detection engine + ensemble fusion
βββ ai_detector.py AI-generated forgery detector (FFT spectral)
βββ fraud_ring.py Cross-applicant similarity graph + clique detection
βββ provenance.py Tamper-evident SHA-256 hash chain
βββ tampering.py Forge Studio adversarial harness
βββ compliance.py KYC validators, PII redaction, RBI report builder
βββ audit_report.py Bank-letterhead PDF report builder
βββ docsentry_master.ipynb Single source-of-truth Jupyter notebook
β
βββ requirements.txt Python dependencies
βββ packages.txt System packages (Tesseract) for Streamlit Cloud / HF Spaces
βββ .streamlit/config.toml Streamlit theme + server config
β
βββ sample_data/ 26 demo files for the live app
β βββ originals/ 12 genuine documents
β βββ tampered/ 12 tampered documents
β βββ pdfs/ 2 PDFs (1 genuine, 1 tampered)
β
βββ models/ Trained model artefacts
β βββ forgery_rf.joblib Random Forest classifier
β βββ forgery_cnn.keras MobileNetV2 fine-tuned on CASIA v2 (optional)
β
βββ ARCHITECTURE.md Full architecture reference
βββ SUBMISSION.md Hackathon submission packet
βββ BankShield_Pitch.pptx Pitch deck (15 slides)
βββ README.md LICENSE
βββ data/ (gitignored) full training data + downloaded datasets
Module reference
forensics.py β detection engine
The core analytical module. Stateless functions; all logic is independently testable.
| Function | Returns | Description |
|---|---|---|
analyse_document(path) |
dict | End-to-end pipeline. Auto-detects type, runs all relevant detectors, blends Random Forest + CNN + AI-gen predictions, auto-logs to provenance ledger. Primary entry point. |
score_image(path) |
(float, dict, list) | Composite forensic score for an image. Returns total, sub-scores by detector, and EXIF flags. |
error_level_analysis(path, quality=90) |
(PIL.Image, float) | ELA visualisation + scalar suspicion score. |
copy_move_detect(path) |
(np.ndarray, int, list) | ORB-based copy-move detection. Returns annotated viz, match count, raw matches. |
noise_inconsistency(path, block=32) |
(np.ndarray, float) | Per-block Laplacian variance heatmap + outlier ratio. |
exif_sanity(path) |
list of str | EXIF audit: missing EXIF, editor signatures, timestamp inconsistencies. |
pdf_structural_audit(path) |
dict | %%EOF markers, producer/creator drift, consumer-tool fingerprints. |
pdf_font_audit(path) |
dict | Embedded font listing + count anomalies. |
ocr_text(path) |
str | Tesseract OCR with auto-fallback. |
text_rule_checks(text) |
dict | Date monotonicity, amount sanity, IFSC format, account-number patterns. |
extract_features(path) |
dict | 11-feature vector for the Random Forest. |
predict_with_model(path) |
dict / None | Random Forest tamper probability + verdict. |
predict_with_cnn(path) |
dict / None | MobileNetV2 CNN inference (lazy-loaded). |
extract_identity_fields(path) |
(dict, str) | Pulls name, DOB, address, IFSC, account, amounts. |
cross_doc_consistency(paths) |
dict | Per-field similarity across 2+ documents. |
generate_insights(score, sub, flags) |
dict | Numeric β underwriter-readable bullets + recommended action. |
band(score) |
str | Maps a float to LOW / MEDIUM / HIGH / CRITICAL. |
ai_detector.py β AI-generated forgery detection
| Function | Description |
|---|---|
detect_ai_generated(path) |
Full pipeline β probability + verdict + flags + FFT profile. |
radial_fft_profile(gray) |
Radially-averaged log-magnitude FFT spectrum. |
high_freq_attenuation(profile) |
Smoothness score β low for real scans, high for AI outputs. |
spectral_peak_score(profile) |
Counts checkerboard-stride peaks in the high-frequency band. |
jpeg_quantization_check(path) |
Inspects JPEG quantization tables for synthetic-media signatures. |
Blended into the main risk score with a capped +20% overlay so AI-gen signals reliably surface synthetic media without dominating classical detectors.
fraud_ring.py β cross-applicant fraud-ring detection
| Function | Description |
|---|---|
extract_applicant_fields(path) |
OCR + regex pull of name / DOB / address / phone / IFSC / account / employer. |
compare_applicants(a, b) |
Per-field similarity + weighted score. |
build_fraud_graph(applicants) |
NetworkX similarity graph (edges weighted by shared signals). |
detect_rings(G, min_size=3) |
Connected components above threshold β suspected fraud rings. |
visualize_graph(G, rings) |
Force-directed graph with ring members in red. |
fraud_summary(G, rings, applicants) |
Structured summary for the Streamlit UI. |
provenance.py β tamper-evident audit ledger
| Function | Description |
|---|---|
log_analysis(...) |
Appends a SHA-256 hash-chained record to the SQLite ledger. |
verify_chain() |
Walks every record in O(N); pinpoints the first broken record. |
chain_stats() |
Count, first/last timestamps, breakdown by risk band, chain status. |
fetch_ledger(limit) |
Returns the latest N entries. |
ledger_dataframe(limit) |
Pandas DataFrame view (for Streamlit display). |
Each record's record_hash = SHA256(timestamp | doc_sha256 | risk_band | risk_score | prev_hash) β retroactive edits break the chain mathematically.
tampering.py β adversarial Forge Studio
tamper_copy_move, tamper_text_edit, tamper_splice, tamper_compression, tamper_metadata_strip, tamper_custom_region, tamper_chain, annotate_before_after, overlay_heatmap_on_image, detector_scorecard. Used by Tab 5 to apply controlled forgeries and immediately re-run detection.
compliance.py β KYC + regulatory
| Function | Description |
|---|---|
validate_ifsc(code) |
Format check + RBI bank-code lookup (36 banks). |
validate_pan(code) |
Format + entity-type character validation. |
validate_aadhaar(num) |
12-digit format + UIDAI Verhoeff checksum. |
redact_text(text) |
Masks IFSC, PAN, Aadhaar, account numbers. |
redact_pdf(input_path, output_path) |
PII black-box overlays via PyMuPDF text-bbox. |
extract_pii_fields(path) |
Pulls all PII candidates from any document. |
build_compliance_report(...) |
RBI Master-Direction-format audit PDF (5 sections). |
audit_report.py β bank-letterhead PDF
build_pdf_report(report, source_path) β bytes. Multi-page PDF with header letterhead, metadata table, colour-coded risk verdict box, sub-score breakdown table, evidence list, embedded forensic heatmaps. Built with ReportLab Platypus.
app.py β Streamlit UI (6 tabs)
| Tab | Function |
|---|---|
| 1. Single-document analysis | Risk band, sub-score chart, ELA / copy-move / noise heatmaps, AI-gen FFT profile, ML/CNN predictions, downloadable JSON + PDF. |
| 2. Cross-document KYC | Upload 2β4 docs for one applicant; identity-field consistency table. |
| 3. Batch audit | Scan a folder; sortable risk table + CSV download. |
| 4. Compliance & Audit Pack | KYC validation, PII auto-redaction, RBI compliance PDF, provenance ledger view with chain re-verify. |
| 5. Live Tamper Forge Studio | Pick clean sample β choose technique + intensity β watch BankShield localise the tamper with per-detector scorecard + heatmap overlays. |
| 6. Fraud Ring Network | Upload N applicants β similarity graph with red ring members + ring summary cards. |
Pipeline architecture
ββββββββββββββββββββββββββββββββββββββββββ
β PRESENTATION (Streamlit, 6 tabs) β
ββββββββββββββββββββ¬ββββββββββββββββββββββ
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββ
β FORENSICS CORE β
β ELA Β· Copy-move Β· Noise Β· EXIF Β· OCR Β· PDF β
β + Random Forest (11-d feature vector) β
β + MobileNetV2 CNN (CASIA v2 fine-tuned) β
β + AI-Gen Detector (radial FFT) β
ββββββββββββββββββββ¬ββββββββββββββββββββββββββββ
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββ
β ENSEMBLE FUSION β
β weighted blend β RF overlay β CNN overlay β
β β AI-gen overlay (capped at +20%) β
ββββββββββββββββββββ¬ββββββββββββββββββββββββββββ
βΌ
ββββββββββββββββββββ¬ββββββ΄ββββββ¬βββββββββββββββββββ
βΌ βΌ βΌ βΌ
ββββββββββββββββ ββββββββββββββββββ ββββββββββββββββ ββββββββββββββββββ
β COMPLIANCE β β FRAUD-RING β β PROVENANCE β β TAMPER FORGE β
β IFSC Β· PAN Β· β β NetworkX graph β β SHA-256 hash β β Adversarial β
β Aadhaar Β· PIIβ β clique detect β β chain ledger β β validation β
ββββββββ¬ββββββββ ββββββββββ¬ββββββββ ββββββββ¬ββββββββ ββββββββββββββββββ
β β β
ββββββββββββββ¬βββββββ΄βββββββββββββββββ
βΌ
ββββββββββββββββββββββββββββββββββββββ
β OUTPUT β
β Risk band Β· Evidence list β
β Bank-letterhead audit PDF β
β RBI compliance PDF Β· Audit JSON β
β Tamper-evident ledger entry β
ββββββββββββββββββββββββββββββββββββββ
Default weight vector (forensics.WEIGHTS): {ela: 0.20, copy_move: 0.25, noise: 0.20, exif: 0.15, text_rules: 0.20}. The Random Forest probability, when available, is blended 50/50 with the rule-based score. The CNN probability is blended at a weight between 0.4 and 0.7 based on the CNN's reported validation AUC. The AI-gen probability is applied as a final overlay capped at +20%.
Band mapping: 0β0.30 LOW Β· 0.30β0.50 MEDIUM Β· 0.50β0.75 HIGH Β· 0.75+ CRITICAL.
See ARCHITECTURE.md for the full reference.
Detection coverage
Image tampering
- Copy-move forgery β ORB keypoint matching with distance filter
- Image splicing β block-wise noise inconsistency via Laplacian variance
- Text edits / amount tampering β Error Level Analysis
- Photoshop / GIMP / Snapseed edits β EXIF Software-tag string match
- Timestamp inconsistencies β DateTime vs DateTimeOriginal comparison
AI-generated content
- Sora / Midjourney / Stable Diffusion / DALL-E outputs β FFT spectral analysis
- High-frequency suppression (1/f decay deviation)
- Periodic checkerboard peaks from upsampling stride
- Non-standard JPEG quantization tables
PDF tampering
- Incremental edits β multi-
%%EOFmarker counting - Consumer-tool fingerprints β iLovePDF, Smallpdf, PDFescape, Sejda, Foxit Phantom
- Producer/Creator mismatch β flags re-processed PDFs
- Inserted text β embedded-font count anomalies
Cross-document & fraud-ring
- Name / DOB / address fuzzy match across multiple documents
- Per-field weighted scoring with green / yellow / red status
- Cross-applicant similarity graph; cliques β₯3 = suspected fraud ring
- Ring bands: CRITICAL (β₯5 members) / HIGH (3β4) / MEDIUM (2)
KYC validation
- IFSC: format + RBI bank-code list (36 banks)
- PAN: format + entity-type character (10 types per income-tax dept spec)
- Aadhaar: 12-digit format + UIDAI Verhoeff checksum
PII redaction & audit
- Aadhaar, PAN, IFSC, account-number masking
- PDF redaction with black rectangle overlays
- SHA-256 hash-chained provenance ledger (RBI Para 67 compliant)
Running locally
git clone https://github.com/SpandanM110/Doc-Sentry.git
cd Doc-Sentry
pip install -r requirements.txt
streamlit run app.py
Browser opens at http://localhost:8501.
For full OCR text-rule support, install Tesseract OCR:
- Windows: https://github.com/UB-Mannheim/tesseract/wiki
- macOS:
brew install tesseract - Linux:
sudo apt-get install tesseract-ocr libtesseract-dev
The app auto-detects Tesseract on standard Windows install paths; no environment variable required.
Deployment
The repository is deployment-ready for both Streamlit Community Cloud and Hugging Face Spaces. The YAML frontmatter at the top of this README configures the HF Space; packages.txt ensures Tesseract is installed on the build VM; requirements.txt covers Python dependencies.
Live deployment: https://huggingface.co/spaces/SpandanM110/DocSentry
Training your own model
Drop labelled data into data/images/originals/ and data/images/tampered/, open docsentry_master.ipynb, run section 6. A Random Forest auto-trains on whatever you put there and saves to models/forgery_rf.joblib. The Streamlit app picks it up automatically on next restart.
For a CNN upgrade, set TRAIN_CNN = True in section 7 and run on a Colab T4 GPU (free tier). Saves models/forgery_cnn.keras + models/forgery_cnn.meta.json. The app loads it lazily on first request.
Dependencies
OpenCV (cv2), Pillow (PIL), scikit-image, scikit-learn, joblib, PyMuPDF (fitz), pdfplumber, pikepdf, pytesseract, python-dateutil, Streamlit, streamlit-drawable-canvas, ReportLab, NumPy, pandas, matplotlib, NetworkX. Optional: TensorFlow (only required for the CNN path).
All pip-installable. No GPU required for the default pipeline.
License
MIT β see LICENSE. The MIT license covers the source code in this repository. Third-party datasets and pretrained models bundled or referenced (CASIA v2, IDRBT cheque dataset, AgamiAI Indian Bank Statements, MobileNetV2 ImageNet weights) are governed by their own terms; those notices are reproduced in LICENSE below the MIT block.
Acknowledgements
- AgamiAI Indian Bank Statements (Hugging Face) β Apache 2.0
- IDRBT Cheque Image Dataset β Institute for Development and Research in Banking Technology, India
- CASIA v2 image tampering dataset β Chinese Academy of Sciences
- MICC-F220 copy-move benchmark β University of Florence
- CoMoFoD dataset β University of Zagreb
- Tobacco-3482 document corpus β University of Maryland