Spaces:

SpandanM110
/

DocSentry

Sleeping

App Files Files Community

SpandanM110 commited on 8 days ago

Commit

b2ca303

1 Parent(s): e97f963

Round 2: README + ignore pitch artefacts and runtime ledger

Browse files

Files changed (2) hide show

.gitignore +13 -0
README.md +178 -137

.gitignore CHANGED Viewed

@@ -34,3 +34,16 @@ Thumbs.db
 # Archive (the old notebooks we moved out)
 archive/

 # Archive (the old notebooks we moved out)
 archive/
+# Hackathon pitch artefacts - keep local, don't commit
+BankShield_Pitch.pptx
+SUBMISSION.md
+*.pptx
+# Runtime state - SQLite ledger fills up at runtime, never commit
+provenance.db
+provenance.db-journal
+*.db
+*.db-journal
+*.sqlite
+*.sqlite3

README.md CHANGED Viewed

@@ -8,15 +8,35 @@ sdk_version: 1.32.0
 app_file: app.py
 pinned: false
 license: mit
-short_description: Document forensics + KYC compliance for bank underwriting
 ---
-# DocSentry
-Document forensics and KYC compliance pipeline for bank underwriting workflows. Detects tampering and forgery in land records, legal documents, financial statements, and cheques. Validates KYC fields against RBI rules. Produces explainable risk scores and regulator-ready audit reports.
-100% open-source. No paid APIs. No LLM calls. CPU-only by default.
-<img width="1915" height="709" alt="image" src="https://github.com/user-attachments/assets/4567694f-b07e-4367-afa6-174069e2e48f" />
 ---
@@ -24,14 +44,18 @@ Document forensics and KYC compliance pipeline for bank underwriting workflows.
 ```
 Doc-Sentry/
-├── app.py                       Streamlit web UI (4 tabs)
-├── forensics.py                 Core detection engine
-├── audit_report.py              Bank-letterhead PDF report builder
 ├── compliance.py                KYC validators, PII redaction, RBI report builder
 ├── docsentry_master.ipynb       Single source-of-truth Jupyter notebook
 │
 ├── requirements.txt             Python dependencies
-├── packages.txt                 System packages (Tesseract) for Streamlit Cloud
 ├── .streamlit/config.toml       Streamlit theme + server config
 │
 ├── sample_data/                 26 demo files for the live app
@@ -40,9 +64,13 @@ Doc-Sentry/
 │   └── pdfs/                    2 PDFs (1 genuine, 1 tampered)
 │
 ├── models/                      Trained model artefacts
-│   └── forgery_rf.joblib        Random Forest classifier
 │
-├── README.md  DEPLOY.md  RUN_APP.md  DATASETS.md  PUSH.md  LICENSE
 └── data/                        (gitignored) full training data + downloaded datasets
 ```
@@ -56,166 +84,182 @@ The core analytical module. Stateless functions; all logic is independently test
 | Function | Returns | Description |
 |---|---|---|
-| `analyse_document(path)` | dict | End-to-end pipeline. Auto-detects type (image vs PDF), runs all relevant detectors, blends Random Forest + CNN predictions when available. Primary entry point. |
 | `score_image(path)` | (float, dict, list) | Composite forensic score for an image. Returns total, sub-scores by detector, and EXIF flags. |
-| `error_level_analysis(path, quality=90)` | (PIL.Image, float) | ELA visualisation + scalar suspicion score. Re-saves at given JPEG quality; tampered regions diverge from the rest of the image. |
-| `copy_move_detect(path)` | (np.ndarray, int, list) | Detects regions duplicated within the same image using ORB keypoint matching. Returns annotated visualisation, match count, and raw matches. |
-| `noise_inconsistency(path, block=32)` | (np.ndarray, float) | Per-block Laplacian variance. Returns a heatmap of outlier blocks and a normalised ratio. Useful for splicing detection. |
-| `exif_sanity(path)` | list of str | EXIF metadata audit. Flags missing EXIF, photo-editor signatures (Photoshop/GIMP/Snapseed), and timestamp inconsistencies. |
-| `pdf_structural_audit(path)` | dict | Counts `%%EOF` markers (incremental edits), compares producer vs creator, flags consumer-tool fingerprints (iLovePDF, Smallpdf, etc.). |
-| `pdf_font_audit(path)` | dict | Lists embedded fonts and flags unusually high font counts (a signal of inserted text). |
-| `ocr_text(path)` | str | Tesseract OCR with auto-fallback. Returns empty string if Tesseract isn't installed. |
-| `text_rule_checks(text)` | dict | Validates date monotonicity, amount sanity, IFSC format, account number patterns. |
-| `extract_features(path)` | dict | Feature vector for the Random Forest: 11 features (ELA, copy-move count, noise ratio, EXIF flag, 4 GLCM texture features, 3 colour histogram entropies). |
-| `predict_with_model(path)` | dict or None | Loads `models/forgery_rf.joblib` and returns tamper probability + verdict. None if model isn't present. |
-| `predict_with_cnn(path)` | dict or None | Lazy-loads `models/forgery_cnn.keras` (TensorFlow). None if model isn't present, so the app starts fast without TF. |
-| `extract_identity_fields(path)` | (dict, str) | Pulls name, DOB, address, account number, IFSC, and amounts from any document. |
-| `cross_doc_consistency(paths)` | dict | Compares identity fields across 2+ documents using `difflib.SequenceMatcher`. Returns per-field match status and an aggregate consistency risk. |
-| `generate_insights(score, sub, flags)` | dict | Converts numeric sub-scores into underwriter-readable bullets, risk band, and recommended action. |
-| `band(score)` | str | Maps a float to LOW / MEDIUM / HIGH / CRITICAL. Boundaries at 0.25, 0.50, 0.75. |
-Constants of interest: `WEIGHTS`, `INSIGHT_RULES`, `ACTIONS`, `MODEL_PATH`, `CNN_MODEL_PATH`, `TESSERACT_OK`.
-### `app.py` — Streamlit UI
-Four-tab web app. Imports `forensics`, `compliance`, and `audit_report`.
-| Tab | Function |
 |---|---|
-| Single-document analysis | Drag-drop or pick a sample; shows risk band, sub-score breakdown, evidence list, ELA / copy-move / noise visualisations, PDF audit details, ML/CNN predictions, downloadable JSON + PDF reports. |
-| Cross-document check | Upload 2–4 documents for one applicant; the system extracts identity fields and shows a coloured comparison table with similarity scores. |
-| Batch audit | Point at a folder; scans every supported file and produces a sortable risk table + CSV. |
-| Compliance & Audit Pack | Three sub-tabs: KYC field validation (manual or doc-extracted), PII auto-redaction (PDF + text), RBI-style compliance report generation. |
-The sample picker auto-populates from `sample_data/`; useful for the deployed demo where users can't browse the local filesystem.
-### `audit_report.py` — bank-letterhead PDF
-Single public function: `build_pdf_report(report, source_path)` → `bytes`.
-Generates a multi-page PDF with header letterhead, metadata table, coloured risk-verdict box, sub-score breakdown table (with ASCII bar chart), evidence list, embedded heatmaps for image documents, structural audit details for PDFs, ML model verdict block, and footer disclaimer. Uses ReportLab Platypus.
 ### `compliance.py` — KYC + regulatory
 | Function | Description |
 |---|---|
-| `validate_ifsc(code)` | Format check (`^[A-Z]{4}0[A-Z0-9]{6}$`) + lookup against an embedded RBI bank-code list (~36 major Indian banks). Returns bank name and branch code on success. |
-| `validate_pan(code)` | Format check (`^[A-Z]{5}\d{4}[A-Z]$`) + entity-type character validation (P=Individual, F=Firm, C=Company, etc.). |
-| `validate_aadhaar(num)` | 12-digit format + UIDAI Verhoeff checksum verification. Aadhaar numbers cannot start with 0 or 1 per UIDAI spec. |
-| `redact_text(text)` | Masks IFSC, PAN, Aadhaar, and account numbers in arbitrary text. |
-| `redact_pdf(input_path, output_path)` | Renders each PDF page, locates PII bounding boxes via `page.search_for`, overlays opaque black rectangles. |
-| `extract_pii_fields(path)` | Pulls all PII candidates from any document (PDF or image via OCR). |
-| `build_compliance_report(forensic_report, source_path, kyc_results)` | Generates a 5-section regulator-ready PDF: document ID + SHA-256, KYC verification table, fraud-screening verdict, recommended RBI risk treatment, auditor sign-off block. References specific RBI Master Directions. |
-### `docsentry_master.ipynb`
-Single source of truth. Sections:
-1. Environment auto-detection (Colab vs local)
-2. Datasets (synthetic generator + Kaggle CASIA v2 hook + manual download references)
-3. Image forensics
-4. PDF forensics
-5. OCR + text rules
-6. Random Forest training + saving
-7. (Optional) CNN training on Colab GPU
-8. End-to-end pipeline
-9. Cross-document consistency
-10. Dashboard + batch audit
-11. PDF report generator
-12. Export cell — writes `forensics.py`, `app.py`, `audit_report.py` to disk for the Streamlit demo
-13. Launch instructions
-Edit the notebook, re-run section 12, and the `.py` files used by Streamlit regenerate automatically.
 ---
 ## Pipeline architecture
 ```
-                          ┌──────────────────────────┐
-                          │  Document (PNG/PDF/JPG)  │
-                          └────────────┬─────────────┘
-                                       │
-              ┌──────────────────┬─────┴─────┬──────────────────┐
-              ▼                  ▼           ▼                  ▼
-        ┌─────────┐         ┌─────────┐ ┌─────────┐         ┌─────────┐
-        │  ELA    │         │ Copy-   │ │ Noise   │         │  EXIF   │
-        │ analysis│         │  move   │ │ heatmap │         │  audit  │
-        └────┬────┘         └────┬────┘ └────┬────┘         └────┬────┘
-             └───────┬───────────┴───────────┴──────────┬────────┘
-                     │  (images)                        │
-                     │                                  ▼
-                     │                       ┌─────────────────────┐
-                     │                       │  OCR + text rules   │
-                     │                       │ dates · IFSC · math │
-                     │                       └──────────┬──────────┘
-                     ▼                                  │
-        ┌────────────────────────────┐                  │
-        │  Feature vector (11-dim)   │                  │
-        └──────────────┬─────────────┘                  │
-                       ▼                                │
-        ┌────────────────────────────┐                  │
-        │ Random Forest classifier   │                  │
-        │ (forgery_rf.joblib)        │                  │
-        └──────────────┬─────────────┘                  │
-                       │                                │
-                       ▼                                │
-        ┌────────────────────────────┐                  │
-        │ (Optional) CNN inference   │                  │
-        │ MobileNetV2 fine-tuned     │                  │
-        │ (forgery_cnn.keras)        │                  │
-        └──────────────┬─────────────┘                  │
-                       │                                │
-                       └──────────────┬─────────────────┘
-                                      ▼
-                      ┌──────────────────────────────┐
-                      │  Weighted ensemble scorer    │
-                      │  (rule + RF + CNN blend)     │
-                      └──────────────┬───────────────┘
-                                     ▼
-                      ┌──────────────────────────────┐
-                      │  Risk band + Evidence list   │
-                      │  Recommended action          │
-                      │  Audit JSON + PDF report     │
-                      └──────────────────────────────┘
 ```
-Each detector outputs a sub-score in `[0, 1]`. The default weight vector (in `forensics.WEIGHTS`) is `{ela: 0.20, copy_move: 0.25, noise: 0.15, exif: 0.10, pdf_struct: 0.15, text_rules: 0.10, math: 0.05}`. The Random Forest probability, when available, is blended 50/50 with the rule-based score. The CNN probability, when available, is blended at a weight between 0.4 and 0.7 depending on the CNN's reported validation AUC.
 ---
 ## Detection coverage
 **Image tampering**
 - Copy-move forgery — ORB keypoint matching with distance filter
 - Image splicing — block-wise noise inconsistency via Laplacian variance
 - Text edits / amount tampering — Error Level Analysis
 - Photoshop / GIMP / Snapseed edits — EXIF Software-tag string match
 - Timestamp inconsistencies — DateTime vs DateTimeOriginal comparison
 **PDF tampering**
 - Incremental edits — multi-`%%EOF` marker counting
-- Consumer-tool fingerprints — iLovePDF, Smallpdf, PDFescape, Sejda, Foxit Phantom strings in producer/creator
 - Producer/Creator mismatch — flags re-processed PDFs
-- Inserted text — embedded font count anomalies
-**Text-level**
-- Date sequence violations — monotonic check on extracted dates
-- Round-number anomalies — counts mega-amounts that are multiples of ₹1 lakh
-- Missing IFSC with account number present — invalid bank document
-**Cross-document**
 - Name / DOB / address fuzzy match across multiple documents
-- Per-field similarity scoring with green/yellow/red status
 **KYC validation**
 - IFSC: format + RBI bank-code list (36 banks)
 - PAN: format + entity-type character (10 types per income-tax dept spec)
 - Aadhaar: 12-digit format + UIDAI Verhoeff checksum
-**PII redaction**
 - Aadhaar, PAN, IFSC, account-number masking
-- PDF redaction with black rectangle overlays via `page.search_for` bounding boxes
 ---
@@ -231,37 +275,34 @@ streamlit run app.py
 Browser opens at `http://localhost:8501`.
 For full OCR text-rule support, install Tesseract OCR:
 - Windows: https://github.com/UB-Mannheim/tesseract/wiki
 - macOS: `brew install tesseract`
-- Linux: `sudo apt-get install tesseract-ocr`
 The app auto-detects Tesseract on standard Windows install paths; no environment variable required.
-See `RUN_APP.md` for a more detailed walkthrough.
 ---
 ## Deployment
-Push to GitHub, connect at https://share.streamlit.io, point at `app.py`, deploy. The `packages.txt` ensures Tesseract is installed on the Streamlit Cloud VM; `requirements.txt` covers Python dependencies.
-See `DEPLOY.md` for step-by-step instructions including troubleshooting.
 ---
 ## Training your own model
-Drop labelled data into `data/images/originals/` and `data/images/tampered/`, open `docsentry_master.ipynb`, run section 6. A Random Forest auto-trains on whatever you put there and saves to `models/forgery_rf.joblib`. The Streamlit app picks it up automatically on next restart — no code changes.
-For a CNN upgrade, set `TRAIN_CNN = True` in section 7 and run on a Colab T4 GPU (free tier). Saves `models/forgery_cnn.keras` + `models/forgery_cnn.meta.json`. The app loads this lazily on first request.
-See `DATASETS.md` for public datasets you can use.
 ---
 ## Dependencies
-OpenCV (cv2), Pillow (PIL), scikit-image, scikit-learn, joblib, PyMuPDF (fitz), pdfplumber, pikepdf, pytesseract, python-dateutil, Streamlit, ReportLab, NumPy, pandas, matplotlib. Optional: TensorFlow (only required for the CNN path).
 All pip-installable. No GPU required for the default pipeline.
@@ -269,7 +310,7 @@ All pip-installable. No GPU required for the default pipeline.
 ## License
-MIT — see `LICENSE`. The MIT license covers the source code in this repository. Third-party datasets and pretrained models bundled or referenced (CASIA v2, IDRBT cheque dataset, AgamiAI Indian Bank Statements, MobileNetV2 ImageNet weights, etc.) are governed by their own terms; those notices are reproduced in `LICENSE` below the MIT block.
 ---

 app_file: app.py
 pinned: false
 license: mit
+short_description: BankShield — document forensics + fraud-ring detection for Indian bank underwriting
 ---
+# BankShield
+**Real-Time Document Forensics, AI-Generated Forgery Detection, and Cross-Applicant Fraud-Ring Intelligence for Indian Bank Underwriting.**
+BankShield catches tampered, forged, and AI-generated documents the moment they reach the underwriter — and surfaces organised fraud rings that span multiple applicants. Six independent detection layers fuse into a single calibrated risk score, with explainable evidence, tamper-evident audit trails, and RBI-format compliance reports out of the box.
+100% open source. No paid APIs. No external LLM calls. CPU-only by default. Runs locally on the bank's perimeter — PII never leaves.
+- **Live demo:** https://huggingface.co/spaces/SpandanM110/DocSentry
+- **Source:** https://github.com/SpandanM110/Doc-Sentry
+- **Architecture reference:** see [`ARCHITECTURE.md`](ARCHITECTURE.md)
+---
+## The six pillars
+| Pillar | Module | What it does |
+|---|---|---|
+| **Image Forensics** | `forensics.py` | ELA, copy-move (ORB), Laplacian noise inconsistency, EXIF audit |
+| **PDF Structural Audit** | `forensics.py` | EOF marker counting, producer/creator drift, embedded-font anomalies, consumer-tool fingerprints |
+| **OCR + Financial Rules** | `forensics.py` | Tesseract OCR + IFSC / PAN / Aadhaar / date monotonicity / amount sanity |
+| **AI-Generated Detection** *(new)* | `ai_detector.py` | Radial FFT spectral analysis — catches Sora / Midjourney / Stable Diffusion outputs |
+| **Fraud Ring Network** *(new)* | `fraud_ring.py` | NetworkX similarity graph across applicants; clique discovery flags organised fraud rings |
+| **Provenance Ledger** *(new)* | `provenance.py` | SHA-256 hash chain over every analysis; O(N) verifiable; RBI Para 67 compliant |
+Plus the **Live Tamper Forge Studio** (`tampering.py`) — an adversarial-validation harness built directly into the dashboard.
 ---
 ```
 Doc-Sentry/
+├── app.py                       Streamlit web UI (6 tabs)
+├── forensics.py                 Core detection engine + ensemble fusion
+├── ai_detector.py               AI-generated forgery detector (FFT spectral)
+├── fraud_ring.py                Cross-applicant similarity graph + clique detection
+├── provenance.py                Tamper-evident SHA-256 hash chain
+├── tampering.py                 Forge Studio adversarial harness
 ├── compliance.py                KYC validators, PII redaction, RBI report builder
+├── audit_report.py              Bank-letterhead PDF report builder
 ├── docsentry_master.ipynb       Single source-of-truth Jupyter notebook
 │
 ├── requirements.txt             Python dependencies
+├── packages.txt                 System packages (Tesseract) for Streamlit Cloud / HF Spaces
 ├── .streamlit/config.toml       Streamlit theme + server config
 │
 ├── sample_data/                 26 demo files for the live app
 │   └── pdfs/                    2 PDFs (1 genuine, 1 tampered)
 │
 ├── models/                      Trained model artefacts
+│   ├── forgery_rf.joblib        Random Forest classifier
+│   └── forgery_cnn.keras        MobileNetV2 fine-tuned on CASIA v2 (optional)
 │
+├── ARCHITECTURE.md              Full architecture reference
+├── SUBMISSION.md                Hackathon submission packet
+├── BankShield_Pitch.pptx        Pitch deck (15 slides)
+├── README.md  LICENSE
 └── data/                        (gitignored) full training data + downloaded datasets
 ```
 | Function | Returns | Description |
 |---|---|---|
+| `analyse_document(path)` | dict | End-to-end pipeline. Auto-detects type, runs all relevant detectors, blends Random Forest + CNN + AI-gen predictions, auto-logs to provenance ledger. Primary entry point. |
 | `score_image(path)` | (float, dict, list) | Composite forensic score for an image. Returns total, sub-scores by detector, and EXIF flags. |
+| `error_level_analysis(path, quality=90)` | (PIL.Image, float) | ELA visualisation + scalar suspicion score. |
+| `copy_move_detect(path)` | (np.ndarray, int, list) | ORB-based copy-move detection. Returns annotated viz, match count, raw matches. |
+| `noise_inconsistency(path, block=32)` | (np.ndarray, float) | Per-block Laplacian variance heatmap + outlier ratio. |
+| `exif_sanity(path)` | list of str | EXIF audit: missing EXIF, editor signatures, timestamp inconsistencies. |
+| `pdf_structural_audit(path)` | dict | `%%EOF` markers, producer/creator drift, consumer-tool fingerprints. |
+| `pdf_font_audit(path)` | dict | Embedded font listing + count anomalies. |
+| `ocr_text(path)` | str | Tesseract OCR with auto-fallback. |
+| `text_rule_checks(text)` | dict | Date monotonicity, amount sanity, IFSC format, account-number patterns. |
+| `extract_features(path)` | dict | 11-feature vector for the Random Forest. |
+| `predict_with_model(path)` | dict / None | Random Forest tamper probability + verdict. |
+| `predict_with_cnn(path)` | dict / None | MobileNetV2 CNN inference (lazy-loaded). |
+| `extract_identity_fields(path)` | (dict, str) | Pulls name, DOB, address, IFSC, account, amounts. |
+| `cross_doc_consistency(paths)` | dict | Per-field similarity across 2+ documents. |
+| `generate_insights(score, sub, flags)` | dict | Numeric → underwriter-readable bullets + recommended action. |
+| `band(score)` | str | Maps a float to LOW / MEDIUM / HIGH / CRITICAL. |
+### `ai_detector.py` — AI-generated forgery detection
+| Function | Description |
 |---|---|
+| `detect_ai_generated(path)` | Full pipeline → probability + verdict + flags + FFT profile. |
+| `radial_fft_profile(gray)` | Radially-averaged log-magnitude FFT spectrum. |
+| `high_freq_attenuation(profile)` | Smoothness score — low for real scans, high for AI outputs. |
+| `spectral_peak_score(profile)` | Counts checkerboard-stride peaks in the high-frequency band. |
+| `jpeg_quantization_check(path)` | Inspects JPEG quantization tables for synthetic-media signatures. |
+Blended into the main risk score with a capped +20% overlay so AI-gen signals reliably surface synthetic media without dominating classical detectors.
+### `fraud_ring.py` — cross-applicant fraud-ring detection
+| Function | Description |
+|---|---|
+| `extract_applicant_fields(path)` | OCR + regex pull of name / DOB / address / phone / IFSC / account / employer. |
+| `compare_applicants(a, b)` | Per-field similarity + weighted score. |
+| `build_fraud_graph(applicants)` | NetworkX similarity graph (edges weighted by shared signals). |
+| `detect_rings(G, min_size=3)` | Connected components above threshold → suspected fraud rings. |
+| `visualize_graph(G, rings)` | Force-directed graph with ring members in red. |
+| `fraud_summary(G, rings, applicants)` | Structured summary for the Streamlit UI. |
+### `provenance.py` — tamper-evident audit ledger
+| Function | Description |
+|---|---|
+| `log_analysis(...)` | Appends a SHA-256 hash-chained record to the SQLite ledger. |
+| `verify_chain()` | Walks every record in O(N); pinpoints the first broken record. |
+| `chain_stats()` | Count, first/last timestamps, breakdown by risk band, chain status. |
+| `fetch_ledger(limit)` | Returns the latest N entries. |
+| `ledger_dataframe(limit)` | Pandas DataFrame view (for Streamlit display). |
+Each record's `record_hash = SHA256(timestamp | doc_sha256 | risk_band | risk_score | prev_hash)` — retroactive edits break the chain mathematically.
+### `tampering.py` — adversarial Forge Studio
+`tamper_copy_move`, `tamper_text_edit`, `tamper_splice`, `tamper_compression`, `tamper_metadata_strip`, `tamper_custom_region`, `tamper_chain`, `annotate_before_after`, `overlay_heatmap_on_image`, `detector_scorecard`. Used by Tab 5 to apply controlled forgeries and immediately re-run detection.
 ### `compliance.py` — KYC + regulatory
 | Function | Description |
 |---|---|
+| `validate_ifsc(code)` | Format check + RBI bank-code lookup (36 banks). |
+| `validate_pan(code)` | Format + entity-type character validation. |
+| `validate_aadhaar(num)` | 12-digit format + UIDAI Verhoeff checksum. |
+| `redact_text(text)` | Masks IFSC, PAN, Aadhaar, account numbers. |
+| `redact_pdf(input_path, output_path)` | PII black-box overlays via PyMuPDF text-bbox. |
+| `extract_pii_fields(path)` | Pulls all PII candidates from any document. |
+| `build_compliance_report(...)` | RBI Master-Direction-format audit PDF (5 sections). |
+### `audit_report.py` — bank-letterhead PDF
+`build_pdf_report(report, source_path) → bytes`. Multi-page PDF with header letterhead, metadata table, colour-coded risk verdict box, sub-score breakdown table, evidence list, embedded forensic heatmaps. Built with ReportLab Platypus.
+### `app.py` — Streamlit UI (6 tabs)
+| Tab | Function |
+|---|---|
+| 1. Single-document analysis | Risk band, sub-score chart, ELA / copy-move / noise heatmaps, AI-gen FFT profile, ML/CNN predictions, downloadable JSON + PDF. |
+| 2. Cross-document KYC | Upload 2–4 docs for one applicant; identity-field consistency table. |
+| 3. Batch audit | Scan a folder; sortable risk table + CSV download. |
+| 4. Compliance & Audit Pack | KYC validation, PII auto-redaction, RBI compliance PDF, **provenance ledger view with chain re-verify**. |
+| 5. Live Tamper Forge Studio | Pick clean sample → choose technique + intensity → watch BankShield localise the tamper with per-detector scorecard + heatmap overlays. |
+| 6. Fraud Ring Network | Upload N applicants → similarity graph with red ring members + ring summary cards. |
 ---
 ## Pipeline architecture
 ```
+                    ┌────────────────────────────────────────┐
+                    │   PRESENTATION (Streamlit, 6 tabs)     │
+                    └──────────────────┬─────────────────────┘
+                                       ▼
+              ┌──────────────────────────────────────────────┐
+              │   FORENSICS CORE                             │
+              │   ELA · Copy-move · Noise · EXIF · OCR · PDF │
+              │   + Random Forest (11-d feature vector)      │
+              │   + MobileNetV2 CNN (CASIA v2 fine-tuned)    │
+              │   + AI-Gen Detector (radial FFT)             │
+              └──────────────────┬───────────────────────────┘
+                                 ▼
+              ┌──────────────────────────────────────────────┐
+              │   ENSEMBLE FUSION                            │
+              │   weighted blend → RF overlay → CNN overlay  │
+              │   → AI-gen overlay (capped at +20%)          │
+              └──────────────────┬───────────────────────────┘
+                                 ▼
+        ┌──────────────────┬─────┴─────┬──────────────────┐
+        ▼                  ▼           ▼                  ▼
+┌──────────────┐  ┌────────────────┐ ┌──────────────┐ ┌────────────────┐
+│ COMPLIANCE   │  │ FRAUD-RING     │ │ PROVENANCE   │ │ TAMPER FORGE   │
+│ IFSC · PAN · │  │ NetworkX graph │ │ SHA-256 hash │ │ Adversarial    │
+│ Aadhaar · PII│  │ clique detect  │ │ chain ledger │ │ validation     │
+└──────┬───────┘  └────────┬───────┘ └──────┬───────┘ └────────────────┘
+       │                   │                │
+       └────────────┬──────┴────────────────┘
+                    ▼
+         ┌────────────────────────────────────┐
+         │   OUTPUT                           │
+         │   Risk band · Evidence list        │
+         │   Bank-letterhead audit PDF        │
+         │   RBI compliance PDF · Audit JSON  │
+         │   Tamper-evident ledger entry      │
+         └────────────────────────────────────┘
 ```
+Default weight vector (`forensics.WEIGHTS`): `{ela: 0.20, copy_move: 0.25, noise: 0.20, exif: 0.15, text_rules: 0.20}`. The Random Forest probability, when available, is blended 50/50 with the rule-based score. The CNN probability is blended at a weight between 0.4 and 0.7 based on the CNN's reported validation AUC. The AI-gen probability is applied as a final overlay capped at +20%.
+Band mapping: `0–0.30 LOW · 0.30–0.50 MEDIUM · 0.50–0.75 HIGH · 0.75+ CRITICAL`.
+See [`ARCHITECTURE.md`](ARCHITECTURE.md) for the full reference.
 ---
 ## Detection coverage
 **Image tampering**
 - Copy-move forgery — ORB keypoint matching with distance filter
 - Image splicing — block-wise noise inconsistency via Laplacian variance
 - Text edits / amount tampering — Error Level Analysis
 - Photoshop / GIMP / Snapseed edits — EXIF Software-tag string match
 - Timestamp inconsistencies — DateTime vs DateTimeOriginal comparison
+**AI-generated content**
+- Sora / Midjourney / Stable Diffusion / DALL-E outputs — FFT spectral analysis
+- High-frequency suppression (1/f decay deviation)
+- Periodic checkerboard peaks from upsampling stride
+- Non-standard JPEG quantization tables
 **PDF tampering**
 - Incremental edits — multi-`%%EOF` marker counting
+- Consumer-tool fingerprints — iLovePDF, Smallpdf, PDFescape, Sejda, Foxit Phantom
 - Producer/Creator mismatch — flags re-processed PDFs
+- Inserted text — embedded-font count anomalies
+**Cross-document & fraud-ring**
 - Name / DOB / address fuzzy match across multiple documents
+- Per-field weighted scoring with green / yellow / red status
+- Cross-applicant similarity graph; cliques ≥3 = suspected fraud ring
+- Ring bands: CRITICAL (≥5 members) / HIGH (3–4) / MEDIUM (2)
 **KYC validation**
 - IFSC: format + RBI bank-code list (36 banks)
 - PAN: format + entity-type character (10 types per income-tax dept spec)
 - Aadhaar: 12-digit format + UIDAI Verhoeff checksum
+**PII redaction & audit**
 - Aadhaar, PAN, IFSC, account-number masking
+- PDF redaction with black rectangle overlays
+- SHA-256 hash-chained provenance ledger (RBI Para 67 compliant)
 ---
 Browser opens at `http://localhost:8501`.
 For full OCR text-rule support, install Tesseract OCR:
 - Windows: https://github.com/UB-Mannheim/tesseract/wiki
 - macOS: `brew install tesseract`
+- Linux: `sudo apt-get install tesseract-ocr libtesseract-dev`
 The app auto-detects Tesseract on standard Windows install paths; no environment variable required.
 ---
 ## Deployment
+The repository is deployment-ready for both **Streamlit Community Cloud** and **Hugging Face Spaces**. The YAML frontmatter at the top of this README configures the HF Space; `packages.txt` ensures Tesseract is installed on the build VM; `requirements.txt` covers Python dependencies.
+Live deployment: https://huggingface.co/spaces/SpandanM110/DocSentry
 ---
 ## Training your own model
+Drop labelled data into `data/images/originals/` and `data/images/tampered/`, open `docsentry_master.ipynb`, run section 6. A Random Forest auto-trains on whatever you put there and saves to `models/forgery_rf.joblib`. The Streamlit app picks it up automatically on next restart.
+For a CNN upgrade, set `TRAIN_CNN = True` in section 7 and run on a Colab T4 GPU (free tier). Saves `models/forgery_cnn.keras` + `models/forgery_cnn.meta.json`. The app loads it lazily on first request.
 ---
 ## Dependencies
+OpenCV (cv2), Pillow (PIL), scikit-image, scikit-learn, joblib, PyMuPDF (fitz), pdfplumber, pikepdf, pytesseract, python-dateutil, Streamlit, streamlit-drawable-canvas, ReportLab, NumPy, pandas, matplotlib, NetworkX. Optional: TensorFlow (only required for the CNN path).
 All pip-installable. No GPU required for the default pipeline.
 ## License
+MIT — see `LICENSE`. The MIT license covers the source code in this repository. Third-party datasets and pretrained models bundled or referenced (CASIA v2, IDRBT cheque dataset, AgamiAI Indian Bank Statements, MobileNetV2 ImageNet weights) are governed by their own terms; those notices are reproduced in `LICENSE` below the MIT block.
 ---