DocSentry / DATASETS.md
SpandanM110's picture
DocSentry - bank document forensics with 4 tabs
05b69f8
|
Raw
History Blame Contribute Delete
6.64 kB

A newer version of the Streamlit SDK is available: 1.58.0

Upgrade

Datasets for Document Anomaly / Forgery Detection

All datasets below are free. The only requirement is a free Kaggle account for the Kaggle-hosted ones. The notebook auto-picks up whatever you drop into the right folder β€” you do not need every dataset to get started.

Folder layout the notebook expects

data/
β”œβ”€β”€ images/
β”‚   β”œβ”€β”€ originals/        <-- genuine scans (.png/.jpg)
β”‚   └── tampered/         <-- forged scans (.png/.jpg)
β”œβ”€β”€ pdfs/
β”‚   β”œβ”€β”€ originals/        <-- genuine legal PDFs
β”‚   └── tampered/         <-- forged legal PDFs
└── statements/           <-- bank statements, ITRs, receipts (any format)

Run validate_data_layout() in the notebook to confirm everything is in place.


1. Image tampering datasets

CASIA v2 β€” Gold-standard image tampering benchmark

The most-used dataset for splicing/copy-move detection research. ~12k images (7k genuine + 5k tampered).

MICC-F220 β€” Classic copy-move benchmark

220 images, perfect for testing copy-move detection.

CoMoFoD β€” Copy-move with ground-truth masks

260 image sets with masks. Ideal for training a CNN with pixel-level supervision.

Coverage β€” Genuine + tampered pairs

100 pairs with similar-but-genuine objects (toughest case).

Columbia Uncompressed Image Splicing

180 spliced + 180 authentic images, lossless.


2. Document / Legal PDF datasets

Tobacco-3482 β€” Real scanned legal docs

3,482 real-world scanned legal documents (clean baseline of "genuine" docs).

ICDAR Find-It β€” Document forgery challenge

Official competition dataset for forged scientific documents.

DocVQA / RVL-CDIP β€” Real bank/govt docs

Massive dataset of real-world business documents.

FUNSD β€” Form understanding

199 fully-annotated forms (good for layout-anomaly training).


3. Financial statements / receipts / cheques

Receipts Fraud Detection

500+ tampered and genuine receipt images.

Bank statements dataset

Realistic bank statement PDFs and images.

IDRBT / Indian bank cheques

Cheque images (Indian banking context).

SROIE β€” Scanned receipts

Receipt OCR + key-information extraction challenge.


4. Land records (India-specific)

There is no large public dataset for Indian land records β€” you have two practical options:

  1. Synthesise. The notebook already includes a make_demo_pair() function that generates realistic land-record images and tampered copies. You can extend this to produce hundreds of synthetic examples in minutes.
  2. Use government open data. Some state portals publish anonymised RoR (Record of Rights) samples β€” e.g.:
  3. Use Tobacco-3482 or DocVQA as proxy for general scanned-document forensics β€” the same forensic signals (ELA, copy-move, font mix) transfer directly.

5. Kaggle CLI setup (one-time, free)

pip install kaggle
  1. Sign up at https://www.kaggle.com
  2. Open https://www.kaggle.com/settings β†’ Create New API Token
  3. A file kaggle.json will download. Place it at:
    • Windows: C:\Users\<you>\.kaggle\kaggle.json
    • Linux/Mac: ~/.kaggle/kaggle.json
  4. On Linux/Mac: chmod 600 ~/.kaggle/kaggle.json

After that, all the kaggle datasets download … commands above will just work.


6. Minimum data needed to train

The Random Forest in Section 7.5 of the notebook will give meaningful results with:

  • ~50 genuine + 50 tampered images β€” workable baseline
  • ~200 + 200 β€” good results, ROC-AUC typically 0.85+
  • ~1000 + 1000 (e.g. full CASIA v2) β€” production-grade results

For the optional CNN in Section 7.6, target at least 200 images per class.


7. Quick-start recipe (fastest path to working demo)

  1. pip install kaggle and set up the API token (Section 5 above)
  2. Download CASIA v2:
    kaggle datasets download -d divg07/casia-20-image-tampering-detection-dataset \
        -p data/images --unzip
    
  3. Rename the extracted Au β†’ originals and Tp β†’ tampered
  4. Open anomaly_detection_banking.ipynb and run all cells
  5. Section 7.5 will train automatically on the data you just placed