Spaces:

SpandanM110
/

DocSentry

Sleeping

App Files Files Community

DocSentry / DATASETS.md

SpandanM110

DocSentry - bank document forensics with 4 tabs

05b69f8 13 days ago

preview code

Raw

History Blame Contribute Delete

6.64 kB

A newer version of the Streamlit SDK is available: 1.58.0

Upgrade

Datasets for Document Anomaly / Forgery Detection

All datasets below are free. The only requirement is a free Kaggle account for the Kaggle-hosted ones. The notebook auto-picks up whatever you drop into the right folder — you do not need every dataset to get started.

Folder layout the notebook expects

data/
├── images/
│   ├── originals/        <-- genuine scans (.png/.jpg)
│   └── tampered/         <-- forged scans (.png/.jpg)
├── pdfs/
│   ├── originals/        <-- genuine legal PDFs
│   └── tampered/         <-- forged legal PDFs
└── statements/           <-- bank statements, ITRs, receipts (any format)

Run validate_data_layout() in the notebook to confirm everything is in place.

1. Image tampering datasets

CASIA v2 — Gold-standard image tampering benchmark

The most-used dataset for splicing/copy-move detection research. ~12k images (7k genuine + 5k tampered).

Source: https://www.kaggle.com/datasets/divg07/casia-20-image-tampering-detection-dataset
Size: ~2 GB

Download:

kaggle datasets download -d divg07/casia-20-image-tampering-detection-dataset \
    -p data/images --unzip

After download: rename Au/ → originals/ and Tp/ → tampered/

MICC-F220 — Classic copy-move benchmark

220 images, perfect for testing copy-move detection.

Source: http://lci.micc.unifi.it/labd/2015/01/copy-move-forgery-detection-and-localization/
Size: ~50 MB
Download: manual (form on the page)

CoMoFoD — Copy-move with ground-truth masks

260 image sets with masks. Ideal for training a CNN with pixel-level supervision.

Source: https://www.vcl.fer.hr/comofod/
Size: ~1 GB
Download: manual

Coverage — Genuine + tampered pairs

100 pairs with similar-but-genuine objects (toughest case).

Source: https://github.com/wenbihan/coverage
Size: ~600 MB
Download: git clone https://github.com/wenbihan/coverage.git

Columbia Uncompressed Image Splicing

180 spliced + 180 authentic images, lossless.

Source: https://www.ee.columbia.edu/ln/dvmm/downloads/AuthSplicedDataSet/AuthSplicedDataSet.htm
Size: ~1 GB
Download: manual (registration required, free)

2. Document / Legal PDF datasets

Tobacco-3482 — Real scanned legal docs

3,482 real-world scanned legal documents (clean baseline of "genuine" docs).

Source: https://www.kaggle.com/datasets/patrickaudriaz/tobacco3482jpg
Size: ~200 MB

Download:

kaggle datasets download -d patrickaudriaz/tobacco3482jpg \
    -p data/pdfs/originals --unzip

ICDAR Find-It — Document forgery challenge

Official competition dataset for forged scientific documents.

Source: https://findit.univ-lr.fr/
Size: ~500 MB
Download: manual (registration required, free)

DocVQA / RVL-CDIP — Real bank/govt docs

Massive dataset of real-world business documents.

Source: https://www.docvqa.org/datasets and https://www.cs.cmu.edu/~aharley/rvl-cdip/
Size: ~3 GB / 37 GB
Use case: populate originals/ with realistic genuine documents

FUNSD — Form understanding

199 fully-annotated forms (good for layout-anomaly training).

Source: https://guillaumejaume.github.io/FUNSD/
Size: ~50 MB

3. Financial statements / receipts / cheques

Receipts Fraud Detection

500+ tampered and genuine receipt images.

Source: https://www.kaggle.com/datasets/trainingdatapro/receipts-fraud-detection-dataset
Size: ~100 MB

Download:

kaggle datasets download -d trainingdatapro/receipts-fraud-detection-dataset \
    -p data/statements --unzip

Bank statements dataset

Realistic bank statement PDFs and images.

Source: https://www.kaggle.com/datasets/dedeikhsandwisaputra/bank-statements-dataset
Size: ~80 MB

Download:

kaggle datasets download -d dedeikhsandwisaputra/bank-statements-dataset \
    -p data/statements --unzip

IDRBT / Indian bank cheques

Cheque images (Indian banking context).

Source: https://www.kaggle.com/datasets/arsh1207/bank-cheque-image-dataset
Size: ~50 MB

Download:

kaggle datasets download -d arsh1207/bank-cheque-image-dataset \
    -p data/statements --unzip

SROIE — Scanned receipts

Receipt OCR + key-information extraction challenge.

Source: https://rrc.cvc.uab.es/?ch=13
Size: ~150 MB

4. Land records (India-specific)

There is no large public dataset for Indian land records — you have two practical options:

Synthesise. The notebook already includes a make_demo_pair() function that generates realistic land-record images and tampered copies. You can extend this to produce hundreds of synthetic examples in minutes.
Use government open data. Some state portals publish anonymised RoR (Record of Rights) samples — e.g.:
- Bhulekh portals (state-wise): https://bhulekh.gov.in/ (varies by state)
- DigiLocker sample certificates: https://www.digilocker.gov.in/
Use Tobacco-3482 or DocVQA as proxy for general scanned-document forensics — the same forensic signals (ELA, copy-move, font mix) transfer directly.

5. Kaggle CLI setup (one-time, free)

pip install kaggle

Sign up at https://www.kaggle.com
Open https://www.kaggle.com/settings → Create New API Token
A file kaggle.json will download. Place it at:
- Windows: C:\Users\<you>\.kaggle\kaggle.json
- Linux/Mac: ~/.kaggle/kaggle.json
On Linux/Mac: chmod 600 ~/.kaggle/kaggle.json

After that, all the kaggle datasets download … commands above will just work.

6. Minimum data needed to train

The Random Forest in Section 7.5 of the notebook will give meaningful results with:

~50 genuine + 50 tampered images — workable baseline
~200 + 200 — good results, ROC-AUC typically 0.85+
~1000 + 1000 (e.g. full CASIA v2) — production-grade results

For the optional CNN in Section 7.6, target at least 200 images per class.

7. Quick-start recipe (fastest path to working demo)

pip install kaggle and set up the API token (Section 5 above)

Download CASIA v2:

kaggle datasets download -d divg07/casia-20-image-tampering-detection-dataset \
    -p data/images --unzip

Rename the extracted Au → originals and Tp → tampered
Open anomaly_detection_banking.ipynb and run all cells
Section 7.5 will train automatically on the data you just placed