Spaces:
Sleeping
A newer version of the Streamlit SDK is available: 1.58.0
Datasets for Document Anomaly / Forgery Detection
All datasets below are free. The only requirement is a free Kaggle account for the Kaggle-hosted ones. The notebook auto-picks up whatever you drop into the right folder β you do not need every dataset to get started.
Folder layout the notebook expects
data/
βββ images/
β βββ originals/ <-- genuine scans (.png/.jpg)
β βββ tampered/ <-- forged scans (.png/.jpg)
βββ pdfs/
β βββ originals/ <-- genuine legal PDFs
β βββ tampered/ <-- forged legal PDFs
βββ statements/ <-- bank statements, ITRs, receipts (any format)
Run validate_data_layout() in the notebook to confirm everything is in place.
1. Image tampering datasets
CASIA v2 β Gold-standard image tampering benchmark
The most-used dataset for splicing/copy-move detection research. ~12k images (7k genuine + 5k tampered).
- Source: https://www.kaggle.com/datasets/divg07/casia-20-image-tampering-detection-dataset
- Size: ~2 GB
- Download:
kaggle datasets download -d divg07/casia-20-image-tampering-detection-dataset \ -p data/images --unzip - After download: rename
Au/βoriginals/andTp/βtampered/
MICC-F220 β Classic copy-move benchmark
220 images, perfect for testing copy-move detection.
- Source: http://lci.micc.unifi.it/labd/2015/01/copy-move-forgery-detection-and-localization/
- Size: ~50 MB
- Download: manual (form on the page)
CoMoFoD β Copy-move with ground-truth masks
260 image sets with masks. Ideal for training a CNN with pixel-level supervision.
- Source: https://www.vcl.fer.hr/comofod/
- Size: ~1 GB
- Download: manual
Coverage β Genuine + tampered pairs
100 pairs with similar-but-genuine objects (toughest case).
- Source: https://github.com/wenbihan/coverage
- Size: ~600 MB
- Download:
git clone https://github.com/wenbihan/coverage.git
Columbia Uncompressed Image Splicing
180 spliced + 180 authentic images, lossless.
- Source: https://www.ee.columbia.edu/ln/dvmm/downloads/AuthSplicedDataSet/AuthSplicedDataSet.htm
- Size: ~1 GB
- Download: manual (registration required, free)
2. Document / Legal PDF datasets
Tobacco-3482 β Real scanned legal docs
3,482 real-world scanned legal documents (clean baseline of "genuine" docs).
- Source: https://www.kaggle.com/datasets/patrickaudriaz/tobacco3482jpg
- Size: ~200 MB
- Download:
kaggle datasets download -d patrickaudriaz/tobacco3482jpg \ -p data/pdfs/originals --unzip
ICDAR Find-It β Document forgery challenge
Official competition dataset for forged scientific documents.
- Source: https://findit.univ-lr.fr/
- Size: ~500 MB
- Download: manual (registration required, free)
DocVQA / RVL-CDIP β Real bank/govt docs
Massive dataset of real-world business documents.
- Source: https://www.docvqa.org/datasets and https://www.cs.cmu.edu/~aharley/rvl-cdip/
- Size: ~3 GB / 37 GB
- Use case: populate
originals/with realistic genuine documents
FUNSD β Form understanding
199 fully-annotated forms (good for layout-anomaly training).
- Source: https://guillaumejaume.github.io/FUNSD/
- Size: ~50 MB
3. Financial statements / receipts / cheques
Receipts Fraud Detection
500+ tampered and genuine receipt images.
- Source: https://www.kaggle.com/datasets/trainingdatapro/receipts-fraud-detection-dataset
- Size: ~100 MB
- Download:
kaggle datasets download -d trainingdatapro/receipts-fraud-detection-dataset \ -p data/statements --unzip
Bank statements dataset
Realistic bank statement PDFs and images.
- Source: https://www.kaggle.com/datasets/dedeikhsandwisaputra/bank-statements-dataset
- Size: ~80 MB
- Download:
kaggle datasets download -d dedeikhsandwisaputra/bank-statements-dataset \ -p data/statements --unzip
IDRBT / Indian bank cheques
Cheque images (Indian banking context).
- Source: https://www.kaggle.com/datasets/arsh1207/bank-cheque-image-dataset
- Size: ~50 MB
- Download:
kaggle datasets download -d arsh1207/bank-cheque-image-dataset \ -p data/statements --unzip
SROIE β Scanned receipts
Receipt OCR + key-information extraction challenge.
- Source: https://rrc.cvc.uab.es/?ch=13
- Size: ~150 MB
4. Land records (India-specific)
There is no large public dataset for Indian land records β you have two practical options:
- Synthesise. The notebook already includes a
make_demo_pair()function that generates realistic land-record images and tampered copies. You can extend this to produce hundreds of synthetic examples in minutes. - Use government open data. Some state portals publish anonymised RoR (Record of Rights) samples β e.g.:
- Bhulekh portals (state-wise): https://bhulekh.gov.in/ (varies by state)
- DigiLocker sample certificates: https://www.digilocker.gov.in/
- Use Tobacco-3482 or DocVQA as proxy for general scanned-document forensics β the same forensic signals (ELA, copy-move, font mix) transfer directly.
5. Kaggle CLI setup (one-time, free)
pip install kaggle
- Sign up at https://www.kaggle.com
- Open https://www.kaggle.com/settings β Create New API Token
- A file
kaggle.jsonwill download. Place it at:- Windows:
C:\Users\<you>\.kaggle\kaggle.json - Linux/Mac:
~/.kaggle/kaggle.json
- Windows:
- On Linux/Mac:
chmod 600 ~/.kaggle/kaggle.json
After that, all the kaggle datasets download β¦ commands above will just work.
6. Minimum data needed to train
The Random Forest in Section 7.5 of the notebook will give meaningful results with:
- ~50 genuine + 50 tampered images β workable baseline
- ~200 + 200 β good results, ROC-AUC typically 0.85+
- ~1000 + 1000 (e.g. full CASIA v2) β production-grade results
For the optional CNN in Section 7.6, target at least 200 images per class.
7. Quick-start recipe (fastest path to working demo)
pip install kaggleand set up the API token (Section 5 above)- Download CASIA v2:
kaggle datasets download -d divg07/casia-20-image-tampering-detection-dataset \ -p data/images --unzip - Rename the extracted
AuβoriginalsandTpβtampered - Open
anomaly_detection_banking.ipynband run all cells - Section 7.5 will train automatically on the data you just placed