# Datasets for Document Anomaly / Forgery Detection All datasets below are **free**. The only requirement is a free Kaggle account for the Kaggle-hosted ones. The notebook auto-picks up whatever you drop into the right folder — you do not need every dataset to get started. ## Folder layout the notebook expects ``` data/ ├── images/ │ ├── originals/ <-- genuine scans (.png/.jpg) │ └── tampered/ <-- forged scans (.png/.jpg) ├── pdfs/ │ ├── originals/ <-- genuine legal PDFs │ └── tampered/ <-- forged legal PDFs └── statements/ <-- bank statements, ITRs, receipts (any format) ``` Run `validate_data_layout()` in the notebook to confirm everything is in place. --- ## 1. Image tampering datasets ### CASIA v2 — Gold-standard image tampering benchmark The most-used dataset for splicing/copy-move detection research. ~12k images (7k genuine + 5k tampered). - **Source:** https://www.kaggle.com/datasets/divg07/casia-20-image-tampering-detection-dataset - **Size:** ~2 GB - **Download:** ```bash kaggle datasets download -d divg07/casia-20-image-tampering-detection-dataset \ -p data/images --unzip ``` - **After download:** rename `Au/` → `originals/` and `Tp/` → `tampered/` ### MICC-F220 — Classic copy-move benchmark 220 images, perfect for testing copy-move detection. - **Source:** http://lci.micc.unifi.it/labd/2015/01/copy-move-forgery-detection-and-localization/ - **Size:** ~50 MB - **Download:** manual (form on the page) ### CoMoFoD — Copy-move with ground-truth masks 260 image sets with masks. Ideal for training a CNN with pixel-level supervision. - **Source:** https://www.vcl.fer.hr/comofod/ - **Size:** ~1 GB - **Download:** manual ### Coverage — Genuine + tampered pairs 100 pairs with similar-but-genuine objects (toughest case). - **Source:** https://github.com/wenbihan/coverage - **Size:** ~600 MB - **Download:** `git clone https://github.com/wenbihan/coverage.git` ### Columbia Uncompressed Image Splicing 180 spliced + 180 authentic images, lossless. - **Source:** https://www.ee.columbia.edu/ln/dvmm/downloads/AuthSplicedDataSet/AuthSplicedDataSet.htm - **Size:** ~1 GB - **Download:** manual (registration required, free) --- ## 2. Document / Legal PDF datasets ### Tobacco-3482 — Real scanned legal docs 3,482 real-world scanned legal documents (clean baseline of "genuine" docs). - **Source:** https://www.kaggle.com/datasets/patrickaudriaz/tobacco3482jpg - **Size:** ~200 MB - **Download:** ```bash kaggle datasets download -d patrickaudriaz/tobacco3482jpg \ -p data/pdfs/originals --unzip ``` ### ICDAR Find-It — Document forgery challenge Official competition dataset for forged scientific documents. - **Source:** https://findit.univ-lr.fr/ - **Size:** ~500 MB - **Download:** manual (registration required, free) ### DocVQA / RVL-CDIP — Real bank/govt docs Massive dataset of real-world business documents. - **Source:** https://www.docvqa.org/datasets and https://www.cs.cmu.edu/~aharley/rvl-cdip/ - **Size:** ~3 GB / 37 GB - **Use case:** populate `originals/` with realistic genuine documents ### FUNSD — Form understanding 199 fully-annotated forms (good for layout-anomaly training). - **Source:** https://guillaumejaume.github.io/FUNSD/ - **Size:** ~50 MB --- ## 3. Financial statements / receipts / cheques ### Receipts Fraud Detection 500+ tampered and genuine receipt images. - **Source:** https://www.kaggle.com/datasets/trainingdatapro/receipts-fraud-detection-dataset - **Size:** ~100 MB - **Download:** ```bash kaggle datasets download -d trainingdatapro/receipts-fraud-detection-dataset \ -p data/statements --unzip ``` ### Bank statements dataset Realistic bank statement PDFs and images. - **Source:** https://www.kaggle.com/datasets/dedeikhsandwisaputra/bank-statements-dataset - **Size:** ~80 MB - **Download:** ```bash kaggle datasets download -d dedeikhsandwisaputra/bank-statements-dataset \ -p data/statements --unzip ``` ### IDRBT / Indian bank cheques Cheque images (Indian banking context). - **Source:** https://www.kaggle.com/datasets/arsh1207/bank-cheque-image-dataset - **Size:** ~50 MB - **Download:** ```bash kaggle datasets download -d arsh1207/bank-cheque-image-dataset \ -p data/statements --unzip ``` ### SROIE — Scanned receipts Receipt OCR + key-information extraction challenge. - **Source:** https://rrc.cvc.uab.es/?ch=13 - **Size:** ~150 MB --- ## 4. Land records (India-specific) There is no large public dataset for Indian land records — you have two practical options: 1. **Synthesise.** The notebook already includes a `make_demo_pair()` function that generates realistic land-record images and tampered copies. You can extend this to produce hundreds of synthetic examples in minutes. 2. **Use government open data.** Some state portals publish anonymised RoR (Record of Rights) samples — e.g.: - Bhulekh portals (state-wise): https://bhulekh.gov.in/ (varies by state) - DigiLocker sample certificates: https://www.digilocker.gov.in/ 3. **Use Tobacco-3482 or DocVQA as proxy** for general scanned-document forensics — the same forensic signals (ELA, copy-move, font mix) transfer directly. --- ## 5. Kaggle CLI setup (one-time, free) ```bash pip install kaggle ``` 1. Sign up at https://www.kaggle.com 2. Open https://www.kaggle.com/settings → **Create New API Token** 3. A file `kaggle.json` will download. Place it at: - Windows: `C:\Users\\.kaggle\kaggle.json` - Linux/Mac: `~/.kaggle/kaggle.json` 4. On Linux/Mac: `chmod 600 ~/.kaggle/kaggle.json` After that, all the `kaggle datasets download …` commands above will just work. --- ## 6. Minimum data needed to train The Random Forest in Section 7.5 of the notebook will give meaningful results with: - **~50 genuine + 50 tampered images** — workable baseline - **~200 + 200** — good results, ROC-AUC typically 0.85+ - **~1000 + 1000** (e.g. full CASIA v2) — production-grade results For the optional CNN in Section 7.6, target at least 200 images per class. --- ## 7. Quick-start recipe (fastest path to working demo) 1. `pip install kaggle` and set up the API token (Section 5 above) 2. Download CASIA v2: ```bash kaggle datasets download -d divg07/casia-20-image-tampering-detection-dataset \ -p data/images --unzip ``` 3. Rename the extracted `Au` → `originals` and `Tp` → `tampered` 4. Open `anomaly_detection_banking.ipynb` and run all cells 5. Section 7.5 will train automatically on the data you just placed