DocSentry / DATASETS.md
SpandanM110's picture
DocSentry - bank document forensics with 4 tabs
05b69f8
|
Raw
History Blame Contribute Delete
6.64 kB
# Datasets for Document Anomaly / Forgery Detection
All datasets below are **free**. The only requirement is a free Kaggle account for the Kaggle-hosted ones. The notebook auto-picks up whatever you drop into the right folder β€” you do not need every dataset to get started.
## Folder layout the notebook expects
```
data/
β”œβ”€β”€ images/
β”‚ β”œβ”€β”€ originals/ <-- genuine scans (.png/.jpg)
β”‚ └── tampered/ <-- forged scans (.png/.jpg)
β”œβ”€β”€ pdfs/
β”‚ β”œβ”€β”€ originals/ <-- genuine legal PDFs
β”‚ └── tampered/ <-- forged legal PDFs
└── statements/ <-- bank statements, ITRs, receipts (any format)
```
Run `validate_data_layout()` in the notebook to confirm everything is in place.
---
## 1. Image tampering datasets
### CASIA v2 β€” Gold-standard image tampering benchmark
The most-used dataset for splicing/copy-move detection research. ~12k images (7k genuine + 5k tampered).
- **Source:** https://www.kaggle.com/datasets/divg07/casia-20-image-tampering-detection-dataset
- **Size:** ~2 GB
- **Download:**
```bash
kaggle datasets download -d divg07/casia-20-image-tampering-detection-dataset \
-p data/images --unzip
```
- **After download:** rename `Au/` β†’ `originals/` and `Tp/` β†’ `tampered/`
### MICC-F220 β€” Classic copy-move benchmark
220 images, perfect for testing copy-move detection.
- **Source:** http://lci.micc.unifi.it/labd/2015/01/copy-move-forgery-detection-and-localization/
- **Size:** ~50 MB
- **Download:** manual (form on the page)
### CoMoFoD β€” Copy-move with ground-truth masks
260 image sets with masks. Ideal for training a CNN with pixel-level supervision.
- **Source:** https://www.vcl.fer.hr/comofod/
- **Size:** ~1 GB
- **Download:** manual
### Coverage β€” Genuine + tampered pairs
100 pairs with similar-but-genuine objects (toughest case).
- **Source:** https://github.com/wenbihan/coverage
- **Size:** ~600 MB
- **Download:** `git clone https://github.com/wenbihan/coverage.git`
### Columbia Uncompressed Image Splicing
180 spliced + 180 authentic images, lossless.
- **Source:** https://www.ee.columbia.edu/ln/dvmm/downloads/AuthSplicedDataSet/AuthSplicedDataSet.htm
- **Size:** ~1 GB
- **Download:** manual (registration required, free)
---
## 2. Document / Legal PDF datasets
### Tobacco-3482 β€” Real scanned legal docs
3,482 real-world scanned legal documents (clean baseline of "genuine" docs).
- **Source:** https://www.kaggle.com/datasets/patrickaudriaz/tobacco3482jpg
- **Size:** ~200 MB
- **Download:**
```bash
kaggle datasets download -d patrickaudriaz/tobacco3482jpg \
-p data/pdfs/originals --unzip
```
### ICDAR Find-It β€” Document forgery challenge
Official competition dataset for forged scientific documents.
- **Source:** https://findit.univ-lr.fr/
- **Size:** ~500 MB
- **Download:** manual (registration required, free)
### DocVQA / RVL-CDIP β€” Real bank/govt docs
Massive dataset of real-world business documents.
- **Source:** https://www.docvqa.org/datasets and https://www.cs.cmu.edu/~aharley/rvl-cdip/
- **Size:** ~3 GB / 37 GB
- **Use case:** populate `originals/` with realistic genuine documents
### FUNSD β€” Form understanding
199 fully-annotated forms (good for layout-anomaly training).
- **Source:** https://guillaumejaume.github.io/FUNSD/
- **Size:** ~50 MB
---
## 3. Financial statements / receipts / cheques
### Receipts Fraud Detection
500+ tampered and genuine receipt images.
- **Source:** https://www.kaggle.com/datasets/trainingdatapro/receipts-fraud-detection-dataset
- **Size:** ~100 MB
- **Download:**
```bash
kaggle datasets download -d trainingdatapro/receipts-fraud-detection-dataset \
-p data/statements --unzip
```
### Bank statements dataset
Realistic bank statement PDFs and images.
- **Source:** https://www.kaggle.com/datasets/dedeikhsandwisaputra/bank-statements-dataset
- **Size:** ~80 MB
- **Download:**
```bash
kaggle datasets download -d dedeikhsandwisaputra/bank-statements-dataset \
-p data/statements --unzip
```
### IDRBT / Indian bank cheques
Cheque images (Indian banking context).
- **Source:** https://www.kaggle.com/datasets/arsh1207/bank-cheque-image-dataset
- **Size:** ~50 MB
- **Download:**
```bash
kaggle datasets download -d arsh1207/bank-cheque-image-dataset \
-p data/statements --unzip
```
### SROIE β€” Scanned receipts
Receipt OCR + key-information extraction challenge.
- **Source:** https://rrc.cvc.uab.es/?ch=13
- **Size:** ~150 MB
---
## 4. Land records (India-specific)
There is no large public dataset for Indian land records β€” you have two practical options:
1. **Synthesise.** The notebook already includes a `make_demo_pair()` function that generates realistic land-record images and tampered copies. You can extend this to produce hundreds of synthetic examples in minutes.
2. **Use government open data.** Some state portals publish anonymised RoR (Record of Rights) samples β€” e.g.:
- Bhulekh portals (state-wise): https://bhulekh.gov.in/ (varies by state)
- DigiLocker sample certificates: https://www.digilocker.gov.in/
3. **Use Tobacco-3482 or DocVQA as proxy** for general scanned-document forensics β€” the same forensic signals (ELA, copy-move, font mix) transfer directly.
---
## 5. Kaggle CLI setup (one-time, free)
```bash
pip install kaggle
```
1. Sign up at https://www.kaggle.com
2. Open https://www.kaggle.com/settings β†’ **Create New API Token**
3. A file `kaggle.json` will download. Place it at:
- Windows: `C:\Users\<you>\.kaggle\kaggle.json`
- Linux/Mac: `~/.kaggle/kaggle.json`
4. On Linux/Mac: `chmod 600 ~/.kaggle/kaggle.json`
After that, all the `kaggle datasets download …` commands above will just work.
---
## 6. Minimum data needed to train
The Random Forest in Section 7.5 of the notebook will give meaningful results with:
- **~50 genuine + 50 tampered images** β€” workable baseline
- **~200 + 200** β€” good results, ROC-AUC typically 0.85+
- **~1000 + 1000** (e.g. full CASIA v2) β€” production-grade results
For the optional CNN in Section 7.6, target at least 200 images per class.
---
## 7. Quick-start recipe (fastest path to working demo)
1. `pip install kaggle` and set up the API token (Section 5 above)
2. Download CASIA v2:
```bash
kaggle datasets download -d divg07/casia-20-image-tampering-detection-dataset \
-p data/images --unzip
```
3. Rename the extracted `Au` β†’ `originals` and `Tp` β†’ `tampered`
4. Open `anomaly_detection_banking.ipynb` and run all cells
5. Section 7.5 will train automatically on the data you just placed