Spaces:

SpandanM110
/

DocSentry

Sleeping

App Files Files Community

DocSentry / DATASETS.md

SpandanM110

DocSentry - bank document forensics with 4 tabs

05b69f8 14 days ago

preview code

Raw

History Blame Contribute Delete

6.64 kB

	# Datasets for Document Anomaly / Forgery Detection

	All datasets below are free. The only requirement is a free Kaggle account for the Kaggle-hosted ones. The notebook auto-picks up whatever you drop into the right folder — you do not need every dataset to get started.

	## Folder layout the notebook expects

	```
	data/
	├── images/
	│ ├── originals/ <-- genuine scans (.png/.jpg)
	│ └── tampered/ <-- forged scans (.png/.jpg)
	├── pdfs/
	│ ├── originals/ <-- genuine legal PDFs
	│ └── tampered/ <-- forged legal PDFs
	└── statements/ <-- bank statements, ITRs, receipts (any format)
	```

	Run `validate_data_layout()` in the notebook to confirm everything is in place.

	---

	## 1. Image tampering datasets

	### CASIA v2 — Gold-standard image tampering benchmark
	The most-used dataset for splicing/copy-move detection research. ~12k images (7k genuine + 5k tampered).

	- Source: https://www.kaggle.com/datasets/divg07/casia-20-image-tampering-detection-dataset
	- Size: ~2 GB
	- Download:
	```bash
	kaggle datasets download -d divg07/casia-20-image-tampering-detection-dataset \
	-p data/images --unzip
	```
	- After download: rename `Au/` → `originals/` and `Tp/` → `tampered/`

	### MICC-F220 — Classic copy-move benchmark
	220 images, perfect for testing copy-move detection.

	- Source: http://lci.micc.unifi.it/labd/2015/01/copy-move-forgery-detection-and-localization/
	- Size: ~50 MB
	- Download: manual (form on the page)

	### CoMoFoD — Copy-move with ground-truth masks
	260 image sets with masks. Ideal for training a CNN with pixel-level supervision.

	- Source: https://www.vcl.fer.hr/comofod/
	- Size: ~1 GB
	- Download: manual

	### Coverage — Genuine + tampered pairs
	100 pairs with similar-but-genuine objects (toughest case).

	- Source: https://github.com/wenbihan/coverage
	- Size: ~600 MB
	- Download: `git clone https://github.com/wenbihan/coverage.git`

	### Columbia Uncompressed Image Splicing
	180 spliced + 180 authentic images, lossless.

	- Source: https://www.ee.columbia.edu/ln/dvmm/downloads/AuthSplicedDataSet/AuthSplicedDataSet.htm
	- Size: ~1 GB
	- Download: manual (registration required, free)

	---

	## 2. Document / Legal PDF datasets

	### Tobacco-3482 — Real scanned legal docs
	3,482 real-world scanned legal documents (clean baseline of "genuine" docs).

	- Source: https://www.kaggle.com/datasets/patrickaudriaz/tobacco3482jpg
	- Size: ~200 MB
	- Download:
	```bash
	kaggle datasets download -d patrickaudriaz/tobacco3482jpg \
	-p data/pdfs/originals --unzip
	```

	### ICDAR Find-It — Document forgery challenge
	Official competition dataset for forged scientific documents.

	- Source: https://findit.univ-lr.fr/
	- Size: ~500 MB
	- Download: manual (registration required, free)

	### DocVQA / RVL-CDIP — Real bank/govt docs
	Massive dataset of real-world business documents.

	- Source: https://www.docvqa.org/datasets and https://www.cs.cmu.edu/~aharley/rvl-cdip/
	- Size: ~3 GB / 37 GB
	- Use case: populate `originals/` with realistic genuine documents

	### FUNSD — Form understanding
	199 fully-annotated forms (good for layout-anomaly training).

	- Source: https://guillaumejaume.github.io/FUNSD/
	- Size: ~50 MB

	---

	## 3. Financial statements / receipts / cheques

	### Receipts Fraud Detection
	500+ tampered and genuine receipt images.

	- Source: https://www.kaggle.com/datasets/trainingdatapro/receipts-fraud-detection-dataset
	- Size: ~100 MB
	- Download:
	```bash
	kaggle datasets download -d trainingdatapro/receipts-fraud-detection-dataset \
	-p data/statements --unzip
	```

	### Bank statements dataset
	Realistic bank statement PDFs and images.

	- Source: https://www.kaggle.com/datasets/dedeikhsandwisaputra/bank-statements-dataset
	- Size: ~80 MB
	- Download:
	```bash
	kaggle datasets download -d dedeikhsandwisaputra/bank-statements-dataset \
	-p data/statements --unzip
	```

	### IDRBT / Indian bank cheques
	Cheque images (Indian banking context).

	- Source: https://www.kaggle.com/datasets/arsh1207/bank-cheque-image-dataset
	- Size: ~50 MB
	- Download:
	```bash
	kaggle datasets download -d arsh1207/bank-cheque-image-dataset \
	-p data/statements --unzip
	```

	### SROIE — Scanned receipts
	Receipt OCR + key-information extraction challenge.

	- Source: https://rrc.cvc.uab.es/?ch=13
	- Size: ~150 MB

	---

	## 4. Land records (India-specific)

	There is no large public dataset for Indian land records — you have two practical options:

	1. Synthesise. The notebook already includes a `make_demo_pair()` function that generates realistic land-record images and tampered copies. You can extend this to produce hundreds of synthetic examples in minutes.
	2. Use government open data. Some state portals publish anonymised RoR (Record of Rights) samples — e.g.:
	- Bhulekh portals (state-wise): https://bhulekh.gov.in/ (varies by state)
	- DigiLocker sample certificates: https://www.digilocker.gov.in/
	3. Use Tobacco-3482 or DocVQA as proxy for general scanned-document forensics — the same forensic signals (ELA, copy-move, font mix) transfer directly.

	---

	## 5. Kaggle CLI setup (one-time, free)

	```bash
	pip install kaggle
	```

	1. Sign up at https://www.kaggle.com
	2. Open https://www.kaggle.com/settings → Create New API Token
	3. A file `kaggle.json` will download. Place it at:
	- Windows: `C:\Users\<you>\.kaggle\kaggle.json`
	- Linux/Mac: `~/.kaggle/kaggle.json`
	4. On Linux/Mac: `chmod 600 ~/.kaggle/kaggle.json`

	After that, all the `kaggle datasets download …` commands above will just work.

	---

	## 6. Minimum data needed to train

	The Random Forest in Section 7.5 of the notebook will give meaningful results with:

	- ~50 genuine + 50 tampered images — workable baseline
	- ~200 + 200 — good results, ROC-AUC typically 0.85+
	- ~1000 + 1000 (e.g. full CASIA v2) — production-grade results

	For the optional CNN in Section 7.6, target at least 200 images per class.

	---

	## 7. Quick-start recipe (fastest path to working demo)

	1. `pip install kaggle` and set up the API token (Section 5 above)
	2. Download CASIA v2:
	```bash
	kaggle datasets download -d divg07/casia-20-image-tampering-detection-dataset \
	-p data/images --unzip
	```
	3. Rename the extracted `Au` → `originals` and `Tp` → `tampered`
	4. Open `anomaly_detection_banking.ipynb` and run all cells
	5. Section 7.5 will train automatically on the data you just placed