Spaces:

SpandanM110
/

DocSentry

Sleeping

File size: 6,638 Bytes

05b69f8

# Datasets for Document Anomaly / Forgery Detection

All datasets below are **free**. The only requirement is a free Kaggle account for the Kaggle-hosted ones. The notebook auto-picks up whatever you drop into the right folder — you do not need every dataset to get started.

## Folder layout the notebook expects

```
data/
├── images/
│   ├── originals/        <-- genuine scans (.png/.jpg)
│   └── tampered/         <-- forged scans (.png/.jpg)
├── pdfs/
│   ├── originals/        <-- genuine legal PDFs
│   └── tampered/         <-- forged legal PDFs
└── statements/           <-- bank statements, ITRs, receipts (any format)
```

Run `validate_data_layout()` in the notebook to confirm everything is in place.

---

## 1. Image tampering datasets

### CASIA v2 — Gold-standard image tampering benchmark
The most-used dataset for splicing/copy-move detection research. ~12k images (7k genuine + 5k tampered).

- **Source:** https://www.kaggle.com/datasets/divg07/casia-20-image-tampering-detection-dataset
- **Size:** ~2 GB
- **Download:**
  ```bash
  kaggle datasets download -d divg07/casia-20-image-tampering-detection-dataset \
      -p data/images --unzip
  ```
- **After download:** rename `Au/` → `originals/` and `Tp/` → `tampered/`

### MICC-F220 — Classic copy-move benchmark
220 images, perfect for testing copy-move detection.

- **Source:** http://lci.micc.unifi.it/labd/2015/01/copy-move-forgery-detection-and-localization/
- **Size:** ~50 MB
- **Download:** manual (form on the page)

### CoMoFoD — Copy-move with ground-truth masks
260 image sets with masks. Ideal for training a CNN with pixel-level supervision.

- **Source:** https://www.vcl.fer.hr/comofod/
- **Size:** ~1 GB
- **Download:** manual

### Coverage — Genuine + tampered pairs
100 pairs with similar-but-genuine objects (toughest case).

- **Source:** https://github.com/wenbihan/coverage
- **Size:** ~600 MB
- **Download:** `git clone https://github.com/wenbihan/coverage.git`

### Columbia Uncompressed Image Splicing
180 spliced + 180 authentic images, lossless.

- **Source:** https://www.ee.columbia.edu/ln/dvmm/downloads/AuthSplicedDataSet/AuthSplicedDataSet.htm
- **Size:** ~1 GB
- **Download:** manual (registration required, free)

---

## 2. Document / Legal PDF datasets

### Tobacco-3482 — Real scanned legal docs
3,482 real-world scanned legal documents (clean baseline of "genuine" docs).

- **Source:** https://www.kaggle.com/datasets/patrickaudriaz/tobacco3482jpg
- **Size:** ~200 MB
- **Download:**
  ```bash
  kaggle datasets download -d patrickaudriaz/tobacco3482jpg \
      -p data/pdfs/originals --unzip
  ```

### ICDAR Find-It — Document forgery challenge
Official competition dataset for forged scientific documents.

- **Source:** https://findit.univ-lr.fr/
- **Size:** ~500 MB
- **Download:** manual (registration required, free)

### DocVQA / RVL-CDIP — Real bank/govt docs
Massive dataset of real-world business documents.

- **Source:** https://www.docvqa.org/datasets and https://www.cs.cmu.edu/~aharley/rvl-cdip/
- **Size:** ~3 GB / 37 GB
- **Use case:** populate `originals/` with realistic genuine documents

### FUNSD — Form understanding
199 fully-annotated forms (good for layout-anomaly training).

- **Source:** https://guillaumejaume.github.io/FUNSD/
- **Size:** ~50 MB

---

## 3. Financial statements / receipts / cheques

### Receipts Fraud Detection
500+ tampered and genuine receipt images.

- **Source:** https://www.kaggle.com/datasets/trainingdatapro/receipts-fraud-detection-dataset
- **Size:** ~100 MB
- **Download:**
  ```bash
  kaggle datasets download -d trainingdatapro/receipts-fraud-detection-dataset \
      -p data/statements --unzip
  ```

### Bank statements dataset
Realistic bank statement PDFs and images.

- **Source:** https://www.kaggle.com/datasets/dedeikhsandwisaputra/bank-statements-dataset
- **Size:** ~80 MB
- **Download:**
  ```bash
  kaggle datasets download -d dedeikhsandwisaputra/bank-statements-dataset \
      -p data/statements --unzip
  ```

### IDRBT / Indian bank cheques
Cheque images (Indian banking context).

- **Source:** https://www.kaggle.com/datasets/arsh1207/bank-cheque-image-dataset
- **Size:** ~50 MB
- **Download:**
  ```bash
  kaggle datasets download -d arsh1207/bank-cheque-image-dataset \
      -p data/statements --unzip
  ```

### SROIE — Scanned receipts
Receipt OCR + key-information extraction challenge.

- **Source:** https://rrc.cvc.uab.es/?ch=13
- **Size:** ~150 MB

---

## 4. Land records (India-specific)

There is no large public dataset for Indian land records — you have two practical options:

1. **Synthesise.** The notebook already includes a `make_demo_pair()` function that generates realistic land-record images and tampered copies. You can extend this to produce hundreds of synthetic examples in minutes.
2. **Use government open data.** Some state portals publish anonymised RoR (Record of Rights) samples — e.g.:
   - Bhulekh portals (state-wise): https://bhulekh.gov.in/ (varies by state)
   - DigiLocker sample certificates: https://www.digilocker.gov.in/
3. **Use Tobacco-3482 or DocVQA as proxy** for general scanned-document forensics — the same forensic signals (ELA, copy-move, font mix) transfer directly.

---

## 5. Kaggle CLI setup (one-time, free)

```bash
pip install kaggle
```

1. Sign up at https://www.kaggle.com
2. Open https://www.kaggle.com/settings → **Create New API Token**
3. A file `kaggle.json` will download. Place it at:
   - Windows: `C:\Users\<you>\.kaggle\kaggle.json`
   - Linux/Mac: `~/.kaggle/kaggle.json`
4. On Linux/Mac: `chmod 600 ~/.kaggle/kaggle.json`

After that, all the `kaggle datasets download …` commands above will just work.

---

## 6. Minimum data needed to train

The Random Forest in Section 7.5 of the notebook will give meaningful results with:

- **~50 genuine + 50 tampered images** — workable baseline
- **~200 + 200** — good results, ROC-AUC typically 0.85+
- **~1000 + 1000** (e.g. full CASIA v2) — production-grade results

For the optional CNN in Section 7.6, target at least 200 images per class.

---

## 7. Quick-start recipe (fastest path to working demo)

1. `pip install kaggle` and set up the API token (Section 5 above)
2. Download CASIA v2:
   ```bash
   kaggle datasets download -d divg07/casia-20-image-tampering-detection-dataset \
       -p data/images --unzip
   ```
3. Rename the extracted `Au` → `originals` and `Tp` → `tampered`
4. Open `anomaly_detection_banking.ipynb` and run all cells
5. Section 7.5 will train automatically on the data you just placed