File size: 6,638 Bytes
05b69f8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
# Datasets for Document Anomaly / Forgery Detection

All datasets below are **free**. The only requirement is a free Kaggle account for the Kaggle-hosted ones. The notebook auto-picks up whatever you drop into the right folder β€” you do not need every dataset to get started.

## Folder layout the notebook expects

```
data/
β”œβ”€β”€ images/
β”‚   β”œβ”€β”€ originals/        <-- genuine scans (.png/.jpg)
β”‚   └── tampered/         <-- forged scans (.png/.jpg)
β”œβ”€β”€ pdfs/
β”‚   β”œβ”€β”€ originals/        <-- genuine legal PDFs
β”‚   └── tampered/         <-- forged legal PDFs
└── statements/           <-- bank statements, ITRs, receipts (any format)
```

Run `validate_data_layout()` in the notebook to confirm everything is in place.

---

## 1. Image tampering datasets

### CASIA v2 β€” Gold-standard image tampering benchmark
The most-used dataset for splicing/copy-move detection research. ~12k images (7k genuine + 5k tampered).

- **Source:** https://www.kaggle.com/datasets/divg07/casia-20-image-tampering-detection-dataset
- **Size:** ~2 GB
- **Download:**
  ```bash
  kaggle datasets download -d divg07/casia-20-image-tampering-detection-dataset \
      -p data/images --unzip
  ```
- **After download:** rename `Au/` β†’ `originals/` and `Tp/` β†’ `tampered/`

### MICC-F220 β€” Classic copy-move benchmark
220 images, perfect for testing copy-move detection.

- **Source:** http://lci.micc.unifi.it/labd/2015/01/copy-move-forgery-detection-and-localization/
- **Size:** ~50 MB
- **Download:** manual (form on the page)

### CoMoFoD β€” Copy-move with ground-truth masks
260 image sets with masks. Ideal for training a CNN with pixel-level supervision.

- **Source:** https://www.vcl.fer.hr/comofod/
- **Size:** ~1 GB
- **Download:** manual

### Coverage β€” Genuine + tampered pairs
100 pairs with similar-but-genuine objects (toughest case).

- **Source:** https://github.com/wenbihan/coverage
- **Size:** ~600 MB
- **Download:** `git clone https://github.com/wenbihan/coverage.git`

### Columbia Uncompressed Image Splicing
180 spliced + 180 authentic images, lossless.

- **Source:** https://www.ee.columbia.edu/ln/dvmm/downloads/AuthSplicedDataSet/AuthSplicedDataSet.htm
- **Size:** ~1 GB
- **Download:** manual (registration required, free)

---

## 2. Document / Legal PDF datasets

### Tobacco-3482 β€” Real scanned legal docs
3,482 real-world scanned legal documents (clean baseline of "genuine" docs).

- **Source:** https://www.kaggle.com/datasets/patrickaudriaz/tobacco3482jpg
- **Size:** ~200 MB
- **Download:**
  ```bash
  kaggle datasets download -d patrickaudriaz/tobacco3482jpg \
      -p data/pdfs/originals --unzip
  ```

### ICDAR Find-It β€” Document forgery challenge
Official competition dataset for forged scientific documents.

- **Source:** https://findit.univ-lr.fr/
- **Size:** ~500 MB
- **Download:** manual (registration required, free)

### DocVQA / RVL-CDIP β€” Real bank/govt docs
Massive dataset of real-world business documents.

- **Source:** https://www.docvqa.org/datasets and https://www.cs.cmu.edu/~aharley/rvl-cdip/
- **Size:** ~3 GB / 37 GB
- **Use case:** populate `originals/` with realistic genuine documents

### FUNSD β€” Form understanding
199 fully-annotated forms (good for layout-anomaly training).

- **Source:** https://guillaumejaume.github.io/FUNSD/
- **Size:** ~50 MB

---

## 3. Financial statements / receipts / cheques

### Receipts Fraud Detection
500+ tampered and genuine receipt images.

- **Source:** https://www.kaggle.com/datasets/trainingdatapro/receipts-fraud-detection-dataset
- **Size:** ~100 MB
- **Download:**
  ```bash
  kaggle datasets download -d trainingdatapro/receipts-fraud-detection-dataset \
      -p data/statements --unzip
  ```

### Bank statements dataset
Realistic bank statement PDFs and images.

- **Source:** https://www.kaggle.com/datasets/dedeikhsandwisaputra/bank-statements-dataset
- **Size:** ~80 MB
- **Download:**
  ```bash
  kaggle datasets download -d dedeikhsandwisaputra/bank-statements-dataset \
      -p data/statements --unzip
  ```

### IDRBT / Indian bank cheques
Cheque images (Indian banking context).

- **Source:** https://www.kaggle.com/datasets/arsh1207/bank-cheque-image-dataset
- **Size:** ~50 MB
- **Download:**
  ```bash
  kaggle datasets download -d arsh1207/bank-cheque-image-dataset \
      -p data/statements --unzip
  ```

### SROIE β€” Scanned receipts
Receipt OCR + key-information extraction challenge.

- **Source:** https://rrc.cvc.uab.es/?ch=13
- **Size:** ~150 MB

---

## 4. Land records (India-specific)

There is no large public dataset for Indian land records β€” you have two practical options:

1. **Synthesise.** The notebook already includes a `make_demo_pair()` function that generates realistic land-record images and tampered copies. You can extend this to produce hundreds of synthetic examples in minutes.
2. **Use government open data.** Some state portals publish anonymised RoR (Record of Rights) samples β€” e.g.:
   - Bhulekh portals (state-wise): https://bhulekh.gov.in/ (varies by state)
   - DigiLocker sample certificates: https://www.digilocker.gov.in/
3. **Use Tobacco-3482 or DocVQA as proxy** for general scanned-document forensics β€” the same forensic signals (ELA, copy-move, font mix) transfer directly.

---

## 5. Kaggle CLI setup (one-time, free)

```bash
pip install kaggle
```

1. Sign up at https://www.kaggle.com
2. Open https://www.kaggle.com/settings β†’ **Create New API Token**
3. A file `kaggle.json` will download. Place it at:
   - Windows: `C:\Users\<you>\.kaggle\kaggle.json`
   - Linux/Mac: `~/.kaggle/kaggle.json`
4. On Linux/Mac: `chmod 600 ~/.kaggle/kaggle.json`

After that, all the `kaggle datasets download …` commands above will just work.

---

## 6. Minimum data needed to train

The Random Forest in Section 7.5 of the notebook will give meaningful results with:

- **~50 genuine + 50 tampered images** β€” workable baseline
- **~200 + 200** β€” good results, ROC-AUC typically 0.85+
- **~1000 + 1000** (e.g. full CASIA v2) β€” production-grade results

For the optional CNN in Section 7.6, target at least 200 images per class.

---

## 7. Quick-start recipe (fastest path to working demo)

1. `pip install kaggle` and set up the API token (Section 5 above)
2. Download CASIA v2:
   ```bash
   kaggle datasets download -d divg07/casia-20-image-tampering-detection-dataset \
       -p data/images --unzip
   ```
3. Rename the extracted `Au` β†’ `originals` and `Tp` β†’ `tampered`
4. Open `anomaly_detection_banking.ipynb` and run all cells
5. Section 7.5 will train automatically on the data you just placed