Spaces:
Sleeping
Sleeping
File size: 6,638 Bytes
05b69f8 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 | # Datasets for Document Anomaly / Forgery Detection
All datasets below are **free**. The only requirement is a free Kaggle account for the Kaggle-hosted ones. The notebook auto-picks up whatever you drop into the right folder β you do not need every dataset to get started.
## Folder layout the notebook expects
```
data/
βββ images/
β βββ originals/ <-- genuine scans (.png/.jpg)
β βββ tampered/ <-- forged scans (.png/.jpg)
βββ pdfs/
β βββ originals/ <-- genuine legal PDFs
β βββ tampered/ <-- forged legal PDFs
βββ statements/ <-- bank statements, ITRs, receipts (any format)
```
Run `validate_data_layout()` in the notebook to confirm everything is in place.
---
## 1. Image tampering datasets
### CASIA v2 β Gold-standard image tampering benchmark
The most-used dataset for splicing/copy-move detection research. ~12k images (7k genuine + 5k tampered).
- **Source:** https://www.kaggle.com/datasets/divg07/casia-20-image-tampering-detection-dataset
- **Size:** ~2 GB
- **Download:**
```bash
kaggle datasets download -d divg07/casia-20-image-tampering-detection-dataset \
-p data/images --unzip
```
- **After download:** rename `Au/` β `originals/` and `Tp/` β `tampered/`
### MICC-F220 β Classic copy-move benchmark
220 images, perfect for testing copy-move detection.
- **Source:** http://lci.micc.unifi.it/labd/2015/01/copy-move-forgery-detection-and-localization/
- **Size:** ~50 MB
- **Download:** manual (form on the page)
### CoMoFoD β Copy-move with ground-truth masks
260 image sets with masks. Ideal for training a CNN with pixel-level supervision.
- **Source:** https://www.vcl.fer.hr/comofod/
- **Size:** ~1 GB
- **Download:** manual
### Coverage β Genuine + tampered pairs
100 pairs with similar-but-genuine objects (toughest case).
- **Source:** https://github.com/wenbihan/coverage
- **Size:** ~600 MB
- **Download:** `git clone https://github.com/wenbihan/coverage.git`
### Columbia Uncompressed Image Splicing
180 spliced + 180 authentic images, lossless.
- **Source:** https://www.ee.columbia.edu/ln/dvmm/downloads/AuthSplicedDataSet/AuthSplicedDataSet.htm
- **Size:** ~1 GB
- **Download:** manual (registration required, free)
---
## 2. Document / Legal PDF datasets
### Tobacco-3482 β Real scanned legal docs
3,482 real-world scanned legal documents (clean baseline of "genuine" docs).
- **Source:** https://www.kaggle.com/datasets/patrickaudriaz/tobacco3482jpg
- **Size:** ~200 MB
- **Download:**
```bash
kaggle datasets download -d patrickaudriaz/tobacco3482jpg \
-p data/pdfs/originals --unzip
```
### ICDAR Find-It β Document forgery challenge
Official competition dataset for forged scientific documents.
- **Source:** https://findit.univ-lr.fr/
- **Size:** ~500 MB
- **Download:** manual (registration required, free)
### DocVQA / RVL-CDIP β Real bank/govt docs
Massive dataset of real-world business documents.
- **Source:** https://www.docvqa.org/datasets and https://www.cs.cmu.edu/~aharley/rvl-cdip/
- **Size:** ~3 GB / 37 GB
- **Use case:** populate `originals/` with realistic genuine documents
### FUNSD β Form understanding
199 fully-annotated forms (good for layout-anomaly training).
- **Source:** https://guillaumejaume.github.io/FUNSD/
- **Size:** ~50 MB
---
## 3. Financial statements / receipts / cheques
### Receipts Fraud Detection
500+ tampered and genuine receipt images.
- **Source:** https://www.kaggle.com/datasets/trainingdatapro/receipts-fraud-detection-dataset
- **Size:** ~100 MB
- **Download:**
```bash
kaggle datasets download -d trainingdatapro/receipts-fraud-detection-dataset \
-p data/statements --unzip
```
### Bank statements dataset
Realistic bank statement PDFs and images.
- **Source:** https://www.kaggle.com/datasets/dedeikhsandwisaputra/bank-statements-dataset
- **Size:** ~80 MB
- **Download:**
```bash
kaggle datasets download -d dedeikhsandwisaputra/bank-statements-dataset \
-p data/statements --unzip
```
### IDRBT / Indian bank cheques
Cheque images (Indian banking context).
- **Source:** https://www.kaggle.com/datasets/arsh1207/bank-cheque-image-dataset
- **Size:** ~50 MB
- **Download:**
```bash
kaggle datasets download -d arsh1207/bank-cheque-image-dataset \
-p data/statements --unzip
```
### SROIE β Scanned receipts
Receipt OCR + key-information extraction challenge.
- **Source:** https://rrc.cvc.uab.es/?ch=13
- **Size:** ~150 MB
---
## 4. Land records (India-specific)
There is no large public dataset for Indian land records β you have two practical options:
1. **Synthesise.** The notebook already includes a `make_demo_pair()` function that generates realistic land-record images and tampered copies. You can extend this to produce hundreds of synthetic examples in minutes.
2. **Use government open data.** Some state portals publish anonymised RoR (Record of Rights) samples β e.g.:
- Bhulekh portals (state-wise): https://bhulekh.gov.in/ (varies by state)
- DigiLocker sample certificates: https://www.digilocker.gov.in/
3. **Use Tobacco-3482 or DocVQA as proxy** for general scanned-document forensics β the same forensic signals (ELA, copy-move, font mix) transfer directly.
---
## 5. Kaggle CLI setup (one-time, free)
```bash
pip install kaggle
```
1. Sign up at https://www.kaggle.com
2. Open https://www.kaggle.com/settings β **Create New API Token**
3. A file `kaggle.json` will download. Place it at:
- Windows: `C:\Users\<you>\.kaggle\kaggle.json`
- Linux/Mac: `~/.kaggle/kaggle.json`
4. On Linux/Mac: `chmod 600 ~/.kaggle/kaggle.json`
After that, all the `kaggle datasets download β¦` commands above will just work.
---
## 6. Minimum data needed to train
The Random Forest in Section 7.5 of the notebook will give meaningful results with:
- **~50 genuine + 50 tampered images** β workable baseline
- **~200 + 200** β good results, ROC-AUC typically 0.85+
- **~1000 + 1000** (e.g. full CASIA v2) β production-grade results
For the optional CNN in Section 7.6, target at least 200 images per class.
---
## 7. Quick-start recipe (fastest path to working demo)
1. `pip install kaggle` and set up the API token (Section 5 above)
2. Download CASIA v2:
```bash
kaggle datasets download -d divg07/casia-20-image-tampering-detection-dataset \
-p data/images --unzip
```
3. Rename the extracted `Au` β `originals` and `Tp` β `tampered`
4. Open `anomaly_detection_banking.ipynb` and run all cells
5. Section 7.5 will train automatically on the data you just placed
|