Spaces:
Sleeping
Sleeping
| # Datasets for Document Anomaly / Forgery Detection | |
| All datasets below are **free**. The only requirement is a free Kaggle account for the Kaggle-hosted ones. The notebook auto-picks up whatever you drop into the right folder β you do not need every dataset to get started. | |
| ## Folder layout the notebook expects | |
| ``` | |
| data/ | |
| βββ images/ | |
| β βββ originals/ <-- genuine scans (.png/.jpg) | |
| β βββ tampered/ <-- forged scans (.png/.jpg) | |
| βββ pdfs/ | |
| β βββ originals/ <-- genuine legal PDFs | |
| β βββ tampered/ <-- forged legal PDFs | |
| βββ statements/ <-- bank statements, ITRs, receipts (any format) | |
| ``` | |
| Run `validate_data_layout()` in the notebook to confirm everything is in place. | |
| --- | |
| ## 1. Image tampering datasets | |
| ### CASIA v2 β Gold-standard image tampering benchmark | |
| The most-used dataset for splicing/copy-move detection research. ~12k images (7k genuine + 5k tampered). | |
| - **Source:** https://www.kaggle.com/datasets/divg07/casia-20-image-tampering-detection-dataset | |
| - **Size:** ~2 GB | |
| - **Download:** | |
| ```bash | |
| kaggle datasets download -d divg07/casia-20-image-tampering-detection-dataset \ | |
| -p data/images --unzip | |
| ``` | |
| - **After download:** rename `Au/` β `originals/` and `Tp/` β `tampered/` | |
| ### MICC-F220 β Classic copy-move benchmark | |
| 220 images, perfect for testing copy-move detection. | |
| - **Source:** http://lci.micc.unifi.it/labd/2015/01/copy-move-forgery-detection-and-localization/ | |
| - **Size:** ~50 MB | |
| - **Download:** manual (form on the page) | |
| ### CoMoFoD β Copy-move with ground-truth masks | |
| 260 image sets with masks. Ideal for training a CNN with pixel-level supervision. | |
| - **Source:** https://www.vcl.fer.hr/comofod/ | |
| - **Size:** ~1 GB | |
| - **Download:** manual | |
| ### Coverage β Genuine + tampered pairs | |
| 100 pairs with similar-but-genuine objects (toughest case). | |
| - **Source:** https://github.com/wenbihan/coverage | |
| - **Size:** ~600 MB | |
| - **Download:** `git clone https://github.com/wenbihan/coverage.git` | |
| ### Columbia Uncompressed Image Splicing | |
| 180 spliced + 180 authentic images, lossless. | |
| - **Source:** https://www.ee.columbia.edu/ln/dvmm/downloads/AuthSplicedDataSet/AuthSplicedDataSet.htm | |
| - **Size:** ~1 GB | |
| - **Download:** manual (registration required, free) | |
| --- | |
| ## 2. Document / Legal PDF datasets | |
| ### Tobacco-3482 β Real scanned legal docs | |
| 3,482 real-world scanned legal documents (clean baseline of "genuine" docs). | |
| - **Source:** https://www.kaggle.com/datasets/patrickaudriaz/tobacco3482jpg | |
| - **Size:** ~200 MB | |
| - **Download:** | |
| ```bash | |
| kaggle datasets download -d patrickaudriaz/tobacco3482jpg \ | |
| -p data/pdfs/originals --unzip | |
| ``` | |
| ### ICDAR Find-It β Document forgery challenge | |
| Official competition dataset for forged scientific documents. | |
| - **Source:** https://findit.univ-lr.fr/ | |
| - **Size:** ~500 MB | |
| - **Download:** manual (registration required, free) | |
| ### DocVQA / RVL-CDIP β Real bank/govt docs | |
| Massive dataset of real-world business documents. | |
| - **Source:** https://www.docvqa.org/datasets and https://www.cs.cmu.edu/~aharley/rvl-cdip/ | |
| - **Size:** ~3 GB / 37 GB | |
| - **Use case:** populate `originals/` with realistic genuine documents | |
| ### FUNSD β Form understanding | |
| 199 fully-annotated forms (good for layout-anomaly training). | |
| - **Source:** https://guillaumejaume.github.io/FUNSD/ | |
| - **Size:** ~50 MB | |
| --- | |
| ## 3. Financial statements / receipts / cheques | |
| ### Receipts Fraud Detection | |
| 500+ tampered and genuine receipt images. | |
| - **Source:** https://www.kaggle.com/datasets/trainingdatapro/receipts-fraud-detection-dataset | |
| - **Size:** ~100 MB | |
| - **Download:** | |
| ```bash | |
| kaggle datasets download -d trainingdatapro/receipts-fraud-detection-dataset \ | |
| -p data/statements --unzip | |
| ``` | |
| ### Bank statements dataset | |
| Realistic bank statement PDFs and images. | |
| - **Source:** https://www.kaggle.com/datasets/dedeikhsandwisaputra/bank-statements-dataset | |
| - **Size:** ~80 MB | |
| - **Download:** | |
| ```bash | |
| kaggle datasets download -d dedeikhsandwisaputra/bank-statements-dataset \ | |
| -p data/statements --unzip | |
| ``` | |
| ### IDRBT / Indian bank cheques | |
| Cheque images (Indian banking context). | |
| - **Source:** https://www.kaggle.com/datasets/arsh1207/bank-cheque-image-dataset | |
| - **Size:** ~50 MB | |
| - **Download:** | |
| ```bash | |
| kaggle datasets download -d arsh1207/bank-cheque-image-dataset \ | |
| -p data/statements --unzip | |
| ``` | |
| ### SROIE β Scanned receipts | |
| Receipt OCR + key-information extraction challenge. | |
| - **Source:** https://rrc.cvc.uab.es/?ch=13 | |
| - **Size:** ~150 MB | |
| --- | |
| ## 4. Land records (India-specific) | |
| There is no large public dataset for Indian land records β you have two practical options: | |
| 1. **Synthesise.** The notebook already includes a `make_demo_pair()` function that generates realistic land-record images and tampered copies. You can extend this to produce hundreds of synthetic examples in minutes. | |
| 2. **Use government open data.** Some state portals publish anonymised RoR (Record of Rights) samples β e.g.: | |
| - Bhulekh portals (state-wise): https://bhulekh.gov.in/ (varies by state) | |
| - DigiLocker sample certificates: https://www.digilocker.gov.in/ | |
| 3. **Use Tobacco-3482 or DocVQA as proxy** for general scanned-document forensics β the same forensic signals (ELA, copy-move, font mix) transfer directly. | |
| --- | |
| ## 5. Kaggle CLI setup (one-time, free) | |
| ```bash | |
| pip install kaggle | |
| ``` | |
| 1. Sign up at https://www.kaggle.com | |
| 2. Open https://www.kaggle.com/settings β **Create New API Token** | |
| 3. A file `kaggle.json` will download. Place it at: | |
| - Windows: `C:\Users\<you>\.kaggle\kaggle.json` | |
| - Linux/Mac: `~/.kaggle/kaggle.json` | |
| 4. On Linux/Mac: `chmod 600 ~/.kaggle/kaggle.json` | |
| After that, all the `kaggle datasets download β¦` commands above will just work. | |
| --- | |
| ## 6. Minimum data needed to train | |
| The Random Forest in Section 7.5 of the notebook will give meaningful results with: | |
| - **~50 genuine + 50 tampered images** β workable baseline | |
| - **~200 + 200** β good results, ROC-AUC typically 0.85+ | |
| - **~1000 + 1000** (e.g. full CASIA v2) β production-grade results | |
| For the optional CNN in Section 7.6, target at least 200 images per class. | |
| --- | |
| ## 7. Quick-start recipe (fastest path to working demo) | |
| 1. `pip install kaggle` and set up the API token (Section 5 above) | |
| 2. Download CASIA v2: | |
| ```bash | |
| kaggle datasets download -d divg07/casia-20-image-tampering-detection-dataset \ | |
| -p data/images --unzip | |
| ``` | |
| 3. Rename the extracted `Au` β `originals` and `Tp` β `tampered` | |
| 4. Open `anomaly_detection_banking.ipynb` and run all cells | |
| 5. Section 7.5 will train automatically on the data you just placed | |