verifile-x-api / data /DATASETS.md
abinazebinoy's picture
docs: document manifest.csv is generated locally
a1228fb
|
Raw
History Blame Contribute Delete
2.74 kB
# VeriFile-X Dataset Registry
## Real Photo Sources
| Name | Images | Size | URL | License |
|------|--------|------|-----|---------|
| RAISE-1k | 1,000 | 3GB | http://loki.disi.unitn.it/RAISE/ | Free research |
| RAISE-8k | 8,156 | 85GB | http://loki.disi.unitn.it/RAISE/ | Free research |
| COCO 2017 val | 5,000 | 1GB | http://images.cocodataset.org/zips/val2017.zip | CC BY 4.0 |
| FFHQ | 70,000 | 90GB | https://github.com/NVlabs/ffhq-dataset | CC BY-NC 2.0 |
| Unsplash Lite | 25,000 | 3GB | https://github.com/unsplash/datasets | Unsplash license |
| Open Images v7 | 9M | streaming | https://storage.googleapis.com/openimages/web/index.html | CC BY 4.0 |
| DIV2K | 1,000 | 7GB | https://data.vision.ee.ethz.ch/cvl/DIV2K/ | Free research |
| Flickr30K | 31,783 | 10GB | https://shannon.cs.illinois.edu/DenotationGraph/ | Free research |
## AI-Generated Sources
| Name | Images | Size | URL | Generator | License |
|------|--------|------|-----|-----------|---------|
| CIFAKE | 120,000 | 1.2GB | https://www.kaggle.com/datasets/birdy654/cifake-real-and-ai-generated-synthetic-images | SD 1.4 | CC BY 4.0 |
| GenImage | 1.3M | 200GB | https://github.com/GenImage-Dataset/GenImage | 8 generators | CC BY-NC |
| DiffusionDB | 14M | subset | https://huggingface.co/datasets/poloclub/diffusiondb | SD 1.4/2.0 | CC BY 4.0 |
| JourneyDB | 4M | subset | https://journeydb.github.io | Midjourney v5 | CC BY-NC |
| TPDNE archive | 70,000 | 8GB | https://www.kaggle.com/datasets/xhlulu/140k-real-and-fake-faces | StyleGAN2 | Free |
| CNNDetection | 362,000 | 45GB | https://github.com/peterwang512/CNNDetection | 11 GANs | Free research |
| ArtiFact | 964,000 | 120GB | https://github.com/awsaf49/artifact | 26 generators | CC BY-NC |
| AIGI NeurIPS2023 | 300,000 | 40GB | https://github.com/Ekko-zn/AIGC-Forensics | 14 generators | Free research |
| FaceForensics++ | frames | 16GB | https://github.com/ondyari/FaceForensics | DeepFake | Free research |
| Fake-vs-Real Faces | 140,000 | 2.4GB | https://www.kaggle.com/datasets/hamzaboulahia/hardfakevsrealfaces | StyleGAN/ProGAN | CC0 |
## Target Split
- Train: 80,000 real + 80,000 AI = 160,000
- Val: 10,000 real + 10,000 AI = 20,000
- Test: 10,000 real + 10,000 AI = 20,000
- Total: 200,000 balanced images
## Quality Rules for Real Photos
- Must have EXIF with camera make/model OR verified raw source (RAISE)
- No editing software in EXIF
- Minimum 256x256 pixels
- JPEG quality >= 50
## Quality Rules for AI Images
- Must have verified generator label
- No camera EXIF
- Minimum 256x256 pixels
- Known source dataset (not random scrape)
data/manifest.csv is generated locally by running:
python scripts/datasets/index_manual.py
It is not tracked in git due to size (82MB).