--basic model card
Browse files
README.md
ADDED
|
@@ -0,0 +1,145 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# **CLIP_aievals: AI–Generated Image Detector**
|
| 2 |
+
|
| 3 |
+
This model is a CLIP-based classifier fine-tuned to detect AI-generated images across a wide range of generative models. It is trained using a mixture of real datasets (FFHQ, COCO, ImageNet, AFHQ, etc.) and synthetic datasets from diffusion, GANs, and hybrid architectures.
|
| 4 |
+
|
| 5 |
+
## Overview
|
| 6 |
+
|
| 7 |
+
`CLIP_aievals` is designed for robust AI-vs-Real detection by leveraging a CLIP Vision Transformer backbone and a lightweight classification head. It is optimized for generalization across unseen generative sources and large-scale evaluation pipelines.
|
| 8 |
+
|
| 9 |
+
This repository contains the model weights (`clip_vith14_argus.pt`) and supporting configuration files used for inference.
|
| 10 |
+
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
# **Model Architecture**
|
| 14 |
+
|
| 15 |
+
### **Backbone**
|
| 16 |
+
|
| 17 |
+
* CLIP ViT-H/14 vision encoder
|
| 18 |
+
* Pretrained on LAION-2B
|
| 19 |
+
* Frozen or partially unfrozen depending on training configuration
|
| 20 |
+
|
| 21 |
+
### **Classifier Head**
|
| 22 |
+
|
| 23 |
+
* Two-layer MLP:
|
| 24 |
+
|
| 25 |
+
* Input: CLIP image embedding (1024-d)
|
| 26 |
+
* Hidden Layer: 512 with GELU activation
|
| 27 |
+
* Output Layer: 1-unit sigmoid classifier producing probability of AI-generated content
|
| 28 |
+
|
| 29 |
+
### **Regularization and Calibration**
|
| 30 |
+
|
| 31 |
+
* Dropout: 0.1
|
| 32 |
+
* Weight decay: 1e-4
|
| 33 |
+
* Temperature calibration performed post-hoc using validation logits
|
| 34 |
+
* Optional threshold tuning using Eval metrics or Unknown-source analysis
|
| 35 |
+
|
| 36 |
+
### **Training Objective**
|
| 37 |
+
|
| 38 |
+
* Binary cross-entropy
|
| 39 |
+
* Oversampling and class-balancing for multi-source synthetic datasets
|
| 40 |
+
|
| 41 |
+
---
|
| 42 |
+
|
| 43 |
+
# **Datasets**
|
| 44 |
+
|
| 45 |
+
The training pipeline uses a mixture of curated datasets:
|
| 46 |
+
|
| 47 |
+
### **Real Data**
|
| 48 |
+
|
| 49 |
+
* FFHQ (70k)
|
| 50 |
+
* COCO (160k)
|
| 51 |
+
* ImageNet (90k+)
|
| 52 |
+
* AFHQ v1/v2 (cats, dogs, wildlife)
|
| 53 |
+
* DIV2K
|
| 54 |
+
* OpenImages
|
| 55 |
+
|
| 56 |
+
### **Fake Data**
|
| 57 |
+
|
| 58 |
+
* Stable Diffusion (v1.x, v2.x)
|
| 59 |
+
* Latent Diffusion Models
|
| 60 |
+
* StyleGAN3
|
| 61 |
+
* CIPS
|
| 62 |
+
* BigGAN
|
| 63 |
+
* GANformer
|
| 64 |
+
* CycleGAN (horse2zebra, monet2photo)
|
| 65 |
+
* DDPM and DDGAN
|
| 66 |
+
* Face Synthetics
|
| 67 |
+
* Glide
|
| 68 |
+
* Generative Inpainting (partial and full)
|
| 69 |
+
|
| 70 |
+
Labels are binary: `0 = real`, `1 = fake`.
|
| 71 |
+
|
| 72 |
+
---
|
| 73 |
+
|
| 74 |
+
# **Performance Summary**
|
| 75 |
+
|
| 76 |
+
Evaluated on 850k+ mixed-source images:
|
| 77 |
+
|
| 78 |
+
* ROC-AUC: 0.764
|
| 79 |
+
* PR-AUC (AI class): 0.612
|
| 80 |
+
* Global FPR (real images): 0.0073
|
| 81 |
+
* Accuracy: 0.693
|
| 82 |
+
* Precision (AI): 0.853
|
| 83 |
+
* Recall (AI): 0.086
|
| 84 |
+
|
| 85 |
+
Performance is dataset-dependent: high confidence on many synthetic sources, lower recall on advanced diffusion models exhibiting strong photorealism.
|
| 86 |
+
|
| 87 |
+
---
|
| 88 |
+
|
| 89 |
+
# **Intended Use**
|
| 90 |
+
|
| 91 |
+
### **Primary**
|
| 92 |
+
|
| 93 |
+
* Detect whether an image is AI-generated
|
| 94 |
+
* Large-scale offline evaluation of generative models
|
| 95 |
+
* Data filtering for dataset curation
|
| 96 |
+
* Quality and authenticity control in multimedia pipelines
|
| 97 |
+
|
| 98 |
+
### **Secondary**
|
| 99 |
+
|
| 100 |
+
* Research on generative model detection
|
| 101 |
+
* Cross-model robustness evaluation
|
| 102 |
+
|
| 103 |
+
### **Not Intended For**
|
| 104 |
+
|
| 105 |
+
* Legal or forensic verification
|
| 106 |
+
* High-stakes decision systems
|
| 107 |
+
* Per-pixel or localized artifact detection
|
| 108 |
+
|
| 109 |
+
---
|
| 110 |
+
|
| 111 |
+
# **Limitations**
|
| 112 |
+
|
| 113 |
+
* Lower recall on highly realistic diffusion models.
|
| 114 |
+
* Model can produce false positives on:
|
| 115 |
+
|
| 116 |
+
* Overprocessed images
|
| 117 |
+
* Heavy JPEG compression
|
| 118 |
+
* Artistic filters
|
| 119 |
+
* Not calibrated for forensic authenticity analysis.
|
| 120 |
+
|
| 121 |
+
---
|
| 122 |
+
|
| 123 |
+
# **How to Use**
|
| 124 |
+
|
| 125 |
+
## In Python
|
| 126 |
+
|
| 127 |
+
```python
|
| 128 |
+
from src.model import AIImageDetector
|
| 129 |
+
from PIL import Image
|
| 130 |
+
import torch
|
| 131 |
+
|
| 132 |
+
model = AIImageDetector(
|
| 133 |
+
clip_model_name="ViT-H-14",
|
| 134 |
+
device="cuda",
|
| 135 |
+
dropout=0.1
|
| 136 |
+
)
|
| 137 |
+
|
| 138 |
+
model.load_state_dict(torch.load("clip_vith14_argus.pt", map_location="cpu"))
|
| 139 |
+
model.eval()
|
| 140 |
+
|
| 141 |
+
img = Image.open("your_image.jpg")
|
| 142 |
+
prob = model.predict(img) # returns probability of AI generation
|
| 143 |
+
print(prob)
|
| 144 |
+
```
|
| 145 |
+
|