CLIP_aievals / README.md
srini-dash's picture
--basic model card
9286e72 verified
# **CLIP_aievals: AI–Generated Image Detector**
This model is a CLIP-based classifier fine-tuned to detect AI-generated images across a wide range of generative models. It is trained using a mixture of real datasets (FFHQ, COCO, ImageNet, AFHQ, etc.) and synthetic datasets from diffusion, GANs, and hybrid architectures.
## Overview
`CLIP_aievals` is designed for robust AI-vs-Real detection by leveraging a CLIP Vision Transformer backbone and a lightweight classification head. It is optimized for generalization across unseen generative sources and large-scale evaluation pipelines.
This repository contains the model weights (`clip_vith14_argus.pt`) and supporting configuration files used for inference.
---
# **Model Architecture**
### **Backbone**
* CLIP ViT-H/14 vision encoder
* Pretrained on LAION-2B
* Frozen or partially unfrozen depending on training configuration
### **Classifier Head**
* Two-layer MLP:
* Input: CLIP image embedding (1024-d)
* Hidden Layer: 512 with GELU activation
* Output Layer: 1-unit sigmoid classifier producing probability of AI-generated content
### **Regularization and Calibration**
* Dropout: 0.1
* Weight decay: 1e-4
* Temperature calibration performed post-hoc using validation logits
* Optional threshold tuning using Eval metrics or Unknown-source analysis
### **Training Objective**
* Binary cross-entropy
* Oversampling and class-balancing for multi-source synthetic datasets
---
# **Datasets**
The training pipeline uses a mixture of curated datasets:
### **Real Data**
* FFHQ (70k)
* COCO (160k)
* ImageNet (90k+)
* AFHQ v1/v2 (cats, dogs, wildlife)
* DIV2K
* OpenImages
### **Fake Data**
* Stable Diffusion (v1.x, v2.x)
* Latent Diffusion Models
* StyleGAN3
* CIPS
* BigGAN
* GANformer
* CycleGAN (horse2zebra, monet2photo)
* DDPM and DDGAN
* Face Synthetics
* Glide
* Generative Inpainting (partial and full)
Labels are binary: `0 = real`, `1 = fake`.
---
# **Performance Summary**
Evaluated on 850k+ mixed-source images:
* ROC-AUC: 0.764
* PR-AUC (AI class): 0.612
* Global FPR (real images): 0.0073
* Accuracy: 0.693
* Precision (AI): 0.853
* Recall (AI): 0.086
Performance is dataset-dependent: high confidence on many synthetic sources, lower recall on advanced diffusion models exhibiting strong photorealism.
---
# **Intended Use**
### **Primary**
* Detect whether an image is AI-generated
* Large-scale offline evaluation of generative models
* Data filtering for dataset curation
* Quality and authenticity control in multimedia pipelines
### **Secondary**
* Research on generative model detection
* Cross-model robustness evaluation
### **Not Intended For**
* Legal or forensic verification
* High-stakes decision systems
* Per-pixel or localized artifact detection
---
# **Limitations**
* Lower recall on highly realistic diffusion models.
* Model can produce false positives on:
* Overprocessed images
* Heavy JPEG compression
* Artistic filters
* Not calibrated for forensic authenticity analysis.
---
# **How to Use**
## In Python
```python
from src.model import AIImageDetector
from PIL import Image
import torch
model = AIImageDetector(
clip_model_name="ViT-H-14",
device="cuda",
dropout=0.1
)
model.load_state_dict(torch.load("clip_vith14_argus.pt", map_location="cpu"))
model.eval()
img = Image.open("your_image.jpg")
prob = model.predict(img) # returns probability of AI generation
print(prob)
```