| # **CLIP_aievals: AI–Generated Image Detector** | |
| This model is a CLIP-based classifier fine-tuned to detect AI-generated images across a wide range of generative models. It is trained using a mixture of real datasets (FFHQ, COCO, ImageNet, AFHQ, etc.) and synthetic datasets from diffusion, GANs, and hybrid architectures. | |
| ## Overview | |
| `CLIP_aievals` is designed for robust AI-vs-Real detection by leveraging a CLIP Vision Transformer backbone and a lightweight classification head. It is optimized for generalization across unseen generative sources and large-scale evaluation pipelines. | |
| This repository contains the model weights (`clip_vith14_argus.pt`) and supporting configuration files used for inference. | |
| --- | |
| # **Model Architecture** | |
| ### **Backbone** | |
| * CLIP ViT-H/14 vision encoder | |
| * Pretrained on LAION-2B | |
| * Frozen or partially unfrozen depending on training configuration | |
| ### **Classifier Head** | |
| * Two-layer MLP: | |
| * Input: CLIP image embedding (1024-d) | |
| * Hidden Layer: 512 with GELU activation | |
| * Output Layer: 1-unit sigmoid classifier producing probability of AI-generated content | |
| ### **Regularization and Calibration** | |
| * Dropout: 0.1 | |
| * Weight decay: 1e-4 | |
| * Temperature calibration performed post-hoc using validation logits | |
| * Optional threshold tuning using Eval metrics or Unknown-source analysis | |
| ### **Training Objective** | |
| * Binary cross-entropy | |
| * Oversampling and class-balancing for multi-source synthetic datasets | |
| --- | |
| # **Datasets** | |
| The training pipeline uses a mixture of curated datasets: | |
| ### **Real Data** | |
| * FFHQ (70k) | |
| * COCO (160k) | |
| * ImageNet (90k+) | |
| * AFHQ v1/v2 (cats, dogs, wildlife) | |
| * DIV2K | |
| * OpenImages | |
| ### **Fake Data** | |
| * Stable Diffusion (v1.x, v2.x) | |
| * Latent Diffusion Models | |
| * StyleGAN3 | |
| * CIPS | |
| * BigGAN | |
| * GANformer | |
| * CycleGAN (horse2zebra, monet2photo) | |
| * DDPM and DDGAN | |
| * Face Synthetics | |
| * Glide | |
| * Generative Inpainting (partial and full) | |
| Labels are binary: `0 = real`, `1 = fake`. | |
| --- | |
| # **Performance Summary** | |
| Evaluated on 850k+ mixed-source images: | |
| * ROC-AUC: 0.764 | |
| * PR-AUC (AI class): 0.612 | |
| * Global FPR (real images): 0.0073 | |
| * Accuracy: 0.693 | |
| * Precision (AI): 0.853 | |
| * Recall (AI): 0.086 | |
| Performance is dataset-dependent: high confidence on many synthetic sources, lower recall on advanced diffusion models exhibiting strong photorealism. | |
| --- | |
| # **Intended Use** | |
| ### **Primary** | |
| * Detect whether an image is AI-generated | |
| * Large-scale offline evaluation of generative models | |
| * Data filtering for dataset curation | |
| * Quality and authenticity control in multimedia pipelines | |
| ### **Secondary** | |
| * Research on generative model detection | |
| * Cross-model robustness evaluation | |
| ### **Not Intended For** | |
| * Legal or forensic verification | |
| * High-stakes decision systems | |
| * Per-pixel or localized artifact detection | |
| --- | |
| # **Limitations** | |
| * Lower recall on highly realistic diffusion models. | |
| * Model can produce false positives on: | |
| * Overprocessed images | |
| * Heavy JPEG compression | |
| * Artistic filters | |
| * Not calibrated for forensic authenticity analysis. | |
| --- | |
| # **How to Use** | |
| ## In Python | |
| ```python | |
| from src.model import AIImageDetector | |
| from PIL import Image | |
| import torch | |
| model = AIImageDetector( | |
| clip_model_name="ViT-H-14", | |
| device="cuda", | |
| dropout=0.1 | |
| ) | |
| model.load_state_dict(torch.load("clip_vith14_argus.pt", map_location="cpu")) | |
| model.eval() | |
| img = Image.open("your_image.jpg") | |
| prob = model.predict(img) # returns probability of AI generation | |
| print(prob) | |
| ``` | |