File size: 8,538 Bytes
5cb23d7 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 | ---
tags:
- image-classification
- ai-image-detection
- deepfake-detection
- frequency-analysis
- computer-vision
- pytorch
- swinv2
- srm
- dct
- fft
license: apache-2.0
datasets:
- OwensLab/CommunityForensics-Small
metrics:
- accuracy
pipeline_tag: image-classification
---
# ๐ AI-Generated Image Detector
**Multi-Branch Frequency-Aware Detector: SwinV2 + SRM + DCT + FFT**
A robust AI-generated image detector that combines **semantic understanding** with **frequency-domain forensic analysis** to detect AI-generated images from any source โ including high-quality outputs from Stable Diffusion, DALL-E, Midjourney, Flux, and 4,800+ other generators.
## ๐๏ธ Architecture
This model uses a novel **4-branch fusion architecture** for maximum detection robustness:
```
Input Image (256ร256)
โ
โโโโโโผโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโ
โ โ โ โ โ
โผ โผ โผ โผ โผ
SwinV2 SRM HPF DCT Analyzer FFT Analyzer
(768d) (256d) (22d) (36d)
โ โ โ โ
โ โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโ
โ Freq Features (314d)
โ โ
โ Freq Projection (128d)
โ โ
โโโโโโโโโโโโโโโโโโโโ
โ
Fusion MLP (896d โ 512 โ 128 โ 2)
โ
Real / AI-Generated
```
### Branch 1: SwinV2-Tiny Backbone (Semantic Features)
- Pretrained `microsoft/swinv2-tiny-patch4-window8-256`
- Captures high-level semantic inconsistencies (e.g., unnatural textures, impossible geometry)
- 768-dimensional feature vector
### Branch 2: SRM High-Pass Filter Bank (Forensic Residuals)
- **30 fixed Spatial Rich Model (SRM) filters** from image forensics literature
- Based on [Fridrich & Kodovsky (2012)](https://ieeexplore.ieee.org/document/6197267) โ the gold standard in steganalysis
- Includes: 1st/2nd/3rd order derivatives, Laplacians, SPAM filters, edge detectors, Gabor-like directional filters
- Detects subtle manipulation artifacts **invisible in RGB space**
- **Zero learnable parameters** in the filter bank โ maximum generalization
- Processed through a lightweight CNN encoder (30โ64โ128โ256 channels)
### Branch 3: DCT Frequency Band Analysis
- **2D Discrete Cosine Transform** on 32ร32 image patches
- Extracts 8 frequency band energy statistics (mean + std per band)
- Computes **spectral centroid** (center of mass of frequency distribution)
- Measures **high-to-low frequency energy ratio** โ AI images often have anomalous ratios
- Captures **DC component statistics** across patches
- 22-dimensional feature vector
### Branch 4: FFT Radial Power Spectrum
- **2D Fast Fourier Transform** with Hanning window (reduces spectral leakage)
- Azimuthally averaged power spectrum in 32 radial bins
- Measures **deviation from natural 1/fยฒ power law** โ natural images follow this law, AI-generated images deviate
- Extracts: log spectrum, spectral slope, intercept, residual std, residual max
- Detects **upsampling artifacts** and periodic patterns from generator architectures
- 36-dimensional feature vector
### Fusion
- Frequency features (SRM + DCT + FFT = 314d) โ projected to 128d
- Concatenated with SwinV2 semantic features (768d) โ 896d
- MLP classifier with dropout (0.3, 0.1) and label smoothing (0.1)
**Total parameters: ~28.6M** (compact enough for real-time inference)
## ๐ Training Dataset
**[OwensLab/CommunityForensics-Small](https://huggingface.co/datasets/OwensLab/CommunityForensics-Small)** (CVPR 2025)
- **556,000 images** (278K real + 278K AI-generated)
- **4,803 different AI generators** โ the most diverse training set ever used
- Real images from: LAION, ImageNet, COCO, FFHQ, CelebA, MetFaces, AFHQ, and more
- AI images from: All Stable Diffusion variants, DeepFloyd, StyleGAN 1/2/3, BigGAN, VQDM, and thousands of community models
### Social Media Robustness Augmentation
During training, images are augmented with:
- **Random JPEG compression** (QF 30-95) โ simulates Instagram/Twitter/WhatsApp compression
- **Gaussian blur** (ฯ 0.1-2.0) โ simulates re-encoding artifacts
- **Downscale-upscale** (0.5x-0.9x) โ simulates re-upload quality loss
- Standard color jitter, random crops, and horizontal flips
## ๐ Training
### Requirements
```bash
pip install transformers torch torchvision datasets evaluate accelerate trackio pillow scikit-learn
```
### Run Training
```bash
# Full training on GPU (recommended: A10G 24GB or better)
python train.py \
--num_train_epochs 5 \
--per_device_train_batch_size 16 \
--gradient_accumulation_steps 4 \
--learning_rate 2e-5 \
--hub_model_id your-username/ai-image-detector
# Quick test run
python train.py --test_mode
# Custom settings
python train.py \
--max_train_samples 50000 \
--num_train_epochs 3 \
--per_device_train_batch_size 8 \
--image_size 256
```
### Training Hyperparameters
| Parameter | Value |
|-----------|-------|
| Optimizer | AdamW |
| Learning rate | 2e-5 |
| Weight decay | 0.01 |
| Warmup ratio | 0.1 |
| Batch size | 16 ร 4 GPUs = 64 effective |
| Epochs | 5 |
| Precision | bf16 |
| Label smoothing | 0.1 |
| Gradient checkpointing | โ |
| Image size | 256ร256 |
## ๐ฎ Inference
### Single Image
```python
import torch
from train import FrequencyAwareDetector
from torchvision.transforms import Compose, Resize, CenterCrop, ToTensor, Normalize
from PIL import Image
# Load model
model = FrequencyAwareDetector()
state_dict = torch.load("model_state_dict.pt", map_location="cpu")
model.load_state_dict(state_dict)
model.eval()
# Preprocess
transform = Compose([
Resize((288, 288)),
CenterCrop((256, 256)),
ToTensor(),
Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
img = Image.open("test.jpg").convert("RGB")
pixel_values = transform(img).unsqueeze(0)
# Predict
with torch.no_grad():
output = model(pixel_values=pixel_values)
probs = torch.softmax(output["logits"], dim=1)
pred = probs.argmax(dim=1).item()
labels = {0: "Real", 1: "AI-Generated"}
print(f"Prediction: {labels[pred]} ({probs[0][pred]:.2%} confidence)")
```
### Command Line
```bash
# Single image
python inference.py --image photo.jpg
# URL
python inference.py --image https://example.com/image.png
# Batch (entire directory)
python inference.py --image_dir ./photos/
```
## ๐ Scientific Background
### Why Frequency Analysis?
AI-generated images contain subtle artifacts that are invisible to the human eye but detectable in the frequency domain:
1. **Upsampling Artifacts**: Diffusion models and GANs use transposed convolutions and upsampling layers that leave periodic patterns in the frequency spectrum
2. **1/fยฒ Deviation**: Natural images follow a characteristic 1/fยฒ power spectrum (Fourier). AI images deviate from this, especially at mid-to-high frequencies
3. **DCT Block Patterns**: The generation process creates non-natural distributions of DCT coefficients across image patches
4. **Noise Residuals**: SRM filters reveal that AI images have fundamentally different noise patterns than camera-captured images
### Key References
1. **AIDE** (2024): "A Sanity Check for AI-generated Image Detection" โ [arxiv:2406.19435](https://arxiv.org/abs/2406.19435). DCT patch selection + SRM + CLIP fusion achieves 92.77% on AIGCDetectBenchmark.
2. **CommunityForensics** (CVPR 2025): "Using Thousands of Generators to Train Fake Image Detectors" โ [arxiv:2411.04125](https://arxiv.org/abs/2411.04125). Training on diverse generators (4803+) dramatically improves cross-generator generalization.
3. **SRM Filters**: Fridrich & Kodovsky (2012) โ "Rich Models for Steganalysis of Digital Images". The standard filter bank for image forensics.
4. **UnivFD**: Ojha et al. (2023) โ "Towards Universal Fake Image Detectors". CLIP features for zero-shot detection.
## ๐ Repository Structure
```
โโโ train.py # Full training script with model architecture
โโโ inference.py # Easy-to-use inference script
โโโ detector_config.json # Model configuration
โโโ model_state_dict.pt # Trained weights (after training)
โโโ README.md # This file
```
## License
Apache 2.0
|