π¬ 3-Minute Live Demo
https://drive.google.com/file/d/1IhzPm4rtJI3Ok8STFGe_VkJ9gYpvpdsT
Summary
This project detects whether an image is real or AI-Generated / AI-Deepfake using ConvNeXtV2-Base (256Γ256), pretrained on ImageNet-1K.
The model was trained in two phases: First on ~400,000 images (ai vs real) to build a strong foundation, then with Continual Learning phase to adapt to the latest generative AI models.
The model detects images generated by many state-of-the-art (SOTA) generative models, including DALL-E3, Flux, Nano Banana Pro, Diffusion Models (SDXL, SD3.5...), Midjourney V6, and more.
Inference with Streamlit
Containerization with Docker (optional)
Test Score (OOD):
EvalGen (FLUX, GoT, Infinity, OmniGen, Nova. 11k samples each): 90.40% fake detection rate
Training Summary:
Training follows techniques from deep learning research literature, such as:
Layer-wise Learning Rate Decay (LLRD) to protect early layers during fine-tuning,
Cosine Annealing with a 5-7% Warmup (LinearLR),
Rehearsal Buffer (stratified datasets) for the Continual Learning phase,
Gradient Clipping, Label Smoothing...
Training/Testing are optimized with Automatic Mixed Precision (AMP) using Tensor Cores and other efficient CUDA optimizations
(If run on a CPU, it falls back to non GPU; if the GPU doesn't support AMP, it falls back to FP32)
Data Augmentations include a probability for Random Resampling & Random Jpeg Reencoding to simulate real world images.
Real images consist primarily of natural photographs (COCO-style), human faces, and real-world web images from the listed datasets
Side Note: The LLRD/Cosine recipe was originally designed for fine-tuning ViTs (MAE and BEiT research papers), but it works well on ConvNeXt even though it's a CNN architecture.
Shortly, itβs because ConvNeXt was designed to act like ViTs while keeping CNN architecture efficiency.
Project Structure
βββ train/
β βββ train.py # Phase 1 training
β βββ Continual Learning.py # Phase 2 continual learning
βββ test/
β βββ test.py # Model evaluation
βββ model/
β βββ convnext.py # ConvNeXtV2 model architecture
βββ checkpoints/
β βββ checkpoint_phase1.pth
β βββ checkpoint_phase2.pth
βββ inference.py # Streamlit web interface
βββ data_loaders.py # Dataset loading utilities
βββ engine.py # Training/testing loops
βββ transforms.py # Data augmentation pipelines
Training Setup:
Phase 1:
400k images from 11 different datasets, 8 epochs with batch-size 30 using:
AdamW(wd 0.02),
LLRD(headlr: 2e-4, lr_decay:0.8, filtering Bias/BN on),
Cosine Annealing (eta_min 0) with 5% Warmup (linearLR, lr * 0.01),
Label Smoothing (0.05),
Gradient Clipping (max norm = 1.0)
Phase 2:
Continual Learning (i used 20k images), 5 epochs with batch-size 28 using:
Rehearsal Buffer (stratified replay 1:8),
AdamW(wd 0.02),
LLRD(headlr: 7.5e-5, lr_decay:0.85, filtering Bias/BN on),
Cosine Annealing (eta_min 1e-7, our earliest layer with the LLRD is bigger than that) with 7% Warmup (linearLR, lr * 0.01),
Label Smoothing (0.05),
Gradient Clipping (max norm = 1.0)
(out of numerous phase 2 hyperparameters/setups i experimented with, this worked out the best for my 20k dataset, especially with that 1:8 replay)
Datasets Used (0:real 1:fake)
Phase 1 [~400k images] (I halved the number of samples in some large datasets):
- DDA-Training-Set (arxiv:2505.14359, COCO + SD2 generated pairs) [huggingface.co/datasets/Junwei-Xi/DDA-Training-Set]
- Defactify (MS COCOAI: SD21, SDXL, SD3, DALL-E3, MidjourneyV6) [huggingface.co/datasets/Rajarshi-Roy-research/Defactify_Image_Dataset]
- VisCounter_COCOAI: [huggingface.co/datasets/NasrinImp/COCO_AI]
- genimage_tiny (Midjourney, BigGAN, VQDM, SDv5, Wukong, ADM, GLIDE) [kaggle.com/datasets/yangsangtai/tiny-genimage]
- art_artai [kaggle.com/datasets/cashbowman/ai-generated-images-vs-real-images]
- Midjourney_small [kaggle.com/datasets/mariammarioma/midjourney-imagenet-real-vs-synth]
- DF40 (Deepfake) [github.com/YZY-stack/DF40]
- Gravex200k [kaggle.com/datasets/muhammadbilal6305/200k-real-vs-ai-visuals-by-mbilal]
- StyleGan2 [kaggle.com/datasets/kshitizbhargava/deepfake-face-images]
- human_faces_hass [kaggle.com/datasets/hassnainzaidi/human-faces-data-set ]
- dfk_oldmonk [kaggle.com/datasets/saurabhbagchi/deepfake-image-detection]
See
data_loaders.pyfor the train/val/test splits and sizes.
Phase 2 Continual Learning [~20k new images]:
- Super_GenAI_Dataset [kaggle.com/datasets/hiddenplant/sut-project?select=Super_GenAI_Dataset]
- midjourney-dalle-sd-nanobananapro-dataset [huggingface.co/datasets/julienlucas/midjourney-dalle-sd-nanobananapro-dataset?utm_source=chatgpt.com]
Usage/How to Run
Model Checkpoint: (Checkpoints are too large for GitHub)
- Download my checkpoint ("checkpoint_phase2") from my [huggingface.co/xRayon/convnext-ai-images-detector/tree/main/AI%20Images%20Detector/checkpoints]
- Place it inside my checkpoints/ folder.
Inference (Web Interface):
- Open CMD in the project directory (where inference.py and requirements.txt are located).
- Install the requirements: pip install -r requirements.txt
- Run: streamlit run inference.py
(The inference interface allows threshold adjustment to control strictness)
Docker Containerization (Optional):
- Make sure you have
checkpoint_phase2.pthin yourAI Images Detector/checkpoints/ - Build Image:
docker image build -t ai-image-detector .make sure you're in the repo root (where the Dockerfile is) - Run Container :
docker container run -p 8501:8501 -v "PATH_HERE:/app/checkpoints" ai-image-detectorreplace PATH_HERE with the checkpoint's full path, we'll be mounting the checkpoint instead of building it with the image, faster and better.
Requirements
- Python 3.x
- PyTorch 2.7.1
- torchvision 0.22.1
- timm 1.0.24
- streamlit 1.54.0
- See
requirements.txtfor full list
Limitations
- Not trained to detect partially AI-generated or hybrid images.
- The model uses a CNN architecture rather than a large ViT, and was trained on ~400k images (a medium-sized dataset). While it performs well, it is not designed to reach SOTA-level accuracy
- Not designed to be robust against adversarial attacks or intentional evasion.
- Performance may degrade on future generative models that were not represented in the training or continual learning phases.
If you have any feedback or notice anything that could be improved, Iβd genuinely appreciate it.
Model tree for xRayon/convnext-ai-images-detector
Base model
timm/convnextv2_base.fcmae_ft_in1k