Project Thumbnail

🎬 3-Minute Live Demo

https://drive.google.com/file/d/1IhzPm4rtJI3Ok8STFGe_VkJ9gYpvpdsT

Summary

This project detects whether an image is real or AI-Generated / AI-Deepfake using ConvNeXtV2-Base (256Γ—256), pretrained on ImageNet-1K.
The model was trained in two phases: First on ~400,000 images (ai vs real) to build a strong foundation, then with Continual Learning phase to adapt to the latest generative AI models.

The model detects images generated by many state-of-the-art (SOTA) generative models, including DALL-E3, Flux, Nano Banana Pro, Diffusion Models (SDXL, SD3.5...), Midjourney V6, and more.

Inference with Streamlit
Containerization with Docker (optional)

Test Score (OOD):

EvalGen (FLUX, GoT, Infinity, OmniGen, Nova. 11k samples each): 90.40% fake detection rate

Training Summary:

Training follows techniques from deep learning research literature, such as:
Layer-wise Learning Rate Decay (LLRD) to protect early layers during fine-tuning,
Cosine Annealing with a 5-7% Warmup (LinearLR),
Rehearsal Buffer (stratified datasets) for the Continual Learning phase,
Gradient Clipping, Label Smoothing...

Training/Testing are optimized with Automatic Mixed Precision (AMP) using Tensor Cores and other efficient CUDA optimizations
(If run on a CPU, it falls back to non GPU; if the GPU doesn't support AMP, it falls back to FP32)
Data Augmentations include a probability for Random Resampling & Random Jpeg Reencoding to simulate real world images.
Real images consist primarily of natural photographs (COCO-style), human faces, and real-world web images from the listed datasets

Side Note: The LLRD/Cosine recipe was originally designed for fine-tuning ViTs (MAE and BEiT research papers), but it works well on ConvNeXt even though it's a CNN architecture.
Shortly, it’s because ConvNeXt was designed to act like ViTs while keeping CNN architecture efficiency.

Project Structure

β”œβ”€β”€ train/
β”‚ β”œβ”€β”€ train.py # Phase 1 training
β”‚ └── Continual Learning.py # Phase 2 continual learning
β”œβ”€β”€ test/
β”‚ └── test.py # Model evaluation
β”œβ”€β”€ model/
β”‚ └── convnext.py # ConvNeXtV2 model architecture
β”œβ”€β”€ checkpoints/
β”‚ └── checkpoint_phase1.pth
β”‚ └── checkpoint_phase2.pth
β”œβ”€β”€ inference.py # Streamlit web interface
β”œβ”€β”€ data_loaders.py # Dataset loading utilities
β”œβ”€β”€ engine.py # Training/testing loops
└── transforms.py # Data augmentation pipelines

Training Setup:

Phase 1:
400k images from 11 different datasets, 8 epochs with batch-size 30 using:
AdamW(wd 0.02),
LLRD(headlr: 2e-4, lr_decay:0.8, filtering Bias/BN on),
Cosine Annealing (eta_min 0) with 5% Warmup (linearLR, lr * 0.01),
Label Smoothing (0.05),
Gradient Clipping (max norm = 1.0)

Phase 2:
Continual Learning (i used 20k images), 5 epochs with batch-size 28 using:
Rehearsal Buffer (stratified replay 1:8),
AdamW(wd 0.02),
LLRD(headlr: 7.5e-5, lr_decay:0.85, filtering Bias/BN on),
Cosine Annealing (eta_min 1e-7, our earliest layer with the LLRD is bigger than that) with 7% Warmup (linearLR, lr * 0.01),
Label Smoothing (0.05),
Gradient Clipping (max norm = 1.0)

(out of numerous phase 2 hyperparameters/setups i experimented with, this worked out the best for my 20k dataset, especially with that 1:8 replay)

Datasets Used (0:real 1:fake)

Phase 1 [~400k images] (I halved the number of samples in some large datasets):

  • DDA-Training-Set (arxiv:2505.14359, COCO + SD2 generated pairs)    [huggingface.co/datasets/Junwei-Xi/DDA-Training-Set]
  • Defactify (MS COCOAI: SD21, SDXL, SD3, DALL-E3, MidjourneyV6)   [huggingface.co/datasets/Rajarshi-Roy-research/Defactify_Image_Dataset]
  • VisCounter_COCOAI:   [huggingface.co/datasets/NasrinImp/COCO_AI]
  • genimage_tiny (Midjourney, BigGAN, VQDM, SDv5, Wukong, ADM, GLIDE)   [kaggle.com/datasets/yangsangtai/tiny-genimage]
  • art_artai   [kaggle.com/datasets/cashbowman/ai-generated-images-vs-real-images]
  • Midjourney_small   [kaggle.com/datasets/mariammarioma/midjourney-imagenet-real-vs-synth]
  • DF40 (Deepfake)   [github.com/YZY-stack/DF40]
  • Gravex200k   [kaggle.com/datasets/muhammadbilal6305/200k-real-vs-ai-visuals-by-mbilal]
  • StyleGan2   [kaggle.com/datasets/kshitizbhargava/deepfake-face-images]
  • human_faces_hass   [kaggle.com/datasets/hassnainzaidi/human-faces-data-set ]
  • dfk_oldmonk   [kaggle.com/datasets/saurabhbagchi/deepfake-image-detection] See data_loaders.py for the train/val/test splits and sizes.

Phase 2 Continual Learning [~20k new images]:

  • Super_GenAI_Dataset   [kaggle.com/datasets/hiddenplant/sut-project?select=Super_GenAI_Dataset]
  • midjourney-dalle-sd-nanobananapro-dataset   [huggingface.co/datasets/julienlucas/midjourney-dalle-sd-nanobananapro-dataset?utm_source=chatgpt.com]

Usage/How to Run

Model Checkpoint: (Checkpoints are too large for GitHub)

  • Download my checkpoint ("checkpoint_phase2") from my [huggingface.co/xRayon/convnext-ai-images-detector/tree/main/AI%20Images%20Detector/checkpoints]
  • Place it inside my checkpoints/ folder.

Inference (Web Interface):

  • Open CMD in the project directory (where inference.py and requirements.txt are located).
  • Install the requirements: pip install -r requirements.txt
  • Run: streamlit run inference.py

(The inference interface allows threshold adjustment to control strictness)

Docker Containerization (Optional):

  • Make sure you have checkpoint_phase2.pthin your AI Images Detector/checkpoints/
  • Build Image: docker image build -t ai-image-detector . make sure you're in the repo root (where the Dockerfile is)
  • Run Container : docker container run -p 8501:8501 -v "PATH_HERE:/app/checkpoints" ai-image-detector replace PATH_HERE with the checkpoint's full path, we'll be mounting the checkpoint instead of building it with the image, faster and better.

Requirements

  • Python 3.x
  • PyTorch 2.7.1
  • torchvision 0.22.1
  • timm 1.0.24
  • streamlit 1.54.0
  • See requirements.txt for full list

Limitations

  • Not trained to detect partially AI-generated or hybrid images.
  • The model uses a CNN architecture rather than a large ViT, and was trained on ~400k images (a medium-sized dataset). While it performs well, it is not designed to reach SOTA-level accuracy
  • Not designed to be robust against adversarial attacks or intentional evasion.
  • Performance may degrade on future generative models that were not represented in the training or continual learning phases.

If you have any feedback or notice anything that could be improved, I’d genuinely appreciate it.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for xRayon/convnext-ai-images-detector

Finetuned
(1)
this model

Paper for xRayon/convnext-ai-images-detector