---
license: apache-2.0
pipeline_tag: image-classification
tags:
  - computer-vision
  - image-classification
  - pytorch
library_name: pytorch
---

Official repository for the paper "Simplicity Prevails: The Emergence of Generalizable AIGI Detection in Visual Foundation Models"(https://arxiv.org/pdf/2602.01738)

If you have any questions, please feel free to open a discussion in the Community tab. For direct inquiries, you can also reach out to us via email at 2450042008@mails.szu.edu.cn.

# VFM Baselines Release

This directory contains the 7 vision foundation model baselines used in the paper:

- `MetaCLIP-Linear`
- `MetaCLIP2-Linear`
- `SigLIP-Linear`
- `SigLIP2-Linear`
- `PE-CLIP-Linear`
- `DINOv2-Linear`
- `DINOv3-Linear`

## Contents

- `models.py`: unified model-loading code for all 7 baselines
- `test_vfm_baselines.py`: unified evaluation script
- `weights/`: released checkpoints
- `core/vision_encoder/`: vendored PE vision encoder code required by `PE-CLIP-Linear`

## Model Names

The unified loader and test script accept these names:

- `metacliplin`
- `metaclip2lin`
- `sigliplin`
- `siglip2lin`
- `pelin`
- `dinov2lin`
- `dinov3lin`

The paper names such as `MetaCLIP-Linear` and `DINOv3-Linear` are also accepted.

## Usage

Evaluate a single model:

```bash
python test_vfm_baselines.py \
  --model sigliplin \
  --real-dir /path/to/0_real \
  --fake-dir /path/to/1_fake \
  --max-samples 100
```

Evaluate all 7 models:

```bash
python test_vfm_baselines.py \
  --model all \
  --real-dir /path/to/0_real \
  --fake-dir /path/to/1_fake \
  --max-samples 100
```

Optional arguments:

- `--checkpoint`: override the default checkpoint for single-model evaluation
- `--batch-size`: batch size for evaluation
- `--num-workers`: dataloader workers
- `--device`: explicit device such as `cuda:0` or `cpu`
- `--save-json`: save results to a JSON file

## Dependencies

The release code expects these Python packages:

- `torch`
- `torchvision`
- `transformers`
- `scikit-learn`
- `Pillow`
- `timm`
- `einops`
- `ftfy`
- `regex`
- `huggingface_hub`

## Notes

- The clip-family and DINO-family baselines instantiate the backbone from Hugging Face model configs and then load the released checkpoint.
- `PE-CLIP-Linear` uses the vendored `core/vision_encoder` code in this directory.
- The checkpoints in `weights/` are arranged locally for packaging convenience. For public release, they can be uploaded as the same filenames.