---
license: mit
tags:
- multimodal
- classification
- content detection
---

# ImageBind-MLP Model

## Model Description

This is a fine-tuned ImageBind model for detecting machine-generated content across multiple modalities (text, image, and audio). The model is part of the **RU-AI** project, which introduces a large multimodal dataset for AI-generated content detection.

This model leverages ImageBind's unified embedding space to identify whether content is human-generated or machine-generated across different modalities.

## Model Details

- **Model Type:** Multi-modal classification model based on ImageBind
- **Architecture:** ImageBind with MLP classifier head
- **Paper:** [RU-AI: A Large Multimodal Dataset for Machine Generated Content Detection](https://arxiv.org/abs/2406.04906)
- **GitHub Repository:** [ZhihaoZhang97/RU-AI](https://github.com/ZhihaoZhang97/RU-AI)
- **Accepted at:** WWW'25 Resource Track
- **Modalities Supported:** Text, Image, and Audio

## Intended Use

This model is designed for detecting AI-generated content in:
- **Text:** Identifying AI-written articles, essays, responses, and general text
- **Images:** Detecting images generated by models like Stable Diffusion, DALL-E, etc.
- **Audio:** Identifying synthetic speech from TTS models

### Use Cases
- Content moderation and authenticity verification
- Academic integrity checking
- Media forensics and fact-checking
- Research on AI-generated content detection

## Training Data

The model was trained on the **RU-AI dataset**, which includes:
- **245,895** real/human-generated samples
- **1,229,475** machine-generated samples
- Multiple data sources: COCO, Flickr8k, Places dataset
- AI-generated content from various models:
  - Images: Stable Diffusion (v1.5, v6.0, XL v3.0, AbsoluteReality, EpicRealism)
  - Audio: EfficientSpeech, StyleTTS2, VITS, XTTS2, YourTTS
  - Text: Various LLM-generated captions and descriptions

Dataset is publicly available at [Zenodo](https://zenodo.org/records/11406538).

## Requirements

### Hardware
- NVIDIA GPU with at least **16GB VRAM** (RTX 3090 24GB or higher recommended)
- At least **500GB** disk space for the full dataset

### Software
- Python >= 3.8
- PyTorch >= 1.13.1
- CUDA >= 11.6

## Installation

```bash
# Clone the repository
git clone https://github.com/ZhihaoZhang97/RU-AI.git
cd RU-AI

# Create virtual environment
conda create -n ruai python=3.8
conda activate ruai

# Install dependencies
pip3 install -r requirements.txt
```

## Usage

### Model Inference

```python
# See infer_imagebind_model.py in the GitHub repository
python infer_imagebind_model.py
```

Before running inference, you need to:
1. Download the dataset or prepare your own data
2. Update the data paths in `infer_imagebind_model.py`:
   - `image_data_paths`
   - `audio_data_paths`
   - `text_data`

### Quick Start with Sample Data

```bash
# Download Flickr8k sample data
python ./download_flickr.py

# Or download the full dataset (157GB compressed, 500GB uncompressed)
python ./download_all.py
```

## Model Performance

This model is designed to detect AI-generated content across multiple modalities simultaneously, leveraging ImageBind's unified embedding space to create joint representations across vision, text, and audio.

For detailed performance metrics and evaluation results, please refer to the [paper](https://arxiv.org/abs/2406.04906).

## Limitations

- The model's performance depends on the quality and diversity of training data
- May not generalize well to AI models or techniques not represented in the training set
- Detection accuracy may vary across different modalities
- Requires significant computational resources for inference

## Ethical Considerations

This model is intended for research and legitimate content verification purposes. Users should:
- Consider privacy implications when analyzing user-generated content
- Be aware of potential biases in training data
- Use the model responsibly and not for censorship without human oversight
- Understand that detection is probabilistic and may produce false positives/negatives

## Citation

If you use this model in your research, please cite:

```bibtex
@misc{huang2024ruai,
  title={RU-AI: A Large Multimodal Dataset for Machine Generated Content Detection},
  author={Liting Huang and Zhihao Zhang and Yiran Zhang and Xiyue Zhou and Shoujin Wang},
  year={2024},
  eprint={2406.04906},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}
```

## Acknowledgments

This work builds upon:
- [ImageBind: One Embedding Space To Bind Them All](https://openaccess.thecvf.com/content/CVPR2023/papers/Girdhar_ImageBind_One_Embedding_Space_To_Bind_Them_All_CVPR_2023_paper.pdf)
- [LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment](https://arxiv.org/abs/2310.01852)

We appreciate the open-source community for the datasets and models that made this work possible.

## License

Please refer to the [GitHub repository](https://github.com/ZhihaoZhang97/RU-AI) for license information.

## Contact

For questions and issues:
- Open an issue on the [GitHub repository](https://github.com/ZhihaoZhang97/RU-AI)
- Refer to the paper for contact information of the authors