--- license: mit tags: - multimodal - classification - content detection --- # ImageBind-MLP Model ## Model Description This is a fine-tuned ImageBind model for detecting machine-generated content across multiple modalities (text, image, and audio). The model is part of the **RU-AI** project, which introduces a large multimodal dataset for AI-generated content detection. This model leverages ImageBind's unified embedding space to identify whether content is human-generated or machine-generated across different modalities. ## Model Details - **Model Type:** Multi-modal classification model based on ImageBind - **Architecture:** ImageBind with MLP classifier head - **Paper:** [RU-AI: A Large Multimodal Dataset for Machine Generated Content Detection](https://arxiv.org/abs/2406.04906) - **GitHub Repository:** [ZhihaoZhang97/RU-AI](https://github.com/ZhihaoZhang97/RU-AI) - **Accepted at:** WWW'25 Resource Track - **Modalities Supported:** Text, Image, and Audio ## Intended Use This model is designed for detecting AI-generated content in: - **Text:** Identifying AI-written articles, essays, responses, and general text - **Images:** Detecting images generated by models like Stable Diffusion, DALL-E, etc. - **Audio:** Identifying synthetic speech from TTS models ### Use Cases - Content moderation and authenticity verification - Academic integrity checking - Media forensics and fact-checking - Research on AI-generated content detection ## Training Data The model was trained on the **RU-AI dataset**, which includes: - **245,895** real/human-generated samples - **1,229,475** machine-generated samples - Multiple data sources: COCO, Flickr8k, Places dataset - AI-generated content from various models: - Images: Stable Diffusion (v1.5, v6.0, XL v3.0, AbsoluteReality, EpicRealism) - Audio: EfficientSpeech, StyleTTS2, VITS, XTTS2, YourTTS - Text: Various LLM-generated captions and descriptions Dataset is publicly available at [Zenodo](https://zenodo.org/records/11406538). ## Requirements ### Hardware - NVIDIA GPU with at least **16GB VRAM** (RTX 3090 24GB or higher recommended) - At least **500GB** disk space for the full dataset ### Software - Python >= 3.8 - PyTorch >= 1.13.1 - CUDA >= 11.6 ## Installation ```bash # Clone the repository git clone https://github.com/ZhihaoZhang97/RU-AI.git cd RU-AI # Create virtual environment conda create -n ruai python=3.8 conda activate ruai # Install dependencies pip3 install -r requirements.txt ``` ## Usage ### Model Inference ```python # See infer_imagebind_model.py in the GitHub repository python infer_imagebind_model.py ``` Before running inference, you need to: 1. Download the dataset or prepare your own data 2. Update the data paths in `infer_imagebind_model.py`: - `image_data_paths` - `audio_data_paths` - `text_data` ### Quick Start with Sample Data ```bash # Download Flickr8k sample data python ./download_flickr.py # Or download the full dataset (157GB compressed, 500GB uncompressed) python ./download_all.py ``` ## Model Performance This model is designed to detect AI-generated content across multiple modalities simultaneously, leveraging ImageBind's unified embedding space to create joint representations across vision, text, and audio. For detailed performance metrics and evaluation results, please refer to the [paper](https://arxiv.org/abs/2406.04906). ## Limitations - The model's performance depends on the quality and diversity of training data - May not generalize well to AI models or techniques not represented in the training set - Detection accuracy may vary across different modalities - Requires significant computational resources for inference ## Ethical Considerations This model is intended for research and legitimate content verification purposes. Users should: - Consider privacy implications when analyzing user-generated content - Be aware of potential biases in training data - Use the model responsibly and not for censorship without human oversight - Understand that detection is probabilistic and may produce false positives/negatives ## Citation If you use this model in your research, please cite: ```bibtex @misc{huang2024ruai, title={RU-AI: A Large Multimodal Dataset for Machine Generated Content Detection}, author={Liting Huang and Zhihao Zhang and Yiran Zhang and Xiyue Zhou and Shoujin Wang}, year={2024}, eprint={2406.04906}, archivePrefix={arXiv}, primaryClass={cs.CV} } ``` ## Acknowledgments This work builds upon: - [ImageBind: One Embedding Space To Bind Them All](https://openaccess.thecvf.com/content/CVPR2023/papers/Girdhar_ImageBind_One_Embedding_Space_To_Bind_Them_All_CVPR_2023_paper.pdf) - [LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment](https://arxiv.org/abs/2310.01852) We appreciate the open-source community for the datasets and models that made this work possible. ## License Please refer to the [GitHub repository](https://github.com/ZhihaoZhang97/RU-AI) for license information. ## Contact For questions and issues: - Open an issue on the [GitHub repository](https://github.com/ZhihaoZhang97/RU-AI) - Refer to the paper for contact information of the authors