--- license: mit tags: - multimodal - classification - content detection --- # LanguageBind-MLP Model ## Model Description This is a fine-tuned LanguageBind model for detecting machine-generated content across multiple modalities (text, image, and audio). The model is part of the **RU-AI** project, which introduces a large multimodal dataset for AI-generated content detection. This model leverages LanguageBind's multi-modal semantic alignment capabilities to identify whether content is human-generated or machine-generated across different modalities. ## Model Details - **Model Type:** Multi-modal classification model based on LanguageBind - **Architecture:** LanguageBind with MLP classifier head - **Paper:** [RU-AI: A Large Multimodal Dataset for Machine Generated Content Detection](https://arxiv.org/abs/2406.04906) - **GitHub Repository:** [ZhihaoZhang97/RU-AI](https://github.com/ZhihaoZhang97/RU-AI) - **Accepted at:** WWW'25 Resource Track - **Modalities Supported:** Text, Image, and Audio ## Intended Use This model is designed for detecting AI-generated content in: - **Text:** Identifying AI-written articles, essays, responses, and general text - **Images:** Detecting images generated by models like Stable Diffusion, DALL-E, etc. - **Audio:** Identifying synthetic speech from TTS models ### Use Cases - Content moderation and authenticity verification - Academic integrity checking - Media forensics and fact-checking - Research on AI-generated content detection ## Training Data The model was trained on the **RU-AI dataset**, which includes: - **245,895** real/human-generated samples - **1,229,475** machine-generated samples - Multiple data sources: COCO, Flickr8k, Places dataset - AI-generated content from various models: - Images: Stable Diffusion (v1.5, v6.0, XL v3.0, AbsoluteReality, EpicRealism) - Audio: EfficientSpeech, StyleTTS2, VITS, XTTS2, YourTTS - Text: Various LLM-generated captions and descriptions Dataset is publicly available at [Zenodo](https://zenodo.org/records/11406538). ## Requirements ### Hardware - NVIDIA GPU with at least **16GB VRAM** (RTX 3090 24GB or higher recommended) - At least **500GB** disk space for the full dataset ### Software - Python >= 3.8 - PyTorch >= 1.13.1 - CUDA >= 11.6 ## Installation ```bash # Clone the repository git clone https://github.com/ZhihaoZhang97/RU-AI.git cd RU-AI # Create virtual environment conda create -n ruai python=3.8 conda activate ruai # Install dependencies pip3 install -r requirements.txt ``` ## Usage ### Model Inference ```python # See infer_languagebind_model.py in the GitHub repository python infer_languagebind_model.py ``` Before running inference, you need to: 1. Download the dataset or prepare your own data 2. Update the data paths in `infer_languagebind_model.py`: - `image_data_paths` - `audio_data_paths` - `text_data` ### Quick Start with Sample Data ```bash # Download Flickr8k sample data python ./download_flickr.py # Or download the full dataset (157GB compressed, 500GB uncompressed) python ./download_all.py ``` ## Model Performance This model is designed to detect AI-generated content across multiple modalities simultaneously, leveraging LanguageBind's language-based semantic alignment to create unified representations. For detailed performance metrics and evaluation results, please refer to the [paper](https://arxiv.org/abs/2406.04906). ## Limitations - The model's performance depends on the quality and diversity of training data - May not generalize well to AI models or techniques not represented in the training set - Detection accuracy may vary across different modalities - Requires significant computational resources for inference ## Ethical Considerations This model is intended for research and legitimate content verification purposes. Users should: - Consider privacy implications when analyzing user-generated content - Be aware of potential biases in training data - Use the model responsibly and not for censorship without human oversight - Understand that detection is probabilistic and may produce false positives/negatives ## Citation If you use this model in your research, please cite: ```bibtex @misc{huang2024ruai, title={RU-AI: A Large Multimodal Dataset for Machine Generated Content Detection}, author={Liting Huang and Zhihao Zhang and Yiran Zhang and Xiyue Zhou and Shoujin Wang}, year={2024}, eprint={2406.04906}, archivePrefix={arXiv}, primaryClass={cs.CV} } ``` ## Acknowledgments This work builds upon: - [LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment](https://arxiv.org/abs/2310.01852) - [ImageBind: One Embedding Space To Bind Them All](https://openaccess.thecvf.com/content/CVPR2023/papers/Girdhar_ImageBind_One_Embedding_Space_To_Bind_Them_All_CVPR_2023_paper.pdf) We appreciate the open-source community for the datasets and models that made this work possible. ## License Please refer to the [GitHub repository](https://github.com/ZhihaoZhang97/RU-AI) for license information. ## Contact For questions and issues: - Open an issue on the [GitHub repository](https://github.com/ZhihaoZhang97/RU-AI) - Refer to the paper for contact information of the authors