--- language: - en license: cc-by-nc-4.0 pipeline_tag: text-to-audio tags: - text-video-to-audio - text-controlled-video-to-audio - audio-controlled-video-to-audio - audio-generation --- # ControlFoley: Unified and Controllable Video-to-Audio Generation [**Paper**](https://huggingface.co/papers/2604.15086) | [**Code**](https://github.com/xiaomi-research/controlfoley) | [**Project Page**](https://yjx-research.github.io/ControlFoley_web_page/) | [**Demo Page**](https://yjx-research.github.io/ControlFoley/) **ControlFoley** is a unified and controllable multimodal video-to-audio (V2A) generation framework. It enables precise control over generated audio using video, text, and reference audio. Unlike previous methods, ControlFoley is specifically designed to handle complex cross-modal conflicts (e.g., when text descriptions and visual content disagree) and allows for precise timbre control using reference audio while maintaining temporal synchronization with the video. ## Capabilities ControlFoley supports a wide range of applications through a unified framework: - 🎬 **Text-Video-to-Audio (TV2A)**: Synchronized sound effect generation based on video content and text guidance. - 📝 **Text-Controlled Video-to-Audio (TC-V2A)**: Prioritizes text semantics even when they conflict with the visual content. - 🎧 **Audio-Controlled Video-to-Audio (AC-V2A)**: Generates audio where the timbre is derived from a reference audio file, synchronized with the target video. - 🔊 **Text-to-Audio (T2A)**: Direct audio generation from text prompts without video input. ## Quick Start ### Installation ```bash # Clone the repository git clone https://github.com/xiaomi-research/controlfoley cd controlfoley # Create conda environment conda create -n controlfoley python=3.10.16 conda activate controlfoley # Install dependencies pip install -r requirements.txt # Download pretrained weights pip install huggingface-hub==0.26.2 huggingface-cli download YJX-Xiaomi/ControlFoley --resume-download --local-dir model_weights --local-dir-use-symlinks False ``` ### Inference You can run various tasks using the provided `demo.py` script: **Text-Video-to-Audio (TV2A):** ```bash python demo.py --video "assets/001.mp4" --prompt "the skateboard wheels scraping and grinding on the ground." --duration 8.0 --output "./output" ``` **Audio-Controlled Video-to-Audio (AC-V2A):** ```bash python demo.py --video "assets/003.mp4" --audio "assets/003.wav" --duration 8.0 --output "./output" ``` **Text-to-Audio (T2A):** ```bash python demo.py --prompt "A bird sings melodically in a forest." --duration 8.0 --output "./output" ``` ## Key Innovations - **Joint Visual Encoding**: Combines CLIP and CAV-MAE-ST representations to improve robustness under modality conflict. - **Temporal-Timbre Decoupling**: Extracts acoustic style from reference audio while suppressing temporal cues to avoid affecting video synchronization. - **Modality-Robust Training**: Uses unified representation alignment (REPA) and random modality dropout to handle diverse input combinations. ## Citation If you find this project useful, please consider citing the following paper: ```bibtex @misc{yang2026controlfoleyunifiedcontrollablevideotoaudio, title={ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling}, author={Jianxuan Yang and Xinyue Guo and Zhi Cheng and Kai Wang and Lipan Zhang and Jinjie Hu and Qiang Ji and Yihua Cao and Yihao Meng and Zhaoyue Cui and Mengmei Liu and Meng Meng and Jian Luan}, year={2026}, eprint={2604.15086}, archivePrefix={arXiv}, primaryClass={cs.MM}, url={https://arxiv.org/abs/2604.15086}, } ``` ## License The model weights are licensed under [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/). The code is licensed under the Apache License 2.0.