nielsr HF Staff

Add model card for ControlFoley

747a767 verified about 2 months ago

3.85 kB

	---
	language:
	- en
	license: cc-by-nc-4.0
	pipeline_tag: text-to-audio
	tags:
	- text-video-to-audio
	- text-controlled-video-to-audio
	- audio-controlled-video-to-audio
	- audio-generation
	---

	# ControlFoley: Unified and Controllable Video-to-Audio Generation

	[Paper](https://huggingface.co/papers/2604.15086) \| [Code](https://github.com/xiaomi-research/controlfoley) \| [Project Page](https://yjx-research.github.io/ControlFoley_web_page/) \| [Demo Page](https://yjx-research.github.io/ControlFoley/)

	ControlFoley is a unified and controllable multimodal video-to-audio (V2A) generation framework. It enables precise control over generated audio using video, text, and reference audio. Unlike previous methods, ControlFoley is specifically designed to handle complex cross-modal conflicts (e.g., when text descriptions and visual content disagree) and allows for precise timbre control using reference audio while maintaining temporal synchronization with the video.

	## Capabilities

	ControlFoley supports a wide range of applications through a unified framework:

	- 🎬 Text-Video-to-Audio (TV2A): Synchronized sound effect generation based on video content and text guidance.
	- 📝 Text-Controlled Video-to-Audio (TC-V2A): Prioritizes text semantics even when they conflict with the visual content.
	- 🎧 Audio-Controlled Video-to-Audio (AC-V2A): Generates audio where the timbre is derived from a reference audio file, synchronized with the target video.
	- 🔊 Text-to-Audio (T2A): Direct audio generation from text prompts without video input.

	## Quick Start

	### Installation

	```bash
	# Clone the repository
	git clone https://github.com/xiaomi-research/controlfoley
	cd controlfoley

	# Create conda environment
	conda create -n controlfoley python=3.10.16
	conda activate controlfoley

	# Install dependencies
	pip install -r requirements.txt

	# Download pretrained weights
	pip install huggingface-hub==0.26.2
	huggingface-cli download YJX-Xiaomi/ControlFoley --resume-download --local-dir model_weights --local-dir-use-symlinks False
	```

	### Inference

	You can run various tasks using the provided `demo.py` script:

	Text-Video-to-Audio (TV2A):
	```bash
	python demo.py --video "assets/001.mp4" --prompt "the skateboard wheels scraping and grinding on the ground." --duration 8.0 --output "./output"
	```

	Audio-Controlled Video-to-Audio (AC-V2A):
	```bash
	python demo.py --video "assets/003.mp4" --audio "assets/003.wav" --duration 8.0 --output "./output"
	```

	Text-to-Audio (T2A):
	```bash
	python demo.py --prompt "A bird sings melodically in a forest." --duration 8.0 --output "./output"
	```

	## Key Innovations

	- Joint Visual Encoding: Combines CLIP and CAV-MAE-ST representations to improve robustness under modality conflict.
	- Temporal-Timbre Decoupling: Extracts acoustic style from reference audio while suppressing temporal cues to avoid affecting video synchronization.
	- Modality-Robust Training: Uses unified representation alignment (REPA) and random modality dropout to handle diverse input combinations.

	## Citation

	If you find this project useful, please consider citing the following paper:

	```bibtex
	@misc{yang2026controlfoleyunifiedcontrollablevideotoaudio,
	title={ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling},
	author={Jianxuan Yang and Xinyue Guo and Zhi Cheng and Kai Wang and Lipan Zhang and Jinjie Hu and Qiang Ji and Yihua Cao and Yihao Meng and Zhaoyue Cui and Mengmei Liu and Meng Meng and Jian Luan},
	year={2026},
	eprint={2604.15086},
	archivePrefix={arXiv},
	primaryClass={cs.MM},
	url={https://arxiv.org/abs/2604.15086},
	}
	```

	## License

	The model weights are licensed under [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/). The code is licensed under the Apache License 2.0.