ControlFoley / README.md

nielsr HF Staff

Add model card for ControlFoley

747a767 verified about 2 months ago

3.85 kB

language:
  - en
license: cc-by-nc-4.0
pipeline_tag: text-to-audio
tags:
  - text-video-to-audio
  - text-controlled-video-to-audio
  - audio-controlled-video-to-audio
  - audio-generation

ControlFoley: Unified and Controllable Video-to-Audio Generation

Paper | Code | Project Page | Demo Page

ControlFoley is a unified and controllable multimodal video-to-audio (V2A) generation framework. It enables precise control over generated audio using video, text, and reference audio. Unlike previous methods, ControlFoley is specifically designed to handle complex cross-modal conflicts (e.g., when text descriptions and visual content disagree) and allows for precise timbre control using reference audio while maintaining temporal synchronization with the video.

Capabilities

ControlFoley supports a wide range of applications through a unified framework:

🎬 Text-Video-to-Audio (TV2A): Synchronized sound effect generation based on video content and text guidance.
📝 Text-Controlled Video-to-Audio (TC-V2A): Prioritizes text semantics even when they conflict with the visual content.
🎧 Audio-Controlled Video-to-Audio (AC-V2A): Generates audio where the timbre is derived from a reference audio file, synchronized with the target video.
🔊 Text-to-Audio (T2A): Direct audio generation from text prompts without video input.

Quick Start

Installation

# Clone the repository
git clone https://github.com/xiaomi-research/controlfoley
cd controlfoley

# Create conda environment
conda create -n controlfoley python=3.10.16
conda activate controlfoley

# Install dependencies
pip install -r requirements.txt

# Download pretrained weights
pip install huggingface-hub==0.26.2
huggingface-cli download YJX-Xiaomi/ControlFoley --resume-download --local-dir model_weights --local-dir-use-symlinks False

Inference

You can run various tasks using the provided demo.py script:

Text-Video-to-Audio (TV2A):

python demo.py --video "assets/001.mp4" --prompt "the skateboard wheels scraping and grinding on the ground." --duration 8.0 --output "./output"

Audio-Controlled Video-to-Audio (AC-V2A):

python demo.py --video "assets/003.mp4" --audio "assets/003.wav" --duration 8.0 --output "./output"

Text-to-Audio (T2A):

python demo.py --prompt "A bird sings melodically in a forest." --duration 8.0 --output "./output"

Key Innovations

Joint Visual Encoding: Combines CLIP and CAV-MAE-ST representations to improve robustness under modality conflict.
Temporal-Timbre Decoupling: Extracts acoustic style from reference audio while suppressing temporal cues to avoid affecting video synchronization.
Modality-Robust Training: Uses unified representation alignment (REPA) and random modality dropout to handle diverse input combinations.

Citation

If you find this project useful, please consider citing the following paper:

@misc{yang2026controlfoleyunifiedcontrollablevideotoaudio,
  title={ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling}, 
  author={Jianxuan Yang and Xinyue Guo and Zhi Cheng and Kai Wang and Lipan Zhang and Jinjie Hu and Qiang Ji and Yihua Cao and Yihao Meng and Zhaoyue Cui and Mengmei Liu and Meng Meng and Jian Luan},
  year={2026},
  eprint={2604.15086},
  archivePrefix={arXiv},
  primaryClass={cs.MM},
  url={https://arxiv.org/abs/2604.15086}, 
}

License

The model weights are licensed under CC BY-NC 4.0. The code is licensed under the Apache License 2.0.