File size: 14,598 Bytes
762861b
 
 
 
7317f1c
 
 
 
 
 
4f27bda
762861b
1d4b4e4
 
ab35623
 
1d4b4e4
 
fe6e8b7
1d4b4e4
 
 
 
 
fe6e8b7
1d4b4e4
 
 
 
 
 
 
 
fd4de5d
 
1d4b4e4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fe6e8b7
1d4b4e4
fe6e8b7
1d4b4e4
 
 
 
 
 
 
fe6e8b7
1d4b4e4
 
ab35623
d7c0894
1d4b4e4
 
 
 
 
 
 
 
 
d7c0894
1d4b4e4
 
 
 
 
fe6e8b7
 
1d4b4e4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fe6e8b7
1d4b4e4
 
 
 
 
fe6e8b7
1d4b4e4
 
 
 
 
 
 
 
 
 
 
 
 
 
fe6e8b7
1d4b4e4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fe6e8b7
 
 
1d4b4e4
fe6e8b7
1d4b4e4
fe6e8b7
 
1d4b4e4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7317f1c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
---
license: cc-by-nc-4.0
language:
- en
pipeline_tag: text-to-audio
tags:
- text-video-to-audio
- text-controlled-video-to-audio
- audio-controlled-video-to-audio
- audio-generation
library_name: diffusers
---




<!-- ## **ControlFoley** -->


<div align="center">

# ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling

<p align="center">
  <a href="https://arxiv.org/abs/2604.15086" style="text-decoration:none"><img src="https://img.shields.io/badge/arXiv-2506.21448-b31b1b.svg" alt="arXiv"/></a>
  &nbsp;
  <a href="https://github.com/xiaomi-research/controlfoley" style="text-decoration:none"><img src="https://img.shields.io/badge/GitHub.io-Code-blue?logo=Github&style=flat-square" alt="GitHub"/></a>
  &nbsp;
  <a href="https://yjx-research.github.io/ControlFoley_web_page/" style="text-decoration:none"><img src="https://img.shields.io/badge/Project Page-Project-blue" alt="Project Page"/></a>
  &nbsp;
  <a href="https://yjx-research.github.io/ControlFoley/" style="text-decoration:none"><img src="https://img.shields.io/badge/Demo Page-Demo-blue" alt="Demo Page"/></a>
  &nbsp;
  <a href="https://huggingface.co/YJX-Xiaomi/ControlFoley" style="text-decoration:none"><img src="https://img.shields.io/badge/HuggingFace-Models-orange?logo=huggingface" alt="Hugging Face"/></a>
  &nbsp;
  <a href="https://clawhub.ai/yjx-research/controlfoley-audio-generator" style="text-decoration:none"><img src="https://img.shields.io/badge/ClawHub-ClawHub-red" alt="ClawHub"/></a>
</p>

</div>

<p align="center">
If you find this project useful, please consider giving a star ⭐️~
</p>


<div align="center">

<hr style="border: none; border-top: 3px solid #333; margin: 16px 0;">

### πŸ‘₯ **Authors**

<div>
    <!-- Row 1: 6 authors -->
    <div style="margin-bottom: 2px;">
        Jianxuan Yang<sup>1*†</sup>,&nbsp;
        Xinyue Guo<sup>1*</sup>,&nbsp;
        Zhi Cheng<sup>1,2</sup>,&nbsp;
        Kai Wang<sup>1,2</sup>,&nbsp;
        Lipan Zhang<sup>1</sup>,&nbsp;
        Jinjie Hu<sup>1</sup>
    </div>
    <!-- Row 2: 7 authors -->
    <div>
        Qiang Ji<sup>1</sup>,&nbsp;
        Yihua Cao<sup>1</sup>,&nbsp;
        Yihao Meng<sup>1,2</sup>,&nbsp;
        Zhaoyue Cui<sup>1,2</sup>,&nbsp;
        Mengmei Liu<sup>1</sup>,&nbsp;
        Meng Meng<sup>1</sup>,&nbsp;
        Jian Luan<sup>1</sup>
    </div>
</div>
<!-- Affiliations -->
<div>
    <sup>1</sup>MiLM Plus, Xiaomi Inc. &nbsp;&nbsp; <sup>2</sup>Wuhan University
    <br>
    *Equal contribution &nbsp;&nbsp; †Corresponding author
</div>
</div>

<hr style="border: none; border-top: 3px solid #333; margin: 16px 0;">

## πŸ“° **News**

- [2026-04] Technical report released on [arXiv](https://arxiv.org/abs/2604.15086).
- [2026-04] [Project page](https://yjx-research.github.io/ControlFoley_web_page/) is now live.
- [2026-04] [Inference code](https://github.com/xiaomi-research/controlfoley) and [pretrained models](https://huggingface.co/YJX-Xiaomi/ControlFoley) are released.
- [2026-04] Online demo is available on [Project Inference Interface](https://yjx-research.github.io/ControlFoley_web_page/#try-gen).
- [2026-04] Skill [ControlFoley Audio Generator](https://clawhub.ai/yjx-research/controlfoley-audio-generator) released.

<hr style="border: none; border-top: 3px solid #333; margin: 16px 0;">

## πŸ”„ **Updates**

- [x] Release technical report on arXiv.
- [x] Launch project page.
- [x] Release inference code and pretrained models.
- [x] Launch online inference demo (available on project page).
- [x] Release skill.

<hr style="border: none; border-top: 3px solid #333; margin: 16px 0;">

## πŸ“Ί **Intro Video**

https://github.com/user-attachments/assets/d63e9837-a568-4521-9009-58b4105214a9

For more results of our model, visit [Project Page](https://yjx-research.github.io/ControlFoley_web_page/). For comparison with other methods, visit [Demo Page](https://yjx-research.github.io/ControlFoley/).

<hr style="border: none; border-top: 3px solid #333; margin: 16px 0;">

## 🎧 **Overview**

ControlFoley is a unified and controllable multimodal video-to-audio (V2A) generation framework that enables precise control over generated audio using video, text, and reference audio.

Unlike existing methods that rely on a single modality or struggle under conflicting inputs, ControlFoley is designed to handle complex multimodal interactions and maintain strong controllability even when modalities are inconsistent.

<hr style="border: none; border-top: 3px solid #333; margin: 16px 0;">

## 🎨 **Tease Figure**

<div align="center">
    <img src="assets/tease.png" width="100%">
    <p style="margin-top: 8px; text-align: center; font-style: italic;">
        Left: Overview of the ControlFoley framework with three multimodal conditioning modes for controllable video-synchronized audio generation. Right: Performance radar chart of Video-to-Audio models.
    </p>
</div>

<hr style="border: none; border-top: 3px solid #333; margin: 16px 0;">

## πŸš€ **Capabilities**

ControlFoley supports a wide range of applications:

- 🎬 <strong>Text-Video-to-Audio Generation (TV2A)</strong><br>
  Video-content-adaptive dubbing and synchronized sound effect generation under text guidance.

- πŸ“ <strong>Text-Controlled Video-to-Audio (TC-V2A)</strong><br>
  Audio generation under video–text conflicts, with semantics consistent with text prompts and temporally synchronized with video contents.

- 🎧 <strong>Audio-Controlled Video-to-Audio (AC-V2A)</strong><br>
  Audio generation conditioned on reference audio, with timbre consistent with the reference audio and temporally synchronized with video contents.

- πŸ“ <strong>Text-to-Audio Generation (T2A)</strong><br>
  Generate audio directly from text prompts as an additional capability of the unified framework.

<hr style="border: none; border-top: 3px solid #333; margin: 16px 0;">

## 🧠 **Key Innovations**

<div align="center">
    <img src="assets/controlfoley.png" width="100%">
</div>

- <strong>Joint Visual Encoding for Robust Multimodal Control:</strong>
  Combines CLIP and CAV-MAE-ST representations to capture both vision-language and audio-visual correlations, improving robustness under modality conflict.

- <strong>Timbre-Focused Reference Audio Control:</strong>
  Extracts global timbre representations while suppressing temporal cues, enabling precise acoustic style control without affecting synchronization.

- <strong>Modality-Robust Training with Unified Alignment:</strong>
  Introduces all-modality dropout and a unified REPA objective to improve robustness across diverse modality combinations.

- <strong>VGGSound-TVC Benchmark:</strong>
  A new benchmark for evaluating textual controllability under visual-text semantic conflicts.

<hr style="border: none; border-top: 3px solid #333; margin: 16px 0;">

## πŸ§ͺ **VGGSound-TVC Benchmark**

We propose VGGSound-TVC to evaluate text controllability under varying levels of visual-text conflict. In this dataset, textual descriptions of videos are reconstructed in accordance with the rules described below.

- L0 β†’ No conflict, where the textual description is consistent with the video content.
- L1_subject β†’  A mild semantic conflict introduced at the subject level, where the action description remains unchanged while the sounding subject is replaced.
- L1_action β†’ A mild semantic conflict introduced at the action level, where the subject remains unchanged while the action description is modified.
- L2 β†’ A moderate semantic conflict in which the textual description belongs to a different semantic category while still maintaining a similar temporal structure or acoustic rhythm.
- L3 β†’ Strong conflict, where the textual description is randomly substituted.

This enables systematic analysis of modality dominance and controllability under increasing inconsistency. Example samples from VGGSound-TVC are as follows.
<div align="center">
    <img src="assets/benchmark.png" width="100%">
</div>

<hr style="border: none; border-top: 3px solid #333; margin: 16px 0;">

## πŸ“Š **Performance**

ControlFoley achieves strong performance across multiple V2A tasks, demonstrating both high generation quality and robust controllability.

🎬 <strong>TV2A</strong>

ControlFoley achieves state-of-the-art performance across multiple benchmarks, including VGGSound-Test, Kling-Audio-Eval, and MovieGen-Audio-Bench.

- Highest CLAP scores (better semantic alignment)
- Lowest DeSync (better temporal synchronization)
- Best overall IS (better audio quality)β€”Up to 27% relative improvement (22.08 vs. 17.36 on VGGSound).

<div align="center">
    <img src="assets/result1.png" width="80%">
</div>

πŸ“ <strong>TC-V2A</strong>

ControlFoley demonstrates strong textual controllability under increasing visual-text conflict.

- Maintains high CLAP (text alignment) across conflict levels  
- Effectively reduces IB under conflict (less reliance on visual bias)  
- Achieves better balance between controllability and generation quality  

<div align="center">
    <img src="assets/result2.png" width="60%">
</div>

🎧 <strong>AC-V2A</strong>

ControlFoley achieves the best performance across all evaluation metrics on the Greatest Hits dataset.

- Better timbre similarity (Resemblyzer)  
- Better synchronization (DeSync)  
- Higher audio quality (IS)  
  
Notably, it outperforms CondFoleyGen, a specialized in-domain baseline, demonstrating strong generalization ability.

<div align="center">
    <img src="assets/result3.png" width="50%">
</div>

##
ControlFoley also demonstrates competitive or superior performance compared to strong proprietary systems such as Kling-Foley, highlighting its effectiveness as an open and controllable solution.

<hr style="border: none; border-top: 3px solid #333; margin: 16px 0;">

## πŸ›  **Quick Start**

### πŸ”‘ **Prerequisites**

- Python 3.10+
- PyTorch 2.5.1+
- CUDA 11.8+
- FFmpeg (conda install -c conda-forge ffmpeg)

### 🧱 **Installation**

```bash
# Clone the repository
git clone https://github.com/xiaomi-research/controlfoley
cd controlfoley

# Create conda environment
conda create -n controlfoley python=3.10.16
conda activate controlfoley

# Install dependencies
pip install -r requirements.txt

# Download pretrained weights
pip install huggingface-hub==0.26.2
huggingface-cli download YJX-Xiaomi/ControlFoley --resume-download --local-dir model_weights --local-dir-use-symlinks False
```

Or you can download the weights from [here](https://huggingface.co/YJX-Xiaomi/ControlFoley/tree/main/) and put them in the `model_weights` folder.

### 🎨 **Inference**

```
python demo.py [OPTIONS]

Options:
  --video            TEXT       Path to the input video file. (default: None)
  --audio            TEXT       Path to the input reference audio file. (default: None)
  --prompt           TEXT       Textual prompt for audio generation. (default: None)
  --negative_prompt  TEXT       Negative textual prompt for audio generation. (default: None)
  --duration         FLOAT      Duration of the generated audio in seconds. (default: 8.0)
  --output           TEXT       Output directory for generated audio files. (default: ./output)
```

### πŸ“Œ **Supported Tasks**

| Task   | video      | audio      | prompt   |
|--------|------------|------------|----------|
| TV2A   | required   | None       | required |
| TC-V2A | required   | None       | required |
| AC-V2A | required   | required   | optional |
| V2A    | required   | None       | None     |
| T2A    | None       | None       | required |

### πŸ“‹ **Usage Examples**

- TV2A

```bash
python demo.py --video "assets/001.mp4" --prompt "the skateboard wheels scraping and grinding on the ground." --duration 8.0 --output "./output"
```

- TC-V2A

```bash
python demo.py --video "assets/002.mp4" --prompt "man whistling." --duration 8.0 --output "./output"
```

- AC-V2A

```bash
python demo.py --video "assets/003.mp4" --audio "assets/003.wav" --duration 8.0 --output "./output"
```

- V2A

```bash
python demo.py --video "assets/004.mp4" --duration 8.0 --output "./output"
```

- T2A

```bash
python demo.py --prompt "A bird sings melodically in a forest." --duration 8.0 --output "./output"
```

<hr style="border: none; border-top: 3px solid #333; margin: 16px 0;">

## πŸ“ **Citation**

If you find this repository useful, please consider citing our paper:

```bibtex
@misc{yang2026controlfoleyunifiedcontrollablevideotoaudio,
  title={ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling}, 
  author={Jianxuan Yang and Xinyue Guo and Zhi Cheng and Kai Wang and Lipan Zhang and Jinjie Hu and Qiang Ji and Yihua Cao and Yihao Meng and Zhaoyue Cui and Mengmei Liu and Meng Meng and Jian Luan},
  year={2026},
  eprint={2604.15086},
  archivePrefix={arXiv},
  primaryClass={cs.MM},
  url={https://arxiv.org/abs/2604.15086}, 
}
```

<hr style="border: none; border-top: 3px solid #333; margin: 16px 0;">

## πŸ”’ **License**

This repository is licensed under the [Apache License 2.0](./LICENSE) and the [model weights](https://huggingface.co/YJX-Xiaomi/ControlFoley/tree/main/) are licensed under the [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/).

<hr style="border: none; border-top: 3px solid #333; margin: 16px 0;">

## πŸ™ **Acknowledgments**

This project uses the following datasets:<br>
VGGSound, Kling-Audio-Eval, The Greatest Hits (<a href="https://creativecommons.org/licenses/by/4.0/" target="_blank" style="color:#007bff; text-decoration:none;">CC BY 4.0</a>),
and MovieGen-Audio-Bench (<a href="https://creativecommons.org/licenses/by-nc/4.0/" target="_blank" style="color:#dc3545; text-decoration:none;">CC BY-NC 4.0</a>).<br>
All resources are used for <strong>academic and non-commercial demonstration purposes only</strong>.

This project is inspired by the following works:<br>
[stable-audio-tools](https://github.com/Stability-AI/stable-audio-tools), [MMAudio](https://github.com/hkchengrex/MMAudio), [Make-An-Audio 2](https://github.com/bytedance/Make-An-Audio-2), [Synchformer](https://github.com/v-iashin/Synchformer), and [audiocraft](https://github.com/facebookresearch/audiocraft).<br>
Thanks for their contributions.

<hr style="border: none; border-top: 3px solid #333; margin: 16px 0;">

## πŸ“ž **Contact**

If you have any questions or suggestions, please feel free to contact us at yangjianxuan@xiaomi.com.

<hr style="border: none; border-top: 3px solid #333; margin: 16px 0;">

<div align="center">

2026 ControlFoley Project. All Rights Reserved.

</div>