YJX-Xiaomi commited on
Commit
1d4b4e4
Β·
verified Β·
1 Parent(s): 61b3cad

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +348 -3
README.md CHANGED
@@ -1,3 +1,348 @@
1
- ---
2
- license: cc-by-nc-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+
3
+ <!-- ## **ControlFoley** -->
4
+
5
+ <div align="center">
6
+
7
+ # ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling
8
+
9
+ <p align="center">
10
+ <a href="xxx" style="text-decoration:none"><img src="https://img.shields.io/badge/arXiv-2506.21448-b31b1b.svg" alt="arXiv"/></a>
11
+ &nbsp;
12
+ <a href="https://github.com/xiaomi-research/controlfoley" style="text-decoration:none"><img src="https://img.shields.io/badge/GitHub.io-Code-blue?logo=Github&style=flat-square" alt="GitHub"/></a>
13
+ &nbsp;
14
+ <a href="https://yjx-research.github.io/ControlFoley_web_page/" style="text-decoration:none"><img src="https://img.shields.io/badge/Project Page-Project-blue" alt="Project Page"/></a>
15
+ &nbsp;
16
+ <a href="https://yjx-research.github.io/ControlFoley/" style="text-decoration:none"><img src="https://img.shields.io/badge/Demo Page-Demo-blue" alt="Demo Page"/></a>
17
+ &nbsp;
18
+ <a href="https://huggingface.co/YJX-Xiaomi/ControlFoley" style="text-decoration:none"><img src="https://img.shields.io/badge/HuggingFace-Models-orange?logo=huggingface" alt="Hugging Face"/></a>
19
+ </p>
20
+
21
+ </div>
22
+
23
+ <p align="center">
24
+ If you find this project useful, please consider giving a star ⭐️~
25
+ </p>
26
+
27
+
28
+ <div align="center">
29
+
30
+ <hr style="border: none; border-top: 3px solid #333; margin: 16px 0;">
31
+
32
+ ### πŸ‘₯ **Authors**
33
+
34
+ <div>
35
+ <!-- Row 1: 6 authors -->
36
+ <div style="margin-bottom: 2px;">
37
+ Jianxuan Yang<sup>1*†</sup>,&nbsp;
38
+ Xinyue Guo<sup>1*</sup>,&nbsp;
39
+ Zhi Cheng<sup>1,2</sup>,&nbsp;
40
+ Kai Wang<sup>1,2</sup>,&nbsp;
41
+ Lipan Zhang<sup>1</sup>,&nbsp;
42
+ Jinjie Hu<sup>1</sup>
43
+ </div>
44
+ <!-- Row 2: 7 authors -->
45
+ <div>
46
+ Qiang Ji<sup>1</sup>,&nbsp;
47
+ Yihua Cao<sup>1</sup>,&nbsp;
48
+ Yihao Meng<sup>1,2</sup>,&nbsp;
49
+ Zhaoyue Cui<sup>1,2</sup>,&nbsp;
50
+ Mengmei Liu<sup>1</sup>,&nbsp;
51
+ Meng Meng<sup>1</sup>,&nbsp;
52
+ Jian Luan<sup>1</sup>
53
+ </div>
54
+ </div>
55
+ <!-- Affiliations -->
56
+ <div>
57
+ <sup>1</sup> MiLM Plus, Xiaomi Inc. &nbsp;&nbsp; <sup>2</sup> Wuhan University
58
+ <br>
59
+ * Equal contribution &nbsp;&nbsp; † Corresponding author
60
+ </div>
61
+ </div>
62
+
63
+ <hr style="border: none; border-top: 3px solid #333; margin: 16px 0;">
64
+
65
+ ## πŸ“° **News**
66
+
67
+ - [2026-04] Technical report released on [arXiv](xxx).
68
+ - [2026-04] [Project page](https://yjx-research.github.io/ControlFoley_web_page/) is now live.
69
+ - [2026-04] [Inference code](https://github.com/xiaomi-research/controlfoley) and [pretrained models](https://huggingface.co/YJX-Xiaomi/ControlFoley) are released.
70
+ - [2026-04] Online demo is available on [Project Page](https://yjx-research.github.io/ControlFoley_web_page/), click "Try Now" to experience it immediately.
71
+ - [Coming Soon] Skill will be released.
72
+
73
+ <hr style="border: none; border-top: 3px solid #333; margin: 16px 0;">
74
+
75
+ ## πŸ”„ **Updates**
76
+
77
+ - [x] Release technical report on arXiv.
78
+ - [x] Launch project page.
79
+ - [x] Release inference code and pretrained models.
80
+ - [x] Launch online inference demo (available on project page).
81
+ - [ ] Release skill.
82
+
83
+ <hr style="border: none; border-top: 3px solid #333; margin: 16px 0;">
84
+
85
+ ## πŸ“Ί **Intro Video**
86
+
87
+ https://cdn-uploads.huggingface.co/production/uploads/67510ec5d5d2963818c3155c/BE-iBEKBJ_pGclr32oTk_.mp4
88
+ For more results of our model, visit [Project Page](https://yjx-research.github.io/ControlFoley_web_page/). For comparison with other methods, visit [Demo Page](https://yjx-research.github.io/ControlFoley/).
89
+
90
+ <hr style="border: none; border-top: 3px solid #333; margin: 16px 0;">
91
+
92
+ ## 🎧 **Overview**
93
+
94
+ ControlFoley is a unified and controllable multimodal video-to-audio (V2A) generation framework that enables precise control over generated audio using video, text, and reference audio.
95
+
96
+ Unlike existing methods that rely on a single modality or struggle under conflicting inputs, ControlFoley is designed to handle complex multimodal interactions and maintain strong controllability even when modalities are inconsistent.
97
+
98
+ <hr style="border: none; border-top: 3px solid #333; margin: 16px 0;">
99
+
100
+ ## 🎨 **Tease Figure**
101
+
102
+ <div align="center">
103
+ <img src="assets/tease.png" width="100%">
104
+ <p style="margin-top: 8px; text-align: center; font-style: italic;">
105
+ Left: Overview of the ControlFoley framework with three multimodal conditioning modes for controllable video-synchronized audio generation. Right: Performance radar chart of Video-to-Audio models.
106
+ </p>
107
+ </div>
108
+
109
+ <hr style="border: none; border-top: 3px solid #333; margin: 16px 0;">
110
+
111
+ ## πŸš€ **Capabilities**
112
+
113
+ ControlFoley supports a wide range of applications:
114
+
115
+ - 🎬 <strong>Text-Video-to-Audio Generation (TV2A)</strong><br>
116
+ Video-content-adaptive dubbing and synchronized sound effect generation under text guidance.
117
+
118
+ - πŸ“ <strong>Text-Controlled Video-to-Audio (TC-V2A)</strong><br>
119
+ Audio generation under video–text conflicts, with semantics consistent with text prompts and temporally synchronized with video contents.
120
+
121
+ - 🎧 <strong>Audio-Controlled Video-to-Audio (AC-V2A)</strong><br>
122
+ Audio generation conditioned on reference audio, with timbre consistent with the reference audio and temporally synchronized with video contents.
123
+
124
+ - πŸ“ <strong>Text-to-Audio Generation (T2A)</strong><br>
125
+ Generate audio directly from text prompts as an additional capability of the unified framework.
126
+
127
+ <hr style="border: none; border-top: 3px solid #333; margin: 16px 0;">
128
+
129
+ ## 🧠 **Key Innovations**
130
+
131
+ <div align="center">
132
+ <img src="assets/controlfoley.png" width="100%">
133
+ </div>
134
+
135
+ - <strong>Joint Visual Encoding for Robust Multimodal Control:</strong>
136
+ Combines CLIP and CAV-MAE-ST representations to capture both vision-language and audio-visual correlations, improving robustness under modality conflict.
137
+
138
+ - <strong>Timbre-Focused Reference Audio Control:</strong>
139
+ Extracts global timbre representations while suppressing temporal cues, enabling precise acoustic style control without affecting synchronization.
140
+
141
+ - <strong>Modality-Robust Training with Unified Alignment:</strong>
142
+ Introduces all-modality dropout and a unified REPA objective to improve robustness across diverse modality combinations.
143
+
144
+ - <strong>VGGSound-TVC Benchmark:</strong>
145
+ A new benchmark for evaluating textual controllability under visual-text semantic conflicts.
146
+
147
+ <hr style="border: none; border-top: 3px solid #333; margin: 16px 0;">
148
+
149
+ ## πŸ§ͺ **VGGSound-TVC Benchmark**
150
+
151
+ We propose VGGSound-TVC to evaluate text controllability under varying levels of visual-text conflict. In this dataset, textual descriptions of videos are reconstructed in accordance with the rules described below.
152
+
153
+ - L0 β†’ No conflict, where the textual description is consistent with the video content.
154
+ - L1_subject β†’ A mild semantic conflict introduced at the subject level, where the action description remains unchanged while the sounding subject is replaced.
155
+ - L1_subject β†’ A mild semantic conflict introduced at the action level, where the subject remains unchanged while the action description is modified.
156
+ - L2 β†’ A moderate semantic conflict in which the textual description belongs to a different semantic category while still maintaining a similar temporal structure or acoustic rhythm.
157
+ - L3 β†’ Strong conflict, where the textual description is randomly substituted.
158
+
159
+ This enables systematic analysis of modality dominance and controllability under increasing inconsistency. Example samples from VGGSound-TVC are as follows.
160
+ <div align="center">
161
+ <img src="assets/benchmark.png" width="80%">
162
+ </div>
163
+
164
+ <hr style="border: none; border-top: 3px solid #333; margin: 16px 0;">
165
+
166
+ ## πŸ“Š **Performance**
167
+
168
+ ControlFoley achieves strong performance across multiple V2A tasks, demonstrating both high generation quality and robust controllability.
169
+
170
+ 🎬 <strong>TV2A</strong>
171
+
172
+ ControlFoley achieves state-of-the-art performance across multiple benchmarks, including VGGSound-Test, Kling-Audio-Eval, and MovieGen-Audio-Bench.
173
+
174
+ - Highest CLAP scores (better semantic alignment)
175
+ - Lowest DeSync (better temporal synchronization)
176
+ - Best overall IS (better audio quality). Up to 27% relative improvement (22.08 vs. 17.36 on VGGSound)
177
+
178
+ <div align="center">
179
+ <img src="assets/result1.png" width="80%">
180
+ </div>
181
+
182
+ πŸ“ <strong>TC-V2A</strong>
183
+
184
+ ControlFoley demonstrates strong textual controllability under increasing visual-text conflict.
185
+
186
+ - Maintains high CLAP (text alignment) across conflict levels
187
+ - Effectively reduces IB under conflict (less reliance on visual bias)
188
+ - Achieves better balance between controllability and generation quality
189
+
190
+ <div align="center">
191
+ <img src="assets/result2.png" width="60%">
192
+ </div>
193
+
194
+ 🎧 <strong>AC-V2A</strong>
195
+
196
+ ControlFoley achieves the best performance across all evaluation metrics on the Greatest Hits dataset.
197
+
198
+ - Better timbre similarity (Resemblyzer)
199
+ - Better synchronization (DeSync)
200
+ - Higher audio quality (IS)
201
+
202
+ Notably, it outperforms CondFoleyGen, a specialized in-domain baseline, demonstrating strong generalization ability.
203
+
204
+ <div align="center">
205
+ <img src="assets/result3.png" width="50%">
206
+ </div>
207
+
208
+ ##
209
+ ControlFoley also demonstrates competitive or superior performance compared to strong proprietary systems such as Kling-Foley, highlighting its effectiveness as an open and controllable solution.
210
+
211
+ <hr style="border: none; border-top: 3px solid #333; margin: 16px 0;">
212
+
213
+ ## πŸ›  **Quick Start**
214
+
215
+ ### πŸ”‘ **Prerequisites**
216
+
217
+ - Python 3.10+
218
+ - PyTorch 2.5.1+
219
+ - CUDA 11.8+
220
+ - FFmpeg (conda install -c conda-forge ffmpeg)
221
+
222
+ ### 🧱 **Installation**
223
+
224
+ ```bash
225
+ # Clone the repository
226
+ git clone https://github.com/xiaomi-research/controlfoley
227
+ cd controlfoley
228
+
229
+ # Create conda environment
230
+ conda create -n controlfoley python=3.10.16
231
+ conda activate controlfoley
232
+
233
+ # Install dependencies
234
+ pip install -r requirements.txt
235
+
236
+ # Download pretrained weights
237
+ pip install huggingface-hub==0.26.2
238
+ huggingface-cli download YJX-Xiaomi/ControlFoley --resume-download --local-dir model_weights --local-dir-use-symlinks False
239
+ ```
240
+
241
+ Or you can download the weights from [here](https://huggingface.co/YJX-Xiaomi/ControlFoley/tree/main/) and put them in the `model_weights` folder.
242
+
243
+ ### 🎨 **Inference**
244
+
245
+ ```
246
+ python demo.py [OPTIONS]
247
+
248
+ Options:
249
+ --video TEXT Path to the input video file. (default: None)
250
+ --audio TEXT Path to the input reference audio file. (default: None)
251
+ --prompt TEXT Textual prompt for audio generation. (default: None)
252
+ --negative_prompt TEXT Negative textual prompt for audio generation. (default: None)
253
+ --duration FLOAT Duration of the generated audio in seconds. (default: 8.0)
254
+ --output TEXT Output directory for generated audio files. (default: ./output)
255
+ ```
256
+
257
+ ### πŸ“Œ **Supported Tasks**
258
+
259
+ | Task | video | audio | prompt |
260
+ |--------|------------|------------|----------|
261
+ | TV2A | required | None | required |
262
+ | TC-V2A | required | None | required |
263
+ | AC-V2A | required | required | optional |
264
+ | V2A | required | None | None |
265
+ | T2A | None | None | required |
266
+
267
+ ### πŸ“‹ **Usage Examples**
268
+
269
+ - TV2A
270
+
271
+ ```bash
272
+ python demo.py --video "assets/001.mp4" --prompt "the skateboard wheels scraping and grinding on the ground." --duration 8.0 --output "./output"
273
+ ```
274
+
275
+ - TC-V2A
276
+
277
+ ```bash
278
+ python demo.py --video "assets/002.mp4" --prompt "man whistling." --duration 8.0 --output "./output"
279
+ ```
280
+
281
+ - AC-V2A
282
+
283
+ ```bash
284
+ python demo.py --video "assets/003.mp4" --audio "assets/003.wav" --duration 8.0 --output "./output"
285
+ ```
286
+
287
+ - V2A
288
+
289
+ ```bash
290
+ python demo.py --video "assets/004.mp4" --duration 8.0 --output "./output"
291
+ ```
292
+
293
+ - T2A
294
+
295
+ ```bash
296
+ python demo.py --prompt "A bird sings melodically in a forest." --duration 8.0 --output "./output"
297
+ ```
298
+
299
+ <hr style="border: none; border-top: 3px solid #333; margin: 16px 0;">
300
+
301
+ ## πŸ“ **Citation**
302
+
303
+ If you find this repository useful, please consider citing our paper:
304
+
305
+ ```bibtex
306
+ @misc{xxx,
307
+ title={xxx},
308
+ author={xxx},
309
+ year={2026},
310
+ eprint={xxx},
311
+ archivePrefix={arXiv},
312
+ primaryClass={cs.CV},
313
+ url={https://arxiv.org/abs/xxx}
314
+ }
315
+ ```
316
+
317
+ <hr style="border: none; border-top: 3px solid #333; margin: 16px 0;">
318
+
319
+ ## πŸ”’ **License**
320
+
321
+ This repository is licensed under the [Apache License 2.0](./LICENSE) and the [model weights](https://huggingface.co/YJX-Xiaomi/ControlFoley/tree/main/) are licensed under the [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/).
322
+
323
+ <hr style="border: none; border-top: 3px solid #333; margin: 16px 0;">
324
+
325
+ ## πŸ™ **Acknowledgments**
326
+
327
+ This project uses the following datasets:<br>
328
+ VGGSound, Kling-Audio-Eval, The Greatest Hits (<a href="https://creativecommons.org/licenses/by/4.0/" target="_blank" style="color:#007bff; text-decoration:none;">CC BY 4.0</a>),
329
+ and MovieGen-Audio-Bench (<a href="https://creativecommons.org/licenses/by-nc/4.0/" target="_blank" style="color:#dc3545; text-decoration:none;">CC BY-NC 4.0</a>).<br>
330
+ All resources are used for <strong>academic and non-commercial demonstration purposes only</strong>.
331
+
332
+ This project is inspired by the following works:<br>
333
+ [stable-audio-tools](https://github.com/Stability-AI/stable-audio-tools), [MMAudio](https://github.com/hkchengrex/MMAudio), [Make-An-Audio 2](https://github.com/bytedance/Make-An-Audio-2), [Synchformer](https://github.com/v-iashin/Synchformer), and [audiocraft](https://github.com/facebookresearch/audiocraft).<br>
334
+ Thanks for their contributions.
335
+
336
+ <hr style="border: none; border-top: 3px solid #333; margin: 16px 0;">
337
+
338
+ ## πŸ“ž **Contact**
339
+
340
+ If you have any questions or suggestions, please feel free to contact us at yangjianxuan@xiaomi.com.
341
+
342
+ <hr style="border: none; border-top: 3px solid #333; margin: 16px 0;">
343
+
344
+ <div align="center">
345
+
346
+ 2026 ControlFoley Project. All Rights Reserved.
347
+
348
+ </div>