Text-to-Audio
Diffusers
English
text-video-to-audio
text-controlled-video-to-audio
audio-controlled-video-to-audio
audio-generation
Instructions to use YJX-Xiaomi/ControlFoley with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use YJX-Xiaomi/ControlFoley with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("YJX-Xiaomi/ControlFoley", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
Update README.md
Browse files
README.md
CHANGED
|
@@ -7,12 +7,14 @@ language:
|
|
| 7 |
|
| 8 |
<!-- ## **ControlFoley** -->
|
| 9 |
|
|
|
|
|
|
|
| 10 |
<div align="center">
|
| 11 |
|
| 12 |
# ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling
|
| 13 |
|
| 14 |
<p align="center">
|
| 15 |
-
<a href="
|
| 16 |
|
| 17 |
<a href="https://github.com/xiaomi-research/controlfoley" style="text-decoration:none"><img src="https://img.shields.io/badge/GitHub.io-Code-blue?logo=Github&style=flat-square" alt="GitHub"/></a>
|
| 18 |
|
|
@@ -59,9 +61,9 @@ If you find this project useful, please consider giving a star ⭐️~
|
|
| 59 |
</div>
|
| 60 |
<!-- Affiliations -->
|
| 61 |
<div>
|
| 62 |
-
<sup>1</sup>
|
| 63 |
<br>
|
| 64 |
-
*
|
| 65 |
</div>
|
| 66 |
</div>
|
| 67 |
|
|
@@ -69,7 +71,7 @@ If you find this project useful, please consider giving a star ⭐️~
|
|
| 69 |
|
| 70 |
## 📰 **News**
|
| 71 |
|
| 72 |
-
- [2026-04] Technical report released on [arXiv](
|
| 73 |
- [2026-04] [Project page](https://yjx-research.github.io/ControlFoley_web_page/) is now live.
|
| 74 |
- [2026-04] [Inference code](https://github.com/xiaomi-research/controlfoley) and [pretrained models](https://huggingface.co/YJX-Xiaomi/ControlFoley) are released.
|
| 75 |
- [2026-04] Online demo is available on [Project Page](https://yjx-research.github.io/ControlFoley_web_page/), click "Try Now" to experience it immediately.
|
|
@@ -89,7 +91,8 @@ If you find this project useful, please consider giving a star ⭐️~
|
|
| 89 |
|
| 90 |
## 📺 **Intro Video**
|
| 91 |
|
| 92 |
-
https://
|
|
|
|
| 93 |
For more results of our model, visit [Project Page](https://yjx-research.github.io/ControlFoley_web_page/). For comparison with other methods, visit [Demo Page](https://yjx-research.github.io/ControlFoley/).
|
| 94 |
|
| 95 |
<hr style="border: none; border-top: 3px solid #333; margin: 16px 0;">
|
|
@@ -157,13 +160,13 @@ We propose VGGSound-TVC to evaluate text controllability under varying levels of
|
|
| 157 |
|
| 158 |
- L0 → No conflict, where the textual description is consistent with the video content.
|
| 159 |
- L1_subject → A mild semantic conflict introduced at the subject level, where the action description remains unchanged while the sounding subject is replaced.
|
| 160 |
-
-
|
| 161 |
- L2 → A moderate semantic conflict in which the textual description belongs to a different semantic category while still maintaining a similar temporal structure or acoustic rhythm.
|
| 162 |
- L3 → Strong conflict, where the textual description is randomly substituted.
|
| 163 |
|
| 164 |
This enables systematic analysis of modality dominance and controllability under increasing inconsistency. Example samples from VGGSound-TVC are as follows.
|
| 165 |
<div align="center">
|
| 166 |
-
<img src="assets/benchmark.png" width="
|
| 167 |
</div>
|
| 168 |
|
| 169 |
<hr style="border: none; border-top: 3px solid #333; margin: 16px 0;">
|
|
@@ -178,7 +181,7 @@ ControlFoley achieves state-of-the-art performance across multiple benchmarks, i
|
|
| 178 |
|
| 179 |
- Highest CLAP scores (better semantic alignment)
|
| 180 |
- Lowest DeSync (better temporal synchronization)
|
| 181 |
-
- Best overall IS (better audio quality)
|
| 182 |
|
| 183 |
<div align="center">
|
| 184 |
<img src="assets/result1.png" width="80%">
|
|
@@ -308,14 +311,14 @@ python demo.py --prompt "A bird sings melodically in a forest." --duration 8.0 -
|
|
| 308 |
If you find this repository useful, please consider citing our paper:
|
| 309 |
|
| 310 |
```bibtex
|
| 311 |
-
@misc{
|
| 312 |
-
title={
|
| 313 |
-
author={
|
| 314 |
year={2026},
|
| 315 |
-
eprint={
|
| 316 |
archivePrefix={arXiv},
|
| 317 |
-
primaryClass={cs.
|
| 318 |
-
url={https://arxiv.org/abs/
|
| 319 |
}
|
| 320 |
```
|
| 321 |
|
|
@@ -350,4 +353,4 @@ If you have any questions or suggestions, please feel free to contact us at yang
|
|
| 350 |
|
| 351 |
2026 ControlFoley Project. All Rights Reserved.
|
| 352 |
|
| 353 |
-
</div>
|
|
|
|
| 7 |
|
| 8 |
<!-- ## **ControlFoley** -->
|
| 9 |
|
| 10 |
+
[中文阅读](./README_zh.md)
|
| 11 |
+
|
| 12 |
<div align="center">
|
| 13 |
|
| 14 |
# ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling
|
| 15 |
|
| 16 |
<p align="center">
|
| 17 |
+
<a href="https://arxiv.org/abs/2604.15086" style="text-decoration:none"><img src="https://img.shields.io/badge/arXiv-2506.21448-b31b1b.svg" alt="arXiv"/></a>
|
| 18 |
|
| 19 |
<a href="https://github.com/xiaomi-research/controlfoley" style="text-decoration:none"><img src="https://img.shields.io/badge/GitHub.io-Code-blue?logo=Github&style=flat-square" alt="GitHub"/></a>
|
| 20 |
|
|
|
|
| 61 |
</div>
|
| 62 |
<!-- Affiliations -->
|
| 63 |
<div>
|
| 64 |
+
<sup>1</sup>MiLM Plus, Xiaomi Inc. <sup>2</sup>Wuhan University
|
| 65 |
<br>
|
| 66 |
+
*Equal contribution †Corresponding author
|
| 67 |
</div>
|
| 68 |
</div>
|
| 69 |
|
|
|
|
| 71 |
|
| 72 |
## 📰 **News**
|
| 73 |
|
| 74 |
+
- [2026-04] Technical report released on [arXiv](https://arxiv.org/abs/2604.15086).
|
| 75 |
- [2026-04] [Project page](https://yjx-research.github.io/ControlFoley_web_page/) is now live.
|
| 76 |
- [2026-04] [Inference code](https://github.com/xiaomi-research/controlfoley) and [pretrained models](https://huggingface.co/YJX-Xiaomi/ControlFoley) are released.
|
| 77 |
- [2026-04] Online demo is available on [Project Page](https://yjx-research.github.io/ControlFoley_web_page/), click "Try Now" to experience it immediately.
|
|
|
|
| 91 |
|
| 92 |
## 📺 **Intro Video**
|
| 93 |
|
| 94 |
+
https://github.com/user-attachments/assets/d63e9837-a568-4521-9009-58b4105214a9
|
| 95 |
+
|
| 96 |
For more results of our model, visit [Project Page](https://yjx-research.github.io/ControlFoley_web_page/). For comparison with other methods, visit [Demo Page](https://yjx-research.github.io/ControlFoley/).
|
| 97 |
|
| 98 |
<hr style="border: none; border-top: 3px solid #333; margin: 16px 0;">
|
|
|
|
| 160 |
|
| 161 |
- L0 → No conflict, where the textual description is consistent with the video content.
|
| 162 |
- L1_subject → A mild semantic conflict introduced at the subject level, where the action description remains unchanged while the sounding subject is replaced.
|
| 163 |
+
- L1_action → A mild semantic conflict introduced at the action level, where the subject remains unchanged while the action description is modified.
|
| 164 |
- L2 → A moderate semantic conflict in which the textual description belongs to a different semantic category while still maintaining a similar temporal structure or acoustic rhythm.
|
| 165 |
- L3 → Strong conflict, where the textual description is randomly substituted.
|
| 166 |
|
| 167 |
This enables systematic analysis of modality dominance and controllability under increasing inconsistency. Example samples from VGGSound-TVC are as follows.
|
| 168 |
<div align="center">
|
| 169 |
+
<img src="assets/benchmark.png" width="100%">
|
| 170 |
</div>
|
| 171 |
|
| 172 |
<hr style="border: none; border-top: 3px solid #333; margin: 16px 0;">
|
|
|
|
| 181 |
|
| 182 |
- Highest CLAP scores (better semantic alignment)
|
| 183 |
- Lowest DeSync (better temporal synchronization)
|
| 184 |
+
- Best overall IS (better audio quality)—Up to 27% relative improvement (22.08 vs. 17.36 on VGGSound).
|
| 185 |
|
| 186 |
<div align="center">
|
| 187 |
<img src="assets/result1.png" width="80%">
|
|
|
|
| 311 |
If you find this repository useful, please consider citing our paper:
|
| 312 |
|
| 313 |
```bibtex
|
| 314 |
+
@misc{yang2026controlfoleyunifiedcontrollablevideotoaudio,
|
| 315 |
+
title={ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling},
|
| 316 |
+
author={Jianxuan Yang and Xinyue Guo and Zhi Cheng and Kai Wang and Lipan Zhang and Jinjie Hu and Qiang Ji and Yihua Cao and Yihao Meng and Zhaoyue Cui and Mengmei Liu and Meng Meng and Jian Luan},
|
| 317 |
year={2026},
|
| 318 |
+
eprint={2604.15086},
|
| 319 |
archivePrefix={arXiv},
|
| 320 |
+
primaryClass={cs.MM},
|
| 321 |
+
url={https://arxiv.org/abs/2604.15086},
|
| 322 |
}
|
| 323 |
```
|
| 324 |
|
|
|
|
| 353 |
|
| 354 |
2026 ControlFoley Project. All Rights Reserved.
|
| 355 |
|
| 356 |
+
</div>
|