YJX-Xiaomi commited on
Commit
fe6e8b7
·
verified ·
1 Parent(s): 762861b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +18 -15
README.md CHANGED
@@ -7,12 +7,14 @@ language:
7
 
8
  <!-- ## **ControlFoley** -->
9
 
 
 
10
  <div align="center">
11
 
12
  # ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling
13
 
14
  <p align="center">
15
- <a href="xxx" style="text-decoration:none"><img src="https://img.shields.io/badge/arXiv-2506.21448-b31b1b.svg" alt="arXiv"/></a>
16
  &nbsp;
17
  <a href="https://github.com/xiaomi-research/controlfoley" style="text-decoration:none"><img src="https://img.shields.io/badge/GitHub.io-Code-blue?logo=Github&style=flat-square" alt="GitHub"/></a>
18
  &nbsp;
@@ -59,9 +61,9 @@ If you find this project useful, please consider giving a star ⭐️~
59
  </div>
60
  <!-- Affiliations -->
61
  <div>
62
- <sup>1</sup> MiLM Plus, Xiaomi Inc. &nbsp;&nbsp; <sup>2</sup> Wuhan University
63
  <br>
64
- * Equal contribution &nbsp;&nbsp; † Corresponding author
65
  </div>
66
  </div>
67
 
@@ -69,7 +71,7 @@ If you find this project useful, please consider giving a star ⭐️~
69
 
70
  ## 📰 **News**
71
 
72
- - [2026-04] Technical report released on [arXiv](xxx).
73
  - [2026-04] [Project page](https://yjx-research.github.io/ControlFoley_web_page/) is now live.
74
  - [2026-04] [Inference code](https://github.com/xiaomi-research/controlfoley) and [pretrained models](https://huggingface.co/YJX-Xiaomi/ControlFoley) are released.
75
  - [2026-04] Online demo is available on [Project Page](https://yjx-research.github.io/ControlFoley_web_page/), click "Try Now" to experience it immediately.
@@ -89,7 +91,8 @@ If you find this project useful, please consider giving a star ⭐️~
89
 
90
  ## 📺 **Intro Video**
91
 
92
- https://cdn-uploads.huggingface.co/production/uploads/67510ec5d5d2963818c3155c/BE-iBEKBJ_pGclr32oTk_.mp4
 
93
  For more results of our model, visit [Project Page](https://yjx-research.github.io/ControlFoley_web_page/). For comparison with other methods, visit [Demo Page](https://yjx-research.github.io/ControlFoley/).
94
 
95
  <hr style="border: none; border-top: 3px solid #333; margin: 16px 0;">
@@ -157,13 +160,13 @@ We propose VGGSound-TVC to evaluate text controllability under varying levels of
157
 
158
  - L0 → No conflict, where the textual description is consistent with the video content.
159
  - L1_subject → A mild semantic conflict introduced at the subject level, where the action description remains unchanged while the sounding subject is replaced.
160
- - L1_subject → A mild semantic conflict introduced at the action level, where the subject remains unchanged while the action description is modified.
161
  - L2 → A moderate semantic conflict in which the textual description belongs to a different semantic category while still maintaining a similar temporal structure or acoustic rhythm.
162
  - L3 → Strong conflict, where the textual description is randomly substituted.
163
 
164
  This enables systematic analysis of modality dominance and controllability under increasing inconsistency. Example samples from VGGSound-TVC are as follows.
165
  <div align="center">
166
- <img src="assets/benchmark.png" width="80%">
167
  </div>
168
 
169
  <hr style="border: none; border-top: 3px solid #333; margin: 16px 0;">
@@ -178,7 +181,7 @@ ControlFoley achieves state-of-the-art performance across multiple benchmarks, i
178
 
179
  - Highest CLAP scores (better semantic alignment)
180
  - Lowest DeSync (better temporal synchronization)
181
- - Best overall IS (better audio quality). Up to 27% relative improvement (22.08 vs. 17.36 on VGGSound)
182
 
183
  <div align="center">
184
  <img src="assets/result1.png" width="80%">
@@ -308,14 +311,14 @@ python demo.py --prompt "A bird sings melodically in a forest." --duration 8.0 -
308
  If you find this repository useful, please consider citing our paper:
309
 
310
  ```bibtex
311
- @misc{xxx,
312
- title={xxx},
313
- author={xxx},
314
  year={2026},
315
- eprint={xxx},
316
  archivePrefix={arXiv},
317
- primaryClass={cs.CV},
318
- url={https://arxiv.org/abs/xxx}
319
  }
320
  ```
321
 
@@ -350,4 +353,4 @@ If you have any questions or suggestions, please feel free to contact us at yang
350
 
351
  2026 ControlFoley Project. All Rights Reserved.
352
 
353
- </div>
 
7
 
8
  <!-- ## **ControlFoley** -->
9
 
10
+ [中文阅读](./README_zh.md)
11
+
12
  <div align="center">
13
 
14
  # ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling
15
 
16
  <p align="center">
17
+ <a href="https://arxiv.org/abs/2604.15086" style="text-decoration:none"><img src="https://img.shields.io/badge/arXiv-2506.21448-b31b1b.svg" alt="arXiv"/></a>
18
  &nbsp;
19
  <a href="https://github.com/xiaomi-research/controlfoley" style="text-decoration:none"><img src="https://img.shields.io/badge/GitHub.io-Code-blue?logo=Github&style=flat-square" alt="GitHub"/></a>
20
  &nbsp;
 
61
  </div>
62
  <!-- Affiliations -->
63
  <div>
64
+ <sup>1</sup>MiLM Plus, Xiaomi Inc. &nbsp;&nbsp; <sup>2</sup>Wuhan University
65
  <br>
66
+ *Equal contribution &nbsp;&nbsp; †Corresponding author
67
  </div>
68
  </div>
69
 
 
71
 
72
  ## 📰 **News**
73
 
74
+ - [2026-04] Technical report released on [arXiv](https://arxiv.org/abs/2604.15086).
75
  - [2026-04] [Project page](https://yjx-research.github.io/ControlFoley_web_page/) is now live.
76
  - [2026-04] [Inference code](https://github.com/xiaomi-research/controlfoley) and [pretrained models](https://huggingface.co/YJX-Xiaomi/ControlFoley) are released.
77
  - [2026-04] Online demo is available on [Project Page](https://yjx-research.github.io/ControlFoley_web_page/), click "Try Now" to experience it immediately.
 
91
 
92
  ## 📺 **Intro Video**
93
 
94
+ https://github.com/user-attachments/assets/d63e9837-a568-4521-9009-58b4105214a9
95
+
96
  For more results of our model, visit [Project Page](https://yjx-research.github.io/ControlFoley_web_page/). For comparison with other methods, visit [Demo Page](https://yjx-research.github.io/ControlFoley/).
97
 
98
  <hr style="border: none; border-top: 3px solid #333; margin: 16px 0;">
 
160
 
161
  - L0 → No conflict, where the textual description is consistent with the video content.
162
  - L1_subject → A mild semantic conflict introduced at the subject level, where the action description remains unchanged while the sounding subject is replaced.
163
+ - L1_action → A mild semantic conflict introduced at the action level, where the subject remains unchanged while the action description is modified.
164
  - L2 → A moderate semantic conflict in which the textual description belongs to a different semantic category while still maintaining a similar temporal structure or acoustic rhythm.
165
  - L3 → Strong conflict, where the textual description is randomly substituted.
166
 
167
  This enables systematic analysis of modality dominance and controllability under increasing inconsistency. Example samples from VGGSound-TVC are as follows.
168
  <div align="center">
169
+ <img src="assets/benchmark.png" width="100%">
170
  </div>
171
 
172
  <hr style="border: none; border-top: 3px solid #333; margin: 16px 0;">
 
181
 
182
  - Highest CLAP scores (better semantic alignment)
183
  - Lowest DeSync (better temporal synchronization)
184
+ - Best overall IS (better audio quality)Up to 27% relative improvement (22.08 vs. 17.36 on VGGSound).
185
 
186
  <div align="center">
187
  <img src="assets/result1.png" width="80%">
 
311
  If you find this repository useful, please consider citing our paper:
312
 
313
  ```bibtex
314
+ @misc{yang2026controlfoleyunifiedcontrollablevideotoaudio,
315
+ title={ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling},
316
+ author={Jianxuan Yang and Xinyue Guo and Zhi Cheng and Kai Wang and Lipan Zhang and Jinjie Hu and Qiang Ji and Yihua Cao and Yihao Meng and Zhaoyue Cui and Mengmei Liu and Meng Meng and Jian Luan},
317
  year={2026},
318
+ eprint={2604.15086},
319
  archivePrefix={arXiv},
320
+ primaryClass={cs.MM},
321
+ url={https://arxiv.org/abs/2604.15086},
322
  }
323
  ```
324
 
 
353
 
354
  2026 ControlFoley Project. All Rights Reserved.
355
 
356
+ </div>