Update README.md
Browse files
README.md
CHANGED
|
@@ -1,140 +1,78 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
---
|
| 4 |
-
|
| 5 |
-
|
| 6 |
<h1 align="center">PrismAudio</h1>
|
| 7 |
|
| 8 |
-
|
| 9 |
-
<p align="center">
|
| 10 |
-
<img src="https://img.shields.io/badge/ICLR 2026-Main Conference-blue.svg" alt="ICLR 2026"/>
|
| 11 |
-
</p>
|
| 12 |
-
|
| 13 |
-
<p align="center">
|
| 14 |
-
<a href="https://arxiv.org/abs/2511.18833">
|
| 15 |
-
<img src="https://img.shields.io/badge/arXiv-2511.18833-b31b1b.svg" alt="arXiv"/>
|
| 16 |
-
</a>
|
| 17 |
-
|
| 18 |
-
<a href="http://prismaudio-project.github.io/">
|
| 19 |
-
<img src="https://img.shields.io/badge/Online%20Demo-🌐-blue" alt="Online Demo"/>
|
| 20 |
-
</a>
|
| 21 |
-
|
| 22 |
-
</p>
|
| 23 |
-
|
| 24 |
-
<p align="center">
|
| 25 |
-
If you find this project useful,<br>
|
| 26 |
-
a star ⭐ on GitHub would be greatly appreciated!
|
| 27 |
-
</p>
|
| 28 |
-
|
| 29 |
---
|
| 30 |
|
| 31 |
-
**PrismAudio**
|
| 32 |
-
|
| 33 |
-
|
| 34 |
|
| 35 |
---
|
| 36 |
|
| 37 |
-
## 📰
|
| 38 |
-
|
| 39 |
-
- **2026.03.22**
|
| 40 |
-
- **2026.01.26**
|
| 41 |
-
- **2025.11.25**
|
| 42 |
-
- **2025.11.25**
|
| 43 |
-
- **2025.09.19**
|
| 44 |
-
- **2025.09.01**
|
| 45 |
-
- **2025.07.17**
|
| 46 |
-
- **2025.07.15**
|
| 47 |
-
- **2025.07.08**
|
| 48 |
-
- **2025.07.01**
|
| 49 |
-
- **2025.07.01**
|
| 50 |
-
- **2025.06**
|
| 51 |
-
- **2025.06**
|
| 52 |
|
| 53 |
---
|
| 54 |
|
| 55 |
-
## 🚀
|
| 56 |
|
| 57 |
-
- **V2A
|
| 58 |
-
- **
|
| 59 |
-
- **
|
| 60 |
-
- **
|
| 61 |
-
- **
|
| 62 |
|
| 63 |
---
|
| 64 |
|
| 65 |
-
## ✨
|
| 66 |
-
|
| 67 |
-
PrismAudio consists of three main components:
|
| 68 |
-
|
| 69 |
-
1. **CoT-Aware Audio Foundation Model**: Built on a Multimodal Diffusion Transformer with flow matching, enhanced with VideoPrism for video understanding and T5-Gemma for structured CoT text encoding.
|
| 70 |
-
2. **Decomposed Multi-Dimensional CoT Reasoning**: Four specialized CoT modules — Semantic, Temporal, Aesthetic, and Spatial — each providing targeted reasoning for its corresponding perceptual dimension.
|
| 71 |
-
3. **Fast-GRPO Multi-Dimensional RL Framework**: A hybrid ODE-SDE sampling strategy that dramatically reduces training overhead while enabling multi-dimensional reward optimization across all perceptual dimensions.
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
---
|
| 75 |
-
|
| 76 |
-
## ⚡ Quick Start
|
| 77 |
-
For more details, please refer to [ThinkSound](https://github.com/FunAudioLLM/ThinkSound).
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
---
|
| 81 |
-
|
| 82 |
-
## ▶️ Run Demo
|
| 83 |
-
|
| 84 |
-
For more details, please refer to [ThinkSound](https://github.com/FunAudioLLM/ThinkSound).
|
| 85 |
-
|
| 86 |
-
---
|
| 87 |
-
|
| 88 |
-
## 🏋️ Train the Model
|
| 89 |
-
|
| 90 |
-
For more details, please refer to [ThinkSound](https://github.com/FunAudioLLM/ThinkSound).
|
| 91 |
-
|
| 92 |
-
---
|
| 93 |
-
|
| 94 |
-
## 📄 License
|
| 95 |
-
|
| 96 |
-
This project is released under the Apache 2.0 License.
|
| 97 |
-
|
| 98 |
-
> **Note:**
|
| 99 |
-
> The code, models, and dataset are **for research and educational purposes only**.
|
| 100 |
-
> **Commercial use is NOT permitted.**
|
| 101 |
-
> For commercial licensing, please contact the authors.
|
| 102 |
|
| 103 |
-
|
| 104 |
|
| 105 |
-
|
| 106 |
-
|
|
|
|
| 107 |
|
| 108 |
---
|
| 109 |
|
| 110 |
-
##
|
| 111 |
|
| 112 |
-
|
| 113 |
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
|
|
|
|
| 117 |
|
| 118 |
---
|
| 119 |
|
| 120 |
-
## 📖
|
| 121 |
|
| 122 |
-
|
| 123 |
|
| 124 |
```bibtex
|
| 125 |
@misc{liu2025prismaudiodecomposedchainofthoughtsmultidimensional,
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
| 134 |
```
|
| 135 |
|
| 136 |
---
|
| 137 |
|
| 138 |
-
## 📬
|
| 139 |
|
| 140 |
-
✨
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
<h1 align="center">PrismAudio</h1>
|
| 2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
| 4 |
|
| 5 |
+
**PrismAudio** 是首个将强化学习融入视频转音频(V2A)生成的框架,并配备专门的思维链(CoT)规划机制。在 [ThinkSound](https://arxiv.org/pdf/2506.21448) 开创性的基于 CoT 的 V2A 框架基础上,PrismAudio 进一步将单一推理分解为四个专门的 CoT 模块(语义、时序、美学和空间),每个模块配备针对性的奖励函数,实现多维强化学习优化,从而在所有感知维度上同步提升推理能力。
|
|
|
|
|
|
|
| 6 |
|
| 7 |
---
|
| 8 |
|
| 9 |
+
## 📰 最新动态
|
| 10 |
+
|
| 11 |
+
- **2026.03.22** 🔥 **PrismAudio** 正式发布,这是我们的下一代视频转音频生成模型!
|
| 12 |
+
- **2026.01.26** 🎉 PrismAudio 被 **ICLR 2026 主会场** 接收!
|
| 13 |
+
- **2025.11.25** 🔥 [PrismAudio 在线 Demo](http://prismaudio-project.github.io/) 上线!
|
| 14 |
+
- **2025.11.25** 🔥 [PrismAudio 论文](https://arxiv.org/pdf/2511.18833) 发布于 arXiv!
|
| 15 |
+
- **2025.09.19** 🎉 ThinkSound 被 **NeurIPS 2025 主会场** 接收!
|
| 16 |
+
- **2025.09.01** AudioCoT 数据集在 [Hugging Face](https://huggingface.co/datasets/liuhuadai/AudioCoT) 开源!
|
| 17 |
+
- **2025.07.17** 🧠 开放微调:训练和微调代码正式公开!
|
| 18 |
+
- **2025.07.15** 📦 简化安装,支持 Windows `.bat` 脚本一键配置!
|
| 19 |
+
- **2025.07.08** 🔧 重大更新:模型轻量化,优化显存和 GPU 使用,支持大规模高吞吐量音频生成!
|
| 20 |
+
- **2025.07.01** 在 [Hugging Face Spaces](https://huggingface.co/spaces/FunAudioLLM/ThinkSound) 和 [ModelScope](https://modelscope.cn/studios/iic/ThinkSound) 上线在线 Demo!
|
| 21 |
+
- **2025.07.01** 发布推理脚本和 Web 界面!
|
| 22 |
+
- **2025.06** [ThinkSound 论文](https://arxiv.org/pdf/2506.21448) 发布于 arXiv!
|
| 23 |
+
- **2025.06** [在线 Demo](http://thinksound-project.github.io/) 上线!
|
| 24 |
|
| 25 |
---
|
| 26 |
|
| 27 |
+
## 🚀 主要特性
|
| 28 |
|
| 29 |
+
- **V2A 最优性能**:在 VGGSound 和 AudioCanvas 基准测试的全部四个感知维度上均达到最先进水平。
|
| 30 |
+
- **分解式 CoT 推理**:四个专门的 CoT 模块(语义、时序、美学、空间),各自提供聚焦、可解释的推理。
|
| 31 |
+
- **多维强化学习**:Fast-GRPO 实现高效的多维奖励优化,同时不牺牲生成质量。
|
| 32 |
+
- **新基准 AudioCanvas**:包含 300 个单事件类别和 501 个多事件样本的严格 V2A 基准测试。
|
| 33 |
+
- **高效轻量**:仅 5.18 亿参数,推理速度快于此前的最优方法。
|
| 34 |
|
| 35 |
---
|
| 36 |
|
| 37 |
+
## ✨ 方法概述
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
|
| 39 |
+
PrismAudio 由三个主要组件构成:
|
| 40 |
|
| 41 |
+
1. **CoT 感知音频基础模型**:基于多模态扩散 Transformer 和流匹配,结合 VideoPrism 进行视频理解,T5-Gemma 进行结构化 CoT 文本编码。
|
| 42 |
+
2. **分解式多维 CoT 推理**:四个专门的 CoT 模块——语义、时序、美学和空间,各自针对对应感知维度提供精准推理。
|
| 43 |
+
3. **Fast-GRPO 多维强化学习框架**:混合 ODE-SDE 采样策略,大幅降低训练开销,同时实现跨所有感知维度的多维奖励优化。
|
| 44 |
|
| 45 |
---
|
| 46 |
|
| 47 |
+
## 📄 许可证
|
| 48 |
|
| 49 |
+
本项目基于 Apache 2.0 协议发布。
|
| 50 |
|
| 51 |
+
> **注意:**
|
| 52 |
+
> 代码、模型和数据集**仅供研究和教育用途**。
|
| 53 |
+
> **不允许商业使用。**
|
| 54 |
+
> 如需商业授权,请联系作者。
|
| 55 |
|
| 56 |
---
|
| 57 |
|
| 58 |
+
## 📖 引用
|
| 59 |
|
| 60 |
+
如果 PrismAudio 对您的研究有帮助,请引用我们的论文:
|
| 61 |
|
| 62 |
```bibtex
|
| 63 |
@misc{liu2025prismaudiodecomposedchainofthoughtsmultidimensional,
|
| 64 |
+
title={PrismAudio: Decomposed Chain-of-Thoughts and Multi-dimensional Rewards for Video-to-Audio Generation},
|
| 65 |
+
author={Huadai Liu and Kaicheng Luo and Wen Wang and Qian Chen and Peiwen Sun and Rongjie Huang and Xiangang Li and Jieping Ye and Wei Xue},
|
| 66 |
+
year={2025},
|
| 67 |
+
eprint={2511.18833},
|
| 68 |
+
archivePrefix={arXiv},
|
| 69 |
+
primaryClass={cs.SD},
|
| 70 |
+
url={https://arxiv.org/abs/2511.18833},
|
| 71 |
+
}
|
| 72 |
```
|
| 73 |
|
| 74 |
---
|
| 75 |
|
| 76 |
+
## 📬 联系我们
|
| 77 |
|
| 78 |
+
✨ 如有任何问题或建议,欢迎 [提交 Issue](https://github.com/liuhuadai/ThinkSound/issues) 或通过邮件联系我们:[huadai.liu@connect.ust.hk](mailto:huadai.liu@connect.ust.hk)
|