Update README.md
Browse files
README.md
CHANGED
|
@@ -1,5 +1,5 @@
|
|
| 1 |
---
|
| 2 |
-
license:
|
| 3 |
base_model:
|
| 4 |
- google/videoprism-large-f8r288
|
| 5 |
- google/t5gemma-l-l-ul2-it
|
|
@@ -10,7 +10,6 @@ tags:
|
|
| 10 |
- video2audio
|
| 11 |
---
|
| 12 |
<h1 align="center">PrismAudio</h1>
|
| 13 |
-
|
| 14 |
<p align="center">
|
| 15 |
<img src="https://img.shields.io/badge/ICLR 2026-Main Conference-blue.svg" alt="ICLR 2026"/>
|
| 16 |
</p>
|
|
@@ -37,100 +36,70 @@ tags:
|
|
| 37 |
</a>
|
| 38 |
</p>
|
| 39 |
|
| 40 |
-
<p align="center">
|
| 41 |
-
If you find this project useful,<br>
|
| 42 |
-
a star β on GitHub would be greatly appreciated!
|
| 43 |
-
</p>
|
| 44 |
-
|
| 45 |
-
|
| 46 |
---
|
| 47 |
|
| 48 |
-
**PrismAudio** is the first framework to integrate
|
| 49 |
-
|
| 50 |
-
---
|
| 51 |
-
|
| 52 |
-
## π° News
|
| 53 |
-
|
| 54 |
-
- **2026.03.22** π₯ We have released **PrismAudio**, our next-generation video-to-audio generation model! Model weights are available on [Hugging Face](https://huggingface.co/FunAudioLLM/PrismAudio) and [ModelScope](https://www.modelscope.cn/models/iic/PrismAudio). For more details, please refer to the [`prismaudio`](https://github.com/liuhuadai/ThinkSound/tree/prismaudio) branch!
|
| 55 |
-
- **2026.01.26** π PrismAudio has been accepted to the **ICLR 2026 Main Conference**!
|
| 56 |
-
- **2025.11.25** π₯ [PrismAudio Online Demo](http://prismaudio-project.github.io/) is live!
|
| 57 |
-
- **2025.11.25** π₯ [PrismAudio paper](https://arxiv.org/pdf/2511.18833) released on arXiv!
|
| 58 |
-
- **2025.09.19** π ThinkSound has been accepted to the **NeurIPS 2025 Main Conference**!
|
| 59 |
-
- **2025.09.01** AudioCoT dataset is now open-sourced on [Hugging Face](https://huggingface.co/datasets/liuhuadai/AudioCoT)!
|
| 60 |
-
- **2025.07.17** π§ Finetuning enabled: training and finetuning code is now publicly available!
|
| 61 |
-
- **2025.07.15** π¦ Simplified installation with Windows `.bat` scripts for one-click setup!
|
| 62 |
-
- **2025.07.08** π§ Major update: model lightweighted, optimized memory and GPU usage, supports large-scale high-throughput audio generation!
|
| 63 |
-
- **2025.07.01** Online demo on [Hugging Face Spaces](https://huggingface.co/spaces/FunAudioLLM/ThinkSound) and [ModelScope](https://modelscope.cn/studios/iic/ThinkSound)!
|
| 64 |
-
- **2025.07.01** Released inference scripts and web interface!
|
| 65 |
-
- **2025.06** [ThinkSound paper](https://arxiv.org/pdf/2506.21448) released on arXiv!
|
| 66 |
-
- **2025.06** [Online Demo](http://thinksound-project.github.io/) is live!
|
| 67 |
-
|
| 68 |
-
---
|
| 69 |
-
|
| 70 |
-
## β‘ Quick Start
|
| 71 |
-
|
| 72 |
-
For detailed training and inference code, please refer to [ThinkSound (prismaudio branch)](https://github.com/FunAudioLLM/ThinkSound/tree/prismaudio).
|
| 73 |
-
|
| 74 |
---
|
| 75 |
|
| 76 |
-
##
|
| 77 |
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
- **Efficient**: 518M parameters with faster inference than prior SOTAs.
|
| 83 |
|
| 84 |
-
-
|
| 85 |
-
|
| 86 |
-
|
|
|
|
| 87 |
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
|
|
|
| 93 |
|
| 94 |
---
|
| 95 |
|
| 96 |
-
## π License
|
| 97 |
|
| 98 |
-
|
| 99 |
|
| 100 |
-
|
| 101 |
-
> The code, models, and dataset are **for research and educational purposes only**.
|
| 102 |
-
> **Commercial use is NOT permitted.**
|
| 103 |
-
> For commercial licensing, please contact the authors.
|
| 104 |
|
| 105 |
-
**
|
| 106 |
-
|
| 107 |
-
- **Stable Audio Open VAE** (by Stability AI): Licensed under the [Stability AI Community License](./third_party/LICENSE_StabilityAI.md). **Commercial use and redistribution require prior permission from Stability AI.**
|
| 108 |
-
- π **All other code and models** are released under the Apache License 2.0.
|
| 109 |
|
| 110 |
---
|
| 111 |
|
| 112 |
-
## Acknowledgements
|
| 113 |
-
|
| 114 |
-
Many thanks to:
|
| 115 |
|
| 116 |
-
|
| 117 |
|
| 118 |
-
|
| 119 |
|
| 120 |
-
## π Citation
|
| 121 |
-
|
| 122 |
-
If you find PrismAudio useful in your research or work, please cite our paper:
|
| 123 |
|
| 124 |
```bibtex
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 125 |
@misc{liu2025prismaudiodecomposedchainofthoughtsmultidimensional,
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
|
| 133 |
}
|
| 134 |
```
|
| 135 |
-
|
| 136 |
-
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
license: mit
|
| 3 |
base_model:
|
| 4 |
- google/videoprism-large-f8r288
|
| 5 |
- google/t5gemma-l-l-ul2-it
|
|
|
|
| 10 |
- video2audio
|
| 11 |
---
|
| 12 |
<h1 align="center">PrismAudio</h1>
|
|
|
|
| 13 |
<p align="center">
|
| 14 |
<img src="https://img.shields.io/badge/ICLR 2026-Main Conference-blue.svg" alt="ICLR 2026"/>
|
| 15 |
</p>
|
|
|
|
| 36 |
</a>
|
| 37 |
</p>
|
| 38 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
---
|
| 40 |
|
| 41 |
+
**PrismAudio** is the first framework to integrate reinforcement learning into video-to-audio (V2A) generation, equipped with a dedicated Chain-of-Thought (CoT) planning mechanism. Building on the pioneering CoT-based V2A framework of ThinkSound, PrismAudio further decomposes single-step reasoning into four specialized CoT modules β **semantic**, **temporal**, **aesthetic**, and **spatial** β each with targeted reward functions, enabling multi-dimensional RL optimization that simultaneously improves reasoning across all perceptual dimensions.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 42 |
---
|
| 43 |
|
| 44 |
+
## Quick Start
|
| 45 |
|
| 46 |
+
For full training and inference details, please refer to the [ThinkSound `prismaudio` branch](https://github.com/FunAudioLLM/ThinkSound/tree/prismaudio).
|
| 47 |
+
```bash
|
| 48 |
+
git clone -b prismaudio https://github.com/liuhuadai/ThinkSound.git
|
| 49 |
+
cd ThinkSound
|
|
|
|
| 50 |
|
| 51 |
+
conda create -n prismaudio python=3.10
|
| 52 |
+
conda activate prismaudio
|
| 53 |
+
chmod +x scripts/PrismAudio/setup/build_env.sh
|
| 54 |
+
./scripts/PrismAudio/setup/build_env.sh
|
| 55 |
|
| 56 |
+
# Download pretrained weights to ckpts/
|
| 57 |
+
# From Hugging Face: https://huggingface.co/FunAudioLLM/PrismAudio
|
| 58 |
+
# From ModelScope: https://www.modelscope.cn/models/iic/PrismAudio
|
| 59 |
+
git lfs install
|
| 60 |
+
git clone https://huggingface.co/FunAudioLLM/PrismAudio ckpts
|
| 61 |
+
```
|
| 62 |
|
| 63 |
---
|
| 64 |
|
|
|
|
| 65 |
|
| 66 |
+
## License
|
| 67 |
|
| 68 |
+
This project is released under the [MIT License](https://opensource.org/licenses/MIT).
|
|
|
|
|
|
|
|
|
|
| 69 |
|
| 70 |
+
> **Note:** The code, model weights, and datasets are intended for **research and educational purposes only**. Commercial use is not permitted without explicit authorization from the authors.
|
|
|
|
|
|
|
|
|
|
| 71 |
|
| 72 |
---
|
| 73 |
|
|
|
|
|
|
|
|
|
|
| 74 |
|
| 75 |
+
## Citation
|
| 76 |
|
| 77 |
+
If you find PrismAudio useful in your research, please consider citing our papers:
|
| 78 |
|
|
|
|
|
|
|
|
|
|
| 79 |
|
| 80 |
```bibtex
|
| 81 |
+
@misc{liu2025thinksoundchainofthoughtreasoningmultimodal,
|
| 82 |
+
title={ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing},
|
| 83 |
+
author={Huadai Liu and Jialei Wang and Kaicheng Luo and Wen Wang and Qian Chen and Zhou Zhao and Wei Xue},
|
| 84 |
+
year={2025},
|
| 85 |
+
eprint={2506.21448},
|
| 86 |
+
archivePrefix={arXiv},
|
| 87 |
+
primaryClass={eess.AS},
|
| 88 |
+
url={https://arxiv.org/abs/2506.21448},
|
| 89 |
+
}
|
| 90 |
+
|
| 91 |
@misc{liu2025prismaudiodecomposedchainofthoughtsmultidimensional,
|
| 92 |
+
title={PrismAudio: Decomposed Chain-of-Thoughts and Multi-dimensional Rewards for Video-to-Audio Generation},
|
| 93 |
+
author={Huadai Liu and Kaicheng Luo and Wen Wang and Qian Chen and Peiwen Sun and Rongjie Huang and Xiangang Li and Jieping Ye and Wei Xue},
|
| 94 |
+
year={2025},
|
| 95 |
+
eprint={2511.18833},
|
| 96 |
+
archivePrefix={arXiv},
|
| 97 |
+
primaryClass={cs.SD},
|
| 98 |
+
url={https://arxiv.org/abs/2511.18833},
|
| 99 |
}
|
| 100 |
```
|
| 101 |
+
|
| 102 |
+
---
|
| 103 |
+
## Contact
|
| 104 |
+
|
| 105 |
+
If you have any questions or suggestions, feel free to [open an issue](https://github.com/liuhuadai/ThinkSound/issues) or reach out via email: [huadai.liu@connect.ust.hk](mailto:huadai.liu@connect.ust.hk)
|