--- title: PrismAudio emoji: 🎵 colorFrom: purple colorTo: blue sdk: gradio sdk_version: "5.42.0" python_version: "3.10" app_file: app.py pinned: false ---

PrismAudio

ICLR 2026

arXiv   Online Demo   GitHub   Hugging Face   ModelScope

--- **PrismAudio** is the first framework to integrate reinforcement learning into video-to-audio (V2A) generation, equipped with a dedicated Chain-of-Thought (CoT) planning mechanism. Building on the pioneering CoT-based V2A framework of ThinkSound, PrismAudio further decomposes single-step reasoning into four specialized CoT modules — **semantic**, **temporal**, **aesthetic**, and **spatial** — each with targeted reward functions, enabling multi-dimensional RL optimization that simultaneously improves reasoning across all perceptual dimensions. --- ## Quick Start For full training and inference details, please refer to the [ThinkSound `prismaudio` branch](https://github.com/FunAudioLLM/ThinkSound/tree/prismaudio). ```bash git clone -b prismaudio https://github.com/liuhuadai/ThinkSound.git cd ThinkSound conda create -n prismaudio python=3.10 conda activate prismaudio chmod +x scripts/PrismAudio/setup/build_env.sh ./scripts/PrismAudio/setup/build_env.sh # Download pretrained weights to ckpts/ # From Hugging Face: https://huggingface.co/FunAudioLLM/PrismAudio # From ModelScope: https://www.modelscope.cn/models/iic/PrismAudio git lfs install git clone https://huggingface.co/FunAudioLLM/PrismAudio ckpts ``` --- ## License This project is released under the [MIT License](https://opensource.org/licenses/MIT). > **Note:** The code, model weights, and datasets are intended for **research and educational purposes only**. Commercial use is not permitted without explicit authorization from the authors. --- ## Citation If you find PrismAudio useful in your research, please consider citing our papers: ```bibtex @misc{liu2025thinksoundchainofthoughtreasoningmultimodal, title={ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing}, author={Huadai Liu and Jialei Wang and Kaicheng Luo and Wen Wang and Qian Chen and Zhou Zhao and Wei Xue}, year={2025}, eprint={2506.21448}, archivePrefix={arXiv}, primaryClass={eess.AS}, url={https://arxiv.org/abs/2506.21448}, } @misc{liu2025prismaudiodecomposedchainofthoughtsmultidimensional, title={PrismAudio: Decomposed Chain-of-Thoughts and Multi-dimensional Rewards for Video-to-Audio Generation}, author={Huadai Liu and Kaicheng Luo and Wen Wang and Qian Chen and Peiwen Sun and Rongjie Huang and Xiangang Li and Jieping Ye and Wei Xue}, year={2025}, eprint={2511.18833}, archivePrefix={arXiv}, primaryClass={cs.SD}, url={https://arxiv.org/abs/2511.18833}, } ``` --- ## Contact If you have any questions or suggestions, feel free to [open an issue](https://github.com/liuhuadai/ThinkSound/issues)