HIT-TMG
/

UniMoE-Audio-Preview

@@ -13,7 +13,7 @@ tags:
 <h1 align="center">UniMoE-Audio</h1>
-**UniMoE-Audio is a unified framework that seamlessly combines speech and music generation. Powered by a novel Dynamic-Capacity Mixture-of-Experts architecture. **
 <div align="center" style="display: flex; justify-content: center; margin-top: 10px;">
   <a href="https://mukioxun.github.io/Uni-MoE-site/home.html"><img src="https://img.shields.io/badge/📰 -Website-228B22" style="margin-right: 5px;"></a>
@@ -21,7 +21,6 @@ tags:
 </div>
 ## Model Information
 - **Base Model**: Qwen2.5-VL with MoE extensions
 - **Audio Codec**: DAC (Descript Audio Codec) with 12 channels
@@ -42,21 +41,35 @@ tags:
 - [x] Technical Report: [UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE](https://arxiv.org/abs/2510.13344)
 ## Evaluation
 ### Speech Synthesis
 ![Speech Synthesis](./imgs/Speech_Generation.png)
 ### Text to Music Generation
 ![Text to Music Generation](./imgs/T2M.png)
 ### Video-Text to Music Generation
 ![Video-Text to Music Generation](./imgs/VT2M.png)
 ## Requirements
-We recommend using conda to install the environment.
-```bash
-conda env create -f configs/enviroment.yml      # add -n for your name
-conda activate unimoe-audio                     # default name
 ```
-A `dac model` is also required to be downloaded in '/path/to/UniMoE-Audio/utils/dac_model'.
-It will be automatically downloaded when running the first time.
 ## Usage
@@ -65,7 +78,7 @@ Here is a code snippet to show you how to use UniMoE-Audio with `transformers`
 ```python
 import torch
-import deepspeed_utils # This line is important, do not delete
 from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
 # Import from utils modules

 <h1 align="center">UniMoE-Audio</h1>
+**UniMoE-Audio**  is a unified framework that seamlessly combines speech and music generation. Powered by a novel Dynamic-Capacity Mixture-of-Experts architecture.
 <div align="center" style="display: flex; justify-content: center; margin-top: 10px;">
   <a href="https://mukioxun.github.io/Uni-MoE-site/home.html"><img src="https://img.shields.io/badge/📰 -Website-228B22" style="margin-right: 5px;"></a>
 </div>
 ## Model Information
 - **Base Model**: Qwen2.5-VL with MoE extensions
 - **Audio Codec**: DAC (Descript Audio Codec) with 12 channels
 - [x] Technical Report: [UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE](https://arxiv.org/abs/2510.13344)
 ## Evaluation
 ### Speech Synthesis
 ![Speech Synthesis](./imgs/Speech_Generation.png)
 ### Text to Music Generation
 ![Text to Music Generation](./imgs/T2M.png)
 ### Video-Text to Music Generation
 ![Video-Text to Music Generation](./imgs/VT2M.png)
 ## Requirements
+Since we have used the Qwen2.5VL model, we advise you to install transformers>=4.53.1, or you might encounter the following error:
+```
+KeyError: 'qwen2_vl'
+```
+## Quickstart
+We use `qwen-vl-utils` to handle various types of visual input. You can install it using the following command:
+```
+pip install qwen-vl-utils
+```
+We use the Descript Audio Codec (DAC) for audio compression.  You can install it using the following command:
+```
+pip install descript-audio-codec
 ```
+The model weight will be automatically downloaded on first run.
 ## Usage
 ```python
 import torch
+import deepspeed_utils # This line is important, do not delete it
 from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
 # Import from utils modules