foggyforest commited on
Commit
1aed72e
·
verified ·
1 Parent(s): 6a140f5

Delete README (1).md

Browse files
Files changed (1) hide show
  1. README (1).md +0 -216
README (1).md DELETED
@@ -1,216 +0,0 @@
1
- ---
2
- license: mit
3
- language:
4
- - en
5
- - zh
6
- base_model:
7
- - Qwen/Qwen2-0.5B
8
- pipeline_tag: feature-extraction
9
- library_name: sentence-transformers
10
- tags:
11
- - MoE
12
- - Unified Generation
13
- - Speech and Music
14
- - Multi-modal
15
- datasets:
16
- ---
17
-
18
- <h1 align="center">UniMoE-Audio</h1>
19
-
20
- **UniMoE-Audio** is a unified framework that seamlessly combines speech and music generation. Powered by a novel dynamic-capacity Mixture-of-Experts design, it adapts intelligently to input complexity, enabling high-fidelity voice and expressive music within a single model.
21
-
22
- ## Key Innovations
23
-
24
- #### **Top-P Dynamic Routing Strategy**
25
- We introduce a **Top-P routing strategy** that overcomes the limitations of conventional static Top-K routing:
26
-
27
- - **Dynamic Expert Allocation**: Instead of assigning a fixed number of experts to every token, our approach dynamically determines the number of experts based on token complexity
28
- - **Resource Efficiency**: Simple tokens don't consume unnecessary resources, while complex tokens receive sufficient processing power
29
- - **Performance Optimization**: Results in improved overall efficiency and performance
30
-
31
- #### **Three-Stage Training Curriculum**
32
- We employ a comprehensive training approach to enable effective joint learning from imbalanced data:
33
-
34
- 1. **Independent Specialist Training** - Initial expert specialization
35
- 2. **Integration with Warm-up** - Gradual system integration
36
- 3. **Synergistic Joint Training** - Collaborative optimization
37
-
38
- ## Model Information
39
- - **Base Model**: Qwen2.5-VL with MoE extensions
40
- - **Audio Codec**: DAC (Descript Audio Codec) with 12 channels
41
- - **Expert Configuration**: 8 dynamic experts + 2 shared experts
42
- - **Audio Sampling Rate**: 16kHz
43
- - Usage:
44
- - Text-to-Speech (TTS)
45
- - Speech-to-Text (STT)
46
- - Music Generation
47
- - GPU Requirements:
48
- - Memory: 16GB+
49
- - CUDA-enabled GPU
50
-
51
- ## Open-source Plan
52
- - [☑️] Model Checkpoint
53
- - [☑️] [UniMoE-Audio-preview](https://huggingface.co/foggyforest/UniMoE-Audio-preview)
54
- - [☑️] Inference Code: [HITsz-TMG/UniMoE-Audio](https://github.com/HITsz-TMG/UMOE-Scaling-Unified-Multimodal-LLMs/tree/master/UniMoE-Audio)
55
- - [☑️] Technical Report: [UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE]()
56
-
57
- ## Evaluation
58
- ### Speech Synthesis
59
- ![Speech Synthesis](./imgs/Speech_Generation.png)
60
- ### Text to Music Generation
61
- ![Text to Music Generation](./imgs/T2M.png)
62
- ### Video-Text to Music Generation
63
- ![Video-Text to Music Generation](./imgs/VT2M.png)
64
-
65
- ## Requirements
66
- We recommend using conda to install the environment.
67
- ```bash
68
- conda env create -f configs/enviroment.yml # add -n for your name
69
- conda activate unimoe-audio # default name
70
- ```
71
- then install the torch packages
72
- ```bash
73
- # Use the official index
74
- pip install torch==2.1.1 torchvision==0.16.1 torchaudio==2.1.1 --index-url https://download.pytorch.org/whl/cu121
75
-
76
- # Use Tsinghua mirror source
77
- pip install torch==2.1.1 torchvision==0.16.1 torchaudio==2.1.1 -i https://pypi.tuna.tsinghua.edu.cn/simple/ --extra-index-url https://download.pytorch.org/whl/cu121
78
-
79
- # Use Alibaba Cloud mirror source
80
- pip install torch==2.1.1 torchvision==0.16.1 torchaudio==2.1.1 -i https://mirrors.aliyun.com/pypi/simple/ --extra-index-url https://download.pytorch.org/whl/cu121
81
- ```
82
- A `dac model` is also required to be downloaded in '/path/to/UniMoE-Audio/utils/dac_model'.
83
- It will be automatically downloaded when running the first time.
84
-
85
-
86
- ## Usage
87
- Please move to the `utils` folder to your working directory.
88
- Then you can use the model like this:
89
-
90
- ```python
91
- from modeling import UniMoEAudio
92
-
93
- MODEL_NAME= "HIT-TMG/UniMoE-Audio-Preview"
94
-
95
- # Load model
96
- unimoe_audio = UniMoEAudio.from_pretrained(
97
- MODEL_NAME,
98
- cache_dir='./cache',
99
- torch_dtype='bfloat16',
100
- device_id=0
101
- )
102
-
103
- ```
104
-
105
- ### TTS Example:
106
- ```python
107
- # TTS/Voice Cloning
108
- target_text = "Target Text"
109
- prompt_audio = "/path/to/your/prompt_audio.wav"
110
- prompt_text = "Prompt Text"
111
-
112
- # Encode prompt audio
113
- prompt_codec = unimoe_audio.dac.encode(prompt_audio)
114
-
115
- prompt_codec_input_ids = unimoe_audio._preprocess_codec(
116
- codec=prompt_codec,
117
- codec_delay_pattern=unimoe_audio.model.config.codec_delay_pattern,
118
- codec_channels=unimoe_audio.model.num_channels,
119
- codec_bos_value=unimoe_audio.model.config.codec_bos_value,
120
- codec_eos_value=unimoe_audio.model.config.codec_eos_value,
121
- codec_pad_value=unimoe_audio.model.config.codec_pad_value
122
- )
123
-
124
- # Construct prompt text
125
- text_input, _, _ = unimoe_audio._prepare_prompt(task="speech", caption=target_text, prompt_text=prompt_text, prompt_codec_input_ids=prompt_codec_input_ids)
126
-
127
- # Tokenize input text
128
- source_input = unimoe_audio.tokenizer(text_input, add_special_tokens=False, return_tensors="pt", padding=True)
129
- prompt_codec_input_ids = prompt_codec_input_ids.unsqueeze(0).expand(len(text_input), -1, -1).reshape(-1, prompt_codec_input_ids.shape[1])
130
-
131
- #Speech Generation
132
- unimoe_audio._generate_core(
133
- source_input,
134
- prompt_codec_input_ids,
135
- save_name = "speech",
136
- output_dir = "./",
137
- cfg_scale = 1.0,
138
- temperature = 1.0,
139
- top_p = 1.0,
140
- cfg_filter_top_k = 45,
141
- eos_prob_mul_factor = 1.0,
142
- do_sample = True,
143
- debug_guidance_step = -1,
144
- use_cache = True
145
- )
146
- ```
147
- ### T2M Example:
148
- ```python
149
- caption = "music deccription"
150
-
151
- # Construct prompt text
152
- text_input, _, _ = unimoe_audio._prepare_prompt(task="music", caption=caption)
153
-
154
- # Tokenize input text
155
- source_input = unimoe_audio.tokenizer(text_input, add_special_tokens=False, return_tensors="pt", padding=True)
156
-
157
- #music generation with prompt text
158
- unimoe_audio._generate_core(
159
- source_input,
160
- None,
161
- save_name = "music",
162
- output_dir = "./",
163
- cfg_scale = 10.0,
164
- temperature = 1.0,
165
- top_p = 1.0,
166
- cfg_filter_top_k = 45,
167
- eos_prob_mul_factor = 0.6,
168
- do_sample = True,
169
- debug_guidance_step = -1,
170
- use_cache = True
171
- )
172
- ```
173
-
174
- ### VT2M Example:
175
- ```python
176
- # VT2M
177
- caption = "music deccription"
178
- prompt_video = "/path/to/your/video.mp4"
179
-
180
- #prepare prompt
181
- text_input, video_inputs, fps_inputs = unimoe_audio._prepare_prompt(task="music", caption=caption, video=prompt_video, fps=1, sampling_fps=1, max_frames=1)
182
-
183
- #input processor
184
- source_input = unimoe_audio.processor(
185
- text=text_input,
186
- images=None,
187
- videos=video_inputs,
188
- fps=fps_inputs,
189
- padding=True,
190
- return_tensors="pt",
191
- do_resize=False
192
- )
193
-
194
- #music generation with prompt video
195
- unimoe_audio._generate_core(
196
- source_input,
197
- None,
198
- save_name = "video_music",
199
- output_dir = "./",
200
- rebuild_codec=None,
201
- cfg_scale = 10.0,
202
- temperature = 1.0,
203
- top_p = 1.0,
204
- cfg_filter_top_k = 45,
205
- eos_prob_mul_factor = 0.6,
206
- do_sample = True,
207
- debug_guidance_step = -1,
208
- use_cache = True
209
- )
210
- ```
211
-
212
-
213
-
214
-
215
-
216
-