ASLP-lab commited on
Commit
36f5333
·
verified ·
1 Parent(s): ab78ed1

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,8 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ images/I-OSUM-Pangu.png filter=lfs diff=lfs merge=lfs -text
37
+ images/Strategy.png filter=lfs diff=lfs merge=lfs -text
38
+ images/structure.png filter=lfs diff=lfs merge=lfs -text
39
+ images/table1.png filter=lfs diff=lfs merge=lfs -text
40
+ images/table4.png filter=lfs diff=lfs merge=lfs -text
images/I-OSUM-Pangu.png ADDED

Git LFS Details

  • SHA256: d82a5f9d15f7fa733072edf59e69fbddb6583dc64e35f118f774bd8ae4d6e3df
  • Pointer size: 131 Bytes
  • Size of remote file: 448 kB
images/Strategy.png ADDED

Git LFS Details

  • SHA256: f84fef23419ec79ca56707090045d0f4fa647238b14dc5aac3a3b67d93490142
  • Pointer size: 131 Bytes
  • Size of remote file: 173 kB
images/structure.png ADDED

Git LFS Details

  • SHA256: 83ea860ad7e4a5525e2ee52d83ba87dc88a0502ba330ff0a97f00cbf13a7913c
  • Pointer size: 131 Bytes
  • Size of remote file: 124 kB
images/table1.png ADDED

Git LFS Details

  • SHA256: 194bc0efc0f96cd523419ef4f935ca30a69e231dab3784f3e0dfe4d91ea56868
  • Pointer size: 131 Bytes
  • Size of remote file: 169 kB
images/table2.png ADDED
images/table3.png ADDED
images/table4.png ADDED

Git LFS Details

  • SHA256: e48941656514344aa09e30071a4a249fcab330b7f2aa9f7994525a20cdba72a3
  • Pointer size: 131 Bytes
  • Size of remote file: 124 kB
images/table5.png ADDED
images/table6.png ADDED
only_encder_ckpt.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e25b838f9da41c9a29c057799d4d8335f9f5ee0e33c09ea176348e4840353e72
3
+ size 6979084750
readme.md ADDED
@@ -0,0 +1,282 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <p align="center">
2
+ <h1>I-OSUM-Pangu: Intent-Aware Open-Source Speech Understanding Framework</h1>
3
+ <p>
4
+
5
+ Yujie Liao, Xuelong Geng, Shuiyuan Wang, Lei Xie
6
+
7
+ <p align="center">
8
+ <img src="images/I-OSUM-Pangu.png" width="400"/>
9
+ <p>
10
+
11
+ <p align="center">
12
+ <a href="https://github.com/ASLP-lab/I-OSUM-Pangu"> Code</a>
13
+ </p>
14
+
15
+ In recent years, the development of large-scale audio-language models has enabled multi-dimensional speech understanding. However, most existing open-source models rely on fixed templates or task tags, while more powerful systems are often closed-source or require massive amounts of training data.
16
+
17
+ We propose **I-OSUM-Pangu**, an efficient, controllable, and fully open-source speech understanding framework.
18
+
19
+ The model is built upon:
20
+
21
+ - Whisper-medium speech encoder (from the Whisper series developed by :contentReference[oaicite:0]{index=0})
22
+ - :contentReference[oaicite:1]{index=1} 7B large language model backbone
23
+
24
+ The core objective of our framework is to enable the model to:
25
+
26
+ - Understand user instructions expressed in natural language
27
+ - Automatically identify user intent
28
+ - Route the request to the corresponding speech understanding task
29
+ - Work without relying on fixed prompt templates
30
+
31
+ Experimental results show that:
32
+
33
+ - The Instruction Following Rate (IFR) exceeds **90%**
34
+ - While maintaining comparable task performance with traditional fixed-tag approaches
35
+
36
+ This project releases both code and model weights, aiming to provide a **reproducible and extensible open-source framework** for speech understanding research.
37
+
38
+ ---
39
+
40
+ ## Architecture
41
+
42
+ The overall architecture of I-OSUM-Pangu is shown below:
43
+
44
+ <p align="center">
45
+ <img src="images/structure.png" width="80%"/>
46
+ <p>
47
+
48
+ The model mainly consists of three components:
49
+
50
+ ### 1. Speech Encoder
51
+ Whisper-medium
52
+ Responsible for extracting speech representations.
53
+
54
+ ### 2. Adapter
55
+ Transforms acoustic features into tokens compatible with the LLM input space.
56
+
57
+ ### 3. Intent-aware LLM
58
+ OpenPangu-7B
59
+
60
+ Responsible for:
61
+ - Parsing natural language instructions
62
+ - Identifying user intent
63
+ - Determining which speech task to execute
64
+
65
+ ---
66
+
67
+ ## Training Strategy
68
+
69
+ We propose a **Decoupled-then-Integrated Training Strategy**, illustrated below:
70
+
71
+ <p align="center">
72
+ <img src="images/Strategy.png" width="80%"/>
73
+ <p>
74
+
75
+ ### Stage 1: Speech Understanding Alignment
76
+
77
+ Goal: Equip the model with multi-task speech understanding capability.
78
+
79
+ Characteristics:
80
+
81
+ - Only speech-related modules are trained
82
+ - Establish strong acoustic representation ability
83
+
84
+ ---
85
+
86
+ ### Stage 2: Intent Understanding
87
+
88
+ Goal: Enable the model to understand natural language user instructions.
89
+
90
+ Examples:
91
+
92
+ Please transcribe this audio.
93
+ Analyze the speaker's emotion.
94
+ Identify what event happens in the audio.
95
+
96
+ The model learns:
97
+
98
+ - Instruction semantic understanding
99
+ - Task mapping capability
100
+
101
+ ---
102
+
103
+ ### Stage 3: Joint Instruction Tuning
104
+
105
+ In the final stage, joint training allows the model to:
106
+
107
+ - Automatically parse user instructions
108
+ - Identify task types
109
+ - Execute the corresponding speech understanding tasks
110
+
111
+ Without requiring fixed templates, such as:
112
+
113
+ What is the emotion of this speech?
114
+ Can you transcribe this audio?
115
+ What event happens in the audio?
116
+
117
+ The model can correctly understand and execute all of them.
118
+
119
+ ---
120
+
121
+ ## Inference Results
122
+
123
+ ### Dataset Configuration
124
+
125
+ The model is trained on **47,000 hours** of multi-task speech data, covering seven core speech tasks. Additionally, a dedicated dataset is constructed to enhance instruction-following ability.
126
+
127
+ <p align="center">
128
+ <img src="images/table1.png" width="65%"/>
129
+ </p>
130
+
131
+ ---
132
+
133
+ ### Instruction Following Performance (IFR)
134
+
135
+ Instruction Following Rate (IFR) measures the ability of the model to parse natural language instructions and execute the corresponding tasks.
136
+
137
+ The metric is defined as:
138
+
139
+ \[
140
+ IFR = \left( \frac{N_{correct}}{N_{total}} \right) \times 100\%
141
+ \]
142
+
143
+ where:
144
+
145
+ - \(N_{correct}\) represents the number of correctly executed instructions
146
+ - \(N_{total}\) represents the total number of evaluation samples
147
+
148
+ Compared with mainstream open-source models, **I-OSUM-Pangu achieves significantly better performance**:
149
+
150
+ <p align="center">
151
+ <img src="images/table2.png" width="65%"/>
152
+ </p>
153
+
154
+ ---
155
+
156
+ ### Flexibility vs Accuracy
157
+
158
+ We evaluate whether natural language instructions (NL) degrade performance compared to fixed instructions (FI).
159
+
160
+ Results show that the model maintains strong flexibility while preserving task accuracy.
161
+
162
+ <p align="center">
163
+ <img src="images/table3.png" width="65%"/>
164
+ </p>
165
+
166
+ Conclusion:
167
+
168
+ Only minor performance drops appear in relatively niche tasks such as:
169
+
170
+ - Style recognition
171
+ - Event detection
172
+
173
+ Core tasks such as:
174
+
175
+ - ASR
176
+ - SER
177
+ - SAP
178
+
179
+ remain almost unchanged, validating the effectiveness of the **Decoupled-then-Integrated strategy**.
180
+
181
+ ---
182
+
183
+ ### Multi-task Speech Understanding Performance
184
+
185
+ On public benchmarks, the model demonstrates competitive performance across multiple tasks, particularly in:
186
+
187
+ - Age prediction
188
+ - Emotion recognition (MER2023)
189
+
190
+ <p align="center">
191
+ <img src="images/table4.png" width="65%"/>
192
+ </p>
193
+
194
+ ---
195
+
196
+ ### Speech-to-Text Chat (STTC) Capability
197
+
198
+ We further evaluate the model in conversational reasoning scenarios.
199
+
200
+ I-OSUM-Pangu outperforms GLM-4-Voice on the TriviaQA and WebQ benchmarks.
201
+
202
+ <p align="center">
203
+ <img src="images/table5.png" width="65%"/>
204
+ </p>
205
+
206
+ ---
207
+
208
+ ### Ablation Study: Importance of the Decoupled Training Strategy
209
+
210
+ We compare direct joint training with our decoupled-then-integrated strategy to verify the effectiveness of our core design.
211
+
212
+ <p align="center">
213
+ <img src="images/table6.png" width="65%"/>
214
+ </p>
215
+
216
+ Conclusion:
217
+
218
+ Text-domain intent pretraining (Stage 2) establishes a strong semantic prior for the model and is crucial for improving instruction-following stability.
219
+
220
+ ---
221
+
222
+ ## How to Use the I-OSUM-Pangu Framework for Training and Inference
223
+
224
+ ### Environment Setup
225
+
226
+ Before starting, please ensure that your device supports **NPU** and the Python environment is properly configured.
227
+
228
+ We recommend running the code on a Linux system.
229
+
230
+ If Conda is not installed, please refer to:
231
+ https://blog.csdn.net/qq_41636123/article/details/130266232
232
+
233
+ ```bash
234
+ # Create a new conda environment
235
+ conda create -n iosum python=3.10
236
+ conda activate iosum
237
+
238
+ # Clone the repository
239
+ git clone https://github.com/ASLP-lab/I-OSUM-Pangu.git
240
+ cd I-OSUM-Pangu
241
+
242
+ # Install dependencies
243
+ pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
244
+ ```
245
+
246
+ ### Model Download
247
+ ```python
248
+ from huggingface_hub import snapshot_download
249
+
250
+ # 下载Qwen2-Audio-7B模型
251
+ snapshot_download(
252
+ repo_id="ASLP-lab/I-OSUM-Pangu",
253
+ local_dir="path",
254
+ local_dir_use_symlinks=False,
255
+ endpoint="https://hf-mirror.com"
256
+ )
257
+ ```
258
+ ### Inference
259
+ This project provides batch inference scripts for all tasks under in :I-OSUM-Pangu/infer_code:
260
+
261
+ ```shell
262
+ python infer_ASR.py
263
+ ```
264
+ ### Training
265
+ To ensure a smooth training process, please follow the steps below.
266
+ #### 1. Data Preparation
267
+ Data can be prepared in three formats:
268
+
269
+ raw、shard、combine
270
+
271
+ Recommended: shard format
272
+
273
+ After preparing the dataset, write the generated data index into the following configuration file:
274
+ ```yaml
275
+ I-OSUM-Pangu/conf/data_s2t_tmp.yaml
276
+ ```
277
+ #### 2. Start Training
278
+
279
+ Run the main training script:
280
+ ```bash
281
+ I-OSUM-Pangu/train.sh
282
+ ```
step_832499.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:80917360757afeee1deff1864055c12b851f75dae7b0a8452eef30ab0c4da67b
3
+ size 17180426102