Upload folder using huggingface_hub

Browse files

Files changed (13) hide show

.gitattributes +5 -0
images/I-OSUM-Pangu.png +3 -0
images/Strategy.png +3 -0
images/structure.png +3 -0
images/table1.png +3 -0
images/table2.png +0 -0
images/table3.png +0 -0
images/table4.png +3 -0
images/table5.png +0 -0
images/table6.png +0 -0
only_encder_ckpt.pt +3 -0
readme.md +282 -0
step_832499.pt +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,8 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+images/I-OSUM-Pangu.png filter=lfs diff=lfs merge=lfs -text
+images/Strategy.png filter=lfs diff=lfs merge=lfs -text
+images/structure.png filter=lfs diff=lfs merge=lfs -text
+images/table1.png filter=lfs diff=lfs merge=lfs -text
+images/table4.png filter=lfs diff=lfs merge=lfs -text

images/I-OSUM-Pangu.png ADDED Viewed

Git LFS Details

SHA256: d82a5f9d15f7fa733072edf59e69fbddb6583dc64e35f118f774bd8ae4d6e3df
Pointer size: 131 Bytes
Size of remote file: 448 kB

images/Strategy.png ADDED Viewed

Git LFS Details

SHA256: f84fef23419ec79ca56707090045d0f4fa647238b14dc5aac3a3b67d93490142
Pointer size: 131 Bytes
Size of remote file: 173 kB

images/structure.png ADDED Viewed

Git LFS Details

SHA256: 83ea860ad7e4a5525e2ee52d83ba87dc88a0502ba330ff0a97f00cbf13a7913c
Pointer size: 131 Bytes
Size of remote file: 124 kB

images/table1.png ADDED Viewed

Git LFS Details

SHA256: 194bc0efc0f96cd523419ef4f935ca30a69e231dab3784f3e0dfe4d91ea56868
Pointer size: 131 Bytes
Size of remote file: 169 kB

images/table2.png ADDED Viewed

images/table3.png ADDED Viewed

images/table4.png ADDED Viewed

Git LFS Details

SHA256: e48941656514344aa09e30071a4a249fcab330b7f2aa9f7994525a20cdba72a3
Pointer size: 131 Bytes
Size of remote file: 124 kB

images/table5.png ADDED Viewed

images/table6.png ADDED Viewed

only_encder_ckpt.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e25b838f9da41c9a29c057799d4d8335f9f5ee0e33c09ea176348e4840353e72
+size 6979084750

readme.md ADDED Viewed

	@@ -0,0 +1,282 @@

+<p align="center">
+   <h1>I-OSUM-Pangu: Intent-Aware Open-Source Speech Understanding Framework</h1>
+<p>
+Yujie Liao, Xuelong Geng, Shuiyuan Wang, Lei Xie
+<p align="center">
+    <img src="images/I-OSUM-Pangu.png" width="400"/>
+<p>
+<p align="center">
+    <a href="https://github.com/ASLP-lab/I-OSUM-Pangu"> Code</a>
+</p>
+In recent years, the development of large-scale audio-language models has enabled multi-dimensional speech understanding. However, most existing open-source models rely on fixed templates or task tags, while more powerful systems are often closed-source or require massive amounts of training data.
+We propose **I-OSUM-Pangu**, an efficient, controllable, and fully open-source speech understanding framework.
+The model is built upon:
+- Whisper-medium speech encoder (from the Whisper series developed by :contentReference[oaicite:0]{index=0})
+- :contentReference[oaicite:1]{index=1} 7B large language model backbone
+The core objective of our framework is to enable the model to:
+- Understand user instructions expressed in natural language
+- Automatically identify user intent
+- Route the request to the corresponding speech understanding task
+- Work without relying on fixed prompt templates
+Experimental results show that:
+- The Instruction Following Rate (IFR) exceeds **90%**
+- While maintaining comparable task performance with traditional fixed-tag approaches
+This project releases both code and model weights, aiming to provide a **reproducible and extensible open-source framework** for speech understanding research.
+---
+## Architecture
+The overall architecture of I-OSUM-Pangu is shown below:
+<p align="center">
+    <img src="images/structure.png" width="80%"/>
+<p>
+The model mainly consists of three components:
+### 1. Speech Encoder
+Whisper-medium
+Responsible for extracting speech representations.
+### 2. Adapter
+Transforms acoustic features into tokens compatible with the LLM input space.
+### 3. Intent-aware LLM
+OpenPangu-7B
+Responsible for:
+- Parsing natural language instructions
+- Identifying user intent
+- Determining which speech task to execute
+---
+## Training Strategy
+We propose a **Decoupled-then-Integrated Training Strategy**, illustrated below:
+<p align="center">
+    <img src="images/Strategy.png" width="80%"/>
+<p>
+### Stage 1: Speech Understanding Alignment
+Goal: Equip the model with multi-task speech understanding capability.
+Characteristics:
+- Only speech-related modules are trained
+- Establish strong acoustic representation ability
+---
+### Stage 2: Intent Understanding
+Goal: Enable the model to understand natural language user instructions.
+Examples:
+Please transcribe this audio.
+Analyze the speaker's emotion.
+Identify what event happens in the audio.
+The model learns:
+- Instruction semantic understanding
+- Task mapping capability
+---
+### Stage 3: Joint Instruction Tuning
+In the final stage, joint training allows the model to:
+- Automatically parse user instructions
+- Identify task types
+- Execute the corresponding speech understanding tasks
+Without requiring fixed templates, such as:
+What is the emotion of this speech?
+Can you transcribe this audio?
+What event happens in the audio?
+The model can correctly understand and execute all of them.
+---
+## Inference Results
+### Dataset Configuration
+The model is trained on **47,000 hours** of multi-task speech data, covering seven core speech tasks. Additionally, a dedicated dataset is constructed to enhance instruction-following ability.
+<p align="center">
+    <img src="images/table1.png" width="65%"/>
+</p>
+---
+### Instruction Following Performance (IFR)
+Instruction Following Rate (IFR) measures the ability of the model to parse natural language instructions and execute the corresponding tasks.
+The metric is defined as:
+\[
+IFR = \left( \frac{N_{correct}}{N_{total}} \right) \times 100\%
+\]
+where:
+- \(N_{correct}\) represents the number of correctly executed instructions
+- \(N_{total}\) represents the total number of evaluation samples
+Compared with mainstream open-source models, **I-OSUM-Pangu achieves significantly better performance**:
+<p align="center">
+    <img src="images/table2.png" width="65%"/>
+</p>
+---
+### Flexibility vs Accuracy
+We evaluate whether natural language instructions (NL) degrade performance compared to fixed instructions (FI).
+Results show that the model maintains strong flexibility while preserving task accuracy.
+<p align="center">
+    <img src="images/table3.png" width="65%"/>
+</p>
+Conclusion:
+Only minor performance drops appear in relatively niche tasks such as:
+- Style recognition
+- Event detection
+Core tasks such as:
+- ASR
+- SER
+- SAP
+remain almost unchanged, validating the effectiveness of the **Decoupled-then-Integrated strategy**.
+---
+### Multi-task Speech Understanding Performance
+On public benchmarks, the model demonstrates competitive performance across multiple tasks, particularly in:
+- Age prediction
+- Emotion recognition (MER2023)
+<p align="center">
+    <img src="images/table4.png" width="65%"/>
+</p>
+---
+### Speech-to-Text Chat (STTC) Capability
+We further evaluate the model in conversational reasoning scenarios.
+I-OSUM-Pangu outperforms GLM-4-Voice on the TriviaQA and WebQ benchmarks.
+<p align="center">
+    <img src="images/table5.png" width="65%"/>
+</p>
+---
+### Ablation Study: Importance of the Decoupled Training Strategy
+We compare direct joint training with our decoupled-then-integrated strategy to verify the effectiveness of our core design.
+<p align="center">
+    <img src="images/table6.png" width="65%"/>
+</p>
+Conclusion:
+Text-domain intent pretraining (Stage 2) establishes a strong semantic prior for the model and is crucial for improving instruction-following stability.
+---
+## How to Use the I-OSUM-Pangu Framework for Training and Inference
+### Environment Setup
+Before starting, please ensure that your device supports **NPU** and the Python environment is properly configured.
+We recommend running the code on a Linux system.
+If Conda is not installed, please refer to:
+https://blog.csdn.net/qq_41636123/article/details/130266232
+```bash
+# Create a new conda environment
+conda create -n iosum python=3.10
+conda activate iosum
+# Clone the repository
+git clone https://github.com/ASLP-lab/I-OSUM-Pangu.git
+cd I-OSUM-Pangu
+# Install dependencies
+pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
+```
+### Model Download
+```python
+from huggingface_hub import snapshot_download
+# 下载Qwen2-Audio-7B模型
+snapshot_download(
+    repo_id="ASLP-lab/I-OSUM-Pangu",
+    local_dir="path",
+    local_dir_use_symlinks=False,
+    endpoint="https://hf-mirror.com"
+)
+```
+### Inference
+This project provides batch inference scripts for all tasks under in ：I-OSUM-Pangu/infer_code:
+```shell
+python infer_ASR.py
+```
+### Training
+To ensure a smooth training process, please follow the steps below.
+#### 1. Data Preparation
+Data can be prepared in three formats:
+raw、shard、combine
+Recommended: shard format
+After preparing the dataset, write the generated data index into the following configuration file:
+```yaml
+I-OSUM-Pangu/conf/data_s2t_tmp.yaml
+```
+#### 2. Start Training
+Run the main training script:
+```bash
+I-OSUM-Pangu/train.sh
+```

step_832499.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:80917360757afeee1deff1864055c12b851f75dae7b0a8452eef30ab0c4da67b
+size 17180426102