yshao18 commited on
Commit
e9392d3
·
verified ·
1 Parent(s): 625f14f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +163 -3
README.md CHANGED
@@ -1,3 +1,163 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - wenetspeech
5
+ - gigaspeech
6
+ - common_voice
7
+ - iemocap
8
+ - crema-d
9
+ - meld
10
+ - ravdess
11
+ - tess
12
+ - dailytalk
13
+ - aishell-1
14
+ - emotiontalk
15
+ - cs-dialogue
16
+ - voxceleb2
17
+ language:
18
+ - en
19
+ - zh
20
+ base_model:
21
+ - Qwen/Qwen2.5-7B-Instruct
22
+ pipeline_tag: audio-text-to-text
23
+ tags:
24
+ - speech
25
+ - speech-llm
26
+ - audio
27
+ - instruction-free
28
+ - paralinguistic
29
+ ---
30
+
31
+ # AZeroS
32
+
33
+ **AZeroS** (Auden Zero-instruction-tuned Speech-LLM) extends a frozen LLM to speech via
34
+ **Self-Generated Instruction-Free Tuning (SIFT)**. It keeps the LLM and audio encoders frozen and
35
+ trains lightweight projection modules on speech–text pairs, achieving strong semantic and
36
+ paralinguistic performance with modest training cost, generalizing well to unseen instructions.
37
+
38
+ 🔗 **Paper**: https://arxiv.org/pdf/2601.06086
39
+ 🔗 **Code**: https://github.com/AudenAI/Auden/tree/main/examples/azeros
40
+ 🔗 **Model**: https://huggingface.co/AudenAI/azeros
41
+ 🔗 **Auden Repo**: https://github.com/AudenAI/Auden
42
+
43
+ ## 🔍 What Can This Model Do?
44
+
45
+ - 🎙️ **Speech understanding** (semantic content understanding and dialog)
46
+ - 😊 **Paralinguistic analysis** (emotion, age, gender, etc.)
47
+
48
+ ## Quick Start
49
+
50
+ ```python
51
+ import torch
52
+ from model import AZerosModel
53
+
54
+ device = "cuda" if torch.cuda.is_available() else "cpu"
55
+ model = AZerosModel.from_pretrained("AudenAI/azeros").to(device)
56
+
57
+ wav_files = ["speech1.wav", "speech2.wav"]
58
+ messages = [
59
+ [
60
+ {
61
+ "role": "user",
62
+ "content": f"{model.audio_token_wrapped} Please analyze speech content and paralinguistic information.",
63
+ }
64
+ ]
65
+ for _ in wav_files
66
+ ]
67
+
68
+ generate_config = {
69
+ "max_new_tokens": 200,
70
+ "num_beams": 1,
71
+ "do_sample": False,
72
+ "min_length": 1,
73
+ "repetition_penalty": 1.0,
74
+ "length_penalty": 1.0,
75
+ "top_p": None,
76
+ "top_k": None,
77
+ "temperature": None,
78
+ }
79
+
80
+ outputs = model.generate(wav_files, messages, **generate_config)
81
+ print(outputs)
82
+ ```
83
+
84
+ ## Auden Setup (Required)
85
+
86
+ This model relies on the Auden codebase for loading and inference:
87
+
88
+ ```bash
89
+ git clone https://github.com/AudenAI/Auden.git
90
+ cd Auden
91
+ pip install -e .
92
+ cd examples/azeros
93
+ ```
94
+
95
+ ## 📌 Model Characteristics
96
+
97
+ - Input: Raw audio waveform (16 kHz) or text
98
+ - Output: Text responses regarding to the input
99
+ - Backend LLM: Qwen2.5-7B-Instruct
100
+ - Encoders: [TTA](https://huggingface.co/AudenAI/auden-encoder-tta-m10) and [Auden-Voice](https://huggingface.co/AudenAI/auden-encoder-voice)
101
+ - Architecture: Frozen LLM + frozen audio encoders + lightweight projection modules
102
+ - Training paradigm: Self-Generated Instruction-Free Tuning (SIFT)
103
+
104
+ ## 📊 Evaluation
105
+
106
+ ### VoiceBench
107
+
108
+ | Model | Alpaca Eval | Comm Eval | Wild Voice | SD-QA | BBH | Adv Bench | IF Eval | OBQA | MMSU | Overall |
109
+ | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
110
+ | **Text Only Model** | | | | | | | | | | |
111
+ | Qwen2.5 | 4.66 | 4.55 | 4.62 | 62.03 | 80.00 | 99.04 | 70.14 | 84.84 | 71.57 | 82.69 |
112
+ | Qwen2.5 (TN) | 4.61 | 4.53 | 4.56 | 63.84 | 56.30 | 98.85 | 66.11 | 74.07 | 64.51 | 77.52 |
113
+ | **Cascaded System** | | | | | | | | | | |
114
+ | Whisper+GPT-4o | 4.80 | 4.47 | 4.62 | 75.77 | 87.20 | 98.27 | 76.51 | 92.97 | 81.69 | 87.80 |
115
+ | Whisper+Qwen2.5 | 4.64 | 4.33 | 4.21 | 58.50 | 52.85 | 98.27 | 63.99 | 78.24 | 69.00 | 76.05 |
116
+ | **End-to-end Speech-LLM** | | | | | | | | | | |
117
+ | GPT-4o | 4.78 | 4.49 | 4.58 | 75.50 | 84.10 | 98.65 | 76.02 | 89.23 | 80.25 | 86.75 |
118
+ | Moshi | 2.01 | 1.60 | 1.30 | 15.64 | 47.40 | 44.23 | 10.12 | 25.93 | 24.04 | 29.51 |
119
+ | Phi-4-multimodal | 3.81 | 3.82 | 3.56 | 39.78 | 61.80 | 100.00 | 45.35 | 65.93 | 42.19 | 64.32 |
120
+ | GLM-4-Voice | 3.97 | 3.42 | 3.18 | 36.98 | 52.80 | 88.08 | 25.92 | 53.41 | 39.75 | 56.48 |
121
+ | Qwen2-Audio | 3.42 | 3.29 | 2.76 | 31.65 | 53.00 | 99.04 | 26.35 | 48.35 | 36.14 | 53.77 |
122
+ | DeSTA2.5 | 3.73 | 2.52 | 3.30 | 46.47 | 62.40 | 97.69 | 65.47 | 72.75 | 58.56 | 66.04 |
123
+ | Qwen2.5-Omni | 3.88 | 3.77 | 3.52 | 46.75 | 63.70 | 97.31 | 40.19 | 81.54 | 61.45 | 68.26 |
124
+ | Qwen3-Omni-30B | 4.74 | 4.54 | 4.58 | 76.90 | 80.40 | 99.30 | 77.80 | 89.70 | 68.10 | **85.49** |
125
+ | **AZeroS (ours)** | 4.44 | 4.18 | 3.91 | 60.22 | 56.30 | 98.65 | 61.29 | 72.09 | 59.01 | **73.13** |
126
+
127
+ ### AIRBench
128
+
129
+ | Model | Gender | Emotion | Age | LID | Entity | Intent | Avg | Chat |
130
+ | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
131
+ | **Cascaded System** | | | | | | | | |
132
+ | Whisper+GPT-4o | 21.90 | 59.50 | 41.10 | 96.80 | 69.80 | 87.70 | 62.80 | 7.54 |
133
+ | Whisper+Qwen2.5 | 28.36 | 50.80 | 36.40 | 88.00 | 73.60 | 82.70 | 59.98 | 7.34 |
134
+ | **End-to-end Speech-LLM** | | | | | | | | |
135
+ | GPT-4o | * | 49.10 | * | 76.00 | 61.60 | 85.80 | * | 7.53 |
136
+ | Gemini2.5-pro | 90.70 | 60.70 | 34.10 | 99.10 | 68.50 | 92.20 | 74.22 | 8.52 |
137
+ | SALMONN | 35.50 | 29.90 | 48.70 | 28.10 | 51.70 | 36.70 | 38.43 | 6.16 |
138
+ | GLM-4-Voice | 23.91 | 22.95 | 18.70 | 25.40 | 27.90 | 21.10 | 23.33 | 5.53 |
139
+ | Qwen2-Audio | 64.71 | 48.15 | 23.10 | 77.80 | 87.00 | 84.70 | 64.24 | 7.20 |
140
+ | DeSTA2.5 | 84.24 | 64.30 | 65.60 | 97.30 | 65.20 | 83.70 | 76.72 | 7.57 |
141
+ | Qwen2.5-Omni | 89.76 | 54.85 | 44.80 | 89.70 | 79.70 | 88.60 | 74.57 | 6.97 |
142
+ | Qwen3-Omni-30B | 91.11 | 62.20 | 36.90 | 97.70 | 80.40 | 90.70 | **76.50** | **7.85** |
143
+ | **AZeroS (ours)** | 86.75 | 71.45 | 61.30 | 84.80 | 73.60 | 85.60 | **77.25** | **8.28** |
144
+
145
+ *An additional prompt is added to ensure steady output of choices: “Please make your choice among A/B/C/D and do not output other texts.”*
146
+
147
+ ## ⚠️ Limitations
148
+
149
+ - Trained on public datasets; performance may degrade on out-of-domain audio.
150
+ - Not designed for safety-critical applications.
151
+
152
+ ## Citation
153
+
154
+ If you use AZeroS in your research, please cite:
155
+
156
+ ```bibtex
157
+ @article{shao2026azeros,
158
+ title={AZEROS: Extending LLM to Speech with Self-Generated Instruction-Free Tuning},
159
+ author={Shao, Yiwen and Liu, Wei and Li, Jiahong and Wang, Tianzi and Wei, Kun and Yu, Meng and Yu, Dong},
160
+ journal={arXiv preprint arXiv:2601.06086},
161
+ year={2026}
162
+ }
163
+ ```