Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,163 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
datasets:
|
| 4 |
+
- wenetspeech
|
| 5 |
+
- gigaspeech
|
| 6 |
+
- common_voice
|
| 7 |
+
- iemocap
|
| 8 |
+
- crema-d
|
| 9 |
+
- meld
|
| 10 |
+
- ravdess
|
| 11 |
+
- tess
|
| 12 |
+
- dailytalk
|
| 13 |
+
- aishell-1
|
| 14 |
+
- emotiontalk
|
| 15 |
+
- cs-dialogue
|
| 16 |
+
- voxceleb2
|
| 17 |
+
language:
|
| 18 |
+
- en
|
| 19 |
+
- zh
|
| 20 |
+
base_model:
|
| 21 |
+
- Qwen/Qwen2.5-7B-Instruct
|
| 22 |
+
pipeline_tag: audio-text-to-text
|
| 23 |
+
tags:
|
| 24 |
+
- speech
|
| 25 |
+
- speech-llm
|
| 26 |
+
- audio
|
| 27 |
+
- instruction-free
|
| 28 |
+
- paralinguistic
|
| 29 |
+
---
|
| 30 |
+
|
| 31 |
+
# AZeroS
|
| 32 |
+
|
| 33 |
+
**AZeroS** (Auden Zero-instruction-tuned Speech-LLM) extends a frozen LLM to speech via
|
| 34 |
+
**Self-Generated Instruction-Free Tuning (SIFT)**. It keeps the LLM and audio encoders frozen and
|
| 35 |
+
trains lightweight projection modules on speech–text pairs, achieving strong semantic and
|
| 36 |
+
paralinguistic performance with modest training cost, generalizing well to unseen instructions.
|
| 37 |
+
|
| 38 |
+
🔗 **Paper**: https://arxiv.org/pdf/2601.06086
|
| 39 |
+
🔗 **Code**: https://github.com/AudenAI/Auden/tree/main/examples/azeros
|
| 40 |
+
🔗 **Model**: https://huggingface.co/AudenAI/azeros
|
| 41 |
+
🔗 **Auden Repo**: https://github.com/AudenAI/Auden
|
| 42 |
+
|
| 43 |
+
## 🔍 What Can This Model Do?
|
| 44 |
+
|
| 45 |
+
- 🎙️ **Speech understanding** (semantic content understanding and dialog)
|
| 46 |
+
- 😊 **Paralinguistic analysis** (emotion, age, gender, etc.)
|
| 47 |
+
|
| 48 |
+
## Quick Start
|
| 49 |
+
|
| 50 |
+
```python
|
| 51 |
+
import torch
|
| 52 |
+
from model import AZerosModel
|
| 53 |
+
|
| 54 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 55 |
+
model = AZerosModel.from_pretrained("AudenAI/azeros").to(device)
|
| 56 |
+
|
| 57 |
+
wav_files = ["speech1.wav", "speech2.wav"]
|
| 58 |
+
messages = [
|
| 59 |
+
[
|
| 60 |
+
{
|
| 61 |
+
"role": "user",
|
| 62 |
+
"content": f"{model.audio_token_wrapped} Please analyze speech content and paralinguistic information.",
|
| 63 |
+
}
|
| 64 |
+
]
|
| 65 |
+
for _ in wav_files
|
| 66 |
+
]
|
| 67 |
+
|
| 68 |
+
generate_config = {
|
| 69 |
+
"max_new_tokens": 200,
|
| 70 |
+
"num_beams": 1,
|
| 71 |
+
"do_sample": False,
|
| 72 |
+
"min_length": 1,
|
| 73 |
+
"repetition_penalty": 1.0,
|
| 74 |
+
"length_penalty": 1.0,
|
| 75 |
+
"top_p": None,
|
| 76 |
+
"top_k": None,
|
| 77 |
+
"temperature": None,
|
| 78 |
+
}
|
| 79 |
+
|
| 80 |
+
outputs = model.generate(wav_files, messages, **generate_config)
|
| 81 |
+
print(outputs)
|
| 82 |
+
```
|
| 83 |
+
|
| 84 |
+
## Auden Setup (Required)
|
| 85 |
+
|
| 86 |
+
This model relies on the Auden codebase for loading and inference:
|
| 87 |
+
|
| 88 |
+
```bash
|
| 89 |
+
git clone https://github.com/AudenAI/Auden.git
|
| 90 |
+
cd Auden
|
| 91 |
+
pip install -e .
|
| 92 |
+
cd examples/azeros
|
| 93 |
+
```
|
| 94 |
+
|
| 95 |
+
## 📌 Model Characteristics
|
| 96 |
+
|
| 97 |
+
- Input: Raw audio waveform (16 kHz) or text
|
| 98 |
+
- Output: Text responses regarding to the input
|
| 99 |
+
- Backend LLM: Qwen2.5-7B-Instruct
|
| 100 |
+
- Encoders: [TTA](https://huggingface.co/AudenAI/auden-encoder-tta-m10) and [Auden-Voice](https://huggingface.co/AudenAI/auden-encoder-voice)
|
| 101 |
+
- Architecture: Frozen LLM + frozen audio encoders + lightweight projection modules
|
| 102 |
+
- Training paradigm: Self-Generated Instruction-Free Tuning (SIFT)
|
| 103 |
+
|
| 104 |
+
## 📊 Evaluation
|
| 105 |
+
|
| 106 |
+
### VoiceBench
|
| 107 |
+
|
| 108 |
+
| Model | Alpaca Eval | Comm Eval | Wild Voice | SD-QA | BBH | Adv Bench | IF Eval | OBQA | MMSU | Overall |
|
| 109 |
+
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|
| 110 |
+
| **Text Only Model** | | | | | | | | | | |
|
| 111 |
+
| Qwen2.5 | 4.66 | 4.55 | 4.62 | 62.03 | 80.00 | 99.04 | 70.14 | 84.84 | 71.57 | 82.69 |
|
| 112 |
+
| Qwen2.5 (TN) | 4.61 | 4.53 | 4.56 | 63.84 | 56.30 | 98.85 | 66.11 | 74.07 | 64.51 | 77.52 |
|
| 113 |
+
| **Cascaded System** | | | | | | | | | | |
|
| 114 |
+
| Whisper+GPT-4o | 4.80 | 4.47 | 4.62 | 75.77 | 87.20 | 98.27 | 76.51 | 92.97 | 81.69 | 87.80 |
|
| 115 |
+
| Whisper+Qwen2.5 | 4.64 | 4.33 | 4.21 | 58.50 | 52.85 | 98.27 | 63.99 | 78.24 | 69.00 | 76.05 |
|
| 116 |
+
| **End-to-end Speech-LLM** | | | | | | | | | | |
|
| 117 |
+
| GPT-4o | 4.78 | 4.49 | 4.58 | 75.50 | 84.10 | 98.65 | 76.02 | 89.23 | 80.25 | 86.75 |
|
| 118 |
+
| Moshi | 2.01 | 1.60 | 1.30 | 15.64 | 47.40 | 44.23 | 10.12 | 25.93 | 24.04 | 29.51 |
|
| 119 |
+
| Phi-4-multimodal | 3.81 | 3.82 | 3.56 | 39.78 | 61.80 | 100.00 | 45.35 | 65.93 | 42.19 | 64.32 |
|
| 120 |
+
| GLM-4-Voice | 3.97 | 3.42 | 3.18 | 36.98 | 52.80 | 88.08 | 25.92 | 53.41 | 39.75 | 56.48 |
|
| 121 |
+
| Qwen2-Audio | 3.42 | 3.29 | 2.76 | 31.65 | 53.00 | 99.04 | 26.35 | 48.35 | 36.14 | 53.77 |
|
| 122 |
+
| DeSTA2.5 | 3.73 | 2.52 | 3.30 | 46.47 | 62.40 | 97.69 | 65.47 | 72.75 | 58.56 | 66.04 |
|
| 123 |
+
| Qwen2.5-Omni | 3.88 | 3.77 | 3.52 | 46.75 | 63.70 | 97.31 | 40.19 | 81.54 | 61.45 | 68.26 |
|
| 124 |
+
| Qwen3-Omni-30B | 4.74 | 4.54 | 4.58 | 76.90 | 80.40 | 99.30 | 77.80 | 89.70 | 68.10 | **85.49** |
|
| 125 |
+
| **AZeroS (ours)** | 4.44 | 4.18 | 3.91 | 60.22 | 56.30 | 98.65 | 61.29 | 72.09 | 59.01 | **73.13** |
|
| 126 |
+
|
| 127 |
+
### AIRBench
|
| 128 |
+
|
| 129 |
+
| Model | Gender | Emotion | Age | LID | Entity | Intent | Avg | Chat |
|
| 130 |
+
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|
| 131 |
+
| **Cascaded System** | | | | | | | | |
|
| 132 |
+
| Whisper+GPT-4o | 21.90 | 59.50 | 41.10 | 96.80 | 69.80 | 87.70 | 62.80 | 7.54 |
|
| 133 |
+
| Whisper+Qwen2.5 | 28.36 | 50.80 | 36.40 | 88.00 | 73.60 | 82.70 | 59.98 | 7.34 |
|
| 134 |
+
| **End-to-end Speech-LLM** | | | | | | | | |
|
| 135 |
+
| GPT-4o | * | 49.10 | * | 76.00 | 61.60 | 85.80 | * | 7.53 |
|
| 136 |
+
| Gemini2.5-pro | 90.70 | 60.70 | 34.10 | 99.10 | 68.50 | 92.20 | 74.22 | 8.52 |
|
| 137 |
+
| SALMONN | 35.50 | 29.90 | 48.70 | 28.10 | 51.70 | 36.70 | 38.43 | 6.16 |
|
| 138 |
+
| GLM-4-Voice | 23.91 | 22.95 | 18.70 | 25.40 | 27.90 | 21.10 | 23.33 | 5.53 |
|
| 139 |
+
| Qwen2-Audio | 64.71 | 48.15 | 23.10 | 77.80 | 87.00 | 84.70 | 64.24 | 7.20 |
|
| 140 |
+
| DeSTA2.5 | 84.24 | 64.30 | 65.60 | 97.30 | 65.20 | 83.70 | 76.72 | 7.57 |
|
| 141 |
+
| Qwen2.5-Omni | 89.76 | 54.85 | 44.80 | 89.70 | 79.70 | 88.60 | 74.57 | 6.97 |
|
| 142 |
+
| Qwen3-Omni-30B | 91.11 | 62.20 | 36.90 | 97.70 | 80.40 | 90.70 | **76.50** | **7.85** |
|
| 143 |
+
| **AZeroS (ours)** | 86.75 | 71.45 | 61.30 | 84.80 | 73.60 | 85.60 | **77.25** | **8.28** |
|
| 144 |
+
|
| 145 |
+
*An additional prompt is added to ensure steady output of choices: “Please make your choice among A/B/C/D and do not output other texts.”*
|
| 146 |
+
|
| 147 |
+
## ⚠️ Limitations
|
| 148 |
+
|
| 149 |
+
- Trained on public datasets; performance may degrade on out-of-domain audio.
|
| 150 |
+
- Not designed for safety-critical applications.
|
| 151 |
+
|
| 152 |
+
## Citation
|
| 153 |
+
|
| 154 |
+
If you use AZeroS in your research, please cite:
|
| 155 |
+
|
| 156 |
+
```bibtex
|
| 157 |
+
@article{shao2026azeros,
|
| 158 |
+
title={AZEROS: Extending LLM to Speech with Self-Generated Instruction-Free Tuning},
|
| 159 |
+
author={Shao, Yiwen and Liu, Wei and Li, Jiahong and Wang, Tianzi and Wei, Kun and Yu, Meng and Yu, Dong},
|
| 160 |
+
journal={arXiv preprint arXiv:2601.06086},
|
| 161 |
+
year={2026}
|
| 162 |
+
}
|
| 163 |
+
```
|