AudenAI
/

azeros

+---
+license: apache-2.0
+datasets:
+  - wenetspeech
+  - gigaspeech
+  - common_voice
+  - iemocap
+  - crema-d
+  - meld
+  - ravdess
+  - tess
+  - dailytalk
+  - aishell-1
+  - emotiontalk
+  - cs-dialogue
+  - voxceleb2
+language:
+  - en
+  - zh
+base_model:
+  - Qwen/Qwen2.5-7B-Instruct
+pipeline_tag: audio-text-to-text
+tags:
+  - speech
+  - speech-llm
+  - audio
+  - instruction-free
+  - paralinguistic
+---
+# AZeroS
+**AZeroS** (Auden Zero-instruction-tuned Speech-LLM) extends a frozen LLM to speech via
+**Self-Generated Instruction-Free Tuning (SIFT)**. It keeps the LLM and audio encoders frozen and
+trains lightweight projection modules on speech–text pairs, achieving strong semantic and
+paralinguistic performance with modest training cost, generalizing well to unseen instructions.
+🔗 **Paper**: https://arxiv.org/pdf/2601.06086
+🔗 **Code**: https://github.com/AudenAI/Auden/tree/main/examples/azeros
+🔗 **Model**: https://huggingface.co/AudenAI/azeros
+🔗 **Auden Repo**: https://github.com/AudenAI/Auden
+## 🔍 What Can This Model Do?
+- 🎙️ **Speech understanding** (semantic content understanding and dialog)
+- 😊 **Paralinguistic analysis** (emotion, age, gender, etc.)
+## Quick Start
+```python
+import torch
+from model import AZerosModel
+device = "cuda" if torch.cuda.is_available() else "cpu"
+model = AZerosModel.from_pretrained("AudenAI/azeros").to(device)
+wav_files = ["speech1.wav", "speech2.wav"]
+messages = [
+    [
+        {
+            "role": "user",
+            "content": f"{model.audio_token_wrapped} Please analyze speech content and paralinguistic information.",
+        }
+    ]
+    for _ in wav_files
+]
+generate_config = {
+    "max_new_tokens": 200,
+    "num_beams": 1,
+    "do_sample": False,
+    "min_length": 1,
+    "repetition_penalty": 1.0,
+    "length_penalty": 1.0,
+    "top_p": None,
+    "top_k": None,
+    "temperature": None,
+}
+outputs = model.generate(wav_files, messages, **generate_config)
+print(outputs)
+```
+## Auden Setup (Required)
+This model relies on the Auden codebase for loading and inference:
+```bash
+git clone https://github.com/AudenAI/Auden.git
+cd Auden
+pip install -e .
+cd examples/azeros
+```
+## 📌 Model Characteristics
+- Input: Raw audio waveform (16 kHz) or text
+- Output: Text responses regarding to the input
+- Backend LLM: Qwen2.5-7B-Instruct
+- Encoders: [TTA](https://huggingface.co/AudenAI/auden-encoder-tta-m10) and [Auden-Voice](https://huggingface.co/AudenAI/auden-encoder-voice)
+- Architecture: Frozen LLM + frozen audio encoders + lightweight projection modules
+- Training paradigm: Self-Generated Instruction-Free Tuning (SIFT)
+## 📊 Evaluation
+### VoiceBench
+| Model | Alpaca Eval | Comm Eval | Wild Voice | SD-QA | BBH | Adv Bench | IF Eval | OBQA | MMSU | Overall |
+| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
+| **Text Only Model** | | | | | | | | | | |
+| Qwen2.5 | 4.66 | 4.55 | 4.62 | 62.03 | 80.00 | 99.04 | 70.14 | 84.84 | 71.57 | 82.69 |
+| Qwen2.5 (TN) | 4.61 | 4.53 | 4.56 | 63.84 | 56.30 | 98.85 | 66.11 | 74.07 | 64.51 | 77.52 |
+| **Cascaded System** | | | | | | | | | | |
+| Whisper+GPT-4o | 4.80 | 4.47 | 4.62 | 75.77 | 87.20 | 98.27 | 76.51 | 92.97 | 81.69 | 87.80 |
+| Whisper+Qwen2.5 | 4.64 | 4.33 | 4.21 | 58.50 | 52.85 | 98.27 | 63.99 | 78.24 | 69.00 | 76.05 |
+| **End-to-end Speech-LLM** | | | | | | | | | | |
+| GPT-4o | 4.78 | 4.49 | 4.58 | 75.50 | 84.10 | 98.65 | 76.02 | 89.23 | 80.25 | 86.75 |
+| Moshi | 2.01 | 1.60 | 1.30 | 15.64 | 47.40 | 44.23 | 10.12 | 25.93 | 24.04 | 29.51 |
+| Phi-4-multimodal | 3.81 | 3.82 | 3.56 | 39.78 | 61.80 | 100.00 | 45.35 | 65.93 | 42.19 | 64.32 |
+| GLM-4-Voice | 3.97 | 3.42 | 3.18 | 36.98 | 52.80 | 88.08 | 25.92 | 53.41 | 39.75 | 56.48 |
+| Qwen2-Audio | 3.42 | 3.29 | 2.76 | 31.65 | 53.00 | 99.04 | 26.35 | 48.35 | 36.14 | 53.77 |
+| DeSTA2.5 | 3.73 | 2.52 | 3.30 | 46.47 | 62.40 | 97.69 | 65.47 | 72.75 | 58.56 | 66.04 |
+| Qwen2.5-Omni | 3.88 | 3.77 | 3.52 | 46.75 | 63.70 | 97.31 | 40.19 | 81.54 | 61.45 | 68.26 |
+| Qwen3-Omni-30B | 4.74 | 4.54 | 4.58 | 76.90 | 80.40 | 99.30 | 77.80 | 89.70 | 68.10 | **85.49** |
+| **AZeroS (ours)** | 4.44 | 4.18 | 3.91 | 60.22 | 56.30 | 98.65 | 61.29 | 72.09 | 59.01 | **73.13** |
+### AIRBench
+| Model | Gender | Emotion | Age | LID | Entity | Intent | Avg | Chat |
+| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
+| **Cascaded System** | | | | | | | | |
+| Whisper+GPT-4o | 21.90 | 59.50 | 41.10 | 96.80 | 69.80 | 87.70 | 62.80 | 7.54 |
+| Whisper+Qwen2.5 | 28.36 | 50.80 | 36.40 | 88.00 | 73.60 | 82.70 | 59.98 | 7.34 |
+| **End-to-end Speech-LLM** | | | | | | | | |
+| GPT-4o | * | 49.10 | * | 76.00 | 61.60 | 85.80 | * | 7.53 |
+| Gemini2.5-pro | 90.70 | 60.70 | 34.10 | 99.10 | 68.50 | 92.20 | 74.22 | 8.52 |
+| SALMONN | 35.50 | 29.90 | 48.70 | 28.10 | 51.70 | 36.70 | 38.43 | 6.16 |
+| GLM-4-Voice | 23.91 | 22.95 | 18.70 | 25.40 | 27.90 | 21.10 | 23.33 | 5.53 |
+| Qwen2-Audio | 64.71 | 48.15 | 23.10 | 77.80 | 87.00 | 84.70 | 64.24 | 7.20 |
+| DeSTA2.5 | 84.24 | 64.30 | 65.60 | 97.30 | 65.20 | 83.70 | 76.72 | 7.57 |
+| Qwen2.5-Omni | 89.76 | 54.85 | 44.80 | 89.70 | 79.70 | 88.60 | 74.57 | 6.97 |
+| Qwen3-Omni-30B | 91.11 | 62.20 | 36.90 | 97.70 | 80.40 | 90.70 | **76.50** | **7.85** |
+| **AZeroS (ours)** | 86.75 | 71.45 | 61.30 | 84.80 | 73.60 | 85.60 | **77.25** | **8.28** |
+*An additional prompt is added to ensure steady output of choices: “Please make your choice among A/B/C/D and do not output other texts.”*
+## ⚠️ Limitations
+- Trained on public datasets; performance may degrade on out-of-domain audio.
+- Not designed for safety-critical applications.
+## Citation
+If you use AZeroS in your research, please cite:
+```bibtex
+@article{shao2026azeros,
+  title={AZEROS: Extending LLM to Speech with Self-Generated Instruction-Free Tuning},
+  author={Shao, Yiwen and Liu, Wei and Li, Jiahong and Wang, Tianzi and Wei, Kun and Yu, Meng and Yu, Dong},
+  journal={arXiv preprint arXiv:2601.06086},
+  year={2026}
+}
+```