File size: 13,090 Bytes
bdd1782 08d1715 bdd1782 76bef08 bdd1782 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 | <h1 align="center">Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR</h1>
<div align="center">
<div style="text-align: center;">
<img src="https://img.shields.io/badge/Python-3.10+-blue" alt="Python">
<img src="https://img.shields.io/badge/License-Apache%202.0-green" alt="License">
<a href="https://arxiv.org/abs/2604.03074">
<img src="https://img.shields.io/badge/arXiv-paper-red" alt="arXiv Paper">
</a>
<a href="https://huggingface.co/collections/ASLP-lab/speaker-reasoner">
<img src="https://img.shields.io/badge/HuggingFace-Models-ffd21e" alt="HuggingFace">
</a>
<a href="https://github.com/ASLP-lab/Speaker-Reasoner">
<img src="https://img.shields.io/badge/GitHub-repo-black" alt="GitHub">
</a>
<a href="http://www.npu-aslp.org/">
<img src="https://img.shields.io/badge/🏫-ASLP-grey?labelColor=lightgrey" alt="lab">
</a>
</div>
</div>
<div align="center">
<h3>
Zhennan Lin<sup>1</sup>, Shuai Wang<sup>2</sup>, Zhaokai Sun<sup>1</sup>, Pengyuan Xie<sup>3</sup>, Chuan Xie<sup>3</sup>, Jie Liu<sup>3</sup>, Qiang Zhang<sup>3</sup>, Lei Xie<sup>1†</sup>
</h3>
<p>
<sup>†</sup>Corresponding author
</p>
<p>
<sup>1</sup>Audio, Speech and Language Processing Group (ASLP@NPU), Northwestern Polytechnical University<br>
<sup>2</sup>School of Intelligence Science and Technology, Nanjing University<br>
<sup>3</sup>Shanghai Lingguang Zhaxian Technology
</p>
</div>
----
Speaker-Reasoner is an end-to-end Speech LLM for **timestamped speaker-attributed ASR** featuring agentic multi-turn temporal reasoning. Instead of single-pass inference, the model iteratively analyzes global audio structure, autonomously predicts temporal boundaries, and performs fine-grained segment analysis, jointly modeling speaker identity, gender, timestamps, and transcription. A speaker-aware cache further extends processing to audio exceeding the training context window.

## 🌟 Highlights
- **Agentic multi-turn reasoning**: iterative global-to-local inference along the temporal axis — global speaker summary → boundary prediction → fine-grained segment decoding
- **Speaker-aware context cache**: extends processing to long-form audio beyond the training context window while preserving speaker consistency across chunks
- **Three-stage progressive training**: multi-task foundation → temporal interaction learning → cache-conditioned decoding
- **State-of-the-art performance**: outperforms strong baselines including closed-source Gemini-2.5-Pro on AliMeeting and AISHELL-4
- 🔥 **Bilingual & Scaled up**: extended training on 4,194 hours of multi-domain data, natively supporting English and Mandarin across complex multi-speaker scenarios
## 📊 Results
### Comprehensive Multi-Domain Evaluation
<p>We further scaled up Speaker-Reasoner with 4,194 hours of bilingual (ZH/EN) training data. The model demonstrates superior performance across diverse scenarios, including challenging video domains and various public meeting datasets.</p>
<div style="overflow-x: auto;">
<table style="white-space: nowrap;">
<thead>
<tr>
<th rowspan="2">Model</th>
<th colspan="4" align="center">Video-Internal-Eval</th>
<th colspan="4" align="center">Video-Internal-Eval-zh</th>
<th colspan="4" align="center">Video-Internal-Eval-en</th>
<th colspan="4" align="center">AISHELL4-Eval</th>
<th colspan="4" align="center">Alimeeting-Far</th>
<th colspan="4" align="center">AMI-SDM</th>
<th colspan="4" align="center">MLC-SLM-Eval-1</th>
<th colspan="4" align="center">MLC-SLM-Eval-2</th>
</tr>
<tr>
<th>WER↓</th><th>cpWER↓</th><th>DER↓</th><th>∆cp↓</th>
<th>WER↓</th><th>cpWER↓</th><th>DER↓</th><th>∆cp↓</th>
<th>WER↓</th><th>cpWER↓</th><th>DER↓</th><th>∆cp↓</th>
<th>WER↓</th><th>cpWER↓</th><th>DER↓</th><th>∆cp↓</th>
<th>WER↓</th><th>cpWER↓</th><th>DER↓</th><th>∆cp↓</th>
<th>WER↓</th><th>cpWER↓</th><th>DER↓</th><th>∆cp↓</th>
<th>WER↓</th><th>cpWER↓</th><th>DER↓</th><th>∆cp↓</th>
<th>WER↓</th><th>cpWER↓</th><th>DER↓</th><th>∆cp↓</th>
</tr>
</thead>
<tbody>
<tr>
<td>Gemini-2.5-Pro</td>
<td>22.47</td><td>44.13</td><td>74.05</td><td>21.66</td>
<td>18.28</td><td>40.97</td><td>69.35</td><td>22.69</td>
<td>55.40</td><td>68.82</td><td>100.95</td><td>13.42</td>
<td>19.81</td><td>25.11</td><td>36.07</td><td>5.30</td>
<td>30.16</td><td>39.29</td><td>56.39</td><td>9.13</td>
<td>31.66</td><td>39.98</td><td>50.28</td><td>8.32</td>
<td>36.87</td><td>41.88</td><td>42.33</td><td>5.01</td>
<td>26.73</td><td>32.19</td><td>46.19</td><td>5.46</td>
</tr>
<tr>
<td>VibeVoice-ASR</td>
<td>16.45</td><td>58.60</td><td>47.18</td><td>42.15</td>
<td>17.70</td><td>62.06</td><td>47.65</td><td>44.36</td>
<td>7.11</td><td>32.65</td><td>44.62</td><td>25.54</td>
<td>22.19</td><td>26.16</td><td>8.94</td><td>3.97</td>
<td>34.31</td><td>39.92</td><td>19.62</td><td>5.61</td>
<td>30.53</td><td>35.86</td><td>21.00</td><td>5.33</td>
<td>10.30</td><td>13.45</td><td>6.27</td><td>3.15</td>
<td><b>7.97</b></td><td><b>11.38</b></td><td><b>3.14</b></td><td>3.41</td>
</tr>
<tr>
<td><b>Speaker-Reasoner Multi-turn</b></td>
<td><b>6.27</b></td><td><b>24.43</b></td><td><b>15.33</b></td><td><b>18.16</b></td>
<td><b>6.50</b></td><td><b>25.81</b></td><td><b>16.68</b></td><td><b>19.31</b></td>
<td><b>4.42</b></td><td><b>16.31</b></td><td><b>7.58</b></td><td><b>11.89</b></td>
<td><b>7.13</b></td><td><b>8.14</b></td><td><b>3.38</b></td><td><b>1.01</b></td>
<td><b>19.72</b></td><td><b>19.92</b></td><td><b>6.70</b></td><td><b>0.20</b></td>
<td><b>23.29</b></td><td><b>25.16</b></td><td><b>13.56</b></td><td><b>1.87</b></td>
<td><b>9.17</b></td><td><b>11.74</b></td><td><b>4.76</b></td><td><b>2.57</b></td>
<td>8.54</td><td>11.76</td><td>4.35</td><td><b>3.22</b></td>
</tr>
</tbody>
</table>
</div>
### Segmented Evaluation (40–50s segments)
<div style="overflow-x: auto;">
<table style="white-space: nowrap;">
<thead>
<tr>
<th rowspan="2">Model</th>
<th colspan="4" align="center">AISHELL4-Eval</th>
<th colspan="4" align="center">Alimeeting-Far</th>
</tr>
<tr>
<th>DER↓</th><th>CER↓</th><th>cpCER↓</th><th>∆cp↓</th>
<th>DER↓</th><th>CER↓</th><th>cpCER↓</th><th>∆cp↓</th>
</tr>
</thead>
<tbody>
<tr><td colspan="9"><b>Cascade Baselines</b></td></tr>
<tr><td>Pyannote3.1 + Paraformer</td><td>8.10</td><td>19.18</td><td>26.24</td><td>7.06</td><td>19.13</td><td>30.15</td><td>45.39</td><td>15.24</td></tr>
<tr><td colspan="9"><b>End-to-End Baselines</b></td></tr>
<tr><td>Gemini-2.5-Pro†</td><td>36.07</td><td>19.81</td><td>25.11</td><td>5.30</td><td>56.39</td><td>30.16</td><td>39.29</td><td>9.13</td></tr>
<tr><td>Qwen3-Omni-30B-A3B-Instruct</td><td>32.42</td><td>14.46</td><td>22.22</td><td>7.76</td><td>37.15</td><td>25.40</td><td>36.28</td><td>10.88</td></tr>
<tr><td>Qwen2.5-Omni-7B</td><td>85.68</td><td>33.37</td><td>60.45</td><td>27.08</td><td>91.77</td><td>38.13</td><td>73.38</td><td>35.25</td></tr>
<tr><td>SpeakerLM (212.25h)</td><td>–</td><td>17.75</td><td>26.14</td><td>8.39</td><td>–</td><td>18.63</td><td>32.22</td><td>13.59</td></tr>
<tr><td>SpeakerLM (7638.95h)</td><td>–</td><td>17.17</td><td>18.37</td><td>1.20</td><td>–</td><td>13.97</td><td>16.05</td><td>2.08</td></tr>
<tr><td>VibeVoice-ASR</td><td>10.88</td><td>22.30</td><td>26.30</td><td>4.00</td><td>20.70</td><td>34.67</td><td>40.54</td><td>5.87</td></tr>
<tr><td>TagSpeech-Alimeeting</td><td>37.51</td><td>35.70</td><td>53.44</td><td>17.74</td><td>52.46</td><td>47.11</td><td>68.74</td><td>21.63</td></tr>
<tr><td colspan="9"><b>Ours</b></td></tr>
<tr><td>Qwen3-Omni + SOT sft (Stage 1)</td><td>–</td><td>17.65</td><td>19.59</td><td>1.94</td><td>–</td><td>24.24</td><td>26.03</td><td>1.79</td></tr>
<tr><td>Speaker-Reasoner Base (Stage 1)</td><td>6.24</td><td>14.04</td><td>16.54</td><td>2.50</td><td>8.96</td><td>21.16</td><td>22.64</td><td>1.48</td></tr>
<tr><td>Speaker-Reasoner Multi-turn (Stage 2)</td><td>5.19</td><td>13.83</td><td>14.93</td><td>1.10</td><td>7.47</td><td>20.34</td><td>20.29</td><td>−0.05</td></tr>
<tr><td><b>Speaker-Reasoner Multi-turn w/ SAC (Stage 3)</b></td><td><b>5.26</b></td><td><b>13.83</b></td><td><b>14.73</b></td><td><b>0.90</b></td><td><b>7.34</b></td><td><b>20.57</b></td><td><b>20.43</b></td><td><b>−0.14</b></td></tr>
<tr><td>Speaker-Reasoner Base 7B</td><td>12.00</td><td>15.65</td><td>25.60</td><td>9.95</td><td>18.43</td><td>24.97</td><td>38.12</td><td>13.15</td></tr>
<tr><td>Speaker-Reasoner Multi-turn 7B</td><td>9.38</td><td>15.31</td><td>22.91</td><td>7.60</td><td>15.56</td><td>24.33</td><td>34.81</td><td>10.48</td></tr>
</tbody>
</table>
</div>
† Closed-source model. DER unavailable for SpeakerLM and SOT-based models due to incompatible output formats.
### Long-form Evaluation (without segmentation)
<div style="overflow-x: auto;">
<table style="white-space: nowrap;">
<thead>
<tr>
<th>Model</th>
<th>AISHELL4-Eval DER↓</th>
<th>AISHELL4-Eval cpCER↓</th>
</tr>
</thead>
<tbody>
<tr><td>Gemini-2.5-Pro</td><td>15.32</td><td>31.59</td></tr>
<tr><td>Speaker-Reasoner Multi-turn w/ SAC</td><td>21.60</td><td>36.20</td></tr>
</tbody>
</table>
</div>
### Speaker Attribute Evaluation (AISHELL4-Eval)
<div style="overflow-x: auto;">
<table style="white-space: nowrap;">
<thead>
<tr>
<th>Model</th>
<th>Gender ACC↑</th>
<th>Speaker Count ACC (SCA)↑</th>
</tr>
</thead>
<tbody>
<tr><td>Gemini-2.5-Pro</td><td>94.80</td><td>67.03</td></tr>
<tr><td>Qwen3-Omni-30B-A3B-Instruct</td><td>97.12</td><td>60.49</td></tr>
<tr><td>Speaker-Reasoner Multi-turn</td><td><b>96.80</b></td><td><b>69.03</b></td></tr>
</tbody>
</table>
</div>
## Installation
### Environment Setup
```bash
git clone https://github.com/ASLP-lab/Speaker-Reasoner.git
cd Speaker-Reasoner
conda create -n speaker-reasoner python=3.10 -y
conda activate speaker-reasoner
```
Install MS-Swift and dependencies:
```bash
pip install ms-swift
```
## Model Download
We provide the pre-trained model weights on Hugging Face. You can download the corresponding versions based on your requirements:
| Model Version | Description | Language | Download |
| :--- | :--- | :---: | :---: |
| **Speaker-Reasoner** | The standard multi-turn model evaluated in the main paper. | ZH | [🤗 Hugging Face](https://huggingface.co/ASLP-lab/Speaker-Reasoner) |
| **Speaker-Reasoner-4194h** | Scaled-up version trained on 4,194 hours of multi-domain data. | ZH/EN | [🤗 Hugging Face](https://huggingface.co/ASLP-lab/Speaker-Reasoner-4194h) |
## Training
Coming soon.
## Inference
### vLLM
Speaker-Reasoner is built on top of [Qwen3-Omni-30B-A3B-Instruct](https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct). To run it, you will need to install a custom branch of vLLM from source.
```bash
git clone -b qwen3_omni https://github.com/wangxiongts/vllm.git
cd vllm
pip install -r requirements/build.txt
pip install -r requirements/cuda.txt
export VLLM_PRECOMPILED_WHEEL_LOCATION=https://wheels.vllm.ai/a5dd03c1ebc5e4f56f3c9d3dc0436e9c582c978f/vllm-0.9.2-cp38-abi3-manylinux1_x86_64.whl
VLLM_USE_PRECOMPILED=1 pip install -e . -v --no-build-isolation
# If you meet an "Undefined symbol" error while using VLLM_USE_PRECOMPILED=1, please use "pip install -e . -v" to build from source.
# Install the Transformers
pip install git+https://github.com/huggingface/transformers
pip install accelerate
pip install qwen-omni-utils -U
pip install -U flash-attn --no-build-isolation
```
> For more details on compiling vLLM from source, refer to the [vLLM official documentation](https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html#set-up-using-python-only-build-without-compilation).
## Citation
If you find this work useful, please cite:
```bibtex
@article{lin2026speakerreasoner,
title={Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR},
author={Zhennan Lin and Shuai Wang and Zhaokai Sun and Pengyuan Xie and Chuan Xie and Jie Liu and Qiang Zhang and Lei Xie},
year={2026},
eprint={2604.03074},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2604.03074},
}
```
## License
The code in this repository is released under the **Apache 2.0 License**.
## Contact
- **Issues**: Please open a GitHub Issue for bug reports or suggestions.
- **Email**: znlin@mail.nwpu.edu.cn, lxie@nwpu.edu.cn
<p align="center">
<a href="http://www.nwpu-aslp.org/">
<img src="figs/aslp.png" width="400"/>
</a>
</p>
|