Update README.md

08d1715 verified 26 days ago

13.1 kB

	<h1 align="center">Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR</h1>

	<div align="center">

	<div style="text-align: center;">
	<img src="https://img.shields.io/badge/Python-3.10+-blue" alt="Python">
	<img src="https://img.shields.io/badge/License-Apache%202.0-green" alt="License">
	<a href="https://arxiv.org/abs/2604.03074">
	<img src="https://img.shields.io/badge/arXiv-paper-red" alt="arXiv Paper">
	</a>
	<a href="https://huggingface.co/collections/ASLP-lab/speaker-reasoner">
	<img src="https://img.shields.io/badge/HuggingFace-Models-ffd21e" alt="HuggingFace">
	</a>
	<a href="https://github.com/ASLP-lab/Speaker-Reasoner">
	<img src="https://img.shields.io/badge/GitHub-repo-black" alt="GitHub">
	</a>
	<a href="http://www.npu-aslp.org/">
	<img src="https://img.shields.io/badge/🏫-ASLP-grey?labelColor=lightgrey" alt="lab">
	</a>
	</div>

	</div>

	<div align="center">
	<h3>
	Zhennan Lin<sup>1</sup>, Shuai Wang<sup>2</sup>, Zhaokai Sun<sup>1</sup>, Pengyuan Xie<sup>3</sup>, Chuan Xie<sup>3</sup>, Jie Liu<sup>3</sup>, Qiang Zhang<sup>3</sup>, Lei Xie<sup>1†</sup>
	</h3>

	<p>
	<sup>†</sup>Corresponding author
	</p>

	<p>
	<sup>1</sup>Audio, Speech and Language Processing Group (ASLP@NPU), Northwestern Polytechnical University<br>
	<sup>2</sup>School of Intelligence Science and Technology, Nanjing University<br>
	<sup>3</sup>Shanghai Lingguang Zhaxian Technology
	</p>
	</div>

	----

	Speaker-Reasoner is an end-to-end Speech LLM for timestamped speaker-attributed ASR featuring agentic multi-turn temporal reasoning. Instead of single-pass inference, the model iteratively analyzes global audio structure, autonomously predicts temporal boundaries, and performs fine-grained segment analysis, jointly modeling speaker identity, gender, timestamps, and transcription. A speaker-aware cache further extends processing to audio exceeding the training context window.

	![](figs/speaker_reasoner.png)

	## 🌟 Highlights

	- Agentic multi-turn reasoning: iterative global-to-local inference along the temporal axis — global speaker summary → boundary prediction → fine-grained segment decoding
	- Speaker-aware context cache: extends processing to long-form audio beyond the training context window while preserving speaker consistency across chunks
	- Three-stage progressive training: multi-task foundation → temporal interaction learning → cache-conditioned decoding
	- State-of-the-art performance: outperforms strong baselines including closed-source Gemini-2.5-Pro on AliMeeting and AISHELL-4
	- 🔥 Bilingual & Scaled up: extended training on 4,194 hours of multi-domain data, natively supporting English and Mandarin across complex multi-speaker scenarios

	## 📊 Results

	### Comprehensive Multi-Domain Evaluation

	<p>We further scaled up Speaker-Reasoner with 4,194 hours of bilingual (ZH/EN) training data. The model demonstrates superior performance across diverse scenarios, including challenging video domains and various public meeting datasets.</p>

	<div style="overflow-x: auto;">
	<table style="white-space: nowrap;">
	<thead>
	<tr>
	<th rowspan="2">Model</th>
	<th colspan="4" align="center">Video-Internal-Eval</th>
	<th colspan="4" align="center">Video-Internal-Eval-zh</th>
	<th colspan="4" align="center">Video-Internal-Eval-en</th>
	<th colspan="4" align="center">AISHELL4-Eval</th>
	<th colspan="4" align="center">Alimeeting-Far</th>
	<th colspan="4" align="center">AMI-SDM</th>
	<th colspan="4" align="center">MLC-SLM-Eval-1</th>
	<th colspan="4" align="center">MLC-SLM-Eval-2</th>
	</tr>
	<tr>
	<th>WER↓</th><th>cpWER↓</th><th>DER↓</th><th>∆cp↓</th>
	<th>WER↓</th><th>cpWER↓</th><th>DER↓</th><th>∆cp↓</th>
	<th>WER↓</th><th>cpWER↓</th><th>DER↓</th><th>∆cp↓</th>
	<th>WER↓</th><th>cpWER↓</th><th>DER↓</th><th>∆cp↓</th>
	<th>WER↓</th><th>cpWER↓</th><th>DER↓</th><th>∆cp↓</th>
	<th>WER↓</th><th>cpWER↓</th><th>DER↓</th><th>∆cp↓</th>
	<th>WER↓</th><th>cpWER↓</th><th>DER↓</th><th>∆cp↓</th>
	<th>WER↓</th><th>cpWER↓</th><th>DER↓</th><th>∆cp↓</th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<td>Gemini-2.5-Pro</td>
	<td>22.47</td><td>44.13</td><td>74.05</td><td>21.66</td>
	<td>18.28</td><td>40.97</td><td>69.35</td><td>22.69</td>
	<td>55.40</td><td>68.82</td><td>100.95</td><td>13.42</td>
	<td>19.81</td><td>25.11</td><td>36.07</td><td>5.30</td>
	<td>30.16</td><td>39.29</td><td>56.39</td><td>9.13</td>
	<td>31.66</td><td>39.98</td><td>50.28</td><td>8.32</td>
	<td>36.87</td><td>41.88</td><td>42.33</td><td>5.01</td>
	<td>26.73</td><td>32.19</td><td>46.19</td><td>5.46</td>
	</tr>
	<tr>
	<td>VibeVoice-ASR</td>
	<td>16.45</td><td>58.60</td><td>47.18</td><td>42.15</td>
	<td>17.70</td><td>62.06</td><td>47.65</td><td>44.36</td>
	<td>7.11</td><td>32.65</td><td>44.62</td><td>25.54</td>
	<td>22.19</td><td>26.16</td><td>8.94</td><td>3.97</td>
	<td>34.31</td><td>39.92</td><td>19.62</td><td>5.61</td>
	<td>30.53</td><td>35.86</td><td>21.00</td><td>5.33</td>
	<td>10.30</td><td>13.45</td><td>6.27</td><td>3.15</td>
	<td><b>7.97</b></td><td><b>11.38</b></td><td><b>3.14</b></td><td>3.41</td>
	</tr>
	<tr>
	<td><b>Speaker-Reasoner Multi-turn</b></td>
	<td><b>6.27</b></td><td><b>24.43</b></td><td><b>15.33</b></td><td><b>18.16</b></td>
	<td><b>6.50</b></td><td><b>25.81</b></td><td><b>16.68</b></td><td><b>19.31</b></td>
	<td><b>4.42</b></td><td><b>16.31</b></td><td><b>7.58</b></td><td><b>11.89</b></td>
	<td><b>7.13</b></td><td><b>8.14</b></td><td><b>3.38</b></td><td><b>1.01</b></td>
	<td><b>19.72</b></td><td><b>19.92</b></td><td><b>6.70</b></td><td><b>0.20</b></td>
	<td><b>23.29</b></td><td><b>25.16</b></td><td><b>13.56</b></td><td><b>1.87</b></td>
	<td><b>9.17</b></td><td><b>11.74</b></td><td><b>4.76</b></td><td><b>2.57</b></td>
	<td>8.54</td><td>11.76</td><td>4.35</td><td><b>3.22</b></td>
	</tr>
	</tbody>
	</table>
	</div>

	### Segmented Evaluation (40–50s segments)

	<div style="overflow-x: auto;">
	<table style="white-space: nowrap;">
	<thead>
	<tr>
	<th rowspan="2">Model</th>
	<th colspan="4" align="center">AISHELL4-Eval</th>
	<th colspan="4" align="center">Alimeeting-Far</th>
	</tr>
	<tr>
	<th>DER↓</th><th>CER↓</th><th>cpCER↓</th><th>∆cp↓</th>
	<th>DER↓</th><th>CER↓</th><th>cpCER↓</th><th>∆cp↓</th>
	</tr>
	</thead>
	<tbody>
	<tr><td colspan="9"><b>Cascade Baselines</b></td></tr>
	<tr><td>Pyannote3.1 + Paraformer</td><td>8.10</td><td>19.18</td><td>26.24</td><td>7.06</td><td>19.13</td><td>30.15</td><td>45.39</td><td>15.24</td></tr>
	<tr><td colspan="9"><b>End-to-End Baselines</b></td></tr>
	<tr><td>Gemini-2.5-Pro†</td><td>36.07</td><td>19.81</td><td>25.11</td><td>5.30</td><td>56.39</td><td>30.16</td><td>39.29</td><td>9.13</td></tr>
	<tr><td>Qwen3-Omni-30B-A3B-Instruct</td><td>32.42</td><td>14.46</td><td>22.22</td><td>7.76</td><td>37.15</td><td>25.40</td><td>36.28</td><td>10.88</td></tr>
	<tr><td>Qwen2.5-Omni-7B</td><td>85.68</td><td>33.37</td><td>60.45</td><td>27.08</td><td>91.77</td><td>38.13</td><td>73.38</td><td>35.25</td></tr>
	<tr><td>SpeakerLM (212.25h)</td><td>–</td><td>17.75</td><td>26.14</td><td>8.39</td><td>–</td><td>18.63</td><td>32.22</td><td>13.59</td></tr>
	<tr><td>SpeakerLM (7638.95h)</td><td>–</td><td>17.17</td><td>18.37</td><td>1.20</td><td>–</td><td>13.97</td><td>16.05</td><td>2.08</td></tr>
	<tr><td>VibeVoice-ASR</td><td>10.88</td><td>22.30</td><td>26.30</td><td>4.00</td><td>20.70</td><td>34.67</td><td>40.54</td><td>5.87</td></tr>
	<tr><td>TagSpeech-Alimeeting</td><td>37.51</td><td>35.70</td><td>53.44</td><td>17.74</td><td>52.46</td><td>47.11</td><td>68.74</td><td>21.63</td></tr>
	<tr><td colspan="9"><b>Ours</b></td></tr>
	<tr><td>Qwen3-Omni + SOT sft (Stage 1)</td><td>–</td><td>17.65</td><td>19.59</td><td>1.94</td><td>–</td><td>24.24</td><td>26.03</td><td>1.79</td></tr>
	<tr><td>Speaker-Reasoner Base (Stage 1)</td><td>6.24</td><td>14.04</td><td>16.54</td><td>2.50</td><td>8.96</td><td>21.16</td><td>22.64</td><td>1.48</td></tr>
	<tr><td>Speaker-Reasoner Multi-turn (Stage 2)</td><td>5.19</td><td>13.83</td><td>14.93</td><td>1.10</td><td>7.47</td><td>20.34</td><td>20.29</td><td>−0.05</td></tr>
	<tr><td><b>Speaker-Reasoner Multi-turn w/ SAC (Stage 3)</b></td><td><b>5.26</b></td><td><b>13.83</b></td><td><b>14.73</b></td><td><b>0.90</b></td><td><b>7.34</b></td><td><b>20.57</b></td><td><b>20.43</b></td><td><b>−0.14</b></td></tr>
	<tr><td>Speaker-Reasoner Base 7B</td><td>12.00</td><td>15.65</td><td>25.60</td><td>9.95</td><td>18.43</td><td>24.97</td><td>38.12</td><td>13.15</td></tr>
	<tr><td>Speaker-Reasoner Multi-turn 7B</td><td>9.38</td><td>15.31</td><td>22.91</td><td>7.60</td><td>15.56</td><td>24.33</td><td>34.81</td><td>10.48</td></tr>
	</tbody>
	</table>
	</div>

	† Closed-source model. DER unavailable for SpeakerLM and SOT-based models due to incompatible output formats.

	### Long-form Evaluation (without segmentation)

	<div style="overflow-x: auto;">
	<table style="white-space: nowrap;">
	<thead>
	<tr>
	<th>Model</th>
	<th>AISHELL4-Eval DER↓</th>
	<th>AISHELL4-Eval cpCER↓</th>
	</tr>
	</thead>
	<tbody>
	<tr><td>Gemini-2.5-Pro</td><td>15.32</td><td>31.59</td></tr>
	<tr><td>Speaker-Reasoner Multi-turn w/ SAC</td><td>21.60</td><td>36.20</td></tr>
	</tbody>
	</table>
	</div>

	### Speaker Attribute Evaluation (AISHELL4-Eval)

	<div style="overflow-x: auto;">
	<table style="white-space: nowrap;">
	<thead>
	<tr>
	<th>Model</th>
	<th>Gender ACC↑</th>
	<th>Speaker Count ACC (SCA)↑</th>
	</tr>
	</thead>
	<tbody>
	<tr><td>Gemini-2.5-Pro</td><td>94.80</td><td>67.03</td></tr>
	<tr><td>Qwen3-Omni-30B-A3B-Instruct</td><td>97.12</td><td>60.49</td></tr>
	<tr><td>Speaker-Reasoner Multi-turn</td><td><b>96.80</b></td><td><b>69.03</b></td></tr>
	</tbody>
	</table>
	</div>

	## Installation

	### Environment Setup

	```bash
	git clone https://github.com/ASLP-lab/Speaker-Reasoner.git
	cd Speaker-Reasoner

	conda create -n speaker-reasoner python=3.10 -y
	conda activate speaker-reasoner
	```

	Install MS-Swift and dependencies:

	```bash
	pip install ms-swift
	```

	## Model Download

	We provide the pre-trained model weights on Hugging Face. You can download the corresponding versions based on your requirements:

	\| Model Version \| Description \| Language \| Download \|
	\| :--- \| :--- \| :---: \| :---: \|
	\| Speaker-Reasoner \| The standard multi-turn model evaluated in the main paper. \| ZH \| [🤗 Hugging Face](https://huggingface.co/ASLP-lab/Speaker-Reasoner) \|
	\| Speaker-Reasoner-4194h \| Scaled-up version trained on 4,194 hours of multi-domain data. \| ZH/EN \| [🤗 Hugging Face](https://huggingface.co/ASLP-lab/Speaker-Reasoner-4194h) \|

	## Training

	Coming soon.

	## Inference

	### vLLM

	Speaker-Reasoner is built on top of [Qwen3-Omni-30B-A3B-Instruct](https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct). To run it, you will need to install a custom branch of vLLM from source.

	```bash
	git clone -b qwen3_omni https://github.com/wangxiongts/vllm.git
	cd vllm
	pip install -r requirements/build.txt
	pip install -r requirements/cuda.txt
	export VLLM_PRECOMPILED_WHEEL_LOCATION=https://wheels.vllm.ai/a5dd03c1ebc5e4f56f3c9d3dc0436e9c582c978f/vllm-0.9.2-cp38-abi3-manylinux1_x86_64.whl
	VLLM_USE_PRECOMPILED=1 pip install -e . -v --no-build-isolation
	# If you meet an "Undefined symbol" error while using VLLM_USE_PRECOMPILED=1, please use "pip install -e . -v" to build from source.
	# Install the Transformers
	pip install git+https://github.com/huggingface/transformers
	pip install accelerate
	pip install qwen-omni-utils -U
	pip install -U flash-attn --no-build-isolation
	```

	> For more details on compiling vLLM from source, refer to the [vLLM official documentation](https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html#set-up-using-python-only-build-without-compilation).

	## Citation

	If you find this work useful, please cite:

	```bibtex
	@article{lin2026speakerreasoner,
	title={Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR},
	author={Zhennan Lin and Shuai Wang and Zhaokai Sun and Pengyuan Xie and Chuan Xie and Jie Liu and Qiang Zhang and Lei Xie},
	year={2026},
	eprint={2604.03074},
	archivePrefix={arXiv},
	primaryClass={eess.AS},
	url={https://arxiv.org/abs/2604.03074},
	}
	```

	## License

	The code in this repository is released under the Apache 2.0 License.

	## Contact

	- Issues: Please open a GitHub Issue for bug reports or suggestions.
	- Email: znlin@mail.nwpu.edu.cn, lxie@nwpu.edu.cn

	<p align="center">
	<a href="http://www.nwpu-aslp.org/">
	<img src="figs/aslp.png" width="400"/>
	</a>
	</p>