Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR
Zhennan Lin1, Shuai Wang2, Zhaokai Sun1, Pengyuan Xie3, Chuan Xie3, Jie Liu3, Qiang Zhang3, Lei Xie1†
†Corresponding author
1Audio, Speech and Language Processing Group (ASLP@NPU), Northwestern Polytechnical University
2School of Intelligence Science and Technology, Nanjing University
3Shanghai Lingguang Zhaxian Technology
----
Speaker-Reasoner is an end-to-end Speech LLM for **timestamped speaker-attributed ASR** featuring agentic multi-turn temporal reasoning. Instead of single-pass inference, the model iteratively analyzes global audio structure, autonomously predicts temporal boundaries, and performs fine-grained segment analysis, jointly modeling speaker identity, gender, timestamps, and transcription. A speaker-aware cache further extends processing to audio exceeding the training context window.

## 🌟 Highlights
- **Agentic multi-turn reasoning**: iterative global-to-local inference along the temporal axis — global speaker summary → boundary prediction → fine-grained segment decoding
- **Speaker-aware context cache**: extends processing to long-form audio beyond the training context window while preserving speaker consistency across chunks
- **Three-stage progressive training**: multi-task foundation → temporal interaction learning → cache-conditioned decoding
- **State-of-the-art performance**: outperforms strong baselines including closed-source Gemini-2.5-Pro on AliMeeting and AISHELL-4
- 🔥 **Bilingual & Scaled up**: extended training on 4,194 hours of multi-domain data, natively supporting English and Mandarin across complex multi-speaker scenarios
## 📊 Results
### Comprehensive Multi-Domain Evaluation
We further scaled up Speaker-Reasoner with 4,194 hours of bilingual (ZH/EN) training data. The model demonstrates superior performance across diverse scenarios, including challenging video domains and various public meeting datasets.
| Model |
Video-Internal-Eval |
Video-Internal-Eval-zh |
Video-Internal-Eval-en |
AISHELL4-Eval |
Alimeeting-Far |
AMI-SDM |
MLC-SLM-Eval-1 |
MLC-SLM-Eval-2 |
| WER↓ | cpWER↓ | DER↓ | ∆cp↓ |
WER↓ | cpWER↓ | DER↓ | ∆cp↓ |
WER↓ | cpWER↓ | DER↓ | ∆cp↓ |
WER↓ | cpWER↓ | DER↓ | ∆cp↓ |
WER↓ | cpWER↓ | DER↓ | ∆cp↓ |
WER↓ | cpWER↓ | DER↓ | ∆cp↓ |
WER↓ | cpWER↓ | DER↓ | ∆cp↓ |
WER↓ | cpWER↓ | DER↓ | ∆cp↓ |
| Gemini-2.5-Pro |
22.47 | 44.13 | 74.05 | 21.66 |
18.28 | 40.97 | 69.35 | 22.69 |
55.40 | 68.82 | 100.95 | 13.42 |
19.81 | 25.11 | 36.07 | 5.30 |
30.16 | 39.29 | 56.39 | 9.13 |
31.66 | 39.98 | 50.28 | 8.32 |
36.87 | 41.88 | 42.33 | 5.01 |
26.73 | 32.19 | 46.19 | 5.46 |
| VibeVoice-ASR |
16.45 | 58.60 | 47.18 | 42.15 |
17.70 | 62.06 | 47.65 | 44.36 |
7.11 | 32.65 | 44.62 | 25.54 |
22.19 | 26.16 | 8.94 | 3.97 |
34.31 | 39.92 | 19.62 | 5.61 |
30.53 | 35.86 | 21.00 | 5.33 |
10.30 | 13.45 | 6.27 | 3.15 |
7.97 | 11.38 | 3.14 | 3.41 |
| Speaker-Reasoner Multi-turn |
6.27 | 24.43 | 15.33 | 18.16 |
6.50 | 25.81 | 16.68 | 19.31 |
4.42 | 16.31 | 7.58 | 11.89 |
7.13 | 8.14 | 3.38 | 1.01 |
19.72 | 19.92 | 6.70 | 0.20 |
23.29 | 25.16 | 13.56 | 1.87 |
9.17 | 11.74 | 4.76 | 2.57 |
8.54 | 11.76 | 4.35 | 3.22 |
### Segmented Evaluation (40–50s segments)
| Model |
AISHELL4-Eval |
Alimeeting-Far |
| DER↓ | CER↓ | cpCER↓ | ∆cp↓ |
DER↓ | CER↓ | cpCER↓ | ∆cp↓ |
| Cascade Baselines |
| Pyannote3.1 + Paraformer | 8.10 | 19.18 | 26.24 | 7.06 | 19.13 | 30.15 | 45.39 | 15.24 |
| End-to-End Baselines |
| Gemini-2.5-Pro† | 36.07 | 19.81 | 25.11 | 5.30 | 56.39 | 30.16 | 39.29 | 9.13 |
| Qwen3-Omni-30B-A3B-Instruct | 32.42 | 14.46 | 22.22 | 7.76 | 37.15 | 25.40 | 36.28 | 10.88 |
| Qwen2.5-Omni-7B | 85.68 | 33.37 | 60.45 | 27.08 | 91.77 | 38.13 | 73.38 | 35.25 |
| SpeakerLM (212.25h) | – | 17.75 | 26.14 | 8.39 | – | 18.63 | 32.22 | 13.59 |
| SpeakerLM (7638.95h) | – | 17.17 | 18.37 | 1.20 | – | 13.97 | 16.05 | 2.08 |
| VibeVoice-ASR | 10.88 | 22.30 | 26.30 | 4.00 | 20.70 | 34.67 | 40.54 | 5.87 |
| TagSpeech-Alimeeting | 37.51 | 35.70 | 53.44 | 17.74 | 52.46 | 47.11 | 68.74 | 21.63 |
| Ours |
| Qwen3-Omni + SOT sft (Stage 1) | – | 17.65 | 19.59 | 1.94 | – | 24.24 | 26.03 | 1.79 |
| Speaker-Reasoner Base (Stage 1) | 6.24 | 14.04 | 16.54 | 2.50 | 8.96 | 21.16 | 22.64 | 1.48 |
| Speaker-Reasoner Multi-turn (Stage 2) | 5.19 | 13.83 | 14.93 | 1.10 | 7.47 | 20.34 | 20.29 | −0.05 |
| Speaker-Reasoner Multi-turn w/ SAC (Stage 3) | 5.26 | 13.83 | 14.73 | 0.90 | 7.34 | 20.57 | 20.43 | −0.14 |
| Speaker-Reasoner Base 7B | 12.00 | 15.65 | 25.60 | 9.95 | 18.43 | 24.97 | 38.12 | 13.15 |
| Speaker-Reasoner Multi-turn 7B | 9.38 | 15.31 | 22.91 | 7.60 | 15.56 | 24.33 | 34.81 | 10.48 |
† Closed-source model. DER unavailable for SpeakerLM and SOT-based models due to incompatible output formats.
### Long-form Evaluation (without segmentation)
| Model |
AISHELL4-Eval DER↓ |
AISHELL4-Eval cpCER↓ |
| Gemini-2.5-Pro | 15.32 | 31.59 |
| Speaker-Reasoner Multi-turn w/ SAC | 21.60 | 36.20 |
### Speaker Attribute Evaluation (AISHELL4-Eval)
| Model |
Gender ACC↑ |
Speaker Count ACC (SCA)↑ |
| Gemini-2.5-Pro | 94.80 | 67.03 |
| Qwen3-Omni-30B-A3B-Instruct | 97.12 | 60.49 |
| Speaker-Reasoner Multi-turn | 96.80 | 69.03 |
## Installation
### Environment Setup
```bash
git clone https://github.com/ASLP-lab/Speaker-Reasoner.git
cd Speaker-Reasoner
conda create -n speaker-reasoner python=3.10 -y
conda activate speaker-reasoner
```
Install MS-Swift and dependencies:
```bash
pip install ms-swift
```
## Model Download
We provide the pre-trained model weights on Hugging Face. You can download the corresponding versions based on your requirements:
| Model Version | Description | Language | Download |
| :--- | :--- | :---: | :---: |
| **Speaker-Reasoner** | The standard multi-turn model evaluated in the main paper. | ZH | [🤗 Hugging Face](https://huggingface.co/ASLP-lab/Speaker-Reasoner) |
| **Speaker-Reasoner-4194h** | Scaled-up version trained on 4,194 hours of multi-domain data. | ZH/EN | [🤗 Hugging Face](https://huggingface.co/ASLP-lab/Speaker-Reasoner-4194h) |
## Training
Coming soon.
## Inference
### vLLM
Speaker-Reasoner is built on top of [Qwen3-Omni-30B-A3B-Instruct](https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct). To run it, you will need to install a custom branch of vLLM from source.
```bash
git clone -b qwen3_omni https://github.com/wangxiongts/vllm.git
cd vllm
pip install -r requirements/build.txt
pip install -r requirements/cuda.txt
export VLLM_PRECOMPILED_WHEEL_LOCATION=https://wheels.vllm.ai/a5dd03c1ebc5e4f56f3c9d3dc0436e9c582c978f/vllm-0.9.2-cp38-abi3-manylinux1_x86_64.whl
VLLM_USE_PRECOMPILED=1 pip install -e . -v --no-build-isolation
# If you meet an "Undefined symbol" error while using VLLM_USE_PRECOMPILED=1, please use "pip install -e . -v" to build from source.
# Install the Transformers
pip install git+https://github.com/huggingface/transformers
pip install accelerate
pip install qwen-omni-utils -U
pip install -U flash-attn --no-build-isolation
```
> For more details on compiling vLLM from source, refer to the [vLLM official documentation](https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html#set-up-using-python-only-build-without-compilation).
## Citation
If you find this work useful, please cite:
```bibtex
@article{lin2026speakerreasoner,
title={Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR},
author={Zhennan Lin and Shuai Wang and Zhaokai Sun and Pengyuan Xie and Chuan Xie and Jie Liu and Qiang Zhang and Lei Xie},
year={2026},
eprint={2604.03074},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2604.03074},
}
```
## License
The code in this repository is released under the **Apache 2.0 License**.
## Contact
- **Issues**: Please open a GitHub Issue for bug reports or suggestions.
- **Email**: znlin@mail.nwpu.edu.cn, lxie@nwpu.edu.cn