Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR

Python License arXiv Paper HuggingFace GitHub lab

Zhennan Lin1, Shuai Wang2, Zhaokai Sun1, Pengyuan Xie3, Chuan Xie3, Jie Liu3, Qiang Zhang3, Lei Xie1†

Corresponding author

1Audio, Speech and Language Processing Group (ASLP@NPU), Northwestern Polytechnical University
2School of Intelligence Science and Technology, Nanjing University
3Shanghai Lingguang Zhaxian Technology

---- Speaker-Reasoner is an end-to-end Speech LLM for **timestamped speaker-attributed ASR** featuring agentic multi-turn temporal reasoning. Instead of single-pass inference, the model iteratively analyzes global audio structure, autonomously predicts temporal boundaries, and performs fine-grained segment analysis, jointly modeling speaker identity, gender, timestamps, and transcription. A speaker-aware cache further extends processing to audio exceeding the training context window. ![](figs/speaker_reasoner.png) ## 🌟 Highlights - **Agentic multi-turn reasoning**: iterative global-to-local inference along the temporal axis — global speaker summary → boundary prediction → fine-grained segment decoding - **Speaker-aware context cache**: extends processing to long-form audio beyond the training context window while preserving speaker consistency across chunks - **Three-stage progressive training**: multi-task foundation → temporal interaction learning → cache-conditioned decoding - **State-of-the-art performance**: outperforms strong baselines including closed-source Gemini-2.5-Pro on AliMeeting and AISHELL-4 - 🔥 **Bilingual & Scaled up**: extended training on 4,194 hours of multi-domain data, natively supporting English and Mandarin across complex multi-speaker scenarios ## 📊 Results ### Comprehensive Multi-Domain Evaluation

We further scaled up Speaker-Reasoner with 4,194 hours of bilingual (ZH/EN) training data. The model demonstrates superior performance across diverse scenarios, including challenging video domains and various public meeting datasets.

Model Video-Internal-Eval Video-Internal-Eval-zh Video-Internal-Eval-en AISHELL4-Eval Alimeeting-Far AMI-SDM MLC-SLM-Eval-1 MLC-SLM-Eval-2
WER↓cpWER↓DER↓∆cp↓ WER↓cpWER↓DER↓∆cp↓ WER↓cpWER↓DER↓∆cp↓ WER↓cpWER↓DER↓∆cp↓ WER↓cpWER↓DER↓∆cp↓ WER↓cpWER↓DER↓∆cp↓ WER↓cpWER↓DER↓∆cp↓ WER↓cpWER↓DER↓∆cp↓
Gemini-2.5-Pro 22.4744.1374.0521.66 18.2840.9769.3522.69 55.4068.82100.9513.42 19.8125.1136.075.30 30.1639.2956.399.13 31.6639.9850.288.32 36.8741.8842.335.01 26.7332.1946.195.46
VibeVoice-ASR 16.4558.6047.1842.15 17.7062.0647.6544.36 7.1132.6544.6225.54 22.1926.168.943.97 34.3139.9219.625.61 30.5335.8621.005.33 10.3013.456.273.15 7.9711.383.143.41
Speaker-Reasoner Multi-turn 6.2724.4315.3318.16 6.5025.8116.6819.31 4.4216.317.5811.89 7.138.143.381.01 19.7219.926.700.20 23.2925.1613.561.87 9.1711.744.762.57 8.5411.764.353.22
### Segmented Evaluation (40–50s segments)
Model AISHELL4-Eval Alimeeting-Far
DER↓CER↓cpCER↓∆cp↓ DER↓CER↓cpCER↓∆cp↓
Cascade Baselines
Pyannote3.1 + Paraformer8.1019.1826.247.0619.1330.1545.3915.24
End-to-End Baselines
Gemini-2.5-Pro†36.0719.8125.115.3056.3930.1639.299.13
Qwen3-Omni-30B-A3B-Instruct32.4214.4622.227.7637.1525.4036.2810.88
Qwen2.5-Omni-7B85.6833.3760.4527.0891.7738.1373.3835.25
SpeakerLM (212.25h)17.7526.148.3918.6332.2213.59
SpeakerLM (7638.95h)17.1718.371.2013.9716.052.08
VibeVoice-ASR10.8822.3026.304.0020.7034.6740.545.87
TagSpeech-Alimeeting37.5135.7053.4417.7452.4647.1168.7421.63
Ours
Qwen3-Omni + SOT sft (Stage 1)17.6519.591.9424.2426.031.79
Speaker-Reasoner Base (Stage 1)6.2414.0416.542.508.9621.1622.641.48
Speaker-Reasoner Multi-turn (Stage 2)5.1913.8314.931.107.4720.3420.29−0.05
Speaker-Reasoner Multi-turn w/ SAC (Stage 3)5.2613.8314.730.907.3420.5720.43−0.14
Speaker-Reasoner Base 7B12.0015.6525.609.9518.4324.9738.1213.15
Speaker-Reasoner Multi-turn 7B9.3815.3122.917.6015.5624.3334.8110.48
† Closed-source model. DER unavailable for SpeakerLM and SOT-based models due to incompatible output formats. ### Long-form Evaluation (without segmentation)
Model AISHELL4-Eval DER↓ AISHELL4-Eval cpCER↓
Gemini-2.5-Pro15.3231.59
Speaker-Reasoner Multi-turn w/ SAC21.6036.20
### Speaker Attribute Evaluation (AISHELL4-Eval)
Model Gender ACC↑ Speaker Count ACC (SCA)↑
Gemini-2.5-Pro94.8067.03
Qwen3-Omni-30B-A3B-Instruct97.1260.49
Speaker-Reasoner Multi-turn96.8069.03
## Installation ### Environment Setup ```bash git clone https://github.com/ASLP-lab/Speaker-Reasoner.git cd Speaker-Reasoner conda create -n speaker-reasoner python=3.10 -y conda activate speaker-reasoner ``` Install MS-Swift and dependencies: ```bash pip install ms-swift ``` ## Model Download We provide the pre-trained model weights on Hugging Face. You can download the corresponding versions based on your requirements: | Model Version | Description | Language | Download | | :--- | :--- | :---: | :---: | | **Speaker-Reasoner** | The standard multi-turn model evaluated in the main paper. | ZH | [🤗 Hugging Face](https://huggingface.co/ASLP-lab/Speaker-Reasoner) | | **Speaker-Reasoner-4194h** | Scaled-up version trained on 4,194 hours of multi-domain data. | ZH/EN | [🤗 Hugging Face](https://huggingface.co/ASLP-lab/Speaker-Reasoner-4194h) | ## Training Coming soon. ## Inference ### vLLM Speaker-Reasoner is built on top of [Qwen3-Omni-30B-A3B-Instruct](https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct). To run it, you will need to install a custom branch of vLLM from source. ```bash git clone -b qwen3_omni https://github.com/wangxiongts/vllm.git cd vllm pip install -r requirements/build.txt pip install -r requirements/cuda.txt export VLLM_PRECOMPILED_WHEEL_LOCATION=https://wheels.vllm.ai/a5dd03c1ebc5e4f56f3c9d3dc0436e9c582c978f/vllm-0.9.2-cp38-abi3-manylinux1_x86_64.whl VLLM_USE_PRECOMPILED=1 pip install -e . -v --no-build-isolation # If you meet an "Undefined symbol" error while using VLLM_USE_PRECOMPILED=1, please use "pip install -e . -v" to build from source. # Install the Transformers pip install git+https://github.com/huggingface/transformers pip install accelerate pip install qwen-omni-utils -U pip install -U flash-attn --no-build-isolation ``` > For more details on compiling vLLM from source, refer to the [vLLM official documentation](https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html#set-up-using-python-only-build-without-compilation). ## Citation If you find this work useful, please cite: ```bibtex @article{lin2026speakerreasoner, title={Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR}, author={Zhennan Lin and Shuai Wang and Zhaokai Sun and Pengyuan Xie and Chuan Xie and Jie Liu and Qiang Zhang and Lei Xie}, year={2026}, eprint={2604.03074}, archivePrefix={arXiv}, primaryClass={eess.AS}, url={https://arxiv.org/abs/2604.03074}, } ``` ## License The code in this repository is released under the **Apache 2.0 License**. ## Contact - **Issues**: Please open a GitHub Issue for bug reports or suggestions. - **Email**: znlin@mail.nwpu.edu.cn, lxie@nwpu.edu.cn