Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR

Zhennan Lin¹, Shuai Wang², Zhaokai Sun¹, Pengyuan Xie³, Chuan Xie³, Jie Liu³, Qiang Zhang³, Lei Xie^1†

^†Corresponding author

¹Audio, Speech and Language Processing Group (ASLP@NPU), Northwestern Polytechnical University
²School of Intelligence Science and Technology, Nanjing University
³Shanghai Lingguang Zhaxian Technology

---- Speaker-Reasoner is an end-to-end Speech LLM for **timestamped speaker-attributed ASR** featuring agentic multi-turn temporal reasoning. Instead of single-pass inference, the model iteratively analyzes global audio structure, autonomously predicts temporal boundaries, and performs fine-grained segment analysis, jointly modeling speaker identity, gender, timestamps, and transcription. A speaker-aware cache further extends processing to audio exceeding the training context window. ![](figs/speaker_reasoner.png) ## 🌟 Highlights - **Agentic multi-turn reasoning**: iterative global-to-local inference along the temporal axis — global speaker summary → boundary prediction → fine-grained segment decoding - **Speaker-aware context cache**: extends processing to long-form audio beyond the training context window while preserving speaker consistency across chunks - **Three-stage progressive training**: multi-task foundation → temporal interaction learning → cache-conditioned decoding - **State-of-the-art performance**: outperforms strong baselines including closed-source Gemini-2.5-Pro on AliMeeting and AISHELL-4 - 🔥 **Bilingual & Scaled up**: extended training on 4,194 hours of multi-domain data, natively supporting English and Mandarin across complex multi-speaker scenarios ## 📊 Results ### Comprehensive Multi-Domain Evaluation

We further scaled up Speaker-Reasoner with 4,194 hours of bilingual (ZH/EN) training data. The model demonstrates superior performance across diverse scenarios, including challenging video domains and various public meeting datasets.

Model	Video-Internal-Eval				Video-Internal-Eval-zh				Video-Internal-Eval-en				AISHELL4-Eval				Alimeeting-Far				AMI-SDM				MLC-SLM-Eval-1				MLC-SLM-Eval-2
Model	WER↓	cpWER↓	DER↓	∆cp↓	WER↓	cpWER↓	DER↓	∆cp↓	WER↓	cpWER↓	DER↓	∆cp↓	WER↓	cpWER↓	DER↓	∆cp↓	WER↓	cpWER↓	DER↓	∆cp↓	WER↓	cpWER↓	DER↓	∆cp↓	WER↓	cpWER↓	DER↓	∆cp↓	WER↓	cpWER↓	DER↓	∆cp↓
Gemini-2.5-Pro	22.47	44.13	74.05	21.66	18.28	40.97	69.35	22.69	55.40	68.82	100.95	13.42	19.81	25.11	36.07	5.30	30.16	39.29	56.39	9.13	31.66	39.98	50.28	8.32	36.87	41.88	42.33	5.01	26.73	32.19	46.19	5.46
VibeVoice-ASR	16.45	58.60	47.18	42.15	17.70	62.06	47.65	44.36	7.11	32.65	44.62	25.54	22.19	26.16	8.94	3.97	34.31	39.92	19.62	5.61	30.53	35.86	21.00	5.33	10.30	13.45	6.27	3.15	7.97	11.38	3.14	3.41
Speaker-Reasoner Multi-turn	6.27	24.43	15.33	18.16	6.50	25.81	16.68	19.31	4.42	16.31	7.58	11.89	7.13	8.14	3.38	1.01	19.72	19.92	6.70	0.20	23.29	25.16	13.56	1.87	9.17	11.74	4.76	2.57	8.54	11.76	4.35	3.22

### Segmented Evaluation (40–50s segments)

Model	AISHELL4-Eval				Alimeeting-Far
Model	DER↓	CER↓	cpCER↓	∆cp↓	DER↓	CER↓	cpCER↓	∆cp↓
Cascade Baselines
Pyannote3.1 + Paraformer	8.10	19.18	26.24	7.06	19.13	30.15	45.39	15.24
End-to-End Baselines
Gemini-2.5-Pro†	36.07	19.81	25.11	5.30	56.39	30.16	39.29	9.13
Qwen3-Omni-30B-A3B-Instruct	32.42	14.46	22.22	7.76	37.15	25.40	36.28	10.88
Qwen2.5-Omni-7B	85.68	33.37	60.45	27.08	91.77	38.13	73.38	35.25
SpeakerLM (212.25h)	–	17.75	26.14	8.39	–	18.63	32.22	13.59
SpeakerLM (7638.95h)	–	17.17	18.37	1.20	–	13.97	16.05	2.08
VibeVoice-ASR	10.88	22.30	26.30	4.00	20.70	34.67	40.54	5.87
TagSpeech-Alimeeting	37.51	35.70	53.44	17.74	52.46	47.11	68.74	21.63
Ours
Qwen3-Omni + SOT sft (Stage 1)	–	17.65	19.59	1.94	–	24.24	26.03	1.79
Speaker-Reasoner Base (Stage 1)	6.24	14.04	16.54	2.50	8.96	21.16	22.64	1.48
Speaker-Reasoner Multi-turn (Stage 2)	5.19	13.83	14.93	1.10	7.47	20.34	20.29	−0.05
Speaker-Reasoner Multi-turn w/ SAC (Stage 3)	5.26	13.83	14.73	0.90	7.34	20.57	20.43	−0.14
Speaker-Reasoner Base 7B	12.00	15.65	25.60	9.95	18.43	24.97	38.12	13.15
Speaker-Reasoner Multi-turn 7B	9.38	15.31	22.91	7.60	15.56	24.33	34.81	10.48

† Closed-source model. DER unavailable for SpeakerLM and SOT-based models due to incompatible output formats. ### Long-form Evaluation (without segmentation)

Model	AISHELL4-Eval DER↓	AISHELL4-Eval cpCER↓
Gemini-2.5-Pro	15.32	31.59
Speaker-Reasoner Multi-turn w/ SAC	21.60	36.20

### Speaker Attribute Evaluation (AISHELL4-Eval)

Model	Gender ACC↑	Speaker Count ACC (SCA)↑
Gemini-2.5-Pro	94.80	67.03
Qwen3-Omni-30B-A3B-Instruct	97.12	60.49
Speaker-Reasoner Multi-turn	96.80	69.03

## Installation ### Environment Setup ```bash git clone https://github.com/ASLP-lab/Speaker-Reasoner.git cd Speaker-Reasoner conda create -n speaker-reasoner python=3.10 -y conda activate speaker-reasoner ``` Install MS-Swift and dependencies: ```bash pip install ms-swift ``` ## Model Download We provide the pre-trained model weights on Hugging Face. You can download the corresponding versions based on your requirements: | Model Version | Description | Language | Download | | :--- | :--- | :---: | :---: | | **Speaker-Reasoner** | The standard multi-turn model evaluated in the main paper. | ZH | [🤗 Hugging Face](https://huggingface.co/ASLP-lab/Speaker-Reasoner) | | **Speaker-Reasoner-4194h** | Scaled-up version trained on 4,194 hours of multi-domain data. | ZH/EN | [🤗 Hugging Face](https://huggingface.co/ASLP-lab/Speaker-Reasoner-4194h) | ## Training Coming soon. ## Inference ### vLLM Speaker-Reasoner is built on top of [Qwen3-Omni-30B-A3B-Instruct](https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct). To run it, you will need to install a custom branch of vLLM from source. ```bash git clone -b qwen3_omni https://github.com/wangxiongts/vllm.git cd vllm pip install -r requirements/build.txt pip install -r requirements/cuda.txt export VLLM_PRECOMPILED_WHEEL_LOCATION=https://wheels.vllm.ai/a5dd03c1ebc5e4f56f3c9d3dc0436e9c582c978f/vllm-0.9.2-cp38-abi3-manylinux1_x86_64.whl VLLM_USE_PRECOMPILED=1 pip install -e . -v --no-build-isolation # If you meet an "Undefined symbol" error while using VLLM_USE_PRECOMPILED=1, please use "pip install -e . -v" to build from source. # Install the Transformers pip install git+https://github.com/huggingface/transformers pip install accelerate pip install qwen-omni-utils -U pip install -U flash-attn --no-build-isolation ``` > For more details on compiling vLLM from source, refer to the [vLLM official documentation](https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html#set-up-using-python-only-build-without-compilation). ## Citation If you find this work useful, please cite: ```bibtex @article{lin2026speakerreasoner, title={Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR}, author={Zhennan Lin and Shuai Wang and Zhaokai Sun and Pengyuan Xie and Chuan Xie and Jie Liu and Qiang Zhang and Lei Xie}, year={2026}, eprint={2604.03074}, archivePrefix={arXiv}, primaryClass={eess.AS}, url={https://arxiv.org/abs/2604.03074}, } ``` ## License The code in this repository is released under the **Apache 2.0 License**. ## Contact - **Issues**: Please open a GitHub Issue for bug reports or suggestions. - **Email**: znlin@mail.nwpu.edu.cn, lxie@nwpu.edu.cn

Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR

Zhennan Lin1, Shuai Wang2, Zhaokai Sun1, Pengyuan Xie3, Chuan Xie3, Jie Liu3, Qiang Zhang3, Lei Xie1†

Zhennan Lin¹, Shuai Wang², Zhaokai Sun¹, Pengyuan Xie³, Chuan Xie³, Jie Liu³, Qiang Zhang³, Lei Xie^1†