SoulX-Duplug
Official code for enabling full-duplex speech interaction with
SoulX-Duplug: Plug-and-Play Streaming State Prediction Module for Realtime Full-Duplex Speech Conversation
β¨ Overview
SoulX-Duplug is a plug-and-play streaming semantic VAD model designed for real-time full-duplex speech conversation. Through text-guided streaming state prediction, SoulX-Duplug enables low-latency, semantic-aware streaming dialogue management. In addition to the core model, we also open-source a dialogue system build on top of SoulX-Duplug, which demonstrates the practicality of our model in real-world applications.
To facilitate benchmarking and research in this area, we also release SoulX-Duplug-Eval, a complementary evaluation set for benchmarking full-duplex spoken dialogue systems.
π₯ Demo
Visit our Github page for demo. You can also try the online interactive demo here:
π https://soulx-duplug.sjtuxlance.com/
π οΈ Install
Clone and Install
Here are instructions for installing on Linux.
- Clone the repo
git clone https://github.com/Soul-AILab/SoulX-Duplug.git
cd SoulX-Duplug
- Install system dependencies
sudo apt-get update
sudo apt-get install ffmpeg sox libsox-dev -y
Install Conda: please see https://docs.conda.io/en/latest/miniconda.html
Create Conda env
conda create -n soulx-duplug -y python=3.10
conda activate soulx-duplug
pip install -r requirements.txt
# If you are in mainland China, you can set the mirror as follows:
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
Model Download
Download via hf:
# If you are in mainland China, please first set the mirror:
export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download --resume-download Soul-AILab/SoulX-Duplug-0.6B --local-dir pretrained_models
Download via python:
from huggingface_hub import snapshot_download
snapshot_download("Soul-AILab/SoulX-Duplug-0.6B", local_dir="pretrained_models")
Download via git clone:
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/Soul-AILab/SoulX-Duplug-0.6B pretrained_models
Basic Usage
We provide a streaming inference server for SoulX-Duplug. Start the server:
bash run.sh
For usage (see example_client.py for reference), streamingly send your audio query (in chunks) to the server, and the server will return its prediction of the current dialogue state in a dict:
Format:
{ "type": "turn_state", "session_id": , # session_id "state": { "state": , # predicted state: "idle", "nonidle", "speak", or "blank" "text": , # (optional) asr result of user's turn "asr_segment": , # (optional) asr result of current chunk "asr_buffer": , # (optional) asr result of last 3.2s }, "ts": time.time(), # timestamp }"idle" indicates that the current audio chunk contains no semantic content (e.g., silence, noise, or backchannel).
"nonidle" indicates that the current audio chunk contains semantic content. In this case,
"asr_segment"returns the ASR result of the current chunk, and"asr_buffer"returns the ASR result of the accumulated audio over the past 3.2 seconds."speak" indicates that up to the current chunk, the user is judged to have stopped speaking and the utterance is semantically complete, meaning the system can take the turn. In this case,
"asr_segment"returns the ASR result of the current chunk,"asr_buffer"returns the ASR result of the accumulated audio over the past 3.2 seconds, and"text"returns the complete transcription of the userβs utterance for this turn."blank" indicates that the current unprocessed streaming input does not yet fill a full chunk; the server has cached the input and is waiting for the next query.
π Citation
If you find this work useful in your research, please consider citing:
@misc{yan2026soulxduplug,
title={SoulX-Duplug: Plug-and-Play Streaming State Prediction Module for Realtime Full-Duplex Speech Conversation},
author={Ruiqi Yan and Wenxi Chen and Zhanxun Liu and Ziyang Ma and Haopeng Lin and Hanlin Wen and Hanke Xie and Jun Wu and Yuzhe Liang and Yuxiang Zhao and Pengchao Feng and Jiale Qian and Hao Meng and Yuhang Dai and Shunshun Yin and Ming Tao and Lei Xie and Kai Yu and Xinsheng Wang and Xie Chen},
year={2026},
eprint={2603.14877},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2603.14877},
}
π License
This project is licensed under the Apache 2.0 License.
π Acknowledgment
We thank the following open-source projects for their open-source contributions:
- Downloads last month
- 5