SoulX-Duplug

Official code for enabling full-duplex speech interaction with
SoulX-Duplug: Plug-and-Play Streaming State Prediction Module for Realtime Full-Duplex Speech Conversation

✨ Overview

SoulX-Duplug is a plug-and-play streaming semantic VAD model designed for real-time full-duplex speech conversation. Through text-guided streaming state prediction, SoulX-Duplug enables low-latency, semantic-aware streaming dialogue management. In addition to the core model, we also open-source a dialogue system build on top of SoulX-Duplug, which demonstrates the practicality of our model in real-world applications.

To facilitate benchmarking and research in this area, we also release SoulX-Duplug-Eval, a complementary evaluation set for benchmarking full-duplex spoken dialogue systems.

🔥 Demo

Visit our Github page for demo. You can also try the online interactive demo here:

👉 https://soulx-duplug.sjtuxlance.com/

🛠️ Install

Clone and Install

Here are instructions for installing on Linux.

Clone the repo

git clone https://github.com/Soul-AILab/SoulX-Duplug.git
cd SoulX-Duplug

Install system dependencies

sudo apt-get update
sudo apt-get install ffmpeg sox libsox-dev -y

Install Conda: please see https://docs.conda.io/en/latest/miniconda.html
Create Conda env

conda create -n soulx-duplug -y python=3.10
conda activate soulx-duplug
pip install -r requirements.txt
# If you are in mainland China, you can set the mirror as follows:
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com

Model Download

Download via hf:

# If you are in mainland China, please first set the mirror:
export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download --resume-download Soul-AILab/SoulX-Duplug-0.6B --local-dir pretrained_models

Download via python:

from huggingface_hub import snapshot_download
snapshot_download("Soul-AILab/SoulX-Duplug-0.6B", local_dir="pretrained_models")

Download via git clone:

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/Soul-AILab/SoulX-Duplug-0.6B pretrained_models

Basic Usage

We provide a streaming inference server for SoulX-Duplug. Start the server:

bash run.sh

For usage (see example_client.py for reference), streamingly send your audio query (in chunks) to the server, and the server will return its prediction of the current dialogue state in a dict:

Format:

{
    "type": "turn_state",
    "session_id": ,         # session_id
    "state": {
        "state": ,          # predicted state: "idle", "nonidle", "speak", or "blank"
        "text": ,           # (optional) asr result of user's turn
        "asr_segment": ,    # (optional) asr result of current chunk
        "asr_buffer": ,     # (optional) asr result of last 3.2s
    },
    "ts": time.time(),      # timestamp
}

"idle" indicates that the current audio chunk contains no semantic content (e.g., silence, noise, or backchannel).
"nonidle" indicates that the current audio chunk contains semantic content. In this case, "asr_segment" returns the ASR result of the current chunk, and "asr_buffer" returns the ASR result of the accumulated audio over the past 3.2 seconds.
"speak" indicates that up to the current chunk, the user is judged to have stopped speaking and the utterance is semantically complete, meaning the system can take the turn. In this case, "asr_segment" returns the ASR result of the current chunk, "asr_buffer" returns the ASR result of the accumulated audio over the past 3.2 seconds, and "text" returns the complete transcription of the user’s utterance for this turn.
"blank" indicates that the current unprocessed streaming input does not yet fill a full chunk; the server has cached the input and is waiting for the next query.

🔖 Citation

If you find this work useful in your research, please consider citing:

@misc{yan2026soulxduplug,
      title={SoulX-Duplug: Plug-and-Play Streaming State Prediction Module for Realtime Full-Duplex Speech Conversation}, 
      author={Ruiqi Yan and Wenxi Chen and Zhanxun Liu and Ziyang Ma and Haopeng Lin and Hanlin Wen and Hanke Xie and Jun Wu and Yuzhe Liang and Yuxiang Zhao and Pengchao Feng and Jiale Qian and Hao Meng and Yuhang Dai and Shunshun Yin and Ming Tao and Lei Xie and Kai Yu and Xinsheng Wang and Xie Chen},
      year={2026},
      eprint={2603.14877},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2603.14877}, 
}

📜 License

This project is licensed under the Apache 2.0 License.

🙏 Acknowledgment

We thank the following open-source projects for their open-source contributions:

Downloads last month: 5

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including Soul-AILab/SoulX-Duplug-0.6B

SoulX-Duplug

Collection

3 items • Updated about 23 hours ago • 4

Paper for Soul-AILab/SoulX-Duplug-0.6B

SoulX-Duplug: Plug-and-Play Streaming State Prediction Module for Realtime Full-Duplex Speech Conversation

Paper • 2603.14877 • Published 2 days ago • 2