File size: 6,987 Bytes
80f500c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
---
license: apache-2.0
pipeline_tag: audio-to-audio
library_name: transformers
---

# OpenS2S: Advancing Open-Source End-to-End Empathetic Large Speech Language Model

This repository contains the OpenS2S model presented in the paper [OpenS2S: Advancing Open-Source End-to-End Empathetic Large Speech Language Model](https://huggingface.co/papers/2507.05177).

-   [Paper](https://huggingface.co/papers/2507.05177)
-   [Project Page](https://casia-lm.github.io/OpenS2S)
-   [GitHub Repository](https://github.com/CASIA-LM/OpenS2S)
-   [Hugging Face Dataset](https://huggingface.co/datasets/CASIA-LM/OpenS2S_Datasets)

## Abstract

Empathetic interaction is a cornerstone of human-machine communication, due to the need for understanding speech enriched with paralinguistic cues and generating emotional and expressive responses. However, the most powerful empathetic LSLMs are increasingly closed off, leaving the crucial details about the architecture, data and development opaque to researchers. Given the critical need for transparent research into the LSLMs and empathetic behavior, we present OpenS2S, a fully open-source, transparent and end-to-end LSLM designed to enable empathetic speech interactions. Based on our empathetic speech-to-text model BLSP-Emo, OpenS2S further employs a streaming interleaved decoding architecture to achieve low-latency speech generation. To facilitate end-to-end training, OpenS2S incorporates an automated data construction pipeline that synthesizes diverse, high-quality empathetic speech dialogues at low cost. By leveraging large language models to generate empathetic content and controllable text-to-speech systems to introduce speaker and emotional variation, we construct a scalable training corpus with rich paralinguistic diversity and minimal human supervision. We release the fully open-source OpenS2S model, including the dataset, model weights, pre-training and fine-tuning codes, to empower the broader research community and accelerate innovation in empathetic speech systems. The project webpage can be accessed at this https URL

## Model Architecture

![architecture](https://huggingface.co/CASIA-LM/OpenS2S/resolve/main/figures/Architecture.png)

As shown in the figure, OpenS2S consists of the following main components:

*   **Audio Encoder**: The Audio Encoder is responsible for transforming this raw audio signal into a more manageable and meaningful representation.

*   **Instruction-Following LLM**: The audio embeddings and text embeddings are concatenated to form interleaved input sequences for the large language model. We select Qwen3-8B-Instruct as the LLM, leveraging its robust text processing capabilities.

*   **Streaming Speech Decoder**: The speech response is first converted into discrete tokens using a supervised semantic speech tokenizer. Then, an autoregressive text-to-speech language model is used to generate speech tokens conditioned on the hidden states of the LLM, enabling real-time generation.

## Example

More examples can be found in the [project page](https://casia-lm.github.io/OpenS2S).

## Usage

### Setup

```bash
pip install -r requirements.txt
```

### Prepare the pretrained OpenS2S checkpoint

Download the pretrained OpenS2S model from [Huggingface](https://huggingface.co/CASIA-LM/OpenS2S).

### Inference

1.  Start the controller
    ```bash
    python controller.py
    ```
2.  Start the model server
    ```bash
    python model_worker.py
    ```

3.  Launching web service locally
    ```bash
    python web_demo.py --port 8888
    ```

## Training

### Data Preparation

This code requires input data to be in JSON Lines (jsonl) format. Each line of the file must be a valid JSON object containing exactly one key: messages.

Here is an example of a valid line in the jsonl file:
```python
{
    "messages": [
        {
            "role": "user",
            "content": [
                {"text": "continue the following sentence", "audio": "", "speech_units": "", "spk_emb": ""},
                {"text": "", "audio": "/path/to/audio", "speech_units": "", "spk_emb": ""}
            ]
        },
        {
            "role": "assistant",
            "content": [
                {"text": "hello", "audio": "", "speech_units": "<|audio_0|><|audio_1|>", "spk_emb": ""},
            ]
        }
    ]
}
```

If you want to construct continuation writing based on ASR data, please refer to [text_generation.py](https://github.com/CASIA-LM/OpenS2S/blob/main/text_generation.py). If you want to convert audio waveform into speech units, please refer to [GLM-4-Voice](https://github.com/THUDM/GLM-4-Voice/blob/main/speech_tokenizer/utils.py#L40).

### Train from scratch

1.  Obtain the Audio Encoder, LLM bachbone, and Auto-regressive TTS LM.

2.  Offline process training data
    ```bash
    export llm_path=/path/to/llm_backbone
    export tts_path=/path/to/ar_tts
    export audio_path=path/to/audio_encoder
    python src/instruction_dataset.py offline \
        --dataroot /path/to/raw_data_dir \
        --manifest_files "*.jsonl" \
        --llm_path ${llm_path} \
        --tts_path ${tts_path} \
        --save_dir /path/to/processed_data_dir \
        --num_proc 64
    ```

4.  train the model (connect different modules)
    ```bash
    export data_dir=/path/to/processed_data_dir
    export SAVE_ROOT=/path/to/checkpoints

    bash scripts/train_from_scratch.sh
    ```

### Fine-tuning

1.  Obtain pretrained checkpoints

2.  Offline process
    ```bash
    export omnispeech_path=/path/to/omnispeech

    python src/instruction_dataset.py offline \
        --dataroot /path/to/raw_data_dir \
        --manifest_files "*.jsonl" \
        --llm_path ${omnispeech_path} \
        --tts_path ${omnispeech_path}/tts/ \
        --save_dir /path/to/processed_data_dir \
        --num_proc 64
    ```

3.  fine-tune the pretrained model
    ```bash
    bash scripts/train_continue.sh
    ```

## Acknowledgements

We would like to thank the following projects and individuals for their contributions to the development of OpenS2S:
*   [LLaMA-Omni2](https://github.com/ictnlp/LLaMA-Omni2)
*   [Transformers](https://github.com/huggingface/transformers)
*   [GLM-4-Voice](https://github.com/THUDM/GLM-4-Voice)
*   [Qwen3](https://github.com/QwenLM/Qwen3)
*   [Qwen2-Audio](https://github.com/QwenLM/Qwen2-Audio)

Thank you to all the open-source projects for their contributions to this project!

## License
The license of our project is [Apache License 2.0](https://github.com/CASIA-LM/OpenS2S/blob/main/LICENSE).

## Citation
If you find our project useful, hope you can star our repo and cite our paper as follows:
```bibtex
@article{wang2025opens2s,
  title={OpenS2S : Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model},
  author={Wang Chen, Peng Tianyu, Yang Wen, Bai Yinan, Wang Guangfu, Lin Jun, Jia Lanpeng, Wu Lingxiang, Wang Jinqiao, Zong Chengqing, Zhang Jiajun},
  journal={arXiv preprint arXiv:2507.05177},
  year={2025}
}
```