TurnGuide Checkpoint: `turnguide_loss_2_1`

This repository hosts the TurnGuide fine-tuned checkpoint used by the inference code in the TurnGuide GitHub repository.

This checkpoint was trained with a text:speech token loss ratio of 2:1.

TurnGuide is introduced in:

TurnGuide: Enhancing Meaningful Full Duplex Spoken Interactions via Dynamic Turn-Level Text-Speech Interleaving

🎉 TurnGuide has been accepted to Interspeech 2026 Long Paper Track!

What This Checkpoint Is

turnguide_loss_2_1 is a GLM-4-Voice-based checkpoint for TurnGuide inference. It is one of two released TurnGuide checkpoints:

qqjz/turnguide_loss_2_1: text:speech token loss ratio = 2:1
qqjz/turnguide_loss_3_1: text:speech token loss ratio = 3:1

Both checkpoints can be used with the same TurnGuide inference script by changing --model-path.

This checkpoint is intended to be used with:

TurnGuide code: dreamtheater123/TurnGuide
GLM-4-Voice speech tokenizer: zai-org/glm-4-voice-tokenizer
GLM-4-Voice decoder: zai-org/glm-4-voice-decoder

The decoder is not included in this checkpoint repository and should be downloaded separately.

Installation

Clone the TurnGuide code repository:

git clone https://github.com/dreamtheater123/TurnGuide.git
cd TurnGuide

Create the tested environment:

conda env create -f environment.yml
conda activate turnguide

The tested core environment uses:

Python 3.10.16
PyTorch 2.5.0
CUDA 12.1
torchaudio 2.5.0
transformers 4.44.1

Download the GLM-4-Voice decoder:

git clone https://huggingface.co/zai-org/glm-4-voice-decoder

Inference

Run TurnGuide inference from the TurnGuide repository:

python turnguide_inference.py \
  --input-audio path/to/input.wav \
  --model-path qqjz/turnguide_loss_2_1 \
  --tokenizer-path zai-org/glm-4-voice-tokenizer \
  --flow-path ./glm-4-voice-decoder \
  --output-dir ./turnguide_demo_output

To use the 3:1 checkpoint instead, replace --model-path qqjz/turnguide_loss_2_1 with --model-path qqjz/turnguide_loss_3_1.

The script writes:

assistant.wav: generated assistant-channel speech
stereo_user_left_assistant_right.wav: stereo audio with user speech on the left channel and assistant speech on the right channel
a JSON file containing interleaved decoded text information

Notes

This checkpoint uses custom GLM-4-Voice code and should be loaded with trust_remote_code=True.
The checkpoint is designed for research use with the TurnGuide inference pipeline.
Model weights from GLM-4-Voice and related assets are governed by their respective licenses. Please follow the license terms of the original GLM-4-Voice models and decoder.

Citation

@article{turnguide2026,
  title={TurnGuide: Enhancing Meaningful Full Duplex Spoken Interactions via Dynamic Turn-Level Text-Speech Interleaving},
  author={Cui, Wenqian and Zhu, Lei and Li, Xiao-Hui and Guo, Zhihan and Bai, Haoli and Hou, Lu and King, Irwin},
  journal={arXiv preprint arXiv:2508.07375},
  year={2026}
}

Downloads last month: 53

Safetensors

Model size

10B params

Tensor type

BF16

Model tree for qqjz/turnguide_loss_2_1

Base model

zai-org/glm-4-voice-9b

Finetuned

(3)

this model

Paper for qqjz/turnguide_loss_2_1