Streaming TTS
Collection
Streaming text-to-speech models and frameworks β’ 3 items β’ Updated
Pure ONNX Runtime inference pipeline for Qwen3-TTS-12Hz-0.6B-Base, enabling streaming text-to-speech without PyTorch dependency at runtime.
This repository provides:
qwen3_tts_inferencer_onnx.py β Core streaming TTS engine that orchestrates six ONNX models (talker LLM, local talker transformer, codec decoder, speaker encoder, talker codec embedding, text embedding projection) using only NumPy and ONNX Runtime.test_qwen3-tts-streaming_onnx.py β End-to-end test script that simulates LLM streaming text and produces a WAV file.Reference Audio βββΊ Speaker Encoder βββΊ Speaker Embedding Vector (voice clone context)
β
βΌ
Text Deltas βββΊ Talker LLM (Qwen3-0.6B) βββΊ [Hidden States, VQ Token]
β
βΌ
Local Transformer βββΊ 15-codebook RVQ Tokens
β
βΌ
VQ Token βββΊ Codec Decoder βββΊ 24 kHz Waveform
| Component | ONNX Model | Description |
|---|---|---|
| Talker LLM | talker_model.onnx |
Qwen3-based talker LM mapping interleaved text+audio tokens embeddings to hidden states and VQ. Maintains a growing KV-cache across the entire generation. |
| Local Talker | talker_local_model.onnx |
Depth-wise decoder generating 15 RVQ codebook entries per frame from talker hidden states and VQ. Creates and discards a fresh KV-cache per frame. |
| Codec Decoder | codec_decoder_model.onnx |
Decodes VQ+RVQ audio codes back to 24 kHz waveform. Maintains KV-caches and convolutional caches for streaming decode. |
| Speaker Encoder | speaker_encoder_model.onnx |
ECAPA-TDNN-based speaker encoder. Produces a 1024-dim speaker embedding vector for voice identity cloning. |
| Talker Codec Embed | talker_codec_embed_model.onnx |
VQ embedding for the talker model. Consists of 2048 token vocabs. |
| Text Embed Projection | text_embed_proj_model.onnx |
Text embedding and projection for the talker model. Text embedding consists of 151,936 token vocabs. |
librosa
numpy
onnxruntime
python-box
soundfile
transformers==4.57.3
Example installation with conda env:
conda create --name qwen3-tts-streaming-onnx-1 python=3.12
conda activate qwen3-tts-streaming-onnx-1
pip install -r requirements.txt
.
βββ test_qwen3-tts-streaming_onnx.py # End-to-end test script
βββ README.md
βββ requirements.txt
βββ qwen3-tts_onnx/ # FP32
β βββ talker_model.onnx
β βββ talker_local_model.onnx
β βββ codec_decoder_model.onnx
β βββ speaker_encoder_model.onnx
β βββ talker_codec_embed_model.onnx
β βββ text_embed_proj_model.onnx
βββ configs/
β βββ config.json # Talker, Local Talker, Speaker Encoder config
β βββ speech_tokenizer_config.json # Codec config
β βββ preprocessor_config.json # Text Processor configs
β βββ tokenizer_config.json
β βββ vocab.json
β βββ merges.txt
βββ src/
β βββ core/
β β βββ configuration_qwen3_tts.py
β β βββ processing_qwen3_tts.py
β βββ inference/
β β βββ qwen3_tts_inferencer_onnx.py # Core ONNX inference engine
β βββ utils/
β βββ audio_utils.py
βββ logs/
β βββ <log_synth>.txt
βββ audio_ref/
β βββ <reference_speaker>.[wav|mp3|flac]
βββ audio_synth/
βββ <synthesized_example>.wav
python -u test_qwen3-tts-streaming_onnx.py >& logs/log_test-streaming-onnx-1.txt
# audio automatically saved in audio_synth/ with default parameters, text, language.
python test_qwen3-tts-streaming_onnx.py \
--talker_model_path qwen3-tts_onnx/talker_model.onnx \
--talker_local_model_path qwen3-tts_onnx/talker_local_model.onnx \
--codec_decoder_model_path qwen3-tts_onnx/codec_decoder_model.onnx \
--speaker_encoder_model_path qwen3-tts_onnx/speaker_encoder_model.onnx \
--talker_codec_embed_model_path qwen3-tts_onnx/talker_codec_embed_model.onnx \
--text_embed_proj_model_path qwen3-tts_onnx/text_embed_proj_model.onnx \
--model_config_path configs/config.json \
--codec_config_path configs/tokenizer_config.json \
--backbone_config_path configs/config_backbone.json \
--preprocessor_config_dir configs/ \
--temperature 0.85 \
--top_p 0.8 \
--top_k 50 \
--repetition_penalty 1.9 \
--repetition_window 50 \
--num_threads 4 \
--chunk_frames 4 \
--prompt_wav audio_ref/speaker.[wav|flac|mp3] \
--out_wav output.wav \
--text "Text to be synthesized" \
--language "english"
"chinese", "english", "german", "italian", "portuguese",
"spanish", "japanese", "korean", "french", "russian"
from src.inference import Qwen3TTSInferencerONNX
# Create inferencer
inferencer = Qwen3TTSInferencerONNX(
talker_llm, talker_local, codec_decoder,
speaker_encoder, talker_codec_embed, text_embed_proj,
preprocessor_config_dir, model_config, codec_config,
audio_ref_path, language,
)
inferencer.reset_turn(reset_cache=True)
# Stream text and collect audio
for delta in your_llm_stream():
audio_frames = inferencer.push_text(delta)
...
for audio_tokens in audio_frames:
...
inferencer.push_tokens(audio_tokens)
for wav in inferencer.audio_chunks():
...
yield wav
| Argument | Type | Default | Description |
|---|---|---|---|
--talker_model_path |
str | "qwen3-tts_onnx/talker_model.onnx" | Path to talker LLM model |
--talker_local_model_path |
str | "qwen3-tts_onnx/talker_local_model.onnx" | Path to local talker transformer model |
--codec_decoder_model_path |
str | "qwen3-tts_onnx/codec_decoder_model.onnx" | Path to codec decoder model |
--speaker_encoder_model_path |
str | "qwen3-tts_onnx/speaker_encoder_model.onnx" | Path to speaker encoder model |
--talker_codec_embed_model_path |
str | "qwen3-tts_onnx/talker_codec_embed_model.onnx" | Path to talker codec embedding |
--text_embed_proj_model_path |
str | "qwen3-tts_onnx/text_embed_proj_model.onnx" | Path to text embedding and projection |
--preprocessor_config_dir |
str | "configs/" | Directory path to configuration files for the Qwen3 text tokenizer |
--model_config_path |
str | "configs/config.json" | Path to original model configuration file for the Qwen3-TTS-12Hz-0.6B-Base |
--codec_config_path |
str | "configs/speech_tokenizer_config.json" | Path to original model configuration file for the codec of Qwen3-TTS-12Hz-0.6B-Base |
--temperature |
float | 0.85 |
Sampling temperature |
--top_p |
float | 0.8 |
Nucleus sampling threshold |
--top_k |
int | 50 |
Top-k sampling cutoff |
--repetition_penalty |
float | 1.9 |
Repetition penalty coefficient |
--repetition_window |
int | 50 |
Window for repetition penalty |
--delta_chunk_chars |
int | 1 |
Characters per simulated LLM delta |
--delta_delay_s |
float | 0.0 |
Delay between simulated deltas (seconds) |
--num_threads |
int | 4 |
Number of threads used in sess.intra_op_num_threads of the onnxruntime session options |
--chunk_frames |
int | 4 |
Number of chunk frames to be passed on to the codec decoder forward each time [default 4 frame is 0.32 s as token rate is 12.5 Hz] |
--prompt_wav |
str | audio_ref/female_shadowheart.flac | Reference speaker audio for voice cloning |
--out_wav |
str | out_streaming.wav |
Output WAV file path |
--text |
str | (Russian text) | Text to synthesize |
--language |
str | "russian" | Language of the text to synthesize |
If you use this system in your research, please cite:
@misc{vertoxai2026streamingspeechtranslation,
title={Qwen3-TTS-Streaming-ONNX β VertoX-AI},
author={Tobing, P. L., VertoX-AI},
year={2026},
publisher={HuggingFace},
}
This project is licensed under the Apache-2.0, the same license as the original Qwen3-TTS.
Created by: Patrick Lumbantobing, Vertox-AI
Copyright (c) 2026 Vertox-AI. All rights reserved.
This work is licensed under the Apache License, Version 2.0.
To view a copy of this license, visit [LICENSE](https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/apache-2.0.md).
Base model
Qwen/Qwen3-TTS-12Hz-0.6B-Base