Qwen3-TTS-Realtime ONNX Inference

Pure ONNX Runtime inference pipeline for Qwen3-TTS-12Hz-0.6B-Base, enabling streaming text-to-speech without PyTorch dependency at runtime.

Overview

This repository provides:

  • qwen3_tts_inferencer_onnx.py β€” Core streaming TTS engine that orchestrates six ONNX models (talker LLM, local talker transformer, codec decoder, speaker encoder, talker codec embedding, text embedding projection) using only NumPy and ONNX Runtime.
  • test_qwen3-tts-streaming_onnx.py β€” End-to-end test script that simulates LLM streaming text and produces a WAV file.

Architecture

Reference Audio ──► Speaker Encoder ──► Speaker Embedding Vector (voice clone context)
                                           β”‚
                                           β–Ό
Text Deltas ──► Talker LLM (Qwen3-0.6B) ──► [Hidden States, VQ Token]
                                                          β”‚
                                                          β–Ό
                                                Local Transformer ──► 15-codebook RVQ Tokens
                                                                            β”‚
                                                                            β–Ό
                                                          VQ Token ──► Codec Decoder ──► 24 kHz Waveform
Component ONNX Model Description
Talker LLM talker_model.onnx Qwen3-based talker LM mapping interleaved text+audio tokens embeddings to hidden states and VQ. Maintains a growing KV-cache across the entire generation.
Local Talker talker_local_model.onnx Depth-wise decoder generating 15 RVQ codebook entries per frame from talker hidden states and VQ. Creates and discards a fresh KV-cache per frame.
Codec Decoder codec_decoder_model.onnx Decodes VQ+RVQ audio codes back to 24 kHz waveform. Maintains KV-caches and convolutional caches for streaming decode.
Speaker Encoder speaker_encoder_model.onnx ECAPA-TDNN-based speaker encoder. Produces a 1024-dim speaker embedding vector for voice identity cloning.
Talker Codec Embed talker_codec_embed_model.onnx VQ embedding for the talker model. Consists of 2048 token vocabs.
Text Embed Projection text_embed_proj_model.onnx Text embedding and projection for the talker model. Text embedding consists of 151,936 token vocabs.

Requirements

librosa
numpy
onnxruntime
python-box
soundfile
transformers==4.57.3

Example installation with conda env:

conda create --name qwen3-tts-streaming-onnx-1 python=3.12
conda activate qwen3-tts-streaming-onnx-1
pip install -r requirements.txt

Directory Structure

.
β”œβ”€β”€ test_qwen3-tts-streaming_onnx.py        # End-to-end test script
β”œβ”€β”€ README.md
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ qwen3-tts_onnx/  # FP32
β”‚   β”œβ”€β”€ talker_model.onnx
β”‚   β”œβ”€β”€ talker_local_model.onnx
β”‚   β”œβ”€β”€ codec_decoder_model.onnx
β”‚   β”œβ”€β”€ speaker_encoder_model.onnx
β”‚   β”œβ”€β”€ talker_codec_embed_model.onnx
β”‚   └── text_embed_proj_model.onnx
β”œβ”€β”€ configs/
β”‚   β”œβ”€β”€ config.json                         # Talker, Local Talker, Speaker Encoder config
β”‚   β”œβ”€β”€ speech_tokenizer_config.json        # Codec config
β”‚   β”œβ”€β”€ preprocessor_config.json            # Text Processor configs
β”‚   β”œβ”€β”€ tokenizer_config.json
β”‚   β”œβ”€β”€ vocab.json
β”‚   └── merges.txt
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ core/
β”‚   β”‚   β”œβ”€β”€ configuration_qwen3_tts.py
β”‚   β”‚   └── processing_qwen3_tts.py
β”‚   β”œβ”€β”€ inference/
β”‚   β”‚   └── qwen3_tts_inferencer_onnx.py    # Core ONNX inference engine 
β”‚   └── utils/
β”‚       └── audio_utils.py
β”œβ”€β”€ logs/
β”‚   └── <log_synth>.txt
β”œβ”€β”€ audio_ref/
β”‚   └── <reference_speaker>.[wav|mp3|flac]
└── audio_synth/
    └── <synthesized_example>.wav

Usage

Basic streaming TTS usage

python -u test_qwen3-tts-streaming_onnx.py >& logs/log_test-streaming-onnx-1.txt
# audio automatically saved in audio_synth/ with default parameters, text, language.

Usage with parameters

python test_qwen3-tts-streaming_onnx.py \
    --talker_model_path qwen3-tts_onnx/talker_model.onnx \
    --talker_local_model_path qwen3-tts_onnx/talker_local_model.onnx \
    --codec_decoder_model_path qwen3-tts_onnx/codec_decoder_model.onnx \
    --speaker_encoder_model_path qwen3-tts_onnx/speaker_encoder_model.onnx \
    --talker_codec_embed_model_path qwen3-tts_onnx/talker_codec_embed_model.onnx \
    --text_embed_proj_model_path qwen3-tts_onnx/text_embed_proj_model.onnx \
    --model_config_path configs/config.json \
    --codec_config_path configs/tokenizer_config.json \
    --backbone_config_path configs/config_backbone.json \
    --preprocessor_config_dir configs/ \
    --temperature 0.85 \
    --top_p 0.8 \
    --top_k 50 \
    --repetition_penalty 1.9 \
    --repetition_window 50 \
    --num_threads 4 \
    --chunk_frames 4 \
    --prompt_wav audio_ref/speaker.[wav|flac|mp3] \
    --out_wav output.wav \
    --text "Text to be synthesized" \
    --language "english"

Available Languages

"chinese", "english", "german", "italian", "portuguese",
"spanish", "japanese", "korean", "french", "russian"

Programmatic Usage

from src.inference import Qwen3TTSInferencerONNX

# Create inferencer
inferencer = Qwen3TTSInferencerONNX(
    talker_llm, talker_local, codec_decoder,
    speaker_encoder, talker_codec_embed, text_embed_proj,
    preprocessor_config_dir, model_config, codec_config,
    audio_ref_path, language,
)
inferencer.reset_turn(reset_cache=True)

# Stream text and collect audio
for delta in your_llm_stream():
    audio_frames = inferencer.push_text(delta)
    ...
    for audio_tokens in audio_frames:
        ...
        inferencer.push_tokens(audio_tokens)
        for wav in inferencer.audio_chunks():
            ...
            yield wav

Command-Line Arguments

Argument Type Default Description
--talker_model_path str "qwen3-tts_onnx/talker_model.onnx" Path to talker LLM model
--talker_local_model_path str "qwen3-tts_onnx/talker_local_model.onnx" Path to local talker transformer model
--codec_decoder_model_path str "qwen3-tts_onnx/codec_decoder_model.onnx" Path to codec decoder model
--speaker_encoder_model_path str "qwen3-tts_onnx/speaker_encoder_model.onnx" Path to speaker encoder model
--talker_codec_embed_model_path str "qwen3-tts_onnx/talker_codec_embed_model.onnx" Path to talker codec embedding
--text_embed_proj_model_path str "qwen3-tts_onnx/text_embed_proj_model.onnx" Path to text embedding and projection
--preprocessor_config_dir str "configs/" Directory path to configuration files for the Qwen3 text tokenizer
--model_config_path str "configs/config.json" Path to original model configuration file for the Qwen3-TTS-12Hz-0.6B-Base
--codec_config_path str "configs/speech_tokenizer_config.json" Path to original model configuration file for the codec of Qwen3-TTS-12Hz-0.6B-Base
--temperature float 0.85 Sampling temperature
--top_p float 0.8 Nucleus sampling threshold
--top_k int 50 Top-k sampling cutoff
--repetition_penalty float 1.9 Repetition penalty coefficient
--repetition_window int 50 Window for repetition penalty
--delta_chunk_chars int 1 Characters per simulated LLM delta
--delta_delay_s float 0.0 Delay between simulated deltas (seconds)
--num_threads int 4 Number of threads used in sess.intra_op_num_threads of the onnxruntime session options
--chunk_frames int 4 Number of chunk frames to be passed on to the codec decoder forward each time [default 4 frame is 0.32 s as token rate is 12.5 Hz]
--prompt_wav str audio_ref/female_shadowheart.flac Reference speaker audio for voice cloning
--out_wav str out_streaming.wav Output WAV file path
--text str (Russian text) Text to synthesize
--language str "russian" Language of the text to synthesize

By: Patrick Lumbantobing

Copyright@VertoX-AI

Citation

If you use this system in your research, please cite:

@misc{vertoxai2026streamingspeechtranslation,
  title={Qwen3-TTS-Streaming-ONNX β€” VertoX-AI},
  author={Tobing, P. L., VertoX-AI},
  year={2026},
  publisher={HuggingFace},
}

License

This project is licensed under the Apache-2.0, the same license as the original Qwen3-TTS.

Created by: Patrick Lumbantobing, Vertox-AI
Copyright (c) 2026 Vertox-AI. All rights reserved.

This work is licensed under the Apache License, Version 2.0.
To view a copy of this license, visit [LICENSE](https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/apache-2.0.md).

Acknowledgements

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for pltobing/Qwen3-TTS-Streaming-ONNX

Quantized
(10)
this model

Collection including pltobing/Qwen3-TTS-Streaming-ONNX

Paper for pltobing/Qwen3-TTS-Streaming-ONNX