Qwen3-TTS 0.6B Base β Voice Clone (ONNX)
ONNX export of Qwen/Qwen3-TTS-12Hz-0.6B-Base for voice cloning with ONNX Runtime. No PyTorch required at inference time.
Provide a short reference audio clip and its transcript, and the model synthesizes new text in the same voice using In-Context Learning (ICL).
Both FP32 and INT4 (weight-only, RTN) variants are included.
Exported and maintained by WaveKat as part of the wavekat-tts voice pipeline.
Quick Start
pip install -r requirements.txt
# FP32
python generate_clone_onnx.py \
--ref-audio ref.wav --ref-text "Transcript of the reference audio." \
--text "New text to synthesize in the cloned voice." \
-o output_fp32.wav
# INT4 (~4x smaller, faster)
python generate_clone_onnx.py --variant int4 \
--ref-audio ref.wav --ref-text "Transcript of the reference audio." \
--text "New text to synthesize in the cloned voice." \
-o output_int4.wav
Reference audio should be mono 24 kHz WAV with a clear, single-speaker recording (3β10 seconds works well). The script will resample automatically via librosa if needed.
Model Architecture
Qwen3-TTS 0.6B Base uses a 6-model voice-clone pipeline:
Ref audio --> [Speaker Encoder] ECAPA-TDNN β 1024-d speaker embedding
\-> [Tokenizer Encoder] Mimi encoder β 16-group ref codes (12 Hz)
Text + Ref text + Speaker embed + Ref codes
|
v (ICL prefill)
[Talker LM] 28 layers, 1024 hidden
predicts codebook group 0
|
v
[Code Predictor] 5 layers, 1024 hidden
predicts groups 1-15
|
v
[Vocoder] single forward pass
concat(ref_codes, gen_codes) β 24 kHz waveform β trim ref portion
The pipeline is split into 6 ONNX models:
| Model | Description | Precision |
|---|---|---|
speaker_encoder.onnx |
ECAPA-TDNN: mel β 1024-d speaker embedding | FP32 only |
tokenizer_encoder.onnx |
Mimi encoder: audio β 16-group codec codes | FP32 only |
talker_prefill.onnx |
Full sequence prefill with KV cache output | FP32 / INT4 |
talker_decode.onnx |
Single-step decode with KV cache | FP32 / INT4 |
code_predictor.onnx |
Predict codebook groups 1-15 | FP32 / INT4 |
vocoder.onnx |
Codes to 24 kHz waveform | FP32 / INT4 |
Speaker encoder and tokenizer encoder are always FP32 β they run once per request and are small.
Repository Structure
.
βββ config.json # Model config (dimensions, token IDs, language map)
βββ speaker_encoder.onnx # ECAPA-TDNN speaker encoder (FP32)
βββ tokenizer_encoder.onnx # Mimi speech tokenizer encoder (FP32)
βββ tokenizer/ # Text tokenizer (vocab, merges)
βββ embeddings/ # Pre-extracted embedding weights (.npy)
βββ fp32/ # FP32 ONNX models
β βββ talker_prefill.onnx
β βββ talker_decode.onnx
β βββ code_predictor.onnx
β βββ vocoder.onnx
βββ int4/ # INT4 weight-only quantized models
β βββ talker_prefill.onnx
β βββ talker_decode.onnx
β βββ code_predictor.onnx
β βββ vocoder.onnx
βββ generate_clone_onnx.py # Reference ONNX-only voice clone script
βββ requirements.txt # Inference dependencies
How It Works
- Speaker encoding β the reference audio is converted to a log-mel spectrogram and passed through an ECAPA-TDNN encoder to produce a 1024-d speaker embedding.
- Reference code extraction β the same audio is encoded by a Mimi tokenizer encoder into 16-group discrete codes at 12 Hz.
- ICL prefill β the talker LM is prefilled with an interleaved sequence: text embeddings (ref transcript + target text) paired with codec embeddings (speaker embed + reference codes).
- Autoregressive decode β the talker generates group-0 codec tokens, and the code predictor fills in groups 1-15 per frame.
- Vocoder β reference codes are prepended to generated codes, the vocoder decodes the combined sequence, and the leading reference portion is trimmed proportionally.
Supported Languages
English, Chinese, Japanese, Korean, German, French, Spanish, Italian, Portuguese, Russian.
Reproducing the Export
The export scripts are in the wavekat-tts repository:
cd tools/qwen3-tts-onnx
pip install -r requirements.txt
# Export FP32, quantize INT4, and package for HF
make clone-all
About WaveKat
WaveKat builds open-source voice pipeline components in Rust. This ONNX export is maintained as part of wavekat-tts, which provides unified TTS inference across multiple backends.
Acknowledgements
- Qwen3-TTS by the Qwen team at Alibaba Cloud
- Downloads last month
- 34
Model tree for wavekat/Qwen3-TTS-0.6B-Base-ONNX
Base model
Qwen/Qwen3-TTS-12Hz-0.6B-Base