WaveKat TTS

Qwen3-TTS 0.6B Base β€” Voice Clone (ONNX)

ONNX export of Qwen/Qwen3-TTS-12Hz-0.6B-Base for voice cloning with ONNX Runtime. No PyTorch required at inference time.

Provide a short reference audio clip and its transcript, and the model synthesizes new text in the same voice using In-Context Learning (ICL).

Both FP32 and INT4 (weight-only, RTN) variants are included.

Exported and maintained by WaveKat as part of the wavekat-tts voice pipeline.

Quick Start

pip install -r requirements.txt

# FP32
python generate_clone_onnx.py \
  --ref-audio ref.wav --ref-text "Transcript of the reference audio." \
  --text "New text to synthesize in the cloned voice." \
  -o output_fp32.wav

# INT4 (~4x smaller, faster)
python generate_clone_onnx.py --variant int4 \
  --ref-audio ref.wav --ref-text "Transcript of the reference audio." \
  --text "New text to synthesize in the cloned voice." \
  -o output_int4.wav

Reference audio should be mono 24 kHz WAV with a clear, single-speaker recording (3–10 seconds works well). The script will resample automatically via librosa if needed.

Model Architecture

Qwen3-TTS 0.6B Base uses a 6-model voice-clone pipeline:

Ref audio --> [Speaker Encoder]      ECAPA-TDNN β†’ 1024-d speaker embedding
         \-> [Tokenizer Encoder]     Mimi encoder β†’ 16-group ref codes (12 Hz)

Text + Ref text + Speaker embed + Ref codes
             |
             v  (ICL prefill)
     [Talker LM]           28 layers, 1024 hidden
     predicts codebook group 0
             |
             v
     [Code Predictor]      5 layers, 1024 hidden
     predicts groups 1-15
             |
             v
     [Vocoder]             single forward pass
     concat(ref_codes, gen_codes) β†’ 24 kHz waveform β†’ trim ref portion

The pipeline is split into 6 ONNX models:

Model Description Precision
speaker_encoder.onnx ECAPA-TDNN: mel β†’ 1024-d speaker embedding FP32 only
tokenizer_encoder.onnx Mimi encoder: audio β†’ 16-group codec codes FP32 only
talker_prefill.onnx Full sequence prefill with KV cache output FP32 / INT4
talker_decode.onnx Single-step decode with KV cache FP32 / INT4
code_predictor.onnx Predict codebook groups 1-15 FP32 / INT4
vocoder.onnx Codes to 24 kHz waveform FP32 / INT4

Speaker encoder and tokenizer encoder are always FP32 β€” they run once per request and are small.

Repository Structure

.
β”œβ”€β”€ config.json                # Model config (dimensions, token IDs, language map)
β”œβ”€β”€ speaker_encoder.onnx       # ECAPA-TDNN speaker encoder (FP32)
β”œβ”€β”€ tokenizer_encoder.onnx     # Mimi speech tokenizer encoder (FP32)
β”œβ”€β”€ tokenizer/                 # Text tokenizer (vocab, merges)
β”œβ”€β”€ embeddings/                # Pre-extracted embedding weights (.npy)
β”œβ”€β”€ fp32/                      # FP32 ONNX models
β”‚   β”œβ”€β”€ talker_prefill.onnx
β”‚   β”œβ”€β”€ talker_decode.onnx
β”‚   β”œβ”€β”€ code_predictor.onnx
β”‚   └── vocoder.onnx
β”œβ”€β”€ int4/                      # INT4 weight-only quantized models
β”‚   β”œβ”€β”€ talker_prefill.onnx
β”‚   β”œβ”€β”€ talker_decode.onnx
β”‚   β”œβ”€β”€ code_predictor.onnx
β”‚   └── vocoder.onnx
β”œβ”€β”€ generate_clone_onnx.py     # Reference ONNX-only voice clone script
└── requirements.txt           # Inference dependencies

How It Works

  1. Speaker encoding β€” the reference audio is converted to a log-mel spectrogram and passed through an ECAPA-TDNN encoder to produce a 1024-d speaker embedding.
  2. Reference code extraction β€” the same audio is encoded by a Mimi tokenizer encoder into 16-group discrete codes at 12 Hz.
  3. ICL prefill β€” the talker LM is prefilled with an interleaved sequence: text embeddings (ref transcript + target text) paired with codec embeddings (speaker embed + reference codes).
  4. Autoregressive decode β€” the talker generates group-0 codec tokens, and the code predictor fills in groups 1-15 per frame.
  5. Vocoder β€” reference codes are prepended to generated codes, the vocoder decodes the combined sequence, and the leading reference portion is trimmed proportionally.

Supported Languages

English, Chinese, Japanese, Korean, German, French, Spanish, Italian, Portuguese, Russian.

Reproducing the Export

The export scripts are in the wavekat-tts repository:

cd tools/qwen3-tts-onnx
pip install -r requirements.txt

# Export FP32, quantize INT4, and package for HF
make clone-all

About WaveKat

WaveKat builds open-source voice pipeline components in Rust. This ONNX export is maintained as part of wavekat-tts, which provides unified TTS inference across multiple backends.

Acknowledgements

  • Qwen3-TTS by the Qwen team at Alibaba Cloud
Downloads last month
34
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for wavekat/Qwen3-TTS-0.6B-Base-ONNX

Quantized
(14)
this model