XTTSv2 Streaming ONNX
Streaming text-to-speech inference for XTTSv2 using ONNX Runtime β no PyTorch required.
This repository provides a complete, CPU-friendly, streaming TTS pipeline built on ONNX-exported XTTSv2 models. It replaces the original PyTorch inference path with pure Python/NumPy logic while preserving full compatibility with the XTTSv2 architecture.
Features
- Zero-shot voice cloning from a short (β€ 6 s) reference audio clip.
- Streaming audio output β audio chunks are yielded as they are generated, enabling low-latency playback.
- Pure ONNX Runtime + NumPy β no PyTorch dependency at inference time.
- INT8-quantised GPT model option for reduced memory footprint and faster CPU inference.
- Cross-fade chunk stitching for seamless audio across vocoder boundaries.
- Speed control via linear interpolation of GPT latents.
- Multilingual support β 17 languages: English (en), Spanish (es), French (fr), German (de), Italian (it), Portuguese (pt), Polish (pl), Turkish (tr), Russian (ru), Dutch (nl), Czech (cs), Arabic (ar), Chinese (zh-cn), Japanese (ja), Hungarian (hu), Korean (ko), Hindi (hi).
Architecture Overview
XTTSv2 is composed of four main neural network components, each exported as a separate ONNX model:
| Component | ONNX File | Description |
|---|---|---|
| Conditioning Encoder | conditioning_encoder.onnx |
Six 16-head attention layers + Perceiver Resampler. Compresses a reference mel-spectrogram into 32 Γ 1024 conditioning latents. |
| Speaker Encoder | speaker_encoder.onnx |
H/ASP speaker verification network. Extracts a 512-dim speaker embedding from 16 kHz audio. |
| GPT-2 Decoder | gpt_model.onnx / gpt_model_int8.onnx |
30-layer, 1024-dim decoder-only transformer with KV-cache. Autoregressively predicts VQ-VAE audio codes conditioned on text tokens and conditioning latents. |
| HiFi-GAN Vocoder | hifigan_vocoder.onnx |
26M-parameter neural vocoder. Converts GPT-2 hidden states + speaker embedding into a 24 kHz waveform. |
Pre-exported embedding tables (text, mel, positional) are stored as .npy files in the embeddings/ directory.
βββββββββββββββ mel @ 22 kHz βββββββββββββββββββββββ
β Reference β ββββββββββββββββΊ β Conditioning Encoder ββββΊ cond_latents [1,32,1024]
β Audio Clip β βββββββββββββββββββββββ
β β audio @ 16 kHz βββββββββββββββββββββββ
β β ββββββββββββββββΊ β Speaker Encoder ββββΊ speaker_emb [1,512,1]
βββββββββββββββ βββββββββββββββββββββββ
ββββββββββββ BPE tokens ββββββββββββββββββββββββββββββββββββββββββββ
β Text β ββββββββββββββΊ β GPT-2 Decoder (autoregressive + KV$) ββββΊ latents [1,T,1024]
ββββββββββββ β prefix = [cond | text+pos | start_mel] β
ββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββ
β HiFi-GAN Vocoder ββββΊ waveform @ 24 kHz
β (+ speaker_emb) β
βββββββββββββββββββββββ
Repository Structure
.
βββ README.md # This file
βββ requirements.txt # Python dependencies
βββ xtts_streaming_pipeline.py # Top-level streaming TTS pipeline
βββ xtts_onnx_orchestrator.py # Low-level ONNX AR loop orchestrator
βββ xtts_tokenizer.py # BPE tokenizer wrapper
βββ zh_num2words.py # Chinese number-to-words utility
βββ xtts_onnx/ # ONNX models & assets
β βββ metadata.json # Model architecture metadata
β βββ vocab.json # BPE vocabulary
β βββ mel_stats.npy # Per-channel mel normalisation stats
β βββ conditioning_encoder.onnx # Conditioning encoder
β βββ speaker_encoder.onnx # H/ASP speaker encoder
β βββ gpt_model.onnx # GPT-2 decoder (FP32)
β βββ gpt_model_int8.onnx # GPT-2 decoder (INT8 quantised)
β βββ hifigan_vocoder.onnx # HiFi-GAN vocoder
β βββ embeddings/ # Pre-exported embedding tables
β βββ mel_embedding.npy # [1026, 1024] audio code embeddings
β βββ mel_pos_embedding.npy # [608, 1024] mel positional embeddings
β βββ text_embedding.npy # [6681, 1024] BPE text embeddings
β βββ text_pos_embedding.npy # [404, 1024] text positional embeddings
βββ audio_ref/ # Reference audio clips for voice cloning
βββ audio_synth/ # Directory for synthesised output
Installation
Prerequisites
- Python β₯ 3.10
- A C compiler may be needed for some dependencies (e.g.
tokenizers).
Install dependencies
pip install -r requirements.txt
Clone from Hugging Face Hub
# Install Git LFS (required for large model files)
git lfs install
# Clone the repository
git clone https://huggingface.co/pltobing/XTTSv2-Streaming-ONNX
cd XTTSv2-Streaming-ONNX
Quick Start
Streaming TTS (command-line)
python -u xtts_streaming_pipeline.py \
--model_dir xtts_onnx/ \
--vocab_path xtts_onnx/vocab.json \
--mel_norms_path xtts_onnx/mel_stats.npy \
--ref_audio audio_ref/male_stewie.mp3 \
--language en \
--output output_streaming.wav
Python API
import numpy as np
from xtts_streaming_pipeline import StreamingTTSPipeline
# Initialise the pipeline
pipeline = StreamingTTSPipeline(
model_dir="xtts_onnx/",
vocab_path="xtts_onnx/vocab.json",
mel_norms_path="xtts_onnx/mel_stats.npy",
use_int8_gpt=True, # Use INT8-quantised GPT for faster CPU inference
num_threads_gpt=4, # Adjust to your CPU core count
)
# Compute speaker conditioning (one-time per speaker)
gpt_cond_latent, speaker_embedding = pipeline.get_conditioning_latents(
"audio_ref/male_stewie.mp3"
)
# Stream synthesis
all_chunks = []
for audio_chunk in pipeline.inference_stream(
text="Hello, this is a streaming text-to-speech demo.",
language="en",
gpt_cond_latent=gpt_cond_latent,
speaker_embedding=speaker_embedding,
stream_chunk_size=20, # AR tokens per vocoder call
speed=1.0, # 1.0 = normal speed
):
all_chunks.append(audio_chunk)
# In a real application, you would play or stream each chunk here.
# Concatenate all chunks into a single waveform
full_audio = np.concatenate(all_chunks, axis=0)
# Save to file
import soundfile as sf
sf.write("output.wav", full_audio, 24000)
Configuration
SamplingConfig
Control the autoregressive token sampling behaviour:
| Parameter | Default | Description |
|---|---|---|
temperature |
0.75 |
Softmax temperature. Lower = more deterministic. |
top_k |
50 |
Keep only the top-k most probable tokens. |
top_p |
0.85 |
Nucleus sampling cumulative probability threshold. |
repetition_penalty |
10.0 |
Penalise previously generated tokens. |
do_sample |
True |
True = multinomial sampling; False = greedy argmax. |
from xtts_onnx_orchestrator import SamplingConfig
sampling = SamplingConfig(
temperature=0.65,
top_k=30,
top_p=0.90,
repetition_penalty=10.0,
do_sample=True,
)
for chunk in pipeline.inference_stream(text, "en", cond, spk, sampling=sampling):
...
GPTConfig
Model architecture parameters are automatically loaded from metadata.json. Key fields:
| Parameter | Value | Description |
|---|---|---|
n_layer |
30 | Number of GPT-2 transformer layers |
embed_dim |
1024 | Hidden dimension |
num_heads |
16 | Number of attention heads |
head_dim |
64 | Per-head dimension |
num_audio_tokens |
1026 | Audio vocabulary (1024 VQ codes + start + stop) |
perceiver_output_len |
32 | Conditioning latent sequence length |
max_gen_mel_tokens |
605 | Maximum generated audio tokens |
Module Reference
xtts_streaming_pipeline.py
Top-level streaming pipeline.
| Class / Function | Description |
|---|---|
StreamingTTSPipeline |
Main pipeline class. Owns sessions, tokenizer, orchestrator. |
StreamingTTSPipeline.get_conditioning_latents() |
Extract GPT conditioning + speaker embedding from reference audio. |
StreamingTTSPipeline.inference_stream() |
Generator that yields audio chunks for a text segment. |
StreamingTTSPipeline.time_scale_gpt_latents_numpy() |
Linearly time-scale GPT latents for speed control. |
wav_to_mel_cloning_numpy() |
Compute normalised log-mel spectrogram (NumPy, 22 kHz). |
crossfade_chunks() |
Cross-fade consecutive vocoder waveform chunks. |
xtts_onnx_orchestrator.py
Low-level ONNX autoregressive loop.
| Class / Function | Description |
|---|---|
ONNXSessionManager |
Loads and manages all ONNX sessions + embedding tables. |
XTTSOrchestratorONNX |
Drives the GPT-2 AR loop with KV-cache and logits processing. |
GPTConfig |
Model architecture hyper-parameters (from metadata.json). |
SamplingConfig |
Token sampling hyper-parameters. |
apply_repetition_penalty() |
NumPy repetition penalty on logits. |
apply_temperature() |
Temperature scaling on logits. |
apply_top_k() |
Top-k filtering on logits. |
apply_top_p() |
Nucleus (top-p) filtering on logits. |
numpy_softmax() |
Numerically-stable softmax in NumPy. |
numpy_multinomial() |
Inverse-CDF multinomial sampling. |
Performance Notes
stream_chunk_sizecontrols the latencyβquality trade-off: smaller values yield audio sooner but run the vocoder more often (on all accumulated latents).- Thread count (
num_threads_gpt) should be tuned to your CPU. Start with the number of physical cores. - First call to
get_conditioning_latents()is an expensive step (resampling + mel computation + encoder inference). Cache the results for repeated synthesis with the same speaker.
License
This project is licensed under the Apache License 2.0. See the LICENSE file for details.
Copyright 2025 Patrick Lumbantobing, Vertox-AI
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Acknowledgements
- Coqui AI for the original XTTSv2 model and training recipe.
- XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model (Casanova et al., 2024).
- ONNX Runtime for high-performance cross-platform inference.
Model tree for pltobing/XTTSv2-Streaming-ONNX
Base model
coqui/XTTS-v2