XTTSv2 Streaming ONNX

Streaming text-to-speech inference for XTTSv2 using ONNX Runtime β€” no PyTorch required.

This repository provides a complete, CPU-friendly, streaming TTS pipeline built on ONNX-exported XTTSv2 models. It replaces the original PyTorch inference path with pure Python/NumPy logic while preserving full compatibility with the XTTSv2 architecture.


Features

  • Zero-shot voice cloning from a short (≀ 6 s) reference audio clip.
  • Streaming audio output β€” audio chunks are yielded as they are generated, enabling low-latency playback.
  • Pure ONNX Runtime + NumPy β€” no PyTorch dependency at inference time.
  • INT8-quantised GPT model option for reduced memory footprint and faster CPU inference.
  • Cross-fade chunk stitching for seamless audio across vocoder boundaries.
  • Speed control via linear interpolation of GPT latents.
  • Multilingual support β€” 17 languages: English (en), Spanish (es), French (fr), German (de), Italian (it), Portuguese (pt), Polish (pl), Turkish (tr), Russian (ru), Dutch (nl), Czech (cs), Arabic (ar), Chinese (zh-cn), Japanese (ja), Hungarian (hu), Korean (ko), Hindi (hi).

Architecture Overview

XTTSv2 is composed of four main neural network components, each exported as a separate ONNX model:

Component ONNX File Description
Conditioning Encoder conditioning_encoder.onnx Six 16-head attention layers + Perceiver Resampler. Compresses a reference mel-spectrogram into 32 Γ— 1024 conditioning latents.
Speaker Encoder speaker_encoder.onnx H/ASP speaker verification network. Extracts a 512-dim speaker embedding from 16 kHz audio.
GPT-2 Decoder gpt_model.onnx / gpt_model_int8.onnx 30-layer, 1024-dim decoder-only transformer with KV-cache. Autoregressively predicts VQ-VAE audio codes conditioned on text tokens and conditioning latents.
HiFi-GAN Vocoder hifigan_vocoder.onnx 26M-parameter neural vocoder. Converts GPT-2 hidden states + speaker embedding into a 24 kHz waveform.

Pre-exported embedding tables (text, mel, positional) are stored as .npy files in the embeddings/ directory.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   mel @ 22 kHz    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Reference   β”‚ ───────────────► β”‚ Conditioning Encoder │──► cond_latents [1,32,1024]
β”‚  Audio Clip  β”‚                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚              β”‚   audio @ 16 kHz  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              β”‚ ───────────────► β”‚   Speaker Encoder    │──► speaker_emb  [1,512,1]
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   BPE tokens   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Text   β”‚ ─────────────► β”‚  GPT-2 Decoder (autoregressive + KV$)   │──► latents [1,T,1024]
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                 β”‚  prefix = [cond | text+pos | start_mel] β”‚
                             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                           β”‚
                                           β–Ό
                             β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                             β”‚   HiFi-GAN Vocoder   │──► waveform @ 24 kHz
                             β”‚  (+ speaker_emb)     β”‚
                             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Repository Structure

.
β”œβ”€β”€ README.md                         # This file
β”œβ”€β”€ requirements.txt                  # Python dependencies
β”œβ”€β”€ xtts_streaming_pipeline.py        # Top-level streaming TTS pipeline
β”œβ”€β”€ xtts_onnx_orchestrator.py         # Low-level ONNX AR loop orchestrator
β”œβ”€β”€ xtts_tokenizer.py                 # BPE tokenizer wrapper
β”œβ”€β”€ zh_num2words.py                   # Chinese number-to-words utility
β”œβ”€β”€ xtts_onnx/                        # ONNX models & assets
β”‚   β”œβ”€β”€ metadata.json                 # Model architecture metadata
β”‚   β”œβ”€β”€ vocab.json                    # BPE vocabulary
β”‚   β”œβ”€β”€ mel_stats.npy                 # Per-channel mel normalisation stats
β”‚   β”œβ”€β”€ conditioning_encoder.onnx     # Conditioning encoder
β”‚   β”œβ”€β”€ speaker_encoder.onnx          # H/ASP speaker encoder
β”‚   β”œβ”€β”€ gpt_model.onnx               # GPT-2 decoder (FP32)
β”‚   β”œβ”€β”€ gpt_model_int8.onnx          # GPT-2 decoder (INT8 quantised)
β”‚   β”œβ”€β”€ hifigan_vocoder.onnx          # HiFi-GAN vocoder
β”‚   └── embeddings/                   # Pre-exported embedding tables
β”‚       β”œβ”€β”€ mel_embedding.npy         # [1026, 1024] audio code embeddings
β”‚       β”œβ”€β”€ mel_pos_embedding.npy     # [608, 1024]  mel positional embeddings
β”‚       β”œβ”€β”€ text_embedding.npy        # [6681, 1024] BPE text embeddings
β”‚       └── text_pos_embedding.npy    # [404, 1024]  text positional embeddings
β”œβ”€β”€ audio_ref/                        # Reference audio clips for voice cloning
└── audio_synth/                      # Directory for synthesised output

Installation

Prerequisites

  • Python β‰₯ 3.10
  • A C compiler may be needed for some dependencies (e.g. tokenizers).

Install dependencies

pip install -r requirements.txt

Clone from Hugging Face Hub

# Install Git LFS (required for large model files)
git lfs install

# Clone the repository
git clone https://huggingface.co/pltobing/XTTSv2-Streaming-ONNX
cd XTTSv2-Streaming-ONNX

Quick Start

Streaming TTS (command-line)

python -u xtts_streaming_pipeline.py \
    --model_dir xtts_onnx/ \
    --vocab_path xtts_onnx/vocab.json \
    --mel_norms_path xtts_onnx/mel_stats.npy \
    --ref_audio audio_ref/male_stewie.mp3 \
    --language en \
    --output output_streaming.wav

Python API

import numpy as np
from xtts_streaming_pipeline import StreamingTTSPipeline

# Initialise the pipeline
pipeline = StreamingTTSPipeline(
    model_dir="xtts_onnx/",
    vocab_path="xtts_onnx/vocab.json",
    mel_norms_path="xtts_onnx/mel_stats.npy",
    use_int8_gpt=True,       # Use INT8-quantised GPT for faster CPU inference
    num_threads_gpt=4,        # Adjust to your CPU core count
)

# Compute speaker conditioning (one-time per speaker)
gpt_cond_latent, speaker_embedding = pipeline.get_conditioning_latents(
    "audio_ref/male_stewie.mp3"
)

# Stream synthesis
all_chunks = []
for audio_chunk in pipeline.inference_stream(
    text="Hello, this is a streaming text-to-speech demo.",
    language="en",
    gpt_cond_latent=gpt_cond_latent,
    speaker_embedding=speaker_embedding,
    stream_chunk_size=20,     # AR tokens per vocoder call
    speed=1.0,                # 1.0 = normal speed
):
    all_chunks.append(audio_chunk)
    # In a real application, you would play or stream each chunk here.

# Concatenate all chunks into a single waveform
full_audio = np.concatenate(all_chunks, axis=0)

# Save to file
import soundfile as sf
sf.write("output.wav", full_audio, 24000)

Configuration

SamplingConfig

Control the autoregressive token sampling behaviour:

Parameter Default Description
temperature 0.75 Softmax temperature. Lower = more deterministic.
top_k 50 Keep only the top-k most probable tokens.
top_p 0.85 Nucleus sampling cumulative probability threshold.
repetition_penalty 10.0 Penalise previously generated tokens.
do_sample True True = multinomial sampling; False = greedy argmax.
from xtts_onnx_orchestrator import SamplingConfig

sampling = SamplingConfig(
    temperature=0.65,
    top_k=30,
    top_p=0.90,
    repetition_penalty=10.0,
    do_sample=True,
)

for chunk in pipeline.inference_stream(text, "en", cond, spk, sampling=sampling):
    ...

GPTConfig

Model architecture parameters are automatically loaded from metadata.json. Key fields:

Parameter Value Description
n_layer 30 Number of GPT-2 transformer layers
embed_dim 1024 Hidden dimension
num_heads 16 Number of attention heads
head_dim 64 Per-head dimension
num_audio_tokens 1026 Audio vocabulary (1024 VQ codes + start + stop)
perceiver_output_len 32 Conditioning latent sequence length
max_gen_mel_tokens 605 Maximum generated audio tokens

Module Reference

xtts_streaming_pipeline.py

Top-level streaming pipeline.

Class / Function Description
StreamingTTSPipeline Main pipeline class. Owns sessions, tokenizer, orchestrator.
StreamingTTSPipeline.get_conditioning_latents() Extract GPT conditioning + speaker embedding from reference audio.
StreamingTTSPipeline.inference_stream() Generator that yields audio chunks for a text segment.
StreamingTTSPipeline.time_scale_gpt_latents_numpy() Linearly time-scale GPT latents for speed control.
wav_to_mel_cloning_numpy() Compute normalised log-mel spectrogram (NumPy, 22 kHz).
crossfade_chunks() Cross-fade consecutive vocoder waveform chunks.

xtts_onnx_orchestrator.py

Low-level ONNX autoregressive loop.

Class / Function Description
ONNXSessionManager Loads and manages all ONNX sessions + embedding tables.
XTTSOrchestratorONNX Drives the GPT-2 AR loop with KV-cache and logits processing.
GPTConfig Model architecture hyper-parameters (from metadata.json).
SamplingConfig Token sampling hyper-parameters.
apply_repetition_penalty() NumPy repetition penalty on logits.
apply_temperature() Temperature scaling on logits.
apply_top_k() Top-k filtering on logits.
apply_top_p() Nucleus (top-p) filtering on logits.
numpy_softmax() Numerically-stable softmax in NumPy.
numpy_multinomial() Inverse-CDF multinomial sampling.

Performance Notes

  • stream_chunk_size controls the latency–quality trade-off: smaller values yield audio sooner but run the vocoder more often (on all accumulated latents).
  • Thread count (num_threads_gpt) should be tuned to your CPU. Start with the number of physical cores.
  • First call to get_conditioning_latents() is an expensive step (resampling + mel computation + encoder inference). Cache the results for repeated synthesis with the same speaker.

License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

Copyright 2025 Patrick Lumbantobing, Vertox-AI

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Acknowledgements

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for pltobing/XTTSv2-Streaming-ONNX

Base model

coqui/XTTS-v2
Quantized
(2)
this model

Paper for pltobing/XTTSv2-Streaming-ONNX