Add README and requirements

51e0ddd 5 days ago

12.4 kB

	---
	language:
	- en
	- es
	- fr
	- de
	- it
	- pt
	- pl
	- tr
	- ru
	- nl
	- cs
	- ar
	- zh
	- ja
	- hu
	- ko
	- hi
	tags:
	- tts
	- onnx
	- xtts
	- xttsv2
	- voice clone
	- streaming
	- gpt2
	- hifigan
	- multilingual
	- vq
	- perceiver encoder
	license: apache-2.0
	base_model: coqui/XTTS-v2
	---

	# XTTSv2 Streaming ONNX

	Streaming text-to-speech inference for [XTTSv2](https://arxiv.org/abs/2406.04904) using ONNX Runtime — no PyTorch required.

	This repository provides a complete, CPU-friendly, streaming TTS pipeline built on ONNX-exported XTTSv2 models. It replaces the original PyTorch inference path with pure Python/NumPy logic while preserving full compatibility with the XTTSv2 architecture.

	---

	## Features

	- Zero-shot voice cloning from a short (≤ 6 s) reference audio clip.
	- Streaming audio output — audio chunks are yielded as they are generated, enabling low-latency playback.
	- Pure ONNX Runtime + NumPy — no PyTorch dependency at inference time.
	- INT8-quantised GPT model option for reduced memory footprint and faster CPU inference.
	- Cross-fade chunk stitching for seamless audio across vocoder boundaries.
	- Speed control via linear interpolation of GPT latents.
	- Multilingual support — 17 languages: English (en), Spanish (es), French (fr), German (de), Italian (it), Portuguese (pt), Polish (pl), Turkish (tr), Russian (ru), Dutch (nl), Czech (cs), Arabic (ar), Chinese (zh-cn), Japanese (ja), Hungarian (hu), Korean (ko), Hindi (hi).

	---

	## Architecture Overview

	XTTSv2 is composed of four main neural network components, each exported as a separate ONNX model:

	\| Component \| ONNX File \| Description \|
	\|---\|---\|---\|
	\| Conditioning Encoder \| `conditioning_encoder.onnx` \| Six 16-head attention layers + Perceiver Resampler. Compresses a reference mel-spectrogram into 32 × 1024 conditioning latents. \|
	\| Speaker Encoder \| `speaker_encoder.onnx` \| H/ASP speaker verification network. Extracts a 512-dim speaker embedding from 16 kHz audio. \|
	\| GPT-2 Decoder \| `gpt_model.onnx` / `gpt_model_int8.onnx` \| 30-layer, 1024-dim decoder-only transformer with KV-cache. Autoregressively predicts VQ-VAE audio codes conditioned on text tokens and conditioning latents. \|
	\| HiFi-GAN Vocoder \| `hifigan_vocoder.onnx` \| 26M-parameter neural vocoder. Converts GPT-2 hidden states + speaker embedding into a 24 kHz waveform. \|

	Pre-exported embedding tables (text, mel, positional) are stored as `.npy` files in the `embeddings/` directory.

	```
	┌─────────────┐ mel @ 22 kHz ┌─────────────────────┐
	│ Reference │ ───────────────► │ Conditioning Encoder │──► cond_latents [1,32,1024]
	│ Audio Clip │ └─────────────────────┘
	│ │ audio @ 16 kHz ┌─────────────────────┐
	│ │ ───────────────► │ Speaker Encoder │──► speaker_emb [1,512,1]
	└─────────────┘ └─────────────────────┘

	┌──────────┐ BPE tokens ┌──────────────────────────────────────────┐
	│ Text │ ─────────────► │ GPT-2 Decoder (autoregressive + KV$) │──► latents [1,T,1024]
	└──────────┘ │ prefix = [cond \| text+pos \| start_mel] │
	└──────────────────────────────────────────┘
	│
	▼
	┌─────────────────────┐
	│ HiFi-GAN Vocoder │──► waveform @ 24 kHz
	│ (+ speaker_emb) │
	└─────────────────────┘
	```

	---

	## Repository Structure

	```
	.
	├── README.md # This file
	├── requirements.txt # Python dependencies
	├── xtts_streaming_pipeline.py # Top-level streaming TTS pipeline
	├── xtts_onnx_orchestrator.py # Low-level ONNX AR loop orchestrator
	├── xtts_tokenizer.py # BPE tokenizer wrapper
	├── zh_num2words.py # Chinese number-to-words utility
	├── xtts_onnx/ # ONNX models & assets
	│ ├── metadata.json # Model architecture metadata
	│ ├── vocab.json # BPE vocabulary
	│ ├── mel_stats.npy # Per-channel mel normalisation stats
	│ ├── conditioning_encoder.onnx # Conditioning encoder
	│ ├── speaker_encoder.onnx # H/ASP speaker encoder
	│ ├── gpt_model.onnx # GPT-2 decoder (FP32)
	│ ├── gpt_model_int8.onnx # GPT-2 decoder (INT8 quantised)
	│ ├── hifigan_vocoder.onnx # HiFi-GAN vocoder
	│ └── embeddings/ # Pre-exported embedding tables
	│ ├── mel_embedding.npy # [1026, 1024] audio code embeddings
	│ ├── mel_pos_embedding.npy # [608, 1024] mel positional embeddings
	│ ├── text_embedding.npy # [6681, 1024] BPE text embeddings
	│ └── text_pos_embedding.npy # [404, 1024] text positional embeddings
	├── audio_ref/ # Reference audio clips for voice cloning
	└── audio_synth/ # Directory for synthesised output
	```

	---

	## Installation

	### Prerequisites

	- Python ≥ 3.10
	- A C compiler may be needed for some dependencies (e.g. `tokenizers`).

	### Install dependencies

	```bash
	pip install -r requirements.txt
	```

	### Clone from Hugging Face Hub

	```bash
	# Install Git LFS (required for large model files)
	git lfs install

	# Clone the repository
	git clone https://huggingface.co/pltobing/XTTSv2-Streaming-ONNX
	cd XTTSv2-Streaming-ONNX
	```

	---

	## Quick Start

	### Streaming TTS (command-line)

	```bash
	python -u xtts_streaming_pipeline.py \
	--model_dir xtts_onnx/ \
	--vocab_path xtts_onnx/vocab.json \
	--mel_norms_path xtts_onnx/mel_stats.npy \
	--ref_audio audio_ref/male_stewie.mp3 \
	--language en \
	--output output_streaming.wav
	```

	### Python API

	```python
	import numpy as np
	from xtts_streaming_pipeline import StreamingTTSPipeline

	# Initialise the pipeline
	pipeline = StreamingTTSPipeline(
	model_dir="xtts_onnx/",
	vocab_path="xtts_onnx/vocab.json",
	mel_norms_path="xtts_onnx/mel_stats.npy",
	use_int8_gpt=True, # Use INT8-quantised GPT for faster CPU inference
	num_threads_gpt=4, # Adjust to your CPU core count
	)

	# Compute speaker conditioning (one-time per speaker)
	gpt_cond_latent, speaker_embedding = pipeline.get_conditioning_latents(
	"audio_ref/male_stewie.mp3"
	)

	# Stream synthesis
	all_chunks = []
	for audio_chunk in pipeline.inference_stream(
	text="Hello, this is a streaming text-to-speech demo.",
	language="en",
	gpt_cond_latent=gpt_cond_latent,
	speaker_embedding=speaker_embedding,
	stream_chunk_size=20, # AR tokens per vocoder call
	speed=1.0, # 1.0 = normal speed
	):
	all_chunks.append(audio_chunk)
	# In a real application, you would play or stream each chunk here.

	# Concatenate all chunks into a single waveform
	full_audio = np.concatenate(all_chunks, axis=0)

	# Save to file
	import soundfile as sf
	sf.write("output.wav", full_audio, 24000)
	```

	---

	## Configuration

	### SamplingConfig

	Control the autoregressive token sampling behaviour:

	\| Parameter \| Default \| Description \|
	\|---\|---\|---\|
	\| `temperature` \| `0.75` \| Softmax temperature. Lower = more deterministic. \|
	\| `top_k` \| `50` \| Keep only the top-k most probable tokens. \|
	\| `top_p` \| `0.85` \| Nucleus sampling cumulative probability threshold. \|
	\| `repetition_penalty` \| `10.0` \| Penalise previously generated tokens. \|
	\| `do_sample` \| `True` \| `True` = multinomial sampling; `False` = greedy argmax. \|

	```python
	from xtts_onnx_orchestrator import SamplingConfig

	sampling = SamplingConfig(
	temperature=0.65,
	top_k=30,
	top_p=0.90,
	repetition_penalty=10.0,
	do_sample=True,
	)

	for chunk in pipeline.inference_stream(text, "en", cond, spk, sampling=sampling):
	...
	```

	### GPTConfig

	Model architecture parameters are automatically loaded from `metadata.json`. Key fields:

	\| Parameter \| Value \| Description \|
	\|---\|---\|---\|
	\| `n_layer` \| 30 \| Number of GPT-2 transformer layers \|
	\| `embed_dim` \| 1024 \| Hidden dimension \|
	\| `num_heads` \| 16 \| Number of attention heads \|
	\| `head_dim` \| 64 \| Per-head dimension \|
	\| `num_audio_tokens` \| 1026 \| Audio vocabulary (1024 VQ codes + start + stop) \|
	\| `perceiver_output_len` \| 32 \| Conditioning latent sequence length \|
	\| `max_gen_mel_tokens` \| 605 \| Maximum generated audio tokens \|

	---

	## Module Reference

	### `xtts_streaming_pipeline.py`

	Top-level streaming pipeline.

	\| Class / Function \| Description \|
	\|---\|---\|
	\| `StreamingTTSPipeline` \| Main pipeline class. Owns sessions, tokenizer, orchestrator. \|
	\| `StreamingTTSPipeline.get_conditioning_latents()` \| Extract GPT conditioning + speaker embedding from reference audio. \|
	\| `StreamingTTSPipeline.inference_stream()` \| Generator that yields audio chunks for a text segment. \|
	\| `StreamingTTSPipeline.time_scale_gpt_latents_numpy()` \| Linearly time-scale GPT latents for speed control. \|
	\| `wav_to_mel_cloning_numpy()` \| Compute normalised log-mel spectrogram (NumPy, 22 kHz). \|
	\| `crossfade_chunks()` \| Cross-fade consecutive vocoder waveform chunks. \|

	### `xtts_onnx_orchestrator.py`

	Low-level ONNX autoregressive loop.

	\| Class / Function \| Description \|
	\|---\|---\|
	\| `ONNXSessionManager` \| Loads and manages all ONNX sessions + embedding tables. \|
	\| `XTTSOrchestratorONNX` \| Drives the GPT-2 AR loop with KV-cache and logits processing. \|
	\| `GPTConfig` \| Model architecture hyper-parameters (from `metadata.json`). \|
	\| `SamplingConfig` \| Token sampling hyper-parameters. \|
	\| `apply_repetition_penalty()` \| NumPy repetition penalty on logits. \|
	\| `apply_temperature()` \| Temperature scaling on logits. \|
	\| `apply_top_k()` \| Top-k filtering on logits. \|
	\| `apply_top_p()` \| Nucleus (top-p) filtering on logits. \|
	\| `numpy_softmax()` \| Numerically-stable softmax in NumPy. \|
	\| `numpy_multinomial()` \| Inverse-CDF multinomial sampling. \|

	---

	## Performance Notes

	- `stream_chunk_size` controls the latency–quality trade-off: smaller values yield audio sooner but run the vocoder more often (on all accumulated latents).
	- Thread count (`num_threads_gpt`) should be tuned to your CPU. Start with the number of physical cores.
	- First call to `get_conditioning_latents()` is an expensive step (resampling + mel computation + encoder inference). Cache the results for repeated synthesis with the same speaker.

	---

	## License

	This project is licensed under the Apache License 2.0. See the [LICENSE](LICENSE) file for details.

	```
	Copyright 2025 Patrick Lumbantobing, Vertox-AI

	Licensed under the Apache License, Version 2.0 (the "License");
	you may not use this file except in compliance with the License.
	You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0
	```

	---

	## Acknowledgements

	- [Coqui AI](https://github.com/coqui-ai/TTS) for the original XTTSv2 model and training recipe.
	- [XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model](https://arxiv.org/abs/2406.04904) (Casanova et al., 2024).
	- [ONNX Runtime](https://onnxruntime.ai/) for high-performance cross-platform inference.