Pocket TTS ONNX
ONNX export of Pocket TTS for lightweight text-to-speech with zero-shot voice cloning.
Model Description
Pocket TTS is a compact text-to-speech model from Kyutai that supports voice cloning from short audio samples. This ONNX export provides:
- Zero-shot voice cloning from any audio reference
- INT8 quantized models for fast CPU inference
- Streaming support with adaptive chunking for real-time playback
- Temperature control for generation diversity
- Dual model architecture for flexible flow matching
Architecture
This export uses a dual model split for the Flow LM:
flow_lm_main - Transformer/conditioner (produces conditioning vectors)
flow_lm_flow - Flow network only (Euler integration for latent sampling)
This architecture enables:
- Temperature control: Adjust generation diversity via noise scaling
- Variable LSD steps: Trade off speed vs quality
- External flow loop: Full control over the sampling process
Performance
| Precision | Flow Main | Flow Net | Decoder | Total Size | RTFx (CPU) |
|---|---|---|---|---|---|
| INT8 | 76 MB | 10 MB | 23 MB | ~200 MB | ~3x |
| FP32 | 303 MB | 39 MB | 42 MB | ~475 MB | ~2x |
RTFx = Real-time factor (>1.0 means faster than real-time)
Decoding Strategy
This library employs optimized decoding to prevent audio artifacts while maximizing speed:
- Offline (
generate): Uses chunked decoding (15 frames). This prevents state initialization artifacts found in full-batch decoding while being ~5.5x faster than frame-by-frame. - Streaming (
stream): Uses adaptive chunking (starts at 2 frames). This ensures instant start (low TTFB) while scaling up chunk sizes for throughput.
Usage
from pocket_tts_onnx import PocketTTSOnnx
# Load model (INT8 by default)
tts = PocketTTSOnnx()
# Generate speech with voice cloning
audio = tts.generate(
text="Hello, this is a test of voice cloning.",
voice="reference_sample.wav"
)
# Save output
tts.save_audio(audio, "output.wav")
Temperature Control
Adjust generation diversity with the temperature parameter:
# More deterministic (lower temperature)
tts = PocketTTSOnnx(temperature=0.3)
# Default balance
tts = PocketTTSOnnx(temperature=0.7)
# More diverse/expressive (higher temperature)
tts = PocketTTSOnnx(temperature=1.0)
LSD Steps
Trade off speed vs quality with lsd_steps:
# Default (10 steps)
tts = PocketTTSOnnx(lsd_steps=10)
# Faster (fewer steps, lower quality)
tts = PocketTTSOnnx(lsd_steps=1)
Streaming Mode
For real-time applications with low time-to-first-audio:
for chunk in tts.stream("Hello world!", voice="reference_sample.wav"):
play_audio(chunk) # Process each chunk as it arrives
Command Line
python generate.py "Hello, this is a test." reference_sample.wav output.wav
Files
pocket-tts-onnx/
βββ onnx/
β βββ flow_lm_main.onnx # 303 MB - Flow LM transformer (FP32)
β βββ flow_lm_main_int8.onnx # 76 MB - Flow LM transformer (INT8)
β βββ flow_lm_flow.onnx # 39 MB - Flow network (FP32)
β βββ flow_lm_flow_int8.onnx # 10 MB - Flow network (INT8)
β βββ mimi_decoder.onnx # 42 MB - Audio decoder (FP32)
β βββ mimi_decoder_int8.onnx # 23 MB - Audio decoder (INT8)
β βββ mimi_encoder.onnx # 73 MB - Voice encoder
β βββ text_conditioner.onnx # 16 MB - Text embeddings
βββ reference_sample.wav # Example voice reference
βββ tokenizer.model # SentencePiece tokenizer
βββ pocket_tts_onnx.py # Inference wrapper
βββ generate.py # CLI script
βββ requirements.txt # Python dependencies
βββ README.md
Requirements
onnxruntime>=1.16.0
numpy
soundfile
sentencepiece
scipy # Only needed if resampling from non-24kHz audio
Install with:
pip install -r requirements.txt
For GPU acceleration:
pip install onnxruntime-gpu
License
- Models: CC BY 4.0 (inherited from kyutai/pocket-tts)
- Code: Apache 2.0
Prohibited Use
Use of our model must comply with all applicable laws and regulations and must not result in, involve, or facilitate any illegal, harmful, deceptive, fraudulent, or unauthorized activity. Prohibited uses include, without limitation, voice impersonation or cloning without explicit and lawful consent; misinformation, disinformation, or deception (including fake news, fraudulent calls, or presenting generated content as genuine recordings of real people or events); and the generation of unlawful, harmful, libelous, abusive, harassing, discriminatory, hateful, or privacy-invasive content. We disclaim all liability for any non-compliant use.
Acknowledgments
- Kyutai for the original Pocket TTS model
Model tree for KevinAHM/pocket-tts-onnx
Base model
kyutai/pocket-tts