Spaces:
Running
Running
metadata
title: Pocket TTS ONNX Web Demo
emoji: π
colorFrom: yellow
colorTo: pink
sdk: static
app_file: index.html
pinned: false
license: cc-by-4.0
short_description: Real-time voice cloning entirely in your browser! (CPU)
models:
- KevinAHM/pocket-tts-onnx
custom_headers:
cross-origin-embedder-policy: require-corp
cross-origin-opener-policy: same-origin
cross-origin-resource-policy: cross-origin
Pocket TTS Web Demo
Real-time neural text-to-speech with voice cloning, running entirely in your browser.
Features
- Voice Cloning: Clone any voice from a short audio sample
- Predefined Voices: 3 bundled voices (Cosette, Jean, Fantine)
- Streaming Audio: Real-time audio generation with low latency
- Pure Browser: No server required, runs entirely in WebAssembly
Model Files
The demo requires the following ONNX models in the onnx/ directory:
| File | Size | Purpose |
|---|---|---|
mimi_encoder.onnx |
~70 MB | Voice audio β embeddings |
text_conditioner.onnx |
~16 MB | Text tokens β embeddings |
flow_lm_main_int8.onnx |
~73 MB | AR transformer (INT8) |
flow_lm_flow_int8.onnx |
~10 MB | Flow matching network (INT8) |
mimi_decoder_int8.onnx |
~22 MB | Latents β audio decoder (INT8) |
Additional files:
tokenizer.model- SentencePiece tokenizer (~60 KB)voices.bin- Predefined voice embeddings (~1.5 MB)
Browser Requirements
- Modern browser with WebAssembly support
- Chrome, Edge, Firefox, or Safari (latest versions)
- ~200 MB RAM for model loading
Voice Cloning
- Click "Upload Voice" or select "Custom (Upload)" from the dropdown
- Upload an audio file (WAV, MP3, etc.) with clear speech
- Best results with 3-10 seconds of clean audio
- The voice will be encoded and used for all subsequent generations
File Structure
pocket-tts-web/
βββ index.html # Main HTML page
βββ onnx-streaming.js # Main thread controller
βββ inference-worker.js # Web Worker for ONNX inference
βββ PCMPlayerWorklet.js # Audio playback worklet
βββ EventEmitter.js # Event utilities
βββ sentencepiece.js # SentencePiece tokenizer library
βββ style.css # Styles
βββ tokenizer.model # SentencePiece model
βββ voices.bin # Predefined voice embeddings
βββ onnx/
βββ mimi_encoder.onnx
βββ text_conditioner.onnx
βββ flow_lm_main_int8.onnx
βββ flow_lm_flow_int8.onnx
βββ mimi_decoder_int8.onnx
License
- Models & Voice Embeddings: CC BY 4.0 (inherited from kyutai/pocket-tts)
- Code: Apache 2.0