pocket-tts-web / README.md
KevinAHM's picture
Add COOP/COEP headers for multi-threading
a5b0453
metadata
title: Pocket TTS ONNX Web Demo
emoji: πŸŒ–
colorFrom: yellow
colorTo: pink
sdk: static
app_file: index.html
pinned: false
license: cc-by-4.0
short_description: Real-time voice cloning entirely in your browser! (CPU)
models:
  - KevinAHM/pocket-tts-onnx
custom_headers:
  cross-origin-embedder-policy: require-corp
  cross-origin-opener-policy: same-origin
  cross-origin-resource-policy: cross-origin

Pocket TTS Web Demo

Real-time neural text-to-speech with voice cloning, running entirely in your browser.

Features

  • Voice Cloning: Clone any voice from a short audio sample
  • Predefined Voices: 3 bundled voices (Cosette, Jean, Fantine)
  • Streaming Audio: Real-time audio generation with low latency
  • Pure Browser: No server required, runs entirely in WebAssembly

Model Files

The demo requires the following ONNX models in the onnx/ directory:

File Size Purpose
mimi_encoder.onnx ~70 MB Voice audio β†’ embeddings
text_conditioner.onnx ~16 MB Text tokens β†’ embeddings
flow_lm_main_int8.onnx ~73 MB AR transformer (INT8)
flow_lm_flow_int8.onnx ~10 MB Flow matching network (INT8)
mimi_decoder_int8.onnx ~22 MB Latents β†’ audio decoder (INT8)

Additional files:

  • tokenizer.model - SentencePiece tokenizer (~60 KB)
  • voices.bin - Predefined voice embeddings (~1.5 MB)

Browser Requirements

  • Modern browser with WebAssembly support
  • Chrome, Edge, Firefox, or Safari (latest versions)
  • ~200 MB RAM for model loading

Voice Cloning

  1. Click "Upload Voice" or select "Custom (Upload)" from the dropdown
  2. Upload an audio file (WAV, MP3, etc.) with clear speech
  3. Best results with 3-10 seconds of clean audio
  4. The voice will be encoded and used for all subsequent generations

File Structure

pocket-tts-web/
β”œβ”€β”€ index.html              # Main HTML page
β”œβ”€β”€ onnx-streaming.js       # Main thread controller
β”œβ”€β”€ inference-worker.js     # Web Worker for ONNX inference
β”œβ”€β”€ PCMPlayerWorklet.js     # Audio playback worklet
β”œβ”€β”€ EventEmitter.js         # Event utilities
β”œβ”€β”€ sentencepiece.js        # SentencePiece tokenizer library
β”œβ”€β”€ style.css               # Styles
β”œβ”€β”€ tokenizer.model         # SentencePiece model
β”œβ”€β”€ voices.bin              # Predefined voice embeddings
└── onnx/
    β”œβ”€β”€ mimi_encoder.onnx
    β”œβ”€β”€ text_conditioner.onnx
    β”œβ”€β”€ flow_lm_main_int8.onnx
    β”œβ”€β”€ flow_lm_flow_int8.onnx
    └── mimi_decoder_int8.onnx

License

  • Models & Voice Embeddings: CC BY 4.0 (inherited from kyutai/pocket-tts)
  • Code: Apache 2.0