---
title: Pocket TTS ONNX Web Demo
emoji: 🌖
colorFrom: yellow
colorTo: pink
sdk: static
app_file: index.html
pinned: false
license: cc-by-4.0
short_description: Real-time voice cloning entirely in your browser! (CPU)
models:
  - KevinAHM/pocket-tts-onnx
custom_headers:
  cross-origin-embedder-policy: require-corp
  cross-origin-opener-policy: same-origin
  cross-origin-resource-policy: cross-origin
---

# Pocket TTS Web Demo

Real-time neural text-to-speech with voice cloning, running entirely in your browser.

## Features

- **Voice Cloning**: Clone any voice from a short audio sample
- **Predefined Voices**: 3 bundled voices (Cosette, Jean, Fantine)
- **Streaming Audio**: Real-time audio generation with low latency
- **Pure Browser**: No server required, runs entirely in WebAssembly

## Model Files

The demo requires the following ONNX models in the `onnx/` directory:

| File | Size | Purpose |
|------|------|---------|
| `mimi_encoder.onnx` | ~70 MB | Voice audio → embeddings |
| `text_conditioner.onnx` | ~16 MB | Text tokens → embeddings |
| `flow_lm_main_int8.onnx` | ~73 MB | AR transformer (INT8) |
| `flow_lm_flow_int8.onnx` | ~10 MB | Flow matching network (INT8) |
| `mimi_decoder_int8.onnx` | ~22 MB | Latents → audio decoder (INT8) |

Additional files:
- `tokenizer.model` - SentencePiece tokenizer (~60 KB)
- `voices.bin` - Predefined voice embeddings (~1.5 MB)

## Browser Requirements

- Modern browser with WebAssembly support
- Chrome, Edge, Firefox, or Safari (latest versions)
- ~200 MB RAM for model loading

## Voice Cloning

1. Click "Upload Voice" or select "Custom (Upload)" from the dropdown
2. Upload an audio file (WAV, MP3, etc.) with clear speech
3. Best results with 3-10 seconds of clean audio
4. The voice will be encoded and used for all subsequent generations

## File Structure

```
pocket-tts-web/
├── index.html              # Main HTML page
├── onnx-streaming.js       # Main thread controller
├── inference-worker.js     # Web Worker for ONNX inference
├── PCMPlayerWorklet.js     # Audio playback worklet
├── EventEmitter.js         # Event utilities
├── sentencepiece.js        # SentencePiece tokenizer library
├── style.css               # Styles
├── tokenizer.model         # SentencePiece model
├── voices.bin              # Predefined voice embeddings
└── onnx/
    ├── mimi_encoder.onnx
    ├── text_conditioner.onnx
    ├── flow_lm_main_int8.onnx
    ├── flow_lm_flow_int8.onnx
    └── mimi_decoder_int8.onnx
```

## License

- **Models & Voice Embeddings**: CC BY 4.0 (inherited from [kyutai/pocket-tts](https://huggingface.co/kyutai/pocket-tts))
- **Code**: Apache 2.0