| --- |
| license: apache-2.0 |
| tags: |
| - text-to-speech |
| - tts |
| - onnx |
| - voice-cloning |
| - browser |
| - webassembly |
| - webgpu |
| language: |
| - en |
| - de |
| - zh |
| - ja |
| - fr |
| - es |
| - multilingual |
| library_name: onnxruntime |
| base_model: k2-fsa/OmniVoice |
| --- |
| |
| # VocoLoco — OmniVoice ONNX Models |
|
|
| ONNX exports of [k2-fsa/OmniVoice](https://github.com/k2-fsa/OmniVoice) for browser-based text-to-speech inference via ONNX Runtime Web. |
|
|
| ## Models |
|
|
| | File | Size | Description | |
| |------|------|-------------| |
| | `omnivoice-main-split.onnx` + `_data_00`-`_04` | 2.3 GB | Main TTS model (FP32, sharded) | |
| | `omnivoice-main-int8.onnx` | 586 MB | Main TTS model (INT8 quantized, for mobile/low-memory) | |
| | `omnivoice-decoder.onnx` | 83 MB | Audio token decoder (tokens to waveform) | |
| | `omnivoice-encoder-fixed.onnx` | 624 MB | Audio encoder for voice cloning | |
| | `tokenizer.json` | 11 MB | Qwen2 BPE text tokenizer | |
|
|
| ## Usage |
|
|
| These models are designed to run in the browser via [VocoLoco](https://github.com/YOUR_USERNAME/vocoloco), a fully client-side TTS application. No server required. |
|
|
| ## Architecture |
|
|
| - **Backbone**: Qwen3-0.6B (28 transformer layers) |
| - **Audio codec**: HiggsAudioV2 (8 codebooks, 24kHz output) |
| - **Generation**: Iterative masked diffusion (configurable 8-32 steps) |
| - **Voice cloning**: Zero-shot via reference audio encoding |
| - **Voice design**: Text-based control (gender, pitch, accent) |
|
|
| ## License |
|
|
| Apache 2.0 — same as the original OmniVoice model. |
|
|
| ## Attribution |
|
|
| Based on [OmniVoice](https://github.com/k2-fsa/OmniVoice) by Xiaomi Corp (k2-fsa). |
|
|