vocoloco-onnx / README.md
Gigsu's picture
Upload 13 files
0cf41e8 verified
---
license: apache-2.0
tags:
- text-to-speech
- tts
- onnx
- voice-cloning
- browser
- webassembly
- webgpu
language:
- en
- de
- zh
- ja
- fr
- es
- multilingual
library_name: onnxruntime
base_model: k2-fsa/OmniVoice
---
# VocoLoco — OmniVoice ONNX Models
ONNX exports of [k2-fsa/OmniVoice](https://github.com/k2-fsa/OmniVoice) for browser-based text-to-speech inference via ONNX Runtime Web.
## Models
| File | Size | Description |
|------|------|-------------|
| `omnivoice-main-split.onnx` + `_data_00`-`_04` | 2.3 GB | Main TTS model (FP32, sharded) |
| `omnivoice-main-int8.onnx` | 586 MB | Main TTS model (INT8 quantized, for mobile/low-memory) |
| `omnivoice-decoder.onnx` | 83 MB | Audio token decoder (tokens to waveform) |
| `omnivoice-encoder-fixed.onnx` | 624 MB | Audio encoder for voice cloning |
| `tokenizer.json` | 11 MB | Qwen2 BPE text tokenizer |
## Usage
These models are designed to run in the browser via [VocoLoco](https://github.com/YOUR_USERNAME/vocoloco), a fully client-side TTS application. No server required.
## Architecture
- **Backbone**: Qwen3-0.6B (28 transformer layers)
- **Audio codec**: HiggsAudioV2 (8 codebooks, 24kHz output)
- **Generation**: Iterative masked diffusion (configurable 8-32 steps)
- **Voice cloning**: Zero-shot via reference audio encoding
- **Voice design**: Text-based control (gender, pitch, accent)
## License
Apache 2.0 — same as the original OmniVoice model.
## Attribution
Based on [OmniVoice](https://github.com/k2-fsa/OmniVoice) by Xiaomi Corp (k2-fsa).