File size: 1,567 Bytes
0cf41e8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
---
license: apache-2.0
tags:
  - text-to-speech
  - tts
  - onnx
  - voice-cloning
  - browser
  - webassembly
  - webgpu
language:
  - en
  - de
  - zh
  - ja
  - fr
  - es
  - multilingual
library_name: onnxruntime
base_model: k2-fsa/OmniVoice
---

# VocoLoco — OmniVoice ONNX Models

ONNX exports of [k2-fsa/OmniVoice](https://github.com/k2-fsa/OmniVoice) for browser-based text-to-speech inference via ONNX Runtime Web.

## Models

| File | Size | Description |
|------|------|-------------|
| `omnivoice-main-split.onnx` + `_data_00`-`_04` | 2.3 GB | Main TTS model (FP32, sharded) |
| `omnivoice-main-int8.onnx` | 586 MB | Main TTS model (INT8 quantized, for mobile/low-memory) |
| `omnivoice-decoder.onnx` | 83 MB | Audio token decoder (tokens to waveform) |
| `omnivoice-encoder-fixed.onnx` | 624 MB | Audio encoder for voice cloning |
| `tokenizer.json` | 11 MB | Qwen2 BPE text tokenizer |

## Usage

These models are designed to run in the browser via [VocoLoco](https://github.com/YOUR_USERNAME/vocoloco), a fully client-side TTS application. No server required.

## Architecture

- **Backbone**: Qwen3-0.6B (28 transformer layers)
- **Audio codec**: HiggsAudioV2 (8 codebooks, 24kHz output)
- **Generation**: Iterative masked diffusion (configurable 8-32 steps)
- **Voice cloning**: Zero-shot via reference audio encoding
- **Voice design**: Text-based control (gender, pitch, accent)

## License

Apache 2.0 — same as the original OmniVoice model.

## Attribution

Based on [OmniVoice](https://github.com/k2-fsa/OmniVoice) by Xiaomi Corp (k2-fsa).