Gigsu
/

vocoloco-onnx

Model card Files Files and versions

vocoloco-onnx / README.md

Gigsu's picture

Upload 13 files

0cf41e8 verified about 1 month ago

|

history blame contribute delete

1.57 kB

	---
	license: apache-2.0
	tags:
	- text-to-speech
	- tts
	- onnx
	- voice-cloning
	- browser
	- webassembly
	- webgpu
	language:
	- en
	- de
	- zh
	- ja
	- fr
	- es
	- multilingual
	library_name: onnxruntime
	base_model: k2-fsa/OmniVoice
	---

	# VocoLoco — OmniVoice ONNX Models

	ONNX exports of [k2-fsa/OmniVoice](https://github.com/k2-fsa/OmniVoice) for browser-based text-to-speech inference via ONNX Runtime Web.

	## Models

	\| File \| Size \| Description \|
	\|------\|------\|-------------\|
	\| `omnivoice-main-split.onnx` + `_data_00`-`_04` \| 2.3 GB \| Main TTS model (FP32, sharded) \|
	\| `omnivoice-main-int8.onnx` \| 586 MB \| Main TTS model (INT8 quantized, for mobile/low-memory) \|
	\| `omnivoice-decoder.onnx` \| 83 MB \| Audio token decoder (tokens to waveform) \|
	\| `omnivoice-encoder-fixed.onnx` \| 624 MB \| Audio encoder for voice cloning \|
	\| `tokenizer.json` \| 11 MB \| Qwen2 BPE text tokenizer \|

	## Usage

	These models are designed to run in the browser via [VocoLoco](https://github.com/YOUR_USERNAME/vocoloco), a fully client-side TTS application. No server required.

	## Architecture

	- Backbone: Qwen3-0.6B (28 transformer layers)
	- Audio codec: HiggsAudioV2 (8 codebooks, 24kHz output)
	- Generation: Iterative masked diffusion (configurable 8-32 steps)
	- Voice cloning: Zero-shot via reference audio encoding
	- Voice design: Text-based control (gender, pitch, accent)

	## License

	Apache 2.0 — same as the original OmniVoice model.

	## Attribution

	Based on [OmniVoice](https://github.com/k2-fsa/OmniVoice) by Xiaomi Corp (k2-fsa).