Update README

c7468e6 verified 3 days ago

9.26 kB

	---
	license: apache-2.0
	library_name: onnx
	tags:
	- audio
	- audio-tokenizer
	- neural-codec
	- moss-tts-family
	- MOSS Audio Tokenizer
	- speech-tokenizer
	- onnx
	- tensorrt
	---

	# MOSS-Audio-Tokenizer-ONNX

	This repository provides the ONNX exports of [MOSS-Audio-Tokenizer](https://huggingface.co/OpenMOSS-Team/MOSS-Audio-Tokenizer) (encoder & decoder), enabling torch-free audio encoding/decoding for the [MOSS-TTS](https://github.com/OpenMOSS/MOSS-TTS) family.

	## Overview

	MOSS-Audio-Tokenizer is the unified discrete audio interface for the entire MOSS-TTS Family, based on the Cat (Causal Audio Tokenizer with Transformer) architecture — a 1.6B-parameter, pure Causal Transformer audio tokenizer trained on 3M hours of diverse audio.

	This ONNX repository is designed for lightweight, torch-free deployment scenarios. It serves as the audio tokenizer component in the [MOSS-TTS llama.cpp inference backend](https://github.com/OpenMOSS/MOSS-TTS/blob/main/moss_tts_delay/llama_cpp/README.md), which combines [llama.cpp](https://github.com/ggerganov/llama.cpp) (for the Qwen3 backbone) with ONNX Runtime or TensorRT (for the audio tokenizer) to achieve fully PyTorch-free TTS inference.

	### Supported Backends

	\| Backend \| Runtime \| Use Case \|
	\|---------\|---------\|----------\|
	\| ONNX Runtime (GPU) \| `onnxruntime-gpu` \| Recommended starting point \|
	\| ONNX Runtime (CPU) \| `onnxruntime` \| CPU-only / no CUDA \|
	\| TensorRT \| Build from ONNX \| Maximum throughput (user-built engines) \|

	> Note: We do not provide pre-built TensorRT engines, as they are tied to your specific GPU architecture and TensorRT version. To use TRT, build engines from the ONNX models yourself — see `moss_audio_tokenizer/trt/build_engine.sh` in the main repository.

	## Repository Contents

	\| File \| Description \|
	\|------\|-------------\|
	\| `encoder.onnx` \| ONNX model for audio encoding (waveform → discrete codes) \|
	\| `decoder.onnx` \| ONNX model for audio decoding (discrete codes → waveform) \|

	## Quick Start

	```bash
	# Download
	huggingface-cli download OpenMOSS-Team/MOSS-Audio-Tokenizer-ONNX \
	--local-dir weights/MOSS-Audio-Tokenizer-ONNX
	```

	This is typically used together with [MOSS-TTS-GGUF](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-GGUF) for the llama.cpp inference pipeline. See the [llama.cpp Backend documentation](https://github.com/OpenMOSS/MOSS-TTS/blob/main/moss_tts_delay/llama_cpp/README.md) for the full end-to-end setup.

	## Main Repositories

	\| Repository \| Description \|
	\|------------\|-------------\|
	\| [OpenMOSS/MOSS-TTS](https://github.com/OpenMOSS/MOSS-TTS) \| MOSS-TTS Family main repository (includes llama.cpp backend, PyTorch inference, and all models) \|
	\| [OpenMOSS/MOSS-Audio-Tokenizer](https://github.com/OpenMOSS/MOSS-Audio-Tokenizer) \| MOSS-Audio-Tokenizer source code, PyTorch weights, ONNX/TRT export scripts, and evaluation \|
	\| [OpenMOSS-Team/MOSS-Audio-Tokenizer](https://huggingface.co/OpenMOSS-Team/MOSS-Audio-Tokenizer) \| PyTorch weights on Hugging Face (for `trust_remote_code=True` usage) \|
	\| [OpenMOSS-Team/MOSS-TTS-GGUF](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-GGUF) \| Pre-quantized GGUF backbone weights (companion to this ONNX repo) \|

	## About MOSS-Audio-Tokenizer

	MOSS-Audio-Tokenizer compresses 24kHz raw audio into a 12.5Hz frame rate using a 32-layer Residual Vector Quantizer (RVQ), supporting high-fidelity reconstruction from 0.125kbps to 4kbps. It is trained from scratch on 3 million hours of speech, sound effects, and music, achieving state-of-the-art reconstruction quality among open-source audio tokenizers.

	For the full model description, architecture details, and evaluation metrics, please refer to:
	- [MOSS-Audio-Tokenizer GitHub Repository](https://github.com/OpenMOSS/MOSS-Audio-Tokenizer)
	- [MOSS-TTS README — Audio Tokenizer Section](https://github.com/OpenMOSS/MOSS-TTS#moss-audio-tokenizer)

	## Evaluation Metrics

	The table below compares the reconstruction quality of open-source audio tokenizers with MossAudioTokenizer on speech and audio/music data.

	- Speech metrics are evaluated on LibriSpeech test-clean (English) and AISHELL-2 (Chinese), reported as EN/ZH.
	- Audio metrics are evaluated on the AudioSet evaluation subset, while music metrics are evaluated on MUSDB, reported as audio/music.
	- STFT-Dist. denotes the STFT distance.
	- Higher is better for speech metrics, while lower is better for audio/music metrics (Mel-Loss, STFT-Dist.).
	- Nq denotes the number of quantizers.

	\| Model \| bps \| Frame rate \| Nq \| Speech: SIM ↑ (EN/ZH) \| Speech: STOI ↑ (EN/ZH) \| Speech: PESQ-NB ↑ (EN/ZH) \| Speech: PESQ-WB ↑ (EN/ZH) \| Audio/Music: Mel-Loss ↓ \| Audio/Music: STFT-Dist. ↓ \|
	\| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \|
	\| XCodec2.0 \| 800 \| 50 \| 1 \| 0.82 / 0.74 \| 0.92 / 0.86 \| 3.04 / 2.46 \| 2.43 / 1.96 \| -- / -- \| -- / -- \|
	\| MiMo Audio Tokenizer \| 850 \| 25 \| 4 \| 0.80 / 0.74 \| 0.91 / 0.87 \| 2.94 / 2.62 \| 2.39 / 2.14 \| 0.82 / 0.81 \| 2.33 / 2.23 \|
	\| Higgs Audio Tokenizer \| 1000 \| 25 \| 4 \| 0.77 / 0.68 \| 0.83 / 0.82 \| 3.03 / 2.61 \| 2.48 / 2.14 \| 0.83 / 0.80 \| 2.20 / 2.05 \|
	\| SpeechTokenizer \| 1000 \| 50 \| 2 \| 0.36 / 0.25 \| 0.77 / 0.68 \| 1.59 / 1.38 \| 1.25 / 1.17 \| -- / -- \| -- / -- \|
	\| XY-Tokenizer \| 1000 \| 12.5 \| 8 \| 0.85 / 0.79 \| 0.92 / 0.87 \| 3.10 / 2.63 \| 2.50 / 2.12 \| -- / -- \| -- / -- \|
	\| BigCodec \| 1040 \| 80 \| 1 \| 0.84 / 0.69 \| 0.93 / 0.88 \| 3.27 / 2.55 \| 2.68 / 2.06 \| -- / -- \| -- / -- \|
	\| Mimi \| 1100 \| 12.5 \| 8 \| 0.74 / 0.59 \| 0.91 / 0.85 \| 2.80 / 2.24 \| 2.25 / 1.78 \| 1.24 / 1.19 \| 2.62 / 2.49 \|
	\| MOSS Audio Tokenizer (Ours) \| 750 \| 12.5 \| 6 \| 0.82 / 0.75 \| 0.93 / 0.89 \| 3.14 / 2.73 \| 2.60 / 2.22 \| 0.86 / 0.85 \| 2.21 / 2.10 \|
	\| MOSS Audio Tokenizer (Ours) \| 1000 \| 12.5 \| 8 \| 0.88 / 0.81 \| 0.94 / 0.91 \| 3.38 / 2.96 \| 2.87 / 2.43 \| 0.82 / 0.80 \| 2.16 / 2.04 \|
	\| — \| — \| — \| — \| — \| — \| — \| — \| — \| — \|
	\| DAC \| 1500 \| 75 \| 2 \| 0.48 / 0.41 \| 0.83 / 0.79 \| 1.87 / 1.67 \| 1.48 / 1.37 \| -- / -- \| -- / -- \|
	\| Encodec \| 1500 \| 75 \| 2 \| 0.60 / 0.45 \| 0.85 / 0.81 \| 1.94 / 1.80 \| 1.56 / 1.48 \| 1.12 / 1.04 \| 2.60 / 2.42 \|
	\| Higgs Audio Tokenizer \| 2000 \| 25 \| 8 \| 0.90 / 0.83 \| 0.85 / 0.85 \| 3.59 / 3.22 \| 3.11 / 2.73 \| 0.74 / 0.70 \| 2.07 / 1.92 \|
	\| SpeechTokenizer \| 2000 \| 50 \| 4 \| 0.66 / 0.50 \| 0.88 / 0.80 \| 2.38 / 1.79 \| 1.92 / 1.49 \| -- / -- \| -- / -- \|
	\| Qwen3 TTS Tokenizer \| 2200 \| 12.5 \| 16 \| 0.95 / 0.88 \| 0.96 / 0.93 \| 3.66 / 3.10 \| 3.19 / 2.62 \| -- / -- \| -- / -- \|
	\| MiMo Audio Tokenizer \| 2250 \| 25 \| 12 \| 0.89 / 0.83 \| 0.95 / 0.92 \| 3.57 / 3.25 \| 3.05 / 2.71 \| 0.70 / 0.68 \| 2.21 / 2.10 \|
	\| Mimi \| 2475 \| 12.5 \| 18 \| 0.89 / 0.76 \| 0.94 / 0.91 \| 3.49 / 2.90 \| 2.97 / 2.35 \| 1.10 / 1.06 \| 2.45 / 2.32 \|
	\| MOSS Audio Tokenizer (Ours) \| 1500 \| 12.5 \| 12 \| 0.92 / 0.86 \| 0.95 / 0.93 \| 3.64 / 3.27 \| 3.20 / 2.74 \| 0.77 / 0.74 \| 2.08 / 1.96 \|
	\| MOSS Audio Tokenizer (Ours) \| 2000 \| 12.5 \| 16 \| 0.95 / 0.89 \| 0.96 / 0.94 \| 3.78 / 3.46 \| 3.41 / 2.96 \| 0.73 / 0.70 \| 2.03 / 1.90 \|
	\| — \| — \| — \| — \| — \| — \| — \| — \| — \| — \|
	\| DAC \| 3000 \| 75 \| 4 \| 0.74 / 0.67 \| 0.90 / 0.88 \| 2.76 / 2.47 \| 2.31 / 2.07 \| 0.86 / 0.83 \| 2.23 / 2.10 \|
	\| MiMo Audio Tokenizer \| 3650 \| 25 \| 20 \| 0.91 / 0.85 \| 0.95 / 0.93 \| 3.73 / 3.44 \| 3.25 / 2.89 \| 0.66 / 0.65 \| 2.17 / 2.06 \|
	\| SpeechTokenizer \| 4000 \| 50 \| 8 \| 0.85 / 0.69 \| 0.92 / 0.85 \| 3.05 / 2.20 \| 2.60 / 1.87 \| -- / -- \| -- / -- \|
	\| Mimi \| 4400 \| 12.5 \| 32 \| 0.94 / 0.83 \| 0.96 / 0.94 \| 3.80 / 3.31 \| 3.43 / 2.78 \| 1.02 / 0.98 \| 2.34 / 2.21 \|
	\| Encodec \| 4500 \| 75 \| 6 \| 0.86 / 0.75 \| 0.92 / 0.91 \| 2.91 / 2.63 \| 2.46 / 2.15 \| 0.91 / 0.84 \| 2.33 / 2.17 \|
	\| DAC \| 6000 \| 75 \| 8 \| 0.89 / 0.84 \| 0.95 / 0.94 \| 3.75 / 3.57 \| 3.41 / 3.20 \| 0.65 / 0.63 \| 1.97 / 1.87 \|
	\| MOSS Audio Tokenizer (Ours) \| 3000 \| 12.5 \| 24 \| 0.96 / 0.92 \| 0.97 / 0.96 \| 3.90 / 3.64 \| 3.61 / 3.20 \| 0.69 / 0.66 \| 1.98 / 1.84 \|
	\| MOSS Audio Tokenizer (Ours) \| 4000 \| 12.5 \| 32 \| 0.97 / 0.93 \| 0.97 / 0.96 \| 3.95 / 3.71 \| 3.69 / 3.30 \| 0.68 / 0.64 \| 1.96 / 1.82 \|

	### LibriSpeech Speech Metrics (MOSS Audio Tokenizer vs. Open-source Tokenizers)

	The plots below compare our MOSS Audio Tokenizer model with other open-source speech tokenizers on the LibriSpeech dataset, evaluated with SIM, STOI, PESQ-NB, and PESQ-WB (higher is better).
	We control the bps of the same model by adjusting the number of RVQ codebooks used during inference.

	<table>
	<tr>
	<td align="center"><b>SIM</b><br><img src="images/sim.png" width="100%"></td>
	<td align="center"><b>STOI</b><br><img src="images/stoi.png" width="100%"></td>
	</tr>
	<tr>
	<td align="center"><b>PESQ-NB</b><br><img src="images/pesq-nb.png" width="100%"></td>
	<td align="center"><b>PESQ-WB</b><br><img src="images/pesq-wb.png" width="100%"></td>
	</tr>
	</table>


	## Citation
	If you use this code or result in your paper, please cite our work as:
	```tex

	```