license: apache-2.0
library_name: onnx
tags:
- audio
- audio-tokenizer
- neural-codec
- moss-tts-family
- MOSS Audio Tokenizer
- speech-tokenizer
- onnx
- tensorrt
MOSS-Audio-Tokenizer-ONNX
This repository provides the ONNX exports of MOSS-Audio-Tokenizer (encoder & decoder), enabling torch-free audio encoding/decoding for the MOSS-TTS family.
Overview
MOSS-Audio-Tokenizer is the unified discrete audio interface for the entire MOSS-TTS Family, based on the Cat (Causal Audio Tokenizer with Transformer) architecture β a 1.6B-parameter, pure Causal Transformer audio tokenizer trained on 3M hours of diverse audio.
This ONNX repository is designed for lightweight, torch-free deployment scenarios. It serves as the audio tokenizer component in the MOSS-TTS llama.cpp inference backend, which combines llama.cpp (for the Qwen3 backbone) with ONNX Runtime or TensorRT (for the audio tokenizer) to achieve fully PyTorch-free TTS inference.
Supported Backends
| Backend | Runtime | Use Case |
|---|---|---|
| ONNX Runtime (GPU) | onnxruntime-gpu |
Recommended starting point |
| ONNX Runtime (CPU) | onnxruntime |
CPU-only / no CUDA |
| TensorRT | Build from ONNX | Maximum throughput (user-built engines) |
Note: We do not provide pre-built TensorRT engines, as they are tied to your specific GPU architecture and TensorRT version. To use TRT, build engines from the ONNX models yourself β see
moss_audio_tokenizer/trt/build_engine.shin the main repository.
Repository Contents
| File | Description |
|---|---|
encoder.onnx |
ONNX model for audio encoding (waveform β discrete codes) |
decoder.onnx |
ONNX model for audio decoding (discrete codes β waveform) |
Quick Start
# Download
huggingface-cli download OpenMOSS-Team/MOSS-Audio-Tokenizer-ONNX \
--local-dir weights/MOSS-Audio-Tokenizer-ONNX
This is typically used together with MOSS-TTS-GGUF for the llama.cpp inference pipeline. See the llama.cpp Backend documentation for the full end-to-end setup.
Main Repositories
| Repository | Description |
|---|---|
| OpenMOSS/MOSS-TTS | MOSS-TTS Family main repository (includes llama.cpp backend, PyTorch inference, and all models) |
| OpenMOSS/MOSS-Audio-Tokenizer | MOSS-Audio-Tokenizer source code, PyTorch weights, ONNX/TRT export scripts, and evaluation |
| OpenMOSS-Team/MOSS-Audio-Tokenizer | PyTorch weights on Hugging Face (for trust_remote_code=True usage) |
| OpenMOSS-Team/MOSS-TTS-GGUF | Pre-quantized GGUF backbone weights (companion to this ONNX repo) |
About MOSS-Audio-Tokenizer
MOSS-Audio-Tokenizer compresses 24kHz raw audio into a 12.5Hz frame rate using a 32-layer Residual Vector Quantizer (RVQ), supporting high-fidelity reconstruction from 0.125kbps to 4kbps. It is trained from scratch on 3 million hours of speech, sound effects, and music, achieving state-of-the-art reconstruction quality among open-source audio tokenizers.
For the full model description, architecture details, and evaluation metrics, please refer to:
Evaluation Metrics
The table below compares the reconstruction quality of open-source audio tokenizers with MossAudioTokenizer on speech and audio/music data.
- Speech metrics are evaluated on LibriSpeech test-clean (English) and AISHELL-2 (Chinese), reported as EN/ZH.
- Audio metrics are evaluated on the AudioSet evaluation subset, while music metrics are evaluated on MUSDB, reported as audio/music.
- STFT-Dist. denotes the STFT distance.
- Higher is better for speech metrics, while lower is better for audio/music metrics (Mel-Loss, STFT-Dist.).
- Nq denotes the number of quantizers.
| Model | bps | Frame rate | Nq | Speech: SIM β (EN/ZH) | Speech: STOI β (EN/ZH) | Speech: PESQ-NB β (EN/ZH) | Speech: PESQ-WB β (EN/ZH) | Audio/Music: Mel-Loss β | Audio/Music: STFT-Dist. β |
|---|---|---|---|---|---|---|---|---|---|
| XCodec2.0 | 800 | 50 | 1 | 0.82 / 0.74 | 0.92 / 0.86 | 3.04 / 2.46 | 2.43 / 1.96 | -- / -- | -- / -- |
| MiMo Audio Tokenizer | 850 | 25 | 4 | 0.80 / 0.74 | 0.91 / 0.87 | 2.94 / 2.62 | 2.39 / 2.14 | 0.82 / 0.81 | 2.33 / 2.23 |
| Higgs Audio Tokenizer | 1000 | 25 | 4 | 0.77 / 0.68 | 0.83 / 0.82 | 3.03 / 2.61 | 2.48 / 2.14 | 0.83 / 0.80 | 2.20 / 2.05 |
| SpeechTokenizer | 1000 | 50 | 2 | 0.36 / 0.25 | 0.77 / 0.68 | 1.59 / 1.38 | 1.25 / 1.17 | -- / -- | -- / -- |
| XY-Tokenizer | 1000 | 12.5 | 8 | 0.85 / 0.79 | 0.92 / 0.87 | 3.10 / 2.63 | 2.50 / 2.12 | -- / -- | -- / -- |
| BigCodec | 1040 | 80 | 1 | 0.84 / 0.69 | 0.93 / 0.88 | 3.27 / 2.55 | 2.68 / 2.06 | -- / -- | -- / -- |
| Mimi | 1100 | 12.5 | 8 | 0.74 / 0.59 | 0.91 / 0.85 | 2.80 / 2.24 | 2.25 / 1.78 | 1.24 / 1.19 | 2.62 / 2.49 |
| MOSS Audio Tokenizer (Ours) | 750 | 12.5 | 6 | 0.82 / 0.75 | 0.93 / 0.89 | 3.14 / 2.73 | 2.60 / 2.22 | 0.86 / 0.85 | 2.21 / 2.10 |
| MOSS Audio Tokenizer (Ours) | 1000 | 12.5 | 8 | 0.88 / 0.81 | 0.94 / 0.91 | 3.38 / 2.96 | 2.87 / 2.43 | 0.82 / 0.80 | 2.16 / 2.04 |
| β | β | β | β | β | β | β | β | β | β |
| DAC | 1500 | 75 | 2 | 0.48 / 0.41 | 0.83 / 0.79 | 1.87 / 1.67 | 1.48 / 1.37 | -- / -- | -- / -- |
| Encodec | 1500 | 75 | 2 | 0.60 / 0.45 | 0.85 / 0.81 | 1.94 / 1.80 | 1.56 / 1.48 | 1.12 / 1.04 | 2.60 / 2.42 |
| Higgs Audio Tokenizer | 2000 | 25 | 8 | 0.90 / 0.83 | 0.85 / 0.85 | 3.59 / 3.22 | 3.11 / 2.73 | 0.74 / 0.70 | 2.07 / 1.92 |
| SpeechTokenizer | 2000 | 50 | 4 | 0.66 / 0.50 | 0.88 / 0.80 | 2.38 / 1.79 | 1.92 / 1.49 | -- / -- | -- / -- |
| Qwen3 TTS Tokenizer | 2200 | 12.5 | 16 | 0.95 / 0.88 | 0.96 / 0.93 | 3.66 / 3.10 | 3.19 / 2.62 | -- / -- | -- / -- |
| MiMo Audio Tokenizer | 2250 | 25 | 12 | 0.89 / 0.83 | 0.95 / 0.92 | 3.57 / 3.25 | 3.05 / 2.71 | 0.70 / 0.68 | 2.21 / 2.10 |
| Mimi | 2475 | 12.5 | 18 | 0.89 / 0.76 | 0.94 / 0.91 | 3.49 / 2.90 | 2.97 / 2.35 | 1.10 / 1.06 | 2.45 / 2.32 |
| MOSS Audio Tokenizer (Ours) | 1500 | 12.5 | 12 | 0.92 / 0.86 | 0.95 / 0.93 | 3.64 / 3.27 | 3.20 / 2.74 | 0.77 / 0.74 | 2.08 / 1.96 |
| MOSS Audio Tokenizer (Ours) | 2000 | 12.5 | 16 | 0.95 / 0.89 | 0.96 / 0.94 | 3.78 / 3.46 | 3.41 / 2.96 | 0.73 / 0.70 | 2.03 / 1.90 |
| β | β | β | β | β | β | β | β | β | β |
| DAC | 3000 | 75 | 4 | 0.74 / 0.67 | 0.90 / 0.88 | 2.76 / 2.47 | 2.31 / 2.07 | 0.86 / 0.83 | 2.23 / 2.10 |
| MiMo Audio Tokenizer | 3650 | 25 | 20 | 0.91 / 0.85 | 0.95 / 0.93 | 3.73 / 3.44 | 3.25 / 2.89 | 0.66 / 0.65 | 2.17 / 2.06 |
| SpeechTokenizer | 4000 | 50 | 8 | 0.85 / 0.69 | 0.92 / 0.85 | 3.05 / 2.20 | 2.60 / 1.87 | -- / -- | -- / -- |
| Mimi | 4400 | 12.5 | 32 | 0.94 / 0.83 | 0.96 / 0.94 | 3.80 / 3.31 | 3.43 / 2.78 | 1.02 / 0.98 | 2.34 / 2.21 |
| Encodec | 4500 | 75 | 6 | 0.86 / 0.75 | 0.92 / 0.91 | 2.91 / 2.63 | 2.46 / 2.15 | 0.91 / 0.84 | 2.33 / 2.17 |
| DAC | 6000 | 75 | 8 | 0.89 / 0.84 | 0.95 / 0.94 | 3.75 / 3.57 | 3.41 / 3.20 | 0.65 / 0.63 | 1.97 / 1.87 |
| MOSS Audio Tokenizer (Ours) | 3000 | 12.5 | 24 | 0.96 / 0.92 | 0.97 / 0.96 | 3.90 / 3.64 | 3.61 / 3.20 | 0.69 / 0.66 | 1.98 / 1.84 |
| MOSS Audio Tokenizer (Ours) | 4000 | 12.5 | 32 | 0.97 / 0.93 | 0.97 / 0.96 | 3.95 / 3.71 | 3.69 / 3.30 | 0.68 / 0.64 | 1.96 / 1.82 |
LibriSpeech Speech Metrics (MOSS Audio Tokenizer vs. Open-source Tokenizers)
The plots below compare our MOSS Audio Tokenizer model with other open-source speech tokenizers on the LibriSpeech dataset, evaluated with SIM, STOI, PESQ-NB, and PESQ-WB (higher is better). We control the bps of the same model by adjusting the number of RVQ codebooks used during inference.
SIM![]() |
STOI![]() |
PESQ-NB![]() |
PESQ-WB![]() |
Citation
If you use this code or result in your paper, please cite our work as:



