File size: 9,257 Bytes
8d409dd
 
c7468e6
8d409dd
 
 
 
 
 
 
c7468e6
 
8d409dd
 
c7468e6
8d409dd
c7468e6
8d409dd
c7468e6
8d409dd
c7468e6
8d409dd
c7468e6
8d409dd
c7468e6
8d409dd
c7468e6
 
 
 
 
8d409dd
c7468e6
8d409dd
c7468e6
8d409dd
c7468e6
 
 
 
8d409dd
c7468e6
8d409dd
c7468e6
 
 
 
8d409dd
 
c7468e6
8d409dd
c7468e6
8d409dd
c7468e6
 
 
 
 
 
8d409dd
c7468e6
8d409dd
c7468e6
8d409dd
c7468e6
 
 
8d409dd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
---
license: apache-2.0
library_name: onnx
tags:
  - audio
  - audio-tokenizer
  - neural-codec
  - moss-tts-family
  - MOSS Audio Tokenizer
  - speech-tokenizer
  - onnx
  - tensorrt
---

# MOSS-Audio-Tokenizer-ONNX

This repository provides the **ONNX exports** of [MOSS-Audio-Tokenizer](https://huggingface.co/OpenMOSS-Team/MOSS-Audio-Tokenizer) (encoder & decoder), enabling **torch-free** audio encoding/decoding for the [MOSS-TTS](https://github.com/OpenMOSS/MOSS-TTS) family.

## Overview

**MOSS-Audio-Tokenizer** is the unified discrete audio interface for the entire MOSS-TTS Family, based on the **Cat** (**C**ausal **A**udio **T**okenizer with **T**ransformer) architecture β€” a 1.6B-parameter, pure Causal Transformer audio tokenizer trained on 3M hours of diverse audio.

This ONNX repository is designed for **lightweight, torch-free deployment** scenarios. It serves as the audio tokenizer component in the [MOSS-TTS llama.cpp inference backend](https://github.com/OpenMOSS/MOSS-TTS/blob/main/moss_tts_delay/llama_cpp/README.md), which combines [llama.cpp](https://github.com/ggerganov/llama.cpp) (for the Qwen3 backbone) with ONNX Runtime or TensorRT (for the audio tokenizer) to achieve fully **PyTorch-free** TTS inference.

### Supported Backends

| Backend | Runtime | Use Case |
|---------|---------|----------|
| **ONNX Runtime (GPU)** | `onnxruntime-gpu` | Recommended starting point |
| **ONNX Runtime (CPU)** | `onnxruntime` | CPU-only / no CUDA |
| **TensorRT** | Build from ONNX | Maximum throughput (user-built engines) |

> **Note:** We do **not** provide pre-built TensorRT engines, as they are tied to your specific GPU architecture and TensorRT version. To use TRT, build engines from the ONNX models yourself β€” see `moss_audio_tokenizer/trt/build_engine.sh` in the main repository.

## Repository Contents

| File | Description |
|------|-------------|
| `encoder.onnx` | ONNX model for audio encoding (waveform β†’ discrete codes) |
| `decoder.onnx` | ONNX model for audio decoding (discrete codes β†’ waveform) |

## Quick Start

```bash
# Download
huggingface-cli download OpenMOSS-Team/MOSS-Audio-Tokenizer-ONNX \
    --local-dir weights/MOSS-Audio-Tokenizer-ONNX
```

This is typically used together with [MOSS-TTS-GGUF](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-GGUF) for the llama.cpp inference pipeline. See the [llama.cpp Backend documentation](https://github.com/OpenMOSS/MOSS-TTS/blob/main/moss_tts_delay/llama_cpp/README.md) for the full end-to-end setup.

## Main Repositories

| Repository | Description |
|------------|-------------|
| [OpenMOSS/MOSS-TTS](https://github.com/OpenMOSS/MOSS-TTS) | MOSS-TTS Family main repository (includes llama.cpp backend, PyTorch inference, and all models) |
| [OpenMOSS/MOSS-Audio-Tokenizer](https://github.com/OpenMOSS/MOSS-Audio-Tokenizer) | MOSS-Audio-Tokenizer source code, PyTorch weights, ONNX/TRT export scripts, and evaluation |
| [OpenMOSS-Team/MOSS-Audio-Tokenizer](https://huggingface.co/OpenMOSS-Team/MOSS-Audio-Tokenizer) | PyTorch weights on Hugging Face (for `trust_remote_code=True` usage) |
| [OpenMOSS-Team/MOSS-TTS-GGUF](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-GGUF) | Pre-quantized GGUF backbone weights (companion to this ONNX repo) |

## About MOSS-Audio-Tokenizer

**MOSS-Audio-Tokenizer** compresses 24kHz raw audio into a 12.5Hz frame rate using a 32-layer Residual Vector Quantizer (RVQ), supporting high-fidelity reconstruction from 0.125kbps to 4kbps. It is trained from scratch on 3 million hours of speech, sound effects, and music, achieving state-of-the-art reconstruction quality among open-source audio tokenizers.

For the full model description, architecture details, and evaluation metrics, please refer to:
- [MOSS-Audio-Tokenizer GitHub Repository](https://github.com/OpenMOSS/MOSS-Audio-Tokenizer)
- [MOSS-TTS README β€” Audio Tokenizer Section](https://github.com/OpenMOSS/MOSS-TTS#moss-audio-tokenizer)

## Evaluation Metrics

The table below compares the reconstruction quality of open-source audio tokenizers with MossAudioTokenizer on speech and audio/music data.

- Speech metrics are evaluated on LibriSpeech test-clean (English) and AISHELL-2 (Chinese), reported as EN/ZH.
- Audio metrics are evaluated on the AudioSet evaluation subset, while music metrics are evaluated on MUSDB, reported as audio/music.
- STFT-Dist. denotes the STFT distance.
- Higher is better for speech metrics, while lower is better for audio/music metrics (Mel-Loss, STFT-Dist.).
- Nq denotes the number of quantizers.

| Model | bps | Frame rate | Nq | Speech: SIM ↑ (EN/ZH) | Speech: STOI ↑ (EN/ZH) | Speech: PESQ-NB ↑ (EN/ZH) | Speech: PESQ-WB ↑ (EN/ZH) | Audio/Music: Mel-Loss ↓ | Audio/Music: STFT-Dist. ↓ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| **XCodec2.0** | 800 | 50 | 1 | 0.82 / 0.74 | 0.92 / 0.86 | 3.04 / 2.46 | 2.43 / 1.96 | -- / -- | -- / -- |
| **MiMo Audio Tokenizer** | 850 | 25 | 4 | 0.80 / 0.74 | 0.91 / 0.87 | 2.94 / 2.62 | 2.39 / 2.14 | **0.82** / 0.81 | 2.33 / 2.23 |
| **Higgs Audio Tokenizer** | 1000 | 25 | 4 | 0.77 / 0.68 | 0.83 / 0.82 | 3.03 / 2.61 | 2.48 / 2.14 | 0.83 / **0.80** | 2.20 / 2.05 |
| **SpeechTokenizer** | 1000 | 50 | 2 | 0.36 / 0.25 | 0.77 / 0.68 | 1.59 / 1.38 | 1.25 / 1.17 | -- / -- | -- / -- |
| **XY-Tokenizer** | 1000 | 12.5 | 8 | 0.85 / 0.79 | 0.92 / 0.87 | 3.10 / 2.63 | 2.50 / 2.12 | -- / -- | -- / -- |
| **BigCodec** | 1040 | 80 | 1 | 0.84 / 0.69 | 0.93 / 0.88 | 3.27 / 2.55 | 2.68 / 2.06 | -- / -- | -- / -- |
| **Mimi** | 1100 | 12.5 | 8 | 0.74 / 0.59 | 0.91 / 0.85 | 2.80 / 2.24 | 2.25 / 1.78 | 1.24 / 1.19 | 2.62 / 2.49 |
| **MOSS Audio Tokenizer (Ours)** | 750 | 12.5 | 6 | 0.82 / 0.75 | 0.93 / 0.89 | 3.14 / 2.73 | 2.60 / 2.22 | 0.86 / 0.85 | 2.21 / 2.10 |
| **MOSS Audio Tokenizer (Ours)** | 1000 | 12.5 | 8 | **0.88** / **0.81** | **0.94** / **0.91** | **3.38** / **2.96** | **2.87** / **2.43** | **0.82** / **0.80** | **2.16** / **2.04** |
| **β€”** | **β€”** | **β€”** | **β€”** | **β€”** | **β€”** | **β€”** | **β€”** | **β€”** | **β€”** |
| **DAC** | 1500 | 75 | 2 | 0.48 / 0.41 | 0.83 / 0.79 | 1.87 / 1.67 | 1.48 / 1.37 | -- / -- | -- / -- |
| **Encodec** | 1500 | 75 | 2 | 0.60 / 0.45 | 0.85 / 0.81 | 1.94 / 1.80 | 1.56 / 1.48 | 1.12 / 1.04 | 2.60 / 2.42 |
| **Higgs Audio Tokenizer** | 2000 | 25 | 8 | 0.90 / 0.83 | 0.85 / 0.85 | 3.59 / 3.22 | 3.11 / 2.73 | 0.74 / 0.70 | 2.07 / 1.92 |
| **SpeechTokenizer** | 2000 | 50 | 4 | 0.66 / 0.50 | 0.88 / 0.80 | 2.38 / 1.79 | 1.92 / 1.49 | -- / -- | -- / -- |
| **Qwen3 TTS Tokenizer** | 2200 | 12.5 | 16 | **0.95** / 0.88 | **0.96** / 0.93 | 3.66 / 3.10 | 3.19 / 2.62 | -- / -- | -- / -- |
| **MiMo Audio Tokenizer** | 2250 | 25 | 12 | 0.89 / 0.83 | 0.95 / 0.92 | 3.57 / 3.25 | 3.05 / 2.71 | **0.70** / **0.68** | 2.21 / 2.10 |
| **Mimi** | 2475 | 12.5 | 18 | 0.89 / 0.76 | 0.94 / 0.91 | 3.49 / 2.90 | 2.97 / 2.35 | 1.10 / 1.06 | 2.45 / 2.32 |
| **MOSS Audio Tokenizer (Ours)** | 1500 | 12.5 | 12 | 0.92 / 0.86 | 0.95 / 0.93 | 3.64 / 3.27 | 3.20 / 2.74 | 0.77 / 0.74 | 2.08 / 1.96 |
| **MOSS Audio Tokenizer (Ours)** | 2000 | 12.5 | 16 | **0.95** / **0.89** | **0.96** / **0.94** | **3.78** / **3.46** | **3.41** / **2.96** | 0.73 / 0.70 | **2.03** / **1.90** |
| **β€”** | **β€”** | **β€”** | **β€”** | **β€”** | **β€”** | **β€”** | **β€”** | **β€”** | **β€”** |
| **DAC** | 3000 | 75 | 4 | 0.74 / 0.67 | 0.90 / 0.88 | 2.76 / 2.47 | 2.31 / 2.07 | 0.86 / 0.83 | 2.23 / 2.10 |
| **MiMo Audio Tokenizer** | 3650 | 25 | 20 | 0.91 / 0.85 | 0.95 / 0.93 | 3.73 / 3.44 | 3.25 / 2.89 | 0.66 / 0.65 | 2.17 / 2.06 |
| **SpeechTokenizer** | 4000 | 50 | 8 | 0.85 / 0.69 | 0.92 / 0.85 | 3.05 / 2.20 | 2.60 / 1.87 | -- / -- | -- / -- |
| **Mimi** | 4400 | 12.5 | 32 | 0.94 / 0.83 | 0.96 / 0.94 | 3.80 / 3.31 | 3.43 / 2.78 | 1.02 / 0.98 | 2.34 / 2.21 |
| **Encodec** | 4500 | 75 | 6 | 0.86 / 0.75 | 0.92 / 0.91 | 2.91 / 2.63 | 2.46 / 2.15 | 0.91 / 0.84 | 2.33 / 2.17 |
| **DAC** | 6000 | 75 | 8 | 0.89 / 0.84 | 0.95 / 0.94 | 3.75 / 3.57 | 3.41 / 3.20 | **0.65** / **0.63** | 1.97 / 1.87 |
| **MOSS Audio Tokenizer (Ours)** | 3000 | 12.5 | 24 | 0.96 / 0.92 | **0.97** / **0.96** | 3.90 / 3.64 | 3.61 / 3.20 | 0.69 / 0.66 | 1.98 / 1.84 |
| **MOSS Audio Tokenizer (Ours)** | 4000 | 12.5 | 32 | **0.97** / **0.93** | **0.97** / **0.96** | **3.95** / **3.71** | **3.69** / **3.30** | 0.68 / 0.64 | **1.96** / **1.82** |

### LibriSpeech Speech Metrics (MOSS Audio Tokenizer vs. Open-source Tokenizers)

The plots below compare our MOSS Audio Tokenizer model with other open-source speech tokenizers on the LibriSpeech dataset, evaluated with SIM, STOI, PESQ-NB, and PESQ-WB (higher is better).
We control the bps of the same model by adjusting the number of RVQ codebooks used during inference.

<table>
  <tr>
    <td align="center"><b>SIM</b><br><img src="images/sim.png" width="100%"></td>
    <td align="center"><b>STOI</b><br><img src="images/stoi.png" width="100%"></td>
  </tr>
  <tr>
    <td align="center"><b>PESQ-NB</b><br><img src="images/pesq-nb.png" width="100%"></td>
    <td align="center"><b>PESQ-WB</b><br><img src="images/pesq-wb.png" width="100%"></td>
  </tr>
</table>


## Citation
If you use this code or result in your paper, please cite our work as:
```tex

```