cms42's picture
Update README
c7468e6 verified
metadata
license: apache-2.0
library_name: onnx
tags:
  - audio
  - audio-tokenizer
  - neural-codec
  - moss-tts-family
  - MOSS Audio Tokenizer
  - speech-tokenizer
  - onnx
  - tensorrt

MOSS-Audio-Tokenizer-ONNX

This repository provides the ONNX exports of MOSS-Audio-Tokenizer (encoder & decoder), enabling torch-free audio encoding/decoding for the MOSS-TTS family.

Overview

MOSS-Audio-Tokenizer is the unified discrete audio interface for the entire MOSS-TTS Family, based on the Cat (Causal Audio Tokenizer with Transformer) architecture β€” a 1.6B-parameter, pure Causal Transformer audio tokenizer trained on 3M hours of diverse audio.

This ONNX repository is designed for lightweight, torch-free deployment scenarios. It serves as the audio tokenizer component in the MOSS-TTS llama.cpp inference backend, which combines llama.cpp (for the Qwen3 backbone) with ONNX Runtime or TensorRT (for the audio tokenizer) to achieve fully PyTorch-free TTS inference.

Supported Backends

Backend Runtime Use Case
ONNX Runtime (GPU) onnxruntime-gpu Recommended starting point
ONNX Runtime (CPU) onnxruntime CPU-only / no CUDA
TensorRT Build from ONNX Maximum throughput (user-built engines)

Note: We do not provide pre-built TensorRT engines, as they are tied to your specific GPU architecture and TensorRT version. To use TRT, build engines from the ONNX models yourself β€” see moss_audio_tokenizer/trt/build_engine.sh in the main repository.

Repository Contents

File Description
encoder.onnx ONNX model for audio encoding (waveform β†’ discrete codes)
decoder.onnx ONNX model for audio decoding (discrete codes β†’ waveform)

Quick Start

# Download
huggingface-cli download OpenMOSS-Team/MOSS-Audio-Tokenizer-ONNX \
    --local-dir weights/MOSS-Audio-Tokenizer-ONNX

This is typically used together with MOSS-TTS-GGUF for the llama.cpp inference pipeline. See the llama.cpp Backend documentation for the full end-to-end setup.

Main Repositories

Repository Description
OpenMOSS/MOSS-TTS MOSS-TTS Family main repository (includes llama.cpp backend, PyTorch inference, and all models)
OpenMOSS/MOSS-Audio-Tokenizer MOSS-Audio-Tokenizer source code, PyTorch weights, ONNX/TRT export scripts, and evaluation
OpenMOSS-Team/MOSS-Audio-Tokenizer PyTorch weights on Hugging Face (for trust_remote_code=True usage)
OpenMOSS-Team/MOSS-TTS-GGUF Pre-quantized GGUF backbone weights (companion to this ONNX repo)

About MOSS-Audio-Tokenizer

MOSS-Audio-Tokenizer compresses 24kHz raw audio into a 12.5Hz frame rate using a 32-layer Residual Vector Quantizer (RVQ), supporting high-fidelity reconstruction from 0.125kbps to 4kbps. It is trained from scratch on 3 million hours of speech, sound effects, and music, achieving state-of-the-art reconstruction quality among open-source audio tokenizers.

For the full model description, architecture details, and evaluation metrics, please refer to:

Evaluation Metrics

The table below compares the reconstruction quality of open-source audio tokenizers with MossAudioTokenizer on speech and audio/music data.

  • Speech metrics are evaluated on LibriSpeech test-clean (English) and AISHELL-2 (Chinese), reported as EN/ZH.
  • Audio metrics are evaluated on the AudioSet evaluation subset, while music metrics are evaluated on MUSDB, reported as audio/music.
  • STFT-Dist. denotes the STFT distance.
  • Higher is better for speech metrics, while lower is better for audio/music metrics (Mel-Loss, STFT-Dist.).
  • Nq denotes the number of quantizers.
Model bps Frame rate Nq Speech: SIM ↑ (EN/ZH) Speech: STOI ↑ (EN/ZH) Speech: PESQ-NB ↑ (EN/ZH) Speech: PESQ-WB ↑ (EN/ZH) Audio/Music: Mel-Loss ↓ Audio/Music: STFT-Dist. ↓
XCodec2.0 800 50 1 0.82 / 0.74 0.92 / 0.86 3.04 / 2.46 2.43 / 1.96 -- / -- -- / --
MiMo Audio Tokenizer 850 25 4 0.80 / 0.74 0.91 / 0.87 2.94 / 2.62 2.39 / 2.14 0.82 / 0.81 2.33 / 2.23
Higgs Audio Tokenizer 1000 25 4 0.77 / 0.68 0.83 / 0.82 3.03 / 2.61 2.48 / 2.14 0.83 / 0.80 2.20 / 2.05
SpeechTokenizer 1000 50 2 0.36 / 0.25 0.77 / 0.68 1.59 / 1.38 1.25 / 1.17 -- / -- -- / --
XY-Tokenizer 1000 12.5 8 0.85 / 0.79 0.92 / 0.87 3.10 / 2.63 2.50 / 2.12 -- / -- -- / --
BigCodec 1040 80 1 0.84 / 0.69 0.93 / 0.88 3.27 / 2.55 2.68 / 2.06 -- / -- -- / --
Mimi 1100 12.5 8 0.74 / 0.59 0.91 / 0.85 2.80 / 2.24 2.25 / 1.78 1.24 / 1.19 2.62 / 2.49
MOSS Audio Tokenizer (Ours) 750 12.5 6 0.82 / 0.75 0.93 / 0.89 3.14 / 2.73 2.60 / 2.22 0.86 / 0.85 2.21 / 2.10
MOSS Audio Tokenizer (Ours) 1000 12.5 8 0.88 / 0.81 0.94 / 0.91 3.38 / 2.96 2.87 / 2.43 0.82 / 0.80 2.16 / 2.04
β€” β€” β€” β€” β€” β€” β€” β€” β€” β€”
DAC 1500 75 2 0.48 / 0.41 0.83 / 0.79 1.87 / 1.67 1.48 / 1.37 -- / -- -- / --
Encodec 1500 75 2 0.60 / 0.45 0.85 / 0.81 1.94 / 1.80 1.56 / 1.48 1.12 / 1.04 2.60 / 2.42
Higgs Audio Tokenizer 2000 25 8 0.90 / 0.83 0.85 / 0.85 3.59 / 3.22 3.11 / 2.73 0.74 / 0.70 2.07 / 1.92
SpeechTokenizer 2000 50 4 0.66 / 0.50 0.88 / 0.80 2.38 / 1.79 1.92 / 1.49 -- / -- -- / --
Qwen3 TTS Tokenizer 2200 12.5 16 0.95 / 0.88 0.96 / 0.93 3.66 / 3.10 3.19 / 2.62 -- / -- -- / --
MiMo Audio Tokenizer 2250 25 12 0.89 / 0.83 0.95 / 0.92 3.57 / 3.25 3.05 / 2.71 0.70 / 0.68 2.21 / 2.10
Mimi 2475 12.5 18 0.89 / 0.76 0.94 / 0.91 3.49 / 2.90 2.97 / 2.35 1.10 / 1.06 2.45 / 2.32
MOSS Audio Tokenizer (Ours) 1500 12.5 12 0.92 / 0.86 0.95 / 0.93 3.64 / 3.27 3.20 / 2.74 0.77 / 0.74 2.08 / 1.96
MOSS Audio Tokenizer (Ours) 2000 12.5 16 0.95 / 0.89 0.96 / 0.94 3.78 / 3.46 3.41 / 2.96 0.73 / 0.70 2.03 / 1.90
β€” β€” β€” β€” β€” β€” β€” β€” β€” β€”
DAC 3000 75 4 0.74 / 0.67 0.90 / 0.88 2.76 / 2.47 2.31 / 2.07 0.86 / 0.83 2.23 / 2.10
MiMo Audio Tokenizer 3650 25 20 0.91 / 0.85 0.95 / 0.93 3.73 / 3.44 3.25 / 2.89 0.66 / 0.65 2.17 / 2.06
SpeechTokenizer 4000 50 8 0.85 / 0.69 0.92 / 0.85 3.05 / 2.20 2.60 / 1.87 -- / -- -- / --
Mimi 4400 12.5 32 0.94 / 0.83 0.96 / 0.94 3.80 / 3.31 3.43 / 2.78 1.02 / 0.98 2.34 / 2.21
Encodec 4500 75 6 0.86 / 0.75 0.92 / 0.91 2.91 / 2.63 2.46 / 2.15 0.91 / 0.84 2.33 / 2.17
DAC 6000 75 8 0.89 / 0.84 0.95 / 0.94 3.75 / 3.57 3.41 / 3.20 0.65 / 0.63 1.97 / 1.87
MOSS Audio Tokenizer (Ours) 3000 12.5 24 0.96 / 0.92 0.97 / 0.96 3.90 / 3.64 3.61 / 3.20 0.69 / 0.66 1.98 / 1.84
MOSS Audio Tokenizer (Ours) 4000 12.5 32 0.97 / 0.93 0.97 / 0.96 3.95 / 3.71 3.69 / 3.30 0.68 / 0.64 1.96 / 1.82

LibriSpeech Speech Metrics (MOSS Audio Tokenizer vs. Open-source Tokenizers)

The plots below compare our MOSS Audio Tokenizer model with other open-source speech tokenizers on the LibriSpeech dataset, evaluated with SIM, STOI, PESQ-NB, and PESQ-WB (higher is better). We control the bps of the same model by adjusting the number of RVQ codebooks used during inference.

SIM
STOI
PESQ-NB
PESQ-WB

Citation

If you use this code or result in your paper, please cite our work as: