OpenMOSS-Team
/

MOSS-Audio-Tokenizer-ONNX

@@ -1,6 +1,6 @@
 ---
 license: apache-2.0
-library_name: transformers
 tags:
   - audio
   - audio-tokenizer
@@ -8,97 +8,63 @@ tags:
   - moss-tts-family
   - MOSS Audio Tokenizer
   - speech-tokenizer
-  - trust-remote-code
 ---
-# MossAudioTokenizer
-This is the code for MOSS-Audio-Tokenizer presented in [MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models](https://arxiv.org/abs/2602.10934).
-**MOSSAudioTokenizer** is a unified discrete audio tokenizer based on the **Cat** (**C**ausal **A**udio **T**okenizer with **T**ransformer) architecture. Scaling to 1.6 billion parameters, it functions as a unified discrete interface, delivering both lossless-quality reconstruction and high-level semantic alignment.
-**Key Features:**
-*   **Extreme Compression & Variable Bitrate**: It compresses 24kHz raw audio into a remarkably low frame rate of 12.5Hz. Utilizing a 32-layer Residual Vector Quantizer (RVQ), it supports high-fidelity reconstruction across a wide range of bitrates, from 0.125kbps to 4kbps.
-*   **Pure Transformer Architecture**: The model features a "CNN-free" homogeneous architecture built entirely from Causal Transformer blocks. With 1.6B combined parameters (Encoder + Decoder), it ensures exceptional scalability and supports low-latency streaming inference.
-*   **Large-Scale General Audio Training**: Trained on 3 million hours of diverse audio data, the model excels at encoding and reconstructing all audio domains, including speech, sound effects, and music.
-*   **Unified Semantic-Acoustic Representation**: While achieving state-of-the-art reconstruction quality, Cat produces discrete tokens that are "semantic-rich," making them ideal for downstream tasks like speech understanding (ASR) and generation (TTS).
-*   **Fully Trained From Scratch**: Cat does not rely on any pretrained encoders (such as HuBERT or Whisper) or distillation from teacher models. All representations are learned autonomously from raw data.
-*   **End-to-End Joint Optimization**: All components—including the encoder, quantizer, decoder, discriminator, and a decoder-only LLM for semantic alignment—are optimized jointly in a single unified training pipeline.
-**Summary:**
-By combining a simple, scalable architecture with massive-scale data, the Cat architecture overcomes the bottlenecks of traditional audio tokenizers. It provides a robust, high-fidelity, and semantically grounded interface for the next generation of native audio foundation models.
-This repository contains a lightweight remote-code implementation that mirrors the current 🤗 Transformers
-`transformers.models.moss_audio_tokenizer` module. It is intended to be uploaded to a Hugging Face Hub model repository
-and loaded with `trust_remote_code=True` when needed.
-<br>
-<p align="center">
-    <img src="images/arch.png" width="95%"> <br>
-    Architecture of MossAudioTokenizer
-</p>
-<br>
-## Usage
-### Quickstart
-```python
-import torch
-from transformers import AutoModel
-import torchaudio
-repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer"
-model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()
-wav, sr = torchaudio.load('demo/demo_gt.wav')
-if sr != model.sampling_rate:
-    wav = torchaudio.functional.resample(wav, sr, model.sampling_rate)
-wav = wav.unsqueeze(0)
-enc = model.encode(wav, return_dict=True)
-print(f"enc.audio_codes.shape: {enc.audio_codes.shape}")
-dec = model.decode(enc.audio_codes, return_dict=True)
-print(f"dec.audio.shape: {dec.audio.shape}")
-wav = dec.audio.squeeze(0)
-torchaudio.save("demo/demo_rec.wav", wav, sample_rate=model.sampling_rate)
-# Decode using only the first 8 layers of the RVQ
-dec_rvq8 = model.decode(enc.audio_codes[:8], return_dict=True)
-wav_rvq8 = dec_rvq8.audio.squeeze(0)
-torchaudio.save("demo/demo_rec_rvq8.wav", wav_rvq8, sample_rate=model.sampling_rate)
 ```
-### Streaming
-`MossAudioTokenizerModel.encode` and `MossAudioTokenizerModel.decode` support simple streaming via a `chunk_duration`
-argument.
-- `chunk_duration` is expressed in seconds.
-- It must be <= `MossAudioTokenizerConfig.causal_transformer_context_duration`.
-- `chunk_duration * MossAudioTokenizerConfig.sampling_rate` must be divisible by `MossAudioTokenizerConfig.downsample_rate`.
-- Streaming chunking only supports `batch_size=1`.
-```python
-import torch
-from transformers import AutoModel
-repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer"
-model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()
-audio = torch.randn(1, 1, 3200)  # dummy waveform
-# 0.08s @ 24kHz = 1920 samples, divisible by downsample_rate=1920
-enc = model.encode(audio, return_dict=True, chunk_duration=0.08)
-dec = model.decode(enc.audio_codes, return_dict=True, chunk_duration=0.08)
-```
-## Repository layout
-- `configuration_moss_audio_tokenizer.py`
-- `modeling_moss_audio_tokenizer.py`
-- `__init__.py`
-- `config.json`
-- model weights
 ## Evaluation Metrics

 ---
 license: apache-2.0
+library_name: onnx
 tags:
   - audio
   - audio-tokenizer
   - moss-tts-family
   - MOSS Audio Tokenizer
   - speech-tokenizer
+  - onnx
+  - tensorrt
 ---
+# MOSS-Audio-Tokenizer-ONNX
+This repository provides the **ONNX exports** of [MOSS-Audio-Tokenizer](https://huggingface.co/OpenMOSS-Team/MOSS-Audio-Tokenizer) (encoder & decoder), enabling **torch-free** audio encoding/decoding for the [MOSS-TTS](https://github.com/OpenMOSS/MOSS-TTS) family.
+## Overview
+**MOSS-Audio-Tokenizer** is the unified discrete audio interface for the entire MOSS-TTS Family, based on the **Cat** (**C**ausal **A**udio **T**okenizer with **T**ransformer) architecture — a 1.6B-parameter, pure Causal Transformer audio tokenizer trained on 3M hours of diverse audio.
+This ONNX repository is designed for **lightweight, torch-free deployment** scenarios. It serves as the audio tokenizer component in the [MOSS-TTS llama.cpp inference backend](https://github.com/OpenMOSS/MOSS-TTS/blob/main/moss_tts_delay/llama_cpp/README.md), which combines [llama.cpp](https://github.com/ggerganov/llama.cpp) (for the Qwen3 backbone) with ONNX Runtime or TensorRT (for the audio tokenizer) to achieve fully **PyTorch-free** TTS inference.
+### Supported Backends
+| Backend | Runtime | Use Case |
+|---------|---------|----------|
+| **ONNX Runtime (GPU)** | `onnxruntime-gpu` | Recommended starting point |
+| **ONNX Runtime (CPU)** | `onnxruntime` | CPU-only / no CUDA |
+| **TensorRT** | Build from ONNX | Maximum throughput (user-built engines) |
+> **Note:** We do **not** provide pre-built TensorRT engines, as they are tied to your specific GPU architecture and TensorRT version. To use TRT, build engines from the ONNX models yourself — see `moss_audio_tokenizer/trt/build_engine.sh` in the main repository.
+## Repository Contents
+| File | Description |
+|------|-------------|
+| `encoder.onnx` | ONNX model for audio encoding (waveform → discrete codes) |
+| `decoder.onnx` | ONNX model for audio decoding (discrete codes → waveform) |
+## Quick Start
+```bash
+# Download
+huggingface-cli download OpenMOSS-Team/MOSS-Audio-Tokenizer-ONNX \
+    --local-dir weights/MOSS-Audio-Tokenizer-ONNX
 ```
+This is typically used together with [MOSS-TTS-GGUF](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-GGUF) for the llama.cpp inference pipeline. See the [llama.cpp Backend documentation](https://github.com/OpenMOSS/MOSS-TTS/blob/main/moss_tts_delay/llama_cpp/README.md) for the full end-to-end setup.
+## Main Repositories
+| Repository | Description |
+|------------|-------------|
+| [OpenMOSS/MOSS-TTS](https://github.com/OpenMOSS/MOSS-TTS) | MOSS-TTS Family main repository (includes llama.cpp backend, PyTorch inference, and all models) |
+| [OpenMOSS/MOSS-Audio-Tokenizer](https://github.com/OpenMOSS/MOSS-Audio-Tokenizer) | MOSS-Audio-Tokenizer source code, PyTorch weights, ONNX/TRT export scripts, and evaluation |
+| [OpenMOSS-Team/MOSS-Audio-Tokenizer](https://huggingface.co/OpenMOSS-Team/MOSS-Audio-Tokenizer) | PyTorch weights on Hugging Face (for `trust_remote_code=True` usage) |
+| [OpenMOSS-Team/MOSS-TTS-GGUF](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-GGUF) | Pre-quantized GGUF backbone weights (companion to this ONNX repo) |
+## About MOSS-Audio-Tokenizer
+**MOSS-Audio-Tokenizer** compresses 24kHz raw audio into a 12.5Hz frame rate using a 32-layer Residual Vector Quantizer (RVQ), supporting high-fidelity reconstruction from 0.125kbps to 4kbps. It is trained from scratch on 3 million hours of speech, sound effects, and music, achieving state-of-the-art reconstruction quality among open-source audio tokenizers.
+For the full model description, architecture details, and evaluation metrics, please refer to:
+- [MOSS-Audio-Tokenizer GitHub Repository](https://github.com/OpenMOSS/MOSS-Audio-Tokenizer)
+- [MOSS-TTS README — Audio Tokenizer Section](https://github.com/OpenMOSS/MOSS-TTS#moss-audio-tokenizer)
 ## Evaluation Metrics