Update README
Browse files
README.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
-
library_name:
|
| 4 |
tags:
|
| 5 |
- audio
|
| 6 |
- audio-tokenizer
|
|
@@ -8,97 +8,63 @@ tags:
|
|
| 8 |
- moss-tts-family
|
| 9 |
- MOSS Audio Tokenizer
|
| 10 |
- speech-tokenizer
|
| 11 |
-
-
|
|
|
|
| 12 |
---
|
| 13 |
|
| 14 |
-
#
|
| 15 |
|
| 16 |
-
This
|
| 17 |
|
| 18 |
-
|
| 19 |
|
| 20 |
-
**
|
| 21 |
|
| 22 |
-
|
| 23 |
-
* **Pure Transformer Architecture**: The model features a "CNN-free" homogeneous architecture built entirely from Causal Transformer blocks. With 1.6B combined parameters (Encoder + Decoder), it ensures exceptional scalability and supports low-latency streaming inference.
|
| 24 |
-
* **Large-Scale General Audio Training**: Trained on 3 million hours of diverse audio data, the model excels at encoding and reconstructing all audio domains, including speech, sound effects, and music.
|
| 25 |
-
* **Unified Semantic-Acoustic Representation**: While achieving state-of-the-art reconstruction quality, Cat produces discrete tokens that are "semantic-rich," making them ideal for downstream tasks like speech understanding (ASR) and generation (TTS).
|
| 26 |
-
* **Fully Trained From Scratch**: Cat does not rely on any pretrained encoders (such as HuBERT or Whisper) or distillation from teacher models. All representations are learned autonomously from raw data.
|
| 27 |
-
* **End-to-End Joint Optimization**: All components—including the encoder, quantizer, decoder, discriminator, and a decoder-only LLM for semantic alignment—are optimized jointly in a single unified training pipeline.
|
| 28 |
|
| 29 |
-
|
| 30 |
-
By combining a simple, scalable architecture with massive-scale data, the Cat architecture overcomes the bottlenecks of traditional audio tokenizers. It provides a robust, high-fidelity, and semantically grounded interface for the next generation of native audio foundation models.
|
| 31 |
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
|
|
|
|
|
|
| 35 |
|
| 36 |
-
|
| 37 |
-
<p align="center">
|
| 38 |
-
<img src="images/arch.png" width="95%"> <br>
|
| 39 |
-
Architecture of MossAudioTokenizer
|
| 40 |
-
</p>
|
| 41 |
-
<br>
|
| 42 |
|
| 43 |
-
##
|
| 44 |
|
| 45 |
-
|
|
|
|
|
|
|
|
|
|
| 46 |
|
| 47 |
-
|
| 48 |
-
import torch
|
| 49 |
-
from transformers import AutoModel
|
| 50 |
-
import torchaudio
|
| 51 |
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
if sr != model.sampling_rate:
|
| 57 |
-
wav = torchaudio.functional.resample(wav, sr, model.sampling_rate)
|
| 58 |
-
wav = wav.unsqueeze(0)
|
| 59 |
-
enc = model.encode(wav, return_dict=True)
|
| 60 |
-
print(f"enc.audio_codes.shape: {enc.audio_codes.shape}")
|
| 61 |
-
dec = model.decode(enc.audio_codes, return_dict=True)
|
| 62 |
-
print(f"dec.audio.shape: {dec.audio.shape}")
|
| 63 |
-
wav = dec.audio.squeeze(0)
|
| 64 |
-
torchaudio.save("demo/demo_rec.wav", wav, sample_rate=model.sampling_rate)
|
| 65 |
-
|
| 66 |
-
# Decode using only the first 8 layers of the RVQ
|
| 67 |
-
dec_rvq8 = model.decode(enc.audio_codes[:8], return_dict=True)
|
| 68 |
-
wav_rvq8 = dec_rvq8.audio.squeeze(0)
|
| 69 |
-
torchaudio.save("demo/demo_rec_rvq8.wav", wav_rvq8, sample_rate=model.sampling_rate)
|
| 70 |
```
|
| 71 |
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
`MossAudioTokenizerModel.encode` and `MossAudioTokenizerModel.decode` support simple streaming via a `chunk_duration`
|
| 75 |
-
argument.
|
| 76 |
|
| 77 |
-
|
| 78 |
-
- It must be <= `MossAudioTokenizerConfig.causal_transformer_context_duration`.
|
| 79 |
-
- `chunk_duration * MossAudioTokenizerConfig.sampling_rate` must be divisible by `MossAudioTokenizerConfig.downsample_rate`.
|
| 80 |
-
- Streaming chunking only supports `batch_size=1`.
|
| 81 |
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
|
|
|
|
|
|
|
|
|
| 85 |
|
| 86 |
-
|
| 87 |
-
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()
|
| 88 |
-
audio = torch.randn(1, 1, 3200) # dummy waveform
|
| 89 |
-
|
| 90 |
-
# 0.08s @ 24kHz = 1920 samples, divisible by downsample_rate=1920
|
| 91 |
-
enc = model.encode(audio, return_dict=True, chunk_duration=0.08)
|
| 92 |
-
dec = model.decode(enc.audio_codes, return_dict=True, chunk_duration=0.08)
|
| 93 |
-
```
|
| 94 |
|
| 95 |
-
|
| 96 |
|
| 97 |
-
|
| 98 |
-
-
|
| 99 |
-
-
|
| 100 |
-
- `config.json`
|
| 101 |
-
- model weights
|
| 102 |
|
| 103 |
## Evaluation Metrics
|
| 104 |
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
library_name: onnx
|
| 4 |
tags:
|
| 5 |
- audio
|
| 6 |
- audio-tokenizer
|
|
|
|
| 8 |
- moss-tts-family
|
| 9 |
- MOSS Audio Tokenizer
|
| 10 |
- speech-tokenizer
|
| 11 |
+
- onnx
|
| 12 |
+
- tensorrt
|
| 13 |
---
|
| 14 |
|
| 15 |
+
# MOSS-Audio-Tokenizer-ONNX
|
| 16 |
|
| 17 |
+
This repository provides the **ONNX exports** of [MOSS-Audio-Tokenizer](https://huggingface.co/OpenMOSS-Team/MOSS-Audio-Tokenizer) (encoder & decoder), enabling **torch-free** audio encoding/decoding for the [MOSS-TTS](https://github.com/OpenMOSS/MOSS-TTS) family.
|
| 18 |
|
| 19 |
+
## Overview
|
| 20 |
|
| 21 |
+
**MOSS-Audio-Tokenizer** is the unified discrete audio interface for the entire MOSS-TTS Family, based on the **Cat** (**C**ausal **A**udio **T**okenizer with **T**ransformer) architecture — a 1.6B-parameter, pure Causal Transformer audio tokenizer trained on 3M hours of diverse audio.
|
| 22 |
|
| 23 |
+
This ONNX repository is designed for **lightweight, torch-free deployment** scenarios. It serves as the audio tokenizer component in the [MOSS-TTS llama.cpp inference backend](https://github.com/OpenMOSS/MOSS-TTS/blob/main/moss_tts_delay/llama_cpp/README.md), which combines [llama.cpp](https://github.com/ggerganov/llama.cpp) (for the Qwen3 backbone) with ONNX Runtime or TensorRT (for the audio tokenizer) to achieve fully **PyTorch-free** TTS inference.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
+
### Supported Backends
|
|
|
|
| 26 |
|
| 27 |
+
| Backend | Runtime | Use Case |
|
| 28 |
+
|---------|---------|----------|
|
| 29 |
+
| **ONNX Runtime (GPU)** | `onnxruntime-gpu` | Recommended starting point |
|
| 30 |
+
| **ONNX Runtime (CPU)** | `onnxruntime` | CPU-only / no CUDA |
|
| 31 |
+
| **TensorRT** | Build from ONNX | Maximum throughput (user-built engines) |
|
| 32 |
|
| 33 |
+
> **Note:** We do **not** provide pre-built TensorRT engines, as they are tied to your specific GPU architecture and TensorRT version. To use TRT, build engines from the ONNX models yourself — see `moss_audio_tokenizer/trt/build_engine.sh` in the main repository.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
+
## Repository Contents
|
| 36 |
|
| 37 |
+
| File | Description |
|
| 38 |
+
|------|-------------|
|
| 39 |
+
| `encoder.onnx` | ONNX model for audio encoding (waveform → discrete codes) |
|
| 40 |
+
| `decoder.onnx` | ONNX model for audio decoding (discrete codes → waveform) |
|
| 41 |
|
| 42 |
+
## Quick Start
|
|
|
|
|
|
|
|
|
|
| 43 |
|
| 44 |
+
```bash
|
| 45 |
+
# Download
|
| 46 |
+
huggingface-cli download OpenMOSS-Team/MOSS-Audio-Tokenizer-ONNX \
|
| 47 |
+
--local-dir weights/MOSS-Audio-Tokenizer-ONNX
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
```
|
| 49 |
|
| 50 |
+
This is typically used together with [MOSS-TTS-GGUF](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-GGUF) for the llama.cpp inference pipeline. See the [llama.cpp Backend documentation](https://github.com/OpenMOSS/MOSS-TTS/blob/main/moss_tts_delay/llama_cpp/README.md) for the full end-to-end setup.
|
|
|
|
|
|
|
|
|
|
| 51 |
|
| 52 |
+
## Main Repositories
|
|
|
|
|
|
|
|
|
|
| 53 |
|
| 54 |
+
| Repository | Description |
|
| 55 |
+
|------------|-------------|
|
| 56 |
+
| [OpenMOSS/MOSS-TTS](https://github.com/OpenMOSS/MOSS-TTS) | MOSS-TTS Family main repository (includes llama.cpp backend, PyTorch inference, and all models) |
|
| 57 |
+
| [OpenMOSS/MOSS-Audio-Tokenizer](https://github.com/OpenMOSS/MOSS-Audio-Tokenizer) | MOSS-Audio-Tokenizer source code, PyTorch weights, ONNX/TRT export scripts, and evaluation |
|
| 58 |
+
| [OpenMOSS-Team/MOSS-Audio-Tokenizer](https://huggingface.co/OpenMOSS-Team/MOSS-Audio-Tokenizer) | PyTorch weights on Hugging Face (for `trust_remote_code=True` usage) |
|
| 59 |
+
| [OpenMOSS-Team/MOSS-TTS-GGUF](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-GGUF) | Pre-quantized GGUF backbone weights (companion to this ONNX repo) |
|
| 60 |
|
| 61 |
+
## About MOSS-Audio-Tokenizer
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
|
| 63 |
+
**MOSS-Audio-Tokenizer** compresses 24kHz raw audio into a 12.5Hz frame rate using a 32-layer Residual Vector Quantizer (RVQ), supporting high-fidelity reconstruction from 0.125kbps to 4kbps. It is trained from scratch on 3 million hours of speech, sound effects, and music, achieving state-of-the-art reconstruction quality among open-source audio tokenizers.
|
| 64 |
|
| 65 |
+
For the full model description, architecture details, and evaluation metrics, please refer to:
|
| 66 |
+
- [MOSS-Audio-Tokenizer GitHub Repository](https://github.com/OpenMOSS/MOSS-Audio-Tokenizer)
|
| 67 |
+
- [MOSS-TTS README — Audio Tokenizer Section](https://github.com/OpenMOSS/MOSS-TTS#moss-audio-tokenizer)
|
|
|
|
|
|
|
| 68 |
|
| 69 |
## Evaluation Metrics
|
| 70 |
|