cms42 commited on
Commit
c7468e6
·
verified ·
1 Parent(s): 8d409dd

Update README

Browse files
Files changed (1) hide show
  1. README.md +38 -72
README.md CHANGED
@@ -1,6 +1,6 @@
1
  ---
2
  license: apache-2.0
3
- library_name: transformers
4
  tags:
5
  - audio
6
  - audio-tokenizer
@@ -8,97 +8,63 @@ tags:
8
  - moss-tts-family
9
  - MOSS Audio Tokenizer
10
  - speech-tokenizer
11
- - trust-remote-code
 
12
  ---
13
 
14
- # MossAudioTokenizer
15
 
16
- This is the code for MOSS-Audio-Tokenizer presented in [MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models](https://arxiv.org/abs/2602.10934).
17
 
18
- **MOSSAudioTokenizer** is a unified discrete audio tokenizer based on the **Cat** (**C**ausal **A**udio **T**okenizer with **T**ransformer) architecture. Scaling to 1.6 billion parameters, it functions as a unified discrete interface, delivering both lossless-quality reconstruction and high-level semantic alignment.
19
 
20
- **Key Features:**
21
 
22
- * **Extreme Compression & Variable Bitrate**: It compresses 24kHz raw audio into a remarkably low frame rate of 12.5Hz. Utilizing a 32-layer Residual Vector Quantizer (RVQ), it supports high-fidelity reconstruction across a wide range of bitrates, from 0.125kbps to 4kbps.
23
- * **Pure Transformer Architecture**: The model features a "CNN-free" homogeneous architecture built entirely from Causal Transformer blocks. With 1.6B combined parameters (Encoder + Decoder), it ensures exceptional scalability and supports low-latency streaming inference.
24
- * **Large-Scale General Audio Training**: Trained on 3 million hours of diverse audio data, the model excels at encoding and reconstructing all audio domains, including speech, sound effects, and music.
25
- * **Unified Semantic-Acoustic Representation**: While achieving state-of-the-art reconstruction quality, Cat produces discrete tokens that are "semantic-rich," making them ideal for downstream tasks like speech understanding (ASR) and generation (TTS).
26
- * **Fully Trained From Scratch**: Cat does not rely on any pretrained encoders (such as HuBERT or Whisper) or distillation from teacher models. All representations are learned autonomously from raw data.
27
- * **End-to-End Joint Optimization**: All components—including the encoder, quantizer, decoder, discriminator, and a decoder-only LLM for semantic alignment—are optimized jointly in a single unified training pipeline.
28
 
29
- **Summary:**
30
- By combining a simple, scalable architecture with massive-scale data, the Cat architecture overcomes the bottlenecks of traditional audio tokenizers. It provides a robust, high-fidelity, and semantically grounded interface for the next generation of native audio foundation models.
31
 
32
- This repository contains a lightweight remote-code implementation that mirrors the current 🤗 Transformers
33
- `transformers.models.moss_audio_tokenizer` module. It is intended to be uploaded to a Hugging Face Hub model repository
34
- and loaded with `trust_remote_code=True` when needed.
 
 
35
 
36
- <br>
37
- <p align="center">
38
- <img src="images/arch.png" width="95%"> <br>
39
- Architecture of MossAudioTokenizer
40
- </p>
41
- <br>
42
 
43
- ## Usage
44
 
45
- ### Quickstart
 
 
 
46
 
47
- ```python
48
- import torch
49
- from transformers import AutoModel
50
- import torchaudio
51
 
52
- repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer"
53
- model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()
54
-
55
- wav, sr = torchaudio.load('demo/demo_gt.wav')
56
- if sr != model.sampling_rate:
57
- wav = torchaudio.functional.resample(wav, sr, model.sampling_rate)
58
- wav = wav.unsqueeze(0)
59
- enc = model.encode(wav, return_dict=True)
60
- print(f"enc.audio_codes.shape: {enc.audio_codes.shape}")
61
- dec = model.decode(enc.audio_codes, return_dict=True)
62
- print(f"dec.audio.shape: {dec.audio.shape}")
63
- wav = dec.audio.squeeze(0)
64
- torchaudio.save("demo/demo_rec.wav", wav, sample_rate=model.sampling_rate)
65
-
66
- # Decode using only the first 8 layers of the RVQ
67
- dec_rvq8 = model.decode(enc.audio_codes[:8], return_dict=True)
68
- wav_rvq8 = dec_rvq8.audio.squeeze(0)
69
- torchaudio.save("demo/demo_rec_rvq8.wav", wav_rvq8, sample_rate=model.sampling_rate)
70
  ```
71
 
72
- ### Streaming
73
-
74
- `MossAudioTokenizerModel.encode` and `MossAudioTokenizerModel.decode` support simple streaming via a `chunk_duration`
75
- argument.
76
 
77
- - `chunk_duration` is expressed in seconds.
78
- - It must be <= `MossAudioTokenizerConfig.causal_transformer_context_duration`.
79
- - `chunk_duration * MossAudioTokenizerConfig.sampling_rate` must be divisible by `MossAudioTokenizerConfig.downsample_rate`.
80
- - Streaming chunking only supports `batch_size=1`.
81
 
82
- ```python
83
- import torch
84
- from transformers import AutoModel
 
 
 
85
 
86
- repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer"
87
- model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()
88
- audio = torch.randn(1, 1, 3200) # dummy waveform
89
-
90
- # 0.08s @ 24kHz = 1920 samples, divisible by downsample_rate=1920
91
- enc = model.encode(audio, return_dict=True, chunk_duration=0.08)
92
- dec = model.decode(enc.audio_codes, return_dict=True, chunk_duration=0.08)
93
- ```
94
 
95
- ## Repository layout
96
 
97
- - `configuration_moss_audio_tokenizer.py`
98
- - `modeling_moss_audio_tokenizer.py`
99
- - `__init__.py`
100
- - `config.json`
101
- - model weights
102
 
103
  ## Evaluation Metrics
104
 
 
1
  ---
2
  license: apache-2.0
3
+ library_name: onnx
4
  tags:
5
  - audio
6
  - audio-tokenizer
 
8
  - moss-tts-family
9
  - MOSS Audio Tokenizer
10
  - speech-tokenizer
11
+ - onnx
12
+ - tensorrt
13
  ---
14
 
15
+ # MOSS-Audio-Tokenizer-ONNX
16
 
17
+ This repository provides the **ONNX exports** of [MOSS-Audio-Tokenizer](https://huggingface.co/OpenMOSS-Team/MOSS-Audio-Tokenizer) (encoder & decoder), enabling **torch-free** audio encoding/decoding for the [MOSS-TTS](https://github.com/OpenMOSS/MOSS-TTS) family.
18
 
19
+ ## Overview
20
 
21
+ **MOSS-Audio-Tokenizer** is the unified discrete audio interface for the entire MOSS-TTS Family, based on the **Cat** (**C**ausal **A**udio **T**okenizer with **T**ransformer) architecture — a 1.6B-parameter, pure Causal Transformer audio tokenizer trained on 3M hours of diverse audio.
22
 
23
+ This ONNX repository is designed for **lightweight, torch-free deployment** scenarios. It serves as the audio tokenizer component in the [MOSS-TTS llama.cpp inference backend](https://github.com/OpenMOSS/MOSS-TTS/blob/main/moss_tts_delay/llama_cpp/README.md), which combines [llama.cpp](https://github.com/ggerganov/llama.cpp) (for the Qwen3 backbone) with ONNX Runtime or TensorRT (for the audio tokenizer) to achieve fully **PyTorch-free** TTS inference.
 
 
 
 
 
24
 
25
+ ### Supported Backends
 
26
 
27
+ | Backend | Runtime | Use Case |
28
+ |---------|---------|----------|
29
+ | **ONNX Runtime (GPU)** | `onnxruntime-gpu` | Recommended starting point |
30
+ | **ONNX Runtime (CPU)** | `onnxruntime` | CPU-only / no CUDA |
31
+ | **TensorRT** | Build from ONNX | Maximum throughput (user-built engines) |
32
 
33
+ > **Note:** We do **not** provide pre-built TensorRT engines, as they are tied to your specific GPU architecture and TensorRT version. To use TRT, build engines from the ONNX models yourself — see `moss_audio_tokenizer/trt/build_engine.sh` in the main repository.
 
 
 
 
 
34
 
35
+ ## Repository Contents
36
 
37
+ | File | Description |
38
+ |------|-------------|
39
+ | `encoder.onnx` | ONNX model for audio encoding (waveform → discrete codes) |
40
+ | `decoder.onnx` | ONNX model for audio decoding (discrete codes → waveform) |
41
 
42
+ ## Quick Start
 
 
 
43
 
44
+ ```bash
45
+ # Download
46
+ huggingface-cli download OpenMOSS-Team/MOSS-Audio-Tokenizer-ONNX \
47
+ --local-dir weights/MOSS-Audio-Tokenizer-ONNX
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
  ```
49
 
50
+ This is typically used together with [MOSS-TTS-GGUF](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-GGUF) for the llama.cpp inference pipeline. See the [llama.cpp Backend documentation](https://github.com/OpenMOSS/MOSS-TTS/blob/main/moss_tts_delay/llama_cpp/README.md) for the full end-to-end setup.
 
 
 
51
 
52
+ ## Main Repositories
 
 
 
53
 
54
+ | Repository | Description |
55
+ |------------|-------------|
56
+ | [OpenMOSS/MOSS-TTS](https://github.com/OpenMOSS/MOSS-TTS) | MOSS-TTS Family main repository (includes llama.cpp backend, PyTorch inference, and all models) |
57
+ | [OpenMOSS/MOSS-Audio-Tokenizer](https://github.com/OpenMOSS/MOSS-Audio-Tokenizer) | MOSS-Audio-Tokenizer source code, PyTorch weights, ONNX/TRT export scripts, and evaluation |
58
+ | [OpenMOSS-Team/MOSS-Audio-Tokenizer](https://huggingface.co/OpenMOSS-Team/MOSS-Audio-Tokenizer) | PyTorch weights on Hugging Face (for `trust_remote_code=True` usage) |
59
+ | [OpenMOSS-Team/MOSS-TTS-GGUF](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-GGUF) | Pre-quantized GGUF backbone weights (companion to this ONNX repo) |
60
 
61
+ ## About MOSS-Audio-Tokenizer
 
 
 
 
 
 
 
62
 
63
+ **MOSS-Audio-Tokenizer** compresses 24kHz raw audio into a 12.5Hz frame rate using a 32-layer Residual Vector Quantizer (RVQ), supporting high-fidelity reconstruction from 0.125kbps to 4kbps. It is trained from scratch on 3 million hours of speech, sound effects, and music, achieving state-of-the-art reconstruction quality among open-source audio tokenizers.
64
 
65
+ For the full model description, architecture details, and evaluation metrics, please refer to:
66
+ - [MOSS-Audio-Tokenizer GitHub Repository](https://github.com/OpenMOSS/MOSS-Audio-Tokenizer)
67
+ - [MOSS-TTS README — Audio Tokenizer Section](https://github.com/OpenMOSS/MOSS-TTS#moss-audio-tokenizer)
 
 
68
 
69
  ## Evaluation Metrics
70