Image Feature Extraction
Transformers
Safetensors
moss-audio-tokenizer
audio
audio-tokenizer
neural-codec
moss-tts-family
MOSS Audio Tokenizer
speech-tokenizer
trust-remote-code
custom_code
Instructions to use OpenMOSS-Team/MOSS-Audio-Tokenizer-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use OpenMOSS-Team/MOSS-Audio-Tokenizer-v2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-feature-extraction", model="OpenMOSS-Team/MOSS-Audio-Tokenizer-v2", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("OpenMOSS-Team/MOSS-Audio-Tokenizer-v2", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
upd readme
Browse files
README.md
CHANGED
|
@@ -11,15 +11,15 @@ tags:
|
|
| 11 |
- trust-remote-code
|
| 12 |
---
|
| 13 |
|
| 14 |
-
#
|
| 15 |
|
| 16 |
-
This is the code for MOSS-Audio-Tokenizer presented in [MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models](https://arxiv.org/abs/2602.10934).
|
| 17 |
|
| 18 |
-
**
|
| 19 |
|
| 20 |
**Key Features:**
|
| 21 |
|
| 22 |
-
* **Extreme Compression & Variable Bitrate**: It compresses 48kHz stereo audio into a remarkably low frame rate of 12.5Hz. Utilizing a 32-layer Residual
|
| 23 |
* **Pure Transformer Architecture**: The model features a "CNN-free" homogeneous architecture built entirely from Causal Transformer blocks. With 1.6B combined parameters (Encoder + Decoder), it ensures exceptional scalability and supports low-latency streaming inference.
|
| 24 |
* **Large-Scale General Audio Training**: Trained on 3 million hours of diverse audio data, the model excels at encoding and reconstructing all audio domains, including speech, sound effects, and music.
|
| 25 |
* **Unified Semantic-Acoustic Representation**: While achieving state-of-the-art reconstruction quality, Cat produces discrete tokens that are "semantic-rich," making them ideal for downstream tasks like speech understanding (ASR) and generation (TTS).
|
|
@@ -42,7 +42,7 @@ import torch
|
|
| 42 |
from transformers import AutoModel
|
| 43 |
import torchaudio
|
| 44 |
|
| 45 |
-
repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer"
|
| 46 |
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()
|
| 47 |
|
| 48 |
wav, sr = torchaudio.load('demo/demo_gt.wav')
|
|
@@ -92,7 +92,7 @@ The quantizer always runs in fp32.
|
|
| 92 |
import torch
|
| 93 |
from transformers import AutoModel
|
| 94 |
|
| 95 |
-
repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer"
|
| 96 |
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()
|
| 97 |
audio = torch.randn(2, 48000 * 6) # dummy stereo waveform
|
| 98 |
|
|
@@ -121,5 +121,18 @@ batch_dec = model.batch_decode(codes_list, chunk_duration=0.08)
|
|
| 121 |
## Citation
|
| 122 |
If you use this code or result in your paper, please cite our work as:
|
| 123 |
```tex
|
| 124 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 125 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
- trust-remote-code
|
| 12 |
---
|
| 13 |
|
| 14 |
+
# Moss-Audio-Tokenizer-V2
|
| 15 |
|
| 16 |
+
This is the code for the 48khz stereo version of MOSS-Audio-Tokenizer presented in [MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models](https://arxiv.org/abs/2602.10934).
|
| 17 |
|
| 18 |
+
**MOSS-Audio-Tokenizer-V2** is a unified discrete audio tokenizer based on the **Cat** (**C**ausal **A**udio **T**okenizer with **T**ransformer) architecture. Scaling to 2 billion parameters, it functions as a unified discrete interface, delivering both lossless-quality reconstruction and high-level semantic alignment.
|
| 19 |
|
| 20 |
**Key Features:**
|
| 21 |
|
| 22 |
+
* **Extreme Compression & Variable Bitrate**: It compresses 48kHz stereo audio into a remarkably low frame rate of 12.5Hz. Utilizing a 32-layer Residual Vector Quantization stack, it supports high-fidelity reconstruction across a wide range of bitrates.
|
| 23 |
* **Pure Transformer Architecture**: The model features a "CNN-free" homogeneous architecture built entirely from Causal Transformer blocks. With 1.6B combined parameters (Encoder + Decoder), it ensures exceptional scalability and supports low-latency streaming inference.
|
| 24 |
* **Large-Scale General Audio Training**: Trained on 3 million hours of diverse audio data, the model excels at encoding and reconstructing all audio domains, including speech, sound effects, and music.
|
| 25 |
* **Unified Semantic-Acoustic Representation**: While achieving state-of-the-art reconstruction quality, Cat produces discrete tokens that are "semantic-rich," making them ideal for downstream tasks like speech understanding (ASR) and generation (TTS).
|
|
|
|
| 42 |
from transformers import AutoModel
|
| 43 |
import torchaudio
|
| 44 |
|
| 45 |
+
repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer-V2"
|
| 46 |
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()
|
| 47 |
|
| 48 |
wav, sr = torchaudio.load('demo/demo_gt.wav')
|
|
|
|
| 92 |
import torch
|
| 93 |
from transformers import AutoModel
|
| 94 |
|
| 95 |
+
repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer-V2"
|
| 96 |
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()
|
| 97 |
audio = torch.randn(2, 48000 * 6) # dummy stereo waveform
|
| 98 |
|
|
|
|
| 121 |
## Citation
|
| 122 |
If you use this code or result in your paper, please cite our work as:
|
| 123 |
```tex
|
| 124 |
+
@misc{gong2026mossaudiotokenizerscaling,
|
| 125 |
+
title={MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models},
|
| 126 |
+
author={Yitian Gong and Kuangwei Chen and Zhaoye Fei and Xiaogui Yang and Ke Chen and Yang Wang and Kexin Huang and Mingshu Chen and Ruixiao Li and Qingyuan Cheng and Shimin Li and Xipeng Qiu},
|
| 127 |
+
year={2026},
|
| 128 |
+
eprint={2602.10934},
|
| 129 |
+
archivePrefix={arXiv},
|
| 130 |
+
primaryClass={cs.SD},
|
| 131 |
+
url={https://arxiv.org/abs/2602.10934}
|
| 132 |
+
}
|
| 133 |
```
|
| 134 |
+
|
| 135 |
+
## License
|
| 136 |
+
<!-- TODO: check and add license -->
|
| 137 |
+
MOSS-Audio-Tokenizer-V2 is released under the Apache 2.0 license.
|
| 138 |
+
|