OpenMOSS-Team
/

MOSS-Audio-Tokenizer

@@ -13,8 +13,7 @@ tags:
 # MossAudioTokenizer
-MossAudioTokenizer is a Transformer-based neural audio tokenizer model jointly optimizing the encoder, quantizer, and decoder from scratch for high-fidelity reconstruction of general audio, audio tokenization and synthesis.
-Both the encoder and decoder of MossAudioTokenizer contain approximately 0.8 billion parameters each, totaling about 1.6 billion. MossAudioTokenizer operates at 12.5 Hz, uses a 32-layer residual vector quantizer (RVQ), and supports variable-codebook decoding.
 This repository contains a lightweight remote-code implementation that mirrors the current 🤗 Transformers
 `transformers.models.moss_audio_tokenizer` module. It is intended to be uploaded to a Hugging Face Hub model repository
@@ -29,31 +28,8 @@ and loaded with `trust_remote_code=True` when needed.
 ## Usage
-### Installation
-```bash
-cd MOSS-Audio-Tokenizer
-pip install -r requirements.txt
-```
 ### Quickstart
-```python
-import torch
-from transformers import AutoModel
-repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer"
-model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()
-audio = torch.randn(1, 1, 3200)  # dummy waveform
-enc = model.encode(audio, return_dict=True)
-print(f"enc.audio_codes.shape: {enc.audio_codes.shape}")
-dec = model.decode(enc.audio_codes, return_dict=True)
-print(f"dec.audio.shape: {dec.audio.shape}")
-```
-### Quickstart (Waveform I/O)
 ```python
 import torch
 from transformers import AutoModel
@@ -118,10 +94,10 @@ The table below compares the reconstruction quality of open-source audio tokeniz
 - Audio metrics are evaluated on the AudioSet evaluation subset, while music metrics are evaluated on MUSDB, reported as audio/music.
 - STFT-Dist. denotes the STFT distance.
 - Higher is better for speech metrics, while lower is better for audio/music metrics (Mel-Loss, STFT-Dist.).
-- $\boldsymbol{N}_{\mathrm{VQ}}$ denotes the number of quantizers.
-| Model | bps | Frame rate | $\boldsymbol{N}_{\mathrm{VQ}}$ | Speech: SIM ↑ (EN/ZH) | Speech: STOI ↑ (EN/ZH) | Speech: PESQ-NB ↑ (EN/ZH) | Speech: PESQ-WB ↑ (EN/ZH) | Audio/Music: Mel-Loss ↓ | Audio/Music: STFT-Dist. ↓ |
-| --- | ---: | ---: | ---: | --- | --- | --- | --- | --- | --- |
 | **XCodec2.0** | 800 | 50 | 1 | 0.82 / 0.74 | 0.92 / 0.86 | 3.04 / 2.46 | 2.43 / 1.96 | -- / -- | -- / -- |
 | **MiMo Audio Tokenizer** | 850 | 25 | 4 | 0.80 / 0.74 | 0.91 / 0.87 | 2.94 / 2.62 | 2.39 / 2.14 | **0.82** / 0.81 | 2.33 / 2.23 |
 | **Higgs Audio Tokenizer** | 1000 | 25 | 4 | 0.77 / 0.68 | 0.83 / 0.82 | 3.03 / 2.61 | 2.48 / 2.14 | 0.83 / **0.80** | 2.20 / 2.05 |

 # MossAudioTokenizer
+MOSS Audio Tokenizer is a unified audio tokenizer designed to achieve both high-fidelity reconstruction and semantically rich representations across speech, sound, and music. Built on the Cat (**C**ausal **A**udio **T**okenizer with **T**ransformer) architecture, the model scales to 1.6 billion parameters and was trained on 3 million hours of audio, surpassing previous open-source tokenizers in reconstruction quality across all bitrates. It processes 24 kHz audio at a low 12.5 Hz frame rate, with all components—including the encoder, quantizer, decoder, decoder-only LLM, and discriminator—optimized jointly in an end-to-end manner. Featuring a 32-layer residual vector quantizer (RVQ) with variable-bitrate support, it provides a scalable, native foundation for the next generation of autoregressive audio foundation models.
 This repository contains a lightweight remote-code implementation that mirrors the current 🤗 Transformers
 `transformers.models.moss_audio_tokenizer` module. It is intended to be uploaded to a Hugging Face Hub model repository
 ## Usage
 ### Quickstart
 ```python
 import torch
 from transformers import AutoModel
 - Audio metrics are evaluated on the AudioSet evaluation subset, while music metrics are evaluated on MUSDB, reported as audio/music.
 - STFT-Dist. denotes the STFT distance.
 - Higher is better for speech metrics, while lower is better for audio/music metrics (Mel-Loss, STFT-Dist.).
+- Nq denotes the number of quantizers.
+| Model | bps | Frame rate | Nq | Speech: SIM ↑ (EN/ZH) | Speech: STOI ↑ (EN/ZH) | Speech: PESQ-NB ↑ (EN/ZH) | Speech: PESQ-WB ↑ (EN/ZH) | Audio/Music: Mel-Loss ↓ | Audio/Music: STFT-Dist. ↓ |
+| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
 | **XCodec2.0** | 800 | 50 | 1 | 0.82 / 0.74 | 0.92 / 0.86 | 3.04 / 2.46 | 2.43 / 1.96 | -- / -- | -- / -- |
 | **MiMo Audio Tokenizer** | 850 | 25 | 4 | 0.80 / 0.74 | 0.91 / 0.87 | 2.94 / 2.62 | 2.39 / 2.14 | **0.82** / 0.81 | 2.33 / 2.23 |
 | **Higgs Audio Tokenizer** | 1000 | 25 | 4 | 0.77 / 0.68 | 0.83 / 0.82 | 3.03 / 2.61 | 2.48 / 2.14 | 0.83 / **0.80** | 2.20 / 2.05 |