Commit
·
66562e3
1
Parent(s):
4a52ecf
modify readme (#2)
Browse files- modify readme (f9c0141315f2c2fe1fb6442a19af4a13f49eda15)
Co-authored-by: yitian gong <fdugyt@users.noreply.huggingface.co>
README.md
CHANGED
|
@@ -13,8 +13,7 @@ tags:
|
|
| 13 |
|
| 14 |
# MossAudioTokenizer
|
| 15 |
|
| 16 |
-
|
| 17 |
-
Both the encoder and decoder of MossAudioTokenizer contain approximately 0.8 billion parameters each, totaling about 1.6 billion. MossAudioTokenizer operates at 12.5 Hz, uses a 32-layer residual vector quantizer (RVQ), and supports variable-codebook decoding.
|
| 18 |
|
| 19 |
This repository contains a lightweight remote-code implementation that mirrors the current 🤗 Transformers
|
| 20 |
`transformers.models.moss_audio_tokenizer` module. It is intended to be uploaded to a Hugging Face Hub model repository
|
|
@@ -29,31 +28,8 @@ and loaded with `trust_remote_code=True` when needed.
|
|
| 29 |
|
| 30 |
## Usage
|
| 31 |
|
| 32 |
-
### Installation
|
| 33 |
-
|
| 34 |
-
```bash
|
| 35 |
-
cd MOSS-Audio-Tokenizer
|
| 36 |
-
pip install -r requirements.txt
|
| 37 |
-
```
|
| 38 |
-
|
| 39 |
### Quickstart
|
| 40 |
|
| 41 |
-
```python
|
| 42 |
-
import torch
|
| 43 |
-
from transformers import AutoModel
|
| 44 |
-
|
| 45 |
-
repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer"
|
| 46 |
-
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()
|
| 47 |
-
|
| 48 |
-
audio = torch.randn(1, 1, 3200) # dummy waveform
|
| 49 |
-
enc = model.encode(audio, return_dict=True)
|
| 50 |
-
print(f"enc.audio_codes.shape: {enc.audio_codes.shape}")
|
| 51 |
-
dec = model.decode(enc.audio_codes, return_dict=True)
|
| 52 |
-
print(f"dec.audio.shape: {dec.audio.shape}")
|
| 53 |
-
```
|
| 54 |
-
|
| 55 |
-
### Quickstart (Waveform I/O)
|
| 56 |
-
|
| 57 |
```python
|
| 58 |
import torch
|
| 59 |
from transformers import AutoModel
|
|
@@ -118,10 +94,10 @@ The table below compares the reconstruction quality of open-source audio tokeniz
|
|
| 118 |
- Audio metrics are evaluated on the AudioSet evaluation subset, while music metrics are evaluated on MUSDB, reported as audio/music.
|
| 119 |
- STFT-Dist. denotes the STFT distance.
|
| 120 |
- Higher is better for speech metrics, while lower is better for audio/music metrics (Mel-Loss, STFT-Dist.).
|
| 121 |
-
-
|
| 122 |
|
| 123 |
-
| Model | bps | Frame rate |
|
| 124 |
-
|
|
| 125 |
| **XCodec2.0** | 800 | 50 | 1 | 0.82 / 0.74 | 0.92 / 0.86 | 3.04 / 2.46 | 2.43 / 1.96 | -- / -- | -- / -- |
|
| 126 |
| **MiMo Audio Tokenizer** | 850 | 25 | 4 | 0.80 / 0.74 | 0.91 / 0.87 | 2.94 / 2.62 | 2.39 / 2.14 | **0.82** / 0.81 | 2.33 / 2.23 |
|
| 127 |
| **Higgs Audio Tokenizer** | 1000 | 25 | 4 | 0.77 / 0.68 | 0.83 / 0.82 | 3.03 / 2.61 | 2.48 / 2.14 | 0.83 / **0.80** | 2.20 / 2.05 |
|
|
|
|
| 13 |
|
| 14 |
# MossAudioTokenizer
|
| 15 |
|
| 16 |
+
MOSS Audio Tokenizer is a unified audio tokenizer designed to achieve both high-fidelity reconstruction and semantically rich representations across speech, sound, and music. Built on the Cat (**C**ausal **A**udio **T**okenizer with **T**ransformer) architecture, the model scales to 1.6 billion parameters and was trained on 3 million hours of audio, surpassing previous open-source tokenizers in reconstruction quality across all bitrates. It processes 24 kHz audio at a low 12.5 Hz frame rate, with all components—including the encoder, quantizer, decoder, decoder-only LLM, and discriminator—optimized jointly in an end-to-end manner. Featuring a 32-layer residual vector quantizer (RVQ) with variable-bitrate support, it provides a scalable, native foundation for the next generation of autoregressive audio foundation models.
|
|
|
|
| 17 |
|
| 18 |
This repository contains a lightweight remote-code implementation that mirrors the current 🤗 Transformers
|
| 19 |
`transformers.models.moss_audio_tokenizer` module. It is intended to be uploaded to a Hugging Face Hub model repository
|
|
|
|
| 28 |
|
| 29 |
## Usage
|
| 30 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
### Quickstart
|
| 32 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
```python
|
| 34 |
import torch
|
| 35 |
from transformers import AutoModel
|
|
|
|
| 94 |
- Audio metrics are evaluated on the AudioSet evaluation subset, while music metrics are evaluated on MUSDB, reported as audio/music.
|
| 95 |
- STFT-Dist. denotes the STFT distance.
|
| 96 |
- Higher is better for speech metrics, while lower is better for audio/music metrics (Mel-Loss, STFT-Dist.).
|
| 97 |
+
- Nq denotes the number of quantizers.
|
| 98 |
|
| 99 |
+
| Model | bps | Frame rate | Nq | Speech: SIM ↑ (EN/ZH) | Speech: STOI ↑ (EN/ZH) | Speech: PESQ-NB ↑ (EN/ZH) | Speech: PESQ-WB ↑ (EN/ZH) | Audio/Music: Mel-Loss ↓ | Audio/Music: STFT-Dist. ↓ |
|
| 100 |
+
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|
| 101 |
| **XCodec2.0** | 800 | 50 | 1 | 0.82 / 0.74 | 0.92 / 0.86 | 3.04 / 2.46 | 2.43 / 1.96 | -- / -- | -- / -- |
|
| 102 |
| **MiMo Audio Tokenizer** | 850 | 25 | 4 | 0.80 / 0.74 | 0.91 / 0.87 | 2.94 / 2.62 | 2.39 / 2.14 | **0.82** / 0.81 | 2.33 / 2.23 |
|
| 103 |
| **Higgs Audio Tokenizer** | 1000 | 25 | 4 | 0.77 / 0.68 | 0.83 / 0.82 | 3.03 / 2.61 | 2.48 / 2.14 | 0.83 / **0.80** | 2.20 / 2.05 |
|