Upload 6 files
Browse filesUpdate pictures, metrics and usage in README.
- .gitattributes +5 -0
- README.md +110 -2
- images/arch.png +3 -0
- images/pesq-nb.png +3 -0
- images/pesq-wb.png +3 -0
- images/sim.png +3 -0
- images/stoi.png +3 -0
.gitattributes
CHANGED
|
@@ -34,3 +34,8 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 34 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 36 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 36 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 37 |
+
images/arch.png filter=lfs diff=lfs merge=lfs -text
|
| 38 |
+
images/pesq-nb.png filter=lfs diff=lfs merge=lfs -text
|
| 39 |
+
images/pesq-wb.png filter=lfs diff=lfs merge=lfs -text
|
| 40 |
+
images/sim.png filter=lfs diff=lfs merge=lfs -text
|
| 41 |
+
images/stoi.png filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
|
@@ -6,21 +6,36 @@ tags:
|
|
| 6 |
- audio-tokenizer
|
| 7 |
- neural-codec
|
| 8 |
- moss-tts-family
|
| 9 |
-
-
|
| 10 |
- speech-tokenizer
|
| 11 |
- trust-remote-code
|
| 12 |
---
|
| 13 |
|
| 14 |
# MossAudioTokenizer
|
| 15 |
|
| 16 |
-
MossAudioTokenizer is a neural audio
|
|
|
|
| 17 |
|
| 18 |
This repository contains a lightweight remote-code implementation that mirrors the current π€ Transformers
|
| 19 |
`transformers.models.moss_audio_tokenizer` module. It is intended to be uploaded to a Hugging Face Hub model repository
|
| 20 |
and loaded with `trust_remote_code=True` when needed.
|
| 21 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
## Usage
|
| 23 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
### Quickstart
|
| 25 |
|
| 26 |
```python
|
|
@@ -32,7 +47,36 @@ model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()
|
|
| 32 |
|
| 33 |
audio = torch.randn(1, 1, 3200) # dummy waveform
|
| 34 |
enc = model.encode(audio, return_dict=True)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
dec = model.decode(enc.audio_codes, return_dict=True)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
```
|
| 37 |
|
| 38 |
### Streaming
|
|
@@ -65,3 +109,67 @@ dec = model.decode(enc.audio_codes, return_dict=True, chunk_duration=0.08)
|
|
| 65 |
- `__init__.py`
|
| 66 |
- `config.json`
|
| 67 |
- model weights
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
- audio-tokenizer
|
| 7 |
- neural-codec
|
| 8 |
- moss-tts-family
|
| 9 |
+
- MOSS Audio Tokenizer
|
| 10 |
- speech-tokenizer
|
| 11 |
- trust-remote-code
|
| 12 |
---
|
| 13 |
|
| 14 |
# MossAudioTokenizer
|
| 15 |
|
| 16 |
+
MossAudioTokenizer is a Transformer-based neural audio tokenizer model jointly optimizing the encoder, quantizer, and decoder from scratch for high-fidelity reconstruction of general audio, audio tokenization and synthesis.
|
| 17 |
+
Both the encoder and decoder of MossAudioTokenizer contain approximately 0.8 billion parameters each, totaling about 1.6 billion. MossAudioTokenizer operates at 12.5 Hz, uses a 32-layer residual vector quantizer (RVQ), and supports variable-codebook decoding.
|
| 18 |
|
| 19 |
This repository contains a lightweight remote-code implementation that mirrors the current π€ Transformers
|
| 20 |
`transformers.models.moss_audio_tokenizer` module. It is intended to be uploaded to a Hugging Face Hub model repository
|
| 21 |
and loaded with `trust_remote_code=True` when needed.
|
| 22 |
|
| 23 |
+
<br>
|
| 24 |
+
<p align="center">
|
| 25 |
+
<img src="images/arch.png" width="95%"> <br>
|
| 26 |
+
Architecture of MossAudioTokenizer
|
| 27 |
+
</p>
|
| 28 |
+
<br>
|
| 29 |
+
|
| 30 |
## Usage
|
| 31 |
|
| 32 |
+
### Installation
|
| 33 |
+
|
| 34 |
+
```bash
|
| 35 |
+
cd MOSS-Audio-Tokenizer
|
| 36 |
+
pip install -r requirements.txt
|
| 37 |
+
```
|
| 38 |
+
|
| 39 |
### Quickstart
|
| 40 |
|
| 41 |
```python
|
|
|
|
| 47 |
|
| 48 |
audio = torch.randn(1, 1, 3200) # dummy waveform
|
| 49 |
enc = model.encode(audio, return_dict=True)
|
| 50 |
+
print(f"enc.audio_codes.shape: {enc.audio_codes.shape}")
|
| 51 |
+
dec = model.decode(enc.audio_codes, return_dict=True)
|
| 52 |
+
print(f"dec.audio.shape: {dec.audio.shape}")
|
| 53 |
+
```
|
| 54 |
+
|
| 55 |
+
### Quickstart (Waveform I/O)
|
| 56 |
+
|
| 57 |
+
```python
|
| 58 |
+
import torch
|
| 59 |
+
from transformers import AutoModel
|
| 60 |
+
import torchaudio
|
| 61 |
+
|
| 62 |
+
repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer"
|
| 63 |
+
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()
|
| 64 |
+
|
| 65 |
+
wav, sr = torchaudio.load('demo/demo_gt.wav')
|
| 66 |
+
if sr != model.sampling_rate:
|
| 67 |
+
wav = torchaudio.functional.resample(wav, sr, model.sampling_rate)
|
| 68 |
+
wav = wav.unsqueeze(0)
|
| 69 |
+
enc = model.encode(wav, return_dict=True)
|
| 70 |
+
print(f"enc.audio_codes.shape: {enc.audio_codes.shape}")
|
| 71 |
dec = model.decode(enc.audio_codes, return_dict=True)
|
| 72 |
+
print(f"dec.audio.shape: {dec.audio.shape}")
|
| 73 |
+
wav = dec.audio.squeeze(0)
|
| 74 |
+
torchaudio.save("demo/demo_rec.wav", wav, sample_rate=model.sampling_rate)
|
| 75 |
+
|
| 76 |
+
# Decode using only the first 8 layers of the RVQ
|
| 77 |
+
dec_rvq8 = model.decode(enc.audio_codes[:8], return_dict=True)
|
| 78 |
+
wav_rvq8 = dec_rvq8.audio.squeeze(0)
|
| 79 |
+
torchaudio.save("demo/demo_rec_rvq8.wav", wav_rvq8, sample_rate=model.sampling_rate)
|
| 80 |
```
|
| 81 |
|
| 82 |
### Streaming
|
|
|
|
| 109 |
- `__init__.py`
|
| 110 |
- `config.json`
|
| 111 |
- model weights
|
| 112 |
+
|
| 113 |
+
## Evaluation Metrics
|
| 114 |
+
|
| 115 |
+
The table below compares the reconstruction quality of open-source audio tokenizers with MossAudioTokenizer on speech and audio/music data.
|
| 116 |
+
|
| 117 |
+
- Speech metrics are evaluated on LibriSpeech test-clean (English) and AISHELL-2 (Chinese), reported as EN/ZH.
|
| 118 |
+
- Audio metrics are evaluated on the AudioSet evaluation subset, while music metrics are evaluated on MUSDB, reported as audio/music.
|
| 119 |
+
- STFT-Dist. denotes the STFT distance.
|
| 120 |
+
- Higher is better for speech metrics, while lower is better for audio/music metrics (Mel-Loss, STFT-Dist.).
|
| 121 |
+
- $\boldsymbol{N}_{\mathrm{VQ}}$ denotes the number of quantizers.
|
| 122 |
+
|
| 123 |
+
| Model | bps | Frame rate | $\boldsymbol{N}_{\mathrm{VQ}}$ | Speech: SIM β (EN/ZH) | Speech: STOI β (EN/ZH) | Speech: PESQ-NB β (EN/ZH) | Speech: PESQ-WB β (EN/ZH) | Audio/Music: Mel-Loss β | Audio/Music: STFT-Dist. β |
|
| 124 |
+
| --- | ---: | ---: | ---: | --- | --- | --- | --- | --- | --- |
|
| 125 |
+
| **XCodec2.0** | 800 | 50 | 1 | 0.82 / 0.74 | 0.92 / 0.86 | 3.04 / 2.46 | 2.43 / 1.96 | -- / -- | -- / -- |
|
| 126 |
+
| **MiMo Audio Tokenizer** | 850 | 25 | 4 | 0.80 / 0.74 | 0.91 / 0.87 | 2.94 / 2.62 | 2.39 / 2.14 | **0.82** / 0.81 | 2.33 / 2.23 |
|
| 127 |
+
| **Higgs Audio Tokenizer** | 1000 | 25 | 4 | 0.77 / 0.68 | 0.83 / 0.82 | 3.03 / 2.61 | 2.48 / 2.14 | 0.83 / **0.80** | 2.20 / 2.05 |
|
| 128 |
+
| **SpeechTokenizer** | 1000 | 50 | 2 | 0.36 / 0.25 | 0.77 / 0.68 | 1.59 / 1.38 | 1.25 / 1.17 | -- / -- | -- / -- |
|
| 129 |
+
| **XY-Tokenizer** | 1000 | 12.5 | 8 | 0.85 / 0.79 | 0.92 / 0.87 | 3.10 / 2.63 | 2.50 / 2.12 | -- / -- | -- / -- |
|
| 130 |
+
| **BigCodec** | 1040 | 80 | 1 | 0.84 / 0.69 | 0.93 / 0.88 | 3.27 / 2.55 | 2.68 / 2.06 | -- / -- | -- / -- |
|
| 131 |
+
| **Mimi** | 1100 | 12.5 | 8 | 0.74 / 0.59 | 0.91 / 0.85 | 2.80 / 2.24 | 2.25 / 1.78 | 1.24 / 1.19 | 2.62 / 2.49 |
|
| 132 |
+
| **MOSS Audio Tokenizer (Ours)** | 750 | 12.5 | 6 | 0.82 / 0.75 | 0.93 / 0.89 | 3.14 / 2.73 | 2.60 / 2.22 | 0.86 / 0.85 | 2.21 / 2.10 |
|
| 133 |
+
| **MOSS Audio Tokenizer (Ours)** | 1000 | 12.5 | 8 | **0.88** / **0.81** | **0.94** / **0.91** | **3.38** / **2.96** | **2.87** / **2.43** | **0.82** / **0.80** | **2.16** / **2.04** |
|
| 134 |
+
| **β** | **β** | **β** | **β** | **β** | **β** | **β** | **β** | **β** | **β** |
|
| 135 |
+
| **DAC** | 1500 | 75 | 2 | 0.48 / 0.41 | 0.83 / 0.79 | 1.87 / 1.67 | 1.48 / 1.37 | -- / -- | -- / -- |
|
| 136 |
+
| **Encodec** | 1500 | 75 | 2 | 0.60 / 0.45 | 0.85 / 0.81 | 1.94 / 1.80 | 1.56 / 1.48 | 1.12 / 1.04 | 2.60 / 2.42 |
|
| 137 |
+
| **Higgs Audio Tokenizer** | 2000 | 25 | 8 | 0.90 / 0.83 | 0.85 / 0.85 | 3.59 / 3.22 | 3.11 / 2.73 | 0.74 / 0.70 | 2.07 / 1.92 |
|
| 138 |
+
| **SpeechTokenizer** | 2000 | 50 | 4 | 0.66 / 0.50 | 0.88 / 0.80 | 2.38 / 1.79 | 1.92 / 1.49 | -- / -- | -- / -- |
|
| 139 |
+
| **Qwen3 TTS Tokenizer** | 2200 | 12.5 | 16 | **0.95** / 0.88 | **0.96** / 0.93 | 3.66 / 3.10 | 3.19 / 2.62 | -- / -- | -- / -- |
|
| 140 |
+
| **MiMo Audio Tokenizer** | 2250 | 25 | 12 | 0.89 / 0.83 | 0.95 / 0.92 | 3.57 / 3.25 | 3.05 / 2.71 | **0.70** / **0.68** | 2.21 / 2.10 |
|
| 141 |
+
| **Mimi** | 2475 | 12.5 | 18 | 0.89 / 0.76 | 0.94 / 0.91 | 3.49 / 2.90 | 2.97 / 2.35 | 1.10 / 1.06 | 2.45 / 2.32 |
|
| 142 |
+
| **MOSS Audio Tokenizer (Ours)** | 1500 | 12.5 | 12 | 0.92 / 0.86 | 0.95 / 0.93 | 3.64 / 3.27 | 3.20 / 2.74 | 0.77 / 0.74 | 2.08 / 1.96 |
|
| 143 |
+
| **MOSS Audio Tokenizer (Ours)** | 2000 | 12.5 | 16 | **0.95** / **0.89** | **0.96** / **0.94** | **3.78** / **3.46** | **3.41** / **2.96** | 0.73 / 0.70 | **2.03** / **1.90** |
|
| 144 |
+
| **β** | **β** | **β** | **β** | **β** | **β** | **β** | **β** | **β** | **β** |
|
| 145 |
+
| **DAC** | 3000 | 75 | 4 | 0.74 / 0.67 | 0.90 / 0.88 | 2.76 / 2.47 | 2.31 / 2.07 | 0.86 / 0.83 | 2.23 / 2.10 |
|
| 146 |
+
| **MiMo Audio Tokenizer** | 3650 | 25 | 20 | 0.91 / 0.85 | 0.95 / 0.93 | 3.73 / 3.44 | 3.25 / 2.89 | 0.66 / 0.65 | 2.17 / 2.06 |
|
| 147 |
+
| **SpeechTokenizer** | 4000 | 50 | 8 | 0.85 / 0.69 | 0.92 / 0.85 | 3.05 / 2.20 | 2.60 / 1.87 | -- / -- | -- / -- |
|
| 148 |
+
| **Mimi** | 4400 | 12.5 | 32 | 0.94 / 0.83 | 0.96 / 0.94 | 3.80 / 3.31 | 3.43 / 2.78 | 1.02 / 0.98 | 2.34 / 2.21 |
|
| 149 |
+
| **Encodec** | 4500 | 75 | 6 | 0.86 / 0.75 | 0.92 / 0.91 | 2.91 / 2.63 | 2.46 / 2.15 | 0.91 / 0.84 | 2.33 / 2.17 |
|
| 150 |
+
| **DAC** | 6000 | 75 | 8 | 0.89 / 0.84 | 0.95 / 0.94 | 3.75 / 3.57 | 3.41 / 3.20 | **0.65** / **0.63** | 1.97 / 1.87 |
|
| 151 |
+
| **MOSS Audio Tokenizer (Ours)** | 3000 | 12.5 | 24 | 0.96 / 0.92 | **0.97** / **0.96** | 3.90 / 3.64 | 3.61 / 3.20 | 0.69 / 0.66 | 1.98 / 1.84 |
|
| 152 |
+
| **MOSS Audio Tokenizer (Ours)** | 4000 | 12.5 | 32 | **0.97** / **0.93** | **0.97** / **0.96** | **3.95** / **3.71** | **3.69** / **3.30** | 0.68 / 0.64 | **1.96** / **1.82** |
|
| 153 |
+
|
| 154 |
+
### LibriSpeech Speech Metrics (MOSS Audio Tokenizer vs. Open-source Tokenizers)
|
| 155 |
+
|
| 156 |
+
The plots below compare our MOSS Audio Tokenizer model with other open-source speech tokenizers on the LibriSpeech dataset, evaluated with SIM, STOI, PESQ-NB, and PESQ-WB (higher is better).
|
| 157 |
+
We control the bps of the same model by adjusting the number of RVQ codebooks used during inference.
|
| 158 |
+
|
| 159 |
+
<table>
|
| 160 |
+
<tr>
|
| 161 |
+
<td align="center"><b>SIM</b><br><img src="images/sim.png" width="100%"></td>
|
| 162 |
+
<td align="center"><b>STOI</b><br><img src="images/stoi.png" width="100%"></td>
|
| 163 |
+
</tr>
|
| 164 |
+
<tr>
|
| 165 |
+
<td align="center"><b>PESQ-NB</b><br><img src="images/pesq-nb.png" width="100%"></td>
|
| 166 |
+
<td align="center"><b>PESQ-WB</b><br><img src="images/pesq-wb.png" width="100%"></td>
|
| 167 |
+
</tr>
|
| 168 |
+
</table>
|
| 169 |
+
|
| 170 |
+
|
| 171 |
+
## Citation
|
| 172 |
+
If you use this code or result in your paper, please cite our work as:
|
| 173 |
+
```tex
|
| 174 |
+
|
| 175 |
+
```
|
images/arch.png
ADDED
|
Git LFS Details
|
images/pesq-nb.png
ADDED
|
Git LFS Details
|
images/pesq-wb.png
ADDED
|
Git LFS Details
|
images/sim.png
ADDED
|
Git LFS Details
|
images/stoi.png
ADDED
|
Git LFS Details
|