cms42 commited on
Commit
8d409dd
Β·
verified Β·
1 Parent(s): e764a41

Upload model weights

Browse files
.gitattributes CHANGED
@@ -40,3 +40,5 @@ tokenizer.json filter=lfs diff=lfs merge=lfs -text
40
  *.npy filter=lfs diff=lfs merge=lfs -text
41
  *.onnx filter=lfs diff=lfs merge=lfs -text
42
  *.data filter=lfs diff=lfs merge=lfs -text
 
 
 
40
  *.npy filter=lfs diff=lfs merge=lfs -text
41
  *.onnx filter=lfs diff=lfs merge=lfs -text
42
  *.data filter=lfs diff=lfs merge=lfs -text
43
+ *.png filter=lfs diff=lfs merge=lfs -text
44
+
README.md ADDED
@@ -0,0 +1,165 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: transformers
4
+ tags:
5
+ - audio
6
+ - audio-tokenizer
7
+ - neural-codec
8
+ - moss-tts-family
9
+ - MOSS Audio Tokenizer
10
+ - speech-tokenizer
11
+ - trust-remote-code
12
+ ---
13
+
14
+ # MossAudioTokenizer
15
+
16
+ This is the code for MOSS-Audio-Tokenizer presented in [MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models](https://arxiv.org/abs/2602.10934).
17
+
18
+ **MOSSAudioTokenizer** is a unified discrete audio tokenizer based on the **Cat** (**C**ausal **A**udio **T**okenizer with **T**ransformer) architecture. Scaling to 1.6 billion parameters, it functions as a unified discrete interface, delivering both lossless-quality reconstruction and high-level semantic alignment.
19
+
20
+ **Key Features:**
21
+
22
+ * **Extreme Compression & Variable Bitrate**: It compresses 24kHz raw audio into a remarkably low frame rate of 12.5Hz. Utilizing a 32-layer Residual Vector Quantizer (RVQ), it supports high-fidelity reconstruction across a wide range of bitrates, from 0.125kbps to 4kbps.
23
+ * **Pure Transformer Architecture**: The model features a "CNN-free" homogeneous architecture built entirely from Causal Transformer blocks. With 1.6B combined parameters (Encoder + Decoder), it ensures exceptional scalability and supports low-latency streaming inference.
24
+ * **Large-Scale General Audio Training**: Trained on 3 million hours of diverse audio data, the model excels at encoding and reconstructing all audio domains, including speech, sound effects, and music.
25
+ * **Unified Semantic-Acoustic Representation**: While achieving state-of-the-art reconstruction quality, Cat produces discrete tokens that are "semantic-rich," making them ideal for downstream tasks like speech understanding (ASR) and generation (TTS).
26
+ * **Fully Trained From Scratch**: Cat does not rely on any pretrained encoders (such as HuBERT or Whisper) or distillation from teacher models. All representations are learned autonomously from raw data.
27
+ * **End-to-End Joint Optimization**: All componentsβ€”including the encoder, quantizer, decoder, discriminator, and a decoder-only LLM for semantic alignmentβ€”are optimized jointly in a single unified training pipeline.
28
+
29
+ **Summary:**
30
+ By combining a simple, scalable architecture with massive-scale data, the Cat architecture overcomes the bottlenecks of traditional audio tokenizers. It provides a robust, high-fidelity, and semantically grounded interface for the next generation of native audio foundation models.
31
+
32
+ This repository contains a lightweight remote-code implementation that mirrors the current πŸ€— Transformers
33
+ `transformers.models.moss_audio_tokenizer` module. It is intended to be uploaded to a Hugging Face Hub model repository
34
+ and loaded with `trust_remote_code=True` when needed.
35
+
36
+ <br>
37
+ <p align="center">
38
+ <img src="images/arch.png" width="95%"> <br>
39
+ Architecture of MossAudioTokenizer
40
+ </p>
41
+ <br>
42
+
43
+ ## Usage
44
+
45
+ ### Quickstart
46
+
47
+ ```python
48
+ import torch
49
+ from transformers import AutoModel
50
+ import torchaudio
51
+
52
+ repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer"
53
+ model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()
54
+
55
+ wav, sr = torchaudio.load('demo/demo_gt.wav')
56
+ if sr != model.sampling_rate:
57
+ wav = torchaudio.functional.resample(wav, sr, model.sampling_rate)
58
+ wav = wav.unsqueeze(0)
59
+ enc = model.encode(wav, return_dict=True)
60
+ print(f"enc.audio_codes.shape: {enc.audio_codes.shape}")
61
+ dec = model.decode(enc.audio_codes, return_dict=True)
62
+ print(f"dec.audio.shape: {dec.audio.shape}")
63
+ wav = dec.audio.squeeze(0)
64
+ torchaudio.save("demo/demo_rec.wav", wav, sample_rate=model.sampling_rate)
65
+
66
+ # Decode using only the first 8 layers of the RVQ
67
+ dec_rvq8 = model.decode(enc.audio_codes[:8], return_dict=True)
68
+ wav_rvq8 = dec_rvq8.audio.squeeze(0)
69
+ torchaudio.save("demo/demo_rec_rvq8.wav", wav_rvq8, sample_rate=model.sampling_rate)
70
+ ```
71
+
72
+ ### Streaming
73
+
74
+ `MossAudioTokenizerModel.encode` and `MossAudioTokenizerModel.decode` support simple streaming via a `chunk_duration`
75
+ argument.
76
+
77
+ - `chunk_duration` is expressed in seconds.
78
+ - It must be <= `MossAudioTokenizerConfig.causal_transformer_context_duration`.
79
+ - `chunk_duration * MossAudioTokenizerConfig.sampling_rate` must be divisible by `MossAudioTokenizerConfig.downsample_rate`.
80
+ - Streaming chunking only supports `batch_size=1`.
81
+
82
+ ```python
83
+ import torch
84
+ from transformers import AutoModel
85
+
86
+ repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer"
87
+ model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()
88
+ audio = torch.randn(1, 1, 3200) # dummy waveform
89
+
90
+ # 0.08s @ 24kHz = 1920 samples, divisible by downsample_rate=1920
91
+ enc = model.encode(audio, return_dict=True, chunk_duration=0.08)
92
+ dec = model.decode(enc.audio_codes, return_dict=True, chunk_duration=0.08)
93
+ ```
94
+
95
+ ## Repository layout
96
+
97
+ - `configuration_moss_audio_tokenizer.py`
98
+ - `modeling_moss_audio_tokenizer.py`
99
+ - `__init__.py`
100
+ - `config.json`
101
+ - model weights
102
+
103
+ ## Evaluation Metrics
104
+
105
+ The table below compares the reconstruction quality of open-source audio tokenizers with MossAudioTokenizer on speech and audio/music data.
106
+
107
+ - Speech metrics are evaluated on LibriSpeech test-clean (English) and AISHELL-2 (Chinese), reported as EN/ZH.
108
+ - Audio metrics are evaluated on the AudioSet evaluation subset, while music metrics are evaluated on MUSDB, reported as audio/music.
109
+ - STFT-Dist. denotes the STFT distance.
110
+ - Higher is better for speech metrics, while lower is better for audio/music metrics (Mel-Loss, STFT-Dist.).
111
+ - Nq denotes the number of quantizers.
112
+
113
+ | Model | bps | Frame rate | Nq | Speech: SIM ↑ (EN/ZH) | Speech: STOI ↑ (EN/ZH) | Speech: PESQ-NB ↑ (EN/ZH) | Speech: PESQ-WB ↑ (EN/ZH) | Audio/Music: Mel-Loss ↓ | Audio/Music: STFT-Dist. ↓ |
114
+ | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
115
+ | **XCodec2.0** | 800 | 50 | 1 | 0.82 / 0.74 | 0.92 / 0.86 | 3.04 / 2.46 | 2.43 / 1.96 | -- / -- | -- / -- |
116
+ | **MiMo Audio Tokenizer** | 850 | 25 | 4 | 0.80 / 0.74 | 0.91 / 0.87 | 2.94 / 2.62 | 2.39 / 2.14 | **0.82** / 0.81 | 2.33 / 2.23 |
117
+ | **Higgs Audio Tokenizer** | 1000 | 25 | 4 | 0.77 / 0.68 | 0.83 / 0.82 | 3.03 / 2.61 | 2.48 / 2.14 | 0.83 / **0.80** | 2.20 / 2.05 |
118
+ | **SpeechTokenizer** | 1000 | 50 | 2 | 0.36 / 0.25 | 0.77 / 0.68 | 1.59 / 1.38 | 1.25 / 1.17 | -- / -- | -- / -- |
119
+ | **XY-Tokenizer** | 1000 | 12.5 | 8 | 0.85 / 0.79 | 0.92 / 0.87 | 3.10 / 2.63 | 2.50 / 2.12 | -- / -- | -- / -- |
120
+ | **BigCodec** | 1040 | 80 | 1 | 0.84 / 0.69 | 0.93 / 0.88 | 3.27 / 2.55 | 2.68 / 2.06 | -- / -- | -- / -- |
121
+ | **Mimi** | 1100 | 12.5 | 8 | 0.74 / 0.59 | 0.91 / 0.85 | 2.80 / 2.24 | 2.25 / 1.78 | 1.24 / 1.19 | 2.62 / 2.49 |
122
+ | **MOSS Audio Tokenizer (Ours)** | 750 | 12.5 | 6 | 0.82 / 0.75 | 0.93 / 0.89 | 3.14 / 2.73 | 2.60 / 2.22 | 0.86 / 0.85 | 2.21 / 2.10 |
123
+ | **MOSS Audio Tokenizer (Ours)** | 1000 | 12.5 | 8 | **0.88** / **0.81** | **0.94** / **0.91** | **3.38** / **2.96** | **2.87** / **2.43** | **0.82** / **0.80** | **2.16** / **2.04** |
124
+ | **β€”** | **β€”** | **β€”** | **β€”** | **β€”** | **β€”** | **β€”** | **β€”** | **β€”** | **β€”** |
125
+ | **DAC** | 1500 | 75 | 2 | 0.48 / 0.41 | 0.83 / 0.79 | 1.87 / 1.67 | 1.48 / 1.37 | -- / -- | -- / -- |
126
+ | **Encodec** | 1500 | 75 | 2 | 0.60 / 0.45 | 0.85 / 0.81 | 1.94 / 1.80 | 1.56 / 1.48 | 1.12 / 1.04 | 2.60 / 2.42 |
127
+ | **Higgs Audio Tokenizer** | 2000 | 25 | 8 | 0.90 / 0.83 | 0.85 / 0.85 | 3.59 / 3.22 | 3.11 / 2.73 | 0.74 / 0.70 | 2.07 / 1.92 |
128
+ | **SpeechTokenizer** | 2000 | 50 | 4 | 0.66 / 0.50 | 0.88 / 0.80 | 2.38 / 1.79 | 1.92 / 1.49 | -- / -- | -- / -- |
129
+ | **Qwen3 TTS Tokenizer** | 2200 | 12.5 | 16 | **0.95** / 0.88 | **0.96** / 0.93 | 3.66 / 3.10 | 3.19 / 2.62 | -- / -- | -- / -- |
130
+ | **MiMo Audio Tokenizer** | 2250 | 25 | 12 | 0.89 / 0.83 | 0.95 / 0.92 | 3.57 / 3.25 | 3.05 / 2.71 | **0.70** / **0.68** | 2.21 / 2.10 |
131
+ | **Mimi** | 2475 | 12.5 | 18 | 0.89 / 0.76 | 0.94 / 0.91 | 3.49 / 2.90 | 2.97 / 2.35 | 1.10 / 1.06 | 2.45 / 2.32 |
132
+ | **MOSS Audio Tokenizer (Ours)** | 1500 | 12.5 | 12 | 0.92 / 0.86 | 0.95 / 0.93 | 3.64 / 3.27 | 3.20 / 2.74 | 0.77 / 0.74 | 2.08 / 1.96 |
133
+ | **MOSS Audio Tokenizer (Ours)** | 2000 | 12.5 | 16 | **0.95** / **0.89** | **0.96** / **0.94** | **3.78** / **3.46** | **3.41** / **2.96** | 0.73 / 0.70 | **2.03** / **1.90** |
134
+ | **β€”** | **β€”** | **β€”** | **β€”** | **β€”** | **β€”** | **β€”** | **β€”** | **β€”** | **β€”** |
135
+ | **DAC** | 3000 | 75 | 4 | 0.74 / 0.67 | 0.90 / 0.88 | 2.76 / 2.47 | 2.31 / 2.07 | 0.86 / 0.83 | 2.23 / 2.10 |
136
+ | **MiMo Audio Tokenizer** | 3650 | 25 | 20 | 0.91 / 0.85 | 0.95 / 0.93 | 3.73 / 3.44 | 3.25 / 2.89 | 0.66 / 0.65 | 2.17 / 2.06 |
137
+ | **SpeechTokenizer** | 4000 | 50 | 8 | 0.85 / 0.69 | 0.92 / 0.85 | 3.05 / 2.20 | 2.60 / 1.87 | -- / -- | -- / -- |
138
+ | **Mimi** | 4400 | 12.5 | 32 | 0.94 / 0.83 | 0.96 / 0.94 | 3.80 / 3.31 | 3.43 / 2.78 | 1.02 / 0.98 | 2.34 / 2.21 |
139
+ | **Encodec** | 4500 | 75 | 6 | 0.86 / 0.75 | 0.92 / 0.91 | 2.91 / 2.63 | 2.46 / 2.15 | 0.91 / 0.84 | 2.33 / 2.17 |
140
+ | **DAC** | 6000 | 75 | 8 | 0.89 / 0.84 | 0.95 / 0.94 | 3.75 / 3.57 | 3.41 / 3.20 | **0.65** / **0.63** | 1.97 / 1.87 |
141
+ | **MOSS Audio Tokenizer (Ours)** | 3000 | 12.5 | 24 | 0.96 / 0.92 | **0.97** / **0.96** | 3.90 / 3.64 | 3.61 / 3.20 | 0.69 / 0.66 | 1.98 / 1.84 |
142
+ | **MOSS Audio Tokenizer (Ours)** | 4000 | 12.5 | 32 | **0.97** / **0.93** | **0.97** / **0.96** | **3.95** / **3.71** | **3.69** / **3.30** | 0.68 / 0.64 | **1.96** / **1.82** |
143
+
144
+ ### LibriSpeech Speech Metrics (MOSS Audio Tokenizer vs. Open-source Tokenizers)
145
+
146
+ The plots below compare our MOSS Audio Tokenizer model with other open-source speech tokenizers on the LibriSpeech dataset, evaluated with SIM, STOI, PESQ-NB, and PESQ-WB (higher is better).
147
+ We control the bps of the same model by adjusting the number of RVQ codebooks used during inference.
148
+
149
+ <table>
150
+ <tr>
151
+ <td align="center"><b>SIM</b><br><img src="images/sim.png" width="100%"></td>
152
+ <td align="center"><b>STOI</b><br><img src="images/stoi.png" width="100%"></td>
153
+ </tr>
154
+ <tr>
155
+ <td align="center"><b>PESQ-NB</b><br><img src="images/pesq-nb.png" width="100%"></td>
156
+ <td align="center"><b>PESQ-WB</b><br><img src="images/pesq-wb.png" width="100%"></td>
157
+ </tr>
158
+ </table>
159
+
160
+
161
+ ## Citation
162
+ If you use this code or result in your paper, please cite our work as:
163
+ ```tex
164
+
165
+ ```
decoder.data ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e5680b64d283e68fd9a7cc4074ddcd7f65a7c89e460ec6f74db379920e2cbb3e
3
+ size 7098576524
decoder.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f90ac5129d47c9b571fc2521eff6c9948fb9fdf2f27ff077414286638f453aa8
3
+ size 13900498
encoder.data ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:79b0eccd392bc08243a3e40cd4a826556f1a46a4f82973efc2b01df9f4da2eff
3
+ size 7101884132
encoder.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b22b3a8e4cc7a2fda4a50cb93f4480188f3a7c18b275aa72d4b1ac9283cb0faa
3
+ size 1469716
images/arch.png ADDED

Git LFS Details

  • SHA256: 6a7108429fc230ead608a22837a27615aae3458f86e5e47d12fd3bd5d95c7058
  • Pointer size: 131 Bytes
  • Size of remote file: 215 kB
images/pesq-nb.png ADDED

Git LFS Details

  • SHA256: 107c2f8fb91247264497b50f0bb42f34390ce4593eed28ce4dfea47b88a36797
  • Pointer size: 131 Bytes
  • Size of remote file: 743 kB
images/pesq-wb.png ADDED

Git LFS Details

  • SHA256: ae7b264b13e8570c292d8843ce83fcf21e96360e2900c8e19a8ff837be3c90fd
  • Pointer size: 131 Bytes
  • Size of remote file: 512 kB
images/sim.png ADDED

Git LFS Details

  • SHA256: 79c6340a32e1229ab89fdc71447aa0addf91f146d250b1c3416707c5aacee75d
  • Pointer size: 131 Bytes
  • Size of remote file: 493 kB
images/stoi.png ADDED

Git LFS Details

  • SHA256: 86d2a427d213fbd8913052d0d16e87c2384a6643f26458314c8771b95eab89c4
  • Pointer size: 131 Bytes
  • Size of remote file: 440 kB