lucasnewman commited on
Commit
ce4120e
·
verified ·
1 Parent(s): 13f41ac

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +203 -195
README.md CHANGED
@@ -1,195 +1,203 @@
1
- ---
2
- license: apache-2.0
3
- library_name: transformers
4
- tags:
5
- - audio
6
- - audio-tokenizer
7
- - neural-codec
8
- - moss-tts-family
9
- - MOSS Audio Tokenizer
10
- - speech-tokenizer
11
- - trust-remote-code
12
- ---
13
-
14
- # MossAudioTokenizer
15
-
16
- This is the code for MOSS-Audio-Tokenizer presented in [MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models](https://arxiv.org/abs/2602.10934).
17
-
18
- **MOSSAudioTokenizer** is a unified discrete audio tokenizer based on the **Cat** (**C**ausal **A**udio **T**okenizer with **T**ransformer) architecture. Scaling to 1.6 billion parameters, it functions as a unified discrete interface, delivering both lossless-quality reconstruction and high-level semantic alignment.
19
-
20
- **Key Features:**
21
-
22
- * **Extreme Compression & Variable Bitrate**: It compresses 48kHz stereo audio into a remarkably low frame rate of 12.5Hz. Utilizing a 32-layer Residual LFQ quantizer stack, it supports high-fidelity reconstruction across a wide range of bitrates.
23
- * **Pure Transformer Architecture**: The model features a "CNN-free" homogeneous architecture built entirely from Causal Transformer blocks. With 1.6B combined parameters (Encoder + Decoder), it ensures exceptional scalability and supports low-latency streaming inference.
24
- * **Large-Scale General Audio Training**: Trained on 3 million hours of diverse audio data, the model excels at encoding and reconstructing all audio domains, including speech, sound effects, and music.
25
- * **Unified Semantic-Acoustic Representation**: While achieving state-of-the-art reconstruction quality, Cat produces discrete tokens that are "semantic-rich," making them ideal for downstream tasks like speech understanding (ASR) and generation (TTS).
26
- * **Fully Trained From Scratch**: Cat does not rely on any pretrained encoders (such as HuBERT or Whisper) or distillation from teacher models. All representations are learned autonomously from raw data.
27
- * **End-to-End Joint Optimization**: All components—including the encoder, quantizer, decoder, discriminator, and a decoder-only LLM for semantic alignment—are optimized jointly in a single unified training pipeline.
28
-
29
- **Summary:**
30
- By combining a simple, scalable architecture with massive-scale data, the Cat architecture overcomes the bottlenecks of traditional audio tokenizers. It provides a robust, high-fidelity, and semantically grounded interface for the next generation of native audio foundation models.
31
-
32
- This repository contains a lightweight remote-code implementation that mirrors the current 🤗 Transformers
33
- `transformers.models.moss_audio_tokenizer` module. It is intended to be uploaded to a Hugging Face Hub model repository
34
- and loaded with `trust_remote_code=True` when needed.
35
-
36
-
37
- ## Usage
38
-
39
- ### Quickstart
40
-
41
- ```python
42
- import torch
43
- from transformers import AutoModel
44
- import torchaudio
45
-
46
- repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer"
47
- model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()
48
-
49
- wav, sr = torchaudio.load('demo/demo_gt.wav')
50
- if sr != model.sampling_rate:
51
- wav = torchaudio.functional.resample(wav, sr, model.sampling_rate)
52
- if wav.shape[0] == 1:
53
- wav = wav.repeat(model.config.number_channels, 1)
54
- else:
55
- wav = wav[: model.config.number_channels]
56
- wav = wav.unsqueeze(0)
57
- enc = model.encode(wav, return_dict=True)
58
- print(f"enc.audio_codes.shape: {enc.audio_codes.shape}")
59
- dec = model.decode(enc.audio_codes, return_dict=True)
60
- print(f"dec.audio.shape: {dec.audio.shape}")
61
- wav = dec.audio.squeeze(0)
62
- torchaudio.save("demo/demo_rec.wav", wav, sample_rate=model.sampling_rate)
63
-
64
- # Decode using only the first 8 layers of the RVQ
65
- dec_rvq8 = model.decode(enc.audio_codes[:8], return_dict=True)
66
- wav_rvq8 = dec_rvq8.audio.squeeze(0)
67
- torchaudio.save("demo/demo_rec_rvq8.wav", wav_rvq8, sample_rate=model.sampling_rate)
68
- ```
69
-
70
- ### Attention Backend And Compute Dtype
71
-
72
- `config.attention_implementation` controls whether transformer layers prefer `sdpa` or `flash_attention_2`.
73
- `config.compute_dtype` controls the non-quantizer autocast dtype and supports `fp32`, `bf16`, and `fp16`.
74
-
75
- ```python
76
- model.set_attention_implementation("flash_attention_2")
77
- model.set_compute_dtype("fp16")
78
- ```
79
-
80
- The quantizer always runs in fp32.
81
-
82
- ### Streaming
83
-
84
- `MossAudioTokenizerModel.encode`, `decode`, `batch_encode`, and `batch_decode` all support streaming through a
85
- `chunk_duration` argument.
86
-
87
- - `chunk_duration` is expressed in seconds.
88
- - `chunk_duration * MossAudioTokenizerConfig.sampling_rate` must be divisible by `MossAudioTokenizerConfig.downsample_rate`.
89
- - Streaming batch inference is supported.
90
- - The public waveform interface expects stereo inputs shaped `(2, T)` or batched stereo inputs shaped `(B, 2, T)`.
91
-
92
- ```python
93
- import torch
94
- from transformers import AutoModel
95
-
96
- repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer"
97
- model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()
98
- audio = torch.randn(2, 48000 * 6) # dummy stereo waveform
99
-
100
- # 6.0s @ 48kHz = 288000 samples, divisible by downsample_rate=3840
101
- enc = model.encode(audio.unsqueeze(0), return_dict=True, chunk_duration=0.08)
102
- dec = model.decode(enc.audio_codes, return_dict=True, chunk_duration=0.08)
103
-
104
- batch_enc = model.batch_encode([audio, audio[:, : 48000 * 3]], chunk_duration=0.08)
105
- codes_list = [
106
- batch_enc.audio_codes[:, i, : batch_enc.audio_codes_lengths[i]]
107
- for i in range(batch_enc.audio_codes.shape[1])
108
- ]
109
- batch_dec = model.batch_decode(codes_list, chunk_duration=0.08)
110
- ```
111
-
112
- #### Continuous Batch Streaming Decode
113
-
114
- For decoder-side continuous batching, prefer `batch_decode(..., streaming=True, ...)`.
115
-
116
- - The first streaming call may pass `max_batch_size=...`. If it is omitted, the first batch size reserves the
117
- fixed-slot decoder budget for that public stream.
118
- - Same-size calls continue the existing logical rows in-order.
119
- - If a later call is larger, the new rows are admitted by tail append.
120
- - `finalize_indices` means "decode these rows one last time, then evict them". The indices are interpreted against the
121
- pre-call logical order.
122
- - After a finalize call returns, the next streaming call may use the smaller survivor batch.
123
- - `reset_stream=True` discards the hidden public streaming state and starts a fresh stream.
124
-
125
- Milestone 1 boundaries:
126
-
127
- - decode-only continuous batching
128
- - one active streaming decode state per model instance
129
- - fixed-slot decoder reservation from `max_batch_size`
130
- - no encode-side continuous batching
131
- - no physical compaction of surviving decode slots
132
- - no multi-session concurrency on one model instance
133
-
134
- ```python
135
- import torch
136
- from transformers import AutoModel
137
-
138
- repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer"
139
- model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()
140
- num_quantizers = model.config.quantizer_kwargs["num_quantizers"]
141
-
142
- codes_a0 = torch.randint(0, 8, (num_quantizers, 2))
143
- codes_b0 = torch.randint(0, 8, (num_quantizers, 3))
144
- codes_a1 = torch.randint(0, 8, (num_quantizers, 2))
145
- codes_b1 = torch.randint(0, 8, (num_quantizers, 2))
146
- codes_c0 = torch.randint(0, 8, (num_quantizers, 1))
147
- codes_a2 = torch.randint(0, 8, (num_quantizers, 1))
148
- codes_b2 = torch.randint(0, 8, (num_quantizers, 2))
149
- codes_c1 = torch.randint(0, 8, (num_quantizers, 2))
150
- codes_b3 = torch.randint(0, 8, (num_quantizers, 1))
151
- codes_c2 = torch.randint(0, 8, (num_quantizers, 1))
152
-
153
- # First call reserves 3 fixed decoder slots for A and B.
154
- out_ab0 = model.batch_decode(
155
- [codes_a0, codes_b0],
156
- streaming=True,
157
- max_batch_size=3,
158
- reset_stream=True,
159
- )
160
-
161
- # Same logical rows continue in-order; C is a tail append.
162
- out_abc1 = model.batch_decode(
163
- [codes_a1, codes_b1, codes_c0],
164
- streaming=True,
165
- )
166
-
167
- # Finalize A against the pre-call logical order. A still decodes in this call,
168
- # then is evicted immediately afterward.
169
- out_abc2 = model.batch_decode(
170
- [codes_a2, codes_b2, codes_c1],
171
- streaming=True,
172
- finalize_indices=[0],
173
- )
174
-
175
- # The next call can shrink to the surviving logical rows only.
176
- out_bc3 = model.batch_decode(
177
- [codes_b3, codes_c2],
178
- streaming=True,
179
- )
180
- ```
181
-
182
- ## Repository layout
183
-
184
- - `configuration_moss_audio_tokenizer.py`
185
- - `modeling_moss_audio_tokenizer.py`
186
- - `__init__.py`
187
- - `config.json`
188
- - model weights
189
-
190
-
191
- ## Citation
192
- If you use this code or result in your paper, please cite our work as:
193
- ```tex
194
-
195
- ```
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: transformers
4
+ tags:
5
+ - audio
6
+ - audio-tokenizer
7
+ - neural-codec
8
+ - moss-tts-family
9
+ - MOSS Audio Tokenizer
10
+ - speech-tokenizer
11
+ - mlx
12
+ - mlx-audio
13
+ base_model: OpenMOSS-Team/MOSS-Audio-Tokenizer
14
+ ---
15
+
16
+ # mlx-community/MOSS-Audio-Tokenizer
17
+
18
+ This model was converted to MLX format from [`OpenMOSS-Team/MOSS-Audio-Tokenizer`](https://huggingface.co/OpenMOSS-Team/MOSS-Audio-Tokenizer) using mlx-audio version **0.4.0**.
19
+
20
+ Refer to the [original model card](https://huggingface.co/OpenMOSS-Team/MOSS-Audio-Tokenizer) for more details on the model.
21
+
22
+ # MossAudioTokenizer
23
+
24
+ This is the code for MOSS-Audio-Tokenizer presented in [MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models](https://arxiv.org/abs/2602.10934).
25
+
26
+ **MOSSAudioTokenizer** is a unified discrete audio tokenizer based on the **Cat** (**C**ausal **A**udio **T**okenizer with **T**ransformer) architecture. Scaling to 1.6 billion parameters, it functions as a unified discrete interface, delivering both lossless-quality reconstruction and high-level semantic alignment.
27
+
28
+ **Key Features:**
29
+
30
+ * **Extreme Compression & Variable Bitrate**: It compresses 48kHz stereo audio into a remarkably low frame rate of 12.5Hz. Utilizing a 32-layer Residual LFQ quantizer stack, it supports high-fidelity reconstruction across a wide range of bitrates.
31
+ * **Pure Transformer Architecture**: The model features a "CNN-free" homogeneous architecture built entirely from Causal Transformer blocks. With 1.6B combined parameters (Encoder + Decoder), it ensures exceptional scalability and supports low-latency streaming inference.
32
+ * **Large-Scale General Audio Training**: Trained on 3 million hours of diverse audio data, the model excels at encoding and reconstructing all audio domains, including speech, sound effects, and music.
33
+ * **Unified Semantic-Acoustic Representation**: While achieving state-of-the-art reconstruction quality, Cat produces discrete tokens that are "semantic-rich," making them ideal for downstream tasks like speech understanding (ASR) and generation (TTS).
34
+ * **Fully Trained From Scratch**: Cat does not rely on any pretrained encoders (such as HuBERT or Whisper) or distillation from teacher models. All representations are learned autonomously from raw data.
35
+ * **End-to-End Joint Optimization**: All components—including the encoder, quantizer, decoder, discriminator, and a decoder-only LLM for semantic alignment—are optimized jointly in a single unified training pipeline.
36
+
37
+ **Summary:**
38
+ By combining a simple, scalable architecture with massive-scale data, the Cat architecture overcomes the bottlenecks of traditional audio tokenizers. It provides a robust, high-fidelity, and semantically grounded interface for the next generation of native audio foundation models.
39
+
40
+ This repository contains a lightweight remote-code implementation that mirrors the current 🤗 Transformers
41
+ `transformers.models.moss_audio_tokenizer` module. It is intended to be uploaded to a Hugging Face Hub model repository
42
+ and loaded with `trust_remote_code=True` when needed.
43
+
44
+
45
+ ## Usage
46
+
47
+ ### Quickstart
48
+
49
+ ```python
50
+ import torch
51
+ from transformers import AutoModel
52
+ import torchaudio
53
+
54
+ repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer"
55
+ model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()
56
+
57
+ wav, sr = torchaudio.load('demo/demo_gt.wav')
58
+ if sr != model.sampling_rate:
59
+ wav = torchaudio.functional.resample(wav, sr, model.sampling_rate)
60
+ if wav.shape[0] == 1:
61
+ wav = wav.repeat(model.config.number_channels, 1)
62
+ else:
63
+ wav = wav[: model.config.number_channels]
64
+ wav = wav.unsqueeze(0)
65
+ enc = model.encode(wav, return_dict=True)
66
+ print(f"enc.audio_codes.shape: {enc.audio_codes.shape}")
67
+ dec = model.decode(enc.audio_codes, return_dict=True)
68
+ print(f"dec.audio.shape: {dec.audio.shape}")
69
+ wav = dec.audio.squeeze(0)
70
+ torchaudio.save("demo/demo_rec.wav", wav, sample_rate=model.sampling_rate)
71
+
72
+ # Decode using only the first 8 layers of the RVQ
73
+ dec_rvq8 = model.decode(enc.audio_codes[:8], return_dict=True)
74
+ wav_rvq8 = dec_rvq8.audio.squeeze(0)
75
+ torchaudio.save("demo/demo_rec_rvq8.wav", wav_rvq8, sample_rate=model.sampling_rate)
76
+ ```
77
+
78
+ ### Attention Backend And Compute Dtype
79
+
80
+ `config.attention_implementation` controls whether transformer layers prefer `sdpa` or `flash_attention_2`.
81
+ `config.compute_dtype` controls the non-quantizer autocast dtype and supports `fp32`, `bf16`, and `fp16`.
82
+
83
+ ```python
84
+ model.set_attention_implementation("flash_attention_2")
85
+ model.set_compute_dtype("fp16")
86
+ ```
87
+
88
+ The quantizer always runs in fp32.
89
+
90
+ ### Streaming
91
+
92
+ `MossAudioTokenizerModel.encode`, `decode`, `batch_encode`, and `batch_decode` all support streaming through a
93
+ `chunk_duration` argument.
94
+
95
+ - `chunk_duration` is expressed in seconds.
96
+ - `chunk_duration * MossAudioTokenizerConfig.sampling_rate` must be divisible by `MossAudioTokenizerConfig.downsample_rate`.
97
+ - Streaming batch inference is supported.
98
+ - The public waveform interface expects stereo inputs shaped `(2, T)` or batched stereo inputs shaped `(B, 2, T)`.
99
+
100
+ ```python
101
+ import torch
102
+ from transformers import AutoModel
103
+
104
+ repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer"
105
+ model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()
106
+ audio = torch.randn(2, 48000 * 6) # dummy stereo waveform
107
+
108
+ # 6.0s @ 48kHz = 288000 samples, divisible by downsample_rate=3840
109
+ enc = model.encode(audio.unsqueeze(0), return_dict=True, chunk_duration=0.08)
110
+ dec = model.decode(enc.audio_codes, return_dict=True, chunk_duration=0.08)
111
+
112
+ batch_enc = model.batch_encode([audio, audio[:, : 48000 * 3]], chunk_duration=0.08)
113
+ codes_list = [
114
+ batch_enc.audio_codes[:, i, : batch_enc.audio_codes_lengths[i]]
115
+ for i in range(batch_enc.audio_codes.shape[1])
116
+ ]
117
+ batch_dec = model.batch_decode(codes_list, chunk_duration=0.08)
118
+ ```
119
+
120
+ #### Continuous Batch Streaming Decode
121
+
122
+ For decoder-side continuous batching, prefer `batch_decode(..., streaming=True, ...)`.
123
+
124
+ - The first streaming call may pass `max_batch_size=...`. If it is omitted, the first batch size reserves the
125
+ fixed-slot decoder budget for that public stream.
126
+ - Same-size calls continue the existing logical rows in-order.
127
+ - If a later call is larger, the new rows are admitted by tail append.
128
+ - `finalize_indices` means "decode these rows one last time, then evict them". The indices are interpreted against the
129
+ pre-call logical order.
130
+ - After a finalize call returns, the next streaming call may use the smaller survivor batch.
131
+ - `reset_stream=True` discards the hidden public streaming state and starts a fresh stream.
132
+
133
+ Milestone 1 boundaries:
134
+
135
+ - decode-only continuous batching
136
+ - one active streaming decode state per model instance
137
+ - fixed-slot decoder reservation from `max_batch_size`
138
+ - no encode-side continuous batching
139
+ - no physical compaction of surviving decode slots
140
+ - no multi-session concurrency on one model instance
141
+
142
+ ```python
143
+ import torch
144
+ from transformers import AutoModel
145
+
146
+ repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer"
147
+ model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()
148
+ num_quantizers = model.config.quantizer_kwargs["num_quantizers"]
149
+
150
+ codes_a0 = torch.randint(0, 8, (num_quantizers, 2))
151
+ codes_b0 = torch.randint(0, 8, (num_quantizers, 3))
152
+ codes_a1 = torch.randint(0, 8, (num_quantizers, 2))
153
+ codes_b1 = torch.randint(0, 8, (num_quantizers, 2))
154
+ codes_c0 = torch.randint(0, 8, (num_quantizers, 1))
155
+ codes_a2 = torch.randint(0, 8, (num_quantizers, 1))
156
+ codes_b2 = torch.randint(0, 8, (num_quantizers, 2))
157
+ codes_c1 = torch.randint(0, 8, (num_quantizers, 2))
158
+ codes_b3 = torch.randint(0, 8, (num_quantizers, 1))
159
+ codes_c2 = torch.randint(0, 8, (num_quantizers, 1))
160
+
161
+ # First call reserves 3 fixed decoder slots for A and B.
162
+ out_ab0 = model.batch_decode(
163
+ [codes_a0, codes_b0],
164
+ streaming=True,
165
+ max_batch_size=3,
166
+ reset_stream=True,
167
+ )
168
+
169
+ # Same logical rows continue in-order; C is a tail append.
170
+ out_abc1 = model.batch_decode(
171
+ [codes_a1, codes_b1, codes_c0],
172
+ streaming=True,
173
+ )
174
+
175
+ # Finalize A against the pre-call logical order. A still decodes in this call,
176
+ # then is evicted immediately afterward.
177
+ out_abc2 = model.batch_decode(
178
+ [codes_a2, codes_b2, codes_c1],
179
+ streaming=True,
180
+ finalize_indices=[0],
181
+ )
182
+
183
+ # The next call can shrink to the surviving logical rows only.
184
+ out_bc3 = model.batch_decode(
185
+ [codes_b3, codes_c2],
186
+ streaming=True,
187
+ )
188
+ ```
189
+
190
+ ## Repository layout
191
+
192
+ - `configuration_moss_audio_tokenizer.py`
193
+ - `modeling_moss_audio_tokenizer.py`
194
+ - `__init__.py`
195
+ - `config.json`
196
+ - model weights
197
+
198
+
199
+ ## Citation
200
+ If you use this code or result in your paper, please cite our work as:
201
+ ```tex
202
+
203
+ ```