--replace-all commited on
Commit
d4a3b2c
·
1 Parent(s): 61f845e

update nano

Browse files
README.md CHANGED
@@ -1,3 +1,125 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ library_name: transformers
4
+ tags:
5
+ - audio
6
+ - audio-tokenizer
7
+ - neural-codec
8
+ - moss-tts-family
9
+ - MOSS Audio Tokenizer
10
+ - speech-tokenizer
11
+ - trust-remote-code
12
  ---
13
+
14
+ # MossAudioTokenizer
15
+
16
+ This is the code for MOSS-Audio-Tokenizer presented in [MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models](https://arxiv.org/abs/2602.10934).
17
+
18
+ **MOSSAudioTokenizer** is a unified discrete audio tokenizer based on the **Cat** (**C**ausal **A**udio **T**okenizer with **T**ransformer) architecture. Scaling to 1.6 billion parameters, it functions as a unified discrete interface, delivering both lossless-quality reconstruction and high-level semantic alignment.
19
+
20
+ **Key Features:**
21
+
22
+ * **Extreme Compression & Variable Bitrate**: It compresses 48kHz stereo audio into a remarkably low frame rate of 12.5Hz. Utilizing a 32-layer Residual LFQ quantizer stack, it supports high-fidelity reconstruction across a wide range of bitrates.
23
+ * **Pure Transformer Architecture**: The model features a "CNN-free" homogeneous architecture built entirely from Causal Transformer blocks. With 1.6B combined parameters (Encoder + Decoder), it ensures exceptional scalability and supports low-latency streaming inference.
24
+ * **Large-Scale General Audio Training**: Trained on 3 million hours of diverse audio data, the model excels at encoding and reconstructing all audio domains, including speech, sound effects, and music.
25
+ * **Unified Semantic-Acoustic Representation**: While achieving state-of-the-art reconstruction quality, Cat produces discrete tokens that are "semantic-rich," making them ideal for downstream tasks like speech understanding (ASR) and generation (TTS).
26
+ * **Fully Trained From Scratch**: Cat does not rely on any pretrained encoders (such as HuBERT or Whisper) or distillation from teacher models. All representations are learned autonomously from raw data.
27
+ * **End-to-End Joint Optimization**: All components—including the encoder, quantizer, decoder, discriminator, and a decoder-only LLM for semantic alignment—are optimized jointly in a single unified training pipeline.
28
+
29
+ **Summary:**
30
+ By combining a simple, scalable architecture with massive-scale data, the Cat architecture overcomes the bottlenecks of traditional audio tokenizers. It provides a robust, high-fidelity, and semantically grounded interface for the next generation of native audio foundation models.
31
+
32
+ This repository contains a lightweight remote-code implementation that mirrors the current 🤗 Transformers
33
+ `transformers.models.moss_audio_tokenizer` module. It is intended to be uploaded to a Hugging Face Hub model repository
34
+ and loaded with `trust_remote_code=True` when needed.
35
+
36
+
37
+ ## Usage
38
+
39
+ ### Quickstart
40
+
41
+ ```python
42
+ import torch
43
+ from transformers import AutoModel
44
+ import torchaudio
45
+
46
+ repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer"
47
+ model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()
48
+
49
+ wav, sr = torchaudio.load('demo/demo_gt.wav')
50
+ if sr != model.sampling_rate:
51
+ wav = torchaudio.functional.resample(wav, sr, model.sampling_rate)
52
+ if wav.shape[0] == 1:
53
+ wav = wav.repeat(model.config.number_channels, 1)
54
+ else:
55
+ wav = wav[: model.config.number_channels]
56
+ wav = wav.unsqueeze(0)
57
+ enc = model.encode(wav, return_dict=True)
58
+ print(f"enc.audio_codes.shape: {enc.audio_codes.shape}")
59
+ dec = model.decode(enc.audio_codes, return_dict=True)
60
+ print(f"dec.audio.shape: {dec.audio.shape}")
61
+ wav = dec.audio.squeeze(0)
62
+ torchaudio.save("demo/demo_rec.wav", wav, sample_rate=model.sampling_rate)
63
+
64
+ # Decode using only the first 8 layers of the RVQ
65
+ dec_rvq8 = model.decode(enc.audio_codes[:8], return_dict=True)
66
+ wav_rvq8 = dec_rvq8.audio.squeeze(0)
67
+ torchaudio.save("demo/demo_rec_rvq8.wav", wav_rvq8, sample_rate=model.sampling_rate)
68
+ ```
69
+
70
+ ### Attention Backend And Compute Dtype
71
+
72
+ `config.attention_implementation` controls whether transformer layers prefer `sdpa` or `flash_attention_2`.
73
+ `config.compute_dtype` controls the non-quantizer autocast dtype and supports `fp32`, `bf16`, and `fp16`.
74
+
75
+ ```python
76
+ model.set_attention_implementation("flash_attention_2")
77
+ model.set_compute_dtype("fp16")
78
+ ```
79
+
80
+ The quantizer always runs in fp32.
81
+
82
+ ### Streaming
83
+
84
+ `MossAudioTokenizerModel.encode`, `decode`, `batch_encode`, and `batch_decode` all support streaming through a
85
+ `chunk_duration` argument.
86
+
87
+ - `chunk_duration` is expressed in seconds.
88
+ - `chunk_duration * MossAudioTokenizerConfig.sampling_rate` must be divisible by `MossAudioTokenizerConfig.downsample_rate`.
89
+ - Streaming batch inference is supported.
90
+ - The public waveform interface expects stereo inputs shaped `(2, T)` or batched stereo inputs shaped `(B, 2, T)`.
91
+
92
+ ```python
93
+ import torch
94
+ from transformers import AutoModel
95
+
96
+ repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer"
97
+ model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()
98
+ audio = torch.randn(2, 48000 * 6) # dummy stereo waveform
99
+
100
+ # 6.0s @ 48kHz = 288000 samples, divisible by downsample_rate=3840
101
+ enc = model.encode(audio.unsqueeze(0), return_dict=True, chunk_duration=0.08)
102
+ dec = model.decode(enc.audio_codes, return_dict=True, chunk_duration=0.08)
103
+
104
+ batch_enc = model.batch_encode([audio, audio[:, : 48000 * 3]], chunk_duration=0.08)
105
+ codes_list = [
106
+ batch_enc.audio_codes[:, i, : batch_enc.audio_codes_lengths[i]]
107
+ for i in range(batch_enc.audio_codes.shape[1])
108
+ ]
109
+ batch_dec = model.batch_decode(codes_list, chunk_duration=0.08)
110
+ ```
111
+
112
+ ## Repository layout
113
+
114
+ - `configuration_moss_audio_tokenizer.py`
115
+ - `modeling_moss_audio_tokenizer.py`
116
+ - `__init__.py`
117
+ - `config.json`
118
+ - model weights
119
+
120
+
121
+ ## Citation
122
+ If you use this code or result in your paper, please cite our work as:
123
+ ```tex
124
+
125
+ ```
__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ """Remote code package for Moss audio tokenizer."""
config.json ADDED
@@ -0,0 +1,304 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "MossAudioTokenizerModel"
4
+ ],
5
+ "auto_map": {
6
+ "AutoConfig": "configuration_moss_audio_tokenizer.MossAudioTokenizerConfig",
7
+ "AutoModel": "modeling_moss_audio_tokenizer.MossAudioTokenizerModel"
8
+ },
9
+ "model_type": "moss-audio-tokenizer",
10
+ "sample_rate": 48000,
11
+ "sampling_rate": 48000,
12
+ "downsample_rate": 3840,
13
+ "causal_transformer_context_duration": 10.0,
14
+ "number_channels": 2,
15
+ "enable_channel_interleave": true,
16
+ "attention_implementation": "sdpa",
17
+ "compute_dtype": "fp32",
18
+ "dtype": "float32",
19
+ "code_dim": 768,
20
+ "encoder_kwargs": [
21
+ {
22
+ "module_type": "PatchedPretransform",
23
+ "patch_size": 240
24
+ },
25
+ {
26
+ "causal": true,
27
+ "context_duration": 4.0,
28
+ "conv_layout": true,
29
+ "d_model": 256,
30
+ "dim_feedforward": 1024,
31
+ "gating": "none",
32
+ "input_dimension": 240,
33
+ "layer_scale": 0.01,
34
+ "max_period": 10000,
35
+ "module_type": "Transformer",
36
+ "norm": "layer_norm",
37
+ "num_heads": 4,
38
+ "num_layers": 4,
39
+ "output_dimension": 384,
40
+ "positional_embedding": "rope"
41
+ },
42
+ {
43
+ "module_type": "PatchedPretransform",
44
+ "patch_size": 2
45
+ },
46
+ {
47
+ "causal": true,
48
+ "context_duration": 6.0,
49
+ "conv_layout": true,
50
+ "d_model": 256,
51
+ "dim_feedforward": 1024,
52
+ "gating": "none",
53
+ "input_dimension": 768,
54
+ "layer_scale": 0.01,
55
+ "max_period": 10000,
56
+ "module_type": "Transformer",
57
+ "norm": "layer_norm",
58
+ "num_heads": 4,
59
+ "num_layers": 2,
60
+ "output_dimension": 384,
61
+ "positional_embedding": "rope"
62
+ },
63
+ {
64
+ "module_type": "PatchedPretransform",
65
+ "patch_size": 2
66
+ },
67
+ {
68
+ "causal": true,
69
+ "context_duration": 8.0,
70
+ "conv_layout": true,
71
+ "d_model": 256,
72
+ "dim_feedforward": 1024,
73
+ "gating": "none",
74
+ "input_dimension": 768,
75
+ "layer_scale": 0.01,
76
+ "max_period": 10000,
77
+ "module_type": "Transformer",
78
+ "norm": "layer_norm",
79
+ "num_heads": 4,
80
+ "num_layers": 2,
81
+ "output_dimension": 384,
82
+ "positional_embedding": "rope"
83
+ },
84
+ {
85
+ "module_type": "PatchedPretransform",
86
+ "patch_size": 2
87
+ },
88
+ {
89
+ "causal": true,
90
+ "context_duration": 10.0,
91
+ "conv_layout": true,
92
+ "d_model": 256,
93
+ "dim_feedforward": 1024,
94
+ "gating": "none",
95
+ "input_dimension": 768,
96
+ "layer_scale": 0.01,
97
+ "max_period": 10000,
98
+ "module_type": "Transformer",
99
+ "norm": "layer_norm",
100
+ "num_heads": 4,
101
+ "num_layers": 4,
102
+ "output_dimension": 192,
103
+ "positional_embedding": "rope"
104
+ },
105
+ {
106
+ "module_type": "PatchedPretransform",
107
+ "patch_size": 4
108
+ }
109
+ ],
110
+ "decoder_kwargs": [
111
+ {
112
+ "module_type": "PatchedPretransform",
113
+ "patch_size": 4
114
+ },
115
+ {
116
+ "causal": true,
117
+ "context_duration": 10.0,
118
+ "conv_layout": true,
119
+ "d_model": 256,
120
+ "dim_feedforward": 1024,
121
+ "gating": "none",
122
+ "input_dimension": 192,
123
+ "layer_scale": 0.01,
124
+ "max_period": 10000,
125
+ "module_type": "Transformer",
126
+ "norm": "layer_norm",
127
+ "num_heads": 4,
128
+ "num_layers": 4,
129
+ "output_dimension": 768,
130
+ "positional_embedding": "rope"
131
+ },
132
+ {
133
+ "module_type": "PatchedPretransform",
134
+ "patch_size": 2
135
+ },
136
+ {
137
+ "causal": true,
138
+ "context_duration": 8.0,
139
+ "conv_layout": true,
140
+ "d_model": 256,
141
+ "dim_feedforward": 1024,
142
+ "gating": "none",
143
+ "input_dimension": 384,
144
+ "layer_scale": 0.01,
145
+ "max_period": 10000,
146
+ "module_type": "Transformer",
147
+ "norm": "layer_norm",
148
+ "num_heads": 4,
149
+ "num_layers": 2,
150
+ "output_dimension": 768,
151
+ "positional_embedding": "rope"
152
+ },
153
+ {
154
+ "module_type": "PatchedPretransform",
155
+ "patch_size": 2
156
+ },
157
+ {
158
+ "causal": true,
159
+ "context_duration": 6.0,
160
+ "conv_layout": true,
161
+ "d_model": 256,
162
+ "dim_feedforward": 1024,
163
+ "gating": "none",
164
+ "input_dimension": 384,
165
+ "layer_scale": 0.01,
166
+ "max_period": 10000,
167
+ "module_type": "Transformer",
168
+ "norm": "layer_norm",
169
+ "num_heads": 4,
170
+ "num_layers": 2,
171
+ "output_dimension": 768,
172
+ "positional_embedding": "rope"
173
+ },
174
+ {
175
+ "module_type": "PatchedPretransform",
176
+ "patch_size": 2
177
+ },
178
+ {
179
+ "causal": true,
180
+ "context_duration": 4.0,
181
+ "conv_layout": true,
182
+ "d_model": 256,
183
+ "dim_feedforward": 1024,
184
+ "gating": "none",
185
+ "input_dimension": 384,
186
+ "layer_scale": 0.01,
187
+ "max_period": 10000,
188
+ "module_type": "Transformer",
189
+ "norm": "layer_norm",
190
+ "num_heads": 4,
191
+ "num_layers": 4,
192
+ "output_dimension": 240,
193
+ "positional_embedding": "rope"
194
+ },
195
+ {
196
+ "module_type": "PatchedPretransform",
197
+ "patch_size": 240
198
+ }
199
+ ],
200
+ "quantizer_type": "rlfq",
201
+ "quantizer_kwargs": {
202
+ "codebook_dim": 8,
203
+ "codebook_loss_weight": 1.0,
204
+ "codebook_size": 1024,
205
+ "commitment_loss_weight": 0.25,
206
+ "input_dim": 768,
207
+ "num_quantizers": 16,
208
+ "output_dim": 768,
209
+ "quantizer_dropout": 1.0,
210
+ "quantizer_type": "rlfq",
211
+ "rvq_dim": 512
212
+ },
213
+ "transformers_version": "4.56.0.dev0",
214
+ "reversed_decoder_kwargs": [
215
+ {
216
+ "module_type": "PatchedPretransform",
217
+ "patch_size": 240
218
+ },
219
+ {
220
+ "causal": true,
221
+ "context_duration": 4.0,
222
+ "conv_layout": true,
223
+ "d_model": 256,
224
+ "dim_feedforward": 1024,
225
+ "gating": "none",
226
+ "input_dimension": 240,
227
+ "layer_scale": 0.01,
228
+ "max_period": 10000,
229
+ "module_type": "Transformer",
230
+ "norm": "layer_norm",
231
+ "num_heads": 4,
232
+ "num_layers": 4,
233
+ "output_dimension": 384,
234
+ "positional_embedding": "rope"
235
+ },
236
+ {
237
+ "module_type": "PatchedPretransform",
238
+ "patch_size": 2
239
+ },
240
+ {
241
+ "causal": true,
242
+ "context_duration": 6.0,
243
+ "conv_layout": true,
244
+ "d_model": 256,
245
+ "dim_feedforward": 1024,
246
+ "gating": "none",
247
+ "input_dimension": 768,
248
+ "layer_scale": 0.01,
249
+ "max_period": 10000,
250
+ "module_type": "Transformer",
251
+ "norm": "layer_norm",
252
+ "num_heads": 4,
253
+ "num_layers": 2,
254
+ "output_dimension": 384,
255
+ "positional_embedding": "rope"
256
+ },
257
+ {
258
+ "module_type": "PatchedPretransform",
259
+ "patch_size": 2
260
+ },
261
+ {
262
+ "causal": true,
263
+ "context_duration": 8.0,
264
+ "conv_layout": true,
265
+ "d_model": 256,
266
+ "dim_feedforward": 1024,
267
+ "gating": "none",
268
+ "input_dimension": 768,
269
+ "layer_scale": 0.01,
270
+ "max_period": 10000,
271
+ "module_type": "Transformer",
272
+ "norm": "layer_norm",
273
+ "num_heads": 4,
274
+ "num_layers": 2,
275
+ "output_dimension": 384,
276
+ "positional_embedding": "rope"
277
+ },
278
+ {
279
+ "module_type": "PatchedPretransform",
280
+ "patch_size": 2
281
+ },
282
+ {
283
+ "causal": true,
284
+ "context_duration": 10.0,
285
+ "conv_layout": true,
286
+ "d_model": 256,
287
+ "dim_feedforward": 1024,
288
+ "gating": "none",
289
+ "input_dimension": 768,
290
+ "layer_scale": 0.01,
291
+ "max_period": 10000,
292
+ "module_type": "Transformer",
293
+ "norm": "layer_norm",
294
+ "num_heads": 4,
295
+ "num_layers": 4,
296
+ "output_dimension": 192,
297
+ "positional_embedding": "rope"
298
+ },
299
+ {
300
+ "module_type": "PatchedPretransform",
301
+ "patch_size": 4
302
+ }
303
+ ]
304
+ }
configuration_moss_audio_tokenizer.py ADDED
@@ -0,0 +1,467 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2026 OpenMOSS and the HuggingFace Inc. team. All rights reserved.
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+ """MossAudioTokenizer model configuration."""
16
+
17
+ from typing import Any
18
+
19
+ try:
20
+ from transformers.configuration_utils import PreTrainedConfig
21
+ except ImportError:
22
+ from transformers.configuration_utils import PretrainedConfig as PreTrainedConfig
23
+ from transformers.utils import logging
24
+
25
+
26
+ logger = logging.get_logger(__name__)
27
+
28
+
29
+ class MossAudioTokenizerConfig(PreTrainedConfig):
30
+ r"""
31
+ This is the configuration class to store the configuration of a [`MossAudioTokenizerModel`]. It is used to instantiate a
32
+ MossAudioTokenizer model according to the specified arguments, defining the model architecture.
33
+
34
+ Instantiating a configuration with the defaults will yield a similar configuration to that of the
35
+ [VoiceAgentGroup/moss_audio_tokenizer](https://huggingface.co/VoiceAgentGroup/moss_audio_tokenizer) architecture.
36
+
37
+ Configuration objects inherit from [`PreTrainedConfig`] and can be used to control the model outputs. Read the
38
+ documentation from [`PreTrainedConfig`] for more information.
39
+
40
+ Args:
41
+ sampling_rate (`int`, *optional*, defaults to 48000):
42
+ The sampling rate at which the audio waveform should be digitalized expressed in hertz (Hz).
43
+ downsample_rate (`int`, *optional*, defaults to 3840):
44
+ Total downsampling rate from waveform to tokens.
45
+ causal_transformer_context_duration (`float`, *optional*, defaults to 10.0):
46
+ Legacy global fallback context duration in seconds for causal transformer. If an individual transformer
47
+ entry in `encoder_kwargs` or `decoder_kwargs` provides `context_duration`, that per-module value takes
48
+ precedence.
49
+ encoder_kwargs (`list[dict]`, *optional*):
50
+ List of encoder module configurations. Each dict specifies a module type and its parameters.
51
+ decoder_kwargs (`list[dict]`, *optional*):
52
+ List of decoder module configurations in execution order.
53
+ number_channels (`int`, *optional*, defaults to 2):
54
+ Number of audio channels exposed by the public waveform interface.
55
+ enable_channel_interleave (`bool`, *optional*, defaults to `True`):
56
+ Whether to flatten multi-channel waveforms into a single internal stream before codec inference.
57
+ attention_implementation (`str`, *optional*, defaults to `"sdpa"`):
58
+ Attention implementation to prefer for transformer layers. Supported values are `"sdpa"` and
59
+ `"flash_attention_2"`.
60
+ compute_dtype (`str`, *optional*, defaults to `"fp32"`):
61
+ Inference compute dtype for non-quantizer modules. Supported values are `"fp32"`, `"bf16"`, and `"fp16"`.
62
+ quantizer_type (`str`, *optional*, defaults to `"rlfq"`):
63
+ Quantizer type. Options include `"rvq"`, `"spec_rvq"`, `"rlfq"`, `"random_prefix_rlfq"`.
64
+ quantizer_kwargs (`dict`, *optional*):
65
+ Configuration for the quantizer including `input_dim`, `rvq_dim`, `output_dim`, `num_quantizers`,
66
+ `codebook_size`, and `codebook_dim`.
67
+
68
+ Example:
69
+
70
+ ```python
71
+ >>> from transformers import MossAudioTokenizerModel, MossAudioTokenizerConfig
72
+
73
+ >>> # Initializing a MossAudioTokenizer style configuration
74
+ >>> configuration = MossAudioTokenizerConfig()
75
+
76
+ >>> # Initializing a model (with random weights) from the configuration
77
+ >>> model = MossAudioTokenizerModel(configuration)
78
+
79
+ >>> # Accessing the model configuration
80
+ >>> configuration = model.config
81
+ ```
82
+ """
83
+
84
+ model_type = "moss-audio-tokenizer"
85
+
86
+ # Backward-compatible alias used by some checkpoints.
87
+ attribute_map = {"sample_rate": "sampling_rate"}
88
+
89
+ sampling_rate: int
90
+ downsample_rate: int
91
+ causal_transformer_context_duration: float
92
+ encoder_kwargs: list[dict[str, Any]]
93
+ decoder_kwargs: list[dict[str, Any]]
94
+ number_channels: int
95
+ enable_channel_interleave: bool
96
+ attention_implementation: str
97
+ compute_dtype: str
98
+ quantizer_type: str
99
+ quantizer_kwargs: dict[str, Any]
100
+
101
+ def __init__(
102
+ self,
103
+ version: str | None = None,
104
+ sampling_rate: int = 48000,
105
+ downsample_rate: int = 3840,
106
+ causal_transformer_context_duration: float = 10.0,
107
+ encoder_kwargs: list[dict[str, Any]] | None = None,
108
+ decoder_kwargs: list[dict[str, Any]] | None = None,
109
+ number_channels: int = 2,
110
+ enable_channel_interleave: bool = True,
111
+ attention_implementation: str = "sdpa",
112
+ compute_dtype: str = "fp32",
113
+ quantizer_type: str = "rlfq",
114
+ quantizer_kwargs: dict[str, Any] | None = None,
115
+ **kwargs,
116
+ ):
117
+ # Some checkpoints might include an incorrect/legacy `model_type` (e.g. "speech_tokenizer").
118
+ # We drop it to avoid overriding the class-level `model_type`.
119
+ kwargs.pop("model_type", None)
120
+ if "channels_numbers" in kwargs:
121
+ number_channels = kwargs.pop("channels_numbers")
122
+ if "enable_channel_interleave" in kwargs:
123
+ enable_channel_interleave = kwargs.pop("enable_channel_interleave")
124
+ if "attention_backend" in kwargs and attention_implementation == "sdpa":
125
+ attention_implementation = kwargs.pop("attention_backend")
126
+ if "codec_compute_dtype" in kwargs and compute_dtype == "fp32":
127
+ compute_dtype = kwargs.pop("codec_compute_dtype")
128
+ reversed_decoder_kwargs = kwargs.pop("reversed_decoder_kwargs", None)
129
+
130
+ # `version` is accepted for compatibility but not used in modeling.
131
+ self.version = version
132
+ self.sampling_rate = sampling_rate
133
+ self.downsample_rate = downsample_rate
134
+ self.causal_transformer_context_duration = causal_transformer_context_duration
135
+ self.number_channels = number_channels
136
+ self.enable_channel_interleave = enable_channel_interleave
137
+ self.attention_implementation = attention_implementation
138
+ self.compute_dtype = compute_dtype
139
+ # Default encoder configuration
140
+ if encoder_kwargs is None:
141
+ encoder_kwargs = [
142
+ {
143
+ "module_type": "PatchedPretransform",
144
+ "patch_size": 240,
145
+ },
146
+ {
147
+ "module_type": "Transformer",
148
+ "input_dimension": 240,
149
+ "output_dimension": 384,
150
+ "d_model": 768,
151
+ "num_heads": 12,
152
+ "num_layers": 12,
153
+ "dim_feedforward": 3072,
154
+ "causal": True,
155
+ "norm": "layer_norm",
156
+ "positional_embedding": "rope",
157
+ "max_period": 10000,
158
+ "gating": "none",
159
+ "layer_scale": 0.01,
160
+ "conv_layout": True,
161
+ "context_duration": 1.0,
162
+ },
163
+ {
164
+ "module_type": "PatchedPretransform",
165
+ "patch_size": 2,
166
+ },
167
+ {
168
+ "module_type": "Transformer",
169
+ "input_dimension": 768,
170
+ "output_dimension": 384,
171
+ "d_model": 768,
172
+ "num_heads": 12,
173
+ "num_layers": 12,
174
+ "dim_feedforward": 3072,
175
+ "causal": True,
176
+ "norm": "layer_norm",
177
+ "positional_embedding": "rope",
178
+ "max_period": 10000,
179
+ "gating": "none",
180
+ "layer_scale": 0.01,
181
+ "conv_layout": True,
182
+ "context_duration": 2.0,
183
+ },
184
+ {
185
+ "module_type": "PatchedPretransform",
186
+ "patch_size": 2,
187
+ },
188
+ {
189
+ "module_type": "Transformer",
190
+ "input_dimension": 768,
191
+ "output_dimension": 384,
192
+ "d_model": 768,
193
+ "num_heads": 12,
194
+ "num_layers": 12,
195
+ "dim_feedforward": 3072,
196
+ "causal": True,
197
+ "norm": "layer_norm",
198
+ "positional_embedding": "rope",
199
+ "max_period": 10000,
200
+ "gating": "none",
201
+ "layer_scale": 0.01,
202
+ "conv_layout": True,
203
+ "context_duration": 4.0,
204
+ },
205
+ {
206
+ "module_type": "PatchedPretransform",
207
+ "patch_size": 2,
208
+ },
209
+ {
210
+ "module_type": "Transformer",
211
+ "input_dimension": 768,
212
+ "output_dimension": 384,
213
+ "d_model": 768,
214
+ "num_heads": 12,
215
+ "num_layers": 12,
216
+ "dim_feedforward": 3072,
217
+ "causal": True,
218
+ "norm": "layer_norm",
219
+ "positional_embedding": "rope",
220
+ "max_period": 10000,
221
+ "gating": "none",
222
+ "layer_scale": 0.01,
223
+ "conv_layout": True,
224
+ "context_duration": 8.0,
225
+ },
226
+ {
227
+ "module_type": "PatchedPretransform",
228
+ "patch_size": 2,
229
+ },
230
+ {
231
+ "module_type": "Transformer",
232
+ "input_dimension": 768,
233
+ "output_dimension": 640,
234
+ "d_model": 768,
235
+ "num_heads": 12,
236
+ "num_layers": 12,
237
+ "dim_feedforward": 3072,
238
+ "causal": True,
239
+ "norm": "layer_norm",
240
+ "positional_embedding": "rope",
241
+ "max_period": 10000,
242
+ "gating": "none",
243
+ "layer_scale": 0.01,
244
+ "conv_layout": True,
245
+ "context_duration": 10.0,
246
+ },
247
+ {
248
+ "module_type": "PatchedPretransform",
249
+ "patch_size": 2,
250
+ },
251
+ {
252
+ "module_type": "Transformer",
253
+ "input_dimension": 1280,
254
+ "output_dimension": 768,
255
+ "d_model": 1280,
256
+ "num_heads": 20,
257
+ "num_layers": 32,
258
+ "dim_feedforward": 5120,
259
+ "causal": True,
260
+ "norm": "layer_norm",
261
+ "positional_embedding": "rope",
262
+ "max_period": 10000,
263
+ "gating": "none",
264
+ "layer_scale": 0.01,
265
+ "conv_layout": True,
266
+ "context_duration": 10.0,
267
+ },
268
+ ]
269
+ else:
270
+ encoder_kwargs = [dict(module_kwargs) for module_kwargs in encoder_kwargs]
271
+ for module_kwargs in encoder_kwargs:
272
+ if module_kwargs.get("module_type") == "Transformer":
273
+ module_kwargs.setdefault("context_duration", causal_transformer_context_duration)
274
+ self.encoder_kwargs = encoder_kwargs
275
+
276
+ # Default decoder configuration (execution order)
277
+ if decoder_kwargs is None and reversed_decoder_kwargs is not None:
278
+ reversed_decoder_kwargs = [dict(module_kwargs) for module_kwargs in reversed_decoder_kwargs]
279
+ decoder_kwargs = []
280
+ for module_kwargs in reversed_decoder_kwargs[::-1]:
281
+ if module_kwargs.get("module_type") != "Transformer":
282
+ decoder_kwargs.append(module_kwargs)
283
+ continue
284
+ module_kwargs = dict(module_kwargs)
285
+ module_kwargs["input_dimension"], module_kwargs["output_dimension"] = (
286
+ module_kwargs["output_dimension"],
287
+ module_kwargs["input_dimension"],
288
+ )
289
+ decoder_kwargs.append(module_kwargs)
290
+
291
+ if decoder_kwargs is None:
292
+ decoder_kwargs = [
293
+ {
294
+ "module_type": "Transformer",
295
+ "input_dimension": 768,
296
+ "output_dimension": 1280,
297
+ "d_model": 1280,
298
+ "num_heads": 20,
299
+ "num_layers": 32,
300
+ "dim_feedforward": 5120,
301
+ "causal": True,
302
+ "norm": "layer_norm",
303
+ "positional_embedding": "rope",
304
+ "max_period": 10000,
305
+ "gating": "none",
306
+ "layer_scale": 0.01,
307
+ "conv_layout": True,
308
+ "context_duration": 10.0,
309
+ },
310
+ {
311
+ "module_type": "PatchedPretransform",
312
+ "patch_size": 2,
313
+ },
314
+ {
315
+ "module_type": "Transformer",
316
+ "input_dimension": 640,
317
+ "output_dimension": 768,
318
+ "d_model": 768,
319
+ "num_heads": 12,
320
+ "num_layers": 12,
321
+ "dim_feedforward": 3072,
322
+ "causal": True,
323
+ "norm": "layer_norm",
324
+ "positional_embedding": "rope",
325
+ "max_period": 10000,
326
+ "gating": "none",
327
+ "layer_scale": 0.01,
328
+ "conv_layout": True,
329
+ "context_duration": 10.0,
330
+ },
331
+ {
332
+ "module_type": "PatchedPretransform",
333
+ "patch_size": 2,
334
+ },
335
+ {
336
+ "module_type": "Transformer",
337
+ "input_dimension": 384,
338
+ "output_dimension": 768,
339
+ "d_model": 768,
340
+ "num_heads": 12,
341
+ "num_layers": 12,
342
+ "dim_feedforward": 3072,
343
+ "causal": True,
344
+ "norm": "layer_norm",
345
+ "positional_embedding": "rope",
346
+ "max_period": 10000,
347
+ "gating": "none",
348
+ "layer_scale": 0.01,
349
+ "conv_layout": True,
350
+ "context_duration": 8.0,
351
+ },
352
+ {
353
+ "module_type": "PatchedPretransform",
354
+ "patch_size": 2,
355
+ },
356
+ {
357
+ "module_type": "Transformer",
358
+ "input_dimension": 384,
359
+ "output_dimension": 768,
360
+ "d_model": 768,
361
+ "num_heads": 12,
362
+ "num_layers": 12,
363
+ "dim_feedforward": 3072,
364
+ "causal": True,
365
+ "norm": "layer_norm",
366
+ "positional_embedding": "rope",
367
+ "max_period": 10000,
368
+ "gating": "none",
369
+ "layer_scale": 0.01,
370
+ "conv_layout": True,
371
+ "context_duration": 4.0,
372
+ },
373
+ {
374
+ "module_type": "PatchedPretransform",
375
+ "patch_size": 2,
376
+ },
377
+ {
378
+ "module_type": "Transformer",
379
+ "input_dimension": 384,
380
+ "output_dimension": 768,
381
+ "d_model": 768,
382
+ "num_heads": 12,
383
+ "num_layers": 12,
384
+ "dim_feedforward": 3072,
385
+ "causal": True,
386
+ "norm": "layer_norm",
387
+ "positional_embedding": "rope",
388
+ "max_period": 10000,
389
+ "gating": "none",
390
+ "layer_scale": 0.01,
391
+ "conv_layout": True,
392
+ "context_duration": 2.0,
393
+ },
394
+ {
395
+ "module_type": "PatchedPretransform",
396
+ "patch_size": 2,
397
+ },
398
+ {
399
+ "module_type": "Transformer",
400
+ "input_dimension": 384,
401
+ "output_dimension": 240,
402
+ "d_model": 768,
403
+ "num_heads": 12,
404
+ "num_layers": 12,
405
+ "dim_feedforward": 3072,
406
+ "causal": True,
407
+ "norm": "layer_norm",
408
+ "positional_embedding": "rope",
409
+ "max_period": 10000,
410
+ "gating": "none",
411
+ "layer_scale": 0.01,
412
+ "conv_layout": True,
413
+ "context_duration": 1.0,
414
+ },
415
+ {
416
+ "module_type": "PatchedPretransform",
417
+ "patch_size": 240,
418
+ },
419
+ ]
420
+ else:
421
+ decoder_kwargs = [dict(module_kwargs) for module_kwargs in decoder_kwargs]
422
+ for module_kwargs in decoder_kwargs:
423
+ if module_kwargs.get("module_type") == "Transformer":
424
+ module_kwargs.setdefault("context_duration", causal_transformer_context_duration)
425
+ self.decoder_kwargs = decoder_kwargs
426
+
427
+ # Default quantizer configuration
428
+ if quantizer_kwargs is None:
429
+ quantizer_kwargs = {
430
+ "input_dim": 768,
431
+ "rvq_dim": 512,
432
+ "output_dim": 768,
433
+ "num_quantizers": 32,
434
+ "codebook_size": 1024,
435
+ "codebook_dim": 8,
436
+ "quantizer_type": "rlfq",
437
+ }
438
+
439
+ # Handle quantizer_type from kwargs or config
440
+ kw_qtype = quantizer_kwargs.get("quantizer_type", None)
441
+ if kw_qtype is not None:
442
+ self.quantizer_type = kw_qtype
443
+ else:
444
+ self.quantizer_type = quantizer_type
445
+ quantizer_kwargs["quantizer_type"] = quantizer_type
446
+
447
+ self.quantizer_kwargs = quantizer_kwargs
448
+
449
+ super().__init__(**kwargs)
450
+
451
+ @property
452
+ def num_quantizers(self) -> int:
453
+ """Return the number of quantizers from quantizer_kwargs."""
454
+ return self.quantizer_kwargs.get("num_quantizers", 32)
455
+
456
+ @property
457
+ def codebook_size(self) -> int:
458
+ """Return the codebook size from quantizer_kwargs."""
459
+ return self.quantizer_kwargs.get("codebook_size", 4096)
460
+
461
+ @property
462
+ def frame_rate(self) -> float:
463
+ """Return the frame rate (tokens per second)."""
464
+ return self.sampling_rate / self.downsample_rate
465
+
466
+
467
+ __all__ = ["MossAudioTokenizerConfig"]
model-00001-of-00001.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:34d9880d805eecb21bde975202b1c256dbd0eb98c8680b9d3aeffd2bc6ac2f67
3
+ size 87922568
model.safetensors.index.json ADDED
@@ -0,0 +1,382 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_parameters": 21969664,
4
+ "total_size": 87878656
5
+ },
6
+ "weight_map": {
7
+ "encoder.1.input_proj.weight": "model-00001-of-00001.safetensors",
8
+ "encoder.1.transformer.layers.0.norm1.weight": "model-00001-of-00001.safetensors",
9
+ "encoder.1.transformer.layers.0.norm1.bias": "model-00001-of-00001.safetensors",
10
+ "encoder.1.transformer.layers.0.self_attn.in_proj.weight": "model-00001-of-00001.safetensors",
11
+ "encoder.1.transformer.layers.0.self_attn.out_proj.weight": "model-00001-of-00001.safetensors",
12
+ "encoder.1.transformer.layers.0.norm2.weight": "model-00001-of-00001.safetensors",
13
+ "encoder.1.transformer.layers.0.norm2.bias": "model-00001-of-00001.safetensors",
14
+ "encoder.1.transformer.layers.0.ffn.0.weight": "model-00001-of-00001.safetensors",
15
+ "encoder.1.transformer.layers.0.ffn.2.weight": "model-00001-of-00001.safetensors",
16
+ "encoder.1.transformer.layers.0.layer_scale_1.scale": "model-00001-of-00001.safetensors",
17
+ "encoder.1.transformer.layers.0.layer_scale_2.scale": "model-00001-of-00001.safetensors",
18
+ "encoder.1.transformer.layers.1.norm1.weight": "model-00001-of-00001.safetensors",
19
+ "encoder.1.transformer.layers.1.norm1.bias": "model-00001-of-00001.safetensors",
20
+ "encoder.1.transformer.layers.1.self_attn.in_proj.weight": "model-00001-of-00001.safetensors",
21
+ "encoder.1.transformer.layers.1.self_attn.out_proj.weight": "model-00001-of-00001.safetensors",
22
+ "encoder.1.transformer.layers.1.norm2.weight": "model-00001-of-00001.safetensors",
23
+ "encoder.1.transformer.layers.1.norm2.bias": "model-00001-of-00001.safetensors",
24
+ "encoder.1.transformer.layers.1.ffn.0.weight": "model-00001-of-00001.safetensors",
25
+ "encoder.1.transformer.layers.1.ffn.2.weight": "model-00001-of-00001.safetensors",
26
+ "encoder.1.transformer.layers.1.layer_scale_1.scale": "model-00001-of-00001.safetensors",
27
+ "encoder.1.transformer.layers.1.layer_scale_2.scale": "model-00001-of-00001.safetensors",
28
+ "encoder.1.transformer.layers.2.norm1.weight": "model-00001-of-00001.safetensors",
29
+ "encoder.1.transformer.layers.2.norm1.bias": "model-00001-of-00001.safetensors",
30
+ "encoder.1.transformer.layers.2.self_attn.in_proj.weight": "model-00001-of-00001.safetensors",
31
+ "encoder.1.transformer.layers.2.self_attn.out_proj.weight": "model-00001-of-00001.safetensors",
32
+ "encoder.1.transformer.layers.2.norm2.weight": "model-00001-of-00001.safetensors",
33
+ "encoder.1.transformer.layers.2.norm2.bias": "model-00001-of-00001.safetensors",
34
+ "encoder.1.transformer.layers.2.ffn.0.weight": "model-00001-of-00001.safetensors",
35
+ "encoder.1.transformer.layers.2.ffn.2.weight": "model-00001-of-00001.safetensors",
36
+ "encoder.1.transformer.layers.2.layer_scale_1.scale": "model-00001-of-00001.safetensors",
37
+ "encoder.1.transformer.layers.2.layer_scale_2.scale": "model-00001-of-00001.safetensors",
38
+ "encoder.1.transformer.layers.3.norm1.weight": "model-00001-of-00001.safetensors",
39
+ "encoder.1.transformer.layers.3.norm1.bias": "model-00001-of-00001.safetensors",
40
+ "encoder.1.transformer.layers.3.self_attn.in_proj.weight": "model-00001-of-00001.safetensors",
41
+ "encoder.1.transformer.layers.3.self_attn.out_proj.weight": "model-00001-of-00001.safetensors",
42
+ "encoder.1.transformer.layers.3.norm2.weight": "model-00001-of-00001.safetensors",
43
+ "encoder.1.transformer.layers.3.norm2.bias": "model-00001-of-00001.safetensors",
44
+ "encoder.1.transformer.layers.3.ffn.0.weight": "model-00001-of-00001.safetensors",
45
+ "encoder.1.transformer.layers.3.ffn.2.weight": "model-00001-of-00001.safetensors",
46
+ "encoder.1.transformer.layers.3.layer_scale_1.scale": "model-00001-of-00001.safetensors",
47
+ "encoder.1.transformer.layers.3.layer_scale_2.scale": "model-00001-of-00001.safetensors",
48
+ "encoder.1.output_proj.weight": "model-00001-of-00001.safetensors",
49
+ "encoder.3.input_proj.weight": "model-00001-of-00001.safetensors",
50
+ "encoder.3.transformer.layers.0.norm1.weight": "model-00001-of-00001.safetensors",
51
+ "encoder.3.transformer.layers.0.norm1.bias": "model-00001-of-00001.safetensors",
52
+ "encoder.3.transformer.layers.0.self_attn.in_proj.weight": "model-00001-of-00001.safetensors",
53
+ "encoder.3.transformer.layers.0.self_attn.out_proj.weight": "model-00001-of-00001.safetensors",
54
+ "encoder.3.transformer.layers.0.norm2.weight": "model-00001-of-00001.safetensors",
55
+ "encoder.3.transformer.layers.0.norm2.bias": "model-00001-of-00001.safetensors",
56
+ "encoder.3.transformer.layers.0.ffn.0.weight": "model-00001-of-00001.safetensors",
57
+ "encoder.3.transformer.layers.0.ffn.2.weight": "model-00001-of-00001.safetensors",
58
+ "encoder.3.transformer.layers.0.layer_scale_1.scale": "model-00001-of-00001.safetensors",
59
+ "encoder.3.transformer.layers.0.layer_scale_2.scale": "model-00001-of-00001.safetensors",
60
+ "encoder.3.transformer.layers.1.norm1.weight": "model-00001-of-00001.safetensors",
61
+ "encoder.3.transformer.layers.1.norm1.bias": "model-00001-of-00001.safetensors",
62
+ "encoder.3.transformer.layers.1.self_attn.in_proj.weight": "model-00001-of-00001.safetensors",
63
+ "encoder.3.transformer.layers.1.self_attn.out_proj.weight": "model-00001-of-00001.safetensors",
64
+ "encoder.3.transformer.layers.1.norm2.weight": "model-00001-of-00001.safetensors",
65
+ "encoder.3.transformer.layers.1.norm2.bias": "model-00001-of-00001.safetensors",
66
+ "encoder.3.transformer.layers.1.ffn.0.weight": "model-00001-of-00001.safetensors",
67
+ "encoder.3.transformer.layers.1.ffn.2.weight": "model-00001-of-00001.safetensors",
68
+ "encoder.3.transformer.layers.1.layer_scale_1.scale": "model-00001-of-00001.safetensors",
69
+ "encoder.3.transformer.layers.1.layer_scale_2.scale": "model-00001-of-00001.safetensors",
70
+ "encoder.3.output_proj.weight": "model-00001-of-00001.safetensors",
71
+ "encoder.5.input_proj.weight": "model-00001-of-00001.safetensors",
72
+ "encoder.5.transformer.layers.0.norm1.weight": "model-00001-of-00001.safetensors",
73
+ "encoder.5.transformer.layers.0.norm1.bias": "model-00001-of-00001.safetensors",
74
+ "encoder.5.transformer.layers.0.self_attn.in_proj.weight": "model-00001-of-00001.safetensors",
75
+ "encoder.5.transformer.layers.0.self_attn.out_proj.weight": "model-00001-of-00001.safetensors",
76
+ "encoder.5.transformer.layers.0.norm2.weight": "model-00001-of-00001.safetensors",
77
+ "encoder.5.transformer.layers.0.norm2.bias": "model-00001-of-00001.safetensors",
78
+ "encoder.5.transformer.layers.0.ffn.0.weight": "model-00001-of-00001.safetensors",
79
+ "encoder.5.transformer.layers.0.ffn.2.weight": "model-00001-of-00001.safetensors",
80
+ "encoder.5.transformer.layers.0.layer_scale_1.scale": "model-00001-of-00001.safetensors",
81
+ "encoder.5.transformer.layers.0.layer_scale_2.scale": "model-00001-of-00001.safetensors",
82
+ "encoder.5.transformer.layers.1.norm1.weight": "model-00001-of-00001.safetensors",
83
+ "encoder.5.transformer.layers.1.norm1.bias": "model-00001-of-00001.safetensors",
84
+ "encoder.5.transformer.layers.1.self_attn.in_proj.weight": "model-00001-of-00001.safetensors",
85
+ "encoder.5.transformer.layers.1.self_attn.out_proj.weight": "model-00001-of-00001.safetensors",
86
+ "encoder.5.transformer.layers.1.norm2.weight": "model-00001-of-00001.safetensors",
87
+ "encoder.5.transformer.layers.1.norm2.bias": "model-00001-of-00001.safetensors",
88
+ "encoder.5.transformer.layers.1.ffn.0.weight": "model-00001-of-00001.safetensors",
89
+ "encoder.5.transformer.layers.1.ffn.2.weight": "model-00001-of-00001.safetensors",
90
+ "encoder.5.transformer.layers.1.layer_scale_1.scale": "model-00001-of-00001.safetensors",
91
+ "encoder.5.transformer.layers.1.layer_scale_2.scale": "model-00001-of-00001.safetensors",
92
+ "encoder.5.output_proj.weight": "model-00001-of-00001.safetensors",
93
+ "encoder.7.input_proj.weight": "model-00001-of-00001.safetensors",
94
+ "encoder.7.transformer.layers.0.norm1.weight": "model-00001-of-00001.safetensors",
95
+ "encoder.7.transformer.layers.0.norm1.bias": "model-00001-of-00001.safetensors",
96
+ "encoder.7.transformer.layers.0.self_attn.in_proj.weight": "model-00001-of-00001.safetensors",
97
+ "encoder.7.transformer.layers.0.self_attn.out_proj.weight": "model-00001-of-00001.safetensors",
98
+ "encoder.7.transformer.layers.0.norm2.weight": "model-00001-of-00001.safetensors",
99
+ "encoder.7.transformer.layers.0.norm2.bias": "model-00001-of-00001.safetensors",
100
+ "encoder.7.transformer.layers.0.ffn.0.weight": "model-00001-of-00001.safetensors",
101
+ "encoder.7.transformer.layers.0.ffn.2.weight": "model-00001-of-00001.safetensors",
102
+ "encoder.7.transformer.layers.0.layer_scale_1.scale": "model-00001-of-00001.safetensors",
103
+ "encoder.7.transformer.layers.0.layer_scale_2.scale": "model-00001-of-00001.safetensors",
104
+ "encoder.7.transformer.layers.1.norm1.weight": "model-00001-of-00001.safetensors",
105
+ "encoder.7.transformer.layers.1.norm1.bias": "model-00001-of-00001.safetensors",
106
+ "encoder.7.transformer.layers.1.self_attn.in_proj.weight": "model-00001-of-00001.safetensors",
107
+ "encoder.7.transformer.layers.1.self_attn.out_proj.weight": "model-00001-of-00001.safetensors",
108
+ "encoder.7.transformer.layers.1.norm2.weight": "model-00001-of-00001.safetensors",
109
+ "encoder.7.transformer.layers.1.norm2.bias": "model-00001-of-00001.safetensors",
110
+ "encoder.7.transformer.layers.1.ffn.0.weight": "model-00001-of-00001.safetensors",
111
+ "encoder.7.transformer.layers.1.ffn.2.weight": "model-00001-of-00001.safetensors",
112
+ "encoder.7.transformer.layers.1.layer_scale_1.scale": "model-00001-of-00001.safetensors",
113
+ "encoder.7.transformer.layers.1.layer_scale_2.scale": "model-00001-of-00001.safetensors",
114
+ "encoder.7.transformer.layers.2.norm1.weight": "model-00001-of-00001.safetensors",
115
+ "encoder.7.transformer.layers.2.norm1.bias": "model-00001-of-00001.safetensors",
116
+ "encoder.7.transformer.layers.2.self_attn.in_proj.weight": "model-00001-of-00001.safetensors",
117
+ "encoder.7.transformer.layers.2.self_attn.out_proj.weight": "model-00001-of-00001.safetensors",
118
+ "encoder.7.transformer.layers.2.norm2.weight": "model-00001-of-00001.safetensors",
119
+ "encoder.7.transformer.layers.2.norm2.bias": "model-00001-of-00001.safetensors",
120
+ "encoder.7.transformer.layers.2.ffn.0.weight": "model-00001-of-00001.safetensors",
121
+ "encoder.7.transformer.layers.2.ffn.2.weight": "model-00001-of-00001.safetensors",
122
+ "encoder.7.transformer.layers.2.layer_scale_1.scale": "model-00001-of-00001.safetensors",
123
+ "encoder.7.transformer.layers.2.layer_scale_2.scale": "model-00001-of-00001.safetensors",
124
+ "encoder.7.transformer.layers.3.norm1.weight": "model-00001-of-00001.safetensors",
125
+ "encoder.7.transformer.layers.3.norm1.bias": "model-00001-of-00001.safetensors",
126
+ "encoder.7.transformer.layers.3.self_attn.in_proj.weight": "model-00001-of-00001.safetensors",
127
+ "encoder.7.transformer.layers.3.self_attn.out_proj.weight": "model-00001-of-00001.safetensors",
128
+ "encoder.7.transformer.layers.3.norm2.weight": "model-00001-of-00001.safetensors",
129
+ "encoder.7.transformer.layers.3.norm2.bias": "model-00001-of-00001.safetensors",
130
+ "encoder.7.transformer.layers.3.ffn.0.weight": "model-00001-of-00001.safetensors",
131
+ "encoder.7.transformer.layers.3.ffn.2.weight": "model-00001-of-00001.safetensors",
132
+ "encoder.7.transformer.layers.3.layer_scale_1.scale": "model-00001-of-00001.safetensors",
133
+ "encoder.7.transformer.layers.3.layer_scale_2.scale": "model-00001-of-00001.safetensors",
134
+ "encoder.7.output_proj.weight": "model-00001-of-00001.safetensors",
135
+ "quantizer.input_proj.bias": "model-00001-of-00001.safetensors",
136
+ "quantizer.input_proj.parametrizations.weight.original0": "model-00001-of-00001.safetensors",
137
+ "quantizer.input_proj.parametrizations.weight.original1": "model-00001-of-00001.safetensors",
138
+ "quantizer.output_proj.bias": "model-00001-of-00001.safetensors",
139
+ "quantizer.output_proj.parametrizations.weight.original0": "model-00001-of-00001.safetensors",
140
+ "quantizer.output_proj.parametrizations.weight.original1": "model-00001-of-00001.safetensors",
141
+ "quantizer.quantizers.0.in_proj.bias": "model-00001-of-00001.safetensors",
142
+ "quantizer.quantizers.0.in_proj.parametrizations.weight.original0": "model-00001-of-00001.safetensors",
143
+ "quantizer.quantizers.0.in_proj.parametrizations.weight.original1": "model-00001-of-00001.safetensors",
144
+ "quantizer.quantizers.0.out_proj.bias": "model-00001-of-00001.safetensors",
145
+ "quantizer.quantizers.0.out_proj.parametrizations.weight.original0": "model-00001-of-00001.safetensors",
146
+ "quantizer.quantizers.0.out_proj.parametrizations.weight.original1": "model-00001-of-00001.safetensors",
147
+ "quantizer.quantizers.0.codebook.weight": "model-00001-of-00001.safetensors",
148
+ "quantizer.quantizers.1.in_proj.bias": "model-00001-of-00001.safetensors",
149
+ "quantizer.quantizers.1.in_proj.parametrizations.weight.original0": "model-00001-of-00001.safetensors",
150
+ "quantizer.quantizers.1.in_proj.parametrizations.weight.original1": "model-00001-of-00001.safetensors",
151
+ "quantizer.quantizers.1.out_proj.bias": "model-00001-of-00001.safetensors",
152
+ "quantizer.quantizers.1.out_proj.parametrizations.weight.original0": "model-00001-of-00001.safetensors",
153
+ "quantizer.quantizers.1.out_proj.parametrizations.weight.original1": "model-00001-of-00001.safetensors",
154
+ "quantizer.quantizers.1.codebook.weight": "model-00001-of-00001.safetensors",
155
+ "quantizer.quantizers.2.in_proj.bias": "model-00001-of-00001.safetensors",
156
+ "quantizer.quantizers.2.in_proj.parametrizations.weight.original0": "model-00001-of-00001.safetensors",
157
+ "quantizer.quantizers.2.in_proj.parametrizations.weight.original1": "model-00001-of-00001.safetensors",
158
+ "quantizer.quantizers.2.out_proj.bias": "model-00001-of-00001.safetensors",
159
+ "quantizer.quantizers.2.out_proj.parametrizations.weight.original0": "model-00001-of-00001.safetensors",
160
+ "quantizer.quantizers.2.out_proj.parametrizations.weight.original1": "model-00001-of-00001.safetensors",
161
+ "quantizer.quantizers.2.codebook.weight": "model-00001-of-00001.safetensors",
162
+ "quantizer.quantizers.3.in_proj.bias": "model-00001-of-00001.safetensors",
163
+ "quantizer.quantizers.3.in_proj.parametrizations.weight.original0": "model-00001-of-00001.safetensors",
164
+ "quantizer.quantizers.3.in_proj.parametrizations.weight.original1": "model-00001-of-00001.safetensors",
165
+ "quantizer.quantizers.3.out_proj.bias": "model-00001-of-00001.safetensors",
166
+ "quantizer.quantizers.3.out_proj.parametrizations.weight.original0": "model-00001-of-00001.safetensors",
167
+ "quantizer.quantizers.3.out_proj.parametrizations.weight.original1": "model-00001-of-00001.safetensors",
168
+ "quantizer.quantizers.3.codebook.weight": "model-00001-of-00001.safetensors",
169
+ "quantizer.quantizers.4.in_proj.bias": "model-00001-of-00001.safetensors",
170
+ "quantizer.quantizers.4.in_proj.parametrizations.weight.original0": "model-00001-of-00001.safetensors",
171
+ "quantizer.quantizers.4.in_proj.parametrizations.weight.original1": "model-00001-of-00001.safetensors",
172
+ "quantizer.quantizers.4.out_proj.bias": "model-00001-of-00001.safetensors",
173
+ "quantizer.quantizers.4.out_proj.parametrizations.weight.original0": "model-00001-of-00001.safetensors",
174
+ "quantizer.quantizers.4.out_proj.parametrizations.weight.original1": "model-00001-of-00001.safetensors",
175
+ "quantizer.quantizers.4.codebook.weight": "model-00001-of-00001.safetensors",
176
+ "quantizer.quantizers.5.in_proj.bias": "model-00001-of-00001.safetensors",
177
+ "quantizer.quantizers.5.in_proj.parametrizations.weight.original0": "model-00001-of-00001.safetensors",
178
+ "quantizer.quantizers.5.in_proj.parametrizations.weight.original1": "model-00001-of-00001.safetensors",
179
+ "quantizer.quantizers.5.out_proj.bias": "model-00001-of-00001.safetensors",
180
+ "quantizer.quantizers.5.out_proj.parametrizations.weight.original0": "model-00001-of-00001.safetensors",
181
+ "quantizer.quantizers.5.out_proj.parametrizations.weight.original1": "model-00001-of-00001.safetensors",
182
+ "quantizer.quantizers.5.codebook.weight": "model-00001-of-00001.safetensors",
183
+ "quantizer.quantizers.6.in_proj.bias": "model-00001-of-00001.safetensors",
184
+ "quantizer.quantizers.6.in_proj.parametrizations.weight.original0": "model-00001-of-00001.safetensors",
185
+ "quantizer.quantizers.6.in_proj.parametrizations.weight.original1": "model-00001-of-00001.safetensors",
186
+ "quantizer.quantizers.6.out_proj.bias": "model-00001-of-00001.safetensors",
187
+ "quantizer.quantizers.6.out_proj.parametrizations.weight.original0": "model-00001-of-00001.safetensors",
188
+ "quantizer.quantizers.6.out_proj.parametrizations.weight.original1": "model-00001-of-00001.safetensors",
189
+ "quantizer.quantizers.6.codebook.weight": "model-00001-of-00001.safetensors",
190
+ "quantizer.quantizers.7.in_proj.bias": "model-00001-of-00001.safetensors",
191
+ "quantizer.quantizers.7.in_proj.parametrizations.weight.original0": "model-00001-of-00001.safetensors",
192
+ "quantizer.quantizers.7.in_proj.parametrizations.weight.original1": "model-00001-of-00001.safetensors",
193
+ "quantizer.quantizers.7.out_proj.bias": "model-00001-of-00001.safetensors",
194
+ "quantizer.quantizers.7.out_proj.parametrizations.weight.original0": "model-00001-of-00001.safetensors",
195
+ "quantizer.quantizers.7.out_proj.parametrizations.weight.original1": "model-00001-of-00001.safetensors",
196
+ "quantizer.quantizers.7.codebook.weight": "model-00001-of-00001.safetensors",
197
+ "quantizer.quantizers.8.in_proj.bias": "model-00001-of-00001.safetensors",
198
+ "quantizer.quantizers.8.in_proj.parametrizations.weight.original0": "model-00001-of-00001.safetensors",
199
+ "quantizer.quantizers.8.in_proj.parametrizations.weight.original1": "model-00001-of-00001.safetensors",
200
+ "quantizer.quantizers.8.out_proj.bias": "model-00001-of-00001.safetensors",
201
+ "quantizer.quantizers.8.out_proj.parametrizations.weight.original0": "model-00001-of-00001.safetensors",
202
+ "quantizer.quantizers.8.out_proj.parametrizations.weight.original1": "model-00001-of-00001.safetensors",
203
+ "quantizer.quantizers.8.codebook.weight": "model-00001-of-00001.safetensors",
204
+ "quantizer.quantizers.9.in_proj.bias": "model-00001-of-00001.safetensors",
205
+ "quantizer.quantizers.9.in_proj.parametrizations.weight.original0": "model-00001-of-00001.safetensors",
206
+ "quantizer.quantizers.9.in_proj.parametrizations.weight.original1": "model-00001-of-00001.safetensors",
207
+ "quantizer.quantizers.9.out_proj.bias": "model-00001-of-00001.safetensors",
208
+ "quantizer.quantizers.9.out_proj.parametrizations.weight.original0": "model-00001-of-00001.safetensors",
209
+ "quantizer.quantizers.9.out_proj.parametrizations.weight.original1": "model-00001-of-00001.safetensors",
210
+ "quantizer.quantizers.9.codebook.weight": "model-00001-of-00001.safetensors",
211
+ "quantizer.quantizers.10.in_proj.bias": "model-00001-of-00001.safetensors",
212
+ "quantizer.quantizers.10.in_proj.parametrizations.weight.original0": "model-00001-of-00001.safetensors",
213
+ "quantizer.quantizers.10.in_proj.parametrizations.weight.original1": "model-00001-of-00001.safetensors",
214
+ "quantizer.quantizers.10.out_proj.bias": "model-00001-of-00001.safetensors",
215
+ "quantizer.quantizers.10.out_proj.parametrizations.weight.original0": "model-00001-of-00001.safetensors",
216
+ "quantizer.quantizers.10.out_proj.parametrizations.weight.original1": "model-00001-of-00001.safetensors",
217
+ "quantizer.quantizers.10.codebook.weight": "model-00001-of-00001.safetensors",
218
+ "quantizer.quantizers.11.in_proj.bias": "model-00001-of-00001.safetensors",
219
+ "quantizer.quantizers.11.in_proj.parametrizations.weight.original0": "model-00001-of-00001.safetensors",
220
+ "quantizer.quantizers.11.in_proj.parametrizations.weight.original1": "model-00001-of-00001.safetensors",
221
+ "quantizer.quantizers.11.out_proj.bias": "model-00001-of-00001.safetensors",
222
+ "quantizer.quantizers.11.out_proj.parametrizations.weight.original0": "model-00001-of-00001.safetensors",
223
+ "quantizer.quantizers.11.out_proj.parametrizations.weight.original1": "model-00001-of-00001.safetensors",
224
+ "quantizer.quantizers.11.codebook.weight": "model-00001-of-00001.safetensors",
225
+ "quantizer.quantizers.12.in_proj.bias": "model-00001-of-00001.safetensors",
226
+ "quantizer.quantizers.12.in_proj.parametrizations.weight.original0": "model-00001-of-00001.safetensors",
227
+ "quantizer.quantizers.12.in_proj.parametrizations.weight.original1": "model-00001-of-00001.safetensors",
228
+ "quantizer.quantizers.12.out_proj.bias": "model-00001-of-00001.safetensors",
229
+ "quantizer.quantizers.12.out_proj.parametrizations.weight.original0": "model-00001-of-00001.safetensors",
230
+ "quantizer.quantizers.12.out_proj.parametrizations.weight.original1": "model-00001-of-00001.safetensors",
231
+ "quantizer.quantizers.12.codebook.weight": "model-00001-of-00001.safetensors",
232
+ "quantizer.quantizers.13.in_proj.bias": "model-00001-of-00001.safetensors",
233
+ "quantizer.quantizers.13.in_proj.parametrizations.weight.original0": "model-00001-of-00001.safetensors",
234
+ "quantizer.quantizers.13.in_proj.parametrizations.weight.original1": "model-00001-of-00001.safetensors",
235
+ "quantizer.quantizers.13.out_proj.bias": "model-00001-of-00001.safetensors",
236
+ "quantizer.quantizers.13.out_proj.parametrizations.weight.original0": "model-00001-of-00001.safetensors",
237
+ "quantizer.quantizers.13.out_proj.parametrizations.weight.original1": "model-00001-of-00001.safetensors",
238
+ "quantizer.quantizers.13.codebook.weight": "model-00001-of-00001.safetensors",
239
+ "quantizer.quantizers.14.in_proj.bias": "model-00001-of-00001.safetensors",
240
+ "quantizer.quantizers.14.in_proj.parametrizations.weight.original0": "model-00001-of-00001.safetensors",
241
+ "quantizer.quantizers.14.in_proj.parametrizations.weight.original1": "model-00001-of-00001.safetensors",
242
+ "quantizer.quantizers.14.out_proj.bias": "model-00001-of-00001.safetensors",
243
+ "quantizer.quantizers.14.out_proj.parametrizations.weight.original0": "model-00001-of-00001.safetensors",
244
+ "quantizer.quantizers.14.out_proj.parametrizations.weight.original1": "model-00001-of-00001.safetensors",
245
+ "quantizer.quantizers.14.codebook.weight": "model-00001-of-00001.safetensors",
246
+ "quantizer.quantizers.15.in_proj.bias": "model-00001-of-00001.safetensors",
247
+ "quantizer.quantizers.15.in_proj.parametrizations.weight.original0": "model-00001-of-00001.safetensors",
248
+ "quantizer.quantizers.15.in_proj.parametrizations.weight.original1": "model-00001-of-00001.safetensors",
249
+ "quantizer.quantizers.15.out_proj.bias": "model-00001-of-00001.safetensors",
250
+ "quantizer.quantizers.15.out_proj.parametrizations.weight.original0": "model-00001-of-00001.safetensors",
251
+ "quantizer.quantizers.15.out_proj.parametrizations.weight.original1": "model-00001-of-00001.safetensors",
252
+ "quantizer.quantizers.15.codebook.weight": "model-00001-of-00001.safetensors",
253
+ "decoder.1.input_proj.weight": "model-00001-of-00001.safetensors",
254
+ "decoder.1.transformer.layers.0.norm1.weight": "model-00001-of-00001.safetensors",
255
+ "decoder.1.transformer.layers.0.norm1.bias": "model-00001-of-00001.safetensors",
256
+ "decoder.1.transformer.layers.0.self_attn.in_proj.weight": "model-00001-of-00001.safetensors",
257
+ "decoder.1.transformer.layers.0.self_attn.out_proj.weight": "model-00001-of-00001.safetensors",
258
+ "decoder.1.transformer.layers.0.norm2.weight": "model-00001-of-00001.safetensors",
259
+ "decoder.1.transformer.layers.0.norm2.bias": "model-00001-of-00001.safetensors",
260
+ "decoder.1.transformer.layers.0.ffn.0.weight": "model-00001-of-00001.safetensors",
261
+ "decoder.1.transformer.layers.0.ffn.2.weight": "model-00001-of-00001.safetensors",
262
+ "decoder.1.transformer.layers.0.layer_scale_1.scale": "model-00001-of-00001.safetensors",
263
+ "decoder.1.transformer.layers.0.layer_scale_2.scale": "model-00001-of-00001.safetensors",
264
+ "decoder.1.transformer.layers.1.norm1.weight": "model-00001-of-00001.safetensors",
265
+ "decoder.1.transformer.layers.1.norm1.bias": "model-00001-of-00001.safetensors",
266
+ "decoder.1.transformer.layers.1.self_attn.in_proj.weight": "model-00001-of-00001.safetensors",
267
+ "decoder.1.transformer.layers.1.self_attn.out_proj.weight": "model-00001-of-00001.safetensors",
268
+ "decoder.1.transformer.layers.1.norm2.weight": "model-00001-of-00001.safetensors",
269
+ "decoder.1.transformer.layers.1.norm2.bias": "model-00001-of-00001.safetensors",
270
+ "decoder.1.transformer.layers.1.ffn.0.weight": "model-00001-of-00001.safetensors",
271
+ "decoder.1.transformer.layers.1.ffn.2.weight": "model-00001-of-00001.safetensors",
272
+ "decoder.1.transformer.layers.1.layer_scale_1.scale": "model-00001-of-00001.safetensors",
273
+ "decoder.1.transformer.layers.1.layer_scale_2.scale": "model-00001-of-00001.safetensors",
274
+ "decoder.1.transformer.layers.2.norm1.weight": "model-00001-of-00001.safetensors",
275
+ "decoder.1.transformer.layers.2.norm1.bias": "model-00001-of-00001.safetensors",
276
+ "decoder.1.transformer.layers.2.self_attn.in_proj.weight": "model-00001-of-00001.safetensors",
277
+ "decoder.1.transformer.layers.2.self_attn.out_proj.weight": "model-00001-of-00001.safetensors",
278
+ "decoder.1.transformer.layers.2.norm2.weight": "model-00001-of-00001.safetensors",
279
+ "decoder.1.transformer.layers.2.norm2.bias": "model-00001-of-00001.safetensors",
280
+ "decoder.1.transformer.layers.2.ffn.0.weight": "model-00001-of-00001.safetensors",
281
+ "decoder.1.transformer.layers.2.ffn.2.weight": "model-00001-of-00001.safetensors",
282
+ "decoder.1.transformer.layers.2.layer_scale_1.scale": "model-00001-of-00001.safetensors",
283
+ "decoder.1.transformer.layers.2.layer_scale_2.scale": "model-00001-of-00001.safetensors",
284
+ "decoder.1.transformer.layers.3.norm1.weight": "model-00001-of-00001.safetensors",
285
+ "decoder.1.transformer.layers.3.norm1.bias": "model-00001-of-00001.safetensors",
286
+ "decoder.1.transformer.layers.3.self_attn.in_proj.weight": "model-00001-of-00001.safetensors",
287
+ "decoder.1.transformer.layers.3.self_attn.out_proj.weight": "model-00001-of-00001.safetensors",
288
+ "decoder.1.transformer.layers.3.norm2.weight": "model-00001-of-00001.safetensors",
289
+ "decoder.1.transformer.layers.3.norm2.bias": "model-00001-of-00001.safetensors",
290
+ "decoder.1.transformer.layers.3.ffn.0.weight": "model-00001-of-00001.safetensors",
291
+ "decoder.1.transformer.layers.3.ffn.2.weight": "model-00001-of-00001.safetensors",
292
+ "decoder.1.transformer.layers.3.layer_scale_1.scale": "model-00001-of-00001.safetensors",
293
+ "decoder.1.transformer.layers.3.layer_scale_2.scale": "model-00001-of-00001.safetensors",
294
+ "decoder.1.output_proj.weight": "model-00001-of-00001.safetensors",
295
+ "decoder.3.input_proj.weight": "model-00001-of-00001.safetensors",
296
+ "decoder.3.transformer.layers.0.norm1.weight": "model-00001-of-00001.safetensors",
297
+ "decoder.3.transformer.layers.0.norm1.bias": "model-00001-of-00001.safetensors",
298
+ "decoder.3.transformer.layers.0.self_attn.in_proj.weight": "model-00001-of-00001.safetensors",
299
+ "decoder.3.transformer.layers.0.self_attn.out_proj.weight": "model-00001-of-00001.safetensors",
300
+ "decoder.3.transformer.layers.0.norm2.weight": "model-00001-of-00001.safetensors",
301
+ "decoder.3.transformer.layers.0.norm2.bias": "model-00001-of-00001.safetensors",
302
+ "decoder.3.transformer.layers.0.ffn.0.weight": "model-00001-of-00001.safetensors",
303
+ "decoder.3.transformer.layers.0.ffn.2.weight": "model-00001-of-00001.safetensors",
304
+ "decoder.3.transformer.layers.0.layer_scale_1.scale": "model-00001-of-00001.safetensors",
305
+ "decoder.3.transformer.layers.0.layer_scale_2.scale": "model-00001-of-00001.safetensors",
306
+ "decoder.3.transformer.layers.1.norm1.weight": "model-00001-of-00001.safetensors",
307
+ "decoder.3.transformer.layers.1.norm1.bias": "model-00001-of-00001.safetensors",
308
+ "decoder.3.transformer.layers.1.self_attn.in_proj.weight": "model-00001-of-00001.safetensors",
309
+ "decoder.3.transformer.layers.1.self_attn.out_proj.weight": "model-00001-of-00001.safetensors",
310
+ "decoder.3.transformer.layers.1.norm2.weight": "model-00001-of-00001.safetensors",
311
+ "decoder.3.transformer.layers.1.norm2.bias": "model-00001-of-00001.safetensors",
312
+ "decoder.3.transformer.layers.1.ffn.0.weight": "model-00001-of-00001.safetensors",
313
+ "decoder.3.transformer.layers.1.ffn.2.weight": "model-00001-of-00001.safetensors",
314
+ "decoder.3.transformer.layers.1.layer_scale_1.scale": "model-00001-of-00001.safetensors",
315
+ "decoder.3.transformer.layers.1.layer_scale_2.scale": "model-00001-of-00001.safetensors",
316
+ "decoder.3.output_proj.weight": "model-00001-of-00001.safetensors",
317
+ "decoder.5.input_proj.weight": "model-00001-of-00001.safetensors",
318
+ "decoder.5.transformer.layers.0.norm1.weight": "model-00001-of-00001.safetensors",
319
+ "decoder.5.transformer.layers.0.norm1.bias": "model-00001-of-00001.safetensors",
320
+ "decoder.5.transformer.layers.0.self_attn.in_proj.weight": "model-00001-of-00001.safetensors",
321
+ "decoder.5.transformer.layers.0.self_attn.out_proj.weight": "model-00001-of-00001.safetensors",
322
+ "decoder.5.transformer.layers.0.norm2.weight": "model-00001-of-00001.safetensors",
323
+ "decoder.5.transformer.layers.0.norm2.bias": "model-00001-of-00001.safetensors",
324
+ "decoder.5.transformer.layers.0.ffn.0.weight": "model-00001-of-00001.safetensors",
325
+ "decoder.5.transformer.layers.0.ffn.2.weight": "model-00001-of-00001.safetensors",
326
+ "decoder.5.transformer.layers.0.layer_scale_1.scale": "model-00001-of-00001.safetensors",
327
+ "decoder.5.transformer.layers.0.layer_scale_2.scale": "model-00001-of-00001.safetensors",
328
+ "decoder.5.transformer.layers.1.norm1.weight": "model-00001-of-00001.safetensors",
329
+ "decoder.5.transformer.layers.1.norm1.bias": "model-00001-of-00001.safetensors",
330
+ "decoder.5.transformer.layers.1.self_attn.in_proj.weight": "model-00001-of-00001.safetensors",
331
+ "decoder.5.transformer.layers.1.self_attn.out_proj.weight": "model-00001-of-00001.safetensors",
332
+ "decoder.5.transformer.layers.1.norm2.weight": "model-00001-of-00001.safetensors",
333
+ "decoder.5.transformer.layers.1.norm2.bias": "model-00001-of-00001.safetensors",
334
+ "decoder.5.transformer.layers.1.ffn.0.weight": "model-00001-of-00001.safetensors",
335
+ "decoder.5.transformer.layers.1.ffn.2.weight": "model-00001-of-00001.safetensors",
336
+ "decoder.5.transformer.layers.1.layer_scale_1.scale": "model-00001-of-00001.safetensors",
337
+ "decoder.5.transformer.layers.1.layer_scale_2.scale": "model-00001-of-00001.safetensors",
338
+ "decoder.5.output_proj.weight": "model-00001-of-00001.safetensors",
339
+ "decoder.7.input_proj.weight": "model-00001-of-00001.safetensors",
340
+ "decoder.7.transformer.layers.0.norm1.weight": "model-00001-of-00001.safetensors",
341
+ "decoder.7.transformer.layers.0.norm1.bias": "model-00001-of-00001.safetensors",
342
+ "decoder.7.transformer.layers.0.self_attn.in_proj.weight": "model-00001-of-00001.safetensors",
343
+ "decoder.7.transformer.layers.0.self_attn.out_proj.weight": "model-00001-of-00001.safetensors",
344
+ "decoder.7.transformer.layers.0.norm2.weight": "model-00001-of-00001.safetensors",
345
+ "decoder.7.transformer.layers.0.norm2.bias": "model-00001-of-00001.safetensors",
346
+ "decoder.7.transformer.layers.0.ffn.0.weight": "model-00001-of-00001.safetensors",
347
+ "decoder.7.transformer.layers.0.ffn.2.weight": "model-00001-of-00001.safetensors",
348
+ "decoder.7.transformer.layers.0.layer_scale_1.scale": "model-00001-of-00001.safetensors",
349
+ "decoder.7.transformer.layers.0.layer_scale_2.scale": "model-00001-of-00001.safetensors",
350
+ "decoder.7.transformer.layers.1.norm1.weight": "model-00001-of-00001.safetensors",
351
+ "decoder.7.transformer.layers.1.norm1.bias": "model-00001-of-00001.safetensors",
352
+ "decoder.7.transformer.layers.1.self_attn.in_proj.weight": "model-00001-of-00001.safetensors",
353
+ "decoder.7.transformer.layers.1.self_attn.out_proj.weight": "model-00001-of-00001.safetensors",
354
+ "decoder.7.transformer.layers.1.norm2.weight": "model-00001-of-00001.safetensors",
355
+ "decoder.7.transformer.layers.1.norm2.bias": "model-00001-of-00001.safetensors",
356
+ "decoder.7.transformer.layers.1.ffn.0.weight": "model-00001-of-00001.safetensors",
357
+ "decoder.7.transformer.layers.1.ffn.2.weight": "model-00001-of-00001.safetensors",
358
+ "decoder.7.transformer.layers.1.layer_scale_1.scale": "model-00001-of-00001.safetensors",
359
+ "decoder.7.transformer.layers.1.layer_scale_2.scale": "model-00001-of-00001.safetensors",
360
+ "decoder.7.transformer.layers.2.norm1.weight": "model-00001-of-00001.safetensors",
361
+ "decoder.7.transformer.layers.2.norm1.bias": "model-00001-of-00001.safetensors",
362
+ "decoder.7.transformer.layers.2.self_attn.in_proj.weight": "model-00001-of-00001.safetensors",
363
+ "decoder.7.transformer.layers.2.self_attn.out_proj.weight": "model-00001-of-00001.safetensors",
364
+ "decoder.7.transformer.layers.2.norm2.weight": "model-00001-of-00001.safetensors",
365
+ "decoder.7.transformer.layers.2.norm2.bias": "model-00001-of-00001.safetensors",
366
+ "decoder.7.transformer.layers.2.ffn.0.weight": "model-00001-of-00001.safetensors",
367
+ "decoder.7.transformer.layers.2.ffn.2.weight": "model-00001-of-00001.safetensors",
368
+ "decoder.7.transformer.layers.2.layer_scale_1.scale": "model-00001-of-00001.safetensors",
369
+ "decoder.7.transformer.layers.2.layer_scale_2.scale": "model-00001-of-00001.safetensors",
370
+ "decoder.7.transformer.layers.3.norm1.weight": "model-00001-of-00001.safetensors",
371
+ "decoder.7.transformer.layers.3.norm1.bias": "model-00001-of-00001.safetensors",
372
+ "decoder.7.transformer.layers.3.self_attn.in_proj.weight": "model-00001-of-00001.safetensors",
373
+ "decoder.7.transformer.layers.3.self_attn.out_proj.weight": "model-00001-of-00001.safetensors",
374
+ "decoder.7.transformer.layers.3.norm2.weight": "model-00001-of-00001.safetensors",
375
+ "decoder.7.transformer.layers.3.norm2.bias": "model-00001-of-00001.safetensors",
376
+ "decoder.7.transformer.layers.3.ffn.0.weight": "model-00001-of-00001.safetensors",
377
+ "decoder.7.transformer.layers.3.ffn.2.weight": "model-00001-of-00001.safetensors",
378
+ "decoder.7.transformer.layers.3.layer_scale_1.scale": "model-00001-of-00001.safetensors",
379
+ "decoder.7.transformer.layers.3.layer_scale_2.scale": "model-00001-of-00001.safetensors",
380
+ "decoder.7.output_proj.weight": "model-00001-of-00001.safetensors"
381
+ }
382
+ }
modeling_moss_audio_tokenizer.py ADDED
The diff for this file is too large to render. See raw diff