Kuangwei Chen commited on
Commit
ceff0d0
·
1 Parent(s): c9c8d5d

Add ONNX weights and update model card

Browse files
.gitattributes CHANGED
@@ -33,3 +33,9 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ *.data filter=lfs diff=lfs merge=lfs -text
37
+ *.png filter=lfs diff=lfs merge=lfs -text
38
+ *.jpg filter=lfs diff=lfs merge=lfs -text
39
+ *.wav filter=lfs diff=lfs merge=lfs -text
40
+ *.gguf filter=lfs diff=lfs merge=lfs -text
41
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,118 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: onnx
4
+ tags:
5
+ - audio
6
+ - audio-tokenizer
7
+ - neural-codec
8
+ - moss-tts-family
9
+ - moss-audio-tokenizer-nano
10
+ - speech-tokenizer
11
+ - onnx
12
+ - onnxruntime
13
+ - browser
14
+ ---
15
+
16
+ # MOSS-Audio-Tokenizer-Nano-ONNX
17
+
18
+ This repository provides the **ONNX exports** of [MOSS-Audio-Tokenizer-Nano](https://huggingface.co/OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano), the lightweight audio tokenizer used by [MOSS-TTS-Nano](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Nano). It is intended for **torch-free** deployment with ONNX Runtime and ONNX Runtime Web.
19
+
20
+ ## Overview
21
+
22
+ The Nano variant is a lightweight tokenizer with about **20M parameters**, designed to reduce deployment cost while preserving strong perceptual quality.
23
+
24
+ MOSS-Audio-Tokenizer-Nano supports:
25
+
26
+ - **48 kHz**, **stereo** audio
27
+ - **12.5 Hz** token rate
28
+ - **16 RVQ codebooks**
29
+ - high-fidelity reconstruction across variable bitrates
30
+
31
+ This ONNX repository is designed for lightweight inference pipelines such as:
32
+
33
+ - local CPU deployment with `onnxruntime`
34
+ - browser deployment with `onnxruntime-web`
35
+ - companion audio encoding/decoding for `MOSS-TTS-Nano-100M-ONNX`
36
+
37
+ ## Supported Backends
38
+
39
+ | Backend | Runtime | Use Case |
40
+ |---------|---------|----------|
41
+ | **ONNX Runtime (CPU)** | `onnxruntime` | Local CPU inference |
42
+ | **ONNX Runtime Web** | `onnxruntime-web` | Browser-based deployment |
43
+
44
+ ## Repository Contents
45
+
46
+ | File | Description |
47
+ |------|-------------|
48
+ | `moss_audio_tokenizer_encode.onnx` | Encoder graph for waveform -> discrete audio codes |
49
+ | `moss_audio_tokenizer_encode.data` | External weights for the encoder graph |
50
+ | `moss_audio_tokenizer_decode_full.onnx` | Full decoder graph for audio codes -> waveform |
51
+ | `moss_audio_tokenizer_decode_step.onnx` | Streaming decoder-step graph for incremental decode |
52
+ | `moss_audio_tokenizer_decode_shared.data` | External weights shared by the decoder graphs |
53
+ | `codec_browser_onnx_meta.json` | Metadata for browser / ONNX runtime integration |
54
+
55
+ ## Quick Start
56
+
57
+ ```bash
58
+ huggingface-cli download OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano-ONNX \
59
+ --local-dir weights/MOSS-Audio-Tokenizer-Nano-ONNX
60
+ ```
61
+
62
+ This repository is typically used together with [OpenMOSS-Team/MOSS-TTS-Nano-100M-ONNX](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Nano-100M-ONNX) for fully torch-free MOSS-TTS-Nano deployment.
63
+
64
+ ## Main Repositories
65
+
66
+ | Repository | Description |
67
+ |------------|-------------|
68
+ | [OpenMOSS/MOSS-TTS-Nano](https://github.com/OpenMOSS/MOSS-TTS-Nano) | MOSS-TTS-Nano source code and inference pipeline |
69
+ | [OpenMOSS-Team/MOSS-TTS-Nano](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Nano) | PyTorch MOSS-TTS-Nano weights |
70
+ | [OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano](https://huggingface.co/OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano) | PyTorch MOSS-Audio-Tokenizer-Nano weights |
71
+ | [OpenMOSS-Team/MOSS-TTS-Nano-100M-ONNX](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Nano-100M-ONNX) | Companion ONNX TTS weights |
72
+
73
+ ## About MOSS-Audio-Tokenizer-Nano
74
+
75
+ **MOSS-Audio-Tokenizer-Nano** serves as the lightweight codec backbone for MOSS-TTS-Nano. It keeps the same unified audio-token interface used across the MOSS-TTS family while reducing inference cost for CPU and browser deployment scenarios.
76
+
77
+ For the original PyTorch implementation, setup instructions, and more background, see:
78
+
79
+ - [MOSS-Audio-Tokenizer Repository](https://github.com/OpenMOSS/MOSS-Audio-Tokenizer)
80
+ - [MOSS-TTS-Nano Repository](https://github.com/OpenMOSS/MOSS-TTS-Nano)
81
+
82
+ ## Citation
83
+
84
+ If you use the MOSS-TTS work in your research or product, please cite:
85
+
86
+ ```bibtex
87
+ @misc{openmoss2026mossttsnano,
88
+ title={MOSS-TTS-Nano},
89
+ author={OpenMOSS Team},
90
+ year={2026},
91
+ howpublished={GitHub repository},
92
+ url={https://github.com/OpenMOSS/MOSS-TTS-Nano}
93
+ }
94
+ ```
95
+
96
+ ```bibtex
97
+ @misc{gong2026mossttstechnicalreport,
98
+ title={MOSS-TTS Technical Report},
99
+ author={Yitian Gong and Botian Jiang and Yiwei Zhao and Yucheng Yuan and Kuangwei Chen and Yaozhou Jiang and Cheng Chang and Dong Hong and Mingshu Chen and Ruixiao Li and Yiyang Zhang and Yang Gao and Hanfu Chen and Ke Chen and Songlin Wang and Xiaogui Yang and Yuqian Zhang and Kexin Huang and ZhengYuan Lin and Kang Yu and Ziqi Chen and Jin Wang and Zhaoye Fei and Qinyuan Cheng and Shimin Li and Xipeng Qiu},
100
+ year={2026},
101
+ eprint={2603.18090},
102
+ archivePrefix={arXiv},
103
+ primaryClass={cs.SD},
104
+ url={https://arxiv.org/abs/2603.18090}
105
+ }
106
+ ```
107
+
108
+ ```bibtex
109
+ @misc{gong2026mossaudiotokenizerscalingaudiotokenizers,
110
+ title={MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models},
111
+ author={Yitian Gong and Kuangwei Chen and Zhaoye Fei and Xiaogui Yang and Ke Chen and Yang Wang and Kexin Huang and Mingshu Chen and Ruixiao Li and Qingyuan Cheng and Shimin Li and Xipeng Qiu},
112
+ year={2026},
113
+ eprint={2602.10934},
114
+ archivePrefix={arXiv},
115
+ primaryClass={cs.SD},
116
+ url={https://arxiv.org/abs/2602.10934}
117
+ }
118
+ ```
codec_browser_onnx_meta.json ADDED
@@ -0,0 +1,576 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "format_version": 2,
3
+ "checkpoint_path": "MOSS-Audio-Tokenizer-Nano",
4
+ "files": {
5
+ "encode": "moss_audio_tokenizer_encode.onnx",
6
+ "decode_full": "moss_audio_tokenizer_decode_full.onnx",
7
+ "decode_step": "moss_audio_tokenizer_decode_step.onnx"
8
+ },
9
+ "external_data_files": {
10
+ "moss_audio_tokenizer_encode.onnx": [
11
+ "moss_audio_tokenizer_encode.data"
12
+ ],
13
+ "moss_audio_tokenizer_decode_full.onnx": [
14
+ "moss_audio_tokenizer_decode_shared.data"
15
+ ],
16
+ "moss_audio_tokenizer_decode_step.onnx": [
17
+ "moss_audio_tokenizer_decode_shared.data"
18
+ ]
19
+ },
20
+ "codec_config": {
21
+ "sample_rate": 48000,
22
+ "channels": 2,
23
+ "downsample_rate": 3840,
24
+ "num_quantizers": 16
25
+ },
26
+ "onnx": {
27
+ "opset": 17,
28
+ "encode_input_names": [
29
+ "waveform",
30
+ "input_lengths"
31
+ ],
32
+ "encode_output_names": [
33
+ "audio_codes",
34
+ "audio_code_lengths"
35
+ ],
36
+ "decode_input_names": [
37
+ "audio_codes",
38
+ "audio_code_lengths"
39
+ ],
40
+ "decode_output_names": [
41
+ "audio",
42
+ "audio_lengths"
43
+ ],
44
+ "decode_step_input_names": [
45
+ "audio_codes",
46
+ "audio_code_lengths",
47
+ "transformer_offset_0",
48
+ "transformer_offset_1",
49
+ "transformer_offset_2",
50
+ "transformer_offset_3",
51
+ "attn_offset_0",
52
+ "attn_cached_keys_0",
53
+ "attn_cached_values_0",
54
+ "attn_cached_positions_0",
55
+ "attn_offset_1",
56
+ "attn_cached_keys_1",
57
+ "attn_cached_values_1",
58
+ "attn_cached_positions_1",
59
+ "attn_offset_2",
60
+ "attn_cached_keys_2",
61
+ "attn_cached_values_2",
62
+ "attn_cached_positions_2",
63
+ "attn_offset_3",
64
+ "attn_cached_keys_3",
65
+ "attn_cached_values_3",
66
+ "attn_cached_positions_3",
67
+ "attn_offset_4",
68
+ "attn_cached_keys_4",
69
+ "attn_cached_values_4",
70
+ "attn_cached_positions_4",
71
+ "attn_offset_5",
72
+ "attn_cached_keys_5",
73
+ "attn_cached_values_5",
74
+ "attn_cached_positions_5",
75
+ "attn_offset_6",
76
+ "attn_cached_keys_6",
77
+ "attn_cached_values_6",
78
+ "attn_cached_positions_6",
79
+ "attn_offset_7",
80
+ "attn_cached_keys_7",
81
+ "attn_cached_values_7",
82
+ "attn_cached_positions_7",
83
+ "attn_offset_8",
84
+ "attn_cached_keys_8",
85
+ "attn_cached_values_8",
86
+ "attn_cached_positions_8",
87
+ "attn_offset_9",
88
+ "attn_cached_keys_9",
89
+ "attn_cached_values_9",
90
+ "attn_cached_positions_9",
91
+ "attn_offset_10",
92
+ "attn_cached_keys_10",
93
+ "attn_cached_values_10",
94
+ "attn_cached_positions_10",
95
+ "attn_offset_11",
96
+ "attn_cached_keys_11",
97
+ "attn_cached_values_11",
98
+ "attn_cached_positions_11"
99
+ ],
100
+ "decode_step_output_names": [
101
+ "audio",
102
+ "audio_lengths",
103
+ "transformer_offset_out_0",
104
+ "transformer_offset_out_1",
105
+ "transformer_offset_out_2",
106
+ "transformer_offset_out_3",
107
+ "attn_offset_out_0",
108
+ "attn_cached_keys_out_0",
109
+ "attn_cached_values_out_0",
110
+ "attn_cached_positions_out_0",
111
+ "attn_offset_out_1",
112
+ "attn_cached_keys_out_1",
113
+ "attn_cached_values_out_1",
114
+ "attn_cached_positions_out_1",
115
+ "attn_offset_out_2",
116
+ "attn_cached_keys_out_2",
117
+ "attn_cached_values_out_2",
118
+ "attn_cached_positions_out_2",
119
+ "attn_offset_out_3",
120
+ "attn_cached_keys_out_3",
121
+ "attn_cached_values_out_3",
122
+ "attn_cached_positions_out_3",
123
+ "attn_offset_out_4",
124
+ "attn_cached_keys_out_4",
125
+ "attn_cached_values_out_4",
126
+ "attn_cached_positions_out_4",
127
+ "attn_offset_out_5",
128
+ "attn_cached_keys_out_5",
129
+ "attn_cached_values_out_5",
130
+ "attn_cached_positions_out_5",
131
+ "attn_offset_out_6",
132
+ "attn_cached_keys_out_6",
133
+ "attn_cached_values_out_6",
134
+ "attn_cached_positions_out_6",
135
+ "attn_offset_out_7",
136
+ "attn_cached_keys_out_7",
137
+ "attn_cached_values_out_7",
138
+ "attn_cached_positions_out_7",
139
+ "attn_offset_out_8",
140
+ "attn_cached_keys_out_8",
141
+ "attn_cached_values_out_8",
142
+ "attn_cached_positions_out_8",
143
+ "attn_offset_out_9",
144
+ "attn_cached_keys_out_9",
145
+ "attn_cached_values_out_9",
146
+ "attn_cached_positions_out_9",
147
+ "attn_offset_out_10",
148
+ "attn_cached_keys_out_10",
149
+ "attn_cached_values_out_10",
150
+ "attn_cached_positions_out_10",
151
+ "attn_offset_out_11",
152
+ "attn_cached_keys_out_11",
153
+ "attn_cached_values_out_11",
154
+ "attn_cached_positions_out_11"
155
+ ]
156
+ },
157
+ "streaming_decode": {
158
+ "batch_size": 1,
159
+ "transformer_offsets": [
160
+ {
161
+ "index": 0,
162
+ "decoder_index": 1,
163
+ "input_name": "transformer_offset_0",
164
+ "output_name": "transformer_offset_out_0",
165
+ "shape": [
166
+ 1
167
+ ],
168
+ "dtype": "int32"
169
+ },
170
+ {
171
+ "index": 1,
172
+ "decoder_index": 3,
173
+ "input_name": "transformer_offset_1",
174
+ "output_name": "transformer_offset_out_1",
175
+ "shape": [
176
+ 1
177
+ ],
178
+ "dtype": "int32"
179
+ },
180
+ {
181
+ "index": 2,
182
+ "decoder_index": 5,
183
+ "input_name": "transformer_offset_2",
184
+ "output_name": "transformer_offset_out_2",
185
+ "shape": [
186
+ 1
187
+ ],
188
+ "dtype": "int32"
189
+ },
190
+ {
191
+ "index": 3,
192
+ "decoder_index": 7,
193
+ "input_name": "transformer_offset_3",
194
+ "output_name": "transformer_offset_out_3",
195
+ "shape": [
196
+ 1
197
+ ],
198
+ "dtype": "int32"
199
+ }
200
+ ],
201
+ "attention_caches": [
202
+ {
203
+ "index": 0,
204
+ "decoder_index": 1,
205
+ "layer_index": 0,
206
+ "context": 500,
207
+ "num_heads": 4,
208
+ "head_dim": 64,
209
+ "offset_input_name": "attn_offset_0",
210
+ "offset_output_name": "attn_offset_out_0",
211
+ "cached_keys_input_name": "attn_cached_keys_0",
212
+ "cached_keys_output_name": "attn_cached_keys_out_0",
213
+ "cached_values_input_name": "attn_cached_values_0",
214
+ "cached_values_output_name": "attn_cached_values_out_0",
215
+ "cached_positions_input_name": "attn_cached_positions_0",
216
+ "cached_positions_output_name": "attn_cached_positions_out_0",
217
+ "offset_shape": [
218
+ 1
219
+ ],
220
+ "cache_shape": [
221
+ 1,
222
+ 4,
223
+ 500,
224
+ 64
225
+ ],
226
+ "positions_shape": [
227
+ 1,
228
+ 500
229
+ ],
230
+ "cache_dtype": "float32",
231
+ "positions_dtype": "int32"
232
+ },
233
+ {
234
+ "index": 1,
235
+ "decoder_index": 1,
236
+ "layer_index": 1,
237
+ "context": 500,
238
+ "num_heads": 4,
239
+ "head_dim": 64,
240
+ "offset_input_name": "attn_offset_1",
241
+ "offset_output_name": "attn_offset_out_1",
242
+ "cached_keys_input_name": "attn_cached_keys_1",
243
+ "cached_keys_output_name": "attn_cached_keys_out_1",
244
+ "cached_values_input_name": "attn_cached_values_1",
245
+ "cached_values_output_name": "attn_cached_values_out_1",
246
+ "cached_positions_input_name": "attn_cached_positions_1",
247
+ "cached_positions_output_name": "attn_cached_positions_out_1",
248
+ "offset_shape": [
249
+ 1
250
+ ],
251
+ "cache_shape": [
252
+ 1,
253
+ 4,
254
+ 500,
255
+ 64
256
+ ],
257
+ "positions_shape": [
258
+ 1,
259
+ 500
260
+ ],
261
+ "cache_dtype": "float32",
262
+ "positions_dtype": "int32"
263
+ },
264
+ {
265
+ "index": 2,
266
+ "decoder_index": 1,
267
+ "layer_index": 2,
268
+ "context": 500,
269
+ "num_heads": 4,
270
+ "head_dim": 64,
271
+ "offset_input_name": "attn_offset_2",
272
+ "offset_output_name": "attn_offset_out_2",
273
+ "cached_keys_input_name": "attn_cached_keys_2",
274
+ "cached_keys_output_name": "attn_cached_keys_out_2",
275
+ "cached_values_input_name": "attn_cached_values_2",
276
+ "cached_values_output_name": "attn_cached_values_out_2",
277
+ "cached_positions_input_name": "attn_cached_positions_2",
278
+ "cached_positions_output_name": "attn_cached_positions_out_2",
279
+ "offset_shape": [
280
+ 1
281
+ ],
282
+ "cache_shape": [
283
+ 1,
284
+ 4,
285
+ 500,
286
+ 64
287
+ ],
288
+ "positions_shape": [
289
+ 1,
290
+ 500
291
+ ],
292
+ "cache_dtype": "float32",
293
+ "positions_dtype": "int32"
294
+ },
295
+ {
296
+ "index": 3,
297
+ "decoder_index": 1,
298
+ "layer_index": 3,
299
+ "context": 500,
300
+ "num_heads": 4,
301
+ "head_dim": 64,
302
+ "offset_input_name": "attn_offset_3",
303
+ "offset_output_name": "attn_offset_out_3",
304
+ "cached_keys_input_name": "attn_cached_keys_3",
305
+ "cached_keys_output_name": "attn_cached_keys_out_3",
306
+ "cached_values_input_name": "attn_cached_values_3",
307
+ "cached_values_output_name": "attn_cached_values_out_3",
308
+ "cached_positions_input_name": "attn_cached_positions_3",
309
+ "cached_positions_output_name": "attn_cached_positions_out_3",
310
+ "offset_shape": [
311
+ 1
312
+ ],
313
+ "cache_shape": [
314
+ 1,
315
+ 4,
316
+ 500,
317
+ 64
318
+ ],
319
+ "positions_shape": [
320
+ 1,
321
+ 500
322
+ ],
323
+ "cache_dtype": "float32",
324
+ "positions_dtype": "int32"
325
+ },
326
+ {
327
+ "index": 4,
328
+ "decoder_index": 3,
329
+ "layer_index": 0,
330
+ "context": 800,
331
+ "num_heads": 4,
332
+ "head_dim": 64,
333
+ "offset_input_name": "attn_offset_4",
334
+ "offset_output_name": "attn_offset_out_4",
335
+ "cached_keys_input_name": "attn_cached_keys_4",
336
+ "cached_keys_output_name": "attn_cached_keys_out_4",
337
+ "cached_values_input_name": "attn_cached_values_4",
338
+ "cached_values_output_name": "attn_cached_values_out_4",
339
+ "cached_positions_input_name": "attn_cached_positions_4",
340
+ "cached_positions_output_name": "attn_cached_positions_out_4",
341
+ "offset_shape": [
342
+ 1
343
+ ],
344
+ "cache_shape": [
345
+ 1,
346
+ 4,
347
+ 800,
348
+ 64
349
+ ],
350
+ "positions_shape": [
351
+ 1,
352
+ 800
353
+ ],
354
+ "cache_dtype": "float32",
355
+ "positions_dtype": "int32"
356
+ },
357
+ {
358
+ "index": 5,
359
+ "decoder_index": 3,
360
+ "layer_index": 1,
361
+ "context": 800,
362
+ "num_heads": 4,
363
+ "head_dim": 64,
364
+ "offset_input_name": "attn_offset_5",
365
+ "offset_output_name": "attn_offset_out_5",
366
+ "cached_keys_input_name": "attn_cached_keys_5",
367
+ "cached_keys_output_name": "attn_cached_keys_out_5",
368
+ "cached_values_input_name": "attn_cached_values_5",
369
+ "cached_values_output_name": "attn_cached_values_out_5",
370
+ "cached_positions_input_name": "attn_cached_positions_5",
371
+ "cached_positions_output_name": "attn_cached_positions_out_5",
372
+ "offset_shape": [
373
+ 1
374
+ ],
375
+ "cache_shape": [
376
+ 1,
377
+ 4,
378
+ 800,
379
+ 64
380
+ ],
381
+ "positions_shape": [
382
+ 1,
383
+ 800
384
+ ],
385
+ "cache_dtype": "float32",
386
+ "positions_dtype": "int32"
387
+ },
388
+ {
389
+ "index": 6,
390
+ "decoder_index": 5,
391
+ "layer_index": 0,
392
+ "context": 1200,
393
+ "num_heads": 4,
394
+ "head_dim": 64,
395
+ "offset_input_name": "attn_offset_6",
396
+ "offset_output_name": "attn_offset_out_6",
397
+ "cached_keys_input_name": "attn_cached_keys_6",
398
+ "cached_keys_output_name": "attn_cached_keys_out_6",
399
+ "cached_values_input_name": "attn_cached_values_6",
400
+ "cached_values_output_name": "attn_cached_values_out_6",
401
+ "cached_positions_input_name": "attn_cached_positions_6",
402
+ "cached_positions_output_name": "attn_cached_positions_out_6",
403
+ "offset_shape": [
404
+ 1
405
+ ],
406
+ "cache_shape": [
407
+ 1,
408
+ 4,
409
+ 1200,
410
+ 64
411
+ ],
412
+ "positions_shape": [
413
+ 1,
414
+ 1200
415
+ ],
416
+ "cache_dtype": "float32",
417
+ "positions_dtype": "int32"
418
+ },
419
+ {
420
+ "index": 7,
421
+ "decoder_index": 5,
422
+ "layer_index": 1,
423
+ "context": 1200,
424
+ "num_heads": 4,
425
+ "head_dim": 64,
426
+ "offset_input_name": "attn_offset_7",
427
+ "offset_output_name": "attn_offset_out_7",
428
+ "cached_keys_input_name": "attn_cached_keys_7",
429
+ "cached_keys_output_name": "attn_cached_keys_out_7",
430
+ "cached_values_input_name": "attn_cached_values_7",
431
+ "cached_values_output_name": "attn_cached_values_out_7",
432
+ "cached_positions_input_name": "attn_cached_positions_7",
433
+ "cached_positions_output_name": "attn_cached_positions_out_7",
434
+ "offset_shape": [
435
+ 1
436
+ ],
437
+ "cache_shape": [
438
+ 1,
439
+ 4,
440
+ 1200,
441
+ 64
442
+ ],
443
+ "positions_shape": [
444
+ 1,
445
+ 1200
446
+ ],
447
+ "cache_dtype": "float32",
448
+ "positions_dtype": "int32"
449
+ },
450
+ {
451
+ "index": 8,
452
+ "decoder_index": 7,
453
+ "layer_index": 0,
454
+ "context": 1600,
455
+ "num_heads": 4,
456
+ "head_dim": 64,
457
+ "offset_input_name": "attn_offset_8",
458
+ "offset_output_name": "attn_offset_out_8",
459
+ "cached_keys_input_name": "attn_cached_keys_8",
460
+ "cached_keys_output_name": "attn_cached_keys_out_8",
461
+ "cached_values_input_name": "attn_cached_values_8",
462
+ "cached_values_output_name": "attn_cached_values_out_8",
463
+ "cached_positions_input_name": "attn_cached_positions_8",
464
+ "cached_positions_output_name": "attn_cached_positions_out_8",
465
+ "offset_shape": [
466
+ 1
467
+ ],
468
+ "cache_shape": [
469
+ 1,
470
+ 4,
471
+ 1600,
472
+ 64
473
+ ],
474
+ "positions_shape": [
475
+ 1,
476
+ 1600
477
+ ],
478
+ "cache_dtype": "float32",
479
+ "positions_dtype": "int32"
480
+ },
481
+ {
482
+ "index": 9,
483
+ "decoder_index": 7,
484
+ "layer_index": 1,
485
+ "context": 1600,
486
+ "num_heads": 4,
487
+ "head_dim": 64,
488
+ "offset_input_name": "attn_offset_9",
489
+ "offset_output_name": "attn_offset_out_9",
490
+ "cached_keys_input_name": "attn_cached_keys_9",
491
+ "cached_keys_output_name": "attn_cached_keys_out_9",
492
+ "cached_values_input_name": "attn_cached_values_9",
493
+ "cached_values_output_name": "attn_cached_values_out_9",
494
+ "cached_positions_input_name": "attn_cached_positions_9",
495
+ "cached_positions_output_name": "attn_cached_positions_out_9",
496
+ "offset_shape": [
497
+ 1
498
+ ],
499
+ "cache_shape": [
500
+ 1,
501
+ 4,
502
+ 1600,
503
+ 64
504
+ ],
505
+ "positions_shape": [
506
+ 1,
507
+ 1600
508
+ ],
509
+ "cache_dtype": "float32",
510
+ "positions_dtype": "int32"
511
+ },
512
+ {
513
+ "index": 10,
514
+ "decoder_index": 7,
515
+ "layer_index": 2,
516
+ "context": 1600,
517
+ "num_heads": 4,
518
+ "head_dim": 64,
519
+ "offset_input_name": "attn_offset_10",
520
+ "offset_output_name": "attn_offset_out_10",
521
+ "cached_keys_input_name": "attn_cached_keys_10",
522
+ "cached_keys_output_name": "attn_cached_keys_out_10",
523
+ "cached_values_input_name": "attn_cached_values_10",
524
+ "cached_values_output_name": "attn_cached_values_out_10",
525
+ "cached_positions_input_name": "attn_cached_positions_10",
526
+ "cached_positions_output_name": "attn_cached_positions_out_10",
527
+ "offset_shape": [
528
+ 1
529
+ ],
530
+ "cache_shape": [
531
+ 1,
532
+ 4,
533
+ 1600,
534
+ 64
535
+ ],
536
+ "positions_shape": [
537
+ 1,
538
+ 1600
539
+ ],
540
+ "cache_dtype": "float32",
541
+ "positions_dtype": "int32"
542
+ },
543
+ {
544
+ "index": 11,
545
+ "decoder_index": 7,
546
+ "layer_index": 3,
547
+ "context": 1600,
548
+ "num_heads": 4,
549
+ "head_dim": 64,
550
+ "offset_input_name": "attn_offset_11",
551
+ "offset_output_name": "attn_offset_out_11",
552
+ "cached_keys_input_name": "attn_cached_keys_11",
553
+ "cached_keys_output_name": "attn_cached_keys_out_11",
554
+ "cached_values_input_name": "attn_cached_values_11",
555
+ "cached_values_output_name": "attn_cached_values_out_11",
556
+ "cached_positions_input_name": "attn_cached_positions_11",
557
+ "cached_positions_output_name": "attn_cached_positions_out_11",
558
+ "offset_shape": [
559
+ 1
560
+ ],
561
+ "cache_shape": [
562
+ 1,
563
+ 4,
564
+ 1600,
565
+ 64
566
+ ],
567
+ "positions_shape": [
568
+ 1,
569
+ 1600
570
+ ],
571
+ "cache_dtype": "float32",
572
+ "positions_dtype": "int32"
573
+ }
574
+ ]
575
+ }
576
+ }
moss_audio_tokenizer_decode_full.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0fbbafe3fd4afa2a019af5c5ced204af6e2d1db044fa40f021525d2aee95b4ac
3
+ size 681902
moss_audio_tokenizer_decode_shared.data ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e69d52e0f4e84ca27850557ee54face46632d3a5a16c89bd246c7c408466dcad
3
+ size 44198912
moss_audio_tokenizer_decode_step.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9527c86a29e1837edec1f74db57d5eeaadb3a715af3382703566460afed25855
3
+ size 351400
moss_audio_tokenizer_encode.data ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:aa751265b2bab2887eac224484546b194875aa7494b607115439b3dc6b228a2c
3
+ size 44507136
moss_audio_tokenizer_encode.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:eadea4a645abdcf98714c7aead122ee2ce7da6e080f9f80b977cd1ca8e19473a
3
+ size 815775