schwarztgyt commited on
Commit
38f484f
·
1 Parent(s): e9ada85
Files changed (6) hide show
  1. .gitattributes +1 -0
  2. .gitignore +3 -0
  3. LICENSE +201 -0
  4. README.md +71 -17
  5. demo/demo_gt.wav +3 -0
  6. dev/huggingface_compliance_audit.md +215 -0
.gitattributes CHANGED
@@ -43,5 +43,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
43
  *.pkl filter=lfs diff=lfs merge=lfs -text
44
  *.tar filter=lfs diff=lfs merge=lfs -text
45
  *.wasm filter=lfs diff=lfs merge=lfs -text
 
46
  *.zst filter=lfs diff=lfs merge=lfs -text
47
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
43
  *.pkl filter=lfs diff=lfs merge=lfs -text
44
  *.tar filter=lfs diff=lfs merge=lfs -text
45
  *.wasm filter=lfs diff=lfs merge=lfs -text
46
+ *.wav filter=lfs diff=lfs merge=lfs -text
47
  *.zst filter=lfs diff=lfs merge=lfs -text
48
  *tfevents* filter=lfs diff=lfs merge=lfs -text
.gitignore ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ dev/*
2
+ !dev/huggingface_compliance_audit.md
3
+ demo/demo_rec*.wav
LICENSE ADDED
@@ -0,0 +1,201 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Apache License
2
+ Version 2.0, January 2004
3
+ http://www.apache.org/licenses/
4
+
5
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6
+
7
+ 1. Definitions.
8
+
9
+ "License" shall mean the terms and conditions for use, reproduction,
10
+ and distribution as defined by Sections 1 through 9 of this document.
11
+
12
+ "Licensor" shall mean the copyright owner or entity authorized by
13
+ the copyright owner that is granting the License.
14
+
15
+ "Legal Entity" shall mean the union of the acting entity and all
16
+ other entities that control, are controlled by, or are under common
17
+ control with that entity. For the purposes of this definition,
18
+ "control" means (i) the power, direct or indirect, to cause the
19
+ direction or management of such entity, whether by contract or
20
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
21
+ outstanding shares, or (iii) beneficial ownership of such entity.
22
+
23
+ "You" (or "Your") shall mean an individual or Legal Entity
24
+ exercising permissions granted by this License.
25
+
26
+ "Source" form shall mean the preferred form for making modifications,
27
+ including but not limited to software source code, documentation
28
+ source, and configuration files.
29
+
30
+ "Object" form shall mean any form resulting from mechanical
31
+ transformation or translation of a Source form, including but
32
+ not limited to compiled object code, generated documentation,
33
+ and conversions to other media types.
34
+
35
+ "Work" shall mean the work of authorship, whether in Source or
36
+ Object form, made available under the License, as indicated by a
37
+ copyright notice that is included in or attached to the work
38
+ (an example is provided in the Appendix below).
39
+
40
+ "Derivative Works" shall mean any work, whether in Source or Object
41
+ form, that is based on (or derived from) the Work and for which the
42
+ editorial revisions, annotations, elaborations, or other modifications
43
+ represent, as a whole, an original work of authorship. For the purposes
44
+ of this License, Derivative Works shall not include works that remain
45
+ separable from, or merely link (or bind by name) to the interfaces of,
46
+ the Work and Derivative Works thereof.
47
+
48
+ "Contribution" shall mean any work of authorship, including
49
+ the original version of the Work and any modifications or additions
50
+ to that Work or Derivative Works thereof, that is intentionally
51
+ submitted to Licensor for inclusion in the Work by the copyright owner
52
+ or by an individual or Legal Entity authorized to submit on behalf of
53
+ the copyright owner. For the purposes of this definition, "submitted"
54
+ means any form of electronic, verbal, or written communication sent
55
+ to the Licensor or its representatives, including but not limited to
56
+ communication on electronic mailing lists, source code control systems,
57
+ and issue tracking systems that are managed by, or on behalf of, the
58
+ Licensor for the purpose of discussing and improving the Work, but
59
+ excluding communication that is conspicuously marked or otherwise
60
+ designated in writing by the copyright owner as "Not a Contribution."
61
+
62
+ "Contributor" shall mean Licensor and any individual or Legal Entity
63
+ on behalf of whom a Contribution has been received by Licensor and
64
+ subsequently incorporated within the Work.
65
+
66
+ 2. Grant of Copyright License. Subject to the terms and conditions of
67
+ this License, each Contributor hereby grants to You a perpetual,
68
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69
+ copyright license to reproduce, prepare Derivative Works of,
70
+ publicly display, publicly perform, sublicense, and distribute the
71
+ Work and such Derivative Works in Source or Object form.
72
+
73
+ 3. Grant of Patent License. Subject to the terms and conditions of
74
+ this License, each Contributor hereby grants to You a perpetual,
75
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76
+ (except as stated in this section) patent license to make, have made,
77
+ use, offer to sell, sell, import, and otherwise transfer the Work,
78
+ where such license applies only to those patent claims licensable
79
+ by such Contributor that are necessarily infringed by their
80
+ Contribution(s) alone or by combination of their Contribution(s)
81
+ with the Work to which such Contribution(s) was submitted. If You
82
+ institute patent litigation against any entity (including a
83
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
84
+ or a Contribution incorporated within the Work constitutes direct
85
+ or contributory patent infringement, then any patent licenses
86
+ granted to You under this License for that Work shall terminate
87
+ as of the date such litigation is filed.
88
+
89
+ 4. Redistribution. You may reproduce and distribute copies of the
90
+ Work or Derivative Works thereof in any medium, with or without
91
+ modifications, and in Source or Object form, provided that You
92
+ meet the following conditions:
93
+
94
+ (a) You must give any other recipients of the Work or
95
+ Derivative Works a copy of this License; and
96
+
97
+ (b) You must cause any modified files to carry prominent notices
98
+ stating that You changed the files; and
99
+
100
+ (c) You must retain, in the Source form of any Derivative Works
101
+ that You distribute, all copyright, patent, trademark, and
102
+ attribution notices from the Source form of the Work,
103
+ excluding those notices that do not pertain to any part of
104
+ the Derivative Works; and
105
+
106
+ (d) If the Work includes a "NOTICE" text file as part of its
107
+ distribution, then any Derivative Works that You distribute must
108
+ include a readable copy of the attribution notices contained
109
+ within such NOTICE file, excluding those notices that do not
110
+ pertain to any part of the Derivative Works, in at least one
111
+ of the following places: within a NOTICE text file distributed
112
+ as part of the Derivative Works; within the Source form or
113
+ documentation, if provided along with the Derivative Works; or,
114
+ within a display generated by the Derivative Works, if and
115
+ wherever such third-party notices normally appear. The contents
116
+ of the NOTICE file are for informational purposes only and
117
+ do not modify the License. You may add Your own attribution
118
+ notices within Derivative Works that You distribute, alongside
119
+ or as an addendum to the NOTICE text from the Work, provided
120
+ that such additional attribution notices cannot be construed
121
+ as modifying the License.
122
+
123
+ You may add Your own copyright statement to Your modifications and
124
+ may provide additional or different license terms and conditions
125
+ for use, reproduction, or distribution of Your modifications, or
126
+ for any such Derivative Works as a whole, provided Your use,
127
+ reproduction, and distribution of the Work otherwise complies with
128
+ the conditions stated in this License.
129
+
130
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
131
+ any Contribution intentionally submitted for inclusion in the Work
132
+ by You to the Licensor shall be under the terms and conditions of
133
+ this License, without any additional terms or conditions.
134
+ Notwithstanding the above, nothing herein shall supersede or modify
135
+ the terms of any separate license agreement you may have executed
136
+ with Licensor regarding such Contributions.
137
+
138
+ 6. Trademarks. This License does not grant permission to use the trade
139
+ names, trademarks, service marks, or product names of the Licensor,
140
+ except as required for reasonable and customary use in describing the
141
+ origin of the Work and reproducing the content of the NOTICE file.
142
+
143
+ 7. Disclaimer of Warranty. Unless required by applicable law or
144
+ agreed to in writing, Licensor provides the Work (and each
145
+ Contributor provides its Contributions) on an "AS IS" BASIS,
146
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147
+ implied, including, without limitation, any warranties or conditions
148
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149
+ PARTICULAR PURPOSE. You are solely responsible for determining the
150
+ appropriateness of using or redistributing the Work and assume any
151
+ risks associated with Your exercise of permissions under this License.
152
+
153
+ 8. Limitation of Liability. In no event and under no legal theory,
154
+ whether in tort (including negligence), contract, or otherwise,
155
+ unless required by applicable law (such as deliberate and grossly
156
+ negligent acts) or agreed to in writing, shall any Contributor be
157
+ liable to You for damages, including any direct, indirect, special,
158
+ incidental, or consequential damages of any character arising as a
159
+ result of this License or out of the use or inability to use the
160
+ Work (including but not limited to damages for loss of goodwill,
161
+ work stoppage, computer failure or malfunction, or any and all
162
+ other commercial damages or losses), even if such Contributor
163
+ has been advised of the possibility of such damages.
164
+
165
+ 9. Accepting Warranty or Additional Liability. While redistributing
166
+ the Work or Derivative Works thereof, You may choose to offer,
167
+ and charge a fee for, acceptance of support, warranty, indemnity,
168
+ or other liability obligations and/or rights consistent with this
169
+ License. However, in accepting such obligations, You may act only
170
+ on Your own behalf and on Your sole responsibility, not on behalf
171
+ of any other Contributor, and only if You agree to indemnify,
172
+ defend, and hold each Contributor harmless for any liability
173
+ incurred by, or claims asserted against, such Contributor by reason
174
+ of your accepting any such warranty or additional liability.
175
+
176
+ END OF TERMS AND CONDITIONS
177
+
178
+ APPENDIX: How to apply the Apache License to your work.
179
+
180
+ To apply the Apache License to your work, attach the following
181
+ boilerplate notice, with the fields enclosed by brackets "[]"
182
+ replaced with your own identifying information. (Don't include
183
+ the brackets!) The text should be enclosed in the appropriate
184
+ comment syntax for the file format. We also recommend that a
185
+ file or class name and description of purpose be included on the
186
+ same "printed page" as the copyright notice for easier
187
+ identification within third-party archives.
188
+
189
+ Copyright [yyyy] [name of copyright owner]
190
+
191
+ Licensed under the Apache License, Version 2.0 (the "License");
192
+ you may not use this file except in compliance with the License.
193
+ You may obtain a copy of the License at
194
+
195
+ http://www.apache.org/licenses/LICENSE-2.0
196
+
197
+ Unless required by applicable law or agreed to in writing, software
198
+ distributed under the License is distributed on an "AS IS" BASIS,
199
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200
+ See the License for the specific language governing permissions and
201
+ limitations under the License.
README.md CHANGED
@@ -9,6 +9,7 @@ tags:
9
  - MOSS Audio Tokenizer
10
  - speech-tokenizer
11
  - trust-remote-code
 
12
  ---
13
 
14
  # Moss-Audio-Tokenizer-V2
@@ -30,8 +31,58 @@ This is the code for the 48khz stereo version of MOSS-Audio-Tokenizer presented
30
  By combining a simple, scalable architecture with massive-scale data, the Cat architecture overcomes the bottlenecks of traditional audio tokenizers. It provides a robust, high-fidelity, and semantically grounded interface for the next generation of native audio foundation models.
31
 
32
  This repository contains a lightweight remote-code implementation that mirrors the current 🤗 Transformers
33
- `transformers.models.moss_audio_tokenizer` module. It is intended to be uploaded to a Hugging Face Hub model repository
34
- and loaded with `trust_remote_code=True` when needed.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
  ## Usage
37
 
@@ -45,7 +96,8 @@ import torchaudio
45
  repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer-V2"
46
  model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()
47
 
48
- wav, sr = torchaudio.load('demo/demo_gt.wav')
 
49
  if sr != model.sampling_rate:
50
  wav = torchaudio.functional.resample(wav, sr, model.sampling_rate)
51
  if wav.shape[0] == 1:
@@ -66,6 +118,8 @@ wav_rvq8 = dec_rvq8.audio.squeeze(0)
66
  torchaudio.save("demo/demo_rec_rvq8.wav", wav_rvq8, sample_rate=model.sampling_rate)
67
  ```
68
 
 
 
69
  ### Attention Backend And Compute Dtype
70
 
71
  `config.attention_implementation` controls whether transformer layers prefer `sdpa` or `flash_attention_2`.
@@ -114,25 +168,25 @@ batch_dec = model.batch_decode(codes_list, chunk_duration=0.08)
114
  - `modeling_moss_audio_tokenizer.py`
115
  - `__init__.py`
116
  - `config.json`
117
- - model weights
118
-
119
-
 
120
 
121
  ## Citation
122
  If you use this code or result in your paper, please cite our work as:
123
  ```tex
124
- @misc{gong2026mossaudiotokenizerscaling,
125
- title={MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models},
126
- author={Yitian Gong and Kuangwei Chen and Zhaoye Fei and Xiaogui Yang and Ke Chen and Yang Wang and Kexin Huang and Mingshu Chen and Ruixiao Li and Qingyuan Cheng and Shimin Li and Xipeng Qiu},
127
- year={2026},
128
- eprint={2602.10934},
129
- archivePrefix={arXiv},
130
- primaryClass={cs.SD},
131
- url={https://arxiv.org/abs/2602.10934}
132
  }
133
  ```
134
 
135
- ## License
136
- <!-- TODO: check and add license -->
137
- MOSS-Audio-Tokenizer-V2 is released under the Apache 2.0 license.
138
 
 
9
  - MOSS Audio Tokenizer
10
  - speech-tokenizer
11
  - trust-remote-code
12
+ - arxiv:2602.10934
13
  ---
14
 
15
  # Moss-Audio-Tokenizer-V2
 
31
  By combining a simple, scalable architecture with massive-scale data, the Cat architecture overcomes the bottlenecks of traditional audio tokenizers. It provides a robust, high-fidelity, and semantically grounded interface for the next generation of native audio foundation models.
32
 
33
  This repository contains a lightweight remote-code implementation that mirrors the current 🤗 Transformers
34
+ `transformers.models.moss_audio_tokenizer` module. It is hosted as a Hugging Face Hub model repository and should be
35
+ loaded with `trust_remote_code=True`.
36
+
37
+ ## Model Details
38
+
39
+ - **Architecture:** Cat (Causal Audio Tokenizer with Transformer), a CNN-free neural audio codec/tokenizer.
40
+ - **Sampling rate:** 48 kHz.
41
+ - **Channels:** stereo public waveform interface.
42
+ - **Token frame rate:** 12.5 Hz.
43
+ - **Quantization:** 32-layer residual vector quantization stack.
44
+ - **Checkpoint size:** the safetensors index reports 2,123,701,248 total parameters.
45
+ - **Weight format:** sharded `safetensors` weights with a `model.safetensors.index.json` index.
46
+
47
+ ## Intended Use
48
+
49
+ MOSS-Audio-Tokenizer-V2 is intended for research and development on audio tokenization, neural codec reconstruction,
50
+ native audio foundation models, speech/audio understanding, speech generation, and related downstream modeling. It can
51
+ encode 48 kHz stereo waveforms into discrete audio codes and decode those codes back to waveforms.
52
+
53
+ This model is not intended for use in applications that impersonate a real person, reproduce private or copyrighted
54
+ audio without permission, or make high-stakes decisions from reconstructed audio without additional validation.
55
+
56
+ ## Training Data And Procedure
57
+
58
+ The model was trained from scratch on 3 million hours of diverse audio data, covering speech, sound effects, and music,
59
+ as described in the accompanying paper. The training pipeline jointly optimizes the encoder, quantizer, decoder,
60
+ discriminator, and a decoder-only LLM used for semantic alignment.
61
+
62
+ The full training data mixture is not included in this repository. For details on dataset composition, filtering, and
63
+ training/evaluation methodology, refer to the paper.
64
+
65
+ ## Evaluation
66
+
67
+ The model is designed to provide high-fidelity reconstruction and semantically rich discrete representations across
68
+ speech, sound effects, and music. Please refer to the paper for the full benchmark setup and quantitative results.
69
+
70
+ ## Limitations
71
+
72
+ - Audio outside the 48 kHz stereo setting may require resampling and channel conversion before inference.
73
+ - Reconstruction quality depends on audio domain, signal quality, selected number of RVQ layers, and inference settings.
74
+ - The repository uses custom Transformers remote code, so users should review the code and pin a trusted revision in
75
+ production deployments.
76
+ - `flash_attention_2` is optional; if it is unavailable, use the default `sdpa` attention implementation.
77
+
78
+ ## Requirements
79
+
80
+ - Python 3.10 or newer.
81
+ - PyTorch.
82
+ - Transformers. This checkpoint was prepared with `transformers_version` set to `4.56.0.dev0`; use a recent Transformers
83
+ build that supports custom remote-code models.
84
+ - `torchaudio` for the examples below.
85
+ - Optional: `flash-attn` if using `model.set_attention_implementation("flash_attention_2")`.
86
 
87
  ## Usage
88
 
 
96
  repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer-V2"
97
  model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()
98
 
99
+ audio_path = "demo/demo_gt.wav" # replace with your own 48 kHz stereo audio path if needed
100
+ wav, sr = torchaudio.load(audio_path)
101
  if sr != model.sampling_rate:
102
  wav = torchaudio.functional.resample(wav, sr, model.sampling_rate)
103
  if wav.shape[0] == 1:
 
118
  torchaudio.save("demo/demo_rec_rvq8.wav", wav_rvq8, sample_rate=model.sampling_rate)
119
  ```
120
 
121
+ For production use with `trust_remote_code=True`, pin `revision` to a reviewed commit hash.
122
+
123
  ### Attention Backend And Compute Dtype
124
 
125
  `config.attention_implementation` controls whether transformer layers prefer `sdpa` or `flash_attention_2`.
 
168
  - `modeling_moss_audio_tokenizer.py`
169
  - `__init__.py`
170
  - `config.json`
171
+ - `model.safetensors.index.json`
172
+ - sharded model weights: `model-00001-of-00003.safetensors`, `model-00002-of-00003.safetensors`,
173
+ `model-00003-of-00003.safetensors`
174
+ - `demo/demo_gt.wav`
175
 
176
  ## Citation
177
  If you use this code or result in your paper, please cite our work as:
178
  ```tex
179
+ @misc{gong2026mossaudiotokenizerscaling,
180
+ title={MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models},
181
+ author={Yitian Gong and Kuangwei Chen and Zhaoye Fei and Xiaogui Yang and Ke Chen and Yang Wang and Kexin Huang and Mingshu Chen and Ruixiao Li and Qingyuan Cheng and Shimin Li and Xipeng Qiu},
182
+ year={2026},
183
+ eprint={2602.10934},
184
+ archivePrefix={arXiv},
185
+ primaryClass={cs.SD},
186
+ url={https://arxiv.org/abs/2602.10934}
187
  }
188
  ```
189
 
190
+ ## License
191
+ MOSS-Audio-Tokenizer-V2 is released under the Apache 2.0 license. See `LICENSE` for the full license text.
 
192
 
demo/demo_gt.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:631608f5c8b931ece1d45adc7f40a3b3b0ae2ec056a8a08a3565b04cc5750a4b
3
+ size 243244
dev/huggingface_compliance_audit.md ADDED
@@ -0,0 +1,215 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Hugging Face 合规性检查总结
2
+
3
+ 检查日期:2026-06-05 UTC
4
+
5
+ 检查对象:本地仓库 `MOSS-Audio-Tokenizer-V2`,远程地址为 `https://huggingface.co/OpenMOSS-Team/MOSS-Audio-Tokenizer-V2`。
6
+
7
+ 结论:本地仓库整体已经是标准 Hugging Face Transformers 自定义模型仓库形态,权重使用 safetensors + Git LFS 分片,`config.json` 的 `auto_map` 和远程代码加载入口也能通过本地验证。没有发现会直接阻断上传或 `AutoConfig`/模型类动态导入的问题。2026-06-05 已完成一轮整改,补齐了主要发布质量问题;剩余事项主要是远程页面状态、可选 metadata 和更完整的结构化评测信息核验。
8
+
9
+ ## 整改状态
10
+
11
+ 已完成:
12
+
13
+ - 添加根目录 `LICENSE`,使用 Apache-2.0 全文。
14
+ - 删除 README license 部分的待办注释。
15
+ - README 补充 Model Details、Intended Use、Training Data And Procedure、Evaluation、Limitations、Requirements。
16
+ - README 示例保留 `demo/demo_gt.wav`,并说明可替换为自有音频;当前本地已有 `demo/demo_gt.wav`。
17
+ - README 增加 production 使用 `trust_remote_code=True` 时 pin commit revision 的建议。
18
+ - README repository layout 补充分片权重、index 和 demo 音频。
19
+ - `.gitignore` 增加 `!dev/huggingface_compliance_audit.md`,使本审计文档可被 Git 跟踪;同时忽略 quickstart 生成的 `demo/demo_rec*.wav`。
20
+
21
+ ## 检查依据
22
+
23
+ 本次按 Hugging Face 官方文档的以下方向检查:
24
+
25
+ - Model Cards: `README.md` 是模型卡,应包含 YAML metadata 和正文说明;模型卡应描述模型、用途/限制、训练信息、数据、评测结果等。参考:<https://huggingface.co/docs/hub/model-cards>
26
+ - Model Card metadata: 建议显式写 `library_name`;可写 `pipeline_tag`、`license`、`datasets`、`model-index` 等以提高可发现性。参考:<https://huggingface.co/docs/hub/model-cards>
27
+ - Custom Transformers model: 自定义模型需要配置类 `model_type`、模型类 `config_class`、`auto_map`,并通过 `trust_remote_code=True` 加载。参考:<https://huggingface.co/docs/transformers/custom_models>
28
+ - 大文件和仓库结构:建议文件数少于 100k、单目录条目少于 10k、单文件分片小于 200GB、一次 commit 尽量少于 100 个大文件操作。参考:<https://huggingface.co/docs/hub/en/storage-limits>
29
+ - 权重安全格式:safetensors 相比 pickle 更安全;pickle 权重存在任意代码执行风险。参考:<https://huggingface.co/docs/safetensors/en/index>、<https://huggingface.co/docs/hub/security-pickle>
30
+
31
+ ## 本地证据
32
+
33
+ ### 仓库结构
34
+
35
+ `git ls-files` 当前跟踪的文件为:
36
+
37
+ - `.gitattributes`
38
+ - `README.md`
39
+ - `__init__.py`
40
+ - `config.json`
41
+ - `configuration_moss_audio_tokenizer.py`
42
+ - `modeling_moss_audio_tokenizer.py`
43
+ - `model-00001-of-00003.safetensors`
44
+ - `model-00002-of-00003.safetensors`
45
+ - `model-00003-of-00003.safetensors`
46
+ - `model.safetensors.index.json`
47
+
48
+ 结构判断:符合 Hugging Face model repo 的基本结构。自定义 Transformers remote-code 文件、config、模型权重和权重索引均在根目录,用户可以通过 `AutoModel.from_pretrained(repo_id, trust_remote_code=True)` 获取。
49
+
50
+ 注意:当前 `.gitignore` 保留 `dev/*`,但已增加 `!dev/huggingface_compliance_audit.md`,因此本文件可以被 Git 跟踪。`.gitignore` 还忽略 quickstart 生成的 `demo/demo_rec*.wav`。
51
+
52
+ ### 模型卡 README
53
+
54
+ 已符合:
55
+
56
+ - `README.md` 顶部已有 YAML metadata。
57
+ - 已声明 `license: apache-2.0`。
58
+ - 已声明 `library_name: transformers`。这点很重要,因为 Hugging Face 对 2024-08 之后创建的模型仓库不再总是从 `config.json` 自动推断为 Transformers。
59
+ - 已提供与 audio tokenizer 相关的 tags:`audio`、`audio-tokenizer`、`neural-codec`、`moss-tts-family`、`speech-tokenizer`、`trust-remote-code` 等。
60
+ - 已包含 quickstart、streaming 使用示例、RVQ 层数控制、citation 和 license 说明。
61
+ - README 示例明确使用 `trust_remote_code=True`,与 custom model 要求一致。
62
+
63
+ 仍可改进或待核验:
64
+
65
+ - YAML metadata 仍未添加 `pipeline_tag`。音频 tokenizer 不一定有完全匹配的官方 pipeline;如果 Hugging Face metadata UI 接受,可以考虑 `feature-extraction`,否则保留现有自定义 tags。
66
+ - YAML metadata 没有 `datasets`。README 已补训练数据说明;如果训练数据有公开 Hub dataset id,可再补 `datasets` metadata。
67
+ - 没有 `model-index` 或结构化 eval metadata。若论文中有重建质量、ASR/TTS 下游指标,建议加入正文表格;如果有可结构化指标,再加 `model-index`。
68
+ - 远程页面 metadata 是否正确渲染仍需有权限账号确认。
69
+
70
+ ### config 和 AutoClass
71
+
72
+ 已符合:
73
+
74
+ - `config.json` 包含:
75
+ - `model_type: "moss-audio-tokenizer"`
76
+ - `architectures: ["MossAudioTokenizerModel"]`
77
+ - `auto_map.AutoConfig: "configuration_moss_audio_tokenizer.MossAudioTokenizerConfig"`
78
+ - `auto_map.AutoModel: "modeling_moss_audio_tokenizer.MossAudioTokenizerModel"`
79
+ - `configuration_moss_audio_tokenizer.py` 中 `MossAudioTokenizerConfig.model_type` 与 `config.json` 一致。
80
+ - `modeling_moss_audio_tokenizer.py` 中 `MossAudioTokenizerPreTrainedModel.config_class = MossAudioTokenizerConfig`。
81
+ - `modeling_moss_audio_tokenizer.py` 设置了 `main_input_name = "input_values"`、`input_modalities = "audio"` 和 `_no_split_modules`,对 Transformers 加载/设备切分是正面信号。
82
+
83
+ 本地验证通过:
84
+
85
+ ```bash
86
+ python -c "from transformers import AutoConfig; c=AutoConfig.from_pretrained('.', trust_remote_code=True); print(type(c).__name__, c.model_type, c.architectures, c.auto_map)"
87
+ ```
88
+
89
+ 输出要点:
90
+
91
+ ```text
92
+ MossAudioTokenizerConfig moss-audio-tokenizer ['MossAudioTokenizerModel'] {'AutoConfig': 'configuration_moss_audio_tokenizer.MossAudioTokenizerConfig', 'AutoModel': 'modeling_moss_audio_tokenizer.MossAudioTokenizerModel'}
93
+ ```
94
+
95
+ 模型类动态导入也通过:
96
+
97
+ ```bash
98
+ python -c "from transformers.dynamic_module_utils import get_class_from_dynamic_module; cls=get_class_from_dynamic_module('modeling_moss_audio_tokenizer.MossAudioTokenizerModel', '.'); print(cls.__name__, cls.config_class.__name__)"
99
+ ```
100
+
101
+ 输出要点:
102
+
103
+ ```text
104
+ MossAudioTokenizerModel MossAudioTokenizerConfig
105
+ ```
106
+
107
+ 当前状态:
108
+
109
+ - README 已补 “Requirements” 小节,说明 Python、PyTorch、Transformers、`torchaudio` 和可选 `flash-attn`。
110
+ - README 已说明 `config.json` 中的 `transformers_version` 为 `4.56.0.dev0`,建议使用支持 custom remote-code models 的近期 Transformers build。
111
+ - README 已加入 production 使用 `trust_remote_code=True` 时 pin reviewed commit hash 的建议。
112
+
113
+ ### 权重、LFS 和仓库大小
114
+
115
+ 已符合:
116
+
117
+ - `.gitattributes` 对 `*.safetensors` 设置了 `filter=lfs diff=lfs merge=lfs -text`。
118
+ - `git lfs ls-files` 列出了三个 safetensors 分片:
119
+ - `model-00001-of-00003.safetensors`
120
+ - `model-00002-of-00003.safetensors`
121
+ - `model-00003-of-00003.safetensors`
122
+ - `git cat-file -s HEAD:model-00001-of-00003.safetensors` 为 135 字节,说明 Git 对象里是 LFS pointer,不是把 3.9GB 权重作为普通 Git blob 提交。
123
+ - 第一个 LFS pointer 内容记录了真实大小 `3978639168` 字节。
124
+ - 另外两个分片的 Git blob size 分别为 135 和 134 字节,也符合 LFS pointer 预期。
125
+ - `model.safetensors.index.json` metadata:
126
+ - `total_parameters: 2123701248`
127
+ - `total_size: 8494804992`
128
+ - `weight_map` 条目数:2094
129
+ - `weight_map` 分片分布:
130
+ - `model-00001-of-00003.safetensors`: 898 entries
131
+ - `model-00002-of-00003.safetensors`: 1010 entries
132
+ - `model-00003-of-00003.safetensors`: 186 entries
133
+ - 三个分片大小约 3.98GB、3.99GB、0.52GB,远低于 Hugging Face 对大文件建议的 200GB 分片线,也低于 500GB 单文件硬限制。
134
+ - 仓库跟踪文件数量只有 10 个,远低于 100k 文件建议,也不存在单目录 10k 条目问题。
135
+
136
+ 需要改进或待核验:
137
+
138
+ - 只能确认本地当前分支的 LFS pointer。远程仓库历史中是否有旧的大 LFS 版本、未清理 PR ref、重复上传,需要在 Hugging Face repo Settings 的 “List LFS files” 或通过有权限的 API 再查。
139
+ - 如果后续更新权重,建议保持分片数量少、每次 commit 的大文件操作不超过 50-100 个。
140
+
141
+ ### 安全性
142
+
143
+ 已符合:
144
+
145
+ - 权重使用 safetensors,没有发现 `.bin`、`.pt`、`.pth`、`.pkl`、`.pickle` 等 pickle 风险权重文件。
146
+ - 自定义代码 import 扫描未发现明显高风险模式:`subprocess`、`requests`、`urllib`、`socket`、`pickle`、`torch.load`、`eval(`、`exec(` 等。
147
+ - `flash_attn` 是 try/except 可选依赖;缺失时会回退,不会阻断基础导入。
148
+
149
+ 当前状态:
150
+
151
+ - README 已说明 repo 使用 custom Transformers remote code,并建议生产环境 pin reviewed commit hash。
152
+ - 如果未来增加依赖,仍应避免在模型 import 或 forward 过程中做网络访问、文件系统副作用、shell 调用。
153
+
154
+ ### 远程页面状态
155
+
156
+ 待核验:
157
+
158
+ - 浏览器/匿名 API 访问 `https://huggingface.co/OpenMOSS-Team/MOSS-Audio-Tokenizer-V2` 返回 401,当前无法匿名确认远程页面实际渲染、文件列表、模型卡 metadata 解析结果、LFS 文件列表或下载统计。
159
+ - 本地 `git remote -v` 确认 origin 指向该 Hugging Face repo。
160
+
161
+ 建议有权限的人在 Hugging Face 页面上手动确认:
162
+
163
+ - README metadata 是否被页面正确解析。
164
+ - `Files and versions` 中是否有 3 个 safetensors 分片和 index。
165
+ - 页面是否显示 `Safetensors`、`Transformers`、license、任务标签。
166
+ - `Security`/file scan 是否正常,无 pickle 或 malware 警告。
167
+ - LFS storage 页面是否没有多余历史大文件。
168
+
169
+ ## 优先级建议
170
+
171
+ ### P0:阻断项
172
+
173
+ 本地未发现明确 P0 阻断项。`AutoConfig` 和模型类动态导入通过,权重以 LFS pointer 形式跟踪。
174
+
175
+ ### P1:已完成
176
+
177
+ - 已删除 README 中的 license 待办注释。
178
+ - 已添加根目录 `LICENSE` 文件,放 Apache-2.0 全文。
179
+ - 已补 README 标准模型卡小节:intended use、limitations、training data、training procedure、evaluation results、ethical considerations。
180
+ - 已加 “Requirements” 小节,写清 Python、Transformers、PyTorch、`torchaudio`、可选 `flash-attn`。
181
+ - 已确认并保留 `demo/demo_gt.wav` 示例音频,README 说明可替换为自有音频。
182
+
183
+ ### P2:增强可发现性和可维护性
184
+
185
+ - 如 Hugging Face metadata UI 校验通过,增加 `pipeline_tag: feature-extraction`。
186
+ - 如有论文指标,加入 eval 表格;可结构化时再加入 `model-index`。
187
+ - 如训练数据有公开 Hub dataset id,补 `datasets` metadata;否则正文解释数据范围和不可公开原因。
188
+ - 用有权限账号确认远程页面和 LFS storage 状态。
189
+
190
+ ## 推荐的 README metadata 方向
191
+
192
+ 以下仅是方向,`pipeline_tag` 需要以 Hugging Face metadata UI 的校验结果为准:
193
+
194
+ ```yaml
195
+ ---
196
+ license: apache-2.0
197
+ library_name: transformers
198
+ pipeline_tag: feature-extraction
199
+ tags:
200
+ - audio
201
+ - audio-tokenizer
202
+ - neural-codec
203
+ - speech-tokenizer
204
+ - trust-remote-code
205
+ - arxiv:2602.10934
206
+ ---
207
+ ```
208
+
209
+ 如果 `pipeline_tag: feature-extraction` 不适合该 tokenizer,就不要强行添加;保留自定义 tags,并在正文明确这是 audio tokenizer / neural codec。
210
+
211
+ ## 最终判断
212
+
213
+ 按本地证据看,这个仓库已经基本符合 Hugging Face 模型仓库和 Transformers custom remote-code 的关键规范:模型卡存在、metadata 基本可用、config/auto_map 正确、权重是 safetensors、LFS tracking 正确、分片和索引一致、文件数量和分片大小都在建议范围内。
214
+
215
+ 主要发布质量问题已经整改:LICENSE、README 待办注释、模型卡补充、依赖版本说明和 remote-code 安全建议都已加入。剩余风险主要是远程页面因为匿名访问 401,需要有权限账号再做最后确认;另外 `pipeline_tag`、`datasets`、`model-index` 是否补充取决于 Hugging Face metadata 校验结果和可公开的训练/评测信息。