exlaw commited on
Commit
d718caa
·
verified ·
1 Parent(s): 6841252

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
LICENSE ADDED
@@ -0,0 +1,500 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Tencent is pleased to support the open source community by making Hidden-Decoding 大模型序列长度扩展 available.
2
+
3
+ Copyright (C) 2026 Tencent. All rights reserved.
4
+
5
+ The open-source software and/or models included in this distribution may have been modified by Tencent (“Tencent Modifications”). All Tencent Modifications are Copyright (C) Tencent.
6
+
7
+ Hidden-Decoding 大模型序列长度扩展 is licensed under the License Terms of Hidden-Decoding 大模型序列长度扩展, except for the third-party components listed below, which remain licensed under their respective original terms. Hidden-Decoding 大模型序列长度扩展 does not impose any additional restrictions beyond those specified in the original licenses of these third-party components. Users are required to comply with all applicable terms and conditions of the original licenses and to ensure that the use of these third-party components conforms to all relevant laws and regulations.
8
+
9
+ For the avoidance of doubt, Hidden-Decoding 大模型序列长度扩展 refers solely to training code, inference code, parameters, and weights made publicly available by Tencent in accordance with the License Terms of Hidden-Decoding 大模型序列长度扩展.
10
+
11
+ Terms of the License Terms of Hidden-Decoding 大模型序列长度扩展:
12
+ --------------------------------------------------------------------
13
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
14
+
15
+ 0. Additional Territorial Limitation **Hidden-Decoding 大模型序列长度扩展 IS NOT INTENDED FOR USE WITHIN THE EUROPEAN UNION.** IN THE EVENT OF ANY CONFLICT, THIS CLAUSE SHALL PREVAIL.
16
+
17
+ 1. Definitions.
18
+
19
+ "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document.
20
+
21
+ "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License.
22
+
23
+ "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%!)(MISSING) or more of the outstanding shares, or (iii) beneficial ownership of such entity.
24
+
25
+ "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License.
26
+
27
+ "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files.
28
+
29
+ "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types.
30
+
31
+ "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below).
32
+
33
+ "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof.
34
+
35
+ "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution."
36
+
37
+ "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work.
38
+
39
+ 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form.
40
+
41
+ 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed.
42
+
43
+ 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:
44
+
45
+ You must give any other recipients of the Work or Derivative Works a copy of this License; and
46
+
47
+ You must cause any modified files to carry prominent notices stating that You changed the files; and
48
+
49
+ You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and
50
+
51
+ If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License.
52
+
53
+ You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License.
54
+
55
+ 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions.
56
+
57
+ 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file.
58
+
59
+ 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License.
60
+
61
+ 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages.
62
+
63
+ 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability.
64
+
65
+
66
+ END OF TERMS AND CONDITIONS
67
+
68
+
69
+
70
+ Dependencies and Licenses:
71
+
72
+ This open-source project, Hidden-Decoding 大模型序列长度扩展, builds upon the following open-source models and/or software components, each of which remains licensed under its original license. Certain models or software may include modifications made by Tencent (“Tencent Modifications”), which are Copyright (C) Tencent.
73
+
74
+ In case you believe there have been errors in the attribution below, you may submit the concerns to us for review and correction.
75
+
76
+
77
+ Open Source Model Licensed under the apache-2.0:
78
+ --------------------------------------------------------------------
79
+ 1. Qwen3-8B-Base
80
+ Copyright (C) Qwen3-8B-Base original author and authors
81
+ Please note this model has been modified by Tencent in this distribution.
82
+
83
+
84
+
85
+ Terms of the apache-2.0:
86
+ --------------------------------------------------------------------
87
+ Apache License
88
+ Version 2.0, January 2004
89
+ http://www.apache.org/licenses/
90
+
91
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
92
+
93
+ 1. Definitions.
94
+
95
+ "License" shall mean the terms and conditions for use, reproduction,
96
+ and distribution as defined by Sections 1 through 9 of this document.
97
+
98
+ "Licensor" shall mean the copyright owner or entity authorized by
99
+ the copyright owner that is granting the License.
100
+
101
+ "Legal Entity" shall mean the union of the acting entity and all
102
+ other entities that control, are controlled by, or are under common
103
+ control with that entity. For the purposes of this definition,
104
+ "control" means (i) the power, direct or indirect, to cause the
105
+ direction or management of such entity, whether by contract or
106
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
107
+ outstanding shares, or (iii) beneficial ownership of such entity.
108
+
109
+ "You" (or "Your") shall mean an individual or Legal Entity
110
+ exercising permissions granted by this License.
111
+
112
+ "Source" form shall mean the preferred form for making modifications,
113
+ including but not limited to software source code, documentation
114
+ source, and configuration files.
115
+
116
+ "Object" form shall mean any form resulting from mechanical
117
+ transformation or translation of a Source form, including but
118
+ not limited to compiled object code, generated documentation,
119
+ and conversions to other media types.
120
+
121
+ "Work" shall mean the work of authorship, whether in Source or
122
+ Object form, made available under the License, as indicated by a
123
+ copyright notice that is included in or attached to the work
124
+ (an example is provided in the Appendix below).
125
+
126
+ "Derivative Works" shall mean any work, whether in Source or Object
127
+ form, that is based on (or derived from) the Work and for which the
128
+ editorial revisions, annotations, elaborations, or other modifications
129
+ represent, as a whole, an original work of authorship. For the purposes
130
+ of this License, Derivative Works shall not include works that remain
131
+ separable from, or merely link (or bind by name) to the interfaces of,
132
+ the Work and Derivative Works thereof.
133
+
134
+ "Contribution" shall mean any work of authorship, including
135
+ the original version of the Work and any modifications or additions
136
+ to that Work or Derivative Works thereof, that is intentionally
137
+ submitted to Licensor for inclusion in the Work by the copyright owner
138
+ or by an individual or Legal Entity authorized to submit on behalf of
139
+ the copyright owner. For the purposes of this definition, "submitted"
140
+ means any form of electronic, verbal, or written communication sent
141
+ to the Licensor or its representatives, including but not limited to
142
+ communication on electronic mailing lists, source code control systems,
143
+ and issue tracking systems that are managed by, or on behalf of, the
144
+ Licensor for the purpose of discussing and improving the Work, but
145
+ excluding communication that is conspicuously marked or otherwise
146
+ designated in writing by the copyright owner as "Not a Contribution."
147
+
148
+ "Contributor" shall mean Licensor and any individual or Legal Entity
149
+ on behalf of whom a Contribution has been received by Licensor and
150
+ subsequently incorporated within the Work.
151
+
152
+ 2. Grant of Copyright License. Subject to the terms and conditions of
153
+ this License, each Contributor hereby grants to You a perpetual,
154
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
155
+ copyright license to reproduce, prepare Derivative Works of,
156
+ publicly display, publicly perform, sublicense, and distribute the
157
+ Work and such Derivative Works in Source or Object form.
158
+
159
+ 3. Grant of Patent License. Subject to the terms and conditions of
160
+ this License, each Contributor hereby grants to You a perpetual,
161
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
162
+ (except as stated in this section) patent license to make, have made,
163
+ use, offer to sell, sell, import, and otherwise transfer the Work,
164
+ where such license applies only to those patent claims licensable
165
+ by such Contributor that are necessarily infringed by their
166
+ Contribution(s) alone or by combination of their Contribution(s)
167
+ with the Work to which such Contribution(s) was submitted. If You
168
+ institute patent litigation against any entity (including a
169
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
170
+ or a Contribution incorporated within the Work constitutes direct
171
+ or contributory patent infringement, then any patent licenses
172
+ granted to You under this License for that Work shall terminate
173
+ as of the date such litigation is filed.
174
+
175
+ 4. Redistribution. You may reproduce and distribute copies of the
176
+ Work or Derivative Works thereof in any medium, with or without
177
+ modifications, and in Source or Object form, provided that You
178
+ meet the following conditions:
179
+
180
+ (a) You must give any other recipients of the Work or
181
+ Derivative Works a copy of this License; and
182
+
183
+ (b) You must cause any modified files to carry prominent notices
184
+ stating that You changed the files; and
185
+
186
+ (c) You must retain, in the Source form of any Derivative Works
187
+ that You distribute, all copyright, patent, trademark, and
188
+ attribution notices from the Source form of the Work,
189
+ excluding those notices that do not pertain to any part of
190
+ the Derivative Works; and
191
+
192
+ (d) If the Work includes a "NOTICE" text file as part of its
193
+ distribution, then any Derivative Works that You distribute must
194
+ include a readable copy of the attribution notices contained
195
+ within such NOTICE file, excluding those notices that do not
196
+ pertain to any part of the Derivative Works, in at least one
197
+ of the following places: within a NOTICE text file distributed
198
+ as part of the Derivative Works; within the Source form or
199
+ documentation, if provided along with the Derivative Works; or,
200
+ within a display generated by the Derivative Works, if and
201
+ wherever such third-party notices normally appear. The contents
202
+ of the NOTICE file are for informational purposes only and
203
+ do not modify the License. You may add Your own attribution
204
+ notices within Derivative Works that You distribute, alongside
205
+ or as an addendum to the NOTICE text from the Work, provided
206
+ that such additional attribution notices cannot be construed
207
+ as modifying the License.
208
+
209
+ You may add Your own copyright statement to Your modifications and
210
+ may provide additional or different license terms and conditions
211
+ for use, reproduction, or distribution of Your modifications, or
212
+ for any such Derivative Works as a whole, provided Your use,
213
+ reproduction, and distribution of the Work otherwise complies with
214
+ the conditions stated in this License.
215
+
216
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
217
+ any Contribution intentionally submitted for inclusion in the Work
218
+ by You to the Licensor shall be under the terms and conditions of
219
+ this License, without any additional terms or conditions.
220
+ Notwithstanding the above, nothing herein shall supersede or modify
221
+ the terms of any separate license agreement you may have executed
222
+ with Licensor regarding such Contributions.
223
+
224
+ 6. Trademarks. This License does not grant permission to use the trade
225
+ names, trademarks, service marks, or product names of the Licensor,
226
+ except as required for reasonable and customary use in describing the
227
+ origin of the Work and reproducing the content of the NOTICE file.
228
+
229
+ 7. Disclaimer of Warranty. Unless required by applicable law or
230
+ agreed to in writing, Licensor provides the Work (and each
231
+ Contributor provides its Contributions) on an "AS IS" BASIS,
232
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
233
+ implied, including, without limitation, any warranties or conditions
234
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
235
+ PARTICULAR PURPOSE. You are solely responsible for determining the
236
+ appropriateness of using or redistributing the Work and assume any
237
+ risks associated with Your exercise of permissions under this License.
238
+
239
+ 8. Limitation of Liability. In no event and under no legal theory,
240
+ whether in tort (including negligence), contract, or otherwise,
241
+ unless required by applicable law (such as deliberate and grossly
242
+ negligent acts) or agreed to in writing, shall any Contributor be
243
+ liable to You for damages, including any direct, indirect, special,
244
+ incidental, or consequential damages of any character arising as a
245
+ result of this License or out of the use or inability to use the
246
+ Work (including but not limited to damages for loss of goodwill,
247
+ work stoppage, computer failure or malfunction, or any and all
248
+ other commercial damages or losses), even if such Contributor
249
+ has been advised of the possibility of such damages.
250
+
251
+ 9. Accepting Warranty or Additional Liability. While redistributing
252
+ the Work or Derivative Works thereof, You may choose to offer,
253
+ and charge a fee for, acceptance of support, warranty, indemnity,
254
+ or other liability obligations and/or rights consistent with this
255
+ License. However, in accepting such obligations, You may act only
256
+ on Your own behalf and on Your sole responsibility, not on behalf
257
+ of any other Contributor, and only if You agree to indemnify,
258
+ defend, and hold each Contributor harmless for any liability
259
+ incurred by, or claims asserted against, such Contributor by reason
260
+ of your accepting any such warranty or additional liability.
261
+
262
+ END OF TERMS AND CONDITIONS
263
+
264
+ APPENDIX: How to apply the Apache License to your work.
265
+
266
+ To apply the Apache License to your work, attach the following
267
+ boilerplate notice, with the fields enclosed by brackets "[]"
268
+ replaced with your own identifying information. (Don't include
269
+ the brackets!) The text should be enclosed in the appropriate
270
+ comment syntax for the file format. We also recommend that a
271
+ file or class name and description of purpose be included on the
272
+ same "printed page" as the copyright notice for easier
273
+ identification within third-party archives.
274
+
275
+ Copyright [yyyy] [name of copyright owner]
276
+
277
+ Licensed under the Apache License, Version 2.0 (the "License");
278
+ you may not use this file except in compliance with the License.
279
+ You may obtain a copy of the License at
280
+
281
+ http://www.apache.org/licenses/LICENSE-2.0
282
+
283
+ Unless required by applicable law or agreed to in writing, software
284
+ distributed under the License is distributed on an "AS IS" BASIS,
285
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
286
+ See the License for the specific language governing permissions and
287
+ limitations under the License.
288
+
289
+
290
+ Open Source Software Licensed under the apache-2.0
291
+ --------------------------------------------------------------------
292
+ 1. sglang
293
+ Copyright 2023-2024 SGLang Team
294
+ Please note this software has been modified by Tencent in this distribution.
295
+
296
+
297
+
298
+ Terms of the apache-2.0
299
+ --------------------------------------------------------------------
300
+ Apache License
301
+ Version 2.0, January 2004
302
+ http://www.apache.org/licenses/
303
+
304
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
305
+
306
+ 1. Definitions.
307
+
308
+ "License" shall mean the terms and conditions for use, reproduction,
309
+ and distribution as defined by Sections 1 through 9 of this document.
310
+
311
+ "Licensor" shall mean the copyright owner or entity authorized by
312
+ the copyright owner that is granting the License.
313
+
314
+ "Legal Entity" shall mean the union of the acting entity and all
315
+ other entities that control, are controlled by, or are under common
316
+ control with that entity. For the purposes of this definition,
317
+ "control" means (i) the power, direct or indirect, to cause the
318
+ direction or management of such entity, whether by contract or
319
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
320
+ outstanding shares, or (iii) beneficial ownership of such entity.
321
+
322
+ "You" (or "Your") shall mean an individual or Legal Entity
323
+ exercising permissions granted by this License.
324
+
325
+ "Source" form shall mean the preferred form for making modifications,
326
+ including but not limited to software source code, documentation
327
+ source, and configuration files.
328
+
329
+ "Object" form shall mean any form resulting from mechanical
330
+ transformation or translation of a Source form, including but
331
+ not limited to compiled object code, generated documentation,
332
+ and conversions to other media types.
333
+
334
+ "Work" shall mean the work of authorship, whether in Source or
335
+ Object form, made available under the License, as indicated by a
336
+ copyright notice that is included in or attached to the work
337
+ (an example is provided in the Appendix below).
338
+
339
+ "Derivative Works" shall mean any work, whether in Source or Object
340
+ form, that is based on (or derived from) the Work and for which the
341
+ editorial revisions, annotations, elaborations, or other modifications
342
+ represent, as a whole, an original work of authorship. For the purposes
343
+ of this License, Derivative Works shall not include works that remain
344
+ separable from, or merely link (or bind by name) to the interfaces of,
345
+ the Work and Derivative Works thereof.
346
+
347
+ "Contribution" shall mean any work of authorship, including
348
+ the original version of the Work and any modifications or additions
349
+ to that Work or Derivative Works thereof, that is intentionally
350
+ submitted to Licensor for inclusion in the Work by the copyright owner
351
+ or by an individual or Legal Entity authorized to submit on behalf of
352
+ the copyright owner. For the purposes of this definition, "submitted"
353
+ means any form of electronic, verbal, or written communication sent
354
+ to the Licensor or its representatives, including but not limited to
355
+ communication on electronic mailing lists, source code control systems,
356
+ and issue tracking systems that are managed by, or on behalf of, the
357
+ Licensor for the purpose of discussing and improving the Work, but
358
+ excluding communication that is conspicuously marked or otherwise
359
+ designated in writing by the copyright owner as "Not a Contribution."
360
+
361
+ "Contributor" shall mean Licensor and any individual or Legal Entity
362
+ on behalf of whom a Contribution has been received by Licensor and
363
+ subsequently incorporated within the Work.
364
+
365
+ 2. Grant of Copyright License. Subject to the terms and conditions of
366
+ this License, each Contributor hereby grants to You a perpetual,
367
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
368
+ copyright license to reproduce, prepare Derivative Works of,
369
+ publicly display, publicly perform, sublicense, and distribute the
370
+ Work and such Derivative Works in Source or Object form.
371
+
372
+ 3. Grant of Patent License. Subject to the terms and conditions of
373
+ this License, each Contributor hereby grants to You a perpetual,
374
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
375
+ (except as stated in this section) patent license to make, have made,
376
+ use, offer to sell, sell, import, and otherwise transfer the Work,
377
+ where such license applies only to those patent claims licensable
378
+ by such Contributor that are necessarily infringed by their
379
+ Contribution(s) alone or by combination of their Contribution(s)
380
+ with the Work to which such Contribution(s) was submitted. If You
381
+ institute patent litigation against any entity (including a
382
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
383
+ or a Contribution incorporated within the Work constitutes direct
384
+ or contributory patent infringement, then any patent licenses
385
+ granted to You under this License for that Work shall terminate
386
+ as of the date such litigation is filed.
387
+
388
+ 4. Redistribution. You may reproduce and distribute copies of the
389
+ Work or Derivative Works thereof in any medium, with or without
390
+ modifications, and in Source or Object form, provided that You
391
+ meet the following conditions:
392
+
393
+ (a) You must give any other recipients of the Work or
394
+ Derivative Works a copy of this License; and
395
+
396
+ (b) You must cause any modified files to carry prominent notices
397
+ stating that You changed the files; and
398
+
399
+ (c) You must retain, in the Source form of any Derivative Works
400
+ that You distribute, all copyright, patent, trademark, and
401
+ attribution notices from the Source form of the Work,
402
+ excluding those notices that do not pertain to any part of
403
+ the Derivative Works; and
404
+
405
+ (d) If the Work includes a "NOTICE" text file as part of its
406
+ distribution, then any Derivative Works that You distribute must
407
+ include a readable copy of the attribution notices contained
408
+ within such NOTICE file, excluding those notices that do not
409
+ pertain to any part of the Derivative Works, in at least one
410
+ of the following places: within a NOTICE text file distributed
411
+ as part of the Derivative Works; within the Source form or
412
+ documentation, if provided along with the Derivative Works; or,
413
+ within a display generated by the Derivative Works, if and
414
+ wherever such third-party notices normally appear. The contents
415
+ of the NOTICE file are for informational purposes only and
416
+ do not modify the License. You may add Your own attribution
417
+ notices within Derivative Works that You distribute, alongside
418
+ or as an addendum to the NOTICE text from the Work, provided
419
+ that such additional attribution notices cannot be construed
420
+ as modifying the License.
421
+
422
+ You may add Your own copyright statement to Your modifications and
423
+ may provide additional or different license terms and conditions
424
+ for use, reproduction, or distribution of Your modifications, or
425
+ for any such Derivative Works as a whole, provided Your use,
426
+ reproduction, and distribution of the Work otherwise complies with
427
+ the conditions stated in this License.
428
+
429
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
430
+ any Contribution intentionally submitted for inclusion in the Work
431
+ by You to the Licensor shall be under the terms and conditions of
432
+ this License, without any additional terms or conditions.
433
+ Notwithstanding the above, nothing herein shall supersede or modify
434
+ the terms of any separate license agreement you may have executed
435
+ with Licensor regarding such Contributions.
436
+
437
+ 6. Trademarks. This License does not grant permission to use the trade
438
+ names, trademarks, service marks, or product names of the Licensor,
439
+ except as required for reasonable and customary use in describing the
440
+ origin of the Work and reproducing the content of the NOTICE file.
441
+
442
+ 7. Disclaimer of Warranty. Unless required by applicable law or
443
+ agreed to in writing, Licensor provides the Work (and each
444
+ Contributor provides its Contributions) on an "AS IS" BASIS,
445
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
446
+ implied, including, without limitation, any warranties or conditions
447
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
448
+ PARTICULAR PURPOSE. You are solely responsible for determining the
449
+ appropriateness of using or redistributing the Work and assume any
450
+ risks associated with Your exercise of permissions under this License.
451
+
452
+ 8. Limitation of Liability. In no event and under no legal theory,
453
+ whether in tort (including negligence), contract, or otherwise,
454
+ unless required by applicable law (such as deliberate and grossly
455
+ negligent acts) or agreed to in writing, shall any Contributor be
456
+ liable to You for damages, including any direct, indirect, special,
457
+ incidental, or consequential damages of any character arising as a
458
+ result of this License or out of the use or inability to use the
459
+ Work (including but not limited to damages for loss of goodwill,
460
+ work stoppage, computer failure or malfunction, or any and all
461
+ other commercial damages or losses), even if such Contributor
462
+ has been advised of the possibility of such damages.
463
+
464
+ 9. Accepting Warranty or Additional Liability. While redistributing
465
+ the Work or Derivative Works thereof, You may choose to offer,
466
+ and charge a fee for, acceptance of support, warranty, indemnity,
467
+ or other liability obligations and/or rights consistent with this
468
+ License. However, in accepting such obligations, You may act only
469
+ on Your own behalf and on Your sole responsibility, not on behalf
470
+ of any other Contributor, and only if You agree to indemnify,
471
+ defend, and hold each Contributor harmless for any liability
472
+ incurred by, or claims asserted against, such Contributor by reason
473
+ of your accepting any such warranty or additional liability.
474
+
475
+ END OF TERMS AND CONDITIONS
476
+
477
+ APPENDIX: How to apply the Apache License to your work.
478
+
479
+ To apply the Apache License to your work, attach the following
480
+ boilerplate notice, with the fields enclosed by brackets "[]"
481
+ replaced with your own identifying information. (Don't include
482
+ the brackets!) The text should be enclosed in the appropriate
483
+ comment syntax for the file format. We also recommend that a
484
+ file or class name and description of purpose be included on the
485
+ same "printed page" as the copyright notice for easier
486
+ identification within third-party archives.
487
+
488
+ Copyright [yyyy] [name of copyright owner]
489
+
490
+ Licensed under the Apache License, Version 2.0 (the "License");
491
+ you may not use this file except in compliance with the License.
492
+ You may obtain a copy of the License at
493
+
494
+ http://www.apache.org/licenses/LICENSE-2.0
495
+
496
+ Unless required by applicable law or agreed to in writing, software
497
+ distributed under the License is distributed on an "AS IS" BASIS,
498
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
499
+ See the License for the specific language governing permissions and
500
+ limitations under the License.
README.md ADDED
@@ -0,0 +1,97 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: sequential-hidden-decoding
4
+ license_link: LICENSE
5
+ base_model:
6
+ - Qwen/Qwen3-8B-Base
7
+ tags:
8
+ - sequential-hidden-decoding
9
+ - pretrained
10
+ - base-model
11
+ ---
12
+
13
+ # Sequential-Hidden-Decoding-8B-n4
14
+
15
+ This is the **n=4** variant of Sequential Hidden Decoding, a method that scales sequence length by n× with only additional Embedding parameters — same Transformer, more compute per token.
16
+
17
+ - **Base model:** [Qwen3-8B-Base](https://huggingface.co/Qwen/Qwen3-8B-Base)
18
+ - **Scale:** 4×
19
+ - **Additional Embedding Params:** 3.1B
20
+ - **Training Tokens:** 150B
21
+ - **Dtype:** bfloat16
22
+
23
+ > **Note:** This is a **base model** (not instruction-tuned). It is intended for benchmarking, text completion, and as a foundation for downstream fine-tuning (SFT / RLHF). For conversational or instruction-following use cases, please fine-tune on your own data.
24
+
25
+ ## Key Idea
26
+
27
+ Prepare *n* independent Embedding matrices to encode the same token sequence *n* times, interleave the results, and feed the *n*×-length sequence into the same Transformer. Only the last embedding of each token computes the next-token loss, while the preceding embeddings serve as implicit reasoning steps in a continuous latent space.
28
+
29
+ ## Results
30
+
31
+ | Benchmark | # Shots | 8B Baseline | 8B scale n=2 | 8B scale n=4 | 8B scale n=8 |
32
+ |-----------|:-------:|:-----------:|:------------:|:------------:|:------------:|
33
+ | BBH (EM) | 3-shot | 78.8 | 81.3 | **83.0** | 83.9 |
34
+ | MMLU (EM) | 5-shot | 79.8 | 80.9 | **81.9** | 82.2 |
35
+ | MBPP+ (Pass@1) | 1-shot | 66.7 | 69.4 | **68.7** | 69.4 |
36
+ | MATH (LLM-judge) | 4-shot | 56.0 | 58.2 | **60.0** | 61.1 |
37
+ | ARC-C | 25-shot | 93.9 | 94.3 | **94.4** | 94.7 |
38
+ | Hellaswag | 10-shot | 79.7 | 83.1 | **85.0** | 85.3 |
39
+ | GSM8K | 4-shot | 92.5 | 93.3 | **93.9** | 94.6 |
40
+
41
+ ## Serving (SGLang)
42
+
43
+ This model requires a patched version of [SGLang](https://github.com/sgl-project/sglang) for inference. See the [project page](https://huggingface.co/collections/tencent/sequential-hidden-decoding) for installation options (Docker image, forked repo, or manual patch).
44
+
45
+ ```bash
46
+ python -m sglang.launch_server \
47
+ --model-path tencent/Sequential-Hidden-Decoding-8B-n4 \
48
+ --trust-remote-code \
49
+ --tp-size 1 \
50
+ --port 30000 --host 0.0.0.0 \
51
+ --chunked-prefill-size -1 \
52
+ --attention-backend fa3 \
53
+ --mem-fraction-static 0.82 \
54
+ --max-running-requests 32 \
55
+ --context-length 131072 \
56
+ --cuda-graph-max-bs 128 \
57
+ --cuda-graph-bs 1 2 4 8 16 32 64 128
58
+ ```
59
+
60
+ ```python
61
+ from openai import OpenAI
62
+
63
+ client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
64
+ response = client.completions.create(
65
+ model="tencent/Sequential-Hidden-Decoding-8B-n4",
66
+ prompt="The meaning of life is",
67
+ max_tokens=128,
68
+ temperature=0,
69
+ )
70
+ print(response.choices[0].text)
71
+ ```
72
+
73
+ ## All Models
74
+
75
+ | Model | Scale | Embedding Params | Training Tokens |
76
+ |-------|:-----:|:----------------:|:---------------:|
77
+ | [Sequential-Hidden-Decoding-8B-n2](https://huggingface.co/tencent/Sequential-Hidden-Decoding-8B-n2) | 2× | 1.9B | 75B |
78
+ | [Sequential-Hidden-Decoding-8B-n4](https://huggingface.co/tencent/Sequential-Hidden-Decoding-8B-n4) | 4× | 3.1B | 150B |
79
+ | [Sequential-Hidden-Decoding-8B-n8](https://huggingface.co/tencent/Sequential-Hidden-Decoding-8B-n8) | 8× | 5.6B | 187B |
80
+
81
+ ## Citation
82
+
83
+ ```bibtex
84
+ @article{hidden_decoding_2026,
85
+ title = {Sequential Hidden Decoding: Scaling Sequence Length in Pretraining},
86
+ year = {2026},
87
+ url = {https://welm.weixin.qq.com/posts/hidden_decoding/}
88
+ }
89
+ ```
90
+
91
+ ## Contact
92
+
93
+ Sijun Zhang (nepheloturbulence@gmail.com), Aiwei Liu (liuaiwei20@gmail.com)
94
+
95
+ ## License
96
+
97
+ This model is released under the [License Terms of Sequential-Hidden-Decoding](LICENSE).
added_tokens.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "</tool_call>": 151658,
3
+ "<tool_call>": 151657,
4
+ "<|box_end|>": 151649,
5
+ "<|box_start|>": 151648,
6
+ "<|endoftext|>": 151643,
7
+ "<|file_sep|>": 151664,
8
+ "<|fim_middle|>": 151660,
9
+ "<|fim_pad|>": 151662,
10
+ "<|fim_prefix|>": 151659,
11
+ "<|fim_suffix|>": 151661,
12
+ "<|im_end|>": 151645,
13
+ "<|im_start|>": 151644,
14
+ "<|image_pad|>": 151655,
15
+ "<|object_ref_end|>": 151647,
16
+ "<|object_ref_start|>": 151646,
17
+ "<|quad_end|>": 151651,
18
+ "<|quad_start|>": 151650,
19
+ "<|repo_name|>": 151663,
20
+ "<|video_pad|>": 151656,
21
+ "<|vision_end|>": 151653,
22
+ "<|vision_pad|>": 151654,
23
+ "<|vision_start|>": 151652
24
+ }
chat_template.jinja ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {%- if tools %}
2
+ {{- '<|im_start|>system\n' }}
3
+ {%- if messages[0]['role'] == 'system' %}
4
+ {{- messages[0]['content'] }}
5
+ {%- else %}
6
+ {{- 'You are a helpful assistant.' }}
7
+ {%- endif %}
8
+ {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
9
+ {%- for tool in tools %}
10
+ {{- "\n" }}
11
+ {{- tool | tojson }}
12
+ {%- endfor %}
13
+ {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
14
+ {%- else %}
15
+ {%- if messages[0]['role'] == 'system' %}
16
+ {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}
17
+ {%- else %}
18
+ {{- '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n' }}
19
+ {%- endif %}
20
+ {%- endif %}
21
+ {%- for message in messages %}
22
+ {%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}
23
+ {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
24
+ {%- elif message.role == "assistant" %}
25
+ {{- '<|im_start|>' + message.role }}
26
+ {%- if message.content %}
27
+ {{- '\n' + message.content }}
28
+ {%- endif %}
29
+ {%- for tool_call in message.tool_calls %}
30
+ {%- if tool_call.function is defined %}
31
+ {%- set tool_call = tool_call.function %}
32
+ {%- endif %}
33
+ {{- '\n<tool_call>\n{"name": "' }}
34
+ {{- tool_call.name }}
35
+ {{- '", "arguments": ' }}
36
+ {{- tool_call.arguments | tojson }}
37
+ {{- '}\n</tool_call>' }}
38
+ {%- endfor %}
39
+ {{- '<|im_end|>\n' }}
40
+ {%- elif message.role == "tool" %}
41
+ {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}
42
+ {{- '<|im_start|>user' }}
43
+ {%- endif %}
44
+ {{- '\n<tool_response>\n' }}
45
+ {{- message.content }}
46
+ {{- '\n</tool_response>' }}
47
+ {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
48
+ {{- '<|im_end|>\n' }}
49
+ {%- endif %}
50
+ {%- endif %}
51
+ {%- endfor %}
52
+ {%- if add_generation_prompt %}
53
+ {{- '<|im_start|>assistant\n' }}
54
+ {%- endif %}
config.json ADDED
@@ -0,0 +1,75 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Qwen3ScaleSeqForCausalLM"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "auto_map": {
8
+ "AutoConfig": "configuration_qwen3_scale_seq.Qwen3ScaleSeqConfig",
9
+ "AutoModel": "modeling_qwen3_scale_seq.Qwen3ScaleSeqModel",
10
+ "AutoModelForCausalLM": "modeling_qwen3_scale_seq.Qwen3ScaleSeqForCausalLM"
11
+ },
12
+ "dtype": "bfloat16",
13
+ "eos_token_id": 151643,
14
+ "head_dim": 128,
15
+ "hidden_act": "silu",
16
+ "hidden_size": 4096,
17
+ "initializer_range": 0.02,
18
+ "intermediate_size": 12288,
19
+ "layer_types": [
20
+ "full_attention",
21
+ "full_attention",
22
+ "full_attention",
23
+ "full_attention",
24
+ "full_attention",
25
+ "full_attention",
26
+ "full_attention",
27
+ "full_attention",
28
+ "full_attention",
29
+ "full_attention",
30
+ "full_attention",
31
+ "full_attention",
32
+ "full_attention",
33
+ "full_attention",
34
+ "full_attention",
35
+ "full_attention",
36
+ "full_attention",
37
+ "full_attention",
38
+ "full_attention",
39
+ "full_attention",
40
+ "full_attention",
41
+ "full_attention",
42
+ "full_attention",
43
+ "full_attention",
44
+ "full_attention",
45
+ "full_attention",
46
+ "full_attention",
47
+ "full_attention",
48
+ "full_attention",
49
+ "full_attention",
50
+ "full_attention",
51
+ "full_attention",
52
+ "full_attention",
53
+ "full_attention",
54
+ "full_attention",
55
+ "full_attention"
56
+ ],
57
+ "max_position_embeddings": 131072,
58
+ "max_window_layers": 28,
59
+ "model_type": "qwen3_scale_seq",
60
+ "num_attention_heads": 32,
61
+ "num_hidden_layers": 36,
62
+ "num_key_value_heads": 8,
63
+ "pad_token_id": 151643,
64
+ "rms_norm_eps": 1e-06,
65
+ "rope_scaling": null,
66
+ "rope_theta": 1000000.0,
67
+ "scale_seq_times": 3,
68
+ "sliding_window": null,
69
+ "tie_word_embeddings": false,
70
+ "torch_dtype": "bfloat16",
71
+ "transformers_version": "4.57.1",
72
+ "use_cache": true,
73
+ "use_sliding_window": false,
74
+ "vocab_size": 151936
75
+ }
configuration_qwen3_scale_seq.py ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Qwen3ScaleSeq model configuration.
2
+
3
+ Extends Qwen3Config with scale_seq_times for embedding replication
4
+ to scale effective sequence length. See Scale_SeqLen_via_Embedding_Replication.md.
5
+ """
6
+
7
+ from transformers import Qwen3Config
8
+
9
+
10
+ class Qwen3ScaleSeqConfig(Qwen3Config):
11
+ """
12
+ Configuration for Qwen3 with scaled sequence length via embedding replication.
13
+
14
+ Adds one parameter on top of Qwen3Config:
15
+ scale_seq_times (int): Number of additional embedding copies (n-1 in the doc).
16
+ 0 means no scaling (standard Qwen3 behavior).
17
+ 1 means 2x sequence length (original + 1 copy), etc.
18
+ """
19
+
20
+ model_type = "qwen3_scale_seq"
21
+
22
+ def __init__(self, scale_seq_times=0, **kwargs):
23
+ self.scale_seq_times = scale_seq_times
24
+ super().__init__(**kwargs)
25
+
26
+
27
+ __all__ = ["Qwen3ScaleSeqConfig"]
generation_config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 151643,
3
+ "eos_token_id": 151643,
4
+ "max_new_tokens": 2048,
5
+ "transformers_version": "4.57.1",
6
+ "trust_remote_code": true
7
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model-00001-of-00005.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0a61e66d0591ce4b29e3d094e86a84244721ccb377300d32335f9f902f668ad2
3
+ size 4902257696
model-00002-of-00005.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ef31828d5a0792255dc8b86e1e9b2a806be3aa1ef644a5fef0eeceff9e768e8c
3
+ size 4915960368
model-00003-of-00005.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5eb1c89f5b92159731935710b16851528b690b7049e3a17df2207e97c02f8b00
3
+ size 4983068496
model-00004-of-00005.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:02270b2f4428bcbd78c1d4d39a1b426f1997fdc2e7f87796325592a3d2b68de5
3
+ size 4069549928
model-00005-of-00005.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b28bf8c97bdb6b6afd533de4ec8f387194fdb6123c771239b7424d3b6f9b7348
3
+ size 1244659840
model.safetensors.index.json ADDED
@@ -0,0 +1,410 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_parameters": 10057724928,
4
+ "total_size": 20115449856
5
+ },
6
+ "weight_map": {
7
+ "lm_head.weight": "model-00005-of-00005.safetensors",
8
+ "model.embed_tokens.weight": "model-00001-of-00005.safetensors",
9
+ "model.layers.0.input_layernorm.weight": "model-00001-of-00005.safetensors",
10
+ "model.layers.0.mlp.down_proj.weight": "model-00001-of-00005.safetensors",
11
+ "model.layers.0.mlp.gate_proj.weight": "model-00001-of-00005.safetensors",
12
+ "model.layers.0.mlp.up_proj.weight": "model-00001-of-00005.safetensors",
13
+ "model.layers.0.post_attention_layernorm.weight": "model-00001-of-00005.safetensors",
14
+ "model.layers.0.self_attn.k_norm.weight": "model-00001-of-00005.safetensors",
15
+ "model.layers.0.self_attn.k_proj.weight": "model-00001-of-00005.safetensors",
16
+ "model.layers.0.self_attn.o_proj.weight": "model-00001-of-00005.safetensors",
17
+ "model.layers.0.self_attn.q_norm.weight": "model-00001-of-00005.safetensors",
18
+ "model.layers.0.self_attn.q_proj.weight": "model-00001-of-00005.safetensors",
19
+ "model.layers.0.self_attn.v_proj.weight": "model-00001-of-00005.safetensors",
20
+ "model.layers.1.input_layernorm.weight": "model-00001-of-00005.safetensors",
21
+ "model.layers.1.mlp.down_proj.weight": "model-00001-of-00005.safetensors",
22
+ "model.layers.1.mlp.gate_proj.weight": "model-00001-of-00005.safetensors",
23
+ "model.layers.1.mlp.up_proj.weight": "model-00001-of-00005.safetensors",
24
+ "model.layers.1.post_attention_layernorm.weight": "model-00001-of-00005.safetensors",
25
+ "model.layers.1.self_attn.k_norm.weight": "model-00001-of-00005.safetensors",
26
+ "model.layers.1.self_attn.k_proj.weight": "model-00001-of-00005.safetensors",
27
+ "model.layers.1.self_attn.o_proj.weight": "model-00001-of-00005.safetensors",
28
+ "model.layers.1.self_attn.q_norm.weight": "model-00001-of-00005.safetensors",
29
+ "model.layers.1.self_attn.q_proj.weight": "model-00001-of-00005.safetensors",
30
+ "model.layers.1.self_attn.v_proj.weight": "model-00001-of-00005.safetensors",
31
+ "model.layers.10.input_layernorm.weight": "model-00002-of-00005.safetensors",
32
+ "model.layers.10.mlp.down_proj.weight": "model-00002-of-00005.safetensors",
33
+ "model.layers.10.mlp.gate_proj.weight": "model-00002-of-00005.safetensors",
34
+ "model.layers.10.mlp.up_proj.weight": "model-00002-of-00005.safetensors",
35
+ "model.layers.10.post_attention_layernorm.weight": "model-00002-of-00005.safetensors",
36
+ "model.layers.10.self_attn.k_norm.weight": "model-00002-of-00005.safetensors",
37
+ "model.layers.10.self_attn.k_proj.weight": "model-00002-of-00005.safetensors",
38
+ "model.layers.10.self_attn.o_proj.weight": "model-00002-of-00005.safetensors",
39
+ "model.layers.10.self_attn.q_norm.weight": "model-00002-of-00005.safetensors",
40
+ "model.layers.10.self_attn.q_proj.weight": "model-00002-of-00005.safetensors",
41
+ "model.layers.10.self_attn.v_proj.weight": "model-00002-of-00005.safetensors",
42
+ "model.layers.11.input_layernorm.weight": "model-00002-of-00005.safetensors",
43
+ "model.layers.11.mlp.down_proj.weight": "model-00002-of-00005.safetensors",
44
+ "model.layers.11.mlp.gate_proj.weight": "model-00002-of-00005.safetensors",
45
+ "model.layers.11.mlp.up_proj.weight": "model-00002-of-00005.safetensors",
46
+ "model.layers.11.post_attention_layernorm.weight": "model-00002-of-00005.safetensors",
47
+ "model.layers.11.self_attn.k_norm.weight": "model-00002-of-00005.safetensors",
48
+ "model.layers.11.self_attn.k_proj.weight": "model-00002-of-00005.safetensors",
49
+ "model.layers.11.self_attn.o_proj.weight": "model-00002-of-00005.safetensors",
50
+ "model.layers.11.self_attn.q_norm.weight": "model-00002-of-00005.safetensors",
51
+ "model.layers.11.self_attn.q_proj.weight": "model-00002-of-00005.safetensors",
52
+ "model.layers.11.self_attn.v_proj.weight": "model-00002-of-00005.safetensors",
53
+ "model.layers.12.input_layernorm.weight": "model-00002-of-00005.safetensors",
54
+ "model.layers.12.mlp.down_proj.weight": "model-00002-of-00005.safetensors",
55
+ "model.layers.12.mlp.gate_proj.weight": "model-00002-of-00005.safetensors",
56
+ "model.layers.12.mlp.up_proj.weight": "model-00002-of-00005.safetensors",
57
+ "model.layers.12.post_attention_layernorm.weight": "model-00002-of-00005.safetensors",
58
+ "model.layers.12.self_attn.k_norm.weight": "model-00002-of-00005.safetensors",
59
+ "model.layers.12.self_attn.k_proj.weight": "model-00002-of-00005.safetensors",
60
+ "model.layers.12.self_attn.o_proj.weight": "model-00002-of-00005.safetensors",
61
+ "model.layers.12.self_attn.q_norm.weight": "model-00002-of-00005.safetensors",
62
+ "model.layers.12.self_attn.q_proj.weight": "model-00002-of-00005.safetensors",
63
+ "model.layers.12.self_attn.v_proj.weight": "model-00002-of-00005.safetensors",
64
+ "model.layers.13.input_layernorm.weight": "model-00002-of-00005.safetensors",
65
+ "model.layers.13.mlp.down_proj.weight": "model-00002-of-00005.safetensors",
66
+ "model.layers.13.mlp.gate_proj.weight": "model-00002-of-00005.safetensors",
67
+ "model.layers.13.mlp.up_proj.weight": "model-00002-of-00005.safetensors",
68
+ "model.layers.13.post_attention_layernorm.weight": "model-00002-of-00005.safetensors",
69
+ "model.layers.13.self_attn.k_norm.weight": "model-00002-of-00005.safetensors",
70
+ "model.layers.13.self_attn.k_proj.weight": "model-00002-of-00005.safetensors",
71
+ "model.layers.13.self_attn.o_proj.weight": "model-00002-of-00005.safetensors",
72
+ "model.layers.13.self_attn.q_norm.weight": "model-00002-of-00005.safetensors",
73
+ "model.layers.13.self_attn.q_proj.weight": "model-00002-of-00005.safetensors",
74
+ "model.layers.13.self_attn.v_proj.weight": "model-00002-of-00005.safetensors",
75
+ "model.layers.14.input_layernorm.weight": "model-00002-of-00005.safetensors",
76
+ "model.layers.14.mlp.down_proj.weight": "model-00002-of-00005.safetensors",
77
+ "model.layers.14.mlp.gate_proj.weight": "model-00002-of-00005.safetensors",
78
+ "model.layers.14.mlp.up_proj.weight": "model-00002-of-00005.safetensors",
79
+ "model.layers.14.post_attention_layernorm.weight": "model-00002-of-00005.safetensors",
80
+ "model.layers.14.self_attn.k_norm.weight": "model-00002-of-00005.safetensors",
81
+ "model.layers.14.self_attn.k_proj.weight": "model-00002-of-00005.safetensors",
82
+ "model.layers.14.self_attn.o_proj.weight": "model-00002-of-00005.safetensors",
83
+ "model.layers.14.self_attn.q_norm.weight": "model-00002-of-00005.safetensors",
84
+ "model.layers.14.self_attn.q_proj.weight": "model-00002-of-00005.safetensors",
85
+ "model.layers.14.self_attn.v_proj.weight": "model-00002-of-00005.safetensors",
86
+ "model.layers.15.input_layernorm.weight": "model-00002-of-00005.safetensors",
87
+ "model.layers.15.mlp.down_proj.weight": "model-00002-of-00005.safetensors",
88
+ "model.layers.15.mlp.gate_proj.weight": "model-00002-of-00005.safetensors",
89
+ "model.layers.15.mlp.up_proj.weight": "model-00002-of-00005.safetensors",
90
+ "model.layers.15.post_attention_layernorm.weight": "model-00002-of-00005.safetensors",
91
+ "model.layers.15.self_attn.k_norm.weight": "model-00002-of-00005.safetensors",
92
+ "model.layers.15.self_attn.k_proj.weight": "model-00002-of-00005.safetensors",
93
+ "model.layers.15.self_attn.o_proj.weight": "model-00002-of-00005.safetensors",
94
+ "model.layers.15.self_attn.q_norm.weight": "model-00002-of-00005.safetensors",
95
+ "model.layers.15.self_attn.q_proj.weight": "model-00002-of-00005.safetensors",
96
+ "model.layers.15.self_attn.v_proj.weight": "model-00002-of-00005.safetensors",
97
+ "model.layers.16.input_layernorm.weight": "model-00002-of-00005.safetensors",
98
+ "model.layers.16.mlp.down_proj.weight": "model-00002-of-00005.safetensors",
99
+ "model.layers.16.mlp.gate_proj.weight": "model-00002-of-00005.safetensors",
100
+ "model.layers.16.mlp.up_proj.weight": "model-00002-of-00005.safetensors",
101
+ "model.layers.16.post_attention_layernorm.weight": "model-00002-of-00005.safetensors",
102
+ "model.layers.16.self_attn.k_norm.weight": "model-00002-of-00005.safetensors",
103
+ "model.layers.16.self_attn.k_proj.weight": "model-00002-of-00005.safetensors",
104
+ "model.layers.16.self_attn.o_proj.weight": "model-00002-of-00005.safetensors",
105
+ "model.layers.16.self_attn.q_norm.weight": "model-00002-of-00005.safetensors",
106
+ "model.layers.16.self_attn.q_proj.weight": "model-00002-of-00005.safetensors",
107
+ "model.layers.16.self_attn.v_proj.weight": "model-00002-of-00005.safetensors",
108
+ "model.layers.17.input_layernorm.weight": "model-00002-of-00005.safetensors",
109
+ "model.layers.17.mlp.down_proj.weight": "model-00002-of-00005.safetensors",
110
+ "model.layers.17.mlp.gate_proj.weight": "model-00002-of-00005.safetensors",
111
+ "model.layers.17.mlp.up_proj.weight": "model-00002-of-00005.safetensors",
112
+ "model.layers.17.post_attention_layernorm.weight": "model-00002-of-00005.safetensors",
113
+ "model.layers.17.self_attn.k_norm.weight": "model-00002-of-00005.safetensors",
114
+ "model.layers.17.self_attn.k_proj.weight": "model-00002-of-00005.safetensors",
115
+ "model.layers.17.self_attn.o_proj.weight": "model-00002-of-00005.safetensors",
116
+ "model.layers.17.self_attn.q_norm.weight": "model-00002-of-00005.safetensors",
117
+ "model.layers.17.self_attn.q_proj.weight": "model-00002-of-00005.safetensors",
118
+ "model.layers.17.self_attn.v_proj.weight": "model-00002-of-00005.safetensors",
119
+ "model.layers.18.input_layernorm.weight": "model-00002-of-00005.safetensors",
120
+ "model.layers.18.mlp.down_proj.weight": "model-00002-of-00005.safetensors",
121
+ "model.layers.18.mlp.gate_proj.weight": "model-00002-of-00005.safetensors",
122
+ "model.layers.18.mlp.up_proj.weight": "model-00002-of-00005.safetensors",
123
+ "model.layers.18.post_attention_layernorm.weight": "model-00002-of-00005.safetensors",
124
+ "model.layers.18.self_attn.k_norm.weight": "model-00002-of-00005.safetensors",
125
+ "model.layers.18.self_attn.k_proj.weight": "model-00002-of-00005.safetensors",
126
+ "model.layers.18.self_attn.o_proj.weight": "model-00002-of-00005.safetensors",
127
+ "model.layers.18.self_attn.q_norm.weight": "model-00002-of-00005.safetensors",
128
+ "model.layers.18.self_attn.q_proj.weight": "model-00002-of-00005.safetensors",
129
+ "model.layers.18.self_attn.v_proj.weight": "model-00002-of-00005.safetensors",
130
+ "model.layers.19.input_layernorm.weight": "model-00002-of-00005.safetensors",
131
+ "model.layers.19.mlp.down_proj.weight": "model-00002-of-00005.safetensors",
132
+ "model.layers.19.mlp.gate_proj.weight": "model-00002-of-00005.safetensors",
133
+ "model.layers.19.mlp.up_proj.weight": "model-00002-of-00005.safetensors",
134
+ "model.layers.19.post_attention_layernorm.weight": "model-00002-of-00005.safetensors",
135
+ "model.layers.19.self_attn.k_norm.weight": "model-00002-of-00005.safetensors",
136
+ "model.layers.19.self_attn.k_proj.weight": "model-00002-of-00005.safetensors",
137
+ "model.layers.19.self_attn.o_proj.weight": "model-00002-of-00005.safetensors",
138
+ "model.layers.19.self_attn.q_norm.weight": "model-00002-of-00005.safetensors",
139
+ "model.layers.19.self_attn.q_proj.weight": "model-00002-of-00005.safetensors",
140
+ "model.layers.19.self_attn.v_proj.weight": "model-00002-of-00005.safetensors",
141
+ "model.layers.2.input_layernorm.weight": "model-00001-of-00005.safetensors",
142
+ "model.layers.2.mlp.down_proj.weight": "model-00001-of-00005.safetensors",
143
+ "model.layers.2.mlp.gate_proj.weight": "model-00001-of-00005.safetensors",
144
+ "model.layers.2.mlp.up_proj.weight": "model-00001-of-00005.safetensors",
145
+ "model.layers.2.post_attention_layernorm.weight": "model-00001-of-00005.safetensors",
146
+ "model.layers.2.self_attn.k_norm.weight": "model-00001-of-00005.safetensors",
147
+ "model.layers.2.self_attn.k_proj.weight": "model-00001-of-00005.safetensors",
148
+ "model.layers.2.self_attn.o_proj.weight": "model-00001-of-00005.safetensors",
149
+ "model.layers.2.self_attn.q_norm.weight": "model-00001-of-00005.safetensors",
150
+ "model.layers.2.self_attn.q_proj.weight": "model-00001-of-00005.safetensors",
151
+ "model.layers.2.self_attn.v_proj.weight": "model-00001-of-00005.safetensors",
152
+ "model.layers.20.input_layernorm.weight": "model-00002-of-00005.safetensors",
153
+ "model.layers.20.mlp.down_proj.weight": "model-00002-of-00005.safetensors",
154
+ "model.layers.20.mlp.gate_proj.weight": "model-00002-of-00005.safetensors",
155
+ "model.layers.20.mlp.up_proj.weight": "model-00002-of-00005.safetensors",
156
+ "model.layers.20.post_attention_layernorm.weight": "model-00002-of-00005.safetensors",
157
+ "model.layers.20.self_attn.k_norm.weight": "model-00002-of-00005.safetensors",
158
+ "model.layers.20.self_attn.k_proj.weight": "model-00002-of-00005.safetensors",
159
+ "model.layers.20.self_attn.o_proj.weight": "model-00002-of-00005.safetensors",
160
+ "model.layers.20.self_attn.q_norm.weight": "model-00002-of-00005.safetensors",
161
+ "model.layers.20.self_attn.q_proj.weight": "model-00002-of-00005.safetensors",
162
+ "model.layers.20.self_attn.v_proj.weight": "model-00002-of-00005.safetensors",
163
+ "model.layers.21.input_layernorm.weight": "model-00002-of-00005.safetensors",
164
+ "model.layers.21.mlp.down_proj.weight": "model-00002-of-00005.safetensors",
165
+ "model.layers.21.mlp.gate_proj.weight": "model-00002-of-00005.safetensors",
166
+ "model.layers.21.mlp.up_proj.weight": "model-00002-of-00005.safetensors",
167
+ "model.layers.21.post_attention_layernorm.weight": "model-00002-of-00005.safetensors",
168
+ "model.layers.21.self_attn.k_norm.weight": "model-00002-of-00005.safetensors",
169
+ "model.layers.21.self_attn.k_proj.weight": "model-00002-of-00005.safetensors",
170
+ "model.layers.21.self_attn.o_proj.weight": "model-00002-of-00005.safetensors",
171
+ "model.layers.21.self_attn.q_norm.weight": "model-00002-of-00005.safetensors",
172
+ "model.layers.21.self_attn.q_proj.weight": "model-00002-of-00005.safetensors",
173
+ "model.layers.21.self_attn.v_proj.weight": "model-00002-of-00005.safetensors",
174
+ "model.layers.22.input_layernorm.weight": "model-00003-of-00005.safetensors",
175
+ "model.layers.22.mlp.down_proj.weight": "model-00003-of-00005.safetensors",
176
+ "model.layers.22.mlp.gate_proj.weight": "model-00003-of-00005.safetensors",
177
+ "model.layers.22.mlp.up_proj.weight": "model-00003-of-00005.safetensors",
178
+ "model.layers.22.post_attention_layernorm.weight": "model-00003-of-00005.safetensors",
179
+ "model.layers.22.self_attn.k_norm.weight": "model-00002-of-00005.safetensors",
180
+ "model.layers.22.self_attn.k_proj.weight": "model-00002-of-00005.safetensors",
181
+ "model.layers.22.self_attn.o_proj.weight": "model-00002-of-00005.safetensors",
182
+ "model.layers.22.self_attn.q_norm.weight": "model-00002-of-00005.safetensors",
183
+ "model.layers.22.self_attn.q_proj.weight": "model-00002-of-00005.safetensors",
184
+ "model.layers.22.self_attn.v_proj.weight": "model-00002-of-00005.safetensors",
185
+ "model.layers.23.input_layernorm.weight": "model-00003-of-00005.safetensors",
186
+ "model.layers.23.mlp.down_proj.weight": "model-00003-of-00005.safetensors",
187
+ "model.layers.23.mlp.gate_proj.weight": "model-00003-of-00005.safetensors",
188
+ "model.layers.23.mlp.up_proj.weight": "model-00003-of-00005.safetensors",
189
+ "model.layers.23.post_attention_layernorm.weight": "model-00003-of-00005.safetensors",
190
+ "model.layers.23.self_attn.k_norm.weight": "model-00003-of-00005.safetensors",
191
+ "model.layers.23.self_attn.k_proj.weight": "model-00003-of-00005.safetensors",
192
+ "model.layers.23.self_attn.o_proj.weight": "model-00003-of-00005.safetensors",
193
+ "model.layers.23.self_attn.q_norm.weight": "model-00003-of-00005.safetensors",
194
+ "model.layers.23.self_attn.q_proj.weight": "model-00003-of-00005.safetensors",
195
+ "model.layers.23.self_attn.v_proj.weight": "model-00003-of-00005.safetensors",
196
+ "model.layers.24.input_layernorm.weight": "model-00003-of-00005.safetensors",
197
+ "model.layers.24.mlp.down_proj.weight": "model-00003-of-00005.safetensors",
198
+ "model.layers.24.mlp.gate_proj.weight": "model-00003-of-00005.safetensors",
199
+ "model.layers.24.mlp.up_proj.weight": "model-00003-of-00005.safetensors",
200
+ "model.layers.24.post_attention_layernorm.weight": "model-00003-of-00005.safetensors",
201
+ "model.layers.24.self_attn.k_norm.weight": "model-00003-of-00005.safetensors",
202
+ "model.layers.24.self_attn.k_proj.weight": "model-00003-of-00005.safetensors",
203
+ "model.layers.24.self_attn.o_proj.weight": "model-00003-of-00005.safetensors",
204
+ "model.layers.24.self_attn.q_norm.weight": "model-00003-of-00005.safetensors",
205
+ "model.layers.24.self_attn.q_proj.weight": "model-00003-of-00005.safetensors",
206
+ "model.layers.24.self_attn.v_proj.weight": "model-00003-of-00005.safetensors",
207
+ "model.layers.25.input_layernorm.weight": "model-00003-of-00005.safetensors",
208
+ "model.layers.25.mlp.down_proj.weight": "model-00003-of-00005.safetensors",
209
+ "model.layers.25.mlp.gate_proj.weight": "model-00003-of-00005.safetensors",
210
+ "model.layers.25.mlp.up_proj.weight": "model-00003-of-00005.safetensors",
211
+ "model.layers.25.post_attention_layernorm.weight": "model-00003-of-00005.safetensors",
212
+ "model.layers.25.self_attn.k_norm.weight": "model-00003-of-00005.safetensors",
213
+ "model.layers.25.self_attn.k_proj.weight": "model-00003-of-00005.safetensors",
214
+ "model.layers.25.self_attn.o_proj.weight": "model-00003-of-00005.safetensors",
215
+ "model.layers.25.self_attn.q_norm.weight": "model-00003-of-00005.safetensors",
216
+ "model.layers.25.self_attn.q_proj.weight": "model-00003-of-00005.safetensors",
217
+ "model.layers.25.self_attn.v_proj.weight": "model-00003-of-00005.safetensors",
218
+ "model.layers.26.input_layernorm.weight": "model-00003-of-00005.safetensors",
219
+ "model.layers.26.mlp.down_proj.weight": "model-00003-of-00005.safetensors",
220
+ "model.layers.26.mlp.gate_proj.weight": "model-00003-of-00005.safetensors",
221
+ "model.layers.26.mlp.up_proj.weight": "model-00003-of-00005.safetensors",
222
+ "model.layers.26.post_attention_layernorm.weight": "model-00003-of-00005.safetensors",
223
+ "model.layers.26.self_attn.k_norm.weight": "model-00003-of-00005.safetensors",
224
+ "model.layers.26.self_attn.k_proj.weight": "model-00003-of-00005.safetensors",
225
+ "model.layers.26.self_attn.o_proj.weight": "model-00003-of-00005.safetensors",
226
+ "model.layers.26.self_attn.q_norm.weight": "model-00003-of-00005.safetensors",
227
+ "model.layers.26.self_attn.q_proj.weight": "model-00003-of-00005.safetensors",
228
+ "model.layers.26.self_attn.v_proj.weight": "model-00003-of-00005.safetensors",
229
+ "model.layers.27.input_layernorm.weight": "model-00003-of-00005.safetensors",
230
+ "model.layers.27.mlp.down_proj.weight": "model-00003-of-00005.safetensors",
231
+ "model.layers.27.mlp.gate_proj.weight": "model-00003-of-00005.safetensors",
232
+ "model.layers.27.mlp.up_proj.weight": "model-00003-of-00005.safetensors",
233
+ "model.layers.27.post_attention_layernorm.weight": "model-00003-of-00005.safetensors",
234
+ "model.layers.27.self_attn.k_norm.weight": "model-00003-of-00005.safetensors",
235
+ "model.layers.27.self_attn.k_proj.weight": "model-00003-of-00005.safetensors",
236
+ "model.layers.27.self_attn.o_proj.weight": "model-00003-of-00005.safetensors",
237
+ "model.layers.27.self_attn.q_norm.weight": "model-00003-of-00005.safetensors",
238
+ "model.layers.27.self_attn.q_proj.weight": "model-00003-of-00005.safetensors",
239
+ "model.layers.27.self_attn.v_proj.weight": "model-00003-of-00005.safetensors",
240
+ "model.layers.28.input_layernorm.weight": "model-00003-of-00005.safetensors",
241
+ "model.layers.28.mlp.down_proj.weight": "model-00003-of-00005.safetensors",
242
+ "model.layers.28.mlp.gate_proj.weight": "model-00003-of-00005.safetensors",
243
+ "model.layers.28.mlp.up_proj.weight": "model-00003-of-00005.safetensors",
244
+ "model.layers.28.post_attention_layernorm.weight": "model-00003-of-00005.safetensors",
245
+ "model.layers.28.self_attn.k_norm.weight": "model-00003-of-00005.safetensors",
246
+ "model.layers.28.self_attn.k_proj.weight": "model-00003-of-00005.safetensors",
247
+ "model.layers.28.self_attn.o_proj.weight": "model-00003-of-00005.safetensors",
248
+ "model.layers.28.self_attn.q_norm.weight": "model-00003-of-00005.safetensors",
249
+ "model.layers.28.self_attn.q_proj.weight": "model-00003-of-00005.safetensors",
250
+ "model.layers.28.self_attn.v_proj.weight": "model-00003-of-00005.safetensors",
251
+ "model.layers.29.input_layernorm.weight": "model-00003-of-00005.safetensors",
252
+ "model.layers.29.mlp.down_proj.weight": "model-00003-of-00005.safetensors",
253
+ "model.layers.29.mlp.gate_proj.weight": "model-00003-of-00005.safetensors",
254
+ "model.layers.29.mlp.up_proj.weight": "model-00003-of-00005.safetensors",
255
+ "model.layers.29.post_attention_layernorm.weight": "model-00003-of-00005.safetensors",
256
+ "model.layers.29.self_attn.k_norm.weight": "model-00003-of-00005.safetensors",
257
+ "model.layers.29.self_attn.k_proj.weight": "model-00003-of-00005.safetensors",
258
+ "model.layers.29.self_attn.o_proj.weight": "model-00003-of-00005.safetensors",
259
+ "model.layers.29.self_attn.q_norm.weight": "model-00003-of-00005.safetensors",
260
+ "model.layers.29.self_attn.q_proj.weight": "model-00003-of-00005.safetensors",
261
+ "model.layers.29.self_attn.v_proj.weight": "model-00003-of-00005.safetensors",
262
+ "model.layers.3.input_layernorm.weight": "model-00001-of-00005.safetensors",
263
+ "model.layers.3.mlp.down_proj.weight": "model-00001-of-00005.safetensors",
264
+ "model.layers.3.mlp.gate_proj.weight": "model-00001-of-00005.safetensors",
265
+ "model.layers.3.mlp.up_proj.weight": "model-00001-of-00005.safetensors",
266
+ "model.layers.3.post_attention_layernorm.weight": "model-00001-of-00005.safetensors",
267
+ "model.layers.3.self_attn.k_norm.weight": "model-00001-of-00005.safetensors",
268
+ "model.layers.3.self_attn.k_proj.weight": "model-00001-of-00005.safetensors",
269
+ "model.layers.3.self_attn.o_proj.weight": "model-00001-of-00005.safetensors",
270
+ "model.layers.3.self_attn.q_norm.weight": "model-00001-of-00005.safetensors",
271
+ "model.layers.3.self_attn.q_proj.weight": "model-00001-of-00005.safetensors",
272
+ "model.layers.3.self_attn.v_proj.weight": "model-00001-of-00005.safetensors",
273
+ "model.layers.30.input_layernorm.weight": "model-00003-of-00005.safetensors",
274
+ "model.layers.30.mlp.down_proj.weight": "model-00003-of-00005.safetensors",
275
+ "model.layers.30.mlp.gate_proj.weight": "model-00003-of-00005.safetensors",
276
+ "model.layers.30.mlp.up_proj.weight": "model-00003-of-00005.safetensors",
277
+ "model.layers.30.post_attention_layernorm.weight": "model-00003-of-00005.safetensors",
278
+ "model.layers.30.self_attn.k_norm.weight": "model-00003-of-00005.safetensors",
279
+ "model.layers.30.self_attn.k_proj.weight": "model-00003-of-00005.safetensors",
280
+ "model.layers.30.self_attn.o_proj.weight": "model-00003-of-00005.safetensors",
281
+ "model.layers.30.self_attn.q_norm.weight": "model-00003-of-00005.safetensors",
282
+ "model.layers.30.self_attn.q_proj.weight": "model-00003-of-00005.safetensors",
283
+ "model.layers.30.self_attn.v_proj.weight": "model-00003-of-00005.safetensors",
284
+ "model.layers.31.input_layernorm.weight": "model-00003-of-00005.safetensors",
285
+ "model.layers.31.mlp.down_proj.weight": "model-00003-of-00005.safetensors",
286
+ "model.layers.31.mlp.gate_proj.weight": "model-00003-of-00005.safetensors",
287
+ "model.layers.31.mlp.up_proj.weight": "model-00003-of-00005.safetensors",
288
+ "model.layers.31.post_attention_layernorm.weight": "model-00003-of-00005.safetensors",
289
+ "model.layers.31.self_attn.k_norm.weight": "model-00003-of-00005.safetensors",
290
+ "model.layers.31.self_attn.k_proj.weight": "model-00003-of-00005.safetensors",
291
+ "model.layers.31.self_attn.o_proj.weight": "model-00003-of-00005.safetensors",
292
+ "model.layers.31.self_attn.q_norm.weight": "model-00003-of-00005.safetensors",
293
+ "model.layers.31.self_attn.q_proj.weight": "model-00003-of-00005.safetensors",
294
+ "model.layers.31.self_attn.v_proj.weight": "model-00003-of-00005.safetensors",
295
+ "model.layers.32.input_layernorm.weight": "model-00003-of-00005.safetensors",
296
+ "model.layers.32.mlp.down_proj.weight": "model-00003-of-00005.safetensors",
297
+ "model.layers.32.mlp.gate_proj.weight": "model-00003-of-00005.safetensors",
298
+ "model.layers.32.mlp.up_proj.weight": "model-00003-of-00005.safetensors",
299
+ "model.layers.32.post_attention_layernorm.weight": "model-00003-of-00005.safetensors",
300
+ "model.layers.32.self_attn.k_norm.weight": "model-00003-of-00005.safetensors",
301
+ "model.layers.32.self_attn.k_proj.weight": "model-00003-of-00005.safetensors",
302
+ "model.layers.32.self_attn.o_proj.weight": "model-00003-of-00005.safetensors",
303
+ "model.layers.32.self_attn.q_norm.weight": "model-00003-of-00005.safetensors",
304
+ "model.layers.32.self_attn.q_proj.weight": "model-00003-of-00005.safetensors",
305
+ "model.layers.32.self_attn.v_proj.weight": "model-00003-of-00005.safetensors",
306
+ "model.layers.33.input_layernorm.weight": "model-00003-of-00005.safetensors",
307
+ "model.layers.33.mlp.down_proj.weight": "model-00003-of-00005.safetensors",
308
+ "model.layers.33.mlp.gate_proj.weight": "model-00003-of-00005.safetensors",
309
+ "model.layers.33.mlp.up_proj.weight": "model-00003-of-00005.safetensors",
310
+ "model.layers.33.post_attention_layernorm.weight": "model-00003-of-00005.safetensors",
311
+ "model.layers.33.self_attn.k_norm.weight": "model-00003-of-00005.safetensors",
312
+ "model.layers.33.self_attn.k_proj.weight": "model-00003-of-00005.safetensors",
313
+ "model.layers.33.self_attn.o_proj.weight": "model-00003-of-00005.safetensors",
314
+ "model.layers.33.self_attn.q_norm.weight": "model-00003-of-00005.safetensors",
315
+ "model.layers.33.self_attn.q_proj.weight": "model-00003-of-00005.safetensors",
316
+ "model.layers.33.self_attn.v_proj.weight": "model-00003-of-00005.safetensors",
317
+ "model.layers.34.input_layernorm.weight": "model-00003-of-00005.safetensors",
318
+ "model.layers.34.mlp.down_proj.weight": "model-00003-of-00005.safetensors",
319
+ "model.layers.34.mlp.gate_proj.weight": "model-00003-of-00005.safetensors",
320
+ "model.layers.34.mlp.up_proj.weight": "model-00003-of-00005.safetensors",
321
+ "model.layers.34.post_attention_layernorm.weight": "model-00003-of-00005.safetensors",
322
+ "model.layers.34.self_attn.k_norm.weight": "model-00003-of-00005.safetensors",
323
+ "model.layers.34.self_attn.k_proj.weight": "model-00003-of-00005.safetensors",
324
+ "model.layers.34.self_attn.o_proj.weight": "model-00003-of-00005.safetensors",
325
+ "model.layers.34.self_attn.q_norm.weight": "model-00003-of-00005.safetensors",
326
+ "model.layers.34.self_attn.q_proj.weight": "model-00003-of-00005.safetensors",
327
+ "model.layers.34.self_attn.v_proj.weight": "model-00003-of-00005.safetensors",
328
+ "model.layers.35.input_layernorm.weight": "model-00004-of-00005.safetensors",
329
+ "model.layers.35.mlp.down_proj.weight": "model-00004-of-00005.safetensors",
330
+ "model.layers.35.mlp.gate_proj.weight": "model-00004-of-00005.safetensors",
331
+ "model.layers.35.mlp.up_proj.weight": "model-00004-of-00005.safetensors",
332
+ "model.layers.35.post_attention_layernorm.weight": "model-00004-of-00005.safetensors",
333
+ "model.layers.35.self_attn.k_norm.weight": "model-00004-of-00005.safetensors",
334
+ "model.layers.35.self_attn.k_proj.weight": "model-00003-of-00005.safetensors",
335
+ "model.layers.35.self_attn.o_proj.weight": "model-00004-of-00005.safetensors",
336
+ "model.layers.35.self_attn.q_norm.weight": "model-00004-of-00005.safetensors",
337
+ "model.layers.35.self_attn.q_proj.weight": "model-00003-of-00005.safetensors",
338
+ "model.layers.35.self_attn.v_proj.weight": "model-00003-of-00005.safetensors",
339
+ "model.layers.4.input_layernorm.weight": "model-00001-of-00005.safetensors",
340
+ "model.layers.4.mlp.down_proj.weight": "model-00001-of-00005.safetensors",
341
+ "model.layers.4.mlp.gate_proj.weight": "model-00001-of-00005.safetensors",
342
+ "model.layers.4.mlp.up_proj.weight": "model-00001-of-00005.safetensors",
343
+ "model.layers.4.post_attention_layernorm.weight": "model-00001-of-00005.safetensors",
344
+ "model.layers.4.self_attn.k_norm.weight": "model-00001-of-00005.safetensors",
345
+ "model.layers.4.self_attn.k_proj.weight": "model-00001-of-00005.safetensors",
346
+ "model.layers.4.self_attn.o_proj.weight": "model-00001-of-00005.safetensors",
347
+ "model.layers.4.self_attn.q_norm.weight": "model-00001-of-00005.safetensors",
348
+ "model.layers.4.self_attn.q_proj.weight": "model-00001-of-00005.safetensors",
349
+ "model.layers.4.self_attn.v_proj.weight": "model-00001-of-00005.safetensors",
350
+ "model.layers.5.input_layernorm.weight": "model-00001-of-00005.safetensors",
351
+ "model.layers.5.mlp.down_proj.weight": "model-00001-of-00005.safetensors",
352
+ "model.layers.5.mlp.gate_proj.weight": "model-00001-of-00005.safetensors",
353
+ "model.layers.5.mlp.up_proj.weight": "model-00001-of-00005.safetensors",
354
+ "model.layers.5.post_attention_layernorm.weight": "model-00001-of-00005.safetensors",
355
+ "model.layers.5.self_attn.k_norm.weight": "model-00001-of-00005.safetensors",
356
+ "model.layers.5.self_attn.k_proj.weight": "model-00001-of-00005.safetensors",
357
+ "model.layers.5.self_attn.o_proj.weight": "model-00001-of-00005.safetensors",
358
+ "model.layers.5.self_attn.q_norm.weight": "model-00001-of-00005.safetensors",
359
+ "model.layers.5.self_attn.q_proj.weight": "model-00001-of-00005.safetensors",
360
+ "model.layers.5.self_attn.v_proj.weight": "model-00001-of-00005.safetensors",
361
+ "model.layers.6.input_layernorm.weight": "model-00001-of-00005.safetensors",
362
+ "model.layers.6.mlp.down_proj.weight": "model-00001-of-00005.safetensors",
363
+ "model.layers.6.mlp.gate_proj.weight": "model-00001-of-00005.safetensors",
364
+ "model.layers.6.mlp.up_proj.weight": "model-00001-of-00005.safetensors",
365
+ "model.layers.6.post_attention_layernorm.weight": "model-00001-of-00005.safetensors",
366
+ "model.layers.6.self_attn.k_norm.weight": "model-00001-of-00005.safetensors",
367
+ "model.layers.6.self_attn.k_proj.weight": "model-00001-of-00005.safetensors",
368
+ "model.layers.6.self_attn.o_proj.weight": "model-00001-of-00005.safetensors",
369
+ "model.layers.6.self_attn.q_norm.weight": "model-00001-of-00005.safetensors",
370
+ "model.layers.6.self_attn.q_proj.weight": "model-00001-of-00005.safetensors",
371
+ "model.layers.6.self_attn.v_proj.weight": "model-00001-of-00005.safetensors",
372
+ "model.layers.7.input_layernorm.weight": "model-00001-of-00005.safetensors",
373
+ "model.layers.7.mlp.down_proj.weight": "model-00001-of-00005.safetensors",
374
+ "model.layers.7.mlp.gate_proj.weight": "model-00001-of-00005.safetensors",
375
+ "model.layers.7.mlp.up_proj.weight": "model-00001-of-00005.safetensors",
376
+ "model.layers.7.post_attention_layernorm.weight": "model-00001-of-00005.safetensors",
377
+ "model.layers.7.self_attn.k_norm.weight": "model-00001-of-00005.safetensors",
378
+ "model.layers.7.self_attn.k_proj.weight": "model-00001-of-00005.safetensors",
379
+ "model.layers.7.self_attn.o_proj.weight": "model-00001-of-00005.safetensors",
380
+ "model.layers.7.self_attn.q_norm.weight": "model-00001-of-00005.safetensors",
381
+ "model.layers.7.self_attn.q_proj.weight": "model-00001-of-00005.safetensors",
382
+ "model.layers.7.self_attn.v_proj.weight": "model-00001-of-00005.safetensors",
383
+ "model.layers.8.input_layernorm.weight": "model-00001-of-00005.safetensors",
384
+ "model.layers.8.mlp.down_proj.weight": "model-00001-of-00005.safetensors",
385
+ "model.layers.8.mlp.gate_proj.weight": "model-00001-of-00005.safetensors",
386
+ "model.layers.8.mlp.up_proj.weight": "model-00001-of-00005.safetensors",
387
+ "model.layers.8.post_attention_layernorm.weight": "model-00001-of-00005.safetensors",
388
+ "model.layers.8.self_attn.k_norm.weight": "model-00001-of-00005.safetensors",
389
+ "model.layers.8.self_attn.k_proj.weight": "model-00001-of-00005.safetensors",
390
+ "model.layers.8.self_attn.o_proj.weight": "model-00001-of-00005.safetensors",
391
+ "model.layers.8.self_attn.q_norm.weight": "model-00001-of-00005.safetensors",
392
+ "model.layers.8.self_attn.q_proj.weight": "model-00001-of-00005.safetensors",
393
+ "model.layers.8.self_attn.v_proj.weight": "model-00001-of-00005.safetensors",
394
+ "model.layers.9.input_layernorm.weight": "model-00002-of-00005.safetensors",
395
+ "model.layers.9.mlp.down_proj.weight": "model-00002-of-00005.safetensors",
396
+ "model.layers.9.mlp.gate_proj.weight": "model-00001-of-00005.safetensors",
397
+ "model.layers.9.mlp.up_proj.weight": "model-00002-of-00005.safetensors",
398
+ "model.layers.9.post_attention_layernorm.weight": "model-00002-of-00005.safetensors",
399
+ "model.layers.9.self_attn.k_norm.weight": "model-00001-of-00005.safetensors",
400
+ "model.layers.9.self_attn.k_proj.weight": "model-00001-of-00005.safetensors",
401
+ "model.layers.9.self_attn.o_proj.weight": "model-00001-of-00005.safetensors",
402
+ "model.layers.9.self_attn.q_norm.weight": "model-00001-of-00005.safetensors",
403
+ "model.layers.9.self_attn.q_proj.weight": "model-00001-of-00005.safetensors",
404
+ "model.layers.9.self_attn.v_proj.weight": "model-00001-of-00005.safetensors",
405
+ "model.norm.weight": "model-00004-of-00005.safetensors",
406
+ "model.scale_seq_embed_tokens_list.0.weight": "model-00004-of-00005.safetensors",
407
+ "model.scale_seq_embed_tokens_list.1.weight": "model-00004-of-00005.safetensors",
408
+ "model.scale_seq_embed_tokens_list.2.weight": "model-00004-of-00005.safetensors"
409
+ }
410
+ }
modeling_qwen3_scale_seq.py ADDED
@@ -0,0 +1,226 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Qwen3 with scaled sequence length via embedding replication.
2
+
3
+ Extends Qwen3Model/Qwen3ForCausalLM with scale_seq_times additional
4
+ embedding tables. During forward, the original token sequence of length L
5
+ is expanded to (1 + scale_seq_times) * L via interleaved multi-stream
6
+ embedding, then processed by the standard Qwen3 transformer body.
7
+
8
+ Architecture overview (n = 1 + scale_seq_times):
9
+ - n Embedding tables: E_0 (original), E_1, ..., E_{n-1} (new)
10
+ - Interleaved layout: [E_0(t1), E_1(t1), ..., E_0(t2), E_1(t2), ...]
11
+ - RoPE positions: 0, 1, 2, ..., n*L - 1 (continuous)
12
+ - Standard causal attention over all n*L positions
13
+ - Contraction: only the last stream's hidden_state per token goes through
14
+ lm_head (the stream with the richest context), matching v4dev behavior.
15
+
16
+ See: Scale_SeqLen_via_Embedding_Replication.md
17
+ """
18
+
19
+ from typing import Optional, Tuple, Union
20
+
21
+ import torch
22
+ from torch import nn
23
+ from transformers import Qwen3ForCausalLM, Qwen3Model
24
+ from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast
25
+ from transformers.processing_utils import Unpack
26
+ from transformers.utils import TransformersKwargs, can_return_tuple
27
+
28
+ from .configuration_qwen3_scale_seq import Qwen3ScaleSeqConfig
29
+
30
+
31
+ class Qwen3ScaleSeqModel(Qwen3Model):
32
+ """Qwen3Model extended with multi-stream embedding for sequence scaling."""
33
+
34
+ config_class = Qwen3ScaleSeqConfig
35
+
36
+ def __init__(self, config: Qwen3ScaleSeqConfig):
37
+ super().__init__(config)
38
+ self.scale_seq_times = getattr(config, "scale_seq_times", 0)
39
+
40
+ if self.scale_seq_times > 0:
41
+ self.scale_seq_embed_tokens_list = nn.ModuleList(
42
+ [
43
+ nn.Embedding(
44
+ config.vocab_size,
45
+ config.hidden_size,
46
+ self.padding_idx,
47
+ )
48
+ for _ in range(self.scale_seq_times)
49
+ ]
50
+ )
51
+
52
+ self.post_init()
53
+
54
+ def _expand_scale_seq(
55
+ self,
56
+ input_ids: torch.LongTensor,
57
+ hidden_states: torch.FloatTensor,
58
+ ) -> torch.FloatTensor:
59
+ """Expand hidden_states from (B, T, D) to (B, T * scale, D).
60
+
61
+ Layout per original token i:
62
+ [main_emb_i, scale_seq_1_emb_i, ..., scale_seq_N_emb_i]
63
+
64
+ Args:
65
+ input_ids: (batch, seq_len) original token ids.
66
+ hidden_states: (batch, seq_len, hidden) main embedding output.
67
+
68
+ Returns:
69
+ Expanded tensor of shape (batch, seq_len * scale, hidden).
70
+ """
71
+ device = hidden_states.device
72
+ B, T, D = hidden_states.shape
73
+
74
+ # (B, T, D) -> (B, T, 1, D)
75
+ parts = [hidden_states.unsqueeze(2)]
76
+
77
+ for s in range(self.scale_seq_times):
78
+ emb_module = self.scale_seq_embed_tokens_list[s]
79
+ hs_s = emb_module(input_ids.to(emb_module.weight.device)).to(device)
80
+ parts.append(hs_s.unsqueeze(2)) # (B, T, 1, D)
81
+
82
+ # (B, T, scale, D) -> (B, T * scale, D)
83
+ expanded = torch.cat(parts, dim=2)
84
+ return expanded.reshape(B, T * (self.scale_seq_times + 1), D)
85
+
86
+ def forward(
87
+ self,
88
+ input_ids: Optional[torch.LongTensor] = None,
89
+ attention_mask: Optional[torch.Tensor] = None,
90
+ position_ids: Optional[torch.LongTensor] = None,
91
+ past_key_values=None,
92
+ inputs_embeds: Optional[torch.FloatTensor] = None,
93
+ use_cache: Optional[bool] = None,
94
+ output_attentions: Optional[bool] = None,
95
+ output_hidden_states: Optional[bool] = None,
96
+ return_dict: Optional[bool] = None,
97
+ cache_position: Optional[torch.LongTensor] = None,
98
+ **kwargs,
99
+ ) -> Union[Tuple, BaseModelOutputWithPast]:
100
+ if (
101
+ self.scale_seq_times > 0
102
+ and input_ids is not None
103
+ and inputs_embeds is None
104
+ ):
105
+ scale = self.scale_seq_times + 1
106
+
107
+ # Compute main embedding, then expand with scale_seq streams
108
+ inputs_embeds = self.embed_tokens(input_ids)
109
+ inputs_embeds = self._expand_scale_seq(input_ids, inputs_embeds)
110
+
111
+ B = inputs_embeds.shape[0]
112
+ T_expanded = inputs_embeds.shape[1]
113
+
114
+ # Recompute cache_position and position_ids in expanded space
115
+ past_seen_tokens = (
116
+ past_key_values.get_seq_length()
117
+ if past_key_values is not None else 0
118
+ )
119
+ cache_position = torch.arange(
120
+ past_seen_tokens, past_seen_tokens + T_expanded,
121
+ device=inputs_embeds.device,
122
+ )
123
+ position_ids = cache_position.unsqueeze(0).expand(B, -1)
124
+
125
+ # Expand attention_mask to match expanded sequence length
126
+ if attention_mask is not None:
127
+ attention_mask = attention_mask.repeat_interleave(scale, dim=1)
128
+
129
+ input_ids = None # avoid double embedding lookup in super().forward()
130
+
131
+ return super().forward(
132
+ input_ids=input_ids,
133
+ attention_mask=attention_mask,
134
+ position_ids=position_ids,
135
+ past_key_values=past_key_values,
136
+ inputs_embeds=inputs_embeds,
137
+ use_cache=use_cache,
138
+ output_attentions=output_attentions,
139
+ output_hidden_states=output_hidden_states,
140
+ return_dict=return_dict,
141
+ cache_position=cache_position,
142
+ **kwargs,
143
+ )
144
+
145
+
146
+ class Qwen3ScaleSeqForCausalLM(Qwen3ForCausalLM):
147
+ """Qwen3ForCausalLM with multi-stream embedding for sequence scaling.
148
+
149
+ Contraction: after the transformer body produces (B, T*scale, D),
150
+ select only the last stream per token (the one with richest context)
151
+ before applying lm_head, producing (B, T, vocab_size).
152
+ """
153
+
154
+ config_class = Qwen3ScaleSeqConfig
155
+ _tied_weights_keys = ["lm_head.weight"]
156
+
157
+ def __init__(self, config: Qwen3ScaleSeqConfig):
158
+ super().__init__(config)
159
+ # Replace the inner model with our scaled version
160
+ self.model = Qwen3ScaleSeqModel(config)
161
+ self.post_init()
162
+
163
+ @can_return_tuple
164
+ def forward(
165
+ self,
166
+ input_ids: Optional[torch.LongTensor] = None,
167
+ attention_mask: Optional[torch.Tensor] = None,
168
+ position_ids: Optional[torch.LongTensor] = None,
169
+ past_key_values=None,
170
+ inputs_embeds: Optional[torch.FloatTensor] = None,
171
+ labels: Optional[torch.LongTensor] = None,
172
+ use_cache: Optional[bool] = None,
173
+ output_attentions: Optional[bool] = None,
174
+ output_hidden_states: Optional[bool] = None,
175
+ return_dict: Optional[bool] = None,
176
+ cache_position: Optional[torch.LongTensor] = None,
177
+ logits_to_keep: Union[int, torch.Tensor] = 0,
178
+ **kwargs,
179
+ ) -> CausalLMOutputWithPast:
180
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
181
+ output_hidden_states = (
182
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
183
+ )
184
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
185
+
186
+ outputs = self.model(
187
+ input_ids=input_ids,
188
+ attention_mask=attention_mask,
189
+ position_ids=position_ids,
190
+ past_key_values=past_key_values,
191
+ inputs_embeds=inputs_embeds,
192
+ use_cache=use_cache,
193
+ output_attentions=output_attentions,
194
+ output_hidden_states=output_hidden_states,
195
+ return_dict=return_dict,
196
+ cache_position=cache_position,
197
+ **kwargs,
198
+ )
199
+
200
+ hidden_states = outputs[0]
201
+
202
+ # ---- scale_seq contraction ----
203
+ # Contract expanded hidden_states (B, T*scale, D) back to logical
204
+ # token space (B, T, D) by selecting the last stream per token group
205
+ # (the stream with the richest context), matching v4dev behavior.
206
+ if self.model.scale_seq_times > 0:
207
+ scale = self.model.scale_seq_times + 1
208
+ hidden_states = hidden_states[:, scale - 1::scale, :]
209
+
210
+ slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep
211
+ logits = self.lm_head(hidden_states[:, slice_indices, :])
212
+
213
+ loss = None
214
+ if labels is not None:
215
+ loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.config.vocab_size, **kwargs)
216
+
217
+ return CausalLMOutputWithPast(
218
+ loss=loss,
219
+ logits=logits,
220
+ past_key_values=outputs.past_key_values if use_cache else None,
221
+ hidden_states=outputs.hidden_states,
222
+ attentions=outputs.attentions,
223
+ )
224
+
225
+
226
+ __all__ = ["Qwen3ScaleSeqModel", "Qwen3ScaleSeqForCausalLM"]
special_tokens_map.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|im_start|>",
4
+ "<|im_end|>",
5
+ "<|object_ref_start|>",
6
+ "<|object_ref_end|>",
7
+ "<|box_start|>",
8
+ "<|box_end|>",
9
+ "<|quad_start|>",
10
+ "<|quad_end|>",
11
+ "<|vision_start|>",
12
+ "<|vision_end|>",
13
+ "<|vision_pad|>",
14
+ "<|image_pad|>",
15
+ "<|video_pad|>"
16
+ ],
17
+ "eos_token": {
18
+ "content": "<|endoftext|>",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ },
24
+ "pad_token": {
25
+ "content": "<|endoftext|>",
26
+ "lstrip": false,
27
+ "normalized": false,
28
+ "rstrip": false,
29
+ "single_word": false
30
+ }
31
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9c5ae00e602b8860cbd784ba82a8aa14e8feecec692e7076590d014d7b7fdafa
3
+ size 11421896
tokenizer_config.json ADDED
@@ -0,0 +1,207 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_prefix_space": false,
4
+ "added_tokens_decoder": {
5
+ "151643": {
6
+ "content": "<|endoftext|>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "151644": {
14
+ "content": "<|im_start|>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "151645": {
22
+ "content": "<|im_end|>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "151646": {
30
+ "content": "<|object_ref_start|>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ },
37
+ "151647": {
38
+ "content": "<|object_ref_end|>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false,
43
+ "special": true
44
+ },
45
+ "151648": {
46
+ "content": "<|box_start|>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false,
51
+ "special": true
52
+ },
53
+ "151649": {
54
+ "content": "<|box_end|>",
55
+ "lstrip": false,
56
+ "normalized": false,
57
+ "rstrip": false,
58
+ "single_word": false,
59
+ "special": true
60
+ },
61
+ "151650": {
62
+ "content": "<|quad_start|>",
63
+ "lstrip": false,
64
+ "normalized": false,
65
+ "rstrip": false,
66
+ "single_word": false,
67
+ "special": true
68
+ },
69
+ "151651": {
70
+ "content": "<|quad_end|>",
71
+ "lstrip": false,
72
+ "normalized": false,
73
+ "rstrip": false,
74
+ "single_word": false,
75
+ "special": true
76
+ },
77
+ "151652": {
78
+ "content": "<|vision_start|>",
79
+ "lstrip": false,
80
+ "normalized": false,
81
+ "rstrip": false,
82
+ "single_word": false,
83
+ "special": true
84
+ },
85
+ "151653": {
86
+ "content": "<|vision_end|>",
87
+ "lstrip": false,
88
+ "normalized": false,
89
+ "rstrip": false,
90
+ "single_word": false,
91
+ "special": true
92
+ },
93
+ "151654": {
94
+ "content": "<|vision_pad|>",
95
+ "lstrip": false,
96
+ "normalized": false,
97
+ "rstrip": false,
98
+ "single_word": false,
99
+ "special": true
100
+ },
101
+ "151655": {
102
+ "content": "<|image_pad|>",
103
+ "lstrip": false,
104
+ "normalized": false,
105
+ "rstrip": false,
106
+ "single_word": false,
107
+ "special": true
108
+ },
109
+ "151656": {
110
+ "content": "<|video_pad|>",
111
+ "lstrip": false,
112
+ "normalized": false,
113
+ "rstrip": false,
114
+ "single_word": false,
115
+ "special": true
116
+ },
117
+ "151657": {
118
+ "content": "<tool_call>",
119
+ "lstrip": false,
120
+ "normalized": false,
121
+ "rstrip": false,
122
+ "single_word": false,
123
+ "special": false
124
+ },
125
+ "151658": {
126
+ "content": "</tool_call>",
127
+ "lstrip": false,
128
+ "normalized": false,
129
+ "rstrip": false,
130
+ "single_word": false,
131
+ "special": false
132
+ },
133
+ "151659": {
134
+ "content": "<|fim_prefix|>",
135
+ "lstrip": false,
136
+ "normalized": false,
137
+ "rstrip": false,
138
+ "single_word": false,
139
+ "special": false
140
+ },
141
+ "151660": {
142
+ "content": "<|fim_middle|>",
143
+ "lstrip": false,
144
+ "normalized": false,
145
+ "rstrip": false,
146
+ "single_word": false,
147
+ "special": false
148
+ },
149
+ "151661": {
150
+ "content": "<|fim_suffix|>",
151
+ "lstrip": false,
152
+ "normalized": false,
153
+ "rstrip": false,
154
+ "single_word": false,
155
+ "special": false
156
+ },
157
+ "151662": {
158
+ "content": "<|fim_pad|>",
159
+ "lstrip": false,
160
+ "normalized": false,
161
+ "rstrip": false,
162
+ "single_word": false,
163
+ "special": false
164
+ },
165
+ "151663": {
166
+ "content": "<|repo_name|>",
167
+ "lstrip": false,
168
+ "normalized": false,
169
+ "rstrip": false,
170
+ "single_word": false,
171
+ "special": false
172
+ },
173
+ "151664": {
174
+ "content": "<|file_sep|>",
175
+ "lstrip": false,
176
+ "normalized": false,
177
+ "rstrip": false,
178
+ "single_word": false,
179
+ "special": false
180
+ }
181
+ },
182
+ "additional_special_tokens": [
183
+ "<|im_start|>",
184
+ "<|im_end|>",
185
+ "<|object_ref_start|>",
186
+ "<|object_ref_end|>",
187
+ "<|box_start|>",
188
+ "<|box_end|>",
189
+ "<|quad_start|>",
190
+ "<|quad_end|>",
191
+ "<|vision_start|>",
192
+ "<|vision_end|>",
193
+ "<|vision_pad|>",
194
+ "<|image_pad|>",
195
+ "<|video_pad|>"
196
+ ],
197
+ "bos_token": null,
198
+ "clean_up_tokenization_spaces": false,
199
+ "eos_token": "<|endoftext|>",
200
+ "errors": "replace",
201
+ "extra_special_tokens": {},
202
+ "model_max_length": 131072,
203
+ "pad_token": "<|endoftext|>",
204
+ "split_special_tokens": false,
205
+ "tokenizer_class": "Qwen2Tokenizer",
206
+ "unk_token": null
207
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff