wangzihan99 commited on
Commit
74a1327
·
2 Parent(s): 89a2cd35ff8f11

Merge branch 'main' of https://huggingface.co/Qwen/Qwen-7B-Chat-Int4 into pr/6

Browse files
Files changed (8) hide show
  1. NOTICE +229 -1
  2. README.md +5 -5
  3. assets/logo.jpg +0 -0
  4. assets/wechat.png +0 -0
  5. config.json +1 -1
  6. generation_config.json +11 -11
  7. modeling_qwen.py +63 -145
  8. tokenizer_config.json +1 -1
NOTICE CHANGED
@@ -49,4 +49,232 @@ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
49
  AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
50
  LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
51
  OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
52
- SOFTWARE.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
  AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
50
  LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
51
  OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
52
+ SOFTWARE.
53
+
54
+ ------------- LICENSE FOR stanford_alpaca code --------------
55
+
56
+ Apache License
57
+ Version 2.0, January 2004
58
+ http://www.apache.org/licenses/
59
+
60
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
61
+
62
+ 1. Definitions.
63
+
64
+ "License" shall mean the terms and conditions for use, reproduction,
65
+ and distribution as defined by Sections 1 through 9 of this document.
66
+
67
+ "Licensor" shall mean the copyright owner or entity authorized by
68
+ the copyright owner that is granting the License.
69
+
70
+ "Legal Entity" shall mean the union of the acting entity and all
71
+ other entities that control, are controlled by, or are under common
72
+ control with that entity. For the purposes of this definition,
73
+ "control" means (i) the power, direct or indirect, to cause the
74
+ direction or management of such entity, whether by contract or
75
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
76
+ outstanding shares, or (iii) beneficial ownership of such entity.
77
+
78
+ "You" (or "Your") shall mean an individual or Legal Entity
79
+ exercising permissions granted by this License.
80
+
81
+ "Source" form shall mean the preferred form for making modifications,
82
+ including but not limited to software source code, documentation
83
+ source, and configuration files.
84
+
85
+ "Object" form shall mean any form resulting from mechanical
86
+ transformation or translation of a Source form, including but
87
+ not limited to compiled object code, generated documentation,
88
+ and conversions to other media types.
89
+
90
+ "Work" shall mean the work of authorship, whether in Source or
91
+ Object form, made available under the License, as indicated by a
92
+ copyright notice that is included in or attached to the work
93
+ (an example is provided in the Appendix below).
94
+
95
+ "Derivative Works" shall mean any work, whether in Source or Object
96
+ form, that is based on (or derived from) the Work and for which the
97
+ editorial revisions, annotations, elaborations, or other modifications
98
+ represent, as a whole, an original work of authorship. For the purposes
99
+ of this License, Derivative Works shall not include works that remain
100
+ separable from, or merely link (or bind by name) to the interfaces of,
101
+ the Work and Derivative Works thereof.
102
+
103
+ "Contribution" shall mean any work of authorship, including
104
+ the original version of the Work and any modifications or additions
105
+ to that Work or Derivative Works thereof, that is intentionally
106
+ submitted to Licensor for inclusion in the Work by the copyright owner
107
+ or by an individual or Legal Entity authorized to submit on behalf of
108
+ the copyright owner. For the purposes of this definition, "submitted"
109
+ means any form of electronic, verbal, or written communication sent
110
+ to the Licensor or its representatives, including but not limited to
111
+ communication on electronic mailing lists, source code control systems,
112
+ and issue tracking systems that are managed by, or on behalf of, the
113
+ Licensor for the purpose of discussing and improving the Work, but
114
+ excluding communication that is conspicuously marked or otherwise
115
+ designated in writing by the copyright owner as "Not a Contribution."
116
+
117
+ "Contributor" shall mean Licensor and any individual or Legal Entity
118
+ on behalf of whom a Contribution has been received by Licensor and
119
+ subsequently incorporated within the Work.
120
+
121
+ 2. Grant of Copyright License. Subject to the terms and conditions of
122
+ this License, each Contributor hereby grants to You a perpetual,
123
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
124
+ copyright license to reproduce, prepare Derivative Works of,
125
+ publicly display, publicly perform, sublicense, and distribute the
126
+ Work and such Derivative Works in Source or Object form.
127
+
128
+ 3. Grant of Patent License. Subject to the terms and conditions of
129
+ this License, each Contributor hereby grants to You a perpetual,
130
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
131
+ (except as stated in this section) patent license to make, have made,
132
+ use, offer to sell, sell, import, and otherwise transfer the Work,
133
+ where such license applies only to those patent claims licensable
134
+ by such Contributor that are necessarily infringed by their
135
+ Contribution(s) alone or by combination of their Contribution(s)
136
+ with the Work to which such Contribution(s) was submitted. If You
137
+ institute patent litigation against any entity (including a
138
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
139
+ or a Contribution incorporated within the Work constitutes direct
140
+ or contributory patent infringement, then any patent licenses
141
+ granted to You under this License for that Work shall terminate
142
+ as of the date such litigation is filed.
143
+
144
+ 4. Redistribution. You may reproduce and distribute copies of the
145
+ Work or Derivative Works thereof in any medium, with or without
146
+ modifications, and in Source or Object form, provided that You
147
+ meet the following conditions:
148
+
149
+ (a) You must give any other recipients of the Work or
150
+ Derivative Works a copy of this License; and
151
+
152
+ (b) You must cause any modified files to carry prominent notices
153
+ stating that You changed the files; and
154
+
155
+ (c) You must retain, in the Source form of any Derivative Works
156
+ that You distribute, all copyright, patent, trademark, and
157
+ attribution notices from the Source form of the Work,
158
+ excluding those notices that do not pertain to any part of
159
+ the Derivative Works; and
160
+
161
+ (d) If the Work includes a "NOTICE" text file as part of its
162
+ distribution, then any Derivative Works that You distribute must
163
+ include a readable copy of the attribution notices contained
164
+ within such NOTICE file, excluding those notices that do not
165
+ pertain to any part of the Derivative Works, in at least one
166
+ of the following places: within a NOTICE text file distributed
167
+ as part of the Derivative Works; within the Source form or
168
+ documentation, if provided along with the Derivative Works; or,
169
+ within a display generated by the Derivative Works, if and
170
+ wherever such third-party notices normally appear. The contents
171
+ of the NOTICE file are for informational purposes only and
172
+ do not modify the License. You may add Your own attribution
173
+ notices within Derivative Works that You distribute, alongside
174
+ or as an addendum to the NOTICE text from the Work, provided
175
+ that such additional attribution notices cannot be construed
176
+ as modifying the License.
177
+
178
+ You may add Your own copyright statement to Your modifications and
179
+ may provide additional or different license terms and conditions
180
+ for use, reproduction, or distribution of Your modifications, or
181
+ for any such Derivative Works as a whole, provided Your use,
182
+ reproduction, and distribution of the Work otherwise complies with
183
+ the conditions stated in this License.
184
+
185
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
186
+ any Contribution intentionally submitted for inclusion in the Work
187
+ by You to the Licensor shall be under the terms and conditions of
188
+ this License, without any additional terms or conditions.
189
+ Notwithstanding the above, nothing herein shall supersede or modify
190
+ the terms of any separate license agreement you may have executed
191
+ with Licensor regarding such Contributions.
192
+
193
+ 6. Trademarks. This License does not grant permission to use the trade
194
+ names, trademarks, service marks, or product names of the Licensor,
195
+ except as required for reasonable and customary use in describing the
196
+ origin of the Work and reproducing the content of the NOTICE file.
197
+
198
+ 7. Disclaimer of Warranty. Unless required by applicable law or
199
+ agreed to in writing, Licensor provides the Work (and each
200
+ Contributor provides its Contributions) on an "AS IS" BASIS,
201
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
202
+ implied, including, without limitation, any warranties or conditions
203
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
204
+ PARTICULAR PURPOSE. You are solely responsible for determining the
205
+ appropriateness of using or redistributing the Work and assume any
206
+ risks associated with Your exercise of permissions under this License.
207
+
208
+ 8. Limitation of Liability. In no event and under no legal theory,
209
+ whether in tort (including negligence), contract, or otherwise,
210
+ unless required by applicable law (such as deliberate and grossly
211
+ negligent acts) or agreed to in writing, shall any Contributor be
212
+ liable to You for damages, including any direct, indirect, special,
213
+ incidental, or consequential damages of any character arising as a
214
+ result of this License or out of the use or inability to use the
215
+ Work (including but not limited to damages for loss of goodwill,
216
+ work stoppage, computer failure or malfunction, or any and all
217
+ other commercial damages or losses), even if such Contributor
218
+ has been advised of the possibility of such damages.
219
+
220
+ 9. Accepting Warranty or Additional Liability. While redistributing
221
+ the Work or Derivative Works thereof, You may choose to offer,
222
+ and charge a fee for, acceptance of support, warranty, indemnity,
223
+ or other liability obligations and/or rights consistent with this
224
+ License. However, in accepting such obligations, You may act only
225
+ on Your own behalf and on Your sole responsibility, not on behalf
226
+ of any other Contributor, and only if You agree to indemnify,
227
+ defend, and hold each Contributor harmless for any liability
228
+ incurred by, or claims asserted against, such Contributor by reason
229
+ of your accepting any such warranty or additional liability.
230
+
231
+ END OF TERMS AND CONDITIONS
232
+
233
+ APPENDIX: How to apply the Apache License to your work.
234
+
235
+ To apply the Apache License to your work, attach the following
236
+ boilerplate notice, with the fields enclosed by brackets "[]"
237
+ replaced with your own identifying information. (Don't include
238
+ the brackets!) The text should be enclosed in the appropriate
239
+ comment syntax for the file format. We also recommend that a
240
+ file or class name and description of purpose be included on the
241
+ same "printed page" as the copyright notice for easier
242
+ identification within third-party archives.
243
+
244
+ Copyright 2023 Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li
245
+
246
+ Licensed under the Apache License, Version 2.0 (the "License");
247
+ you may not use this file except in compliance with the License.
248
+ You may obtain a copy of the License at
249
+
250
+ http://www.apache.org/licenses/LICENSE-2.0
251
+
252
+ Unless required by applicable law or agreed to in writing, software
253
+ distributed under the License is distributed on an "AS IS" BASIS,
254
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
255
+ See the License for the specific language governing permissions and
256
+ limitations under the License.
257
+
258
+ ------------- LICENSE FOR PanQiWei AutoGPTQ code --------------
259
+
260
+ MIT License
261
+
262
+ Copyright (c) 2023 潘其威(William)
263
+
264
+ Permission is hereby granted, free of charge, to any person obtaining a copy
265
+ of this software and associated documentation files (the "Software"), to deal
266
+ in the Software without restriction, including without limitation the rights
267
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
268
+ copies of the Software, and to permit persons to whom the Software is
269
+ furnished to do so, subject to the following conditions:
270
+
271
+ The above copyright notice and this permission notice shall be included in all
272
+ copies or substantial portions of the Software.
273
+
274
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
275
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
276
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
277
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
278
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
279
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
280
+ SOFTWARE.
README.md CHANGED
@@ -16,11 +16,11 @@ inference: false
16
  <br>
17
 
18
  <p align="center">
19
- 🤗 <a href="https://huggingface.co/Qwen">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://modelscope.cn/organization/qwen">ModelScope</a>&nbsp&nbsp | &nbsp&nbsp 📑 <a href="https://arxiv.org/abs/2309.16609">Paper</a>&nbsp&nbsp | &nbsp&nbsp🖥️ <a href="https://modelscope.cn/studios/qwen/Qwen-7B-Chat-Demo/summary">Demo</a>
20
  <br>
21
- <a href="https://github.com/QwenLM/Qwen/blob/main/assets/wechat.png">WeChat (微信)</a>&nbsp&nbsp | &nbsp&nbsp DingTalk (钉钉) &nbsp&nbsp | &nbsp&nbsp<a href="https://discord.gg/z3GAxXZ9Ce">Discord</a>&nbsp&nbsp
22
  </p>
23
- <br><br>
24
 
25
  ## 介绍(Introduction)
26
 
@@ -597,9 +597,9 @@ If you find our work helpful, feel free to give us a cite.
597
 
598
  ## 使用协议(License Agreement)
599
 
600
- 我们的代码和模型权重对学术研究完全开放,并支持商用。请查看[LICENSE](https://github.com/QwenLM/Qwen/blob/main/LICENSE)了解具体的开源协议细节。如需商用,请填写[问卷](https://dashscope.console.aliyun.com/openModelApply/qianwen)申请。
601
 
602
- Our code and checkpoints are open to research purpose, and they are allowed for commercial purposes. Check [LICENSE](https://github.com/QwenLM/Qwen/blob/main/LICENSE) for more details about the license. If you have requirements for commercial use, please fill out the [form](https://dashscope.console.aliyun.com/openModelApply/qianwen) to apply.
603
  <br>
604
 
605
 
 
16
  <br>
17
 
18
  <p align="center">
19
+ 🤗 <a href="https://huggingface.co/Qwen">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://modelscope.cn/organization/qwen">ModelScope</a>&nbsp&nbsp | &nbsp&nbsp 📑 <a href="https://arxiv.org/abs/2309.16609">Paper</a> &nbsp&nbsp | &nbsp&nbsp🖥️ <a href="https://modelscope.cn/studios/qwen/Qwen-7B-Chat-Demo/summary">Demo</a>
20
  <br>
21
+ <a href="assets/wechat.png">WeChat (微信)</a>&nbsp&nbsp | &nbsp&nbsp<a href="https://discord.gg/z3GAxXZ9Ce">Discord</a>&nbsp&nbsp | &nbsp&nbsp<a href="https://dashscope.aliyun.com">API</a>
22
  </p>
23
+ <br>
24
 
25
  ## 介绍(Introduction)
26
 
 
597
 
598
  ## 使用协议(License Agreement)
599
 
600
+ 我们的代码和模型权重对学术研究完全开放,并支持商用。请查看[LICENSE](https://github.com/QwenLM/Qwen/blob/main/Tongyi%20Qianwen%20LICENSE%20AGREEMENT)了解具体的开源协议细节。如需商用,请填写[问卷](https://dashscope.console.aliyun.com/openModelApply/qianwen)申请。
601
 
602
+ Our code and checkpoints are open to research purpose, and they are allowed for commercial purposes. Check [LICENSE](https://github.com/QwenLM/Qwen/blob/main/Tongyi%20Qianwen%20LICENSE%20AGREEMENT) for more details about the license. If you have requirements for commercial use, please fill out the [form](https://dashscope.console.aliyun.com/openModelApply/qianwen) to apply.
603
  <br>
604
 
605
 
assets/logo.jpg CHANGED
assets/wechat.png CHANGED
config.json CHANGED
@@ -16,7 +16,7 @@
16
  "initializer_range": 0.02,
17
  "kv_channels": 128,
18
  "layer_norm_epsilon": 1e-06,
19
- "max_position_embeddings": 8192,
20
  "model_type": "qwen",
21
  "no_bias": true,
22
  "num_attention_heads": 32,
 
16
  "initializer_range": 0.02,
17
  "kv_channels": 128,
18
  "layer_norm_epsilon": 1e-06,
19
+ "max_position_embeddings": 32768,
20
  "model_type": "qwen",
21
  "no_bias": true,
22
  "num_attention_heads": 32,
generation_config.json CHANGED
@@ -1,12 +1,12 @@
1
  {
2
- "chat_format": "chatml",
3
- "eos_token_id": 151643,
4
- "pad_token_id": 151643,
5
- "max_window_size": 6144,
6
- "max_new_tokens": 512,
7
- "do_sample": true,
8
- "top_k": 0,
9
- "top_p": 0.8,
10
- "repetition_penalty": 1.1,
11
- "transformers_version": "4.31.0"
12
- }
 
1
  {
2
+ "chat_format": "chatml",
3
+ "eos_token_id": 151643,
4
+ "pad_token_id": 151643,
5
+ "max_window_size": 24000,
6
+ "max_new_tokens": 512,
7
+ "do_sample": true,
8
+ "top_k": 0,
9
+ "top_p": 0.8,
10
+ "repetition_penalty": 1.1,
11
+ "transformers_version": "4.31.0"
12
+ }
modeling_qwen.py CHANGED
@@ -13,7 +13,6 @@ import torch
13
  import torch.nn.functional as F
14
  import torch.utils.checkpoint
15
  import warnings
16
- from torch.cuda.amp import autocast
17
 
18
  from torch.nn import CrossEntropyLoss
19
  from transformers import PreTrainedTokenizer, GenerationConfig, StoppingCriteriaList
@@ -80,9 +79,10 @@ apply_rotary_emb_func = None
80
  apply_rotary_emb_func_triton = None
81
  rms_norm = None
82
  flash_attn_unpadded_func = None
 
83
 
84
  def _import_flash_attn():
85
- global apply_rotary_emb_func, rms_norm, flash_attn_unpadded_func
86
  try:
87
  from flash_attn.layers.rotary import apply_rotary_emb_func as __apply_rotary_emb_func
88
  apply_rotary_emb_func = __apply_rotary_emb_func
@@ -103,14 +103,18 @@ def _import_flash_attn():
103
 
104
  try:
105
  import flash_attn
 
106
  if not hasattr(flash_attn, '__version__'):
107
  from flash_attn.flash_attn_interface import flash_attn_unpadded_func as __flash_attn_unpadded_func
108
  else:
109
  if int(flash_attn.__version__.split(".")[0]) >= 2:
 
 
110
  from flash_attn.flash_attn_interface import flash_attn_varlen_func as __flash_attn_unpadded_func
111
  else:
112
  from flash_attn.flash_attn_interface import flash_attn_unpadded_func as __flash_attn_unpadded_func
113
  flash_attn_unpadded_func = __flash_attn_unpadded_func
 
114
  except ImportError:
115
  logger.warn(
116
  "Warning: import flash_attn fail, please install FlashAttention to get higher efficiency "
@@ -207,6 +211,11 @@ class FlashSelfAttention(torch.nn.Module):
207
  seqlen_k = k.shape[1]
208
  seqlen_out = seqlen_q
209
 
 
 
 
 
 
210
  q, k, v = [rearrange(x, "b s ... -> (b s) ...") for x in [q, k, v]]
211
  cu_seqlens_q = torch.arange(
212
  0,
@@ -336,7 +345,7 @@ class QWenAttention(nn.Module):
336
  warnings.warn("Failed to import KV cache kernels.")
337
  self.cache_kernels = None
338
 
339
- def _attn(self, query, key, value, registered_causal_mask, attention_mask=None, head_mask=None):
340
  device = query.device
341
  if self.use_cache_quantization:
342
  qk, qk_scale, qk_zero = key
@@ -361,26 +370,13 @@ class QWenAttention(nn.Module):
361
  size_temp = value[0].size(-1)
362
  else:
363
  size_temp = value.size(-1)
364
- attn_weights = attn_weights / torch.full(
365
- [],
366
- size_temp ** 0.5,
367
- dtype=attn_weights.dtype,
368
- device=attn_weights.device,
369
- )
370
- if self.use_cache_quantization:
371
- query_length, key_length = query.size(-2), key[0].size(-2)
372
- else:
373
- query_length, key_length = query.size(-2), key.size(-2)
374
- causal_mask = registered_causal_mask[
375
- :, :, key_length - query_length : key_length, :key_length
376
- ]
377
  mask_value = torch.finfo(attn_weights.dtype).min
378
- mask_value = torch.full([], mask_value, dtype=attn_weights.dtype).to(
379
- attn_weights.device
380
- )
381
- attn_weights = torch.where(
382
- causal_mask, attn_weights.to(attn_weights.dtype), mask_value
383
- )
384
 
385
  if attention_mask is not None:
386
  attn_weights = attn_weights + attention_mask
@@ -420,62 +416,6 @@ class QWenAttention(nn.Module):
420
 
421
  return attn_output, attn_weights
422
 
423
- def _upcast_and_reordered_attn(
424
- self, query, key, value, registered_causal_mask, attention_mask=None, head_mask=None
425
- ):
426
- bsz, num_heads, q_seq_len, dk = query.size()
427
- _, _, k_seq_len, _ = key.size()
428
-
429
- attn_weights = torch.empty(
430
- bsz * num_heads,
431
- q_seq_len,
432
- k_seq_len,
433
- dtype=torch.float32,
434
- device=query.device,
435
- )
436
-
437
- scale_factor = 1.0
438
- if self.scale_attn_weights:
439
- scale_factor /= float(value.size(-1)) ** 0.5
440
-
441
- with autocast(enabled=False):
442
- q, k = query.reshape(-1, q_seq_len, dk), key.transpose(-1, -2).reshape(
443
- -1, dk, k_seq_len
444
- )
445
- attn_weights = torch.baddbmm(
446
- attn_weights, q.float(), k.float(), beta=0, alpha=scale_factor
447
- )
448
- attn_weights = attn_weights.reshape(bsz, num_heads, q_seq_len, k_seq_len)
449
-
450
- query_length, key_length = query.size(-2), key.size(-2)
451
- causal_mask = registered_causal_mask[
452
- :, :, key_length - query_length : key_length, :key_length
453
- ]
454
- mask_value = torch.finfo(attn_weights.dtype).min
455
- mask_value = torch.tensor(mask_value, dtype=attn_weights.dtype).to(
456
- attn_weights.device
457
- )
458
- attn_weights = torch.where(causal_mask, attn_weights, mask_value)
459
-
460
- if attention_mask is not None:
461
- attn_weights = attn_weights + attention_mask
462
-
463
- attn_weights = nn.functional.softmax(attn_weights, dim=-1)
464
-
465
- if attn_weights.dtype != torch.float32:
466
- raise RuntimeError(
467
- "Error with upcasting, attn_weights does not have dtype torch.float32"
468
- )
469
- attn_weights = attn_weights.type(value.dtype)
470
- attn_weights = self.attn_dropout(attn_weights)
471
-
472
- if head_mask is not None:
473
- attn_weights = attn_weights * head_mask
474
-
475
- attn_output = torch.matmul(attn_weights, value)
476
-
477
- return attn_output, attn_weights
478
-
479
  def _split_heads(self, tensor, num_heads, attn_head_size):
480
  new_shape = tensor.size()[:-1] + (num_heads, attn_head_size)
481
  tensor = tensor.view(new_shape)
@@ -490,7 +430,6 @@ class QWenAttention(nn.Module):
490
  self,
491
  hidden_states: Optional[Tuple[torch.FloatTensor]],
492
  rotary_pos_emb_list: Optional[List[List[torch.Tensor]]] = None,
493
- registered_causal_mask: Optional[torch.Tensor] = None,
494
  layer_past: Optional[Tuple[torch.Tensor]] = None,
495
  attention_mask: Optional[torch.FloatTensor] = None,
496
  head_mask: Optional[torch.FloatTensor] = None,
@@ -564,7 +503,8 @@ class QWenAttention(nn.Module):
564
  else:
565
  present = None
566
 
567
- if self.use_logn_attn and not self.training:
 
568
  if self.use_cache_quantization:
569
  seq_start = key[0].size(2) - query.size(1)
570
  seq_end = key[0].size(2)
@@ -583,12 +523,19 @@ class QWenAttention(nn.Module):
583
  q, k, v = query, key, value
584
  attn_output = self.core_attention_flash(q, k, v, attention_mask=attention_mask)
585
  else:
 
 
 
 
 
 
 
586
  query = query.permute(0, 2, 1, 3)
587
  if not self.use_cache_quantization:
588
  key = key.permute(0, 2, 1, 3)
589
  value = value.permute(0, 2, 1, 3)
590
  if (
591
- registered_causal_mask is None
592
  and self.use_flash_attn
593
  and flash_attn_unpadded_func is not None
594
  and not self.is_fp32
@@ -597,13 +544,12 @@ class QWenAttention(nn.Module):
597
  raise Exception(_ERROR_INPUT_CPU_QUERY_WITH_FLASH_ATTN_ACTIVATED)
598
 
599
  if not self.use_cache_quantization and SUPPORT_TORCH2:
600
- causal_mask = registered_causal_mask[
601
- :, :, key.size(-2) - query.size(-2): key.size(-2), :key.size(-2)
602
- ]
603
  if attention_mask is not None:
604
  attention_mask = attention_mask.expand(
605
  -1, -1, causal_mask.size(2), -1
606
- ).masked_fill(~causal_mask, torch.finfo(query.dtype).min)
 
 
607
  else:
608
  attention_mask = causal_mask
609
  attn_output = F.scaled_dot_product_attention(
@@ -612,7 +558,7 @@ class QWenAttention(nn.Module):
612
  attn_weight = None
613
  else:
614
  attn_output, attn_weight = self._attn(
615
- query, key, value, registered_causal_mask, attention_mask, head_mask
616
  )
617
  context_layer = self._merge_heads(
618
  attn_output, self.num_heads, self.head_dim
@@ -628,6 +574,8 @@ class QWenAttention(nn.Module):
628
  and not self.is_fp32
629
  ):
630
  raise ValueError("Cannot output attentions while using flash-attn")
 
 
631
  else:
632
  outputs += (attn_weight,)
633
 
@@ -653,6 +601,7 @@ class QWenMLP(nn.Module):
653
  output = self.c_proj(intermediate_parallel)
654
  return output
655
 
 
656
  class QWenBlock(nn.Module):
657
  def __init__(self, config):
658
  super().__init__()
@@ -675,7 +624,6 @@ class QWenBlock(nn.Module):
675
  self,
676
  hidden_states: Optional[Tuple[torch.FloatTensor]],
677
  rotary_pos_emb_list: Optional[List[List[torch.Tensor]]] = None,
678
- registered_causal_mask: Optional[torch.Tensor] = None,
679
  layer_past: Optional[Tuple[torch.Tensor]] = None,
680
  attention_mask: Optional[torch.FloatTensor] = None,
681
  head_mask: Optional[torch.FloatTensor] = None,
@@ -689,7 +637,6 @@ class QWenBlock(nn.Module):
689
  attn_outputs = self.attn(
690
  layernorm_output,
691
  rotary_pos_emb_list,
692
- registered_causal_mask=registered_causal_mask,
693
  layer_past=layer_past,
694
  attention_mask=attention_mask,
695
  head_mask=head_mask,
@@ -723,6 +670,7 @@ class QWenPreTrainedModel(PreTrainedModel):
723
  is_parallelizable = False
724
  supports_gradient_checkpointing = True
725
  _no_split_modules = ["QWenBlock"]
 
726
 
727
  def __init__(self, *inputs, **kwargs):
728
  super().__init__(*inputs, **kwargs)
@@ -789,21 +737,6 @@ class QWenModel(QWenPreTrainedModel):
789
 
790
  self.use_flash_attn = config.use_flash_attn
791
  self.is_fp32 = not (config.bf16 or config.fp16)
792
- if (
793
- self.use_flash_attn
794
- and flash_attn_unpadded_func is not None
795
- and not self.is_fp32
796
- ):
797
- self.registered_causal_mask = None
798
- else:
799
- max_positions = config.max_position_embeddings
800
- self.register_buffer(
801
- "registered_causal_mask",
802
- torch.tril(
803
- torch.ones((max_positions, max_positions), dtype=torch.bool)
804
- ).view(1, 1, max_positions, max_positions),
805
- persistent=False,
806
- )
807
 
808
  self.h = nn.ModuleList(
809
  [
@@ -975,7 +908,6 @@ class QWenModel(QWenPreTrainedModel):
975
  create_custom_forward(block),
976
  hidden_states,
977
  rotary_pos_emb_list,
978
- self.registered_causal_mask,
979
  None,
980
  attention_mask,
981
  head_mask[i],
@@ -987,7 +919,6 @@ class QWenModel(QWenPreTrainedModel):
987
  hidden_states,
988
  layer_past=layer_past,
989
  rotary_pos_emb_list=rotary_pos_emb_list,
990
- registered_causal_mask=self.registered_causal_mask,
991
  attention_mask=attention_mask,
992
  head_mask=head_mask[i],
993
  encoder_hidden_states=encoder_hidden_states,
@@ -1031,11 +962,6 @@ class QWenLMHeadModel(QWenPreTrainedModel):
1031
  assert (
1032
  config.bf16 + config.fp16 + config.fp32 <= 1
1033
  ), "Only one of \"bf16\", \"fp16\", \"fp32\" can be true"
1034
- logger.warn(
1035
- "Warning: please make sure that you are using the latest codes and checkpoints, "
1036
- "especially if you used Qwen-7B before 09.25.2023."
1037
- "请使用最新模型和代码,尤其如果你在9月25日前已经开始使用Qwen-7B,千万注意不要使用错误代码和模型。"
1038
- )
1039
 
1040
  autoset_precision = config.bf16 + config.fp16 + config.fp32 == 0
1041
 
@@ -1094,7 +1020,6 @@ class QWenLMHeadModel(QWenPreTrainedModel):
1094
  self.lm_head.half()
1095
  self.post_init()
1096
 
1097
-
1098
  def get_output_embeddings(self):
1099
  return self.lm_head
1100
 
@@ -1104,22 +1029,13 @@ class QWenLMHeadModel(QWenPreTrainedModel):
1104
  def prepare_inputs_for_generation(
1105
  self, input_ids, past_key_values=None, inputs_embeds=None, **kwargs
1106
  ):
1107
- token_type_ids = kwargs.get("token_type_ids", None)
1108
  if past_key_values:
1109
  input_ids = input_ids[:, -1].unsqueeze(-1)
1110
- if token_type_ids is not None:
1111
- token_type_ids = token_type_ids[:, -1].unsqueeze(-1)
1112
-
1113
- attention_mask = kwargs.get("attention_mask", None)
1114
- position_ids = kwargs.get("position_ids", None)
1115
 
1116
- if attention_mask is not None and position_ids is None:
1117
- position_ids = attention_mask.long().cumsum(-1) - 1
1118
- position_ids.masked_fill_(attention_mask == 0, 1)
1119
- if past_key_values:
1120
- position_ids = position_ids[:, -1].unsqueeze(-1)
1121
  else:
1122
- position_ids = None
1123
 
1124
  if inputs_embeds is not None and past_key_values is None:
1125
  model_inputs = {"inputs_embeds": inputs_embeds}
@@ -1130,9 +1046,7 @@ class QWenLMHeadModel(QWenPreTrainedModel):
1130
  {
1131
  "past_key_values": past_key_values,
1132
  "use_cache": kwargs.get("use_cache"),
1133
- "position_ids": position_ids,
1134
  "attention_mask": attention_mask,
1135
- "token_type_ids": token_type_ids,
1136
  }
1137
  )
1138
  return model_inputs
@@ -1403,8 +1317,7 @@ class RotaryEmbedding(torch.nn.Module):
1403
  self._ntk_alpha_cached = 1.0
1404
  self._ntk_alpha_cached_list = [1.0]
1405
 
1406
- def update_rotary_pos_emb_cache(self, max_seq_len, offset=0, ntk_alpha=1.0):
1407
- seqlen = max_seq_len + offset
1408
  if seqlen > self._seq_len_cached or ntk_alpha != self._ntk_alpha_cached:
1409
  base = self.base * ntk_alpha ** (self.dim / (self.dim - 2))
1410
  self.inv_freq = 1.0 / (
@@ -1427,10 +1340,10 @@ class RotaryEmbedding(torch.nn.Module):
1427
  cos, sin = emb.cos(), emb.sin()
1428
  self._rotary_pos_emb_cache = [cos, sin]
1429
 
1430
- def forward(self, max_seq_len, offset=0, ntk_alpha=1.0):
1431
- self.update_rotary_pos_emb_cache(max_seq_len, offset, ntk_alpha)
1432
  cos, sin = self._rotary_pos_emb_cache
1433
- return [cos[:, offset : offset + max_seq_len], sin[:, offset : offset + max_seq_len]]
1434
 
1435
 
1436
  def _rotate_half(x):
@@ -1442,23 +1355,28 @@ def _rotate_half(x):
1442
 
1443
 
1444
  def apply_rotary_pos_emb(t, freqs):
 
 
 
 
 
 
 
 
 
1445
  cos, sin = freqs
1446
- if apply_rotary_emb_func_triton is not None and t.is_cuda:
1447
- return apply_rotary_emb_func_triton(t, cos, sin)
1448
- elif apply_rotary_emb_func is not None and t.is_cuda:
1449
- t_ = t.float()
1450
- cos = cos.squeeze(0).squeeze(1)[:, : cos.shape[-1] // 2]
1451
- sin = sin.squeeze(0).squeeze(1)[:, : sin.shape[-1] // 2]
1452
- output = apply_rotary_emb_func(t_, cos, sin).type_as(t)
1453
- return output
1454
  else:
1455
- rot_dim = freqs[0].shape[-1]
1456
- cos, sin = freqs
1457
- t_, t_pass_ = t[..., :rot_dim], t[..., rot_dim:]
1458
- t_ = t_.float()
1459
- t_pass_ = t_pass_.float()
1460
- t_ = (t_ * cos) + (_rotate_half(t_) * sin)
1461
- return torch.cat((t_, t_pass_), dim=-1).type_as(t)
1462
 
1463
 
1464
  class RMSNorm(torch.nn.Module):
 
13
  import torch.nn.functional as F
14
  import torch.utils.checkpoint
15
  import warnings
 
16
 
17
  from torch.nn import CrossEntropyLoss
18
  from transformers import PreTrainedTokenizer, GenerationConfig, StoppingCriteriaList
 
79
  apply_rotary_emb_func_triton = None
80
  rms_norm = None
81
  flash_attn_unpadded_func = None
82
+ flash_attn_func = None
83
 
84
  def _import_flash_attn():
85
+ global apply_rotary_emb_func, rms_norm, flash_attn_unpadded_func, flash_attn_func
86
  try:
87
  from flash_attn.layers.rotary import apply_rotary_emb_func as __apply_rotary_emb_func
88
  apply_rotary_emb_func = __apply_rotary_emb_func
 
103
 
104
  try:
105
  import flash_attn
106
+ _flash_attn_func = None
107
  if not hasattr(flash_attn, '__version__'):
108
  from flash_attn.flash_attn_interface import flash_attn_unpadded_func as __flash_attn_unpadded_func
109
  else:
110
  if int(flash_attn.__version__.split(".")[0]) >= 2:
111
+ if int(flash_attn.__version__.split(".")[1]) >= 1:
112
+ from flash_attn.flash_attn_interface import flash_attn_func as _flash_attn_func
113
  from flash_attn.flash_attn_interface import flash_attn_varlen_func as __flash_attn_unpadded_func
114
  else:
115
  from flash_attn.flash_attn_interface import flash_attn_unpadded_func as __flash_attn_unpadded_func
116
  flash_attn_unpadded_func = __flash_attn_unpadded_func
117
+ flash_attn_func = _flash_attn_func
118
  except ImportError:
119
  logger.warn(
120
  "Warning: import flash_attn fail, please install FlashAttention to get higher efficiency "
 
211
  seqlen_k = k.shape[1]
212
  seqlen_out = seqlen_q
213
 
214
+ if flash_attn_func is not None and batch_size == 1:
215
+ dropout_p = self.dropout_p if self.training else 0
216
+ output = flash_attn_func(q, k, v, dropout_p, softmax_scale=self.softmax_scale, causal=self.causal)
217
+ return output
218
+
219
  q, k, v = [rearrange(x, "b s ... -> (b s) ...") for x in [q, k, v]]
220
  cu_seqlens_q = torch.arange(
221
  0,
 
345
  warnings.warn("Failed to import KV cache kernels.")
346
  self.cache_kernels = None
347
 
348
+ def _attn(self, query, key, value, causal_mask=None, attention_mask=None, head_mask=None):
349
  device = query.device
350
  if self.use_cache_quantization:
351
  qk, qk_scale, qk_zero = key
 
370
  size_temp = value[0].size(-1)
371
  else:
372
  size_temp = value.size(-1)
373
+ attn_weights = attn_weights / (size_temp ** 0.5)
374
+
 
 
 
 
 
 
 
 
 
 
 
375
  mask_value = torch.finfo(attn_weights.dtype).min
376
+ if causal_mask is not None:
377
+ attn_weights = torch.where(
378
+ causal_mask, attn_weights.to(attn_weights.dtype), mask_value
379
+ )
 
 
380
 
381
  if attention_mask is not None:
382
  attn_weights = attn_weights + attention_mask
 
416
 
417
  return attn_output, attn_weights
418
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
419
  def _split_heads(self, tensor, num_heads, attn_head_size):
420
  new_shape = tensor.size()[:-1] + (num_heads, attn_head_size)
421
  tensor = tensor.view(new_shape)
 
430
  self,
431
  hidden_states: Optional[Tuple[torch.FloatTensor]],
432
  rotary_pos_emb_list: Optional[List[List[torch.Tensor]]] = None,
 
433
  layer_past: Optional[Tuple[torch.Tensor]] = None,
434
  attention_mask: Optional[torch.FloatTensor] = None,
435
  head_mask: Optional[torch.FloatTensor] = None,
 
503
  else:
504
  present = None
505
 
506
+ key_size = key[0].size(2) if self.use_cache_quantization else key.size(1)
507
+ if key_size > self.seq_length and self.use_logn_attn and not self.training:
508
  if self.use_cache_quantization:
509
  seq_start = key[0].size(2) - query.size(1)
510
  seq_end = key[0].size(2)
 
523
  q, k, v = query, key, value
524
  attn_output = self.core_attention_flash(q, k, v, attention_mask=attention_mask)
525
  else:
526
+ key_size = key[0].size(2) if self.use_cache_quantization else key.size(1)
527
+ if query.size(1) == key_size:
528
+ causal_mask = torch.tril(
529
+ torch.ones((key_size, key_size), dtype=torch.bool, device=query.device)
530
+ ).view(1, 1, key_size, key_size)
531
+ else:
532
+ causal_mask = None
533
  query = query.permute(0, 2, 1, 3)
534
  if not self.use_cache_quantization:
535
  key = key.permute(0, 2, 1, 3)
536
  value = value.permute(0, 2, 1, 3)
537
  if (
538
+ causal_mask is None
539
  and self.use_flash_attn
540
  and flash_attn_unpadded_func is not None
541
  and not self.is_fp32
 
544
  raise Exception(_ERROR_INPUT_CPU_QUERY_WITH_FLASH_ATTN_ACTIVATED)
545
 
546
  if not self.use_cache_quantization and SUPPORT_TORCH2:
 
 
 
547
  if attention_mask is not None:
548
  attention_mask = attention_mask.expand(
549
  -1, -1, causal_mask.size(2), -1
550
+ )
551
+ if causal_mask is not None:
552
+ attention_mask.masked_fill(~causal_mask, torch.finfo(query.dtype).min)
553
  else:
554
  attention_mask = causal_mask
555
  attn_output = F.scaled_dot_product_attention(
 
558
  attn_weight = None
559
  else:
560
  attn_output, attn_weight = self._attn(
561
+ query, key, value, causal_mask, attention_mask, head_mask
562
  )
563
  context_layer = self._merge_heads(
564
  attn_output, self.num_heads, self.head_dim
 
574
  and not self.is_fp32
575
  ):
576
  raise ValueError("Cannot output attentions while using flash-attn")
577
+ elif not self.use_cache_quantization and SUPPORT_TORCH2:
578
+ raise ValueError("Cannot output attentions while using scaled_dot_product_attention")
579
  else:
580
  outputs += (attn_weight,)
581
 
 
601
  output = self.c_proj(intermediate_parallel)
602
  return output
603
 
604
+
605
  class QWenBlock(nn.Module):
606
  def __init__(self, config):
607
  super().__init__()
 
624
  self,
625
  hidden_states: Optional[Tuple[torch.FloatTensor]],
626
  rotary_pos_emb_list: Optional[List[List[torch.Tensor]]] = None,
 
627
  layer_past: Optional[Tuple[torch.Tensor]] = None,
628
  attention_mask: Optional[torch.FloatTensor] = None,
629
  head_mask: Optional[torch.FloatTensor] = None,
 
637
  attn_outputs = self.attn(
638
  layernorm_output,
639
  rotary_pos_emb_list,
 
640
  layer_past=layer_past,
641
  attention_mask=attention_mask,
642
  head_mask=head_mask,
 
670
  is_parallelizable = False
671
  supports_gradient_checkpointing = True
672
  _no_split_modules = ["QWenBlock"]
673
+ _skip_keys_device_placement = "past_key_values"
674
 
675
  def __init__(self, *inputs, **kwargs):
676
  super().__init__(*inputs, **kwargs)
 
737
 
738
  self.use_flash_attn = config.use_flash_attn
739
  self.is_fp32 = not (config.bf16 or config.fp16)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
740
 
741
  self.h = nn.ModuleList(
742
  [
 
908
  create_custom_forward(block),
909
  hidden_states,
910
  rotary_pos_emb_list,
 
911
  None,
912
  attention_mask,
913
  head_mask[i],
 
919
  hidden_states,
920
  layer_past=layer_past,
921
  rotary_pos_emb_list=rotary_pos_emb_list,
 
922
  attention_mask=attention_mask,
923
  head_mask=head_mask[i],
924
  encoder_hidden_states=encoder_hidden_states,
 
962
  assert (
963
  config.bf16 + config.fp16 + config.fp32 <= 1
964
  ), "Only one of \"bf16\", \"fp16\", \"fp32\" can be true"
 
 
 
 
 
965
 
966
  autoset_precision = config.bf16 + config.fp16 + config.fp32 == 0
967
 
 
1020
  self.lm_head.half()
1021
  self.post_init()
1022
 
 
1023
  def get_output_embeddings(self):
1024
  return self.lm_head
1025
 
 
1029
  def prepare_inputs_for_generation(
1030
  self, input_ids, past_key_values=None, inputs_embeds=None, **kwargs
1031
  ):
 
1032
  if past_key_values:
1033
  input_ids = input_ids[:, -1].unsqueeze(-1)
 
 
 
 
 
1034
 
1035
+ if input_ids.size(0) == 1:
1036
+ attention_mask = None
 
 
 
1037
  else:
1038
+ attention_mask = kwargs.get("attention_mask", None)
1039
 
1040
  if inputs_embeds is not None and past_key_values is None:
1041
  model_inputs = {"inputs_embeds": inputs_embeds}
 
1046
  {
1047
  "past_key_values": past_key_values,
1048
  "use_cache": kwargs.get("use_cache"),
 
1049
  "attention_mask": attention_mask,
 
1050
  }
1051
  )
1052
  return model_inputs
 
1317
  self._ntk_alpha_cached = 1.0
1318
  self._ntk_alpha_cached_list = [1.0]
1319
 
1320
+ def update_rotary_pos_emb_cache(self, seqlen, ntk_alpha=1.0):
 
1321
  if seqlen > self._seq_len_cached or ntk_alpha != self._ntk_alpha_cached:
1322
  base = self.base * ntk_alpha ** (self.dim / (self.dim - 2))
1323
  self.inv_freq = 1.0 / (
 
1340
  cos, sin = emb.cos(), emb.sin()
1341
  self._rotary_pos_emb_cache = [cos, sin]
1342
 
1343
+ def forward(self, max_seq_len, ntk_alpha=1.0):
1344
+ self.update_rotary_pos_emb_cache(max_seq_len, ntk_alpha)
1345
  cos, sin = self._rotary_pos_emb_cache
1346
+ return [cos[:, :max_seq_len], sin[:, :max_seq_len]]
1347
 
1348
 
1349
  def _rotate_half(x):
 
1355
 
1356
 
1357
  def apply_rotary_pos_emb(t, freqs):
1358
+ """ Apply rotary embedding to the first rotary_dim of the iput
1359
+
1360
+ Arguments:
1361
+ t (tensor(batch_size, seq_len, n_head, head_dim)):
1362
+ the input embedding/hidden states
1363
+ freqs (list[tensor(1, seq_len, 1, rotary_dim), tensor(1, seq_len, 1, rotary_dim)]):
1364
+ the cached cos/sin position embeddings
1365
+ """
1366
+ rot_dim = freqs[0].shape[-1]
1367
  cos, sin = freqs
1368
+ t_float = t.float()
1369
+ if apply_rotary_emb_func is not None and t.is_cuda:
1370
+ # apply_rotary_emb in flash_attn requires cos/sin to be of
1371
+ # shape (seqlen, rotary_dim / 2) and apply rotary embedding
1372
+ # to the first rotary_dim of the input
1373
+ cos = cos.squeeze(0).squeeze(1)[:, : rot_dim // 2]
1374
+ sin = sin.squeeze(0).squeeze(1)[:, : rot_dim // 2]
1375
+ return apply_rotary_emb_func(t_float, cos, sin).type_as(t)
1376
  else:
1377
+ t_rot, t_pass = t_float[..., :rot_dim], t_float[..., rot_dim:]
1378
+ t_rot = (t_rot * cos) + (_rotate_half(t_rot) * sin)
1379
+ return torch.cat((t_rot, t_pass), dim=-1).type_as(t)
 
 
 
 
1380
 
1381
 
1382
  class RMSNorm(torch.nn.Module):
tokenizer_config.json CHANGED
@@ -1,5 +1,5 @@
1
  {
2
- "model_max_length": 8192,
3
  "tokenizer_class": "QWenTokenizer",
4
  "auto_map": {
5
  "AutoTokenizer": [
 
1
  {
2
+ "model_max_length": 32768,
3
  "tokenizer_class": "QWenTokenizer",
4
  "auto_map": {
5
  "AutoTokenizer": [