PhilipGAQ commited on
Commit
73e2bf7
·
verified ·
1 Parent(s): aed749c

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ merged_model/tokenizer.json filter=lfs diff=lfs merge=lfs -text
37
+ query_encoder/tokenizer.json filter=lfs diff=lfs merge=lfs -text
.msc ADDED
Binary file (2.58 kB). View file
 
.mv ADDED
@@ -0,0 +1 @@
 
 
1
+ Revision:master,CreatedAt:1757614282
configuration.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"framework":"Pytorch","task":"sentence-similarity"}
doc_encoder/README.md ADDED
@@ -0,0 +1,202 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: /train20/intern/permanent/aqjiang/model/asymmed-emb-new/asym-med-8B-v7/merged_model
3
+ library_name: peft
4
+ ---
5
+
6
+ # Model Card for Model ID
7
+
8
+ <!-- Provide a quick summary of what the model is/does. -->
9
+
10
+
11
+
12
+ ## Model Details
13
+
14
+ ### Model Description
15
+
16
+ <!-- Provide a longer summary of what this model is. -->
17
+
18
+
19
+
20
+ - **Developed by:** [More Information Needed]
21
+ - **Funded by [optional]:** [More Information Needed]
22
+ - **Shared by [optional]:** [More Information Needed]
23
+ - **Model type:** [More Information Needed]
24
+ - **Language(s) (NLP):** [More Information Needed]
25
+ - **License:** [More Information Needed]
26
+ - **Finetuned from model [optional]:** [More Information Needed]
27
+
28
+ ### Model Sources [optional]
29
+
30
+ <!-- Provide the basic links for the model. -->
31
+
32
+ - **Repository:** [More Information Needed]
33
+ - **Paper [optional]:** [More Information Needed]
34
+ - **Demo [optional]:** [More Information Needed]
35
+
36
+ ## Uses
37
+
38
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
+
40
+ ### Direct Use
41
+
42
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
+
44
+ [More Information Needed]
45
+
46
+ ### Downstream Use [optional]
47
+
48
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
+
50
+ [More Information Needed]
51
+
52
+ ### Out-of-Scope Use
53
+
54
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
+
56
+ [More Information Needed]
57
+
58
+ ## Bias, Risks, and Limitations
59
+
60
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
+
62
+ [More Information Needed]
63
+
64
+ ### Recommendations
65
+
66
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
+
68
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
+
70
+ ## How to Get Started with the Model
71
+
72
+ Use the code below to get started with the model.
73
+
74
+ [More Information Needed]
75
+
76
+ ## Training Details
77
+
78
+ ### Training Data
79
+
80
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
+
82
+ [More Information Needed]
83
+
84
+ ### Training Procedure
85
+
86
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
+
88
+ #### Preprocessing [optional]
89
+
90
+ [More Information Needed]
91
+
92
+
93
+ #### Training Hyperparameters
94
+
95
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
+
97
+ #### Speeds, Sizes, Times [optional]
98
+
99
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
+
101
+ [More Information Needed]
102
+
103
+ ## Evaluation
104
+
105
+ <!-- This section describes the evaluation protocols and provides the results. -->
106
+
107
+ ### Testing Data, Factors & Metrics
108
+
109
+ #### Testing Data
110
+
111
+ <!-- This should link to a Dataset Card if possible. -->
112
+
113
+ [More Information Needed]
114
+
115
+ #### Factors
116
+
117
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
+
119
+ [More Information Needed]
120
+
121
+ #### Metrics
122
+
123
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
+
125
+ [More Information Needed]
126
+
127
+ ### Results
128
+
129
+ [More Information Needed]
130
+
131
+ #### Summary
132
+
133
+
134
+
135
+ ## Model Examination [optional]
136
+
137
+ <!-- Relevant interpretability work for the model goes here -->
138
+
139
+ [More Information Needed]
140
+
141
+ ## Environmental Impact
142
+
143
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
+
145
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
+
147
+ - **Hardware Type:** [More Information Needed]
148
+ - **Hours used:** [More Information Needed]
149
+ - **Cloud Provider:** [More Information Needed]
150
+ - **Compute Region:** [More Information Needed]
151
+ - **Carbon Emitted:** [More Information Needed]
152
+
153
+ ## Technical Specifications [optional]
154
+
155
+ ### Model Architecture and Objective
156
+
157
+ [More Information Needed]
158
+
159
+ ### Compute Infrastructure
160
+
161
+ [More Information Needed]
162
+
163
+ #### Hardware
164
+
165
+ [More Information Needed]
166
+
167
+ #### Software
168
+
169
+ [More Information Needed]
170
+
171
+ ## Citation [optional]
172
+
173
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
+
175
+ **BibTeX:**
176
+
177
+ [More Information Needed]
178
+
179
+ **APA:**
180
+
181
+ [More Information Needed]
182
+
183
+ ## Glossary [optional]
184
+
185
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
+
187
+ [More Information Needed]
188
+
189
+ ## More Information [optional]
190
+
191
+ [More Information Needed]
192
+
193
+ ## Model Card Authors [optional]
194
+
195
+ [More Information Needed]
196
+
197
+ ## Model Card Contact
198
+
199
+ [More Information Needed]
200
+ ### Framework versions
201
+
202
+ - PEFT 0.13.2
doc_encoder/adapter_config.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alpha_pattern": {},
3
+ "auto_mapping": null,
4
+ "base_model_name_or_path": "/train20/intern/permanent/aqjiang/model/asymmed-emb-new/asym-med-8B-v7/merged_model",
5
+ "bias": "none",
6
+ "fan_in_fan_out": false,
7
+ "inference_mode": true,
8
+ "init_lora_weights": true,
9
+ "layer_replication": null,
10
+ "layers_pattern": null,
11
+ "layers_to_transform": null,
12
+ "loftq_config": {},
13
+ "lora_alpha": 64.0,
14
+ "lora_dropout": 0.1,
15
+ "megatron_config": null,
16
+ "megatron_core": "megatron.core",
17
+ "modules_to_save": null,
18
+ "peft_type": "LORA",
19
+ "r": 32,
20
+ "rank_pattern": {},
21
+ "revision": null,
22
+ "target_modules": [
23
+ "k_proj",
24
+ "up_proj",
25
+ "q_proj",
26
+ "v_proj",
27
+ "down_proj",
28
+ "o_proj",
29
+ "gate_proj"
30
+ ],
31
+ "task_type": "FEATURE_EXTRACTION",
32
+ "use_dora": false,
33
+ "use_rslora": false
34
+ }
doc_encoder/adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2489c82b4049268a477f668f6bbd4c7e19938405428874c4ba0151835d6c627e
3
+ size 174653016
merged_model/added_tokens.json ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "</think>": 151668,
3
+ "</tool_call>": 151658,
4
+ "</tool_response>": 151666,
5
+ "<think>": 151667,
6
+ "<tool_call>": 151657,
7
+ "<tool_response>": 151665,
8
+ "<|box_end|>": 151649,
9
+ "<|box_start|>": 151648,
10
+ "<|endoftext|>": 151643,
11
+ "<|file_sep|>": 151664,
12
+ "<|fim_middle|>": 151660,
13
+ "<|fim_pad|>": 151662,
14
+ "<|fim_prefix|>": 151659,
15
+ "<|fim_suffix|>": 151661,
16
+ "<|im_end|>": 151645,
17
+ "<|im_start|>": 151644,
18
+ "<|image_pad|>": 151655,
19
+ "<|object_ref_end|>": 151647,
20
+ "<|object_ref_start|>": 151646,
21
+ "<|quad_end|>": 151651,
22
+ "<|quad_start|>": 151650,
23
+ "<|repo_name|>": 151663,
24
+ "<|video_pad|>": 151656,
25
+ "<|vision_end|>": 151653,
26
+ "<|vision_pad|>": 151654,
27
+ "<|vision_start|>": 151652
28
+ }
merged_model/config.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Qwen3Model"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 151643,
8
+ "eos_token_id": 151645,
9
+ "head_dim": 128,
10
+ "hidden_act": "silu",
11
+ "hidden_size": 4096,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 12288,
14
+ "max_position_embeddings": 40960,
15
+ "max_window_layers": 36,
16
+ "model_type": "qwen3",
17
+ "num_attention_heads": 32,
18
+ "num_hidden_layers": 36,
19
+ "num_key_value_heads": 8,
20
+ "rms_norm_eps": 1e-06,
21
+ "rope_scaling": null,
22
+ "rope_theta": 1000000,
23
+ "sliding_window": null,
24
+ "tie_word_embeddings": false,
25
+ "torch_dtype": "float32",
26
+ "transformers_version": "4.51.1",
27
+ "use_cache": false,
28
+ "use_sliding_window": false,
29
+ "vocab_size": 151936
30
+ }
merged_model/merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
merged_model/model-00001-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ab3656de4af5c665f4ce1d37771255d54391036430a287b7a7e0569f7850cc9d
3
+ size 4972454136
merged_model/model-00002-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2056b2988fef3f42a4f98da4905ba799b53ec6c9256a9556ced3eb8ee5e1e22a
3
+ size 4832048200
merged_model/model-00003-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6b9fc7579d310fe65cdb33bf2b0c25172fcbe3f62450e572a9f50fe20cff35ce
3
+ size 4832048248
merged_model/model-00004-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:243acabb6075117e224294ae829fe18659820e4661239556bf37082e84106448
3
+ size 4999855080
merged_model/model-00005-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:911db739134122f45cf339653e52d4917e42121aa7f007162e173fd120585c40
3
+ size 4832048272
merged_model/model-00006-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4db72a2a8588178fcd9eb76f827ac55c1f0182860da6c151a9b7e89237731886
3
+ size 4832048264
merged_model/model-00007-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:385bbead79a0c3c123b801a44163d8a3edbbe4513e07150b2689d3ac25f42d39
3
+ size 973163080
merged_model/model.safetensors.index.json ADDED
@@ -0,0 +1,405 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_size": 30273622016
4
+ },
5
+ "weight_map": {
6
+ "embed_tokens.weight": "model-00001-of-00007.safetensors",
7
+ "layers.0.input_layernorm.weight": "model-00001-of-00007.safetensors",
8
+ "layers.0.mlp.down_proj.weight": "model-00001-of-00007.safetensors",
9
+ "layers.0.mlp.gate_proj.weight": "model-00001-of-00007.safetensors",
10
+ "layers.0.mlp.up_proj.weight": "model-00001-of-00007.safetensors",
11
+ "layers.0.post_attention_layernorm.weight": "model-00001-of-00007.safetensors",
12
+ "layers.0.self_attn.k_norm.weight": "model-00001-of-00007.safetensors",
13
+ "layers.0.self_attn.k_proj.weight": "model-00001-of-00007.safetensors",
14
+ "layers.0.self_attn.o_proj.weight": "model-00001-of-00007.safetensors",
15
+ "layers.0.self_attn.q_norm.weight": "model-00001-of-00007.safetensors",
16
+ "layers.0.self_attn.q_proj.weight": "model-00001-of-00007.safetensors",
17
+ "layers.0.self_attn.v_proj.weight": "model-00001-of-00007.safetensors",
18
+ "layers.1.input_layernorm.weight": "model-00001-of-00007.safetensors",
19
+ "layers.1.mlp.down_proj.weight": "model-00001-of-00007.safetensors",
20
+ "layers.1.mlp.gate_proj.weight": "model-00001-of-00007.safetensors",
21
+ "layers.1.mlp.up_proj.weight": "model-00001-of-00007.safetensors",
22
+ "layers.1.post_attention_layernorm.weight": "model-00001-of-00007.safetensors",
23
+ "layers.1.self_attn.k_norm.weight": "model-00001-of-00007.safetensors",
24
+ "layers.1.self_attn.k_proj.weight": "model-00001-of-00007.safetensors",
25
+ "layers.1.self_attn.o_proj.weight": "model-00001-of-00007.safetensors",
26
+ "layers.1.self_attn.q_norm.weight": "model-00001-of-00007.safetensors",
27
+ "layers.1.self_attn.q_proj.weight": "model-00001-of-00007.safetensors",
28
+ "layers.1.self_attn.v_proj.weight": "model-00001-of-00007.safetensors",
29
+ "layers.10.input_layernorm.weight": "model-00003-of-00007.safetensors",
30
+ "layers.10.mlp.down_proj.weight": "model-00003-of-00007.safetensors",
31
+ "layers.10.mlp.gate_proj.weight": "model-00003-of-00007.safetensors",
32
+ "layers.10.mlp.up_proj.weight": "model-00003-of-00007.safetensors",
33
+ "layers.10.post_attention_layernorm.weight": "model-00003-of-00007.safetensors",
34
+ "layers.10.self_attn.k_norm.weight": "model-00003-of-00007.safetensors",
35
+ "layers.10.self_attn.k_proj.weight": "model-00003-of-00007.safetensors",
36
+ "layers.10.self_attn.o_proj.weight": "model-00003-of-00007.safetensors",
37
+ "layers.10.self_attn.q_norm.weight": "model-00003-of-00007.safetensors",
38
+ "layers.10.self_attn.q_proj.weight": "model-00003-of-00007.safetensors",
39
+ "layers.10.self_attn.v_proj.weight": "model-00003-of-00007.safetensors",
40
+ "layers.11.input_layernorm.weight": "model-00003-of-00007.safetensors",
41
+ "layers.11.mlp.down_proj.weight": "model-00003-of-00007.safetensors",
42
+ "layers.11.mlp.gate_proj.weight": "model-00003-of-00007.safetensors",
43
+ "layers.11.mlp.up_proj.weight": "model-00003-of-00007.safetensors",
44
+ "layers.11.post_attention_layernorm.weight": "model-00003-of-00007.safetensors",
45
+ "layers.11.self_attn.k_norm.weight": "model-00003-of-00007.safetensors",
46
+ "layers.11.self_attn.k_proj.weight": "model-00003-of-00007.safetensors",
47
+ "layers.11.self_attn.o_proj.weight": "model-00003-of-00007.safetensors",
48
+ "layers.11.self_attn.q_norm.weight": "model-00003-of-00007.safetensors",
49
+ "layers.11.self_attn.q_proj.weight": "model-00003-of-00007.safetensors",
50
+ "layers.11.self_attn.v_proj.weight": "model-00003-of-00007.safetensors",
51
+ "layers.12.input_layernorm.weight": "model-00003-of-00007.safetensors",
52
+ "layers.12.mlp.down_proj.weight": "model-00003-of-00007.safetensors",
53
+ "layers.12.mlp.gate_proj.weight": "model-00003-of-00007.safetensors",
54
+ "layers.12.mlp.up_proj.weight": "model-00003-of-00007.safetensors",
55
+ "layers.12.post_attention_layernorm.weight": "model-00003-of-00007.safetensors",
56
+ "layers.12.self_attn.k_norm.weight": "model-00003-of-00007.safetensors",
57
+ "layers.12.self_attn.k_proj.weight": "model-00003-of-00007.safetensors",
58
+ "layers.12.self_attn.o_proj.weight": "model-00003-of-00007.safetensors",
59
+ "layers.12.self_attn.q_norm.weight": "model-00003-of-00007.safetensors",
60
+ "layers.12.self_attn.q_proj.weight": "model-00003-of-00007.safetensors",
61
+ "layers.12.self_attn.v_proj.weight": "model-00003-of-00007.safetensors",
62
+ "layers.13.input_layernorm.weight": "model-00003-of-00007.safetensors",
63
+ "layers.13.mlp.down_proj.weight": "model-00003-of-00007.safetensors",
64
+ "layers.13.mlp.gate_proj.weight": "model-00003-of-00007.safetensors",
65
+ "layers.13.mlp.up_proj.weight": "model-00003-of-00007.safetensors",
66
+ "layers.13.post_attention_layernorm.weight": "model-00003-of-00007.safetensors",
67
+ "layers.13.self_attn.k_norm.weight": "model-00003-of-00007.safetensors",
68
+ "layers.13.self_attn.k_proj.weight": "model-00003-of-00007.safetensors",
69
+ "layers.13.self_attn.o_proj.weight": "model-00003-of-00007.safetensors",
70
+ "layers.13.self_attn.q_norm.weight": "model-00003-of-00007.safetensors",
71
+ "layers.13.self_attn.q_proj.weight": "model-00003-of-00007.safetensors",
72
+ "layers.13.self_attn.v_proj.weight": "model-00003-of-00007.safetensors",
73
+ "layers.14.input_layernorm.weight": "model-00003-of-00007.safetensors",
74
+ "layers.14.mlp.down_proj.weight": "model-00003-of-00007.safetensors",
75
+ "layers.14.mlp.gate_proj.weight": "model-00003-of-00007.safetensors",
76
+ "layers.14.mlp.up_proj.weight": "model-00003-of-00007.safetensors",
77
+ "layers.14.post_attention_layernorm.weight": "model-00003-of-00007.safetensors",
78
+ "layers.14.self_attn.k_norm.weight": "model-00003-of-00007.safetensors",
79
+ "layers.14.self_attn.k_proj.weight": "model-00003-of-00007.safetensors",
80
+ "layers.14.self_attn.o_proj.weight": "model-00003-of-00007.safetensors",
81
+ "layers.14.self_attn.q_norm.weight": "model-00003-of-00007.safetensors",
82
+ "layers.14.self_attn.q_proj.weight": "model-00003-of-00007.safetensors",
83
+ "layers.14.self_attn.v_proj.weight": "model-00003-of-00007.safetensors",
84
+ "layers.15.input_layernorm.weight": "model-00004-of-00007.safetensors",
85
+ "layers.15.mlp.down_proj.weight": "model-00004-of-00007.safetensors",
86
+ "layers.15.mlp.gate_proj.weight": "model-00003-of-00007.safetensors",
87
+ "layers.15.mlp.up_proj.weight": "model-00003-of-00007.safetensors",
88
+ "layers.15.post_attention_layernorm.weight": "model-00004-of-00007.safetensors",
89
+ "layers.15.self_attn.k_norm.weight": "model-00003-of-00007.safetensors",
90
+ "layers.15.self_attn.k_proj.weight": "model-00003-of-00007.safetensors",
91
+ "layers.15.self_attn.o_proj.weight": "model-00003-of-00007.safetensors",
92
+ "layers.15.self_attn.q_norm.weight": "model-00003-of-00007.safetensors",
93
+ "layers.15.self_attn.q_proj.weight": "model-00003-of-00007.safetensors",
94
+ "layers.15.self_attn.v_proj.weight": "model-00003-of-00007.safetensors",
95
+ "layers.16.input_layernorm.weight": "model-00004-of-00007.safetensors",
96
+ "layers.16.mlp.down_proj.weight": "model-00004-of-00007.safetensors",
97
+ "layers.16.mlp.gate_proj.weight": "model-00004-of-00007.safetensors",
98
+ "layers.16.mlp.up_proj.weight": "model-00004-of-00007.safetensors",
99
+ "layers.16.post_attention_layernorm.weight": "model-00004-of-00007.safetensors",
100
+ "layers.16.self_attn.k_norm.weight": "model-00004-of-00007.safetensors",
101
+ "layers.16.self_attn.k_proj.weight": "model-00004-of-00007.safetensors",
102
+ "layers.16.self_attn.o_proj.weight": "model-00004-of-00007.safetensors",
103
+ "layers.16.self_attn.q_norm.weight": "model-00004-of-00007.safetensors",
104
+ "layers.16.self_attn.q_proj.weight": "model-00004-of-00007.safetensors",
105
+ "layers.16.self_attn.v_proj.weight": "model-00004-of-00007.safetensors",
106
+ "layers.17.input_layernorm.weight": "model-00004-of-00007.safetensors",
107
+ "layers.17.mlp.down_proj.weight": "model-00004-of-00007.safetensors",
108
+ "layers.17.mlp.gate_proj.weight": "model-00004-of-00007.safetensors",
109
+ "layers.17.mlp.up_proj.weight": "model-00004-of-00007.safetensors",
110
+ "layers.17.post_attention_layernorm.weight": "model-00004-of-00007.safetensors",
111
+ "layers.17.self_attn.k_norm.weight": "model-00004-of-00007.safetensors",
112
+ "layers.17.self_attn.k_proj.weight": "model-00004-of-00007.safetensors",
113
+ "layers.17.self_attn.o_proj.weight": "model-00004-of-00007.safetensors",
114
+ "layers.17.self_attn.q_norm.weight": "model-00004-of-00007.safetensors",
115
+ "layers.17.self_attn.q_proj.weight": "model-00004-of-00007.safetensors",
116
+ "layers.17.self_attn.v_proj.weight": "model-00004-of-00007.safetensors",
117
+ "layers.18.input_layernorm.weight": "model-00004-of-00007.safetensors",
118
+ "layers.18.mlp.down_proj.weight": "model-00004-of-00007.safetensors",
119
+ "layers.18.mlp.gate_proj.weight": "model-00004-of-00007.safetensors",
120
+ "layers.18.mlp.up_proj.weight": "model-00004-of-00007.safetensors",
121
+ "layers.18.post_attention_layernorm.weight": "model-00004-of-00007.safetensors",
122
+ "layers.18.self_attn.k_norm.weight": "model-00004-of-00007.safetensors",
123
+ "layers.18.self_attn.k_proj.weight": "model-00004-of-00007.safetensors",
124
+ "layers.18.self_attn.o_proj.weight": "model-00004-of-00007.safetensors",
125
+ "layers.18.self_attn.q_norm.weight": "model-00004-of-00007.safetensors",
126
+ "layers.18.self_attn.q_proj.weight": "model-00004-of-00007.safetensors",
127
+ "layers.18.self_attn.v_proj.weight": "model-00004-of-00007.safetensors",
128
+ "layers.19.input_layernorm.weight": "model-00004-of-00007.safetensors",
129
+ "layers.19.mlp.down_proj.weight": "model-00004-of-00007.safetensors",
130
+ "layers.19.mlp.gate_proj.weight": "model-00004-of-00007.safetensors",
131
+ "layers.19.mlp.up_proj.weight": "model-00004-of-00007.safetensors",
132
+ "layers.19.post_attention_layernorm.weight": "model-00004-of-00007.safetensors",
133
+ "layers.19.self_attn.k_norm.weight": "model-00004-of-00007.safetensors",
134
+ "layers.19.self_attn.k_proj.weight": "model-00004-of-00007.safetensors",
135
+ "layers.19.self_attn.o_proj.weight": "model-00004-of-00007.safetensors",
136
+ "layers.19.self_attn.q_norm.weight": "model-00004-of-00007.safetensors",
137
+ "layers.19.self_attn.q_proj.weight": "model-00004-of-00007.safetensors",
138
+ "layers.19.self_attn.v_proj.weight": "model-00004-of-00007.safetensors",
139
+ "layers.2.input_layernorm.weight": "model-00001-of-00007.safetensors",
140
+ "layers.2.mlp.down_proj.weight": "model-00001-of-00007.safetensors",
141
+ "layers.2.mlp.gate_proj.weight": "model-00001-of-00007.safetensors",
142
+ "layers.2.mlp.up_proj.weight": "model-00001-of-00007.safetensors",
143
+ "layers.2.post_attention_layernorm.weight": "model-00001-of-00007.safetensors",
144
+ "layers.2.self_attn.k_norm.weight": "model-00001-of-00007.safetensors",
145
+ "layers.2.self_attn.k_proj.weight": "model-00001-of-00007.safetensors",
146
+ "layers.2.self_attn.o_proj.weight": "model-00001-of-00007.safetensors",
147
+ "layers.2.self_attn.q_norm.weight": "model-00001-of-00007.safetensors",
148
+ "layers.2.self_attn.q_proj.weight": "model-00001-of-00007.safetensors",
149
+ "layers.2.self_attn.v_proj.weight": "model-00001-of-00007.safetensors",
150
+ "layers.20.input_layernorm.weight": "model-00004-of-00007.safetensors",
151
+ "layers.20.mlp.down_proj.weight": "model-00004-of-00007.safetensors",
152
+ "layers.20.mlp.gate_proj.weight": "model-00004-of-00007.safetensors",
153
+ "layers.20.mlp.up_proj.weight": "model-00004-of-00007.safetensors",
154
+ "layers.20.post_attention_layernorm.weight": "model-00004-of-00007.safetensors",
155
+ "layers.20.self_attn.k_norm.weight": "model-00004-of-00007.safetensors",
156
+ "layers.20.self_attn.k_proj.weight": "model-00004-of-00007.safetensors",
157
+ "layers.20.self_attn.o_proj.weight": "model-00004-of-00007.safetensors",
158
+ "layers.20.self_attn.q_norm.weight": "model-00004-of-00007.safetensors",
159
+ "layers.20.self_attn.q_proj.weight": "model-00004-of-00007.safetensors",
160
+ "layers.20.self_attn.v_proj.weight": "model-00004-of-00007.safetensors",
161
+ "layers.21.input_layernorm.weight": "model-00004-of-00007.safetensors",
162
+ "layers.21.mlp.down_proj.weight": "model-00004-of-00007.safetensors",
163
+ "layers.21.mlp.gate_proj.weight": "model-00004-of-00007.safetensors",
164
+ "layers.21.mlp.up_proj.weight": "model-00004-of-00007.safetensors",
165
+ "layers.21.post_attention_layernorm.weight": "model-00004-of-00007.safetensors",
166
+ "layers.21.self_attn.k_norm.weight": "model-00004-of-00007.safetensors",
167
+ "layers.21.self_attn.k_proj.weight": "model-00004-of-00007.safetensors",
168
+ "layers.21.self_attn.o_proj.weight": "model-00004-of-00007.safetensors",
169
+ "layers.21.self_attn.q_norm.weight": "model-00004-of-00007.safetensors",
170
+ "layers.21.self_attn.q_proj.weight": "model-00004-of-00007.safetensors",
171
+ "layers.21.self_attn.v_proj.weight": "model-00004-of-00007.safetensors",
172
+ "layers.22.input_layernorm.weight": "model-00005-of-00007.safetensors",
173
+ "layers.22.mlp.down_proj.weight": "model-00005-of-00007.safetensors",
174
+ "layers.22.mlp.gate_proj.weight": "model-00005-of-00007.safetensors",
175
+ "layers.22.mlp.up_proj.weight": "model-00005-of-00007.safetensors",
176
+ "layers.22.post_attention_layernorm.weight": "model-00005-of-00007.safetensors",
177
+ "layers.22.self_attn.k_norm.weight": "model-00004-of-00007.safetensors",
178
+ "layers.22.self_attn.k_proj.weight": "model-00004-of-00007.safetensors",
179
+ "layers.22.self_attn.o_proj.weight": "model-00004-of-00007.safetensors",
180
+ "layers.22.self_attn.q_norm.weight": "model-00004-of-00007.safetensors",
181
+ "layers.22.self_attn.q_proj.weight": "model-00004-of-00007.safetensors",
182
+ "layers.22.self_attn.v_proj.weight": "model-00004-of-00007.safetensors",
183
+ "layers.23.input_layernorm.weight": "model-00005-of-00007.safetensors",
184
+ "layers.23.mlp.down_proj.weight": "model-00005-of-00007.safetensors",
185
+ "layers.23.mlp.gate_proj.weight": "model-00005-of-00007.safetensors",
186
+ "layers.23.mlp.up_proj.weight": "model-00005-of-00007.safetensors",
187
+ "layers.23.post_attention_layernorm.weight": "model-00005-of-00007.safetensors",
188
+ "layers.23.self_attn.k_norm.weight": "model-00005-of-00007.safetensors",
189
+ "layers.23.self_attn.k_proj.weight": "model-00005-of-00007.safetensors",
190
+ "layers.23.self_attn.o_proj.weight": "model-00005-of-00007.safetensors",
191
+ "layers.23.self_attn.q_norm.weight": "model-00005-of-00007.safetensors",
192
+ "layers.23.self_attn.q_proj.weight": "model-00005-of-00007.safetensors",
193
+ "layers.23.self_attn.v_proj.weight": "model-00005-of-00007.safetensors",
194
+ "layers.24.input_layernorm.weight": "model-00005-of-00007.safetensors",
195
+ "layers.24.mlp.down_proj.weight": "model-00005-of-00007.safetensors",
196
+ "layers.24.mlp.gate_proj.weight": "model-00005-of-00007.safetensors",
197
+ "layers.24.mlp.up_proj.weight": "model-00005-of-00007.safetensors",
198
+ "layers.24.post_attention_layernorm.weight": "model-00005-of-00007.safetensors",
199
+ "layers.24.self_attn.k_norm.weight": "model-00005-of-00007.safetensors",
200
+ "layers.24.self_attn.k_proj.weight": "model-00005-of-00007.safetensors",
201
+ "layers.24.self_attn.o_proj.weight": "model-00005-of-00007.safetensors",
202
+ "layers.24.self_attn.q_norm.weight": "model-00005-of-00007.safetensors",
203
+ "layers.24.self_attn.q_proj.weight": "model-00005-of-00007.safetensors",
204
+ "layers.24.self_attn.v_proj.weight": "model-00005-of-00007.safetensors",
205
+ "layers.25.input_layernorm.weight": "model-00005-of-00007.safetensors",
206
+ "layers.25.mlp.down_proj.weight": "model-00005-of-00007.safetensors",
207
+ "layers.25.mlp.gate_proj.weight": "model-00005-of-00007.safetensors",
208
+ "layers.25.mlp.up_proj.weight": "model-00005-of-00007.safetensors",
209
+ "layers.25.post_attention_layernorm.weight": "model-00005-of-00007.safetensors",
210
+ "layers.25.self_attn.k_norm.weight": "model-00005-of-00007.safetensors",
211
+ "layers.25.self_attn.k_proj.weight": "model-00005-of-00007.safetensors",
212
+ "layers.25.self_attn.o_proj.weight": "model-00005-of-00007.safetensors",
213
+ "layers.25.self_attn.q_norm.weight": "model-00005-of-00007.safetensors",
214
+ "layers.25.self_attn.q_proj.weight": "model-00005-of-00007.safetensors",
215
+ "layers.25.self_attn.v_proj.weight": "model-00005-of-00007.safetensors",
216
+ "layers.26.input_layernorm.weight": "model-00005-of-00007.safetensors",
217
+ "layers.26.mlp.down_proj.weight": "model-00005-of-00007.safetensors",
218
+ "layers.26.mlp.gate_proj.weight": "model-00005-of-00007.safetensors",
219
+ "layers.26.mlp.up_proj.weight": "model-00005-of-00007.safetensors",
220
+ "layers.26.post_attention_layernorm.weight": "model-00005-of-00007.safetensors",
221
+ "layers.26.self_attn.k_norm.weight": "model-00005-of-00007.safetensors",
222
+ "layers.26.self_attn.k_proj.weight": "model-00005-of-00007.safetensors",
223
+ "layers.26.self_attn.o_proj.weight": "model-00005-of-00007.safetensors",
224
+ "layers.26.self_attn.q_norm.weight": "model-00005-of-00007.safetensors",
225
+ "layers.26.self_attn.q_proj.weight": "model-00005-of-00007.safetensors",
226
+ "layers.26.self_attn.v_proj.weight": "model-00005-of-00007.safetensors",
227
+ "layers.27.input_layernorm.weight": "model-00005-of-00007.safetensors",
228
+ "layers.27.mlp.down_proj.weight": "model-00005-of-00007.safetensors",
229
+ "layers.27.mlp.gate_proj.weight": "model-00005-of-00007.safetensors",
230
+ "layers.27.mlp.up_proj.weight": "model-00005-of-00007.safetensors",
231
+ "layers.27.post_attention_layernorm.weight": "model-00005-of-00007.safetensors",
232
+ "layers.27.self_attn.k_norm.weight": "model-00005-of-00007.safetensors",
233
+ "layers.27.self_attn.k_proj.weight": "model-00005-of-00007.safetensors",
234
+ "layers.27.self_attn.o_proj.weight": "model-00005-of-00007.safetensors",
235
+ "layers.27.self_attn.q_norm.weight": "model-00005-of-00007.safetensors",
236
+ "layers.27.self_attn.q_proj.weight": "model-00005-of-00007.safetensors",
237
+ "layers.27.self_attn.v_proj.weight": "model-00005-of-00007.safetensors",
238
+ "layers.28.input_layernorm.weight": "model-00006-of-00007.safetensors",
239
+ "layers.28.mlp.down_proj.weight": "model-00006-of-00007.safetensors",
240
+ "layers.28.mlp.gate_proj.weight": "model-00005-of-00007.safetensors",
241
+ "layers.28.mlp.up_proj.weight": "model-00006-of-00007.safetensors",
242
+ "layers.28.post_attention_layernorm.weight": "model-00006-of-00007.safetensors",
243
+ "layers.28.self_attn.k_norm.weight": "model-00005-of-00007.safetensors",
244
+ "layers.28.self_attn.k_proj.weight": "model-00005-of-00007.safetensors",
245
+ "layers.28.self_attn.o_proj.weight": "model-00005-of-00007.safetensors",
246
+ "layers.28.self_attn.q_norm.weight": "model-00005-of-00007.safetensors",
247
+ "layers.28.self_attn.q_proj.weight": "model-00005-of-00007.safetensors",
248
+ "layers.28.self_attn.v_proj.weight": "model-00005-of-00007.safetensors",
249
+ "layers.29.input_layernorm.weight": "model-00006-of-00007.safetensors",
250
+ "layers.29.mlp.down_proj.weight": "model-00006-of-00007.safetensors",
251
+ "layers.29.mlp.gate_proj.weight": "model-00006-of-00007.safetensors",
252
+ "layers.29.mlp.up_proj.weight": "model-00006-of-00007.safetensors",
253
+ "layers.29.post_attention_layernorm.weight": "model-00006-of-00007.safetensors",
254
+ "layers.29.self_attn.k_norm.weight": "model-00006-of-00007.safetensors",
255
+ "layers.29.self_attn.k_proj.weight": "model-00006-of-00007.safetensors",
256
+ "layers.29.self_attn.o_proj.weight": "model-00006-of-00007.safetensors",
257
+ "layers.29.self_attn.q_norm.weight": "model-00006-of-00007.safetensors",
258
+ "layers.29.self_attn.q_proj.weight": "model-00006-of-00007.safetensors",
259
+ "layers.29.self_attn.v_proj.weight": "model-00006-of-00007.safetensors",
260
+ "layers.3.input_layernorm.weight": "model-00002-of-00007.safetensors",
261
+ "layers.3.mlp.down_proj.weight": "model-00002-of-00007.safetensors",
262
+ "layers.3.mlp.gate_proj.weight": "model-00002-of-00007.safetensors",
263
+ "layers.3.mlp.up_proj.weight": "model-00002-of-00007.safetensors",
264
+ "layers.3.post_attention_layernorm.weight": "model-00002-of-00007.safetensors",
265
+ "layers.3.self_attn.k_norm.weight": "model-00001-of-00007.safetensors",
266
+ "layers.3.self_attn.k_proj.weight": "model-00001-of-00007.safetensors",
267
+ "layers.3.self_attn.o_proj.weight": "model-00001-of-00007.safetensors",
268
+ "layers.3.self_attn.q_norm.weight": "model-00001-of-00007.safetensors",
269
+ "layers.3.self_attn.q_proj.weight": "model-00001-of-00007.safetensors",
270
+ "layers.3.self_attn.v_proj.weight": "model-00001-of-00007.safetensors",
271
+ "layers.30.input_layernorm.weight": "model-00006-of-00007.safetensors",
272
+ "layers.30.mlp.down_proj.weight": "model-00006-of-00007.safetensors",
273
+ "layers.30.mlp.gate_proj.weight": "model-00006-of-00007.safetensors",
274
+ "layers.30.mlp.up_proj.weight": "model-00006-of-00007.safetensors",
275
+ "layers.30.post_attention_layernorm.weight": "model-00006-of-00007.safetensors",
276
+ "layers.30.self_attn.k_norm.weight": "model-00006-of-00007.safetensors",
277
+ "layers.30.self_attn.k_proj.weight": "model-00006-of-00007.safetensors",
278
+ "layers.30.self_attn.o_proj.weight": "model-00006-of-00007.safetensors",
279
+ "layers.30.self_attn.q_norm.weight": "model-00006-of-00007.safetensors",
280
+ "layers.30.self_attn.q_proj.weight": "model-00006-of-00007.safetensors",
281
+ "layers.30.self_attn.v_proj.weight": "model-00006-of-00007.safetensors",
282
+ "layers.31.input_layernorm.weight": "model-00006-of-00007.safetensors",
283
+ "layers.31.mlp.down_proj.weight": "model-00006-of-00007.safetensors",
284
+ "layers.31.mlp.gate_proj.weight": "model-00006-of-00007.safetensors",
285
+ "layers.31.mlp.up_proj.weight": "model-00006-of-00007.safetensors",
286
+ "layers.31.post_attention_layernorm.weight": "model-00006-of-00007.safetensors",
287
+ "layers.31.self_attn.k_norm.weight": "model-00006-of-00007.safetensors",
288
+ "layers.31.self_attn.k_proj.weight": "model-00006-of-00007.safetensors",
289
+ "layers.31.self_attn.o_proj.weight": "model-00006-of-00007.safetensors",
290
+ "layers.31.self_attn.q_norm.weight": "model-00006-of-00007.safetensors",
291
+ "layers.31.self_attn.q_proj.weight": "model-00006-of-00007.safetensors",
292
+ "layers.31.self_attn.v_proj.weight": "model-00006-of-00007.safetensors",
293
+ "layers.32.input_layernorm.weight": "model-00006-of-00007.safetensors",
294
+ "layers.32.mlp.down_proj.weight": "model-00006-of-00007.safetensors",
295
+ "layers.32.mlp.gate_proj.weight": "model-00006-of-00007.safetensors",
296
+ "layers.32.mlp.up_proj.weight": "model-00006-of-00007.safetensors",
297
+ "layers.32.post_attention_layernorm.weight": "model-00006-of-00007.safetensors",
298
+ "layers.32.self_attn.k_norm.weight": "model-00006-of-00007.safetensors",
299
+ "layers.32.self_attn.k_proj.weight": "model-00006-of-00007.safetensors",
300
+ "layers.32.self_attn.o_proj.weight": "model-00006-of-00007.safetensors",
301
+ "layers.32.self_attn.q_norm.weight": "model-00006-of-00007.safetensors",
302
+ "layers.32.self_attn.q_proj.weight": "model-00006-of-00007.safetensors",
303
+ "layers.32.self_attn.v_proj.weight": "model-00006-of-00007.safetensors",
304
+ "layers.33.input_layernorm.weight": "model-00006-of-00007.safetensors",
305
+ "layers.33.mlp.down_proj.weight": "model-00006-of-00007.safetensors",
306
+ "layers.33.mlp.gate_proj.weight": "model-00006-of-00007.safetensors",
307
+ "layers.33.mlp.up_proj.weight": "model-00006-of-00007.safetensors",
308
+ "layers.33.post_attention_layernorm.weight": "model-00006-of-00007.safetensors",
309
+ "layers.33.self_attn.k_norm.weight": "model-00006-of-00007.safetensors",
310
+ "layers.33.self_attn.k_proj.weight": "model-00006-of-00007.safetensors",
311
+ "layers.33.self_attn.o_proj.weight": "model-00006-of-00007.safetensors",
312
+ "layers.33.self_attn.q_norm.weight": "model-00006-of-00007.safetensors",
313
+ "layers.33.self_attn.q_proj.weight": "model-00006-of-00007.safetensors",
314
+ "layers.33.self_attn.v_proj.weight": "model-00006-of-00007.safetensors",
315
+ "layers.34.input_layernorm.weight": "model-00007-of-00007.safetensors",
316
+ "layers.34.mlp.down_proj.weight": "model-00007-of-00007.safetensors",
317
+ "layers.34.mlp.gate_proj.weight": "model-00006-of-00007.safetensors",
318
+ "layers.34.mlp.up_proj.weight": "model-00006-of-00007.safetensors",
319
+ "layers.34.post_attention_layernorm.weight": "model-00007-of-00007.safetensors",
320
+ "layers.34.self_attn.k_norm.weight": "model-00006-of-00007.safetensors",
321
+ "layers.34.self_attn.k_proj.weight": "model-00006-of-00007.safetensors",
322
+ "layers.34.self_attn.o_proj.weight": "model-00006-of-00007.safetensors",
323
+ "layers.34.self_attn.q_norm.weight": "model-00006-of-00007.safetensors",
324
+ "layers.34.self_attn.q_proj.weight": "model-00006-of-00007.safetensors",
325
+ "layers.34.self_attn.v_proj.weight": "model-00006-of-00007.safetensors",
326
+ "layers.35.input_layernorm.weight": "model-00007-of-00007.safetensors",
327
+ "layers.35.mlp.down_proj.weight": "model-00007-of-00007.safetensors",
328
+ "layers.35.mlp.gate_proj.weight": "model-00007-of-00007.safetensors",
329
+ "layers.35.mlp.up_proj.weight": "model-00007-of-00007.safetensors",
330
+ "layers.35.post_attention_layernorm.weight": "model-00007-of-00007.safetensors",
331
+ "layers.35.self_attn.k_norm.weight": "model-00007-of-00007.safetensors",
332
+ "layers.35.self_attn.k_proj.weight": "model-00007-of-00007.safetensors",
333
+ "layers.35.self_attn.o_proj.weight": "model-00007-of-00007.safetensors",
334
+ "layers.35.self_attn.q_norm.weight": "model-00007-of-00007.safetensors",
335
+ "layers.35.self_attn.q_proj.weight": "model-00007-of-00007.safetensors",
336
+ "layers.35.self_attn.v_proj.weight": "model-00007-of-00007.safetensors",
337
+ "layers.4.input_layernorm.weight": "model-00002-of-00007.safetensors",
338
+ "layers.4.mlp.down_proj.weight": "model-00002-of-00007.safetensors",
339
+ "layers.4.mlp.gate_proj.weight": "model-00002-of-00007.safetensors",
340
+ "layers.4.mlp.up_proj.weight": "model-00002-of-00007.safetensors",
341
+ "layers.4.post_attention_layernorm.weight": "model-00002-of-00007.safetensors",
342
+ "layers.4.self_attn.k_norm.weight": "model-00002-of-00007.safetensors",
343
+ "layers.4.self_attn.k_proj.weight": "model-00002-of-00007.safetensors",
344
+ "layers.4.self_attn.o_proj.weight": "model-00002-of-00007.safetensors",
345
+ "layers.4.self_attn.q_norm.weight": "model-00002-of-00007.safetensors",
346
+ "layers.4.self_attn.q_proj.weight": "model-00002-of-00007.safetensors",
347
+ "layers.4.self_attn.v_proj.weight": "model-00002-of-00007.safetensors",
348
+ "layers.5.input_layernorm.weight": "model-00002-of-00007.safetensors",
349
+ "layers.5.mlp.down_proj.weight": "model-00002-of-00007.safetensors",
350
+ "layers.5.mlp.gate_proj.weight": "model-00002-of-00007.safetensors",
351
+ "layers.5.mlp.up_proj.weight": "model-00002-of-00007.safetensors",
352
+ "layers.5.post_attention_layernorm.weight": "model-00002-of-00007.safetensors",
353
+ "layers.5.self_attn.k_norm.weight": "model-00002-of-00007.safetensors",
354
+ "layers.5.self_attn.k_proj.weight": "model-00002-of-00007.safetensors",
355
+ "layers.5.self_attn.o_proj.weight": "model-00002-of-00007.safetensors",
356
+ "layers.5.self_attn.q_norm.weight": "model-00002-of-00007.safetensors",
357
+ "layers.5.self_attn.q_proj.weight": "model-00002-of-00007.safetensors",
358
+ "layers.5.self_attn.v_proj.weight": "model-00002-of-00007.safetensors",
359
+ "layers.6.input_layernorm.weight": "model-00002-of-00007.safetensors",
360
+ "layers.6.mlp.down_proj.weight": "model-00002-of-00007.safetensors",
361
+ "layers.6.mlp.gate_proj.weight": "model-00002-of-00007.safetensors",
362
+ "layers.6.mlp.up_proj.weight": "model-00002-of-00007.safetensors",
363
+ "layers.6.post_attention_layernorm.weight": "model-00002-of-00007.safetensors",
364
+ "layers.6.self_attn.k_norm.weight": "model-00002-of-00007.safetensors",
365
+ "layers.6.self_attn.k_proj.weight": "model-00002-of-00007.safetensors",
366
+ "layers.6.self_attn.o_proj.weight": "model-00002-of-00007.safetensors",
367
+ "layers.6.self_attn.q_norm.weight": "model-00002-of-00007.safetensors",
368
+ "layers.6.self_attn.q_proj.weight": "model-00002-of-00007.safetensors",
369
+ "layers.6.self_attn.v_proj.weight": "model-00002-of-00007.safetensors",
370
+ "layers.7.input_layernorm.weight": "model-00002-of-00007.safetensors",
371
+ "layers.7.mlp.down_proj.weight": "model-00002-of-00007.safetensors",
372
+ "layers.7.mlp.gate_proj.weight": "model-00002-of-00007.safetensors",
373
+ "layers.7.mlp.up_proj.weight": "model-00002-of-00007.safetensors",
374
+ "layers.7.post_attention_layernorm.weight": "model-00002-of-00007.safetensors",
375
+ "layers.7.self_attn.k_norm.weight": "model-00002-of-00007.safetensors",
376
+ "layers.7.self_attn.k_proj.weight": "model-00002-of-00007.safetensors",
377
+ "layers.7.self_attn.o_proj.weight": "model-00002-of-00007.safetensors",
378
+ "layers.7.self_attn.q_norm.weight": "model-00002-of-00007.safetensors",
379
+ "layers.7.self_attn.q_proj.weight": "model-00002-of-00007.safetensors",
380
+ "layers.7.self_attn.v_proj.weight": "model-00002-of-00007.safetensors",
381
+ "layers.8.input_layernorm.weight": "model-00002-of-00007.safetensors",
382
+ "layers.8.mlp.down_proj.weight": "model-00002-of-00007.safetensors",
383
+ "layers.8.mlp.gate_proj.weight": "model-00002-of-00007.safetensors",
384
+ "layers.8.mlp.up_proj.weight": "model-00002-of-00007.safetensors",
385
+ "layers.8.post_attention_layernorm.weight": "model-00002-of-00007.safetensors",
386
+ "layers.8.self_attn.k_norm.weight": "model-00002-of-00007.safetensors",
387
+ "layers.8.self_attn.k_proj.weight": "model-00002-of-00007.safetensors",
388
+ "layers.8.self_attn.o_proj.weight": "model-00002-of-00007.safetensors",
389
+ "layers.8.self_attn.q_norm.weight": "model-00002-of-00007.safetensors",
390
+ "layers.8.self_attn.q_proj.weight": "model-00002-of-00007.safetensors",
391
+ "layers.8.self_attn.v_proj.weight": "model-00002-of-00007.safetensors",
392
+ "layers.9.input_layernorm.weight": "model-00003-of-00007.safetensors",
393
+ "layers.9.mlp.down_proj.weight": "model-00003-of-00007.safetensors",
394
+ "layers.9.mlp.gate_proj.weight": "model-00002-of-00007.safetensors",
395
+ "layers.9.mlp.up_proj.weight": "model-00003-of-00007.safetensors",
396
+ "layers.9.post_attention_layernorm.weight": "model-00003-of-00007.safetensors",
397
+ "layers.9.self_attn.k_norm.weight": "model-00002-of-00007.safetensors",
398
+ "layers.9.self_attn.k_proj.weight": "model-00002-of-00007.safetensors",
399
+ "layers.9.self_attn.o_proj.weight": "model-00002-of-00007.safetensors",
400
+ "layers.9.self_attn.q_norm.weight": "model-00002-of-00007.safetensors",
401
+ "layers.9.self_attn.q_proj.weight": "model-00002-of-00007.safetensors",
402
+ "layers.9.self_attn.v_proj.weight": "model-00002-of-00007.safetensors",
403
+ "norm.weight": "model-00007-of-00007.safetensors"
404
+ }
405
+ }
merged_model/special_tokens_map.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|im_start|>",
4
+ "<|im_end|>",
5
+ "<|object_ref_start|>",
6
+ "<|object_ref_end|>",
7
+ "<|box_start|>",
8
+ "<|box_end|>",
9
+ "<|quad_start|>",
10
+ "<|quad_end|>",
11
+ "<|vision_start|>",
12
+ "<|vision_end|>",
13
+ "<|vision_pad|>",
14
+ "<|image_pad|>",
15
+ "<|video_pad|>"
16
+ ],
17
+ "eos_token": {
18
+ "content": "<|im_end|>",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ },
24
+ "pad_token": {
25
+ "content": "<|endoftext|>",
26
+ "lstrip": false,
27
+ "normalized": false,
28
+ "rstrip": false,
29
+ "single_word": false
30
+ }
31
+ }
merged_model/tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:93623af029cdc69b87f2864d3b2cc2424fdf16684f15e139b5b9d08ec34ced91
3
+ size 11423701
merged_model/tokenizer_config.json ADDED
@@ -0,0 +1,241 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_eos_token": true,
4
+ "add_prefix_space": false,
5
+ "added_tokens_decoder": {
6
+ "151643": {
7
+ "content": "<|endoftext|>",
8
+ "lstrip": false,
9
+ "normalized": false,
10
+ "rstrip": false,
11
+ "single_word": false,
12
+ "special": true
13
+ },
14
+ "151644": {
15
+ "content": "<|im_start|>",
16
+ "lstrip": false,
17
+ "normalized": false,
18
+ "rstrip": false,
19
+ "single_word": false,
20
+ "special": true
21
+ },
22
+ "151645": {
23
+ "content": "<|im_end|>",
24
+ "lstrip": false,
25
+ "normalized": false,
26
+ "rstrip": false,
27
+ "single_word": false,
28
+ "special": true
29
+ },
30
+ "151646": {
31
+ "content": "<|object_ref_start|>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false,
36
+ "special": true
37
+ },
38
+ "151647": {
39
+ "content": "<|object_ref_end|>",
40
+ "lstrip": false,
41
+ "normalized": false,
42
+ "rstrip": false,
43
+ "single_word": false,
44
+ "special": true
45
+ },
46
+ "151648": {
47
+ "content": "<|box_start|>",
48
+ "lstrip": false,
49
+ "normalized": false,
50
+ "rstrip": false,
51
+ "single_word": false,
52
+ "special": true
53
+ },
54
+ "151649": {
55
+ "content": "<|box_end|>",
56
+ "lstrip": false,
57
+ "normalized": false,
58
+ "rstrip": false,
59
+ "single_word": false,
60
+ "special": true
61
+ },
62
+ "151650": {
63
+ "content": "<|quad_start|>",
64
+ "lstrip": false,
65
+ "normalized": false,
66
+ "rstrip": false,
67
+ "single_word": false,
68
+ "special": true
69
+ },
70
+ "151651": {
71
+ "content": "<|quad_end|>",
72
+ "lstrip": false,
73
+ "normalized": false,
74
+ "rstrip": false,
75
+ "single_word": false,
76
+ "special": true
77
+ },
78
+ "151652": {
79
+ "content": "<|vision_start|>",
80
+ "lstrip": false,
81
+ "normalized": false,
82
+ "rstrip": false,
83
+ "single_word": false,
84
+ "special": true
85
+ },
86
+ "151653": {
87
+ "content": "<|vision_end|>",
88
+ "lstrip": false,
89
+ "normalized": false,
90
+ "rstrip": false,
91
+ "single_word": false,
92
+ "special": true
93
+ },
94
+ "151654": {
95
+ "content": "<|vision_pad|>",
96
+ "lstrip": false,
97
+ "normalized": false,
98
+ "rstrip": false,
99
+ "single_word": false,
100
+ "special": true
101
+ },
102
+ "151655": {
103
+ "content": "<|image_pad|>",
104
+ "lstrip": false,
105
+ "normalized": false,
106
+ "rstrip": false,
107
+ "single_word": false,
108
+ "special": true
109
+ },
110
+ "151656": {
111
+ "content": "<|video_pad|>",
112
+ "lstrip": false,
113
+ "normalized": false,
114
+ "rstrip": false,
115
+ "single_word": false,
116
+ "special": true
117
+ },
118
+ "151657": {
119
+ "content": "<tool_call>",
120
+ "lstrip": false,
121
+ "normalized": false,
122
+ "rstrip": false,
123
+ "single_word": false,
124
+ "special": false
125
+ },
126
+ "151658": {
127
+ "content": "</tool_call>",
128
+ "lstrip": false,
129
+ "normalized": false,
130
+ "rstrip": false,
131
+ "single_word": false,
132
+ "special": false
133
+ },
134
+ "151659": {
135
+ "content": "<|fim_prefix|>",
136
+ "lstrip": false,
137
+ "normalized": false,
138
+ "rstrip": false,
139
+ "single_word": false,
140
+ "special": false
141
+ },
142
+ "151660": {
143
+ "content": "<|fim_middle|>",
144
+ "lstrip": false,
145
+ "normalized": false,
146
+ "rstrip": false,
147
+ "single_word": false,
148
+ "special": false
149
+ },
150
+ "151661": {
151
+ "content": "<|fim_suffix|>",
152
+ "lstrip": false,
153
+ "normalized": false,
154
+ "rstrip": false,
155
+ "single_word": false,
156
+ "special": false
157
+ },
158
+ "151662": {
159
+ "content": "<|fim_pad|>",
160
+ "lstrip": false,
161
+ "normalized": false,
162
+ "rstrip": false,
163
+ "single_word": false,
164
+ "special": false
165
+ },
166
+ "151663": {
167
+ "content": "<|repo_name|>",
168
+ "lstrip": false,
169
+ "normalized": false,
170
+ "rstrip": false,
171
+ "single_word": false,
172
+ "special": false
173
+ },
174
+ "151664": {
175
+ "content": "<|file_sep|>",
176
+ "lstrip": false,
177
+ "normalized": false,
178
+ "rstrip": false,
179
+ "single_word": false,
180
+ "special": false
181
+ },
182
+ "151665": {
183
+ "content": "<tool_response>",
184
+ "lstrip": false,
185
+ "normalized": false,
186
+ "rstrip": false,
187
+ "single_word": false,
188
+ "special": false
189
+ },
190
+ "151666": {
191
+ "content": "</tool_response>",
192
+ "lstrip": false,
193
+ "normalized": false,
194
+ "rstrip": false,
195
+ "single_word": false,
196
+ "special": false
197
+ },
198
+ "151667": {
199
+ "content": "<think>",
200
+ "lstrip": false,
201
+ "normalized": false,
202
+ "rstrip": false,
203
+ "single_word": false,
204
+ "special": false
205
+ },
206
+ "151668": {
207
+ "content": "</think>",
208
+ "lstrip": false,
209
+ "normalized": false,
210
+ "rstrip": false,
211
+ "single_word": false,
212
+ "special": false
213
+ }
214
+ },
215
+ "additional_special_tokens": [
216
+ "<|im_start|>",
217
+ "<|im_end|>",
218
+ "<|object_ref_start|>",
219
+ "<|object_ref_end|>",
220
+ "<|box_start|>",
221
+ "<|box_end|>",
222
+ "<|quad_start|>",
223
+ "<|quad_end|>",
224
+ "<|vision_start|>",
225
+ "<|vision_end|>",
226
+ "<|vision_pad|>",
227
+ "<|image_pad|>",
228
+ "<|video_pad|>"
229
+ ],
230
+ "bos_token": null,
231
+ "chat_template": "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0].role == 'system' %}\n {{- messages[0].content + '\\n\\n' }}\n {%- endif %}\n {{- \"# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0].role == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0].content + '<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}\n{%- for message in messages[::-1] %}\n {%- set index = (messages|length - 1) - loop.index0 %}\n {%- if ns.multi_step_tool and message.role == \"user\" and message.content is string and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}\n {%- set ns.multi_step_tool = false %}\n {%- set ns.last_query_index = index %}\n {%- endif %}\n{%- endfor %}\n{%- for message in messages %}\n {%- if message.content is string %}\n {%- set content = message.content %}\n {%- else %}\n {%- set content = '' %}\n {%- endif %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) %}\n {{- '<|im_start|>' + message.role + '\\n' + content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {%- set reasoning_content = '' %}\n {%- if message.reasoning_content is string %}\n {%- set reasoning_content = message.reasoning_content %}\n {%- else %}\n {%- if '</think>' in content %}\n {%- set reasoning_content = content.split('</think>')[0].rstrip('\\n').split('<think>')[-1].lstrip('\\n') %}\n {%- set content = content.split('</think>')[-1].lstrip('\\n') %}\n {%- endif %}\n {%- endif %}\n {%- if loop.index0 > ns.last_query_index %}\n {%- if loop.last or (not loop.last and reasoning_content) %}\n {{- '<|im_start|>' + message.role + '\\n<think>\\n' + reasoning_content.strip('\\n') + '\\n</think>\\n\\n' + content.lstrip('\\n') }}\n {%- else %}\n {{- '<|im_start|>' + message.role + '\\n' + content }}\n {%- endif %}\n {%- else %}\n {{- '<|im_start|>' + message.role + '\\n' + content }}\n {%- endif %}\n {%- if message.tool_calls %}\n {%- for tool_call in message.tool_calls %}\n {%- if (loop.first and content) or (not loop.first) %}\n {{- '\\n' }}\n {%- endif %}\n {%- if tool_call.function %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '<tool_call>\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {%- if tool_call.arguments is string %}\n {{- tool_call.arguments }}\n {%- else %}\n {{- tool_call.arguments | tojson }}\n {%- endif %}\n {{- '}\\n</tool_call>' }}\n {%- endfor %}\n {%- endif %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if loop.first or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n<tool_response>\\n' }}\n {{- content }}\n {{- '\\n</tool_response>' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n {%- if enable_thinking is defined and enable_thinking is false %}\n {{- '<think>\\n\\n</think>\\n\\n' }}\n {%- endif %}\n{%- endif %}",
232
+ "clean_up_tokenization_spaces": false,
233
+ "eos_token": "<|im_end|>",
234
+ "errors": "replace",
235
+ "extra_special_tokens": {},
236
+ "model_max_length": 131072,
237
+ "pad_token": "<|endoftext|>",
238
+ "split_special_tokens": false,
239
+ "tokenizer_class": "Qwen2Tokenizer",
240
+ "unk_token": null
241
+ }
merged_model/vocab.json ADDED
The diff for this file is too large to render. See raw diff
 
query_encoder/config.json ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "NewModel"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.0,
6
+ "auto_map": {
7
+ "AutoConfig": "configuration.NewConfig",
8
+ "AutoModel": "modeling.NewModel",
9
+ "AutoModelForMaskedLM": "modeling.NewForMaskedLM",
10
+ "AutoModelForMultipleChoice": "/train20/intern/permanent/aqjiang/model/gte-multilingual-mlm-base/gte--modeling.NewForMultipleChoice",
11
+ "AutoModelForQuestionAnswering": "/train20/intern/permanent/aqjiang/model/gte-multilingual-mlm-base/gte--modeling.NewForQuestionAnswering",
12
+ "AutoModelForSequenceClassification": "/train20/intern/permanent/aqjiang/model/gte-multilingual-mlm-base/gte--modeling.NewForSequenceClassification",
13
+ "AutoModelForTokenClassification": "/train20/intern/permanent/aqjiang/model/gte-multilingual-mlm-base/gte--modeling.NewForTokenClassification"
14
+ },
15
+ "classifier_dropout": 0.1,
16
+ "hidden_act": "gelu",
17
+ "hidden_dropout_prob": 0.1,
18
+ "hidden_size": 768,
19
+ "initializer_range": 0.02,
20
+ "intermediate_size": 3072,
21
+ "layer_norm_eps": 1e-12,
22
+ "layer_norm_type": "layer_norm",
23
+ "logn_attention_clip1": false,
24
+ "logn_attention_scale": false,
25
+ "max_position_embeddings": 8192,
26
+ "model_type": "new",
27
+ "num_attention_heads": 12,
28
+ "num_hidden_layers": 12,
29
+ "pack_qkv": true,
30
+ "pad_token_id": 1,
31
+ "position_embedding_type": "rope",
32
+ "rope_scaling": null,
33
+ "rope_theta": 160000,
34
+ "torch_dtype": "bfloat16",
35
+ "transformers_version": "4.51.1",
36
+ "type_vocab_size": 1,
37
+ "unpad_inputs": false,
38
+ "use_memory_efficient_attention": false,
39
+ "vocab_size": 250048
40
+ }
query_encoder/configuration.py ADDED
@@ -0,0 +1,137 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2024 The GTE Team Authors and Alibaba Group.
3
+ # Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
4
+ #
5
+ # Licensed under the Apache License, Version 2.0 (the "License");
6
+ # you may not use this file except in compliance with the License.
7
+ # You may obtain a copy of the License at
8
+ #
9
+ # http://www.apache.org/licenses/LICENSE-2.0
10
+ #
11
+ # Unless required by applicable law or agreed to in writing, software
12
+ # distributed under the License is distributed on an "AS IS" BASIS,
13
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14
+ # See the License for the specific language governing permissions and
15
+ # limitations under the License.
16
+ """ NEW model configuration"""
17
+ from transformers.configuration_utils import PretrainedConfig
18
+ from transformers.utils import logging
19
+
20
+ logger = logging.get_logger(__name__)
21
+
22
+
23
+ class NewConfig(PretrainedConfig):
24
+ r"""
25
+ This is the configuration class to store the configuration of a [`NewModel`] or a [`TFNewModel`]. It is used to
26
+ instantiate a NEW model according to the specified arguments, defining the model architecture. Instantiating a
27
+ configuration with the defaults will yield a similar configuration to that of the NEW
28
+ [izhx/new-base-en](https://huggingface.co/izhx/new-base-en) architecture.
29
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
30
+ documentation from [`PretrainedConfig`] for more information.
31
+ Args:
32
+ vocab_size (`int`, *optional*, defaults to 30522):
33
+ Vocabulary size of the NEW model. Defines the number of different tokens that can be represented by the
34
+ `inputs_ids` passed when calling [`NewModel`] or [`TFNewModel`].
35
+ hidden_size (`int`, *optional*, defaults to 768):
36
+ Dimensionality of the encoder layers and the pooler layer.
37
+ num_hidden_layers (`int`, *optional*, defaults to 12):
38
+ Number of hidden layers in the Transformer encoder.
39
+ num_attention_heads (`int`, *optional*, defaults to 12):
40
+ Number of attention heads for each attention layer in the Transformer encoder.
41
+ intermediate_size (`int`, *optional*, defaults to 3072):
42
+ Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
43
+ hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
44
+ The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
45
+ `"relu"`, `"silu"` and `"gelu_new"` are supported.
46
+ hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
47
+ The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
48
+ attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
49
+ The dropout ratio for the attention probabilities.
50
+ max_position_embeddings (`int`, *optional*, defaults to 512):
51
+ The maximum sequence length that this model might ever be used with. Typically set this to something large
52
+ just in case (e.g., 512 or 1024 or 2048).
53
+ type_vocab_size (`int`, *optional*, defaults to 2):
54
+ The vocabulary size of the `token_type_ids` passed when calling [`NewModel`] or [`TFNewModel`].
55
+ initializer_range (`float`, *optional*, defaults to 0.02):
56
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
57
+ layer_norm_eps (`float`, *optional*, defaults to 1e-12):
58
+ The epsilon used by the layer normalization layers.
59
+ position_embedding_type (`str`, *optional*, defaults to `"rope"`):
60
+ Type of position embedding. Choose one of `"absolute"`, `"rope"`.
61
+ rope_theta (`float`, *optional*, defaults to 10000.0):
62
+ The base period of the RoPE embeddings.
63
+ rope_scaling (`Dict`, *optional*):
64
+ Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling
65
+ strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is
66
+ `{"type": strategy name, "factor": scaling factor}`. When using this flag, don't update
67
+ `max_position_embeddings` to the expected new maximum. See the following thread for more information on how
68
+ these scaling strategies behave:
69
+ https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/. This is an
70
+ experimental feature, subject to breaking API changes in future versions.
71
+ classifier_dropout (`float`, *optional*):
72
+ The dropout ratio for the classification head.
73
+ Examples:
74
+ ```python
75
+ >>> from transformers import NewConfig, NewModel
76
+ >>> # Initializing a NEW izhx/new-base-en style configuration
77
+ >>> configuration = NewConfig()
78
+ >>> # Initializing a model (with random weights) from the izhx/new-base-en style configuration
79
+ >>> model = NewModel(configuration)
80
+ >>> # Accessing the model configuration
81
+ >>> configuration = model.config
82
+ ```"""
83
+
84
+ model_type = "new"
85
+
86
+ def __init__(
87
+ self,
88
+ vocab_size=30528,
89
+ hidden_size=768,
90
+ num_hidden_layers=12,
91
+ num_attention_heads=12,
92
+ intermediate_size=3072,
93
+ hidden_act="gelu",
94
+ hidden_dropout_prob=0.1,
95
+ attention_probs_dropout_prob=0.0,
96
+ max_position_embeddings=2048,
97
+ type_vocab_size=1,
98
+ initializer_range=0.02,
99
+ layer_norm_type='layer_norm',
100
+ layer_norm_eps=1e-12,
101
+ # pad_token_id=0,
102
+ position_embedding_type="rope",
103
+ rope_theta=10000.0,
104
+ rope_scaling=None,
105
+ classifier_dropout=None,
106
+ pack_qkv=True,
107
+ unpad_inputs=False,
108
+ use_memory_efficient_attention=False,
109
+ logn_attention_scale=False,
110
+ logn_attention_clip1=False,
111
+ **kwargs,
112
+ ):
113
+ super().__init__(**kwargs)
114
+
115
+ self.vocab_size = vocab_size
116
+ self.hidden_size = hidden_size
117
+ self.num_hidden_layers = num_hidden_layers
118
+ self.num_attention_heads = num_attention_heads
119
+ self.hidden_act = hidden_act
120
+ self.intermediate_size = intermediate_size
121
+ self.hidden_dropout_prob = hidden_dropout_prob
122
+ self.attention_probs_dropout_prob = attention_probs_dropout_prob
123
+ self.max_position_embeddings = max_position_embeddings
124
+ self.type_vocab_size = type_vocab_size
125
+ self.initializer_range = initializer_range
126
+ self.layer_norm_type = layer_norm_type
127
+ self.layer_norm_eps = layer_norm_eps
128
+ self.position_embedding_type = position_embedding_type
129
+ self.rope_theta = rope_theta
130
+ self.rope_scaling = rope_scaling
131
+ self.classifier_dropout = classifier_dropout
132
+
133
+ self.pack_qkv = pack_qkv
134
+ self.unpad_inputs = unpad_inputs
135
+ self.use_memory_efficient_attention = use_memory_efficient_attention
136
+ self.logn_attention_scale = logn_attention_scale
137
+ self.logn_attention_clip1 = logn_attention_clip1
query_encoder/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c358869cca1a75d77bca1d53e43d595b0ef1d65a530d2182d22b82adc974d8f0
3
+ size 610751248
query_encoder/modeling.py ADDED
@@ -0,0 +1,1418 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2024 The GTE Team Authors and Alibaba Group.
3
+ # Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
4
+ #
5
+ # Licensed under the Apache License, Version 2.0 (the "License");
6
+ # you may not use this file except in compliance with the License.
7
+ # You may obtain a copy of the License at
8
+ #
9
+ # http://www.apache.org/licenses/LICENSE-2.0
10
+ #
11
+ # Unless required by applicable law or agreed to in writing, software
12
+ # distributed under the License is distributed on an "AS IS" BASIS,
13
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14
+ # See the License for the specific language governing permissions and
15
+ # limitations under the License.
16
+ """PyTorch NEW model."""
17
+
18
+ import math
19
+ from dataclasses import dataclass
20
+ from typing import List, Optional, Tuple, Union
21
+
22
+ import torch
23
+ import torch.utils.checkpoint
24
+ from torch import nn
25
+
26
+ from transformers.activations import ACT2FN
27
+ from transformers.modeling_outputs import (
28
+ BaseModelOutput,
29
+ BaseModelOutputWithPooling,
30
+ MaskedLMOutput,
31
+ MultipleChoiceModelOutput,
32
+ QuestionAnsweringModelOutput,
33
+ SequenceClassifierOutput,
34
+ ModelOutput,
35
+ )
36
+ from transformers.modeling_utils import PreTrainedModel
37
+ from transformers.utils import logging
38
+
39
+ try:
40
+ import xformers.ops as xops
41
+ except ImportError as e:
42
+ xops = None
43
+
44
+ from .configuration import NewConfig
45
+
46
+
47
+ logger = logging.get_logger(__name__)
48
+
49
+
50
+ # Adapted from https://github.com/HazyResearch/flash-attention/blob/main/flash_attn/bert_padding.py
51
+ # Which was adapted from https://github.com/mlcommons/training_results_v1.1/blob/main/NVIDIA/benchmarks/bert/implementations/pytorch/padding.py
52
+ class IndexFirstAxis(torch.autograd.Function):
53
+ @staticmethod
54
+ def forward(ctx, input, indices):
55
+ ctx.save_for_backward(indices)
56
+ assert input.ndim >= 2
57
+ ctx.first_axis_dim, other_shape = input.shape[0], input.shape[1:]
58
+ second_dim = other_shape.numel()
59
+ # TD [2022-03-04] For some reason torch.gather is a bit faster than indexing.
60
+ # return input[indices]
61
+ # return torch.gather(
62
+ # rearrange(input, "b ... -> b (...)"), 0, repeat(indices, "z -> z d", d=second_dim)
63
+ # ).reshape(-1, *other_shape)
64
+ return torch.gather(
65
+ input.view(ctx.first_axis_dim, second_dim),
66
+ 0,
67
+ indices.unsqueeze(-1).expand(indices.size(0), second_dim)
68
+ ).reshape(-1, *other_shape)
69
+
70
+ @staticmethod
71
+ def backward(ctx, grad_output):
72
+ (indices,) = ctx.saved_tensors
73
+ assert grad_output.ndim >= 2
74
+ other_shape = grad_output.shape[1:]
75
+ # grad_output = rearrange(grad_output, "b ... -> b (...)")
76
+ grad_output = grad_output.view(grad_output.size(0), other_shape.numel())
77
+ grad_input = torch.zeros(
78
+ [ctx.first_axis_dim, grad_output.shape[1]],
79
+ device=grad_output.device,
80
+ dtype=grad_output.dtype,
81
+ )
82
+ # TD [2022-03-04] For some reason torch.scatter is a bit faster than indexing.
83
+ # grad_input[indices] = grad_output
84
+ # grad_input.scatter_(0, repeat(indices, "z -> z d", d=grad_output.shape[1]), grad_output)
85
+ grad_input.scatter_(
86
+ 0, indices.unsqueeze(-1).expand(indices.size(0), grad_output.size(1)), grad_output
87
+ )
88
+ return grad_input.reshape(ctx.first_axis_dim, *other_shape), None
89
+
90
+
91
+ index_first_axis = IndexFirstAxis.apply
92
+
93
+
94
+ def unpad_input(hidden_states, attention_mask=None, indices=None):
95
+ """
96
+ Arguments:
97
+ hidden_states: (batch, seqlen, ...)
98
+ attention_mask: (batch, seqlen), bool / int, 1 means valid and 0 means not valid.
99
+ indices: (total_nnz), the indices of non-masked tokens from the flattened input sequence.
100
+ Return:
101
+ hidden_states: (total_nnz, ...), where total_nnz = number of tokens in selected in attention_mask.
102
+ """
103
+ if indices is None:
104
+ assert attention_mask is not None
105
+ indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
106
+
107
+ # TD [2022-03-04] We don't want to index with a bool mask, because Pytorch will expand the
108
+ # bool mask, then call nonzero to get the indices, then index with those. The indices is @dim
109
+ # times larger than it needs to be, wasting memory. It's faster and more memory-efficient to
110
+ # index with integer indices. Moreover, torch's index is a bit slower than it needs to be,
111
+ # so we write custom forward and backward to make it a bit faster.
112
+ hidden_states = hidden_states.view(-1, *hidden_states.shape[2:])
113
+ return index_first_axis(hidden_states, indices)
114
+
115
+
116
+ class IndexPutFirstAxis(torch.autograd.Function):
117
+ @staticmethod
118
+ def forward(
119
+ ctx,
120
+ values: torch.Tensor,
121
+ indices: torch.Tensor,
122
+ first_axis_dim
123
+ ) -> torch.Tensor:
124
+ ctx.save_for_backward(indices)
125
+ assert indices.ndim == 1
126
+ assert values.ndim >= 2
127
+ output = torch.zeros(
128
+ first_axis_dim, *values.shape[1:], device=values.device, dtype=values.dtype
129
+ )
130
+ output[indices] = values
131
+ return output
132
+
133
+ @staticmethod
134
+ def backward(ctx, grad_output: torch.Tensor) -> Tuple[torch.Tensor, None, None]:
135
+ indices, = ctx.saved_tensors
136
+ grad_values = grad_output[indices]
137
+ return grad_values, None, None
138
+
139
+
140
+ index_put_first_axis = IndexPutFirstAxis.apply
141
+
142
+
143
+ def pad_input(inputs: torch.Tensor, indices: torch.Tensor, batch: int, seqlen: int) -> torch.Tensor:
144
+ """Add padding to sequences.
145
+
146
+ Arguments:
147
+ inputs: (total_nnz, ...), where total_nnz = number of tokens in selected in attention_mask.
148
+ indices: (total_nnz), `indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()`
149
+ batch: int batch_size
150
+ seqlen: int max sequence length
151
+
152
+ Returns:
153
+ inputs: (batch, seqlen, ...)
154
+ """
155
+ output = index_put_first_axis(inputs, indices, batch * seqlen)
156
+ return output.view(batch, seqlen, *inputs.shape[1:])
157
+
158
+
159
+ def rotate_half(x):
160
+ """Rotates half the hidden dims of the input."""
161
+ x1 = x[..., : x.shape[-1] // 2]
162
+ x2 = x[..., x.shape[-1] // 2 :]
163
+ return torch.cat((-x2, x1), dim=-1)
164
+
165
+
166
+ def apply_rotary_pos_emb(q, k, cos, sin):
167
+ """Applies Rotary Position Embedding to the query and key tensors.
168
+
169
+ Args:
170
+ q (`torch.Tensor`): The query tensor.
171
+ k (`torch.Tensor`): The key tensor.
172
+ cos (`torch.Tensor`): The cosine part of the rotary embedding.
173
+ sin (`torch.Tensor`): The sine part of the rotary embedding.
174
+ Returns:
175
+ `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
176
+ """
177
+ cos, sin = cos.to(q.dtype), sin.to(q.dtype)
178
+ q_embed = (q * cos) + (rotate_half(q) * sin)
179
+ k_embed = (k * cos) + (rotate_half(k) * sin)
180
+ return q_embed, k_embed
181
+
182
+
183
+ class RotaryEmbedding(torch.nn.Module):
184
+ def __init__(self, dim, max_position_embeddings=512, base=10000.0, device=None):
185
+ super().__init__()
186
+
187
+ self.dim = dim
188
+ self.max_position_embeddings = max_position_embeddings
189
+ self.base = base
190
+ inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
191
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
192
+
193
+ # Build here to make `torch.jit.trace` work.
194
+ self._set_cos_sin_cache(
195
+ seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
196
+ )
197
+
198
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
199
+ self.max_seq_len_cached = seq_len
200
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.float32)
201
+
202
+ freqs = torch.einsum("i,j->ij", t, self.inv_freq)
203
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
204
+ emb = torch.cat((freqs, freqs), dim=-1)
205
+ self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
206
+ self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
207
+
208
+ def forward(self, x, seq_len=None):
209
+ # x: [bs, num_attention_heads, seq_len, head_size]
210
+ if seq_len > self.max_seq_len_cached:
211
+ self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)
212
+
213
+ return (
214
+ self.cos_cached[:seq_len, ...].to(dtype=x.dtype),
215
+ self.sin_cached[:seq_len, ...].to(dtype=x.dtype),
216
+ )
217
+
218
+
219
+ class NTKScalingRotaryEmbedding(RotaryEmbedding):
220
+ """RotaryEmbedding extended with fixed and mixed NTK scaling. https://kexue.fm/archives/9706 """
221
+
222
+ def __init__(self, dim, max_position_embeddings=512, base=10000, device=None, scaling_factor=1.0, mixed_b=None):
223
+ self.scaling_factor = scaling_factor
224
+ self.mixed_b = mixed_b
225
+ super().__init__(dim, max_position_embeddings, base, device)
226
+ max_position_embeddings = max_position_embeddings * self.scaling_factor
227
+ self._set_cos_sin_cache(max_position_embeddings, self.inv_freq.device, torch.get_default_dtype())
228
+
229
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
230
+ self.max_seq_len_cached = seq_len
231
+
232
+ if seq_len > self.max_position_embeddings:
233
+ base = self.base * (self.scaling_factor if self.mixed_b is None else 1)
234
+ inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
235
+
236
+ if self.mixed_b is None:
237
+ inv_freq = inv_freq / self.scaling_factor ** (2 / self.dim) # (6)
238
+ else:
239
+ a = torch.tensor(self.scaling_factor).log() / (self.dim / 2) ** self.mixed_b # (13)
240
+ lambda_1_m = (a * torch.arange(1, self.dim // 2 + 1).float().to(device) ** self.mixed_b).exp() # (12)
241
+ inv_freq = inv_freq / lambda_1_m # (10)
242
+
243
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
244
+
245
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.float32)
246
+
247
+ freqs = torch.einsum("i,j->ij", t, self.inv_freq)
248
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
249
+ emb = torch.cat((freqs, freqs), dim=-1)
250
+ self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
251
+ self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
252
+
253
+
254
+ class RMSNorm(nn.Module):
255
+ def __init__(self, hidden_size, eps=1e-6):
256
+ """
257
+ RMSNorm is equivalent to T5LayerNorm
258
+ """
259
+ super().__init__()
260
+ self.weight = nn.Parameter(torch.ones(hidden_size))
261
+ self.variance_epsilon = eps
262
+
263
+ def forward(self, hidden_states):
264
+ input_dtype = hidden_states.dtype
265
+ hidden_states = hidden_states.to(torch.float32)
266
+ variance = hidden_states.pow(2).mean(-1, keepdim=True)
267
+ hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
268
+ return self.weight * hidden_states.to(input_dtype)
269
+
270
+
271
+ LAYER_NORM = {
272
+ 'layer_norm': nn.LayerNorm,
273
+ 'rms_norm': RMSNorm
274
+ }
275
+
276
+
277
+ class NewEmbeddings(nn.Module):
278
+ """
279
+ Embedding and Unpadding.
280
+ """
281
+
282
+ def __init__(self, config: NewConfig):
283
+ super().__init__()
284
+ self.padding_idx = config.pad_token_id
285
+ self.word_embeddings = nn.Embedding(
286
+ config.vocab_size, config.hidden_size, padding_idx=self.padding_idx
287
+ )
288
+
289
+ self.position_embedding_type = config.position_embedding_type
290
+ if self.position_embedding_type == 'absolute':
291
+ self.position_embeddings = nn.Embedding(
292
+ config.max_position_embeddings, config.hidden_size, padding_idx=self.padding_idx
293
+ )
294
+ elif self.position_embedding_type == 'rope':
295
+ self._init_rope(config)
296
+ else:
297
+ raise ValueError
298
+
299
+ self.type_vocab_size = config.type_vocab_size
300
+ if self.type_vocab_size > 0:
301
+ self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
302
+
303
+ # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
304
+ # any TensorFlow checkpoint file
305
+ self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
306
+ self.dropout = nn.Dropout(config.hidden_dropout_prob)
307
+ # position_ids is contiguous in memory and excluded when serialized
308
+ self.register_buffer(
309
+ "position_ids", torch.arange(config.max_position_embeddings), persistent=False
310
+ )
311
+
312
+ def _init_rope(self, config):
313
+ kwargs = dict(
314
+ dim=int(config.hidden_size / config.num_attention_heads),
315
+ max_position_embeddings=config.max_position_embeddings,
316
+ base=config.rope_theta
317
+ )
318
+ if config.rope_scaling is None:
319
+ self.rotary_emb = RotaryEmbedding(**kwargs)
320
+ else:
321
+ kwargs.update(scaling_factor=config.rope_scaling["factor"])
322
+ scaling_type = config.rope_scaling["type"]
323
+ if scaling_type == 'ntk':
324
+ kwargs.update(mixed_b=config.rope_scaling.get('mixed_b', None))
325
+ self.rotary_emb = NTKScalingRotaryEmbedding(**kwargs)
326
+ # elif scaling_type == "linear":
327
+ # self.rotary_emb = LinearScalingRotaryEmbedding(**kwargs)
328
+ # elif scaling_type == "dynamic":
329
+ # self.rotary_emb = DynamicNTKScalingRotaryEmbedding(**kwargs)
330
+ else:
331
+ raise ValueError(f"Unknown RoPE scaling type {scaling_type}")
332
+
333
+ def forward(
334
+ self,
335
+ unpad_inputs: bool,
336
+ input_ids: Optional[torch.Tensor] = None,
337
+ attention_mask: Optional[torch.Tensor] = None,
338
+ length: Optional[List[int]] = None,
339
+ token_type_ids: Optional[torch.Tensor] = None,
340
+ position_ids: Optional[torch.Tensor] = None,
341
+ inputs_embeds: Optional[torch.Tensor] = None,
342
+ ) -> Tuple[torch.Tensor, torch.Tensor, Optional[Tuple], Optional[List[int]]]:
343
+ """
344
+ """
345
+ if inputs_embeds is None:
346
+ device, input_shape = input_ids.device, input_ids.shape
347
+ else:
348
+ device, input_shape = inputs_embeds.device, inputs_embeds.shape[:2]
349
+ batch_size, seq_length = input_shape
350
+
351
+ # Set attention_mask if it's None
352
+ if attention_mask is None:
353
+ attention_mask = torch.ones(input_shape, device=device)
354
+ if length is not None:
355
+ for i, l in enumerate(length):
356
+ attention_mask[i, l:] = 0
357
+
358
+ # Set attention_mask_bool for unpadding
359
+ if unpad_inputs:
360
+ attention_mask_bool = attention_mask.bool()
361
+ if length is None:
362
+ length = attention_mask.sum(-1).tolist()
363
+
364
+ # Get word embeddings
365
+ if inputs_embeds is None:
366
+ if unpad_inputs:
367
+ input_ids = input_ids[attention_mask_bool].unsqueeze(0)
368
+ inputs_embeds = self.word_embeddings(input_ids)
369
+ else:
370
+ if unpad_inputs:
371
+ inputs_embeds = inputs_embeds[attention_mask_bool].unsqueeze(0)
372
+ embeddings = inputs_embeds
373
+
374
+ # Set and unpad position_ids
375
+ if position_ids is None:
376
+ if seq_length > self.position_ids.size(0):
377
+ self.register_buffer(
378
+ "position_ids", torch.arange(seq_length, device=embeddings.device), persistent=False
379
+ )
380
+ if unpad_inputs:
381
+ # [1, cumsum_seq_len]
382
+ position_ids = torch.cat([self.position_ids[:l] for l in length]).unsqueeze(0)
383
+ else:
384
+ # [bs, seq_len]
385
+ position_ids = self.position_ids[:seq_length].expand(batch_size, -1)
386
+ elif unpad_inputs:
387
+ position_ids = position_ids[attention_mask_bool].unsqueeze(0) # [1, cumsum_seq_len]
388
+
389
+ # Compute rotary embedding
390
+ if self.position_embedding_type == 'rope':
391
+ rope_cos, rope_sin = self.rotary_emb(inputs_embeds, seq_len=seq_length)
392
+ rope_cos = rope_cos[position_ids].unsqueeze(2) # [bs, seq_len, 1, dim]
393
+ rope_sin = rope_sin[position_ids].unsqueeze(2) # [bs, seq_len, 1, dim]
394
+ rope_embeds = rope_cos, rope_sin
395
+ else:
396
+ rope_embeds = None
397
+
398
+ if self.type_vocab_size > 0:
399
+ if token_type_ids is None:
400
+ token_type_ids = position_ids.mul(0)
401
+ else:
402
+ if self.type_vocab_size < 2:
403
+ token_type_ids.mul_(0)
404
+ if unpad_inputs:
405
+ token_type_ids = token_type_ids[attention_mask_bool].unsqueeze(0)
406
+
407
+ token_type_embeddings = self.token_type_embeddings(token_type_ids)
408
+ embeddings = embeddings + token_type_embeddings
409
+
410
+ # BERT position
411
+ if self.position_embedding_type == "absolute":
412
+ position_embeddings = self.position_embeddings(position_ids)
413
+ embeddings = embeddings + position_embeddings
414
+
415
+ embeddings = self.LayerNorm(embeddings)
416
+ embeddings = self.dropout(embeddings)
417
+
418
+ return embeddings, attention_mask, rope_embeds, length
419
+
420
+
421
+ class NewAttention(nn.Module):
422
+ def __init__(self, config: NewConfig, pack_qkv=None, use_memory_efficient_attention=None):
423
+ super().__init__()
424
+ self.config = config
425
+ if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
426
+ raise ValueError(
427
+ f"The hidden size ({config.hidden_size}) is not a multiple of the number of attention "
428
+ f"heads ({config.num_attention_heads})"
429
+ )
430
+
431
+ self.hidden_size = config.hidden_size
432
+ self.num_attention_heads = config.num_attention_heads
433
+ self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
434
+ self.all_head_size = self.num_attention_heads * self.attention_head_size
435
+
436
+ if pack_qkv is None:
437
+ pack_qkv = config.pack_qkv
438
+ self.pack_qkv = pack_qkv
439
+
440
+ if self.pack_qkv:
441
+ self.qkv_proj = nn.Linear(config.hidden_size, self.all_head_size * 3, bias=True)
442
+ else:
443
+ self.q_proj = nn.Linear(config.hidden_size, self.all_head_size, bias=True)
444
+ self.k_proj = nn.Linear(config.hidden_size, self.all_head_size, bias=True)
445
+ self.v_proj = nn.Linear(config.hidden_size, self.all_head_size, bias=True)
446
+
447
+ self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
448
+ self.o_proj = nn.Linear(config.hidden_size, config.hidden_size, bias=True)
449
+
450
+ if use_memory_efficient_attention is None:
451
+ use_memory_efficient_attention = self.config.use_memory_efficient_attention
452
+ self.use_memory_efficient_attention = use_memory_efficient_attention
453
+ self.memory_efficient_attention = None if xops is None else xops.memory_efficient_attention
454
+ if self.use_memory_efficient_attention:
455
+ assert self.memory_efficient_attention is not None, 'please install xformers'
456
+
457
+ def forward(
458
+ self,
459
+ hidden_states: torch.Tensor,
460
+ attention_bias: torch.FloatTensor,
461
+ rope_embeds: Optional[Tuple[torch.FloatTensor, torch.FloatTensor]] = None,
462
+ padding_inputs: Optional[Tuple] = None, # indices, batch, seqlen
463
+ attention_scale: Optional[torch.FloatTensor] = None,
464
+ head_mask: Optional[torch.FloatTensor] = None,
465
+ output_attentions: Optional[bool] = False,
466
+ qkv_inputs: Optional[Tuple] = None, # For RetroMAE
467
+ ) -> Tuple[torch.Tensor, ...]:
468
+ shape_hd = (self.num_attention_heads, self.attention_head_size)
469
+ # qkv
470
+ if self.pack_qkv and qkv_inputs is None:
471
+ qkv_pack = self.qkv_proj(hidden_states).split(self.all_head_size, dim=-1)
472
+ else:
473
+ if qkv_inputs is None:
474
+ qkv_inputs = (hidden_states, hidden_states, hidden_states)
475
+ qkv_pack = [
476
+ getattr(self, n + '_proj')(s) for s, n in zip(qkv_inputs, 'qkv')
477
+ ]
478
+ query_states, key_states, value_states = [t.view(t.shape[:-1] + shape_hd) for t in qkv_pack]
479
+
480
+ if self.config.position_embedding_type == 'rope':
481
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, *rope_embeds)
482
+
483
+ dtype = query_states.dtype
484
+
485
+ if self.config.logn_attention_scale and attention_scale is not None:
486
+ # https://kexue.fm/archives/8823
487
+ query_states = query_states * attention_scale.to(dtype)
488
+
489
+ if padding_inputs is not None:
490
+ query_states = pad_input(query_states.squeeze(), *padding_inputs)
491
+ key_states = pad_input(key_states.squeeze(), *padding_inputs)
492
+ value_states = pad_input(value_states.squeeze(), *padding_inputs)
493
+
494
+ if self.use_memory_efficient_attention:
495
+ assert self.memory_efficient_attention is not None, "xformers is not loaded"
496
+ assert output_attentions is False, "memory_efficient_attention do not output attentions"
497
+ assert head_mask is None, "Not support yet"
498
+ attention_probs = None
499
+ if torch.is_tensor(attention_bias):
500
+ attention_bias = attention_bias.to(dtype)
501
+ context_layer = self.memory_efficient_attention(
502
+ query_states,
503
+ key_states,
504
+ value_states,
505
+ attn_bias=attention_bias,
506
+ p=self.dropout.p
507
+ )
508
+ else:
509
+ if output_attentions and isinstance(self, NewSdpaAttention):
510
+ raise RuntimeError("SDPA do not output attentions")
511
+ context_layer, attention_probs = self._attention(
512
+ query_states, key_states, value_states, attention_bias, head_mask
513
+ )
514
+
515
+ if padding_inputs is not None:
516
+ context_layer = unpad_input(context_layer, indices=padding_inputs[0])
517
+
518
+ new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
519
+ context_layer = context_layer.view(new_context_layer_shape)
520
+
521
+ # output proj
522
+ attn_output = self.o_proj(context_layer)
523
+
524
+ # add attentions if we output them
525
+ outputs = (attn_output, attention_probs) if output_attentions else (attn_output,)
526
+ return outputs
527
+
528
+ def _attention(self, query_states, key_states, value_states, attention_bias, head_mask):
529
+ """
530
+ Args:
531
+ q/k/v: (B, L, n_head, head_dim),
532
+ Returns:
533
+ attn_output: (B L, n_head, head_dim)
534
+ """
535
+ query_states = query_states.transpose(1, 2)
536
+ key_states = key_states.transpose(1, 2)
537
+ value_states = value_states.transpose(1, 2)
538
+ # Take the dot product between "query" and "key" to get the raw attention scores.
539
+ attention_scores = torch.matmul(query_states, key_states.transpose(-1, -2))
540
+
541
+ attention_scores = attention_scores / math.sqrt(self.attention_head_size)
542
+ if attention_bias is not None:
543
+ # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
544
+ attention_scores = attention_scores + attention_bias
545
+
546
+ # Normalize the attention scores to probabilities.
547
+ attention_probs = nn.functional.softmax(attention_scores, dim=-1)
548
+
549
+ # This is actually dropping out entire tokens to attend to, which might
550
+ # seem a bit unusual, but is taken from the original Transformer paper.
551
+ if self.dropout.p > 0:
552
+ attention_probs = self.dropout(attention_probs)
553
+
554
+ # Mask heads if we want to
555
+ if head_mask is not None:
556
+ attention_probs = attention_probs * head_mask
557
+
558
+ context_layer = torch.matmul(attention_probs, value_states)
559
+
560
+ context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
561
+ return context_layer, attention_probs
562
+
563
+
564
+ class NewSdpaAttention(NewAttention):
565
+ """
566
+ New attention module using torch.nn.functional.scaled_dot_product_attention. This module inherits from
567
+ `NewAttention` as the weights of the module stays untouched. The only changes are on the forward pass to adapt to
568
+ SDPA API.
569
+ """
570
+ def __init__(self, config: NewConfig, **kwargs):
571
+ super().__init__(config, **kwargs)
572
+ # torch.backends.cuda.enable_mem_efficient_sdp(False)
573
+ # logger.warning(
574
+ # "Disable memory efficient attention kernel for `NewSdpaAttention`, you can set "
575
+ # "`use_memory_efficient_attention=True` if it expected to use."
576
+ # )
577
+
578
+ def _attention(self, query_states, key_states, value_states, attention_bias, head_mask):
579
+ attn_output = torch.nn.functional.scaled_dot_product_attention(
580
+ query_states.transpose(1, 2),
581
+ key_states.transpose(1, 2),
582
+ value_states.transpose(1, 2),
583
+ attn_mask=attention_bias,
584
+ dropout_p=self.dropout.p if self.training else 0.0,
585
+ )
586
+ attn_output = attn_output.permute(0, 2, 1, 3).contiguous()
587
+ return attn_output, None
588
+
589
+
590
+ NEW_ATTENTION_CLASSES = {
591
+ "eager": NewAttention,
592
+ # "flash_attention_2": , # TODO
593
+ "sdpa": NewSdpaAttention,
594
+ }
595
+
596
+
597
+ class NewGatedMLP(nn.Module):
598
+ """
599
+ GLU Variants Improve Transformer.
600
+ """
601
+
602
+ def __init__(self, config: NewConfig):
603
+ super().__init__()
604
+ self.intermediate_size = config.intermediate_size
605
+ self.up_gate_proj = nn.Linear(config.hidden_size, self.intermediate_size * 2, bias=False)
606
+ self.down_proj = nn.Linear(self.intermediate_size, config.hidden_size, bias=True)
607
+ self.act_fn = ACT2FN[config.hidden_act]
608
+ if config.hidden_dropout_prob > 0:
609
+ self.hidden_dropout = nn.Dropout(config.hidden_dropout_prob)
610
+ else:
611
+ self.hidden_dropout = None
612
+
613
+ def forward(self, hidden_states):
614
+ up_gate = self.up_gate_proj(hidden_states)
615
+ up_states, gate = torch.split(up_gate, self.intermediate_size, dim=-1)
616
+ gate = self.act_fn(gate)
617
+ gated_states = gate * up_states
618
+ if self.hidden_dropout is not None:
619
+ gated_states = self.hidden_dropout(gated_states)
620
+ down_states = self.down_proj(gated_states)
621
+ return down_states
622
+
623
+
624
+ class NewLayer(nn.Module):
625
+ def __init__(
626
+ self,
627
+ config: NewConfig,
628
+ pack_qkv=None,
629
+ use_memory_efficient_attention=None,
630
+ attn_implementation=None
631
+ ):
632
+ super().__init__()
633
+ if attn_implementation is None:
634
+ attn_implementation = config._attn_implementation
635
+ if use_memory_efficient_attention is None:
636
+ use_memory_efficient_attention = config.use_memory_efficient_attention
637
+ if use_memory_efficient_attention:
638
+ if attn_implementation != 'eager':
639
+ logger.warning_once(f"Override {attn_implementation=} to 'eager' as {use_memory_efficient_attention=}")
640
+ attn_implementation = 'eager' # Since it will be SDPA by default for torch>=2.1.1
641
+ self.attention = NEW_ATTENTION_CLASSES[attn_implementation](
642
+ config, pack_qkv=pack_qkv, use_memory_efficient_attention=use_memory_efficient_attention
643
+ )
644
+ self.mlp = NewGatedMLP(config)
645
+
646
+ ln_class = LAYER_NORM[config.layer_norm_type]
647
+ self.attn_ln = ln_class(config.hidden_size, eps=config.layer_norm_eps)
648
+ self.mlp_ln = ln_class(config.hidden_size, eps=config.layer_norm_eps)
649
+
650
+ if config.hidden_dropout_prob > 0:
651
+ self.hidden_dropout = nn.Dropout(config.hidden_dropout_prob)
652
+ else:
653
+ self.hidden_dropout = None
654
+
655
+ def forward(
656
+ self,
657
+ hidden_states: torch.Tensor,
658
+ attention_bias: torch.FloatTensor,
659
+ rope_embeds: Optional[Tuple[torch.FloatTensor, torch.FloatTensor]] = None,
660
+ padding_inputs: Optional[Tuple] = None, # indices, batch, seqlen
661
+ attention_scale: Optional[torch.FloatTensor] = None,
662
+ subset_indices: Optional[torch.LongTensor] = None,
663
+ head_mask: Optional[torch.FloatTensor] = None,
664
+ output_attentions: Optional[bool] = False,
665
+ qkv_inputs: Optional[Tuple] = None, # For RetroMAE
666
+ ) -> Tuple[torch.Tensor, ...]:
667
+ # Multi head self attention
668
+ residual = hidden_states if qkv_inputs is None else qkv_inputs[0]
669
+ attention_outputs = self.attention(
670
+ hidden_states,
671
+ attention_bias,
672
+ rope_embeds,
673
+ padding_inputs,
674
+ attention_scale,
675
+ head_mask,
676
+ output_attentions=output_attentions,
677
+ qkv_inputs=qkv_inputs,
678
+ )
679
+ hidden_states = attention_outputs[0]
680
+ if self.hidden_dropout is not None:
681
+ hidden_states = self.hidden_dropout(hidden_states)
682
+ hidden_states = residual + hidden_states
683
+
684
+ # In pretraining, after the attention of last layer, we only need the masked tokens.
685
+ if subset_indices is not None:
686
+ hidden_states = hidden_states[subset_indices]
687
+
688
+ hidden_states = self.attn_ln(hidden_states)
689
+
690
+ # Fully Connected
691
+ residual = hidden_states
692
+ hidden_states = self.mlp(hidden_states)
693
+ if self.hidden_dropout is not None:
694
+ hidden_states = self.hidden_dropout(hidden_states)
695
+ hidden_states = residual + hidden_states
696
+ hidden_states = self.mlp_ln(hidden_states)
697
+
698
+ # add self attentions if we output attention weights
699
+ outputs = (hidden_states,) + attention_outputs[1:]
700
+ return outputs
701
+
702
+
703
+ class NewEncoder(nn.Module):
704
+ def __init__(self, config):
705
+ super().__init__()
706
+ self.config = config
707
+ self.layer = nn.ModuleList([NewLayer(config) for _ in range(config.num_hidden_layers)])
708
+ self.gradient_checkpointing = False
709
+
710
+ def forward(
711
+ self,
712
+ hidden_states: torch.Tensor,
713
+ attention_bias: Optional[torch.FloatTensor] = None,
714
+ rope_embeds: Optional[Tuple[torch.FloatTensor, torch.FloatTensor]] = None,
715
+ padding_inputs: Optional[Tuple] = None, # indices, batch, seqlen
716
+ attention_scale: Optional[torch.FloatTensor] = None,
717
+ subset_indices: Optional[torch.LongTensor] = None,
718
+ head_mask: Optional[torch.FloatTensor] = None,
719
+ output_attentions: Optional[bool] = False,
720
+ output_hidden_states: Optional[bool] = False,
721
+ return_dict: Optional[bool] = True,
722
+ ) -> Union[Tuple[torch.Tensor], BaseModelOutput]:
723
+ all_hidden_states = () if output_hidden_states else None
724
+ all_self_attentions = () if output_attentions else None
725
+
726
+ for i, layer_module in enumerate(self.layer):
727
+ if output_hidden_states:
728
+ all_hidden_states = all_hidden_states + (hidden_states,)
729
+
730
+ if i >= len(self.layer) - 1:
731
+ layer_subset_indices = subset_indices
732
+ else:
733
+ layer_subset_indices = None
734
+
735
+ layer_head_mask = head_mask[i] if head_mask is not None else None
736
+
737
+ if self.gradient_checkpointing and self.training:
738
+ layer_outputs = self._gradient_checkpointing_func(
739
+ layer_module.__call__,
740
+ hidden_states,
741
+ attention_bias,
742
+ rope_embeds,
743
+ padding_inputs,
744
+ attention_scale,
745
+ layer_subset_indices,
746
+ layer_head_mask,
747
+ )
748
+ else:
749
+ layer_outputs = layer_module(
750
+ hidden_states,
751
+ attention_bias,
752
+ rope_embeds,
753
+ padding_inputs,
754
+ attention_scale,
755
+ layer_subset_indices,
756
+ layer_head_mask,
757
+ output_attentions,
758
+ )
759
+
760
+ hidden_states = layer_outputs[0]
761
+ if output_attentions:
762
+ all_self_attentions = all_self_attentions + (layer_outputs[1],)
763
+
764
+ if output_hidden_states:
765
+ all_hidden_states = all_hidden_states + (hidden_states,)
766
+
767
+ if not return_dict:
768
+ return tuple(
769
+ v
770
+ for v in [
771
+ hidden_states,
772
+ all_hidden_states,
773
+ all_self_attentions,
774
+ ]
775
+ if v is not None
776
+ )
777
+ return BaseModelOutput(
778
+ last_hidden_state=hidden_states,
779
+ hidden_states=all_hidden_states,
780
+ attentions=all_self_attentions,
781
+ )
782
+
783
+
784
+ # Copied from transformers.models.bert.modeling_bert.BertPooler with Bert->New
785
+ class NewPooler(nn.Module):
786
+ def __init__(self, config):
787
+ super().__init__()
788
+ self.dense = nn.Linear(config.hidden_size, config.hidden_size)
789
+ self.activation = nn.Tanh()
790
+
791
+ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
792
+ # We "pool" the model by simply taking the hidden state corresponding
793
+ # to the first token.
794
+ first_token_tensor = hidden_states[:, 0]
795
+ pooled_output = self.dense(first_token_tensor)
796
+ pooled_output = self.activation(pooled_output)
797
+ return pooled_output
798
+
799
+
800
+ class NewPreTrainedModel(PreTrainedModel):
801
+ """
802
+ An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
803
+ models.
804
+ """
805
+
806
+ config_class = NewConfig
807
+ base_model_prefix = "new"
808
+ supports_gradient_checkpointing = True
809
+ _supports_sdpa = True
810
+
811
+ def _init_weights(self, module):
812
+ """Initialize the weights"""
813
+ if isinstance(module, nn.Linear):
814
+ # Slightly different from the TF version which uses truncated_normal for initialization
815
+ # cf https://github.com/pytorch/pytorch/pull/5617
816
+ module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
817
+ if module.bias is not None:
818
+ module.bias.data.zero_()
819
+ elif isinstance(module, nn.Embedding):
820
+ module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
821
+ if module.padding_idx is not None:
822
+ module.weight.data[module.padding_idx].zero_()
823
+ elif isinstance(module, nn.LayerNorm):
824
+ module.bias.data.zero_()
825
+ module.weight.data.fill_(1.0)
826
+
827
+
828
+ class NewModel(NewPreTrainedModel):
829
+ """
830
+ The bare New Model transformer outputting raw hidden-states without any specific head on top.
831
+ """
832
+
833
+ def __init__(self, config: NewConfig, add_pooling_layer=False):
834
+ super().__init__(config)
835
+ self.config = config
836
+
837
+ self.embeddings = NewEmbeddings(config)
838
+ self.encoder = NewEncoder(config)
839
+
840
+ self.pooler = NewPooler(config) if add_pooling_layer else None
841
+
842
+ # Initialize weights and apply final processing
843
+ self.post_init()
844
+
845
+ def get_input_embeddings(self):
846
+ return self.embeddings.word_embeddings
847
+
848
+ def set_input_embeddings(self, value):
849
+ self.embeddings.word_embeddings = value
850
+
851
+ def forward(
852
+ self,
853
+ input_ids: Optional[torch.Tensor] = None,
854
+ attention_mask: Optional[torch.Tensor] = None,
855
+ length: Optional[List[int]] = None,
856
+ subset_indices: Optional[torch.LongTensor] = None,
857
+ token_type_ids: Optional[torch.Tensor] = None,
858
+ position_ids: Optional[torch.Tensor] = None,
859
+ head_mask: Optional[torch.Tensor] = None,
860
+ inputs_embeds: Optional[torch.Tensor] = None,
861
+ output_attentions: Optional[bool] = None,
862
+ output_hidden_states: Optional[bool] = None,
863
+ return_dict: Optional[bool] = None,
864
+ unpad_inputs: Optional[bool] = None,
865
+ ) -> Union[Tuple[torch.Tensor], BaseModelOutputWithPooling]:
866
+ r"""
867
+ length (`list` of length `batch_size`, *optional*):
868
+ If is `None`, return padded `last_hidden_state`.
869
+ subset_indices ():
870
+ pass
871
+ unpad_inputs (`bool`, *optional*):
872
+ pass
873
+ """
874
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
875
+ output_hidden_states = (
876
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
877
+ )
878
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
879
+ unpad_inputs = unpad_inputs if unpad_inputs is not None else self.config.unpad_inputs
880
+ output_padded = length is None
881
+
882
+ if input_ids is not None and inputs_embeds is not None:
883
+ raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
884
+ elif input_ids is not None:
885
+ self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)
886
+ input_shape = input_ids.size()
887
+ elif inputs_embeds is not None:
888
+ input_shape = inputs_embeds.size()[:-1]
889
+ else:
890
+ raise ValueError("You have to specify either input_ids or inputs_embeds")
891
+
892
+ # TODO: not used
893
+ # # Prepare head mask if needed
894
+ # # 1.0 in head_mask indicate we keep the head
895
+ # # attention_probs has shape bsz x n_heads x N x N
896
+ # # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
897
+ # # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
898
+ # head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
899
+
900
+ # Get embeddings, may unpad them
901
+ (embedding_output, attention_mask, rope_embeds, length) = self.embeddings(
902
+ unpad_inputs,
903
+ input_ids=input_ids,
904
+ attention_mask=attention_mask,
905
+ length=length,
906
+ token_type_ids=token_type_ids,
907
+ position_ids=position_ids,
908
+ inputs_embeds=inputs_embeds
909
+ )
910
+
911
+ batch_size, seq_length = input_shape
912
+ if unpad_inputs and self.config.use_memory_efficient_attention:
913
+ attention_bias = xops.fmha.attn_bias.BlockDiagonalMask.from_seqlens(length)
914
+ else:
915
+ # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
916
+ # ourselves in which case we just need to make it broadcastable to all heads.
917
+ attention_bias = self.get_extended_attention_mask(attention_mask, input_shape)
918
+ if self.config.use_memory_efficient_attention:
919
+ # Invalid shape for attention bias: torch.Size([48, 1, 1, 512]) (expected (48, 12, 512, 512))
920
+ attention_bias = attention_bias.expand(-1, self.config.num_attention_heads, seq_length, -1)
921
+
922
+ padding_inputs = None
923
+ if unpad_inputs and (output_padded or not self.config.use_memory_efficient_attention):
924
+ indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
925
+ if not self.config.use_memory_efficient_attention:
926
+ padding_inputs = (indices, *input_shape)
927
+
928
+ attention_scale = None
929
+ if self.config.logn_attention_scale:
930
+ logger.warning_once("TODO: logn_attention_scale")
931
+ # # attention scale log_512(input_len)
932
+ # attention_scale = attention_mask.sum(1).log() / torch.tensor(self.config.max_position_embeddings).log()
933
+ # # inference-time logn scale need clip 1
934
+ # if self.config.logn_attention_clip1:
935
+ # attention_scale.clip_(1)
936
+ # attention_scale = attention_scale[:, None, None, None]
937
+ # else:
938
+ # attention_scale = None
939
+
940
+ encoder_outputs = self.encoder(
941
+ embedding_output,
942
+ attention_bias=attention_bias,
943
+ rope_embeds=rope_embeds,
944
+ padding_inputs=padding_inputs,
945
+ attention_scale=attention_scale,
946
+ subset_indices=subset_indices,
947
+ head_mask=head_mask,
948
+ output_attentions=output_attentions,
949
+ output_hidden_states=output_hidden_states,
950
+ return_dict=return_dict,
951
+ )
952
+ sequence_output = encoder_outputs[0]
953
+ if unpad_inputs and output_padded:
954
+ sequence_output = pad_input(
955
+ sequence_output.squeeze(), indices, batch_size, seq_length
956
+ )
957
+
958
+ pooled_output = self.pooler(sequence_output) if self.pooler is not None else None
959
+
960
+ if not return_dict:
961
+ return (sequence_output, pooled_output) + encoder_outputs[1:]
962
+
963
+ return BaseModelOutputWithPooling(
964
+ last_hidden_state=sequence_output,
965
+ pooler_output=pooled_output,
966
+ hidden_states=encoder_outputs.hidden_states,
967
+ attentions=encoder_outputs.attentions,
968
+ )
969
+
970
+
971
+ class NewLMPredictionHead(nn.Module):
972
+ def __init__(self, config):
973
+ super().__init__()
974
+ self.dense = nn.Linear(config.hidden_size, config.hidden_size)
975
+ self.transform_act_fn = ACT2FN[config.hidden_act]
976
+ self.norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
977
+
978
+ # The output weights are the same as the input embeddings, but there is
979
+ # an output-only bias for each token.
980
+ self.decoder = nn.Linear(config.hidden_size, config.vocab_size)
981
+
982
+ def forward(self, hidden_states):
983
+ hidden_states = self.dense(hidden_states)
984
+ hidden_states = self.transform_act_fn(hidden_states)
985
+ hidden_states = self.norm(hidden_states)
986
+ hidden_states = self.decoder(hidden_states)
987
+ return hidden_states
988
+
989
+
990
+ class NewForMaskedLM(NewPreTrainedModel):
991
+ _tied_weights_keys = ["lm_head.decoder.bias", "lm_head.decoder.weight"]
992
+
993
+ def __init__(self, config: NewConfig):
994
+ super().__init__(config)
995
+ self.new = NewModel(config, add_pooling_layer=False)
996
+ self.lm_head = NewLMPredictionHead(config)
997
+ self.loss_fct = nn.CrossEntropyLoss()
998
+
999
+ # Initialize weights and apply final processing
1000
+ self.post_init()
1001
+
1002
+ def get_output_embeddings(self):
1003
+ return self.lm_head.decoder
1004
+
1005
+ def set_output_embeddings(self, new_embeddings):
1006
+ self.lm_head.decoder = new_embeddings
1007
+
1008
+ def forward(
1009
+ self,
1010
+ input_ids: Optional[torch.Tensor] = None,
1011
+ attention_mask: Optional[torch.Tensor] = None,
1012
+ token_type_ids: Optional[torch.Tensor] = None,
1013
+ position_ids: Optional[torch.Tensor] = None,
1014
+ head_mask: Optional[torch.Tensor] = None,
1015
+ inputs_embeds: Optional[torch.Tensor] = None,
1016
+ labels: Optional[torch.Tensor] = None,
1017
+ output_attentions: Optional[bool] = None,
1018
+ output_hidden_states: Optional[bool] = None,
1019
+ return_dict: Optional[bool] = None,
1020
+ unpad_inputs: Optional[bool] = None,
1021
+ ) -> Union[Tuple[torch.Tensor], MaskedLMOutput]:
1022
+ r"""
1023
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
1024
+ Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ...,
1025
+ config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked), the
1026
+ loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`
1027
+ """
1028
+
1029
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1030
+
1031
+ if labels is None or not self.new.config.unpad_inputs:
1032
+ length = None
1033
+ subset_indices = None
1034
+ else:
1035
+ length = attention_mask.sum(-1).tolist()
1036
+ labels = labels[attention_mask.bool()].unsqueeze(0)
1037
+ subset_indices = labels > -100
1038
+
1039
+ outputs = self.new(
1040
+ input_ids,
1041
+ attention_mask=attention_mask,
1042
+ length=length,
1043
+ subset_indices=subset_indices,
1044
+ token_type_ids=token_type_ids,
1045
+ position_ids=position_ids,
1046
+ head_mask=head_mask,
1047
+ inputs_embeds=inputs_embeds,
1048
+ output_attentions=output_attentions,
1049
+ output_hidden_states=output_hidden_states,
1050
+ return_dict=return_dict,
1051
+ unpad_inputs=unpad_inputs,
1052
+ )
1053
+
1054
+ sequence_output = outputs[0]
1055
+ prediction_scores = self.lm_head(sequence_output)
1056
+
1057
+ masked_lm_loss = None
1058
+ if labels is not None:
1059
+ if subset_indices is None:
1060
+ mask = attention_mask.bool()
1061
+ prediction_scores = prediction_scores[mask]
1062
+ labels = labels[mask]
1063
+ else:
1064
+ labels = labels[subset_indices]
1065
+ masked_lm_loss = self.loss_fct(prediction_scores, labels)
1066
+
1067
+ if not return_dict:
1068
+ output = (prediction_scores,) + outputs[2:]
1069
+ return ((masked_lm_loss,) + output) if masked_lm_loss is not None else output
1070
+
1071
+ return MaskedLMOutput(
1072
+ loss=masked_lm_loss,
1073
+ logits=prediction_scores,
1074
+ hidden_states=outputs.hidden_states,
1075
+ attentions=outputs.attentions,
1076
+ )
1077
+
1078
+
1079
+ class NewForSequenceClassification(NewPreTrainedModel):
1080
+ def __init__(self, config):
1081
+ super().__init__(config)
1082
+ self.num_labels = config.num_labels
1083
+ self.config = config
1084
+
1085
+ self.new = NewModel(config, add_pooling_layer=True)
1086
+ classifier_dropout = (
1087
+ config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
1088
+ )
1089
+ self.dropout = nn.Dropout(classifier_dropout)
1090
+ self.classifier = nn.Linear(config.hidden_size, config.num_labels)
1091
+
1092
+ # Initialize weights and apply final processing
1093
+ self.post_init()
1094
+
1095
+ def forward(
1096
+ self,
1097
+ input_ids: Optional[torch.Tensor] = None,
1098
+ attention_mask: Optional[torch.Tensor] = None,
1099
+ token_type_ids: Optional[torch.Tensor] = None,
1100
+ position_ids: Optional[torch.Tensor] = None,
1101
+ head_mask: Optional[torch.Tensor] = None,
1102
+ inputs_embeds: Optional[torch.Tensor] = None,
1103
+ labels: Optional[torch.Tensor] = None,
1104
+ output_attentions: Optional[bool] = None,
1105
+ output_hidden_states: Optional[bool] = None,
1106
+ return_dict: Optional[bool] = None,
1107
+ unpad_inputs: Optional[bool] = None,
1108
+ ) -> Union[Tuple[torch.Tensor], SequenceClassifierOutput]:
1109
+ r"""
1110
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1111
+ Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
1112
+ config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
1113
+ `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
1114
+ """
1115
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1116
+
1117
+ outputs = self.new(
1118
+ input_ids,
1119
+ attention_mask=attention_mask,
1120
+ token_type_ids=token_type_ids,
1121
+ position_ids=position_ids,
1122
+ head_mask=head_mask,
1123
+ inputs_embeds=inputs_embeds,
1124
+ output_attentions=output_attentions,
1125
+ output_hidden_states=output_hidden_states,
1126
+ return_dict=return_dict,
1127
+ unpad_inputs=unpad_inputs,
1128
+ )
1129
+
1130
+ pooled_output = outputs[1]
1131
+
1132
+ pooled_output = self.dropout(pooled_output)
1133
+ logits = self.classifier(pooled_output)
1134
+
1135
+ loss = None
1136
+ if labels is not None:
1137
+ if self.config.problem_type is None:
1138
+ if self.num_labels == 1:
1139
+ self.config.problem_type = "regression"
1140
+ elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
1141
+ self.config.problem_type = "single_label_classification"
1142
+ else:
1143
+ self.config.problem_type = "multi_label_classification"
1144
+
1145
+ if self.config.problem_type == "regression":
1146
+ loss_fct = nn.MSELoss()
1147
+ if self.num_labels == 1:
1148
+ loss = loss_fct(logits.squeeze(), labels.squeeze())
1149
+ else:
1150
+ loss = loss_fct(logits, labels)
1151
+ elif self.config.problem_type == "single_label_classification":
1152
+ loss_fct = nn.CrossEntropyLoss()
1153
+ loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
1154
+ elif self.config.problem_type == "multi_label_classification":
1155
+ loss_fct = nn.BCEWithLogitsLoss()
1156
+ loss = loss_fct(logits, labels)
1157
+
1158
+ if not return_dict:
1159
+ output = (logits,) + outputs[2:]
1160
+ return ((loss,) + output) if loss is not None else output
1161
+
1162
+ return SequenceClassifierOutput(
1163
+ loss=loss,
1164
+ logits=logits,
1165
+ hidden_states=outputs.hidden_states,
1166
+ attentions=outputs.attentions,
1167
+ )
1168
+
1169
+
1170
+ class NewForMultipleChoice(NewPreTrainedModel):
1171
+ def __init__(self, config):
1172
+ super().__init__(config)
1173
+
1174
+ self.new = NewModel(config, add_pooling_layer=True)
1175
+ classifier_dropout = (
1176
+ config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
1177
+ )
1178
+ self.dropout = nn.Dropout(classifier_dropout)
1179
+ self.classifier = nn.Linear(config.hidden_size, 1)
1180
+
1181
+ # Initialize weights and apply final processing
1182
+ self.post_init()
1183
+
1184
+ def forward(
1185
+ self,
1186
+ input_ids: Optional[torch.Tensor] = None,
1187
+ attention_mask: Optional[torch.Tensor] = None,
1188
+ token_type_ids: Optional[torch.Tensor] = None,
1189
+ position_ids: Optional[torch.Tensor] = None,
1190
+ head_mask: Optional[torch.Tensor] = None,
1191
+ inputs_embeds: Optional[torch.Tensor] = None,
1192
+ labels: Optional[torch.Tensor] = None,
1193
+ output_attentions: Optional[bool] = None,
1194
+ output_hidden_states: Optional[bool] = None,
1195
+ return_dict: Optional[bool] = None,
1196
+ unpad_inputs: Optional[bool] = None,
1197
+ ) -> Union[Tuple[torch.Tensor], MultipleChoiceModelOutput]:
1198
+ r"""
1199
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1200
+ Labels for computing the multiple choice classification loss. Indices should be in `[0, ...,
1201
+ num_choices-1]` where `num_choices` is the size of the second dimension of the input tensors. (See
1202
+ `input_ids` above)
1203
+ """
1204
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1205
+ num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]
1206
+
1207
+ input_ids = input_ids.view(-1, input_ids.size(-1)) if input_ids is not None else None
1208
+ attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None
1209
+ token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None
1210
+ position_ids = position_ids.view(-1, position_ids.size(-1)) if position_ids is not None else None
1211
+ inputs_embeds = (
1212
+ inputs_embeds.view(-1, inputs_embeds.size(-2), inputs_embeds.size(-1))
1213
+ if inputs_embeds is not None
1214
+ else None
1215
+ )
1216
+
1217
+ outputs = self.new(
1218
+ input_ids,
1219
+ attention_mask=attention_mask,
1220
+ token_type_ids=token_type_ids,
1221
+ position_ids=position_ids,
1222
+ head_mask=head_mask,
1223
+ inputs_embeds=inputs_embeds,
1224
+ output_attentions=output_attentions,
1225
+ output_hidden_states=output_hidden_states,
1226
+ return_dict=return_dict,
1227
+ unpad_inputs=unpad_inputs,
1228
+ )
1229
+
1230
+ pooled_output = outputs[1]
1231
+
1232
+ pooled_output = self.dropout(pooled_output)
1233
+ logits = self.classifier(pooled_output)
1234
+ reshaped_logits = logits.view(-1, num_choices)
1235
+
1236
+ loss = None
1237
+ if labels is not None:
1238
+ loss_fct = nn.CrossEntropyLoss()
1239
+ loss = loss_fct(reshaped_logits, labels)
1240
+
1241
+ if not return_dict:
1242
+ output = (reshaped_logits,) + outputs[2:]
1243
+ return ((loss,) + output) if loss is not None else output
1244
+
1245
+ return MultipleChoiceModelOutput(
1246
+ loss=loss,
1247
+ logits=reshaped_logits,
1248
+ hidden_states=outputs.hidden_states,
1249
+ attentions=outputs.attentions,
1250
+ )
1251
+
1252
+
1253
+ @dataclass
1254
+ class NewTokenClassifierOutput(ModelOutput):
1255
+ loss: Optional[torch.FloatTensor] = None
1256
+ logits: torch.FloatTensor = None
1257
+ last_hidden_state: torch.FloatTensor = None
1258
+ hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
1259
+ attentions: Optional[Tuple[torch.FloatTensor, ...]] = None
1260
+
1261
+
1262
+ class NewForTokenClassification(NewPreTrainedModel):
1263
+ def __init__(self, config):
1264
+ super().__init__(config)
1265
+ self.num_labels = config.num_labels
1266
+
1267
+ self.new = NewModel(config, add_pooling_layer=False)
1268
+ classifier_dropout = (
1269
+ config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
1270
+ )
1271
+ self.dropout = nn.Dropout(classifier_dropout)
1272
+ self.classifier = nn.Linear(config.hidden_size, config.num_labels)
1273
+
1274
+ # Initialize weights and apply final processing
1275
+ self.post_init()
1276
+
1277
+ def forward(
1278
+ self,
1279
+ input_ids: Optional[torch.Tensor] = None,
1280
+ attention_mask: Optional[torch.Tensor] = None,
1281
+ token_type_ids: Optional[torch.Tensor] = None,
1282
+ position_ids: Optional[torch.Tensor] = None,
1283
+ head_mask: Optional[torch.Tensor] = None,
1284
+ inputs_embeds: Optional[torch.Tensor] = None,
1285
+ labels: Optional[torch.Tensor] = None,
1286
+ output_attentions: Optional[bool] = None,
1287
+ output_hidden_states: Optional[bool] = None,
1288
+ return_dict: Optional[bool] = None,
1289
+ unpad_inputs: Optional[bool] = None,
1290
+ ) -> Union[Tuple[torch.Tensor], NewTokenClassifierOutput]:
1291
+ r"""
1292
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
1293
+ Labels for computing the token classification loss. Indices should be in `[0, ..., config.num_labels - 1]`.
1294
+ """
1295
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1296
+
1297
+ outputs = self.new(
1298
+ input_ids,
1299
+ attention_mask=attention_mask,
1300
+ token_type_ids=token_type_ids,
1301
+ position_ids=position_ids,
1302
+ head_mask=head_mask,
1303
+ inputs_embeds=inputs_embeds,
1304
+ output_attentions=output_attentions,
1305
+ output_hidden_states=output_hidden_states,
1306
+ return_dict=return_dict,
1307
+ unpad_inputs=unpad_inputs,
1308
+ )
1309
+
1310
+ sequence_output = outputs[0]
1311
+
1312
+ sequence_output = self.dropout(sequence_output)
1313
+ logits = self.classifier(sequence_output)
1314
+
1315
+ loss = None
1316
+ if labels is not None:
1317
+ loss_fct = nn.CrossEntropyLoss()
1318
+ loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
1319
+
1320
+ if not return_dict:
1321
+ output = (logits,) + outputs[2:]
1322
+ return ((loss,) + output) if loss is not None else output
1323
+
1324
+ return NewTokenClassifierOutput(
1325
+ loss=loss,
1326
+ logits=logits,
1327
+ last_hidden_state=sequence_output,
1328
+ hidden_states=outputs.hidden_states,
1329
+ attentions=outputs.attentions,
1330
+ )
1331
+
1332
+
1333
+ class NewForQuestionAnswering(NewPreTrainedModel):
1334
+ def __init__(self, config):
1335
+ super().__init__(config)
1336
+ self.num_labels = config.num_labels
1337
+
1338
+ self.new = NewModel(config, add_pooling_layer=False)
1339
+ self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)
1340
+
1341
+ # Initialize weights and apply final processing
1342
+ self.post_init()
1343
+
1344
+ def forward(
1345
+ self,
1346
+ input_ids: Optional[torch.Tensor] = None,
1347
+ attention_mask: Optional[torch.Tensor] = None,
1348
+ token_type_ids: Optional[torch.Tensor] = None,
1349
+ position_ids: Optional[torch.Tensor] = None,
1350
+ head_mask: Optional[torch.Tensor] = None,
1351
+ inputs_embeds: Optional[torch.Tensor] = None,
1352
+ start_positions: Optional[torch.Tensor] = None,
1353
+ end_positions: Optional[torch.Tensor] = None,
1354
+ output_attentions: Optional[bool] = None,
1355
+ output_hidden_states: Optional[bool] = None,
1356
+ return_dict: Optional[bool] = None,
1357
+ unpad_inputs: Optional[bool] = None,
1358
+ ) -> Union[Tuple[torch.Tensor], QuestionAnsweringModelOutput]:
1359
+ r"""
1360
+ start_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1361
+ Labels for position (index) of the start of the labelled span for computing the token classification loss.
1362
+ Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
1363
+ are not taken into account for computing the loss.
1364
+ end_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1365
+ Labels for position (index) of the end of the labelled span for computing the token classification loss.
1366
+ Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
1367
+ are not taken into account for computing the loss.
1368
+ """
1369
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1370
+
1371
+ outputs = self.new(
1372
+ input_ids,
1373
+ attention_mask=attention_mask,
1374
+ token_type_ids=token_type_ids,
1375
+ position_ids=position_ids,
1376
+ head_mask=head_mask,
1377
+ inputs_embeds=inputs_embeds,
1378
+ output_attentions=output_attentions,
1379
+ output_hidden_states=output_hidden_states,
1380
+ return_dict=return_dict,
1381
+ unpad_inputs=unpad_inputs,
1382
+ )
1383
+
1384
+ sequence_output = outputs[0]
1385
+
1386
+ logits = self.qa_outputs(sequence_output)
1387
+ start_logits, end_logits = logits.split(1, dim=-1)
1388
+ start_logits = start_logits.squeeze(-1).contiguous()
1389
+ end_logits = end_logits.squeeze(-1).contiguous()
1390
+
1391
+ total_loss = None
1392
+ if start_positions is not None and end_positions is not None:
1393
+ # If we are on multi-GPU, split add a dimension
1394
+ if len(start_positions.size()) > 1:
1395
+ start_positions = start_positions.squeeze(-1)
1396
+ if len(end_positions.size()) > 1:
1397
+ end_positions = end_positions.squeeze(-1)
1398
+ # sometimes the start/end positions are outside our model inputs, we ignore these terms
1399
+ ignored_index = start_logits.size(1)
1400
+ start_positions = start_positions.clamp(0, ignored_index)
1401
+ end_positions = end_positions.clamp(0, ignored_index)
1402
+
1403
+ loss_fct = nn.CrossEntropyLoss(ignore_index=ignored_index)
1404
+ start_loss = loss_fct(start_logits, start_positions)
1405
+ end_loss = loss_fct(end_logits, end_positions)
1406
+ total_loss = (start_loss + end_loss) / 2
1407
+
1408
+ if not return_dict:
1409
+ output = (start_logits, end_logits) + outputs[2:]
1410
+ return ((total_loss,) + output) if total_loss is not None else output
1411
+
1412
+ return QuestionAnsweringModelOutput(
1413
+ loss=total_loss,
1414
+ start_logits=start_logits,
1415
+ end_logits=end_logits,
1416
+ hidden_states=outputs.hidden_states,
1417
+ attentions=outputs.attentions,
1418
+ )
query_encoder/special_tokens_map.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "<s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "mask_token": {
24
+ "content": "<mask>",
25
+ "lstrip": true,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "</s>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "<unk>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
+ }
query_encoder/tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8373f9cd3d27591e1924426bcc1c8799bc5a9affc4fc857982c5d66668dd1f41
3
+ size 17082832
query_encoder/tokenizer_config.json ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<pad>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "250001": {
36
+ "content": "<mask>",
37
+ "lstrip": true,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "bos_token": "<s>",
45
+ "clean_up_tokenization_spaces": true,
46
+ "cls_token": "<s>",
47
+ "eos_token": "</s>",
48
+ "mask_token": "<mask>",
49
+ "max_length": 512,
50
+ "model_max_length": 32768,
51
+ "pad_token": "<pad>",
52
+ "sep_token": "</s>",
53
+ "stride": 0,
54
+ "tokenizer_class": "XLMRobertaTokenizer",
55
+ "truncation_side": "right",
56
+ "truncation_strategy": "longest_first",
57
+ "unk_token": "<unk>"
58
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:155b78dcd6b943b3fa338b1893ed5140f32df8bbedd9c8465bb493dc271cdec8
3
+ size 7736