sujr commited on
Commit
8a3746a
·
verified ·
1 Parent(s): ba9f0b5

Upload folder using huggingface_hub

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. checkpoint-1200/README.md +203 -0
  2. checkpoint-1200/adapter_config.json +380 -0
  3. checkpoint-1200/adapter_model.safetensors +3 -0
  4. checkpoint-1200/latest +1 -0
  5. checkpoint-1200/qwen.tiktoken +0 -0
  6. checkpoint-1200/rng_state_0.pth +3 -0
  7. checkpoint-1200/rng_state_1.pth +3 -0
  8. checkpoint-1200/rng_state_2.pth +3 -0
  9. checkpoint-1200/rng_state_3.pth +3 -0
  10. checkpoint-1200/rng_state_4.pth +3 -0
  11. checkpoint-1200/rng_state_5.pth +3 -0
  12. checkpoint-1200/rng_state_6.pth +3 -0
  13. checkpoint-1200/rng_state_7.pth +3 -0
  14. checkpoint-1200/scheduler.pt +3 -0
  15. checkpoint-1200/special_tokens_map.json +3 -0
  16. checkpoint-1200/tokenization_qwen.py +598 -0
  17. checkpoint-1200/tokenizer_config.json +14 -0
  18. checkpoint-1200/trainer_state.json +873 -0
  19. checkpoint-1200/training_args.bin +3 -0
  20. checkpoint-1200/zero_to_fp32.py +587 -0
  21. checkpoint-1600/README.md +203 -0
  22. checkpoint-1600/adapter_config.json +380 -0
  23. checkpoint-1600/adapter_model.safetensors +3 -0
  24. checkpoint-1600/latest +1 -0
  25. checkpoint-1600/qwen.tiktoken +0 -0
  26. checkpoint-1600/rng_state_0.pth +3 -0
  27. checkpoint-1600/rng_state_1.pth +3 -0
  28. checkpoint-1600/rng_state_2.pth +3 -0
  29. checkpoint-1600/rng_state_3.pth +3 -0
  30. checkpoint-1600/rng_state_4.pth +3 -0
  31. checkpoint-1600/rng_state_5.pth +3 -0
  32. checkpoint-1600/rng_state_6.pth +3 -0
  33. checkpoint-1600/rng_state_7.pth +3 -0
  34. checkpoint-1600/scheduler.pt +3 -0
  35. checkpoint-1600/special_tokens_map.json +3 -0
  36. checkpoint-1600/tokenization_qwen.py +598 -0
  37. checkpoint-1600/tokenizer_config.json +14 -0
  38. checkpoint-1600/trainer_state.json +1153 -0
  39. checkpoint-1600/training_args.bin +3 -0
  40. checkpoint-1600/zero_to_fp32.py +587 -0
  41. checkpoint-2000/README.md +203 -0
  42. checkpoint-2000/adapter_config.json +380 -0
  43. checkpoint-2000/adapter_model.safetensors +3 -0
  44. checkpoint-2000/latest +1 -0
  45. checkpoint-2000/qwen.tiktoken +0 -0
  46. checkpoint-2000/rng_state_0.pth +3 -0
  47. checkpoint-2000/rng_state_1.pth +3 -0
  48. checkpoint-2000/rng_state_2.pth +3 -0
  49. checkpoint-2000/rng_state_3.pth +3 -0
  50. checkpoint-2000/rng_state_4.pth +3 -0
checkpoint-1200/README.md ADDED
@@ -0,0 +1,203 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: peft
3
+ base_model: Qwen/Qwen-VL-Chat
4
+ ---
5
+
6
+ # Model Card for Model ID
7
+
8
+ <!-- Provide a quick summary of what the model is/does. -->
9
+
10
+
11
+
12
+ ## Model Details
13
+
14
+ ### Model Description
15
+
16
+ <!-- Provide a longer summary of what this model is. -->
17
+
18
+
19
+
20
+ - **Developed by:** [More Information Needed]
21
+ - **Funded by [optional]:** [More Information Needed]
22
+ - **Shared by [optional]:** [More Information Needed]
23
+ - **Model type:** [More Information Needed]
24
+ - **Language(s) (NLP):** [More Information Needed]
25
+ - **License:** [More Information Needed]
26
+ - **Finetuned from model [optional]:** [More Information Needed]
27
+
28
+ ### Model Sources [optional]
29
+
30
+ <!-- Provide the basic links for the model. -->
31
+
32
+ - **Repository:** [More Information Needed]
33
+ - **Paper [optional]:** [More Information Needed]
34
+ - **Demo [optional]:** [More Information Needed]
35
+
36
+ ## Uses
37
+
38
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
+
40
+ ### Direct Use
41
+
42
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
+
44
+ [More Information Needed]
45
+
46
+ ### Downstream Use [optional]
47
+
48
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
+
50
+ [More Information Needed]
51
+
52
+ ### Out-of-Scope Use
53
+
54
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
+
56
+ [More Information Needed]
57
+
58
+ ## Bias, Risks, and Limitations
59
+
60
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
+
62
+ [More Information Needed]
63
+
64
+ ### Recommendations
65
+
66
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
+
68
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
+
70
+ ## How to Get Started with the Model
71
+
72
+ Use the code below to get started with the model.
73
+
74
+ [More Information Needed]
75
+
76
+ ## Training Details
77
+
78
+ ### Training Data
79
+
80
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
+
82
+ [More Information Needed]
83
+
84
+ ### Training Procedure
85
+
86
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
+
88
+ #### Preprocessing [optional]
89
+
90
+ [More Information Needed]
91
+
92
+
93
+ #### Training Hyperparameters
94
+
95
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
+
97
+ #### Speeds, Sizes, Times [optional]
98
+
99
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
+
101
+ [More Information Needed]
102
+
103
+ ## Evaluation
104
+
105
+ <!-- This section describes the evaluation protocols and provides the results. -->
106
+
107
+ ### Testing Data, Factors & Metrics
108
+
109
+ #### Testing Data
110
+
111
+ <!-- This should link to a Dataset Card if possible. -->
112
+
113
+ [More Information Needed]
114
+
115
+ #### Factors
116
+
117
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
+
119
+ [More Information Needed]
120
+
121
+ #### Metrics
122
+
123
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
+
125
+ [More Information Needed]
126
+
127
+ ### Results
128
+
129
+ [More Information Needed]
130
+
131
+ #### Summary
132
+
133
+
134
+
135
+ ## Model Examination [optional]
136
+
137
+ <!-- Relevant interpretability work for the model goes here -->
138
+
139
+ [More Information Needed]
140
+
141
+ ## Environmental Impact
142
+
143
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
+
145
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
+
147
+ - **Hardware Type:** [More Information Needed]
148
+ - **Hours used:** [More Information Needed]
149
+ - **Cloud Provider:** [More Information Needed]
150
+ - **Compute Region:** [More Information Needed]
151
+ - **Carbon Emitted:** [More Information Needed]
152
+
153
+ ## Technical Specifications [optional]
154
+
155
+ ### Model Architecture and Objective
156
+
157
+ [More Information Needed]
158
+
159
+ ### Compute Infrastructure
160
+
161
+ [More Information Needed]
162
+
163
+ #### Hardware
164
+
165
+ [More Information Needed]
166
+
167
+ #### Software
168
+
169
+ [More Information Needed]
170
+
171
+ ## Citation [optional]
172
+
173
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
+
175
+ **BibTeX:**
176
+
177
+ [More Information Needed]
178
+
179
+ **APA:**
180
+
181
+ [More Information Needed]
182
+
183
+ ## Glossary [optional]
184
+
185
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
+
187
+ [More Information Needed]
188
+
189
+ ## More Information [optional]
190
+
191
+ [More Information Needed]
192
+
193
+ ## Model Card Authors [optional]
194
+
195
+ [More Information Needed]
196
+
197
+ ## Model Card Contact
198
+
199
+ [More Information Needed]
200
+ ### Framework versions
201
+
202
+ - PEFT 0.10.0
203
+ - PEFT 0.11.1
checkpoint-1200/adapter_config.json ADDED
@@ -0,0 +1,380 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alpha_pattern": {},
3
+ "auto_mapping": null,
4
+ "base_model_name_or_path": "Qwen/Qwen-VL-Chat",
5
+ "bias": "none",
6
+ "fan_in_fan_out": false,
7
+ "inference_mode": true,
8
+ "init_lora_weights": true,
9
+ "layer_replication": null,
10
+ "layers_pattern": null,
11
+ "layers_to_transform": null,
12
+ "loftq_config": {},
13
+ "lora_alpha": 16,
14
+ "lora_dropout": 0.05,
15
+ "megatron_config": null,
16
+ "megatron_core": "megatron.core",
17
+ "modules_to_save": null,
18
+ "peft_type": "LORA",
19
+ "r": 64,
20
+ "rank_pattern": {},
21
+ "revision": null,
22
+ "target_modules": [
23
+ "transformer.h.16.mlp.w1",
24
+ "transformer.visual.transformer.resblocks.13.attn.out_proj",
25
+ "transformer.h.28.mlp.w1",
26
+ "transformer.h.16.attn.c_attn",
27
+ "transformer.h.3.mlp.w1",
28
+ "transformer.visual.transformer.resblocks.29.attn.in_proj",
29
+ "transformer.visual.transformer.resblocks.19.mlp.c_proj",
30
+ "transformer.visual.transformer.resblocks.47.mlp.c_fc",
31
+ "transformer.visual.transformer.resblocks.34.mlp.c_fc",
32
+ "transformer.visual.transformer.resblocks.4.attn.out_proj",
33
+ "transformer.h.31.attn.c_attn",
34
+ "transformer.h.16.mlp.w2",
35
+ "transformer.visual.transformer.resblocks.5.attn.out_proj",
36
+ "transformer.h.2.mlp.w1",
37
+ "transformer.visual.transformer.resblocks.7.attn.in_proj",
38
+ "transformer.h.20.mlp.w2",
39
+ "transformer.h.19.mlp.w1",
40
+ "transformer.visual.transformer.resblocks.18.mlp.c_fc",
41
+ "transformer.visual.transformer.resblocks.27.attn.out_proj",
42
+ "transformer.visual.transformer.resblocks.10.mlp.c_proj",
43
+ "transformer.visual.transformer.resblocks.43.mlp.c_fc",
44
+ "transformer.h.5.mlp.w1",
45
+ "transformer.visual.transformer.resblocks.15.mlp.c_proj",
46
+ "transformer.visual.transformer.resblocks.25.mlp.c_proj",
47
+ "transformer.visual.transformer.resblocks.10.attn.out_proj",
48
+ "transformer.visual.transformer.resblocks.4.mlp.c_fc",
49
+ "transformer.h.31.mlp.w2",
50
+ "transformer.visual.transformer.resblocks.37.attn.out_proj",
51
+ "transformer.h.8.attn.c_proj",
52
+ "transformer.h.29.attn.c_attn",
53
+ "transformer.visual.transformer.resblocks.24.mlp.c_proj",
54
+ "transformer.h.19.mlp.c_proj",
55
+ "transformer.visual.transformer.resblocks.11.attn.out_proj",
56
+ "transformer.h.13.mlp.c_proj",
57
+ "transformer.h.27.mlp.c_proj",
58
+ "transformer.h.31.mlp.w1",
59
+ "transformer.visual.transformer.resblocks.7.mlp.c_proj",
60
+ "transformer.h.28.mlp.w2",
61
+ "transformer.visual.transformer.resblocks.3.mlp.c_proj",
62
+ "transformer.visual.transformer.resblocks.13.attn.in_proj",
63
+ "transformer.h.21.attn.c_attn",
64
+ "transformer.visual.transformer.resblocks.23.mlp.c_fc",
65
+ "transformer.visual.transformer.resblocks.33.mlp.c_proj",
66
+ "transformer.visual.transformer.resblocks.42.mlp.c_fc",
67
+ "transformer.visual.transformer.resblocks.3.attn.in_proj",
68
+ "transformer.h.13.mlp.w1",
69
+ "transformer.visual.transformer.resblocks.22.attn.out_proj",
70
+ "transformer.visual.transformer.resblocks.20.mlp.c_fc",
71
+ "transformer.h.26.mlp.w2",
72
+ "transformer.h.14.attn.c_attn",
73
+ "transformer.h.16.attn.c_proj",
74
+ "transformer.h.1.mlp.w1",
75
+ "transformer.visual.transformer.resblocks.21.attn.out_proj",
76
+ "transformer.visual.transformer.resblocks.39.mlp.c_proj",
77
+ "transformer.visual.transformer.resblocks.4.attn.in_proj",
78
+ "transformer.h.29.mlp.c_proj",
79
+ "transformer.visual.transformer.resblocks.12.mlp.c_proj",
80
+ "transformer.visual.transformer.resblocks.14.attn.in_proj",
81
+ "transformer.h.28.attn.c_proj",
82
+ "transformer.h.18.mlp.w1",
83
+ "transformer.h.27.mlp.w2",
84
+ "transformer.h.18.attn.c_attn",
85
+ "transformer.visual.transformer.resblocks.33.attn.out_proj",
86
+ "transformer.h.5.mlp.w2",
87
+ "transformer.visual.transformer.resblocks.37.mlp.c_fc",
88
+ "transformer.visual.transformer.resblocks.2.mlp.c_proj",
89
+ "transformer.visual.transformer.resblocks.42.attn.out_proj",
90
+ "transformer.visual.transformer.resblocks.15.attn.in_proj",
91
+ "transformer.visual.transformer.resblocks.6.mlp.c_fc",
92
+ "transformer.h.13.mlp.w2",
93
+ "transformer.h.23.attn.c_proj",
94
+ "transformer.h.20.mlp.c_proj",
95
+ "transformer.h.14.mlp.w2",
96
+ "transformer.visual.transformer.resblocks.9.attn.in_proj",
97
+ "transformer.visual.transformer.resblocks.46.attn.in_proj",
98
+ "transformer.h.9.attn.c_attn",
99
+ "transformer.visual.transformer.resblocks.36.mlp.c_proj",
100
+ "transformer.h.31.attn.c_proj",
101
+ "transformer.visual.transformer.resblocks.19.mlp.c_fc",
102
+ "transformer.h.17.mlp.w1",
103
+ "transformer.h.2.attn.c_proj",
104
+ "transformer.visual.transformer.resblocks.47.attn.in_proj",
105
+ "transformer.visual.transformer.resblocks.45.mlp.c_proj",
106
+ "transformer.visual.transformer.resblocks.46.mlp.c_fc",
107
+ "transformer.visual.transformer.resblocks.27.attn.in_proj",
108
+ "transformer.visual.transformer.resblocks.26.attn.out_proj",
109
+ "transformer.h.22.attn.c_proj",
110
+ "transformer.visual.transformer.resblocks.40.attn.out_proj",
111
+ "transformer.visual.transformer.resblocks.46.mlp.c_proj",
112
+ "transformer.visual.transformer.resblocks.18.attn.out_proj",
113
+ "transformer.h.27.attn.c_proj",
114
+ "transformer.visual.transformer.resblocks.26.attn.in_proj",
115
+ "transformer.h.4.mlp.w1",
116
+ "transformer.h.10.attn.c_proj",
117
+ "transformer.h.6.attn.c_attn",
118
+ "transformer.h.2.attn.c_attn",
119
+ "transformer.h.22.mlp.w1",
120
+ "transformer.visual.transformer.resblocks.39.mlp.c_fc",
121
+ "transformer.h.8.mlp.w2",
122
+ "transformer.h.4.attn.c_attn",
123
+ "transformer.h.26.mlp.c_proj",
124
+ "transformer.visual.transformer.resblocks.29.mlp.c_proj",
125
+ "transformer.visual.transformer.resblocks.5.mlp.c_proj",
126
+ "transformer.h.11.mlp.c_proj",
127
+ "transformer.h.0.mlp.w2",
128
+ "transformer.visual.transformer.resblocks.36.attn.out_proj",
129
+ "transformer.h.29.mlp.w1",
130
+ "transformer.h.12.mlp.c_proj",
131
+ "transformer.visual.transformer.resblocks.2.attn.in_proj",
132
+ "transformer.visual.transformer.resblocks.2.mlp.c_fc",
133
+ "transformer.h.25.attn.c_attn",
134
+ "transformer.visual.transformer.resblocks.19.attn.in_proj",
135
+ "transformer.visual.transformer.resblocks.43.attn.out_proj",
136
+ "transformer.visual.transformer.resblocks.35.attn.out_proj",
137
+ "transformer.h.22.attn.c_attn",
138
+ "transformer.h.0.mlp.w1",
139
+ "transformer.h.3.attn.c_attn",
140
+ "transformer.h.28.attn.c_attn",
141
+ "transformer.visual.transformer.resblocks.25.attn.in_proj",
142
+ "transformer.visual.transformer.resblocks.34.attn.out_proj",
143
+ "transformer.h.21.attn.c_proj",
144
+ "transformer.h.6.attn.c_proj",
145
+ "transformer.visual.transformer.resblocks.11.mlp.c_proj",
146
+ "transformer.h.13.attn.c_attn",
147
+ "transformer.visual.transformer.resblocks.38.attn.out_proj",
148
+ "transformer.h.3.attn.c_proj",
149
+ "transformer.visual.transformer.resblocks.17.mlp.c_fc",
150
+ "transformer.h.26.mlp.w1",
151
+ "transformer.visual.transformer.resblocks.36.mlp.c_fc",
152
+ "transformer.h.26.attn.c_attn",
153
+ "transformer.visual.transformer.resblocks.29.attn.out_proj",
154
+ "transformer.h.7.mlp.w1",
155
+ "transformer.visual.transformer.resblocks.40.mlp.c_fc",
156
+ "transformer.visual.transformer.resblocks.9.attn.out_proj",
157
+ "transformer.h.3.mlp.c_proj",
158
+ "transformer.visual.transformer.resblocks.26.mlp.c_fc",
159
+ "transformer.h.11.mlp.w2",
160
+ "transformer.visual.transformer.resblocks.33.attn.in_proj",
161
+ "transformer.visual.transformer.resblocks.42.mlp.c_proj",
162
+ "transformer.visual.transformer.resblocks.32.attn.out_proj",
163
+ "transformer.h.4.attn.c_proj",
164
+ "transformer.visual.transformer.resblocks.27.mlp.c_fc",
165
+ "transformer.visual.transformer.resblocks.11.mlp.c_fc",
166
+ "transformer.visual.transformer.resblocks.25.attn.out_proj",
167
+ "transformer.visual.transformer.resblocks.23.attn.in_proj",
168
+ "transformer.h.5.attn.c_attn",
169
+ "transformer.h.16.mlp.c_proj",
170
+ "transformer.visual.transformer.resblocks.14.mlp.c_proj",
171
+ "transformer.h.22.mlp.w2",
172
+ "transformer.h.25.mlp.c_proj",
173
+ "transformer.visual.transformer.resblocks.10.mlp.c_fc",
174
+ "transformer.h.24.mlp.c_proj",
175
+ "transformer.h.19.mlp.w2",
176
+ "transformer.h.14.mlp.w1",
177
+ "transformer.visual.transformer.resblocks.40.mlp.c_proj",
178
+ "transformer.visual.transformer.resblocks.28.attn.out_proj",
179
+ "transformer.visual.transformer.resblocks.24.mlp.c_fc",
180
+ "transformer.h.8.attn.c_attn",
181
+ "transformer.h.9.mlp.w1",
182
+ "transformer.h.6.mlp.c_proj",
183
+ "transformer.visual.transformer.resblocks.19.attn.out_proj",
184
+ "transformer.visual.transformer.resblocks.32.mlp.c_fc",
185
+ "transformer.visual.transformer.resblocks.7.mlp.c_fc",
186
+ "transformer.visual.transformer.resblocks.44.attn.in_proj",
187
+ "transformer.visual.transformer.resblocks.34.mlp.c_proj",
188
+ "transformer.visual.transformer.resblocks.9.mlp.c_fc",
189
+ "transformer.visual.conv1",
190
+ "transformer.visual.transformer.resblocks.8.attn.out_proj",
191
+ "transformer.h.23.mlp.w2",
192
+ "transformer.h.7.mlp.w2",
193
+ "transformer.h.24.attn.c_proj",
194
+ "transformer.h.30.attn.c_proj",
195
+ "transformer.h.29.attn.c_proj",
196
+ "transformer.visual.transformer.resblocks.9.mlp.c_proj",
197
+ "transformer.visual.transformer.resblocks.35.attn.in_proj",
198
+ "transformer.visual.transformer.resblocks.21.mlp.c_fc",
199
+ "transformer.visual.transformer.resblocks.41.mlp.c_proj",
200
+ "transformer.visual.transformer.resblocks.38.mlp.c_fc",
201
+ "transformer.visual.transformer.resblocks.13.mlp.c_proj",
202
+ "transformer.visual.transformer.resblocks.41.attn.out_proj",
203
+ "transformer.visual.transformer.resblocks.16.mlp.c_fc",
204
+ "transformer.visual.transformer.resblocks.45.attn.out_proj",
205
+ "transformer.h.11.mlp.w1",
206
+ "transformer.visual.transformer.resblocks.16.attn.in_proj",
207
+ "transformer.visual.transformer.resblocks.47.attn.out_proj",
208
+ "transformer.h.9.attn.c_proj",
209
+ "transformer.h.31.mlp.c_proj",
210
+ "transformer.visual.transformer.resblocks.12.attn.in_proj",
211
+ "transformer.visual.transformer.resblocks.28.mlp.c_proj",
212
+ "transformer.visual.transformer.resblocks.20.attn.out_proj",
213
+ "transformer.h.12.attn.c_attn",
214
+ "transformer.h.24.mlp.w1",
215
+ "transformer.visual.transformer.resblocks.21.attn.in_proj",
216
+ "transformer.visual.transformer.resblocks.41.attn.in_proj",
217
+ "transformer.h.10.mlp.w1",
218
+ "transformer.h.1.mlp.w2",
219
+ "transformer.h.0.mlp.c_proj",
220
+ "transformer.h.22.mlp.c_proj",
221
+ "transformer.visual.transformer.resblocks.18.attn.in_proj",
222
+ "transformer.visual.transformer.resblocks.38.mlp.c_proj",
223
+ "transformer.h.12.mlp.w1",
224
+ "transformer.h.1.attn.c_attn",
225
+ "transformer.visual.transformer.resblocks.31.mlp.c_proj",
226
+ "transformer.visual.transformer.resblocks.44.mlp.c_proj",
227
+ "transformer.h.15.mlp.c_proj",
228
+ "transformer.h.6.mlp.w1",
229
+ "transformer.visual.transformer.resblocks.16.mlp.c_proj",
230
+ "transformer.h.13.attn.c_proj",
231
+ "transformer.h.15.attn.c_attn",
232
+ "transformer.h.15.mlp.w1",
233
+ "transformer.h.17.mlp.w2",
234
+ "transformer.visual.transformer.resblocks.10.attn.in_proj",
235
+ "transformer.h.26.attn.c_proj",
236
+ "transformer.visual.transformer.resblocks.20.attn.in_proj",
237
+ "transformer.h.10.mlp.w2",
238
+ "transformer.h.24.attn.c_attn",
239
+ "transformer.h.8.mlp.w1",
240
+ "transformer.h.23.mlp.w1",
241
+ "transformer.visual.transformer.resblocks.1.mlp.c_proj",
242
+ "transformer.h.4.mlp.w2",
243
+ "transformer.visual.transformer.resblocks.38.attn.in_proj",
244
+ "transformer.h.12.mlp.w2",
245
+ "transformer.h.7.attn.c_proj",
246
+ "transformer.h.4.mlp.c_proj",
247
+ "transformer.visual.transformer.resblocks.31.attn.out_proj",
248
+ "transformer.visual.transformer.resblocks.17.mlp.c_proj",
249
+ "transformer.h.21.mlp.w2",
250
+ "transformer.visual.transformer.resblocks.5.attn.in_proj",
251
+ "transformer.h.18.attn.c_proj",
252
+ "transformer.visual.transformer.resblocks.31.mlp.c_fc",
253
+ "transformer.h.18.mlp.w2",
254
+ "transformer.visual.transformer.resblocks.6.attn.out_proj",
255
+ "transformer.visual.transformer.resblocks.8.attn.in_proj",
256
+ "transformer.visual.transformer.resblocks.30.mlp.c_proj",
257
+ "transformer.h.30.mlp.c_proj",
258
+ "transformer.visual.transformer.resblocks.30.attn.out_proj",
259
+ "transformer.visual.transformer.resblocks.16.attn.out_proj",
260
+ "transformer.visual.transformer.resblocks.14.attn.out_proj",
261
+ "transformer.h.25.mlp.w1",
262
+ "transformer.visual.transformer.resblocks.45.attn.in_proj",
263
+ "transformer.h.11.attn.c_proj",
264
+ "transformer.visual.transformer.resblocks.30.attn.in_proj",
265
+ "transformer.visual.transformer.resblocks.43.mlp.c_proj",
266
+ "transformer.h.10.mlp.c_proj",
267
+ "transformer.h.21.mlp.c_proj",
268
+ "transformer.visual.transformer.resblocks.43.attn.in_proj",
269
+ "transformer.visual.transformer.resblocks.3.mlp.c_fc",
270
+ "transformer.visual.transformer.resblocks.44.attn.out_proj",
271
+ "transformer.h.23.attn.c_attn",
272
+ "transformer.visual.transformer.resblocks.22.attn.in_proj",
273
+ "transformer.visual.transformer.resblocks.6.attn.in_proj",
274
+ "transformer.visual.transformer.resblocks.44.mlp.c_fc",
275
+ "transformer.h.17.attn.c_attn",
276
+ "transformer.h.7.attn.c_attn",
277
+ "transformer.visual.transformer.resblocks.42.attn.in_proj",
278
+ "transformer.visual.transformer.resblocks.20.mlp.c_proj",
279
+ "transformer.h.8.mlp.c_proj",
280
+ "transformer.visual.transformer.resblocks.17.attn.out_proj",
281
+ "transformer.h.14.attn.c_proj",
282
+ "transformer.visual.transformer.resblocks.40.attn.in_proj",
283
+ "transformer.h.25.attn.c_proj",
284
+ "transformer.h.28.mlp.c_proj",
285
+ "transformer.visual.transformer.resblocks.35.mlp.c_proj",
286
+ "transformer.visual.transformer.resblocks.36.attn.in_proj",
287
+ "transformer.visual.transformer.resblocks.41.mlp.c_fc",
288
+ "transformer.visual.transformer.resblocks.14.mlp.c_fc",
289
+ "transformer.h.30.mlp.w2",
290
+ "transformer.h.20.mlp.w1",
291
+ "transformer.visual.transformer.resblocks.33.mlp.c_fc",
292
+ "transformer.h.29.mlp.w2",
293
+ "transformer.visual.transformer.resblocks.47.mlp.c_proj",
294
+ "transformer.visual.transformer.resblocks.30.mlp.c_fc",
295
+ "transformer.h.10.attn.c_attn",
296
+ "transformer.visual.transformer.resblocks.1.attn.in_proj",
297
+ "transformer.h.1.attn.c_proj",
298
+ "transformer.visual.transformer.resblocks.8.mlp.c_proj",
299
+ "transformer.h.19.attn.c_proj",
300
+ "transformer.visual.transformer.resblocks.37.attn.in_proj",
301
+ "transformer.h.15.attn.c_proj",
302
+ "transformer.h.5.attn.c_proj",
303
+ "transformer.visual.transformer.resblocks.32.mlp.c_proj",
304
+ "transformer.visual.transformer.resblocks.3.attn.out_proj",
305
+ "transformer.visual.transformer.resblocks.32.attn.in_proj",
306
+ "transformer.h.21.mlp.w1",
307
+ "transformer.h.23.mlp.c_proj",
308
+ "transformer.h.30.mlp.w1",
309
+ "transformer.h.0.attn.c_attn",
310
+ "transformer.visual.transformer.resblocks.24.attn.out_proj",
311
+ "transformer.visual.transformer.resblocks.31.attn.in_proj",
312
+ "transformer.h.18.mlp.c_proj",
313
+ "transformer.visual.transformer.resblocks.25.mlp.c_fc",
314
+ "transformer.visual.transformer.resblocks.22.mlp.c_fc",
315
+ "transformer.h.30.attn.c_attn",
316
+ "transformer.visual.transformer.resblocks.13.mlp.c_fc",
317
+ "transformer.h.17.mlp.c_proj",
318
+ "transformer.visual.transformer.resblocks.24.attn.in_proj",
319
+ "transformer.h.11.attn.c_attn",
320
+ "transformer.h.2.mlp.w2",
321
+ "transformer.visual.transformer.resblocks.8.mlp.c_fc",
322
+ "transformer.visual.transformer.resblocks.0.mlp.c_fc",
323
+ "transformer.visual.transformer.resblocks.2.attn.out_proj",
324
+ "transformer.visual.transformer.resblocks.35.mlp.c_fc",
325
+ "transformer.visual.transformer.resblocks.39.attn.out_proj",
326
+ "transformer.h.12.attn.c_proj",
327
+ "transformer.visual.transformer.resblocks.28.attn.in_proj",
328
+ "transformer.visual.transformer.resblocks.29.mlp.c_fc",
329
+ "transformer.visual.transformer.resblocks.0.attn.out_proj",
330
+ "transformer.visual.transformer.resblocks.23.mlp.c_proj",
331
+ "transformer.h.20.attn.c_attn",
332
+ "transformer.visual.transformer.resblocks.7.attn.out_proj",
333
+ "transformer.visual.transformer.resblocks.15.attn.out_proj",
334
+ "transformer.h.7.mlp.c_proj",
335
+ "transformer.visual.transformer.resblocks.1.attn.out_proj",
336
+ "transformer.h.3.mlp.w2",
337
+ "transformer.h.9.mlp.w2",
338
+ "transformer.visual.transformer.resblocks.34.attn.in_proj",
339
+ "transformer.h.27.attn.c_attn",
340
+ "transformer.visual.transformer.resblocks.12.mlp.c_fc",
341
+ "transformer.h.6.mlp.w2",
342
+ "transformer.visual.transformer.resblocks.39.attn.in_proj",
343
+ "transformer.h.15.mlp.w2",
344
+ "transformer.visual.transformer.resblocks.18.mlp.c_proj",
345
+ "transformer.h.0.attn.c_proj",
346
+ "transformer.h.19.attn.c_attn",
347
+ "transformer.visual.transformer.resblocks.27.mlp.c_proj",
348
+ "transformer.visual.transformer.resblocks.23.attn.out_proj",
349
+ "transformer.h.14.mlp.c_proj",
350
+ "transformer.h.9.mlp.c_proj",
351
+ "transformer.visual.transformer.resblocks.12.attn.out_proj",
352
+ "transformer.visual.transformer.resblocks.0.mlp.c_proj",
353
+ "transformer.visual.transformer.resblocks.5.mlp.c_fc",
354
+ "transformer.visual.transformer.resblocks.28.mlp.c_fc",
355
+ "transformer.visual.transformer.resblocks.6.mlp.c_proj",
356
+ "transformer.visual.transformer.resblocks.22.mlp.c_proj",
357
+ "transformer.visual.transformer.resblocks.37.mlp.c_proj",
358
+ "transformer.visual.transformer.resblocks.17.attn.in_proj",
359
+ "transformer.visual.transformer.resblocks.46.attn.out_proj",
360
+ "transformer.h.24.mlp.w2",
361
+ "transformer.h.27.mlp.w1",
362
+ "transformer.visual.transformer.resblocks.11.attn.in_proj",
363
+ "transformer.visual.transformer.resblocks.4.mlp.c_proj",
364
+ "transformer.visual.transformer.resblocks.21.mlp.c_proj",
365
+ "transformer.visual.transformer.resblocks.26.mlp.c_proj",
366
+ "transformer.visual.transformer.resblocks.15.mlp.c_fc",
367
+ "transformer.h.2.mlp.c_proj",
368
+ "transformer.h.1.mlp.c_proj",
369
+ "transformer.h.5.mlp.c_proj",
370
+ "transformer.visual.transformer.resblocks.45.mlp.c_fc",
371
+ "transformer.visual.transformer.resblocks.0.attn.in_proj",
372
+ "transformer.h.25.mlp.w2",
373
+ "transformer.h.20.attn.c_proj",
374
+ "transformer.h.17.attn.c_proj",
375
+ "transformer.visual.transformer.resblocks.1.mlp.c_fc"
376
+ ],
377
+ "task_type": "CAUSAL_LM",
378
+ "use_dora": false,
379
+ "use_rslora": false
380
+ }
checkpoint-1200/adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:25433b71645e2f2f21acef45e1cd3dd51471fc7d7d8cbcfa08984f46e78ae8ab
3
+ size 469105640
checkpoint-1200/latest ADDED
@@ -0,0 +1 @@
 
 
1
+ global_step1200
checkpoint-1200/qwen.tiktoken ADDED
The diff for this file is too large to render. See raw diff
 
checkpoint-1200/rng_state_0.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d0e05d703defebb48cb1ce8c7911952ccae578d1a7947d21425f3ff731f0503c
3
+ size 15920
checkpoint-1200/rng_state_1.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:84b1cb9d1609ea4d4950ef70e57b5a4c92bd381b97235bb8f28c84dd2d1c8b9f
3
+ size 15920
checkpoint-1200/rng_state_2.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:856080e41a8ab6aae185d671f94419379f1fa3fb0f0e7be7beacb1f897ff85b1
3
+ size 15920
checkpoint-1200/rng_state_3.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bf81d56a0cdea27b5e3e9186c6df18ce9e3f7be5271892df15accb3df0e0c218
3
+ size 15920
checkpoint-1200/rng_state_4.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:11b4622ea11d41a3e43b7c396b97a48c41e47f53cd9ee003472fe4ed7d8bcfd6
3
+ size 15920
checkpoint-1200/rng_state_5.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4ed6bae991517d1ab99fa861cfc1756d30b51a35dccf81c79c6476ebed2ddd93
3
+ size 15920
checkpoint-1200/rng_state_6.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:27c9309aec78b496fd1d73ec24a274926f4b1442325c3303730b620697588e2e
3
+ size 15920
checkpoint-1200/rng_state_7.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:82bc476f5997e3852636a20556416202397a1d429d441c40112a9011e79ef517
3
+ size 15920
checkpoint-1200/scheduler.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f4585a0555f6a1312741348f75004d3499afabae4ab299739739d92b9544be0c
3
+ size 1064
checkpoint-1200/special_tokens_map.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "pad_token": "<|endoftext|>"
3
+ }
checkpoint-1200/tokenization_qwen.py ADDED
@@ -0,0 +1,598 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) Alibaba Cloud.
2
+ #
3
+ # This source code is licensed under the license found in the
4
+ # LICENSE file in the root directory of this source tree.
5
+
6
+ """Tokenization classes for QWen."""
7
+
8
+ import base64
9
+ import logging
10
+ import os
11
+ import requests
12
+ import unicodedata
13
+ from typing import Collection, Dict, List, Set, Tuple, Union, Any, Callable, Optional
14
+
15
+ import tiktoken
16
+ import numpy as np
17
+ from PIL import Image
18
+ from PIL import ImageFont
19
+ from PIL import ImageDraw
20
+ from transformers import PreTrainedTokenizer, AddedToken
21
+ from transformers.utils import try_to_load_from_cache
22
+
23
+ import matplotlib.colors as mcolors
24
+ from matplotlib.font_manager import FontProperties
25
+
26
+ logger = logging.getLogger(__name__)
27
+
28
+
29
+ VOCAB_FILES_NAMES = {"vocab_file": "qwen.tiktoken", "ttf": "SimSun.ttf"}
30
+ FONT_PATH = try_to_load_from_cache("Qwen/Qwen-VL-Chat", "SimSun.ttf")
31
+ if FONT_PATH is None:
32
+ if not os.path.exists("SimSun.ttf"):
33
+ ttf = requests.get("https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/SimSun.ttf")
34
+ open("SimSun.ttf", "wb").write(ttf.content)
35
+ FONT_PATH = "SimSun.ttf"
36
+
37
+ PAT_STR = r"""(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"""
38
+ ENDOFTEXT = "<|endoftext|>"
39
+ IMSTART = "<|im_start|>"
40
+ IMEND = "<|im_end|>"
41
+ # as the default behavior is changed to allow special tokens in
42
+ # regular texts, the surface forms of special tokens need to be
43
+ # as different as possible to minimize the impact
44
+ EXTRAS = tuple((f"<|extra_{i}|>" for i in range(205)))
45
+ SPECIAL_TOKENS = (
46
+ ENDOFTEXT,
47
+ IMSTART,
48
+ IMEND,
49
+ ) + EXTRAS
50
+ IMG_TOKEN_SPAN = 256
51
+
52
+
53
+ def _load_tiktoken_bpe(tiktoken_bpe_file: str) -> Dict[bytes, int]:
54
+ with open(tiktoken_bpe_file, "rb") as f:
55
+ contents = f.read()
56
+ return {
57
+ base64.b64decode(token): int(rank)
58
+ for token, rank in (line.split() for line in contents.splitlines() if line)
59
+ }
60
+
61
+ def _list_find(
62
+ input_list: List[Any],
63
+ candidates: Tuple[Any],
64
+ start: int = 0,
65
+ ):
66
+ for i in range(start, len(input_list)):
67
+ if input_list[i] in candidates:
68
+ return i
69
+ return -1
70
+
71
+ def _replace_closed_tag(
72
+ input_tokens: List[Any],
73
+ start_tags: Union[Any, Tuple[Any]],
74
+ end_tags: Union[Any, Tuple[Any]],
75
+ inclusive_replace_func: Callable,
76
+ exclusive_replace_func: Callable = lambda x: x,
77
+ ):
78
+ if isinstance(start_tags, (str, int)):
79
+ start_tags = (start_tags,)
80
+ if isinstance(end_tags, (str, int)):
81
+ end_tags = (end_tags,)
82
+ assert len(start_tags) == len(end_tags)
83
+
84
+ output_tokens = []
85
+ end = 0
86
+ while True:
87
+ start = _list_find(input_tokens, start_tags, end)
88
+ if start == -1:
89
+ break
90
+ output_tokens.extend(exclusive_replace_func(input_tokens[end : start]))
91
+ tag_idx = start_tags.index(input_tokens[start])
92
+ end = _list_find(input_tokens, (end_tags[tag_idx],), start)
93
+ if end == -1:
94
+ raise ValueError("Unclosed image token")
95
+ output_tokens.extend(inclusive_replace_func(input_tokens[start : end + 1]))
96
+ end += 1
97
+ output_tokens.extend(exclusive_replace_func(input_tokens[end : ]))
98
+ return output_tokens
99
+
100
+ class QWenTokenizer(PreTrainedTokenizer):
101
+ """QWen tokenizer."""
102
+
103
+ vocab_files_names = VOCAB_FILES_NAMES
104
+
105
+ def __init__(
106
+ self,
107
+ vocab_file,
108
+ errors="replace",
109
+ image_start_tag='<img>',
110
+ image_end_tag='</img>',
111
+ image_pad_tag='<imgpad>',
112
+ ref_start_tag='<ref>',
113
+ ref_end_tag='</ref>',
114
+ box_start_tag='<box>',
115
+ box_end_tag='</box>',
116
+ quad_start_tag='<quad>',
117
+ quad_end_tag='</quad>',
118
+ **kwargs,
119
+ ):
120
+ self.image_start_tag = image_start_tag
121
+ self.image_end_tag = image_end_tag
122
+ self.image_pad_tag = image_pad_tag
123
+ self.ref_start_tag = ref_start_tag
124
+ self.ref_end_tag = ref_end_tag
125
+ self.box_start_tag = box_start_tag
126
+ self.box_end_tag = box_end_tag
127
+ self.quad_start_tag = quad_start_tag
128
+ self.quad_end_tag = quad_end_tag
129
+ self.IMAGE_ST = (
130
+ ref_start_tag, ref_end_tag,
131
+ box_start_tag, box_end_tag,
132
+ quad_start_tag, quad_end_tag,
133
+ image_start_tag, image_end_tag,
134
+ image_pad_tag
135
+ )
136
+ super().__init__(**kwargs)
137
+
138
+ self.errors = errors # how to handle errors in decoding
139
+
140
+ self.mergeable_ranks = _load_tiktoken_bpe(vocab_file) # type: dict[bytes, int]
141
+ self.special_tokens = {
142
+ token: index
143
+ for index, token in enumerate(
144
+ SPECIAL_TOKENS + self.IMAGE_ST, start=len(self.mergeable_ranks)
145
+ )
146
+ }
147
+ self.img_start_id = self.special_tokens[self.image_start_tag]
148
+ self.img_end_id = self.special_tokens[self.image_end_tag]
149
+ self.img_pad_id = self.special_tokens[self.image_pad_tag]
150
+ self.ref_start_id = self.special_tokens[self.ref_start_tag]
151
+ self.ref_end_id = self.special_tokens[self.ref_end_tag]
152
+ self.box_start_id = self.special_tokens[self.box_start_tag]
153
+ self.box_end_id = self.special_tokens[self.box_end_tag]
154
+ self.quad_start_id = self.special_tokens[self.quad_start_tag]
155
+ self.quad_end_id = self.special_tokens[self.quad_end_tag]
156
+ self.image_special_tokens = set([
157
+ self.ref_start_id, self.ref_end_id, self.box_start_id, self.box_end_id,
158
+ self.quad_start_id, self.quad_end_id,
159
+ ])
160
+
161
+ enc = tiktoken.Encoding(
162
+ "Qwen",
163
+ pat_str=PAT_STR,
164
+ mergeable_ranks=self.mergeable_ranks,
165
+ special_tokens=self.special_tokens,
166
+ )
167
+ assert (
168
+ len(self.mergeable_ranks) + len(self.special_tokens) == enc.n_vocab
169
+ ), f"{len(self.mergeable_ranks) + len(self.special_tokens)} != {enc.n_vocab} in encoding"
170
+
171
+ self.decoder = {
172
+ v: k for k, v in self.mergeable_ranks.items()
173
+ } # type: dict[int, bytes|str]
174
+ self.decoder.update({v: k for k, v in self.special_tokens.items()})
175
+
176
+ self.tokenizer = enc # type: tiktoken.Encoding
177
+
178
+ self.eod_id = self.tokenizer.eot_token
179
+ self.im_start_id = self.special_tokens[IMSTART]
180
+ self.im_end_id = self.special_tokens[IMEND]
181
+
182
+ def __getstate__(self):
183
+ # for pickle lovers
184
+ state = self.__dict__.copy()
185
+ del state['tokenizer']
186
+ return state
187
+
188
+ def __setstate__(self, state):
189
+ # tokenizer is not python native; don't pass it; rebuild it
190
+ self.__dict__.update(state)
191
+ enc = tiktoken.Encoding(
192
+ "Qwen",
193
+ pat_str=PAT_STR,
194
+ mergeable_ranks=self.mergeable_ranks,
195
+ special_tokens=self.special_tokens,
196
+ )
197
+ self.tokenizer = enc
198
+
199
+
200
+ def __len__(self) -> int:
201
+ return self.tokenizer.n_vocab
202
+
203
+ def get_vocab(self) -> Dict[bytes, int]:
204
+ return self.mergeable_ranks
205
+
206
+ def convert_tokens_to_ids(
207
+ self, tokens: Union[bytes, str, List[Union[bytes, str]]]
208
+ ) -> List[int]:
209
+ ids = []
210
+ if isinstance(tokens, (str, bytes)):
211
+ if tokens in self.special_tokens:
212
+ return self.special_tokens[tokens]
213
+ else:
214
+ return self.mergeable_ranks.get(tokens)
215
+ for token in tokens:
216
+ if token in self.special_tokens:
217
+ ids.append(self.special_tokens[token])
218
+ else:
219
+ ids.append(self.mergeable_ranks.get(token))
220
+ return ids
221
+
222
+ def _add_tokens(self, new_tokens: Union[List[str], List[AddedToken]], special_tokens: bool = False) -> int:
223
+ if not special_tokens and new_tokens:
224
+ raise ValueError('Adding regular tokens is not supported')
225
+ for token in new_tokens:
226
+ surface_form = token.content if isinstance(token, AddedToken) else token
227
+ if surface_form not in SPECIAL_TOKENS + self.IMAGE_ST:
228
+ raise ValueError('Adding unknown special tokens is not supported')
229
+ return 0
230
+
231
+ def save_vocabulary(self, save_directory: str, **kwargs) -> Tuple[str]:
232
+ """
233
+ Save only the vocabulary of the tokenizer (vocabulary).
234
+
235
+ Returns:
236
+ `Tuple(str)`: Paths to the files saved.
237
+ """
238
+ file_path = os.path.join(save_directory, "qwen.tiktoken")
239
+ with open(file_path, "w", encoding="utf8") as w:
240
+ for k, v in self.mergeable_ranks.items():
241
+ line = base64.b64encode(k).decode("utf8") + " " + str(v) + "\n"
242
+ w.write(line)
243
+ return (file_path,)
244
+
245
+ def tokenize(
246
+ self,
247
+ text: str,
248
+ allowed_special: Union[Set, str] = "all",
249
+ disallowed_special: Union[Collection, str] = (),
250
+ **kwargs,
251
+ ) -> List[Union[bytes, str]]:
252
+ """
253
+ Converts a string in a sequence of tokens.
254
+
255
+ Args:
256
+ text (`str`):
257
+ The sequence to be encoded.
258
+ allowed_special (`Literal["all"]` or `set`):
259
+ The surface forms of the tokens to be encoded as special tokens in regular texts.
260
+ Default to "all".
261
+ disallowed_special (`Literal["all"]` or `Collection`):
262
+ The surface forms of the tokens that should not be in regular texts and trigger errors.
263
+ Default to an empty tuple.
264
+
265
+ kwargs (additional keyword arguments, *optional*):
266
+ Will be passed to the underlying model specific encode method.
267
+
268
+ Returns:
269
+ `List[bytes|str]`: The list of tokens.
270
+ """
271
+ tokens = []
272
+ text = unicodedata.normalize("NFC", text)
273
+
274
+ # this implementation takes a detour: text -> token id -> token surface forms
275
+ for t in self.tokenizer.encode(
276
+ text, allowed_special=allowed_special, disallowed_special=disallowed_special
277
+ ):
278
+ tokens.append(self.decoder[t])
279
+
280
+ def _encode_imgurl(img_tokens):
281
+ assert img_tokens[0] == self.image_start_tag and img_tokens[-1] == self.image_end_tag
282
+ img_tokens = img_tokens[1:-1]
283
+ img_url = b''.join(img_tokens)
284
+ out_img_tokens = list(map(self.decoder.get, img_url))
285
+ if len(out_img_tokens) > IMG_TOKEN_SPAN:
286
+ raise ValueError("The content in {}..{} is too long".format(
287
+ self.image_start_tag, self.image_end_tag))
288
+ out_img_tokens.extend([self.image_pad_tag] * (IMG_TOKEN_SPAN - len(out_img_tokens)))
289
+ out_img_tokens = [self.image_start_tag] + out_img_tokens + [self.image_end_tag]
290
+ return out_img_tokens
291
+
292
+ return _replace_closed_tag(tokens, self.image_start_tag, self.image_end_tag, _encode_imgurl)
293
+
294
+ def convert_tokens_to_string(self, tokens: List[Union[bytes, str]]) -> str:
295
+ """
296
+ Converts a sequence of tokens in a single string.
297
+ """
298
+ text = ""
299
+ temp = b""
300
+ for t in tokens:
301
+ if isinstance(t, str):
302
+ if temp:
303
+ text += temp.decode("utf-8", errors=self.errors)
304
+ temp = b""
305
+ text += t
306
+ elif isinstance(t, bytes):
307
+ temp += t
308
+ else:
309
+ raise TypeError("token should only be of type types or str")
310
+ if temp:
311
+ text += temp.decode("utf-8", errors=self.errors)
312
+ return text
313
+
314
+ @property
315
+ def vocab_size(self):
316
+ return self.tokenizer.n_vocab
317
+
318
+ def _convert_id_to_token(self, index: int) -> Union[bytes, str]:
319
+ """Converts an id to a token, special tokens included"""
320
+ if index in self.decoder:
321
+ return self.decoder[index]
322
+ raise ValueError("unknown ids")
323
+
324
+ def _convert_token_to_id(self, token: Union[bytes, str]) -> int:
325
+ """Converts a token to an id using the vocab, special tokens included"""
326
+ if token in self.special_tokens:
327
+ return self.special_tokens[token]
328
+ if token in self.mergeable_ranks:
329
+ return self.mergeable_ranks[token]
330
+ raise ValueError("unknown token")
331
+
332
+ def _tokenize(self, text: str, **kwargs):
333
+ """
334
+ Converts a string in a sequence of tokens (string), using the tokenizer. Split in words for word-based
335
+ vocabulary or sub-words for sub-word-based vocabularies (BPE/SentencePieces/WordPieces).
336
+
337
+ Do NOT take care of added tokens.
338
+ """
339
+ raise NotImplementedError
340
+
341
+ def _decode(
342
+ self,
343
+ token_ids: Union[int, List[int]],
344
+ skip_special_tokens: bool = False,
345
+ errors: str = None,
346
+ **kwargs,
347
+ ) -> str:
348
+ if isinstance(token_ids, int):
349
+ token_ids = [token_ids]
350
+
351
+ def _decode_imgurl(img_token_ids):
352
+ assert img_token_ids[0] == self.img_start_id and img_token_ids[-1] == self.img_end_id
353
+ img_token_ids = img_token_ids[1:-1]
354
+ img_token_ids = img_token_ids[ : img_token_ids.index(self.img_pad_id)]
355
+ img_url = bytes(img_token_ids).decode('utf-8')
356
+ return [self.img_start_id] + self.tokenizer.encode(img_url) + [self.img_end_id]
357
+
358
+ token_ids = _replace_closed_tag(token_ids, self.img_start_id, self.img_end_id, _decode_imgurl)
359
+
360
+ if skip_special_tokens:
361
+ if kwargs.get('keep_image_special', False):
362
+ token_ids = [i for i in token_ids if i < self.eod_id
363
+ or i in self.image_special_tokens]
364
+ else:
365
+ token_ids = [i for i in token_ids if i < self.eod_id]
366
+ return self.tokenizer.decode(token_ids, errors=errors or self.errors)
367
+
368
+ def to_list_format(self, text: str):
369
+ text = unicodedata.normalize("NFC", text)
370
+ token_ids = self.tokenizer.encode(
371
+ text, allowed_special=set(self.IMAGE_ST + (ENDOFTEXT,)))
372
+
373
+ def _encode_vl_info(tokens):
374
+ if len(tokens) == 0:
375
+ return []
376
+ if tokens[0] == self.img_start_id and tokens[-1] == self.img_end_id:
377
+ key = 'image'
378
+ elif tokens[0] == self.ref_start_id and tokens[-1] == self.ref_end_id:
379
+ key = 'ref'
380
+ elif tokens[0] == self.box_start_id and tokens[-1] == self.box_end_id:
381
+ key = 'box'
382
+ elif tokens[0] == self.quad_start_id and tokens[-1] == self.quad_end_id:
383
+ key = 'quad'
384
+ else:
385
+ _tobytes = lambda x: x.encode('utf-8') if isinstance(x, str) else x
386
+ return [{'text': b''.join(map(_tobytes, map(self.decoder.get, tokens))).decode('utf-8')}]
387
+ _tobytes = lambda x: x.encode('utf-8') if isinstance(x, str) else x
388
+ val = b''.join(map(_tobytes, map(self.decoder.get, tokens[1:-1]))).decode('utf-8')
389
+ return [{key: val}]
390
+
391
+ return _replace_closed_tag(
392
+ token_ids,
393
+ (self.img_start_id, self.ref_start_id, self.box_start_id, self.quad_start_id),
394
+ (self.img_end_id, self.ref_end_id, self.box_end_id, self.quad_end_id),
395
+ _encode_vl_info,
396
+ _encode_vl_info,
397
+ )
398
+
399
+ def from_list_format(self, list_format: List[Dict]):
400
+ text = ''
401
+ num_images = 0
402
+ for ele in list_format:
403
+ if 'image' in ele:
404
+ num_images += 1
405
+ text += f'Picture {num_images}: '
406
+ text += self.image_start_tag + ele['image'] + self.image_end_tag
407
+ text += '\n'
408
+ elif 'text' in ele:
409
+ text += ele['text']
410
+ elif 'box' in ele:
411
+ if 'ref' in ele:
412
+ text += self.ref_start_tag + ele['ref'] + self.ref_end_tag
413
+ for box in ele['box']:
414
+ text += self.box_start_tag + '(%d,%d),(%d,%d)' % (box[0], box[1], box[2], box[3]) + self.box_end_tag
415
+ else:
416
+ raise ValueError("Unsupport element: " + str(ele))
417
+ return text
418
+
419
+ def _fetch_latest_picture(self, response, history):
420
+ if history is None:
421
+ history = []
422
+ _history = history + [(response, None)]
423
+ for q, r in _history[::-1]:
424
+ for ele in self.to_list_format(q)[::-1]:
425
+ if 'image' in ele:
426
+ return ele['image']
427
+ return None
428
+
429
+ def _fetch_all_box_with_ref(self, text):
430
+ list_format = self.to_list_format(text)
431
+ output = []
432
+ for i, ele in enumerate(list_format):
433
+ if 'box' in ele:
434
+ bbox = tuple(map(int, ele['box'].replace('(', '').replace(')', '').split(',')))
435
+ assert len(bbox) == 4
436
+ output.append({'box': bbox})
437
+ if i > 0 and 'ref' in list_format[i-1]:
438
+ output[-1]['ref'] = list_format[i-1]['ref'].strip()
439
+ return output
440
+
441
+ def draw_bbox_on_latest_picture(
442
+ self,
443
+ response,
444
+ history=None,
445
+ ) -> Optional[Image.Image]:
446
+ image = self._fetch_latest_picture(response, history)
447
+ if image is None:
448
+ return None
449
+ if image.startswith("http://") or image.startswith("https://"):
450
+ image = Image.open(requests.get(image, stream=True).raw).convert("RGB")
451
+ h, w = image.height, image.width
452
+ else:
453
+ image = np.asarray(Image.open(image).convert("RGB"))
454
+ h, w = image.shape[0], image.shape[1]
455
+ visualizer = Visualizer(image)
456
+
457
+ boxes = self._fetch_all_box_with_ref(response)
458
+ if not boxes:
459
+ return None
460
+ color = random.choice([_ for _ in mcolors.TABLEAU_COLORS.keys()]) # init color
461
+ for box in boxes:
462
+ if 'ref' in box: # random new color for new refexps
463
+ color = random.choice([_ for _ in mcolors.TABLEAU_COLORS.keys()])
464
+ x1, y1, x2, y2 = box['box']
465
+ x1, y1, x2, y2 = (int(x1 / 1000 * w), int(y1 / 1000 * h), int(x2 / 1000 * w), int(y2 / 1000 * h))
466
+ visualizer.draw_box((x1, y1, x2, y2), alpha=1, edge_color=color)
467
+ if 'ref' in box:
468
+ visualizer.draw_text(box['ref'], (x1, y1), color=color, horizontal_alignment="left")
469
+ return visualizer.output
470
+
471
+
472
+ import colorsys
473
+ import logging
474
+ import math
475
+ import numpy as np
476
+ import matplotlib as mpl
477
+ import matplotlib.colors as mplc
478
+ import matplotlib.figure as mplfigure
479
+ import torch
480
+ from matplotlib.backends.backend_agg import FigureCanvasAgg
481
+ from PIL import Image
482
+ import random
483
+
484
+ logger = logging.getLogger(__name__)
485
+
486
+
487
+ class VisImage:
488
+ def __init__(self, img, scale=1.0):
489
+ self.img = img
490
+ self.scale = scale
491
+ self.width, self.height = img.shape[1], img.shape[0]
492
+ self._setup_figure(img)
493
+
494
+ def _setup_figure(self, img):
495
+ fig = mplfigure.Figure(frameon=False)
496
+ self.dpi = fig.get_dpi()
497
+ # add a small 1e-2 to avoid precision lost due to matplotlib's truncation
498
+ # (https://github.com/matplotlib/matplotlib/issues/15363)
499
+ fig.set_size_inches(
500
+ (self.width * self.scale + 1e-2) / self.dpi,
501
+ (self.height * self.scale + 1e-2) / self.dpi,
502
+ )
503
+ self.canvas = FigureCanvasAgg(fig)
504
+ # self.canvas = mpl.backends.backend_cairo.FigureCanvasCairo(fig)
505
+ ax = fig.add_axes([0.0, 0.0, 1.0, 1.0])
506
+ ax.axis("off")
507
+ self.fig = fig
508
+ self.ax = ax
509
+ self.reset_image(img)
510
+
511
+ def reset_image(self, img):
512
+ img = img.astype("uint8")
513
+ self.ax.imshow(img, extent=(0, self.width, self.height, 0), interpolation="nearest")
514
+
515
+ def save(self, filepath):
516
+ self.fig.savefig(filepath)
517
+
518
+ def get_image(self):
519
+ canvas = self.canvas
520
+ s, (width, height) = canvas.print_to_buffer()
521
+
522
+ buffer = np.frombuffer(s, dtype="uint8")
523
+
524
+ img_rgba = buffer.reshape(height, width, 4)
525
+ rgb, alpha = np.split(img_rgba, [3], axis=2)
526
+ return rgb.astype("uint8")
527
+
528
+
529
+ class Visualizer:
530
+ def __init__(self, img_rgb, metadata=None, scale=1.0):
531
+ self.img = np.asarray(img_rgb).clip(0, 255).astype(np.uint8)
532
+ self.font_path = FONT_PATH
533
+ self.output = VisImage(self.img, scale=scale)
534
+ self.cpu_device = torch.device("cpu")
535
+
536
+ # too small texts are useless, therefore clamp to 14
537
+ self._default_font_size = max(
538
+ np.sqrt(self.output.height * self.output.width) // 30, 15 // scale
539
+ )
540
+
541
+ def draw_text(
542
+ self,
543
+ text,
544
+ position,
545
+ *,
546
+ font_size=None,
547
+ color="g",
548
+ horizontal_alignment="center",
549
+ rotation=0,
550
+ ):
551
+ if not font_size:
552
+ font_size = self._default_font_size
553
+
554
+ # since the text background is dark, we don't want the text to be dark
555
+ color = np.maximum(list(mplc.to_rgb(color)), 0.2)
556
+ color[np.argmax(color)] = max(0.8, np.max(color))
557
+
558
+ x, y = position
559
+ self.output.ax.text(
560
+ x,
561
+ y,
562
+ text,
563
+ size=font_size * self.output.scale,
564
+ fontproperties=FontProperties(fname=self.font_path),
565
+ bbox={"facecolor": "black", "alpha": 0.8, "pad": 0.7, "edgecolor": "none"},
566
+ verticalalignment="top",
567
+ horizontalalignment=horizontal_alignment,
568
+ color=color,
569
+ zorder=10,
570
+ rotation=rotation,
571
+ )
572
+ return self.output
573
+
574
+ def draw_box(self, box_coord, alpha=0.5, edge_color="g", line_style="-"):
575
+
576
+ x0, y0, x1, y1 = box_coord
577
+ width = x1 - x0
578
+ height = y1 - y0
579
+
580
+ linewidth = max(self._default_font_size / 4, 1)
581
+
582
+ self.output.ax.add_patch(
583
+ mpl.patches.Rectangle(
584
+ (x0, y0),
585
+ width,
586
+ height,
587
+ fill=False,
588
+ edgecolor=edge_color,
589
+ linewidth=linewidth * self.output.scale,
590
+ alpha=alpha,
591
+ linestyle=line_style,
592
+ )
593
+ )
594
+ return self.output
595
+
596
+ def get_output(self):
597
+
598
+ return self.output
checkpoint-1200/tokenizer_config.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {},
3
+ "auto_map": {
4
+ "AutoTokenizer": [
5
+ "Qwen/Qwen-VL-Chat--tokenization_qwen.QWenTokenizer",
6
+ null
7
+ ]
8
+ },
9
+ "clean_up_tokenization_spaces": true,
10
+ "model_max_length": 768,
11
+ "pad_token": "<|endoftext|>",
12
+ "padding_side": "right",
13
+ "tokenizer_class": "QWenTokenizer"
14
+ }
checkpoint-1200/trainer_state.json ADDED
@@ -0,0 +1,873 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": null,
3
+ "best_model_checkpoint": null,
4
+ "epoch": 0.07831364615284213,
5
+ "eval_steps": 500,
6
+ "global_step": 1200,
7
+ "is_hyper_param_search": false,
8
+ "is_local_process_zero": true,
9
+ "is_world_process_zero": true,
10
+ "log_history": [
11
+ {
12
+ "epoch": 0.0006526137179403511,
13
+ "grad_norm": 17.690582114691438,
14
+ "learning_rate": 1.948051948051948e-06,
15
+ "loss": 1.3559,
16
+ "step": 10
17
+ },
18
+ {
19
+ "epoch": 0.0013052274358807021,
20
+ "grad_norm": 7.768088366444893,
21
+ "learning_rate": 3.896103896103896e-06,
22
+ "loss": 1.2706,
23
+ "step": 20
24
+ },
25
+ {
26
+ "epoch": 0.001957841153821053,
27
+ "grad_norm": 7.705313536090087,
28
+ "learning_rate": 5.844155844155845e-06,
29
+ "loss": 1.3781,
30
+ "step": 30
31
+ },
32
+ {
33
+ "epoch": 0.0026104548717614043,
34
+ "grad_norm": 34.39078827766783,
35
+ "learning_rate": 7.792207792207792e-06,
36
+ "loss": 1.2749,
37
+ "step": 40
38
+ },
39
+ {
40
+ "epoch": 0.0032630685897017554,
41
+ "grad_norm": 68.28824334896528,
42
+ "learning_rate": 9.74025974025974e-06,
43
+ "loss": 1.2955,
44
+ "step": 50
45
+ },
46
+ {
47
+ "epoch": 0.003915682307642106,
48
+ "grad_norm": 14.220322607917241,
49
+ "learning_rate": 1.168831168831169e-05,
50
+ "loss": 1.2315,
51
+ "step": 60
52
+ },
53
+ {
54
+ "epoch": 0.0045682960255824575,
55
+ "grad_norm": 12.611848231734811,
56
+ "learning_rate": 1.3636363636363637e-05,
57
+ "loss": 1.0953,
58
+ "step": 70
59
+ },
60
+ {
61
+ "epoch": 0.0052209097435228086,
62
+ "grad_norm": 6.055664298727015,
63
+ "learning_rate": 1.5584415584415583e-05,
64
+ "loss": 1.105,
65
+ "step": 80
66
+ },
67
+ {
68
+ "epoch": 0.00587352346146316,
69
+ "grad_norm": 3.52269227801977,
70
+ "learning_rate": 1.753246753246753e-05,
71
+ "loss": 0.9563,
72
+ "step": 90
73
+ },
74
+ {
75
+ "epoch": 0.006526137179403511,
76
+ "grad_norm": 10.771884023354394,
77
+ "learning_rate": 1.948051948051948e-05,
78
+ "loss": 0.9523,
79
+ "step": 100
80
+ },
81
+ {
82
+ "epoch": 0.007178750897343862,
83
+ "grad_norm": 33.41476483216757,
84
+ "learning_rate": 2.1428571428571428e-05,
85
+ "loss": 0.832,
86
+ "step": 110
87
+ },
88
+ {
89
+ "epoch": 0.007831364615284213,
90
+ "grad_norm": 31.120240364617406,
91
+ "learning_rate": 2.337662337662338e-05,
92
+ "loss": 0.8376,
93
+ "step": 120
94
+ },
95
+ {
96
+ "epoch": 0.008483978333224564,
97
+ "grad_norm": 5.517231564060886,
98
+ "learning_rate": 2.5324675324675325e-05,
99
+ "loss": 0.8293,
100
+ "step": 130
101
+ },
102
+ {
103
+ "epoch": 0.009136592051164915,
104
+ "grad_norm": 4.311605388342058,
105
+ "learning_rate": 2.7272727272727273e-05,
106
+ "loss": 0.8295,
107
+ "step": 140
108
+ },
109
+ {
110
+ "epoch": 0.009789205769105266,
111
+ "grad_norm": 6.997724163121519,
112
+ "learning_rate": 2.922077922077922e-05,
113
+ "loss": 0.7662,
114
+ "step": 150
115
+ },
116
+ {
117
+ "epoch": 0.010441819487045617,
118
+ "grad_norm": 6.517836234400708,
119
+ "learning_rate": 2.999998841890695e-05,
120
+ "loss": 0.8158,
121
+ "step": 160
122
+ },
123
+ {
124
+ "epoch": 0.011094433204985968,
125
+ "grad_norm": 4.186989141019666,
126
+ "learning_rate": 2.99999176456253e-05,
127
+ "loss": 0.8037,
128
+ "step": 170
129
+ },
130
+ {
131
+ "epoch": 0.01174704692292632,
132
+ "grad_norm": 5.181546943355458,
133
+ "learning_rate": 2.9999782533305785e-05,
134
+ "loss": 0.7274,
135
+ "step": 180
136
+ },
137
+ {
138
+ "epoch": 0.01239966064086667,
139
+ "grad_norm": 3.767076521211455,
140
+ "learning_rate": 2.9999583082527935e-05,
141
+ "loss": 0.7474,
142
+ "step": 190
143
+ },
144
+ {
145
+ "epoch": 0.013052274358807021,
146
+ "grad_norm": 18.84416377940188,
147
+ "learning_rate": 2.999931929414726e-05,
148
+ "loss": 0.7708,
149
+ "step": 200
150
+ },
151
+ {
152
+ "epoch": 0.013704888076747372,
153
+ "grad_norm": 3.169160630444992,
154
+ "learning_rate": 2.999899116929522e-05,
155
+ "loss": 0.8279,
156
+ "step": 210
157
+ },
158
+ {
159
+ "epoch": 0.014357501794687724,
160
+ "grad_norm": 1.912782077307437,
161
+ "learning_rate": 2.999859870937924e-05,
162
+ "loss": 0.7407,
163
+ "step": 220
164
+ },
165
+ {
166
+ "epoch": 0.015010115512628075,
167
+ "grad_norm": 3.3906505952914974,
168
+ "learning_rate": 2.9998141916082696e-05,
169
+ "loss": 0.7732,
170
+ "step": 230
171
+ },
172
+ {
173
+ "epoch": 0.015662729230568426,
174
+ "grad_norm": 2.7144492322383584,
175
+ "learning_rate": 2.999762079136491e-05,
176
+ "loss": 0.7272,
177
+ "step": 240
178
+ },
179
+ {
180
+ "epoch": 0.01631534294850878,
181
+ "grad_norm": 7.109330196029837,
182
+ "learning_rate": 2.9997035337461135e-05,
183
+ "loss": 0.7748,
184
+ "step": 250
185
+ },
186
+ {
187
+ "epoch": 0.016967956666449128,
188
+ "grad_norm": 1.6054280593801813,
189
+ "learning_rate": 2.9996385556882555e-05,
190
+ "loss": 0.7676,
191
+ "step": 260
192
+ },
193
+ {
194
+ "epoch": 0.01762057038438948,
195
+ "grad_norm": 10.883212441614672,
196
+ "learning_rate": 2.9995671452416274e-05,
197
+ "loss": 0.735,
198
+ "step": 270
199
+ },
200
+ {
201
+ "epoch": 0.01827318410232983,
202
+ "grad_norm": 3.511064886507805,
203
+ "learning_rate": 2.999489302712529e-05,
204
+ "loss": 0.7741,
205
+ "step": 280
206
+ },
207
+ {
208
+ "epoch": 0.018925797820270183,
209
+ "grad_norm": 3.618603818375307,
210
+ "learning_rate": 2.9994050284348497e-05,
211
+ "loss": 0.749,
212
+ "step": 290
213
+ },
214
+ {
215
+ "epoch": 0.019578411538210532,
216
+ "grad_norm": 6.012944880342178,
217
+ "learning_rate": 2.9993143227700668e-05,
218
+ "loss": 0.7411,
219
+ "step": 300
220
+ },
221
+ {
222
+ "epoch": 0.020231025256150885,
223
+ "grad_norm": 2.348670372295822,
224
+ "learning_rate": 2.9992171861072428e-05,
225
+ "loss": 0.7394,
226
+ "step": 310
227
+ },
228
+ {
229
+ "epoch": 0.020883638974091234,
230
+ "grad_norm": 4.728309497649916,
231
+ "learning_rate": 2.9991136188630263e-05,
232
+ "loss": 0.8077,
233
+ "step": 320
234
+ },
235
+ {
236
+ "epoch": 0.021536252692031587,
237
+ "grad_norm": 15.611917863290122,
238
+ "learning_rate": 2.9990036214816467e-05,
239
+ "loss": 0.7209,
240
+ "step": 330
241
+ },
242
+ {
243
+ "epoch": 0.022188866409971936,
244
+ "grad_norm": 3.7315277354070817,
245
+ "learning_rate": 2.998887194434916e-05,
246
+ "loss": 0.7101,
247
+ "step": 340
248
+ },
249
+ {
250
+ "epoch": 0.02284148012791229,
251
+ "grad_norm": 6.618759094750745,
252
+ "learning_rate": 2.998764338222222e-05,
253
+ "loss": 0.7759,
254
+ "step": 350
255
+ },
256
+ {
257
+ "epoch": 0.02349409384585264,
258
+ "grad_norm": 6.770044306239603,
259
+ "learning_rate": 2.998635053370533e-05,
260
+ "loss": 0.7398,
261
+ "step": 360
262
+ },
263
+ {
264
+ "epoch": 0.02414670756379299,
265
+ "grad_norm": 12.471224202357552,
266
+ "learning_rate": 2.998499340434389e-05,
267
+ "loss": 0.7046,
268
+ "step": 370
269
+ },
270
+ {
271
+ "epoch": 0.02479932128173334,
272
+ "grad_norm": 4.147359416986547,
273
+ "learning_rate": 2.9983571999959013e-05,
274
+ "loss": 0.761,
275
+ "step": 380
276
+ },
277
+ {
278
+ "epoch": 0.025451934999673693,
279
+ "grad_norm": 34.84722866603778,
280
+ "learning_rate": 2.9982086326647533e-05,
281
+ "loss": 0.757,
282
+ "step": 390
283
+ },
284
+ {
285
+ "epoch": 0.026104548717614043,
286
+ "grad_norm": 5.245498180313093,
287
+ "learning_rate": 2.998053639078193e-05,
288
+ "loss": 0.7536,
289
+ "step": 400
290
+ },
291
+ {
292
+ "epoch": 0.026757162435554396,
293
+ "grad_norm": 36.55990241841121,
294
+ "learning_rate": 2.997892219901034e-05,
295
+ "loss": 0.7395,
296
+ "step": 410
297
+ },
298
+ {
299
+ "epoch": 0.027409776153494745,
300
+ "grad_norm": 5.03198653806696,
301
+ "learning_rate": 2.9977243758256494e-05,
302
+ "loss": 0.7208,
303
+ "step": 420
304
+ },
305
+ {
306
+ "epoch": 0.028062389871435098,
307
+ "grad_norm": 11.376914733036081,
308
+ "learning_rate": 2.997550107571972e-05,
309
+ "loss": 0.719,
310
+ "step": 430
311
+ },
312
+ {
313
+ "epoch": 0.028715003589375447,
314
+ "grad_norm": 2.958119684662306,
315
+ "learning_rate": 2.9973694158874898e-05,
316
+ "loss": 0.7271,
317
+ "step": 440
318
+ },
319
+ {
320
+ "epoch": 0.0293676173073158,
321
+ "grad_norm": 6.037096737490817,
322
+ "learning_rate": 2.9971823015472418e-05,
323
+ "loss": 0.7356,
324
+ "step": 450
325
+ },
326
+ {
327
+ "epoch": 0.03002023102525615,
328
+ "grad_norm": 5.3042973640363575,
329
+ "learning_rate": 2.9969887653538164e-05,
330
+ "loss": 0.7207,
331
+ "step": 460
332
+ },
333
+ {
334
+ "epoch": 0.030672844743196502,
335
+ "grad_norm": 2.4985603001745624,
336
+ "learning_rate": 2.996788808137347e-05,
337
+ "loss": 0.7769,
338
+ "step": 470
339
+ },
340
+ {
341
+ "epoch": 0.03132545846113685,
342
+ "grad_norm": 7.607065841315647,
343
+ "learning_rate": 2.9965824307555084e-05,
344
+ "loss": 0.7091,
345
+ "step": 480
346
+ },
347
+ {
348
+ "epoch": 0.03197807217907721,
349
+ "grad_norm": 4.322533035107957,
350
+ "learning_rate": 2.9963696340935144e-05,
351
+ "loss": 0.7114,
352
+ "step": 490
353
+ },
354
+ {
355
+ "epoch": 0.03263068589701756,
356
+ "grad_norm": 5.878565903250334,
357
+ "learning_rate": 2.9961504190641108e-05,
358
+ "loss": 0.7284,
359
+ "step": 500
360
+ },
361
+ {
362
+ "epoch": 0.033283299614957906,
363
+ "grad_norm": 5.0026507027119855,
364
+ "learning_rate": 2.9959247866075764e-05,
365
+ "loss": 0.6992,
366
+ "step": 510
367
+ },
368
+ {
369
+ "epoch": 0.033935913332898256,
370
+ "grad_norm": 7.12632150273901,
371
+ "learning_rate": 2.9956927376917137e-05,
372
+ "loss": 0.7285,
373
+ "step": 520
374
+ },
375
+ {
376
+ "epoch": 0.03458852705083861,
377
+ "grad_norm": 5.211123255860348,
378
+ "learning_rate": 2.9954542733118496e-05,
379
+ "loss": 0.7511,
380
+ "step": 530
381
+ },
382
+ {
383
+ "epoch": 0.03524114076877896,
384
+ "grad_norm": 9.925273547498618,
385
+ "learning_rate": 2.995209394490827e-05,
386
+ "loss": 0.7699,
387
+ "step": 540
388
+ },
389
+ {
390
+ "epoch": 0.03589375448671931,
391
+ "grad_norm": 7.418381681996765,
392
+ "learning_rate": 2.9949581022790025e-05,
393
+ "loss": 0.759,
394
+ "step": 550
395
+ },
396
+ {
397
+ "epoch": 0.03654636820465966,
398
+ "grad_norm": 4.352380973507467,
399
+ "learning_rate": 2.9947003977542423e-05,
400
+ "loss": 0.7537,
401
+ "step": 560
402
+ },
403
+ {
404
+ "epoch": 0.037198981922600016,
405
+ "grad_norm": 9.712842120769198,
406
+ "learning_rate": 2.9944362820219167e-05,
407
+ "loss": 0.7063,
408
+ "step": 570
409
+ },
410
+ {
411
+ "epoch": 0.037851595640540366,
412
+ "grad_norm": 5.757600819230482,
413
+ "learning_rate": 2.994165756214895e-05,
414
+ "loss": 0.7893,
415
+ "step": 580
416
+ },
417
+ {
418
+ "epoch": 0.038504209358480715,
419
+ "grad_norm": 5.529209601152462,
420
+ "learning_rate": 2.9938888214935426e-05,
421
+ "loss": 0.6771,
422
+ "step": 590
423
+ },
424
+ {
425
+ "epoch": 0.039156823076421064,
426
+ "grad_norm": 10.550479346499758,
427
+ "learning_rate": 2.9936054790457127e-05,
428
+ "loss": 0.737,
429
+ "step": 600
430
+ },
431
+ {
432
+ "epoch": 0.03980943679436142,
433
+ "grad_norm": 8.284279553451016,
434
+ "learning_rate": 2.9933157300867437e-05,
435
+ "loss": 0.7182,
436
+ "step": 610
437
+ },
438
+ {
439
+ "epoch": 0.04046205051230177,
440
+ "grad_norm": 8.18511648646326,
441
+ "learning_rate": 2.9930195758594542e-05,
442
+ "loss": 0.6901,
443
+ "step": 620
444
+ },
445
+ {
446
+ "epoch": 0.04111466423024212,
447
+ "grad_norm": 14.569754827631956,
448
+ "learning_rate": 2.9927170176341365e-05,
449
+ "loss": 0.7008,
450
+ "step": 630
451
+ },
452
+ {
453
+ "epoch": 0.04176727794818247,
454
+ "grad_norm": 4.214581273685441,
455
+ "learning_rate": 2.992408056708551e-05,
456
+ "loss": 0.7489,
457
+ "step": 640
458
+ },
459
+ {
460
+ "epoch": 0.042419891666122825,
461
+ "grad_norm": 10.038596627079452,
462
+ "learning_rate": 2.9920926944079224e-05,
463
+ "loss": 0.7649,
464
+ "step": 650
465
+ },
466
+ {
467
+ "epoch": 0.043072505384063174,
468
+ "grad_norm": 2.386544029221306,
469
+ "learning_rate": 2.9917709320849305e-05,
470
+ "loss": 0.7223,
471
+ "step": 660
472
+ },
473
+ {
474
+ "epoch": 0.043725119102003523,
475
+ "grad_norm": 8.286359254511249,
476
+ "learning_rate": 2.9914427711197096e-05,
477
+ "loss": 0.7089,
478
+ "step": 670
479
+ },
480
+ {
481
+ "epoch": 0.04437773281994387,
482
+ "grad_norm": 4.235819327444911,
483
+ "learning_rate": 2.9911082129198372e-05,
484
+ "loss": 0.7138,
485
+ "step": 680
486
+ },
487
+ {
488
+ "epoch": 0.04503034653788423,
489
+ "grad_norm": 5.187338033698449,
490
+ "learning_rate": 2.9907672589203316e-05,
491
+ "loss": 0.7192,
492
+ "step": 690
493
+ },
494
+ {
495
+ "epoch": 0.04568296025582458,
496
+ "grad_norm": 6.360475337181379,
497
+ "learning_rate": 2.9904199105836443e-05,
498
+ "loss": 0.7094,
499
+ "step": 700
500
+ },
501
+ {
502
+ "epoch": 0.04633557397376493,
503
+ "grad_norm": 4.906400836156689,
504
+ "learning_rate": 2.990066169399654e-05,
505
+ "loss": 0.654,
506
+ "step": 710
507
+ },
508
+ {
509
+ "epoch": 0.04698818769170528,
510
+ "grad_norm": 17.600495314130633,
511
+ "learning_rate": 2.9897060368856603e-05,
512
+ "loss": 0.7299,
513
+ "step": 720
514
+ },
515
+ {
516
+ "epoch": 0.04764080140964563,
517
+ "grad_norm": 7.765935941492389,
518
+ "learning_rate": 2.989339514586377e-05,
519
+ "loss": 0.7486,
520
+ "step": 730
521
+ },
522
+ {
523
+ "epoch": 0.04829341512758598,
524
+ "grad_norm": 7.30026395137639,
525
+ "learning_rate": 2.9889666040739252e-05,
526
+ "loss": 0.6941,
527
+ "step": 740
528
+ },
529
+ {
530
+ "epoch": 0.04894602884552633,
531
+ "grad_norm": 4.676985481218465,
532
+ "learning_rate": 2.9885873069478275e-05,
533
+ "loss": 0.7701,
534
+ "step": 750
535
+ },
536
+ {
537
+ "epoch": 0.04959864256346668,
538
+ "grad_norm": 42.50656974727186,
539
+ "learning_rate": 2.9882016248350006e-05,
540
+ "loss": 0.7428,
541
+ "step": 760
542
+ },
543
+ {
544
+ "epoch": 0.05025125628140704,
545
+ "grad_norm": 3.9893667031114766,
546
+ "learning_rate": 2.9878095593897474e-05,
547
+ "loss": 0.7204,
548
+ "step": 770
549
+ },
550
+ {
551
+ "epoch": 0.05090386999934739,
552
+ "grad_norm": 8.909028486553332,
553
+ "learning_rate": 2.9874111122937518e-05,
554
+ "loss": 0.7336,
555
+ "step": 780
556
+ },
557
+ {
558
+ "epoch": 0.051556483717287736,
559
+ "grad_norm": 5.256925284136456,
560
+ "learning_rate": 2.9870062852560698e-05,
561
+ "loss": 0.7674,
562
+ "step": 790
563
+ },
564
+ {
565
+ "epoch": 0.052209097435228086,
566
+ "grad_norm": 5.835535487534073,
567
+ "learning_rate": 2.986595080013123e-05,
568
+ "loss": 0.7547,
569
+ "step": 800
570
+ },
571
+ {
572
+ "epoch": 0.05286171115316844,
573
+ "grad_norm": 4.7337998648314565,
574
+ "learning_rate": 2.9861774983286913e-05,
575
+ "loss": 0.7412,
576
+ "step": 810
577
+ },
578
+ {
579
+ "epoch": 0.05351432487110879,
580
+ "grad_norm": 4.020304406250962,
581
+ "learning_rate": 2.9857535419939053e-05,
582
+ "loss": 0.7351,
583
+ "step": 820
584
+ },
585
+ {
586
+ "epoch": 0.05416693858904914,
587
+ "grad_norm": 7.005748568175158,
588
+ "learning_rate": 2.9853232128272367e-05,
589
+ "loss": 0.7146,
590
+ "step": 830
591
+ },
592
+ {
593
+ "epoch": 0.05481955230698949,
594
+ "grad_norm": 12.598315147497464,
595
+ "learning_rate": 2.984886512674494e-05,
596
+ "loss": 0.7066,
597
+ "step": 840
598
+ },
599
+ {
600
+ "epoch": 0.055472166024929846,
601
+ "grad_norm": 5.636755294839953,
602
+ "learning_rate": 2.9844434434088114e-05,
603
+ "loss": 0.8033,
604
+ "step": 850
605
+ },
606
+ {
607
+ "epoch": 0.056124779742870196,
608
+ "grad_norm": 2.5964949457129305,
609
+ "learning_rate": 2.9839940069306436e-05,
610
+ "loss": 0.718,
611
+ "step": 860
612
+ },
613
+ {
614
+ "epoch": 0.056777393460810545,
615
+ "grad_norm": 5.496060434333994,
616
+ "learning_rate": 2.9835382051677548e-05,
617
+ "loss": 0.7382,
618
+ "step": 870
619
+ },
620
+ {
621
+ "epoch": 0.057430007178750894,
622
+ "grad_norm": 3.367511777906771,
623
+ "learning_rate": 2.9830760400752117e-05,
624
+ "loss": 0.7049,
625
+ "step": 880
626
+ },
627
+ {
628
+ "epoch": 0.05808262089669125,
629
+ "grad_norm": 12.228282751386294,
630
+ "learning_rate": 2.9826075136353762e-05,
631
+ "loss": 0.7135,
632
+ "step": 890
633
+ },
634
+ {
635
+ "epoch": 0.0587352346146316,
636
+ "grad_norm": 7.426066867205744,
637
+ "learning_rate": 2.9821326278578955e-05,
638
+ "loss": 0.6966,
639
+ "step": 900
640
+ },
641
+ {
642
+ "epoch": 0.05938784833257195,
643
+ "grad_norm": 5.720080945169142,
644
+ "learning_rate": 2.981651384779693e-05,
645
+ "loss": 0.7325,
646
+ "step": 910
647
+ },
648
+ {
649
+ "epoch": 0.0600404620505123,
650
+ "grad_norm": 3.3362738196336275,
651
+ "learning_rate": 2.9811637864649622e-05,
652
+ "loss": 0.7013,
653
+ "step": 920
654
+ },
655
+ {
656
+ "epoch": 0.060693075768452655,
657
+ "grad_norm": 5.5481143050516675,
658
+ "learning_rate": 2.980669835005154e-05,
659
+ "loss": 0.7107,
660
+ "step": 930
661
+ },
662
+ {
663
+ "epoch": 0.061345689486393004,
664
+ "grad_norm": 2.7247889305754533,
665
+ "learning_rate": 2.980169532518971e-05,
666
+ "loss": 0.6839,
667
+ "step": 940
668
+ },
669
+ {
670
+ "epoch": 0.06199830320433335,
671
+ "grad_norm": 12.705144630158374,
672
+ "learning_rate": 2.9796628811523576e-05,
673
+ "loss": 0.7061,
674
+ "step": 950
675
+ },
676
+ {
677
+ "epoch": 0.0626509169222737,
678
+ "grad_norm": 3.1174966376805777,
679
+ "learning_rate": 2.9791498830784896e-05,
680
+ "loss": 0.706,
681
+ "step": 960
682
+ },
683
+ {
684
+ "epoch": 0.06330353064021406,
685
+ "grad_norm": 6.454819870022971,
686
+ "learning_rate": 2.9786305404977657e-05,
687
+ "loss": 0.6901,
688
+ "step": 970
689
+ },
690
+ {
691
+ "epoch": 0.06395614435815442,
692
+ "grad_norm": 8.62099817289566,
693
+ "learning_rate": 2.9781048556377982e-05,
694
+ "loss": 0.6737,
695
+ "step": 980
696
+ },
697
+ {
698
+ "epoch": 0.06460875807609476,
699
+ "grad_norm": 12.649532843245389,
700
+ "learning_rate": 2.977572830753404e-05,
701
+ "loss": 0.6777,
702
+ "step": 990
703
+ },
704
+ {
705
+ "epoch": 0.06526137179403511,
706
+ "grad_norm": 5.019508830810828,
707
+ "learning_rate": 2.9770344681265925e-05,
708
+ "loss": 0.7125,
709
+ "step": 1000
710
+ },
711
+ {
712
+ "epoch": 0.06591398551197546,
713
+ "grad_norm": 5.417114630539967,
714
+ "learning_rate": 2.9764897700665595e-05,
715
+ "loss": 0.7558,
716
+ "step": 1010
717
+ },
718
+ {
719
+ "epoch": 0.06656659922991581,
720
+ "grad_norm": 13.487574757960102,
721
+ "learning_rate": 2.975938738909674e-05,
722
+ "loss": 0.7305,
723
+ "step": 1020
724
+ },
725
+ {
726
+ "epoch": 0.06721921294785617,
727
+ "grad_norm": 4.115297871929447,
728
+ "learning_rate": 2.97538137701947e-05,
729
+ "loss": 0.7382,
730
+ "step": 1030
731
+ },
732
+ {
733
+ "epoch": 0.06787182666579651,
734
+ "grad_norm": 4.218133725965425,
735
+ "learning_rate": 2.974817686786636e-05,
736
+ "loss": 0.7131,
737
+ "step": 1040
738
+ },
739
+ {
740
+ "epoch": 0.06852444038373687,
741
+ "grad_norm": 23.754945260227526,
742
+ "learning_rate": 2.9742476706290044e-05,
743
+ "loss": 0.6854,
744
+ "step": 1050
745
+ },
746
+ {
747
+ "epoch": 0.06917705410167722,
748
+ "grad_norm": 9.992382581534882,
749
+ "learning_rate": 2.973671330991541e-05,
750
+ "loss": 0.7224,
751
+ "step": 1060
752
+ },
753
+ {
754
+ "epoch": 0.06982966781961757,
755
+ "grad_norm": 9.022842665053004,
756
+ "learning_rate": 2.973088670346336e-05,
757
+ "loss": 0.69,
758
+ "step": 1070
759
+ },
760
+ {
761
+ "epoch": 0.07048228153755792,
762
+ "grad_norm": 7.180693480173149,
763
+ "learning_rate": 2.97249969119259e-05,
764
+ "loss": 0.6752,
765
+ "step": 1080
766
+ },
767
+ {
768
+ "epoch": 0.07113489525549826,
769
+ "grad_norm": 4.631581340679664,
770
+ "learning_rate": 2.9719043960566088e-05,
771
+ "loss": 0.7078,
772
+ "step": 1090
773
+ },
774
+ {
775
+ "epoch": 0.07178750897343862,
776
+ "grad_norm": 3.8365551360021497,
777
+ "learning_rate": 2.9713027874917867e-05,
778
+ "loss": 0.7455,
779
+ "step": 1100
780
+ },
781
+ {
782
+ "epoch": 0.07244012269137898,
783
+ "grad_norm": 20.612721990589407,
784
+ "learning_rate": 2.9706948680785984e-05,
785
+ "loss": 0.7123,
786
+ "step": 1110
787
+ },
788
+ {
789
+ "epoch": 0.07309273640931932,
790
+ "grad_norm": 8.515913036269723,
791
+ "learning_rate": 2.9700806404245893e-05,
792
+ "loss": 0.6755,
793
+ "step": 1120
794
+ },
795
+ {
796
+ "epoch": 0.07374535012725968,
797
+ "grad_norm": 8.702591994450561,
798
+ "learning_rate": 2.9694601071643607e-05,
799
+ "loss": 0.743,
800
+ "step": 1130
801
+ },
802
+ {
803
+ "epoch": 0.07439796384520003,
804
+ "grad_norm": 20.204623397644042,
805
+ "learning_rate": 2.968833270959562e-05,
806
+ "loss": 0.6995,
807
+ "step": 1140
808
+ },
809
+ {
810
+ "epoch": 0.07505057756314037,
811
+ "grad_norm": 3.4150625200259563,
812
+ "learning_rate": 2.9682001344988768e-05,
813
+ "loss": 0.7245,
814
+ "step": 1150
815
+ },
816
+ {
817
+ "epoch": 0.07570319128108073,
818
+ "grad_norm": 4.827412673105033,
819
+ "learning_rate": 2.967560700498013e-05,
820
+ "loss": 0.6764,
821
+ "step": 1160
822
+ },
823
+ {
824
+ "epoch": 0.07635580499902107,
825
+ "grad_norm": 5.9778449783108965,
826
+ "learning_rate": 2.9669149716996897e-05,
827
+ "loss": 0.7094,
828
+ "step": 1170
829
+ },
830
+ {
831
+ "epoch": 0.07700841871696143,
832
+ "grad_norm": 4.626419468156439,
833
+ "learning_rate": 2.9662629508736278e-05,
834
+ "loss": 0.7139,
835
+ "step": 1180
836
+ },
837
+ {
838
+ "epoch": 0.07766103243490179,
839
+ "grad_norm": 8.23953369228554,
840
+ "learning_rate": 2.9656046408165344e-05,
841
+ "loss": 0.7132,
842
+ "step": 1190
843
+ },
844
+ {
845
+ "epoch": 0.07831364615284213,
846
+ "grad_norm": 5.755275462407804,
847
+ "learning_rate": 2.964940044352095e-05,
848
+ "loss": 0.6923,
849
+ "step": 1200
850
+ }
851
+ ],
852
+ "logging_steps": 10,
853
+ "max_steps": 15323,
854
+ "num_input_tokens_seen": 0,
855
+ "num_train_epochs": 1,
856
+ "save_steps": 400,
857
+ "stateful_callbacks": {
858
+ "TrainerControl": {
859
+ "args": {
860
+ "should_epoch_stop": false,
861
+ "should_evaluate": false,
862
+ "should_log": false,
863
+ "should_save": true,
864
+ "should_training_stop": false
865
+ },
866
+ "attributes": {}
867
+ }
868
+ },
869
+ "total_flos": 3.280284708293837e+18,
870
+ "train_batch_size": 8,
871
+ "trial_name": null,
872
+ "trial_params": null
873
+ }
checkpoint-1200/training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a3a6a5052a9445cc570063f5939fdeea3ff8007e9c2718674bb335b9eea0bfff
3
+ size 6520
checkpoint-1200/zero_to_fp32.py ADDED
@@ -0,0 +1,587 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+
3
+ # Copyright (c) Microsoft Corporation.
4
+ # SPDX-License-Identifier: Apache-2.0
5
+
6
+ # DeepSpeed Team
7
+
8
+ # This script extracts fp32 consolidated weights from a zero 1, 2 and 3 DeepSpeed checkpoints. It gets
9
+ # copied into the top level checkpoint dir, so the user can easily do the conversion at any point in
10
+ # the future. Once extracted, the weights don't require DeepSpeed and can be used in any
11
+ # application.
12
+ #
13
+ # example: python zero_to_fp32.py . pytorch_model.bin
14
+
15
+ import argparse
16
+ import torch
17
+ import glob
18
+ import math
19
+ import os
20
+ import re
21
+ from collections import OrderedDict
22
+ from dataclasses import dataclass
23
+
24
+ # while this script doesn't use deepspeed to recover data, since the checkpoints are pickled with
25
+ # DeepSpeed data structures it has to be available in the current python environment.
26
+ from deepspeed.utils import logger
27
+ from deepspeed.checkpoint.constants import (DS_VERSION, OPTIMIZER_STATE_DICT, SINGLE_PARTITION_OF_FP32_GROUPS,
28
+ FP32_FLAT_GROUPS, ZERO_STAGE, PARTITION_COUNT, PARAM_SHAPES, BUFFER_NAMES,
29
+ FROZEN_PARAM_SHAPES, FROZEN_PARAM_FRAGMENTS)
30
+
31
+
32
+ @dataclass
33
+ class zero_model_state:
34
+ buffers: dict()
35
+ param_shapes: dict()
36
+ shared_params: list
37
+ ds_version: int
38
+ frozen_param_shapes: dict()
39
+ frozen_param_fragments: dict()
40
+
41
+
42
+ debug = 0
43
+
44
+ # load to cpu
45
+ device = torch.device('cpu')
46
+
47
+
48
+ def atoi(text):
49
+ return int(text) if text.isdigit() else text
50
+
51
+
52
+ def natural_keys(text):
53
+ '''
54
+ alist.sort(key=natural_keys) sorts in human order
55
+ http://nedbatchelder.com/blog/200712/human_sorting.html
56
+ (See Toothy's implementation in the comments)
57
+ '''
58
+ return [atoi(c) for c in re.split(r'(\d+)', text)]
59
+
60
+
61
+ def get_model_state_file(checkpoint_dir, zero_stage):
62
+ if not os.path.isdir(checkpoint_dir):
63
+ raise FileNotFoundError(f"Directory '{checkpoint_dir}' doesn't exist")
64
+
65
+ # there should be only one file
66
+ if zero_stage <= 2:
67
+ file = os.path.join(checkpoint_dir, "mp_rank_00_model_states.pt")
68
+ elif zero_stage == 3:
69
+ file = os.path.join(checkpoint_dir, "zero_pp_rank_0_mp_rank_00_model_states.pt")
70
+
71
+ if not os.path.exists(file):
72
+ raise FileNotFoundError(f"can't find model states file at '{file}'")
73
+
74
+ return file
75
+
76
+
77
+ def get_checkpoint_files(checkpoint_dir, glob_pattern):
78
+ # XXX: need to test that this simple glob rule works for multi-node setup too
79
+ ckpt_files = sorted(glob.glob(os.path.join(checkpoint_dir, glob_pattern)), key=natural_keys)
80
+
81
+ if len(ckpt_files) == 0:
82
+ raise FileNotFoundError(f"can't find {glob_pattern} files in directory '{checkpoint_dir}'")
83
+
84
+ return ckpt_files
85
+
86
+
87
+ def get_optim_files(checkpoint_dir):
88
+ return get_checkpoint_files(checkpoint_dir, "*_optim_states.pt")
89
+
90
+
91
+ def get_model_state_files(checkpoint_dir):
92
+ return get_checkpoint_files(checkpoint_dir, "*_model_states.pt")
93
+
94
+
95
+ def parse_model_states(files):
96
+ zero_model_states = []
97
+ for file in files:
98
+ state_dict = torch.load(file, map_location=device)
99
+
100
+ if BUFFER_NAMES not in state_dict:
101
+ raise ValueError(f"{file} is not a model state checkpoint")
102
+ buffer_names = state_dict[BUFFER_NAMES]
103
+ if debug:
104
+ print("Found buffers:", buffer_names)
105
+
106
+ # recover just the buffers while restoring them to fp32 if they were saved in fp16
107
+ buffers = {k: v.float() for k, v in state_dict["module"].items() if k in buffer_names}
108
+ param_shapes = state_dict[PARAM_SHAPES]
109
+
110
+ # collect parameters that are included in param_shapes
111
+ param_names = []
112
+ for s in param_shapes:
113
+ for name in s.keys():
114
+ param_names.append(name)
115
+
116
+ # update with frozen parameters
117
+ frozen_param_shapes = state_dict.get(FROZEN_PARAM_SHAPES, None)
118
+ if frozen_param_shapes is not None:
119
+ if debug:
120
+ print(f"Found frozen_param_shapes: {frozen_param_shapes}")
121
+ param_names += list(frozen_param_shapes.keys())
122
+
123
+ # handle shared params
124
+ shared_params = [[k, v] for k, v in state_dict["shared_params"].items()]
125
+
126
+ ds_version = state_dict.get(DS_VERSION, None)
127
+
128
+ frozen_param_fragments = state_dict.get(FROZEN_PARAM_FRAGMENTS, None)
129
+
130
+ z_model_state = zero_model_state(buffers=buffers,
131
+ param_shapes=param_shapes,
132
+ shared_params=shared_params,
133
+ ds_version=ds_version,
134
+ frozen_param_shapes=frozen_param_shapes,
135
+ frozen_param_fragments=frozen_param_fragments)
136
+ zero_model_states.append(z_model_state)
137
+
138
+ return zero_model_states
139
+
140
+
141
+ def parse_optim_states(files, ds_checkpoint_dir):
142
+
143
+ total_files = len(files)
144
+ state_dicts = []
145
+ for f in files:
146
+ state_dict = torch.load(f, map_location=device)
147
+ # immediately discard the potentially huge 2 optimizer states as we only care for fp32 master weights
148
+ # and also handle the case where it was already removed by another helper script
149
+ state_dict["optimizer_state_dict"].pop("optimizer_state_dict", None)
150
+ state_dicts.append(state_dict)
151
+
152
+ if not ZERO_STAGE in state_dicts[0][OPTIMIZER_STATE_DICT]:
153
+ raise ValueError(f"{files[0]} is not a zero checkpoint")
154
+ zero_stage = state_dicts[0][OPTIMIZER_STATE_DICT][ZERO_STAGE]
155
+ world_size = state_dicts[0][OPTIMIZER_STATE_DICT][PARTITION_COUNT]
156
+
157
+ # For ZeRO-2 each param group can have different partition_count as data parallelism for expert
158
+ # parameters can be different from data parallelism for non-expert parameters. So we can just
159
+ # use the max of the partition_count to get the dp world_size.
160
+
161
+ if type(world_size) is list:
162
+ world_size = max(world_size)
163
+
164
+ if world_size != total_files:
165
+ raise ValueError(
166
+ f"Expected {world_size} of '*_optim_states.pt' under '{ds_checkpoint_dir}' but found {total_files} files. "
167
+ "Possibly due to an overwrite of an old checkpoint, or a checkpoint didn't get saved by one or more processes."
168
+ )
169
+
170
+ # the groups are named differently in each stage
171
+ if zero_stage <= 2:
172
+ fp32_groups_key = SINGLE_PARTITION_OF_FP32_GROUPS
173
+ elif zero_stage == 3:
174
+ fp32_groups_key = FP32_FLAT_GROUPS
175
+ else:
176
+ raise ValueError(f"unknown zero stage {zero_stage}")
177
+
178
+ if zero_stage <= 2:
179
+ fp32_flat_groups = [state_dicts[i][OPTIMIZER_STATE_DICT][fp32_groups_key] for i in range(len(state_dicts))]
180
+ elif zero_stage == 3:
181
+ # if there is more than one param group, there will be multiple flattened tensors - one
182
+ # flattened tensor per group - for simplicity merge them into a single tensor
183
+ #
184
+ # XXX: could make the script more memory efficient for when there are multiple groups - it
185
+ # will require matching the sub-lists of param_shapes for each param group flattened tensor
186
+
187
+ fp32_flat_groups = [
188
+ torch.cat(state_dicts[i][OPTIMIZER_STATE_DICT][fp32_groups_key], 0) for i in range(len(state_dicts))
189
+ ]
190
+
191
+ return zero_stage, world_size, fp32_flat_groups
192
+
193
+
194
+ def _get_fp32_state_dict_from_zero_checkpoint(ds_checkpoint_dir):
195
+ """
196
+ Returns fp32 state_dict reconstructed from ds checkpoint
197
+
198
+ Args:
199
+ - ``ds_checkpoint_dir``: path to the deepspeed checkpoint folder (where the optimizer files are)
200
+
201
+ """
202
+ print(f"Processing zero checkpoint '{ds_checkpoint_dir}'")
203
+
204
+ optim_files = get_optim_files(ds_checkpoint_dir)
205
+ zero_stage, world_size, fp32_flat_groups = parse_optim_states(optim_files, ds_checkpoint_dir)
206
+ print(f"Detected checkpoint of type zero stage {zero_stage}, world_size: {world_size}")
207
+
208
+ model_files = get_model_state_files(ds_checkpoint_dir)
209
+
210
+ zero_model_states = parse_model_states(model_files)
211
+ print(f'Parsing checkpoint created by deepspeed=={zero_model_states[0].ds_version}')
212
+
213
+ if zero_stage <= 2:
214
+ return _get_fp32_state_dict_from_zero2_checkpoint(world_size, fp32_flat_groups, zero_model_states)
215
+ elif zero_stage == 3:
216
+ return _get_fp32_state_dict_from_zero3_checkpoint(world_size, fp32_flat_groups, zero_model_states)
217
+
218
+
219
+ def _zero2_merge_frozen_params(state_dict, zero_model_states):
220
+ if zero_model_states[0].frozen_param_shapes is None or len(zero_model_states[0].frozen_param_shapes) == 0:
221
+ return
222
+
223
+ frozen_param_shapes = zero_model_states[0].frozen_param_shapes
224
+ frozen_param_fragments = zero_model_states[0].frozen_param_fragments
225
+
226
+ if debug:
227
+ num_elem = sum(s.numel() for s in frozen_param_shapes.values())
228
+ print(f'rank 0: {FROZEN_PARAM_SHAPES}.numel = {num_elem}')
229
+
230
+ wanted_params = len(frozen_param_shapes)
231
+ wanted_numel = sum(s.numel() for s in frozen_param_shapes.values())
232
+ avail_numel = sum([p.numel() for p in frozen_param_fragments.values()])
233
+ print(f'Frozen params: Have {avail_numel} numels to process.')
234
+ print(f'Frozen params: Need {wanted_numel} numels in {wanted_params} params')
235
+
236
+ total_params = 0
237
+ total_numel = 0
238
+ for name, shape in frozen_param_shapes.items():
239
+ total_params += 1
240
+ unpartitioned_numel = shape.numel()
241
+ total_numel += unpartitioned_numel
242
+
243
+ state_dict[name] = frozen_param_fragments[name]
244
+
245
+ if debug:
246
+ print(f"{name} full shape: {shape} unpartitioned numel {unpartitioned_numel} ")
247
+
248
+ print(f"Reconstructed Frozen fp32 state dict with {total_params} params {total_numel} elements")
249
+
250
+
251
+ def _zero2_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states):
252
+ param_shapes = zero_model_states[0].param_shapes
253
+
254
+ # Reconstruction protocol:
255
+ #
256
+ # XXX: document this
257
+
258
+ if debug:
259
+ for i in range(world_size):
260
+ for j in range(len(fp32_flat_groups[0])):
261
+ print(f"{FP32_FLAT_GROUPS}[{i}][{j}].shape={fp32_flat_groups[i][j].shape}")
262
+
263
+ # XXX: memory usage doubles here (zero2)
264
+ num_param_groups = len(fp32_flat_groups[0])
265
+ merged_single_partition_of_fp32_groups = []
266
+ for i in range(num_param_groups):
267
+ merged_partitions = [sd[i] for sd in fp32_flat_groups]
268
+ full_single_fp32_vector = torch.cat(merged_partitions, 0)
269
+ merged_single_partition_of_fp32_groups.append(full_single_fp32_vector)
270
+ avail_numel = sum(
271
+ [full_single_fp32_vector.numel() for full_single_fp32_vector in merged_single_partition_of_fp32_groups])
272
+
273
+ if debug:
274
+ wanted_params = sum([len(shapes) for shapes in param_shapes])
275
+ wanted_numel = sum([sum(shape.numel() for shape in shapes.values()) for shapes in param_shapes])
276
+ # not asserting if there is a mismatch due to possible padding
277
+ print(f"Have {avail_numel} numels to process.")
278
+ print(f"Need {wanted_numel} numels in {wanted_params} params.")
279
+
280
+ # params
281
+ # XXX: for huge models that can't fit into the host's RAM we will have to recode this to support
282
+ # out-of-core computing solution
283
+ total_numel = 0
284
+ total_params = 0
285
+ for shapes, full_single_fp32_vector in zip(param_shapes, merged_single_partition_of_fp32_groups):
286
+ offset = 0
287
+ avail_numel = full_single_fp32_vector.numel()
288
+ for name, shape in shapes.items():
289
+
290
+ unpartitioned_numel = shape.numel()
291
+ total_numel += unpartitioned_numel
292
+ total_params += 1
293
+
294
+ if debug:
295
+ print(f"{name} full shape: {shape} unpartitioned numel {unpartitioned_numel} ")
296
+ state_dict[name] = full_single_fp32_vector.narrow(0, offset, unpartitioned_numel).view(shape)
297
+ offset += unpartitioned_numel
298
+
299
+ # Z2 started to align to 2*world_size to improve nccl performance. Therefore both offset and
300
+ # avail_numel can differ by anywhere between 0..2*world_size. Due to two unrelated complex
301
+ # paddings performed in the code it's almost impossible to predict the exact numbers w/o the
302
+ # live optimizer object, so we are checking that the numbers are within the right range
303
+ align_to = 2 * world_size
304
+
305
+ def zero2_align(x):
306
+ return align_to * math.ceil(x / align_to)
307
+
308
+ if debug:
309
+ print(f"original offset={offset}, avail_numel={avail_numel}")
310
+
311
+ offset = zero2_align(offset)
312
+ avail_numel = zero2_align(avail_numel)
313
+
314
+ if debug:
315
+ print(f"aligned offset={offset}, avail_numel={avail_numel}")
316
+
317
+ # Sanity check
318
+ if offset != avail_numel:
319
+ raise ValueError(f"consumed {offset} numels out of {avail_numel} - something is wrong")
320
+
321
+ print(f"Reconstructed fp32 state dict with {total_params} params {total_numel} elements")
322
+
323
+
324
+ def _get_fp32_state_dict_from_zero2_checkpoint(world_size, fp32_flat_groups, zero_model_states):
325
+ state_dict = OrderedDict()
326
+
327
+ # buffers
328
+ buffers = zero_model_states[0].buffers
329
+ state_dict.update(buffers)
330
+ if debug:
331
+ print(f"added {len(buffers)} buffers")
332
+
333
+ _zero2_merge_frozen_params(state_dict, zero_model_states)
334
+
335
+ _zero2_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states)
336
+
337
+ # recover shared parameters
338
+ for pair in zero_model_states[0].shared_params:
339
+ if pair[1] in state_dict:
340
+ state_dict[pair[0]] = state_dict[pair[1]]
341
+
342
+ return state_dict
343
+
344
+
345
+ def zero3_partitioned_param_info(unpartitioned_numel, world_size):
346
+ remainder = unpartitioned_numel % world_size
347
+ padding_numel = (world_size - remainder) if remainder else 0
348
+ partitioned_numel = math.ceil(unpartitioned_numel / world_size)
349
+ return partitioned_numel, padding_numel
350
+
351
+
352
+ def _zero3_merge_frozen_params(state_dict, world_size, zero_model_states):
353
+ if zero_model_states[0].frozen_param_shapes is None or len(zero_model_states[0].frozen_param_shapes) == 0:
354
+ return
355
+
356
+ if debug:
357
+ for i in range(world_size):
358
+ num_elem = sum(s.numel() for s in zero_model_states[i].frozen_param_fragments.values())
359
+ print(f'rank {i}: {FROZEN_PARAM_SHAPES}.numel = {num_elem}')
360
+
361
+ frozen_param_shapes = zero_model_states[0].frozen_param_shapes
362
+ wanted_params = len(frozen_param_shapes)
363
+ wanted_numel = sum(s.numel() for s in frozen_param_shapes.values())
364
+ avail_numel = sum([p.numel() for p in zero_model_states[0].frozen_param_fragments.values()]) * world_size
365
+ print(f'Frozen params: Have {avail_numel} numels to process.')
366
+ print(f'Frozen params: Need {wanted_numel} numels in {wanted_params} params')
367
+
368
+ total_params = 0
369
+ total_numel = 0
370
+ for name, shape in zero_model_states[0].frozen_param_shapes.items():
371
+ total_params += 1
372
+ unpartitioned_numel = shape.numel()
373
+ total_numel += unpartitioned_numel
374
+
375
+ param_frags = tuple(model_state.frozen_param_fragments[name] for model_state in zero_model_states)
376
+ state_dict[name] = torch.cat(param_frags, 0).narrow(0, 0, unpartitioned_numel).view(shape)
377
+
378
+ partitioned_numel, partitioned_padding_numel = zero3_partitioned_param_info(unpartitioned_numel, world_size)
379
+
380
+ if debug:
381
+ print(
382
+ f"Frozen params: {total_params} {name} full shape: {shape} partition0 numel={partitioned_numel} partitioned_padding_numel={partitioned_padding_numel}"
383
+ )
384
+
385
+ print(f"Reconstructed Frozen fp32 state dict with {total_params} params {total_numel} elements")
386
+
387
+
388
+ def _zero3_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states):
389
+ param_shapes = zero_model_states[0].param_shapes
390
+ avail_numel = fp32_flat_groups[0].numel() * world_size
391
+ # Reconstruction protocol: For zero3 we need to zip the partitions together at boundary of each
392
+ # param, re-consolidating each param, while dealing with padding if any
393
+
394
+ # merge list of dicts, preserving order
395
+ param_shapes = {k: v for d in param_shapes for k, v in d.items()}
396
+
397
+ if debug:
398
+ for i in range(world_size):
399
+ print(f"{FP32_FLAT_GROUPS}[{i}].shape={fp32_flat_groups[i].shape}")
400
+
401
+ wanted_params = len(param_shapes)
402
+ wanted_numel = sum(shape.numel() for shape in param_shapes.values())
403
+ # not asserting if there is a mismatch due to possible padding
404
+ avail_numel = fp32_flat_groups[0].numel() * world_size
405
+ print(f"Trainable params: Have {avail_numel} numels to process.")
406
+ print(f"Trainable params: Need {wanted_numel} numels in {wanted_params} params.")
407
+
408
+ # params
409
+ # XXX: for huge models that can't fit into the host's RAM we will have to recode this to support
410
+ # out-of-core computing solution
411
+ offset = 0
412
+ total_numel = 0
413
+ total_params = 0
414
+ for name, shape in param_shapes.items():
415
+
416
+ unpartitioned_numel = shape.numel()
417
+ total_numel += unpartitioned_numel
418
+ total_params += 1
419
+
420
+ partitioned_numel, partitioned_padding_numel = zero3_partitioned_param_info(unpartitioned_numel, world_size)
421
+
422
+ if debug:
423
+ print(
424
+ f"Trainable params: {total_params} {name} full shape: {shape} partition0 numel={partitioned_numel} partitioned_padding_numel={partitioned_padding_numel}"
425
+ )
426
+
427
+ # XXX: memory usage doubles here
428
+ state_dict[name] = torch.cat(
429
+ tuple(fp32_flat_groups[i].narrow(0, offset, partitioned_numel) for i in range(world_size)),
430
+ 0).narrow(0, 0, unpartitioned_numel).view(shape)
431
+ offset += partitioned_numel
432
+
433
+ offset *= world_size
434
+
435
+ # Sanity check
436
+ if offset != avail_numel:
437
+ raise ValueError(f"consumed {offset} numels out of {avail_numel} - something is wrong")
438
+
439
+ print(f"Reconstructed Trainable fp32 state dict with {total_params} params {total_numel} elements")
440
+
441
+
442
+ def _get_fp32_state_dict_from_zero3_checkpoint(world_size, fp32_flat_groups, zero_model_states):
443
+ state_dict = OrderedDict()
444
+
445
+ # buffers
446
+ buffers = zero_model_states[0].buffers
447
+ state_dict.update(buffers)
448
+ if debug:
449
+ print(f"added {len(buffers)} buffers")
450
+
451
+ _zero3_merge_frozen_params(state_dict, world_size, zero_model_states)
452
+
453
+ _zero3_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states)
454
+
455
+ # recover shared parameters
456
+ for pair in zero_model_states[0].shared_params:
457
+ if pair[1] in state_dict:
458
+ state_dict[pair[0]] = state_dict[pair[1]]
459
+
460
+ return state_dict
461
+
462
+
463
+ def get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag=None):
464
+ """
465
+ Convert ZeRO 2 or 3 checkpoint into a single fp32 consolidated state_dict that can be loaded with
466
+ ``load_state_dict()`` and used for training without DeepSpeed or shared with others, for example
467
+ via a model hub.
468
+
469
+ Args:
470
+ - ``checkpoint_dir``: path to the desired checkpoint folder
471
+ - ``tag``: checkpoint tag used as a unique identifier for checkpoint. If not provided will attempt to load tag in 'latest' file. e.g., ``global_step14``
472
+
473
+ Returns:
474
+ - pytorch ``state_dict``
475
+
476
+ Note: this approach may not work if your application doesn't have sufficient free CPU memory and
477
+ you may need to use the offline approach using the ``zero_to_fp32.py`` script that is saved with
478
+ the checkpoint.
479
+
480
+ A typical usage might be ::
481
+
482
+ from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint
483
+ # do the training and checkpoint saving
484
+ state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir) # already on cpu
485
+ model = model.cpu() # move to cpu
486
+ model.load_state_dict(state_dict)
487
+ # submit to model hub or save the model to share with others
488
+
489
+ In this example the ``model`` will no longer be usable in the deepspeed context of the same
490
+ application. i.e. you will need to re-initialize the deepspeed engine, since
491
+ ``model.load_state_dict(state_dict)`` will remove all the deepspeed magic from it.
492
+
493
+ If you want it all done for you, use ``load_state_dict_from_zero_checkpoint`` instead.
494
+
495
+ """
496
+ if tag is None:
497
+ latest_path = os.path.join(checkpoint_dir, 'latest')
498
+ if os.path.isfile(latest_path):
499
+ with open(latest_path, 'r') as fd:
500
+ tag = fd.read().strip()
501
+ else:
502
+ raise ValueError(f"Unable to find 'latest' file at {latest_path}")
503
+
504
+ ds_checkpoint_dir = os.path.join(checkpoint_dir, tag)
505
+
506
+ if not os.path.isdir(ds_checkpoint_dir):
507
+ raise FileNotFoundError(f"Directory '{ds_checkpoint_dir}' doesn't exist")
508
+
509
+ return _get_fp32_state_dict_from_zero_checkpoint(ds_checkpoint_dir)
510
+
511
+
512
+ def convert_zero_checkpoint_to_fp32_state_dict(checkpoint_dir, output_file, tag=None):
513
+ """
514
+ Convert ZeRO 2 or 3 checkpoint into a single fp32 consolidated ``state_dict`` file that can be
515
+ loaded with ``torch.load(file)`` + ``load_state_dict()`` and used for training without DeepSpeed.
516
+
517
+ Args:
518
+ - ``checkpoint_dir``: path to the desired checkpoint folder. (one that contains the tag-folder, like ``global_step14``)
519
+ - ``output_file``: path to the pytorch fp32 state_dict output file (e.g. path/pytorch_model.bin)
520
+ - ``tag``: checkpoint tag used as a unique identifier for checkpoint. If not provided will attempt to load tag in the file named ``latest`` in the checkpoint folder, e.g., ``global_step14``
521
+ """
522
+
523
+ state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag)
524
+ print(f"Saving fp32 state dict to {output_file}")
525
+ torch.save(state_dict, output_file)
526
+
527
+
528
+ def load_state_dict_from_zero_checkpoint(model, checkpoint_dir, tag=None):
529
+ """
530
+ 1. Put the provided model to cpu
531
+ 2. Convert ZeRO 2 or 3 checkpoint into a single fp32 consolidated ``state_dict``
532
+ 3. Load it into the provided model
533
+
534
+ Args:
535
+ - ``model``: the model object to update
536
+ - ``checkpoint_dir``: path to the desired checkpoint folder. (one that contains the tag-folder, like ``global_step14``)
537
+ - ``tag``: checkpoint tag used as a unique identifier for checkpoint. If not provided will attempt to load tag in the file named ``latest`` in the checkpoint folder, e.g., ``global_step14``
538
+
539
+ Returns:
540
+ - ``model`: modified model
541
+
542
+ Make sure you have plenty of CPU memory available before you call this function. If you don't
543
+ have enough use the ``zero_to_fp32.py`` utility to do the conversion. You will find it
544
+ conveniently placed for you in the checkpoint folder.
545
+
546
+ A typical usage might be ::
547
+
548
+ from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
549
+ model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)
550
+ # submit to model hub or save the model to share with others
551
+
552
+ Note, that once this was run, the ``model`` will no longer be usable in the deepspeed context
553
+ of the same application. i.e. you will need to re-initialize the deepspeed engine, since
554
+ ``model.load_state_dict(state_dict)`` will remove all the deepspeed magic from it.
555
+
556
+ """
557
+ logger.info(f"Extracting fp32 weights")
558
+ state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag)
559
+
560
+ logger.info(f"Overwriting model with fp32 weights")
561
+ model = model.cpu()
562
+ model.load_state_dict(state_dict, strict=False)
563
+
564
+ return model
565
+
566
+
567
+ if __name__ == "__main__":
568
+
569
+ parser = argparse.ArgumentParser()
570
+ parser.add_argument("checkpoint_dir",
571
+ type=str,
572
+ help="path to the desired checkpoint folder, e.g., path/checkpoint-12")
573
+ parser.add_argument(
574
+ "output_file",
575
+ type=str,
576
+ help="path to the pytorch fp32 state_dict output file (e.g. path/checkpoint-12/pytorch_model.bin)")
577
+ parser.add_argument("-t",
578
+ "--tag",
579
+ type=str,
580
+ default=None,
581
+ help="checkpoint tag used as a unique identifier for checkpoint. e.g., global_step1")
582
+ parser.add_argument("-d", "--debug", action='store_true', help="enable debug")
583
+ args = parser.parse_args()
584
+
585
+ debug = args.debug
586
+
587
+ convert_zero_checkpoint_to_fp32_state_dict(args.checkpoint_dir, args.output_file, tag=args.tag)
checkpoint-1600/README.md ADDED
@@ -0,0 +1,203 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: peft
3
+ base_model: Qwen/Qwen-VL-Chat
4
+ ---
5
+
6
+ # Model Card for Model ID
7
+
8
+ <!-- Provide a quick summary of what the model is/does. -->
9
+
10
+
11
+
12
+ ## Model Details
13
+
14
+ ### Model Description
15
+
16
+ <!-- Provide a longer summary of what this model is. -->
17
+
18
+
19
+
20
+ - **Developed by:** [More Information Needed]
21
+ - **Funded by [optional]:** [More Information Needed]
22
+ - **Shared by [optional]:** [More Information Needed]
23
+ - **Model type:** [More Information Needed]
24
+ - **Language(s) (NLP):** [More Information Needed]
25
+ - **License:** [More Information Needed]
26
+ - **Finetuned from model [optional]:** [More Information Needed]
27
+
28
+ ### Model Sources [optional]
29
+
30
+ <!-- Provide the basic links for the model. -->
31
+
32
+ - **Repository:** [More Information Needed]
33
+ - **Paper [optional]:** [More Information Needed]
34
+ - **Demo [optional]:** [More Information Needed]
35
+
36
+ ## Uses
37
+
38
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
+
40
+ ### Direct Use
41
+
42
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
+
44
+ [More Information Needed]
45
+
46
+ ### Downstream Use [optional]
47
+
48
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
+
50
+ [More Information Needed]
51
+
52
+ ### Out-of-Scope Use
53
+
54
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
+
56
+ [More Information Needed]
57
+
58
+ ## Bias, Risks, and Limitations
59
+
60
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
+
62
+ [More Information Needed]
63
+
64
+ ### Recommendations
65
+
66
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
+
68
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
+
70
+ ## How to Get Started with the Model
71
+
72
+ Use the code below to get started with the model.
73
+
74
+ [More Information Needed]
75
+
76
+ ## Training Details
77
+
78
+ ### Training Data
79
+
80
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
+
82
+ [More Information Needed]
83
+
84
+ ### Training Procedure
85
+
86
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
+
88
+ #### Preprocessing [optional]
89
+
90
+ [More Information Needed]
91
+
92
+
93
+ #### Training Hyperparameters
94
+
95
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
+
97
+ #### Speeds, Sizes, Times [optional]
98
+
99
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
+
101
+ [More Information Needed]
102
+
103
+ ## Evaluation
104
+
105
+ <!-- This section describes the evaluation protocols and provides the results. -->
106
+
107
+ ### Testing Data, Factors & Metrics
108
+
109
+ #### Testing Data
110
+
111
+ <!-- This should link to a Dataset Card if possible. -->
112
+
113
+ [More Information Needed]
114
+
115
+ #### Factors
116
+
117
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
+
119
+ [More Information Needed]
120
+
121
+ #### Metrics
122
+
123
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
+
125
+ [More Information Needed]
126
+
127
+ ### Results
128
+
129
+ [More Information Needed]
130
+
131
+ #### Summary
132
+
133
+
134
+
135
+ ## Model Examination [optional]
136
+
137
+ <!-- Relevant interpretability work for the model goes here -->
138
+
139
+ [More Information Needed]
140
+
141
+ ## Environmental Impact
142
+
143
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
+
145
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
+
147
+ - **Hardware Type:** [More Information Needed]
148
+ - **Hours used:** [More Information Needed]
149
+ - **Cloud Provider:** [More Information Needed]
150
+ - **Compute Region:** [More Information Needed]
151
+ - **Carbon Emitted:** [More Information Needed]
152
+
153
+ ## Technical Specifications [optional]
154
+
155
+ ### Model Architecture and Objective
156
+
157
+ [More Information Needed]
158
+
159
+ ### Compute Infrastructure
160
+
161
+ [More Information Needed]
162
+
163
+ #### Hardware
164
+
165
+ [More Information Needed]
166
+
167
+ #### Software
168
+
169
+ [More Information Needed]
170
+
171
+ ## Citation [optional]
172
+
173
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
+
175
+ **BibTeX:**
176
+
177
+ [More Information Needed]
178
+
179
+ **APA:**
180
+
181
+ [More Information Needed]
182
+
183
+ ## Glossary [optional]
184
+
185
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
+
187
+ [More Information Needed]
188
+
189
+ ## More Information [optional]
190
+
191
+ [More Information Needed]
192
+
193
+ ## Model Card Authors [optional]
194
+
195
+ [More Information Needed]
196
+
197
+ ## Model Card Contact
198
+
199
+ [More Information Needed]
200
+ ### Framework versions
201
+
202
+ - PEFT 0.10.0
203
+ - PEFT 0.11.1
checkpoint-1600/adapter_config.json ADDED
@@ -0,0 +1,380 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alpha_pattern": {},
3
+ "auto_mapping": null,
4
+ "base_model_name_or_path": "Qwen/Qwen-VL-Chat",
5
+ "bias": "none",
6
+ "fan_in_fan_out": false,
7
+ "inference_mode": true,
8
+ "init_lora_weights": true,
9
+ "layer_replication": null,
10
+ "layers_pattern": null,
11
+ "layers_to_transform": null,
12
+ "loftq_config": {},
13
+ "lora_alpha": 16,
14
+ "lora_dropout": 0.05,
15
+ "megatron_config": null,
16
+ "megatron_core": "megatron.core",
17
+ "modules_to_save": null,
18
+ "peft_type": "LORA",
19
+ "r": 64,
20
+ "rank_pattern": {},
21
+ "revision": null,
22
+ "target_modules": [
23
+ "transformer.h.16.mlp.w1",
24
+ "transformer.visual.transformer.resblocks.13.attn.out_proj",
25
+ "transformer.h.28.mlp.w1",
26
+ "transformer.h.16.attn.c_attn",
27
+ "transformer.h.3.mlp.w1",
28
+ "transformer.visual.transformer.resblocks.29.attn.in_proj",
29
+ "transformer.visual.transformer.resblocks.19.mlp.c_proj",
30
+ "transformer.visual.transformer.resblocks.47.mlp.c_fc",
31
+ "transformer.visual.transformer.resblocks.34.mlp.c_fc",
32
+ "transformer.visual.transformer.resblocks.4.attn.out_proj",
33
+ "transformer.h.31.attn.c_attn",
34
+ "transformer.h.16.mlp.w2",
35
+ "transformer.visual.transformer.resblocks.5.attn.out_proj",
36
+ "transformer.h.2.mlp.w1",
37
+ "transformer.visual.transformer.resblocks.7.attn.in_proj",
38
+ "transformer.h.20.mlp.w2",
39
+ "transformer.h.19.mlp.w1",
40
+ "transformer.visual.transformer.resblocks.18.mlp.c_fc",
41
+ "transformer.visual.transformer.resblocks.27.attn.out_proj",
42
+ "transformer.visual.transformer.resblocks.10.mlp.c_proj",
43
+ "transformer.visual.transformer.resblocks.43.mlp.c_fc",
44
+ "transformer.h.5.mlp.w1",
45
+ "transformer.visual.transformer.resblocks.15.mlp.c_proj",
46
+ "transformer.visual.transformer.resblocks.25.mlp.c_proj",
47
+ "transformer.visual.transformer.resblocks.10.attn.out_proj",
48
+ "transformer.visual.transformer.resblocks.4.mlp.c_fc",
49
+ "transformer.h.31.mlp.w2",
50
+ "transformer.visual.transformer.resblocks.37.attn.out_proj",
51
+ "transformer.h.8.attn.c_proj",
52
+ "transformer.h.29.attn.c_attn",
53
+ "transformer.visual.transformer.resblocks.24.mlp.c_proj",
54
+ "transformer.h.19.mlp.c_proj",
55
+ "transformer.visual.transformer.resblocks.11.attn.out_proj",
56
+ "transformer.h.13.mlp.c_proj",
57
+ "transformer.h.27.mlp.c_proj",
58
+ "transformer.h.31.mlp.w1",
59
+ "transformer.visual.transformer.resblocks.7.mlp.c_proj",
60
+ "transformer.h.28.mlp.w2",
61
+ "transformer.visual.transformer.resblocks.3.mlp.c_proj",
62
+ "transformer.visual.transformer.resblocks.13.attn.in_proj",
63
+ "transformer.h.21.attn.c_attn",
64
+ "transformer.visual.transformer.resblocks.23.mlp.c_fc",
65
+ "transformer.visual.transformer.resblocks.33.mlp.c_proj",
66
+ "transformer.visual.transformer.resblocks.42.mlp.c_fc",
67
+ "transformer.visual.transformer.resblocks.3.attn.in_proj",
68
+ "transformer.h.13.mlp.w1",
69
+ "transformer.visual.transformer.resblocks.22.attn.out_proj",
70
+ "transformer.visual.transformer.resblocks.20.mlp.c_fc",
71
+ "transformer.h.26.mlp.w2",
72
+ "transformer.h.14.attn.c_attn",
73
+ "transformer.h.16.attn.c_proj",
74
+ "transformer.h.1.mlp.w1",
75
+ "transformer.visual.transformer.resblocks.21.attn.out_proj",
76
+ "transformer.visual.transformer.resblocks.39.mlp.c_proj",
77
+ "transformer.visual.transformer.resblocks.4.attn.in_proj",
78
+ "transformer.h.29.mlp.c_proj",
79
+ "transformer.visual.transformer.resblocks.12.mlp.c_proj",
80
+ "transformer.visual.transformer.resblocks.14.attn.in_proj",
81
+ "transformer.h.28.attn.c_proj",
82
+ "transformer.h.18.mlp.w1",
83
+ "transformer.h.27.mlp.w2",
84
+ "transformer.h.18.attn.c_attn",
85
+ "transformer.visual.transformer.resblocks.33.attn.out_proj",
86
+ "transformer.h.5.mlp.w2",
87
+ "transformer.visual.transformer.resblocks.37.mlp.c_fc",
88
+ "transformer.visual.transformer.resblocks.2.mlp.c_proj",
89
+ "transformer.visual.transformer.resblocks.42.attn.out_proj",
90
+ "transformer.visual.transformer.resblocks.15.attn.in_proj",
91
+ "transformer.visual.transformer.resblocks.6.mlp.c_fc",
92
+ "transformer.h.13.mlp.w2",
93
+ "transformer.h.23.attn.c_proj",
94
+ "transformer.h.20.mlp.c_proj",
95
+ "transformer.h.14.mlp.w2",
96
+ "transformer.visual.transformer.resblocks.9.attn.in_proj",
97
+ "transformer.visual.transformer.resblocks.46.attn.in_proj",
98
+ "transformer.h.9.attn.c_attn",
99
+ "transformer.visual.transformer.resblocks.36.mlp.c_proj",
100
+ "transformer.h.31.attn.c_proj",
101
+ "transformer.visual.transformer.resblocks.19.mlp.c_fc",
102
+ "transformer.h.17.mlp.w1",
103
+ "transformer.h.2.attn.c_proj",
104
+ "transformer.visual.transformer.resblocks.47.attn.in_proj",
105
+ "transformer.visual.transformer.resblocks.45.mlp.c_proj",
106
+ "transformer.visual.transformer.resblocks.46.mlp.c_fc",
107
+ "transformer.visual.transformer.resblocks.27.attn.in_proj",
108
+ "transformer.visual.transformer.resblocks.26.attn.out_proj",
109
+ "transformer.h.22.attn.c_proj",
110
+ "transformer.visual.transformer.resblocks.40.attn.out_proj",
111
+ "transformer.visual.transformer.resblocks.46.mlp.c_proj",
112
+ "transformer.visual.transformer.resblocks.18.attn.out_proj",
113
+ "transformer.h.27.attn.c_proj",
114
+ "transformer.visual.transformer.resblocks.26.attn.in_proj",
115
+ "transformer.h.4.mlp.w1",
116
+ "transformer.h.10.attn.c_proj",
117
+ "transformer.h.6.attn.c_attn",
118
+ "transformer.h.2.attn.c_attn",
119
+ "transformer.h.22.mlp.w1",
120
+ "transformer.visual.transformer.resblocks.39.mlp.c_fc",
121
+ "transformer.h.8.mlp.w2",
122
+ "transformer.h.4.attn.c_attn",
123
+ "transformer.h.26.mlp.c_proj",
124
+ "transformer.visual.transformer.resblocks.29.mlp.c_proj",
125
+ "transformer.visual.transformer.resblocks.5.mlp.c_proj",
126
+ "transformer.h.11.mlp.c_proj",
127
+ "transformer.h.0.mlp.w2",
128
+ "transformer.visual.transformer.resblocks.36.attn.out_proj",
129
+ "transformer.h.29.mlp.w1",
130
+ "transformer.h.12.mlp.c_proj",
131
+ "transformer.visual.transformer.resblocks.2.attn.in_proj",
132
+ "transformer.visual.transformer.resblocks.2.mlp.c_fc",
133
+ "transformer.h.25.attn.c_attn",
134
+ "transformer.visual.transformer.resblocks.19.attn.in_proj",
135
+ "transformer.visual.transformer.resblocks.43.attn.out_proj",
136
+ "transformer.visual.transformer.resblocks.35.attn.out_proj",
137
+ "transformer.h.22.attn.c_attn",
138
+ "transformer.h.0.mlp.w1",
139
+ "transformer.h.3.attn.c_attn",
140
+ "transformer.h.28.attn.c_attn",
141
+ "transformer.visual.transformer.resblocks.25.attn.in_proj",
142
+ "transformer.visual.transformer.resblocks.34.attn.out_proj",
143
+ "transformer.h.21.attn.c_proj",
144
+ "transformer.h.6.attn.c_proj",
145
+ "transformer.visual.transformer.resblocks.11.mlp.c_proj",
146
+ "transformer.h.13.attn.c_attn",
147
+ "transformer.visual.transformer.resblocks.38.attn.out_proj",
148
+ "transformer.h.3.attn.c_proj",
149
+ "transformer.visual.transformer.resblocks.17.mlp.c_fc",
150
+ "transformer.h.26.mlp.w1",
151
+ "transformer.visual.transformer.resblocks.36.mlp.c_fc",
152
+ "transformer.h.26.attn.c_attn",
153
+ "transformer.visual.transformer.resblocks.29.attn.out_proj",
154
+ "transformer.h.7.mlp.w1",
155
+ "transformer.visual.transformer.resblocks.40.mlp.c_fc",
156
+ "transformer.visual.transformer.resblocks.9.attn.out_proj",
157
+ "transformer.h.3.mlp.c_proj",
158
+ "transformer.visual.transformer.resblocks.26.mlp.c_fc",
159
+ "transformer.h.11.mlp.w2",
160
+ "transformer.visual.transformer.resblocks.33.attn.in_proj",
161
+ "transformer.visual.transformer.resblocks.42.mlp.c_proj",
162
+ "transformer.visual.transformer.resblocks.32.attn.out_proj",
163
+ "transformer.h.4.attn.c_proj",
164
+ "transformer.visual.transformer.resblocks.27.mlp.c_fc",
165
+ "transformer.visual.transformer.resblocks.11.mlp.c_fc",
166
+ "transformer.visual.transformer.resblocks.25.attn.out_proj",
167
+ "transformer.visual.transformer.resblocks.23.attn.in_proj",
168
+ "transformer.h.5.attn.c_attn",
169
+ "transformer.h.16.mlp.c_proj",
170
+ "transformer.visual.transformer.resblocks.14.mlp.c_proj",
171
+ "transformer.h.22.mlp.w2",
172
+ "transformer.h.25.mlp.c_proj",
173
+ "transformer.visual.transformer.resblocks.10.mlp.c_fc",
174
+ "transformer.h.24.mlp.c_proj",
175
+ "transformer.h.19.mlp.w2",
176
+ "transformer.h.14.mlp.w1",
177
+ "transformer.visual.transformer.resblocks.40.mlp.c_proj",
178
+ "transformer.visual.transformer.resblocks.28.attn.out_proj",
179
+ "transformer.visual.transformer.resblocks.24.mlp.c_fc",
180
+ "transformer.h.8.attn.c_attn",
181
+ "transformer.h.9.mlp.w1",
182
+ "transformer.h.6.mlp.c_proj",
183
+ "transformer.visual.transformer.resblocks.19.attn.out_proj",
184
+ "transformer.visual.transformer.resblocks.32.mlp.c_fc",
185
+ "transformer.visual.transformer.resblocks.7.mlp.c_fc",
186
+ "transformer.visual.transformer.resblocks.44.attn.in_proj",
187
+ "transformer.visual.transformer.resblocks.34.mlp.c_proj",
188
+ "transformer.visual.transformer.resblocks.9.mlp.c_fc",
189
+ "transformer.visual.conv1",
190
+ "transformer.visual.transformer.resblocks.8.attn.out_proj",
191
+ "transformer.h.23.mlp.w2",
192
+ "transformer.h.7.mlp.w2",
193
+ "transformer.h.24.attn.c_proj",
194
+ "transformer.h.30.attn.c_proj",
195
+ "transformer.h.29.attn.c_proj",
196
+ "transformer.visual.transformer.resblocks.9.mlp.c_proj",
197
+ "transformer.visual.transformer.resblocks.35.attn.in_proj",
198
+ "transformer.visual.transformer.resblocks.21.mlp.c_fc",
199
+ "transformer.visual.transformer.resblocks.41.mlp.c_proj",
200
+ "transformer.visual.transformer.resblocks.38.mlp.c_fc",
201
+ "transformer.visual.transformer.resblocks.13.mlp.c_proj",
202
+ "transformer.visual.transformer.resblocks.41.attn.out_proj",
203
+ "transformer.visual.transformer.resblocks.16.mlp.c_fc",
204
+ "transformer.visual.transformer.resblocks.45.attn.out_proj",
205
+ "transformer.h.11.mlp.w1",
206
+ "transformer.visual.transformer.resblocks.16.attn.in_proj",
207
+ "transformer.visual.transformer.resblocks.47.attn.out_proj",
208
+ "transformer.h.9.attn.c_proj",
209
+ "transformer.h.31.mlp.c_proj",
210
+ "transformer.visual.transformer.resblocks.12.attn.in_proj",
211
+ "transformer.visual.transformer.resblocks.28.mlp.c_proj",
212
+ "transformer.visual.transformer.resblocks.20.attn.out_proj",
213
+ "transformer.h.12.attn.c_attn",
214
+ "transformer.h.24.mlp.w1",
215
+ "transformer.visual.transformer.resblocks.21.attn.in_proj",
216
+ "transformer.visual.transformer.resblocks.41.attn.in_proj",
217
+ "transformer.h.10.mlp.w1",
218
+ "transformer.h.1.mlp.w2",
219
+ "transformer.h.0.mlp.c_proj",
220
+ "transformer.h.22.mlp.c_proj",
221
+ "transformer.visual.transformer.resblocks.18.attn.in_proj",
222
+ "transformer.visual.transformer.resblocks.38.mlp.c_proj",
223
+ "transformer.h.12.mlp.w1",
224
+ "transformer.h.1.attn.c_attn",
225
+ "transformer.visual.transformer.resblocks.31.mlp.c_proj",
226
+ "transformer.visual.transformer.resblocks.44.mlp.c_proj",
227
+ "transformer.h.15.mlp.c_proj",
228
+ "transformer.h.6.mlp.w1",
229
+ "transformer.visual.transformer.resblocks.16.mlp.c_proj",
230
+ "transformer.h.13.attn.c_proj",
231
+ "transformer.h.15.attn.c_attn",
232
+ "transformer.h.15.mlp.w1",
233
+ "transformer.h.17.mlp.w2",
234
+ "transformer.visual.transformer.resblocks.10.attn.in_proj",
235
+ "transformer.h.26.attn.c_proj",
236
+ "transformer.visual.transformer.resblocks.20.attn.in_proj",
237
+ "transformer.h.10.mlp.w2",
238
+ "transformer.h.24.attn.c_attn",
239
+ "transformer.h.8.mlp.w1",
240
+ "transformer.h.23.mlp.w1",
241
+ "transformer.visual.transformer.resblocks.1.mlp.c_proj",
242
+ "transformer.h.4.mlp.w2",
243
+ "transformer.visual.transformer.resblocks.38.attn.in_proj",
244
+ "transformer.h.12.mlp.w2",
245
+ "transformer.h.7.attn.c_proj",
246
+ "transformer.h.4.mlp.c_proj",
247
+ "transformer.visual.transformer.resblocks.31.attn.out_proj",
248
+ "transformer.visual.transformer.resblocks.17.mlp.c_proj",
249
+ "transformer.h.21.mlp.w2",
250
+ "transformer.visual.transformer.resblocks.5.attn.in_proj",
251
+ "transformer.h.18.attn.c_proj",
252
+ "transformer.visual.transformer.resblocks.31.mlp.c_fc",
253
+ "transformer.h.18.mlp.w2",
254
+ "transformer.visual.transformer.resblocks.6.attn.out_proj",
255
+ "transformer.visual.transformer.resblocks.8.attn.in_proj",
256
+ "transformer.visual.transformer.resblocks.30.mlp.c_proj",
257
+ "transformer.h.30.mlp.c_proj",
258
+ "transformer.visual.transformer.resblocks.30.attn.out_proj",
259
+ "transformer.visual.transformer.resblocks.16.attn.out_proj",
260
+ "transformer.visual.transformer.resblocks.14.attn.out_proj",
261
+ "transformer.h.25.mlp.w1",
262
+ "transformer.visual.transformer.resblocks.45.attn.in_proj",
263
+ "transformer.h.11.attn.c_proj",
264
+ "transformer.visual.transformer.resblocks.30.attn.in_proj",
265
+ "transformer.visual.transformer.resblocks.43.mlp.c_proj",
266
+ "transformer.h.10.mlp.c_proj",
267
+ "transformer.h.21.mlp.c_proj",
268
+ "transformer.visual.transformer.resblocks.43.attn.in_proj",
269
+ "transformer.visual.transformer.resblocks.3.mlp.c_fc",
270
+ "transformer.visual.transformer.resblocks.44.attn.out_proj",
271
+ "transformer.h.23.attn.c_attn",
272
+ "transformer.visual.transformer.resblocks.22.attn.in_proj",
273
+ "transformer.visual.transformer.resblocks.6.attn.in_proj",
274
+ "transformer.visual.transformer.resblocks.44.mlp.c_fc",
275
+ "transformer.h.17.attn.c_attn",
276
+ "transformer.h.7.attn.c_attn",
277
+ "transformer.visual.transformer.resblocks.42.attn.in_proj",
278
+ "transformer.visual.transformer.resblocks.20.mlp.c_proj",
279
+ "transformer.h.8.mlp.c_proj",
280
+ "transformer.visual.transformer.resblocks.17.attn.out_proj",
281
+ "transformer.h.14.attn.c_proj",
282
+ "transformer.visual.transformer.resblocks.40.attn.in_proj",
283
+ "transformer.h.25.attn.c_proj",
284
+ "transformer.h.28.mlp.c_proj",
285
+ "transformer.visual.transformer.resblocks.35.mlp.c_proj",
286
+ "transformer.visual.transformer.resblocks.36.attn.in_proj",
287
+ "transformer.visual.transformer.resblocks.41.mlp.c_fc",
288
+ "transformer.visual.transformer.resblocks.14.mlp.c_fc",
289
+ "transformer.h.30.mlp.w2",
290
+ "transformer.h.20.mlp.w1",
291
+ "transformer.visual.transformer.resblocks.33.mlp.c_fc",
292
+ "transformer.h.29.mlp.w2",
293
+ "transformer.visual.transformer.resblocks.47.mlp.c_proj",
294
+ "transformer.visual.transformer.resblocks.30.mlp.c_fc",
295
+ "transformer.h.10.attn.c_attn",
296
+ "transformer.visual.transformer.resblocks.1.attn.in_proj",
297
+ "transformer.h.1.attn.c_proj",
298
+ "transformer.visual.transformer.resblocks.8.mlp.c_proj",
299
+ "transformer.h.19.attn.c_proj",
300
+ "transformer.visual.transformer.resblocks.37.attn.in_proj",
301
+ "transformer.h.15.attn.c_proj",
302
+ "transformer.h.5.attn.c_proj",
303
+ "transformer.visual.transformer.resblocks.32.mlp.c_proj",
304
+ "transformer.visual.transformer.resblocks.3.attn.out_proj",
305
+ "transformer.visual.transformer.resblocks.32.attn.in_proj",
306
+ "transformer.h.21.mlp.w1",
307
+ "transformer.h.23.mlp.c_proj",
308
+ "transformer.h.30.mlp.w1",
309
+ "transformer.h.0.attn.c_attn",
310
+ "transformer.visual.transformer.resblocks.24.attn.out_proj",
311
+ "transformer.visual.transformer.resblocks.31.attn.in_proj",
312
+ "transformer.h.18.mlp.c_proj",
313
+ "transformer.visual.transformer.resblocks.25.mlp.c_fc",
314
+ "transformer.visual.transformer.resblocks.22.mlp.c_fc",
315
+ "transformer.h.30.attn.c_attn",
316
+ "transformer.visual.transformer.resblocks.13.mlp.c_fc",
317
+ "transformer.h.17.mlp.c_proj",
318
+ "transformer.visual.transformer.resblocks.24.attn.in_proj",
319
+ "transformer.h.11.attn.c_attn",
320
+ "transformer.h.2.mlp.w2",
321
+ "transformer.visual.transformer.resblocks.8.mlp.c_fc",
322
+ "transformer.visual.transformer.resblocks.0.mlp.c_fc",
323
+ "transformer.visual.transformer.resblocks.2.attn.out_proj",
324
+ "transformer.visual.transformer.resblocks.35.mlp.c_fc",
325
+ "transformer.visual.transformer.resblocks.39.attn.out_proj",
326
+ "transformer.h.12.attn.c_proj",
327
+ "transformer.visual.transformer.resblocks.28.attn.in_proj",
328
+ "transformer.visual.transformer.resblocks.29.mlp.c_fc",
329
+ "transformer.visual.transformer.resblocks.0.attn.out_proj",
330
+ "transformer.visual.transformer.resblocks.23.mlp.c_proj",
331
+ "transformer.h.20.attn.c_attn",
332
+ "transformer.visual.transformer.resblocks.7.attn.out_proj",
333
+ "transformer.visual.transformer.resblocks.15.attn.out_proj",
334
+ "transformer.h.7.mlp.c_proj",
335
+ "transformer.visual.transformer.resblocks.1.attn.out_proj",
336
+ "transformer.h.3.mlp.w2",
337
+ "transformer.h.9.mlp.w2",
338
+ "transformer.visual.transformer.resblocks.34.attn.in_proj",
339
+ "transformer.h.27.attn.c_attn",
340
+ "transformer.visual.transformer.resblocks.12.mlp.c_fc",
341
+ "transformer.h.6.mlp.w2",
342
+ "transformer.visual.transformer.resblocks.39.attn.in_proj",
343
+ "transformer.h.15.mlp.w2",
344
+ "transformer.visual.transformer.resblocks.18.mlp.c_proj",
345
+ "transformer.h.0.attn.c_proj",
346
+ "transformer.h.19.attn.c_attn",
347
+ "transformer.visual.transformer.resblocks.27.mlp.c_proj",
348
+ "transformer.visual.transformer.resblocks.23.attn.out_proj",
349
+ "transformer.h.14.mlp.c_proj",
350
+ "transformer.h.9.mlp.c_proj",
351
+ "transformer.visual.transformer.resblocks.12.attn.out_proj",
352
+ "transformer.visual.transformer.resblocks.0.mlp.c_proj",
353
+ "transformer.visual.transformer.resblocks.5.mlp.c_fc",
354
+ "transformer.visual.transformer.resblocks.28.mlp.c_fc",
355
+ "transformer.visual.transformer.resblocks.6.mlp.c_proj",
356
+ "transformer.visual.transformer.resblocks.22.mlp.c_proj",
357
+ "transformer.visual.transformer.resblocks.37.mlp.c_proj",
358
+ "transformer.visual.transformer.resblocks.17.attn.in_proj",
359
+ "transformer.visual.transformer.resblocks.46.attn.out_proj",
360
+ "transformer.h.24.mlp.w2",
361
+ "transformer.h.27.mlp.w1",
362
+ "transformer.visual.transformer.resblocks.11.attn.in_proj",
363
+ "transformer.visual.transformer.resblocks.4.mlp.c_proj",
364
+ "transformer.visual.transformer.resblocks.21.mlp.c_proj",
365
+ "transformer.visual.transformer.resblocks.26.mlp.c_proj",
366
+ "transformer.visual.transformer.resblocks.15.mlp.c_fc",
367
+ "transformer.h.2.mlp.c_proj",
368
+ "transformer.h.1.mlp.c_proj",
369
+ "transformer.h.5.mlp.c_proj",
370
+ "transformer.visual.transformer.resblocks.45.mlp.c_fc",
371
+ "transformer.visual.transformer.resblocks.0.attn.in_proj",
372
+ "transformer.h.25.mlp.w2",
373
+ "transformer.h.20.attn.c_proj",
374
+ "transformer.h.17.attn.c_proj",
375
+ "transformer.visual.transformer.resblocks.1.mlp.c_fc"
376
+ ],
377
+ "task_type": "CAUSAL_LM",
378
+ "use_dora": false,
379
+ "use_rslora": false
380
+ }
checkpoint-1600/adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:80f99bd20b2a57ae180db378d4c2ad8777288d01fd71f21c7b258c2141ccd27c
3
+ size 469105640
checkpoint-1600/latest ADDED
@@ -0,0 +1 @@
 
 
1
+ global_step1600
checkpoint-1600/qwen.tiktoken ADDED
The diff for this file is too large to render. See raw diff
 
checkpoint-1600/rng_state_0.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fa74b506d85700151c4e4c4f5c6adc63d055ed8ecb10bd6702453c61ca1d200b
3
+ size 15920
checkpoint-1600/rng_state_1.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dc196d48c3771157921ae2bef9abcc68219ad9aab60637928c27798c1a979dca
3
+ size 15920
checkpoint-1600/rng_state_2.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:94a45f5c84891190bab691174f3d23d0e4ba0525dd98afbfaa45c8a5faa2bb5e
3
+ size 15920
checkpoint-1600/rng_state_3.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8435d9f70042fb8d3d78a56558df657ae47801a72e408d1c47602693b6facda2
3
+ size 15920
checkpoint-1600/rng_state_4.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4521c9dde92d304631e1dcdcb52f6d8149f69ce405bd47f0cdd43efa2d2fb5bf
3
+ size 15920
checkpoint-1600/rng_state_5.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8b9e9e23c4202da095f27b94a494e52c7f529a7b81972744f1ee768dac1b8ca5
3
+ size 15920
checkpoint-1600/rng_state_6.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b81f03f8ec1fd50599c19d8224e60fe0ef15e2b9d856f9b1f5653703f7ad0408
3
+ size 15920
checkpoint-1600/rng_state_7.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7fd839ce13a82530b1a2d875e0a29bcf7ca4daa14fe5a49a2fc9f255a4be0688
3
+ size 15920
checkpoint-1600/scheduler.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b629c1752a3080bde72cb93dc63770861076540b3c0bc6419645c02a824c238f
3
+ size 1064
checkpoint-1600/special_tokens_map.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "pad_token": "<|endoftext|>"
3
+ }
checkpoint-1600/tokenization_qwen.py ADDED
@@ -0,0 +1,598 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) Alibaba Cloud.
2
+ #
3
+ # This source code is licensed under the license found in the
4
+ # LICENSE file in the root directory of this source tree.
5
+
6
+ """Tokenization classes for QWen."""
7
+
8
+ import base64
9
+ import logging
10
+ import os
11
+ import requests
12
+ import unicodedata
13
+ from typing import Collection, Dict, List, Set, Tuple, Union, Any, Callable, Optional
14
+
15
+ import tiktoken
16
+ import numpy as np
17
+ from PIL import Image
18
+ from PIL import ImageFont
19
+ from PIL import ImageDraw
20
+ from transformers import PreTrainedTokenizer, AddedToken
21
+ from transformers.utils import try_to_load_from_cache
22
+
23
+ import matplotlib.colors as mcolors
24
+ from matplotlib.font_manager import FontProperties
25
+
26
+ logger = logging.getLogger(__name__)
27
+
28
+
29
+ VOCAB_FILES_NAMES = {"vocab_file": "qwen.tiktoken", "ttf": "SimSun.ttf"}
30
+ FONT_PATH = try_to_load_from_cache("Qwen/Qwen-VL-Chat", "SimSun.ttf")
31
+ if FONT_PATH is None:
32
+ if not os.path.exists("SimSun.ttf"):
33
+ ttf = requests.get("https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/SimSun.ttf")
34
+ open("SimSun.ttf", "wb").write(ttf.content)
35
+ FONT_PATH = "SimSun.ttf"
36
+
37
+ PAT_STR = r"""(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"""
38
+ ENDOFTEXT = "<|endoftext|>"
39
+ IMSTART = "<|im_start|>"
40
+ IMEND = "<|im_end|>"
41
+ # as the default behavior is changed to allow special tokens in
42
+ # regular texts, the surface forms of special tokens need to be
43
+ # as different as possible to minimize the impact
44
+ EXTRAS = tuple((f"<|extra_{i}|>" for i in range(205)))
45
+ SPECIAL_TOKENS = (
46
+ ENDOFTEXT,
47
+ IMSTART,
48
+ IMEND,
49
+ ) + EXTRAS
50
+ IMG_TOKEN_SPAN = 256
51
+
52
+
53
+ def _load_tiktoken_bpe(tiktoken_bpe_file: str) -> Dict[bytes, int]:
54
+ with open(tiktoken_bpe_file, "rb") as f:
55
+ contents = f.read()
56
+ return {
57
+ base64.b64decode(token): int(rank)
58
+ for token, rank in (line.split() for line in contents.splitlines() if line)
59
+ }
60
+
61
+ def _list_find(
62
+ input_list: List[Any],
63
+ candidates: Tuple[Any],
64
+ start: int = 0,
65
+ ):
66
+ for i in range(start, len(input_list)):
67
+ if input_list[i] in candidates:
68
+ return i
69
+ return -1
70
+
71
+ def _replace_closed_tag(
72
+ input_tokens: List[Any],
73
+ start_tags: Union[Any, Tuple[Any]],
74
+ end_tags: Union[Any, Tuple[Any]],
75
+ inclusive_replace_func: Callable,
76
+ exclusive_replace_func: Callable = lambda x: x,
77
+ ):
78
+ if isinstance(start_tags, (str, int)):
79
+ start_tags = (start_tags,)
80
+ if isinstance(end_tags, (str, int)):
81
+ end_tags = (end_tags,)
82
+ assert len(start_tags) == len(end_tags)
83
+
84
+ output_tokens = []
85
+ end = 0
86
+ while True:
87
+ start = _list_find(input_tokens, start_tags, end)
88
+ if start == -1:
89
+ break
90
+ output_tokens.extend(exclusive_replace_func(input_tokens[end : start]))
91
+ tag_idx = start_tags.index(input_tokens[start])
92
+ end = _list_find(input_tokens, (end_tags[tag_idx],), start)
93
+ if end == -1:
94
+ raise ValueError("Unclosed image token")
95
+ output_tokens.extend(inclusive_replace_func(input_tokens[start : end + 1]))
96
+ end += 1
97
+ output_tokens.extend(exclusive_replace_func(input_tokens[end : ]))
98
+ return output_tokens
99
+
100
+ class QWenTokenizer(PreTrainedTokenizer):
101
+ """QWen tokenizer."""
102
+
103
+ vocab_files_names = VOCAB_FILES_NAMES
104
+
105
+ def __init__(
106
+ self,
107
+ vocab_file,
108
+ errors="replace",
109
+ image_start_tag='<img>',
110
+ image_end_tag='</img>',
111
+ image_pad_tag='<imgpad>',
112
+ ref_start_tag='<ref>',
113
+ ref_end_tag='</ref>',
114
+ box_start_tag='<box>',
115
+ box_end_tag='</box>',
116
+ quad_start_tag='<quad>',
117
+ quad_end_tag='</quad>',
118
+ **kwargs,
119
+ ):
120
+ self.image_start_tag = image_start_tag
121
+ self.image_end_tag = image_end_tag
122
+ self.image_pad_tag = image_pad_tag
123
+ self.ref_start_tag = ref_start_tag
124
+ self.ref_end_tag = ref_end_tag
125
+ self.box_start_tag = box_start_tag
126
+ self.box_end_tag = box_end_tag
127
+ self.quad_start_tag = quad_start_tag
128
+ self.quad_end_tag = quad_end_tag
129
+ self.IMAGE_ST = (
130
+ ref_start_tag, ref_end_tag,
131
+ box_start_tag, box_end_tag,
132
+ quad_start_tag, quad_end_tag,
133
+ image_start_tag, image_end_tag,
134
+ image_pad_tag
135
+ )
136
+ super().__init__(**kwargs)
137
+
138
+ self.errors = errors # how to handle errors in decoding
139
+
140
+ self.mergeable_ranks = _load_tiktoken_bpe(vocab_file) # type: dict[bytes, int]
141
+ self.special_tokens = {
142
+ token: index
143
+ for index, token in enumerate(
144
+ SPECIAL_TOKENS + self.IMAGE_ST, start=len(self.mergeable_ranks)
145
+ )
146
+ }
147
+ self.img_start_id = self.special_tokens[self.image_start_tag]
148
+ self.img_end_id = self.special_tokens[self.image_end_tag]
149
+ self.img_pad_id = self.special_tokens[self.image_pad_tag]
150
+ self.ref_start_id = self.special_tokens[self.ref_start_tag]
151
+ self.ref_end_id = self.special_tokens[self.ref_end_tag]
152
+ self.box_start_id = self.special_tokens[self.box_start_tag]
153
+ self.box_end_id = self.special_tokens[self.box_end_tag]
154
+ self.quad_start_id = self.special_tokens[self.quad_start_tag]
155
+ self.quad_end_id = self.special_tokens[self.quad_end_tag]
156
+ self.image_special_tokens = set([
157
+ self.ref_start_id, self.ref_end_id, self.box_start_id, self.box_end_id,
158
+ self.quad_start_id, self.quad_end_id,
159
+ ])
160
+
161
+ enc = tiktoken.Encoding(
162
+ "Qwen",
163
+ pat_str=PAT_STR,
164
+ mergeable_ranks=self.mergeable_ranks,
165
+ special_tokens=self.special_tokens,
166
+ )
167
+ assert (
168
+ len(self.mergeable_ranks) + len(self.special_tokens) == enc.n_vocab
169
+ ), f"{len(self.mergeable_ranks) + len(self.special_tokens)} != {enc.n_vocab} in encoding"
170
+
171
+ self.decoder = {
172
+ v: k for k, v in self.mergeable_ranks.items()
173
+ } # type: dict[int, bytes|str]
174
+ self.decoder.update({v: k for k, v in self.special_tokens.items()})
175
+
176
+ self.tokenizer = enc # type: tiktoken.Encoding
177
+
178
+ self.eod_id = self.tokenizer.eot_token
179
+ self.im_start_id = self.special_tokens[IMSTART]
180
+ self.im_end_id = self.special_tokens[IMEND]
181
+
182
+ def __getstate__(self):
183
+ # for pickle lovers
184
+ state = self.__dict__.copy()
185
+ del state['tokenizer']
186
+ return state
187
+
188
+ def __setstate__(self, state):
189
+ # tokenizer is not python native; don't pass it; rebuild it
190
+ self.__dict__.update(state)
191
+ enc = tiktoken.Encoding(
192
+ "Qwen",
193
+ pat_str=PAT_STR,
194
+ mergeable_ranks=self.mergeable_ranks,
195
+ special_tokens=self.special_tokens,
196
+ )
197
+ self.tokenizer = enc
198
+
199
+
200
+ def __len__(self) -> int:
201
+ return self.tokenizer.n_vocab
202
+
203
+ def get_vocab(self) -> Dict[bytes, int]:
204
+ return self.mergeable_ranks
205
+
206
+ def convert_tokens_to_ids(
207
+ self, tokens: Union[bytes, str, List[Union[bytes, str]]]
208
+ ) -> List[int]:
209
+ ids = []
210
+ if isinstance(tokens, (str, bytes)):
211
+ if tokens in self.special_tokens:
212
+ return self.special_tokens[tokens]
213
+ else:
214
+ return self.mergeable_ranks.get(tokens)
215
+ for token in tokens:
216
+ if token in self.special_tokens:
217
+ ids.append(self.special_tokens[token])
218
+ else:
219
+ ids.append(self.mergeable_ranks.get(token))
220
+ return ids
221
+
222
+ def _add_tokens(self, new_tokens: Union[List[str], List[AddedToken]], special_tokens: bool = False) -> int:
223
+ if not special_tokens and new_tokens:
224
+ raise ValueError('Adding regular tokens is not supported')
225
+ for token in new_tokens:
226
+ surface_form = token.content if isinstance(token, AddedToken) else token
227
+ if surface_form not in SPECIAL_TOKENS + self.IMAGE_ST:
228
+ raise ValueError('Adding unknown special tokens is not supported')
229
+ return 0
230
+
231
+ def save_vocabulary(self, save_directory: str, **kwargs) -> Tuple[str]:
232
+ """
233
+ Save only the vocabulary of the tokenizer (vocabulary).
234
+
235
+ Returns:
236
+ `Tuple(str)`: Paths to the files saved.
237
+ """
238
+ file_path = os.path.join(save_directory, "qwen.tiktoken")
239
+ with open(file_path, "w", encoding="utf8") as w:
240
+ for k, v in self.mergeable_ranks.items():
241
+ line = base64.b64encode(k).decode("utf8") + " " + str(v) + "\n"
242
+ w.write(line)
243
+ return (file_path,)
244
+
245
+ def tokenize(
246
+ self,
247
+ text: str,
248
+ allowed_special: Union[Set, str] = "all",
249
+ disallowed_special: Union[Collection, str] = (),
250
+ **kwargs,
251
+ ) -> List[Union[bytes, str]]:
252
+ """
253
+ Converts a string in a sequence of tokens.
254
+
255
+ Args:
256
+ text (`str`):
257
+ The sequence to be encoded.
258
+ allowed_special (`Literal["all"]` or `set`):
259
+ The surface forms of the tokens to be encoded as special tokens in regular texts.
260
+ Default to "all".
261
+ disallowed_special (`Literal["all"]` or `Collection`):
262
+ The surface forms of the tokens that should not be in regular texts and trigger errors.
263
+ Default to an empty tuple.
264
+
265
+ kwargs (additional keyword arguments, *optional*):
266
+ Will be passed to the underlying model specific encode method.
267
+
268
+ Returns:
269
+ `List[bytes|str]`: The list of tokens.
270
+ """
271
+ tokens = []
272
+ text = unicodedata.normalize("NFC", text)
273
+
274
+ # this implementation takes a detour: text -> token id -> token surface forms
275
+ for t in self.tokenizer.encode(
276
+ text, allowed_special=allowed_special, disallowed_special=disallowed_special
277
+ ):
278
+ tokens.append(self.decoder[t])
279
+
280
+ def _encode_imgurl(img_tokens):
281
+ assert img_tokens[0] == self.image_start_tag and img_tokens[-1] == self.image_end_tag
282
+ img_tokens = img_tokens[1:-1]
283
+ img_url = b''.join(img_tokens)
284
+ out_img_tokens = list(map(self.decoder.get, img_url))
285
+ if len(out_img_tokens) > IMG_TOKEN_SPAN:
286
+ raise ValueError("The content in {}..{} is too long".format(
287
+ self.image_start_tag, self.image_end_tag))
288
+ out_img_tokens.extend([self.image_pad_tag] * (IMG_TOKEN_SPAN - len(out_img_tokens)))
289
+ out_img_tokens = [self.image_start_tag] + out_img_tokens + [self.image_end_tag]
290
+ return out_img_tokens
291
+
292
+ return _replace_closed_tag(tokens, self.image_start_tag, self.image_end_tag, _encode_imgurl)
293
+
294
+ def convert_tokens_to_string(self, tokens: List[Union[bytes, str]]) -> str:
295
+ """
296
+ Converts a sequence of tokens in a single string.
297
+ """
298
+ text = ""
299
+ temp = b""
300
+ for t in tokens:
301
+ if isinstance(t, str):
302
+ if temp:
303
+ text += temp.decode("utf-8", errors=self.errors)
304
+ temp = b""
305
+ text += t
306
+ elif isinstance(t, bytes):
307
+ temp += t
308
+ else:
309
+ raise TypeError("token should only be of type types or str")
310
+ if temp:
311
+ text += temp.decode("utf-8", errors=self.errors)
312
+ return text
313
+
314
+ @property
315
+ def vocab_size(self):
316
+ return self.tokenizer.n_vocab
317
+
318
+ def _convert_id_to_token(self, index: int) -> Union[bytes, str]:
319
+ """Converts an id to a token, special tokens included"""
320
+ if index in self.decoder:
321
+ return self.decoder[index]
322
+ raise ValueError("unknown ids")
323
+
324
+ def _convert_token_to_id(self, token: Union[bytes, str]) -> int:
325
+ """Converts a token to an id using the vocab, special tokens included"""
326
+ if token in self.special_tokens:
327
+ return self.special_tokens[token]
328
+ if token in self.mergeable_ranks:
329
+ return self.mergeable_ranks[token]
330
+ raise ValueError("unknown token")
331
+
332
+ def _tokenize(self, text: str, **kwargs):
333
+ """
334
+ Converts a string in a sequence of tokens (string), using the tokenizer. Split in words for word-based
335
+ vocabulary or sub-words for sub-word-based vocabularies (BPE/SentencePieces/WordPieces).
336
+
337
+ Do NOT take care of added tokens.
338
+ """
339
+ raise NotImplementedError
340
+
341
+ def _decode(
342
+ self,
343
+ token_ids: Union[int, List[int]],
344
+ skip_special_tokens: bool = False,
345
+ errors: str = None,
346
+ **kwargs,
347
+ ) -> str:
348
+ if isinstance(token_ids, int):
349
+ token_ids = [token_ids]
350
+
351
+ def _decode_imgurl(img_token_ids):
352
+ assert img_token_ids[0] == self.img_start_id and img_token_ids[-1] == self.img_end_id
353
+ img_token_ids = img_token_ids[1:-1]
354
+ img_token_ids = img_token_ids[ : img_token_ids.index(self.img_pad_id)]
355
+ img_url = bytes(img_token_ids).decode('utf-8')
356
+ return [self.img_start_id] + self.tokenizer.encode(img_url) + [self.img_end_id]
357
+
358
+ token_ids = _replace_closed_tag(token_ids, self.img_start_id, self.img_end_id, _decode_imgurl)
359
+
360
+ if skip_special_tokens:
361
+ if kwargs.get('keep_image_special', False):
362
+ token_ids = [i for i in token_ids if i < self.eod_id
363
+ or i in self.image_special_tokens]
364
+ else:
365
+ token_ids = [i for i in token_ids if i < self.eod_id]
366
+ return self.tokenizer.decode(token_ids, errors=errors or self.errors)
367
+
368
+ def to_list_format(self, text: str):
369
+ text = unicodedata.normalize("NFC", text)
370
+ token_ids = self.tokenizer.encode(
371
+ text, allowed_special=set(self.IMAGE_ST + (ENDOFTEXT,)))
372
+
373
+ def _encode_vl_info(tokens):
374
+ if len(tokens) == 0:
375
+ return []
376
+ if tokens[0] == self.img_start_id and tokens[-1] == self.img_end_id:
377
+ key = 'image'
378
+ elif tokens[0] == self.ref_start_id and tokens[-1] == self.ref_end_id:
379
+ key = 'ref'
380
+ elif tokens[0] == self.box_start_id and tokens[-1] == self.box_end_id:
381
+ key = 'box'
382
+ elif tokens[0] == self.quad_start_id and tokens[-1] == self.quad_end_id:
383
+ key = 'quad'
384
+ else:
385
+ _tobytes = lambda x: x.encode('utf-8') if isinstance(x, str) else x
386
+ return [{'text': b''.join(map(_tobytes, map(self.decoder.get, tokens))).decode('utf-8')}]
387
+ _tobytes = lambda x: x.encode('utf-8') if isinstance(x, str) else x
388
+ val = b''.join(map(_tobytes, map(self.decoder.get, tokens[1:-1]))).decode('utf-8')
389
+ return [{key: val}]
390
+
391
+ return _replace_closed_tag(
392
+ token_ids,
393
+ (self.img_start_id, self.ref_start_id, self.box_start_id, self.quad_start_id),
394
+ (self.img_end_id, self.ref_end_id, self.box_end_id, self.quad_end_id),
395
+ _encode_vl_info,
396
+ _encode_vl_info,
397
+ )
398
+
399
+ def from_list_format(self, list_format: List[Dict]):
400
+ text = ''
401
+ num_images = 0
402
+ for ele in list_format:
403
+ if 'image' in ele:
404
+ num_images += 1
405
+ text += f'Picture {num_images}: '
406
+ text += self.image_start_tag + ele['image'] + self.image_end_tag
407
+ text += '\n'
408
+ elif 'text' in ele:
409
+ text += ele['text']
410
+ elif 'box' in ele:
411
+ if 'ref' in ele:
412
+ text += self.ref_start_tag + ele['ref'] + self.ref_end_tag
413
+ for box in ele['box']:
414
+ text += self.box_start_tag + '(%d,%d),(%d,%d)' % (box[0], box[1], box[2], box[3]) + self.box_end_tag
415
+ else:
416
+ raise ValueError("Unsupport element: " + str(ele))
417
+ return text
418
+
419
+ def _fetch_latest_picture(self, response, history):
420
+ if history is None:
421
+ history = []
422
+ _history = history + [(response, None)]
423
+ for q, r in _history[::-1]:
424
+ for ele in self.to_list_format(q)[::-1]:
425
+ if 'image' in ele:
426
+ return ele['image']
427
+ return None
428
+
429
+ def _fetch_all_box_with_ref(self, text):
430
+ list_format = self.to_list_format(text)
431
+ output = []
432
+ for i, ele in enumerate(list_format):
433
+ if 'box' in ele:
434
+ bbox = tuple(map(int, ele['box'].replace('(', '').replace(')', '').split(',')))
435
+ assert len(bbox) == 4
436
+ output.append({'box': bbox})
437
+ if i > 0 and 'ref' in list_format[i-1]:
438
+ output[-1]['ref'] = list_format[i-1]['ref'].strip()
439
+ return output
440
+
441
+ def draw_bbox_on_latest_picture(
442
+ self,
443
+ response,
444
+ history=None,
445
+ ) -> Optional[Image.Image]:
446
+ image = self._fetch_latest_picture(response, history)
447
+ if image is None:
448
+ return None
449
+ if image.startswith("http://") or image.startswith("https://"):
450
+ image = Image.open(requests.get(image, stream=True).raw).convert("RGB")
451
+ h, w = image.height, image.width
452
+ else:
453
+ image = np.asarray(Image.open(image).convert("RGB"))
454
+ h, w = image.shape[0], image.shape[1]
455
+ visualizer = Visualizer(image)
456
+
457
+ boxes = self._fetch_all_box_with_ref(response)
458
+ if not boxes:
459
+ return None
460
+ color = random.choice([_ for _ in mcolors.TABLEAU_COLORS.keys()]) # init color
461
+ for box in boxes:
462
+ if 'ref' in box: # random new color for new refexps
463
+ color = random.choice([_ for _ in mcolors.TABLEAU_COLORS.keys()])
464
+ x1, y1, x2, y2 = box['box']
465
+ x1, y1, x2, y2 = (int(x1 / 1000 * w), int(y1 / 1000 * h), int(x2 / 1000 * w), int(y2 / 1000 * h))
466
+ visualizer.draw_box((x1, y1, x2, y2), alpha=1, edge_color=color)
467
+ if 'ref' in box:
468
+ visualizer.draw_text(box['ref'], (x1, y1), color=color, horizontal_alignment="left")
469
+ return visualizer.output
470
+
471
+
472
+ import colorsys
473
+ import logging
474
+ import math
475
+ import numpy as np
476
+ import matplotlib as mpl
477
+ import matplotlib.colors as mplc
478
+ import matplotlib.figure as mplfigure
479
+ import torch
480
+ from matplotlib.backends.backend_agg import FigureCanvasAgg
481
+ from PIL import Image
482
+ import random
483
+
484
+ logger = logging.getLogger(__name__)
485
+
486
+
487
+ class VisImage:
488
+ def __init__(self, img, scale=1.0):
489
+ self.img = img
490
+ self.scale = scale
491
+ self.width, self.height = img.shape[1], img.shape[0]
492
+ self._setup_figure(img)
493
+
494
+ def _setup_figure(self, img):
495
+ fig = mplfigure.Figure(frameon=False)
496
+ self.dpi = fig.get_dpi()
497
+ # add a small 1e-2 to avoid precision lost due to matplotlib's truncation
498
+ # (https://github.com/matplotlib/matplotlib/issues/15363)
499
+ fig.set_size_inches(
500
+ (self.width * self.scale + 1e-2) / self.dpi,
501
+ (self.height * self.scale + 1e-2) / self.dpi,
502
+ )
503
+ self.canvas = FigureCanvasAgg(fig)
504
+ # self.canvas = mpl.backends.backend_cairo.FigureCanvasCairo(fig)
505
+ ax = fig.add_axes([0.0, 0.0, 1.0, 1.0])
506
+ ax.axis("off")
507
+ self.fig = fig
508
+ self.ax = ax
509
+ self.reset_image(img)
510
+
511
+ def reset_image(self, img):
512
+ img = img.astype("uint8")
513
+ self.ax.imshow(img, extent=(0, self.width, self.height, 0), interpolation="nearest")
514
+
515
+ def save(self, filepath):
516
+ self.fig.savefig(filepath)
517
+
518
+ def get_image(self):
519
+ canvas = self.canvas
520
+ s, (width, height) = canvas.print_to_buffer()
521
+
522
+ buffer = np.frombuffer(s, dtype="uint8")
523
+
524
+ img_rgba = buffer.reshape(height, width, 4)
525
+ rgb, alpha = np.split(img_rgba, [3], axis=2)
526
+ return rgb.astype("uint8")
527
+
528
+
529
+ class Visualizer:
530
+ def __init__(self, img_rgb, metadata=None, scale=1.0):
531
+ self.img = np.asarray(img_rgb).clip(0, 255).astype(np.uint8)
532
+ self.font_path = FONT_PATH
533
+ self.output = VisImage(self.img, scale=scale)
534
+ self.cpu_device = torch.device("cpu")
535
+
536
+ # too small texts are useless, therefore clamp to 14
537
+ self._default_font_size = max(
538
+ np.sqrt(self.output.height * self.output.width) // 30, 15 // scale
539
+ )
540
+
541
+ def draw_text(
542
+ self,
543
+ text,
544
+ position,
545
+ *,
546
+ font_size=None,
547
+ color="g",
548
+ horizontal_alignment="center",
549
+ rotation=0,
550
+ ):
551
+ if not font_size:
552
+ font_size = self._default_font_size
553
+
554
+ # since the text background is dark, we don't want the text to be dark
555
+ color = np.maximum(list(mplc.to_rgb(color)), 0.2)
556
+ color[np.argmax(color)] = max(0.8, np.max(color))
557
+
558
+ x, y = position
559
+ self.output.ax.text(
560
+ x,
561
+ y,
562
+ text,
563
+ size=font_size * self.output.scale,
564
+ fontproperties=FontProperties(fname=self.font_path),
565
+ bbox={"facecolor": "black", "alpha": 0.8, "pad": 0.7, "edgecolor": "none"},
566
+ verticalalignment="top",
567
+ horizontalalignment=horizontal_alignment,
568
+ color=color,
569
+ zorder=10,
570
+ rotation=rotation,
571
+ )
572
+ return self.output
573
+
574
+ def draw_box(self, box_coord, alpha=0.5, edge_color="g", line_style="-"):
575
+
576
+ x0, y0, x1, y1 = box_coord
577
+ width = x1 - x0
578
+ height = y1 - y0
579
+
580
+ linewidth = max(self._default_font_size / 4, 1)
581
+
582
+ self.output.ax.add_patch(
583
+ mpl.patches.Rectangle(
584
+ (x0, y0),
585
+ width,
586
+ height,
587
+ fill=False,
588
+ edgecolor=edge_color,
589
+ linewidth=linewidth * self.output.scale,
590
+ alpha=alpha,
591
+ linestyle=line_style,
592
+ )
593
+ )
594
+ return self.output
595
+
596
+ def get_output(self):
597
+
598
+ return self.output
checkpoint-1600/tokenizer_config.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {},
3
+ "auto_map": {
4
+ "AutoTokenizer": [
5
+ "Qwen/Qwen-VL-Chat--tokenization_qwen.QWenTokenizer",
6
+ null
7
+ ]
8
+ },
9
+ "clean_up_tokenization_spaces": true,
10
+ "model_max_length": 768,
11
+ "pad_token": "<|endoftext|>",
12
+ "padding_side": "right",
13
+ "tokenizer_class": "QWenTokenizer"
14
+ }
checkpoint-1600/trainer_state.json ADDED
@@ -0,0 +1,1153 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": null,
3
+ "best_model_checkpoint": null,
4
+ "epoch": 0.10441819487045617,
5
+ "eval_steps": 500,
6
+ "global_step": 1600,
7
+ "is_hyper_param_search": false,
8
+ "is_local_process_zero": true,
9
+ "is_world_process_zero": true,
10
+ "log_history": [
11
+ {
12
+ "epoch": 0.0006526137179403511,
13
+ "grad_norm": 17.690582114691438,
14
+ "learning_rate": 1.948051948051948e-06,
15
+ "loss": 1.3559,
16
+ "step": 10
17
+ },
18
+ {
19
+ "epoch": 0.0013052274358807021,
20
+ "grad_norm": 7.768088366444893,
21
+ "learning_rate": 3.896103896103896e-06,
22
+ "loss": 1.2706,
23
+ "step": 20
24
+ },
25
+ {
26
+ "epoch": 0.001957841153821053,
27
+ "grad_norm": 7.705313536090087,
28
+ "learning_rate": 5.844155844155845e-06,
29
+ "loss": 1.3781,
30
+ "step": 30
31
+ },
32
+ {
33
+ "epoch": 0.0026104548717614043,
34
+ "grad_norm": 34.39078827766783,
35
+ "learning_rate": 7.792207792207792e-06,
36
+ "loss": 1.2749,
37
+ "step": 40
38
+ },
39
+ {
40
+ "epoch": 0.0032630685897017554,
41
+ "grad_norm": 68.28824334896528,
42
+ "learning_rate": 9.74025974025974e-06,
43
+ "loss": 1.2955,
44
+ "step": 50
45
+ },
46
+ {
47
+ "epoch": 0.003915682307642106,
48
+ "grad_norm": 14.220322607917241,
49
+ "learning_rate": 1.168831168831169e-05,
50
+ "loss": 1.2315,
51
+ "step": 60
52
+ },
53
+ {
54
+ "epoch": 0.0045682960255824575,
55
+ "grad_norm": 12.611848231734811,
56
+ "learning_rate": 1.3636363636363637e-05,
57
+ "loss": 1.0953,
58
+ "step": 70
59
+ },
60
+ {
61
+ "epoch": 0.0052209097435228086,
62
+ "grad_norm": 6.055664298727015,
63
+ "learning_rate": 1.5584415584415583e-05,
64
+ "loss": 1.105,
65
+ "step": 80
66
+ },
67
+ {
68
+ "epoch": 0.00587352346146316,
69
+ "grad_norm": 3.52269227801977,
70
+ "learning_rate": 1.753246753246753e-05,
71
+ "loss": 0.9563,
72
+ "step": 90
73
+ },
74
+ {
75
+ "epoch": 0.006526137179403511,
76
+ "grad_norm": 10.771884023354394,
77
+ "learning_rate": 1.948051948051948e-05,
78
+ "loss": 0.9523,
79
+ "step": 100
80
+ },
81
+ {
82
+ "epoch": 0.007178750897343862,
83
+ "grad_norm": 33.41476483216757,
84
+ "learning_rate": 2.1428571428571428e-05,
85
+ "loss": 0.832,
86
+ "step": 110
87
+ },
88
+ {
89
+ "epoch": 0.007831364615284213,
90
+ "grad_norm": 31.120240364617406,
91
+ "learning_rate": 2.337662337662338e-05,
92
+ "loss": 0.8376,
93
+ "step": 120
94
+ },
95
+ {
96
+ "epoch": 0.008483978333224564,
97
+ "grad_norm": 5.517231564060886,
98
+ "learning_rate": 2.5324675324675325e-05,
99
+ "loss": 0.8293,
100
+ "step": 130
101
+ },
102
+ {
103
+ "epoch": 0.009136592051164915,
104
+ "grad_norm": 4.311605388342058,
105
+ "learning_rate": 2.7272727272727273e-05,
106
+ "loss": 0.8295,
107
+ "step": 140
108
+ },
109
+ {
110
+ "epoch": 0.009789205769105266,
111
+ "grad_norm": 6.997724163121519,
112
+ "learning_rate": 2.922077922077922e-05,
113
+ "loss": 0.7662,
114
+ "step": 150
115
+ },
116
+ {
117
+ "epoch": 0.010441819487045617,
118
+ "grad_norm": 6.517836234400708,
119
+ "learning_rate": 2.999998841890695e-05,
120
+ "loss": 0.8158,
121
+ "step": 160
122
+ },
123
+ {
124
+ "epoch": 0.011094433204985968,
125
+ "grad_norm": 4.186989141019666,
126
+ "learning_rate": 2.99999176456253e-05,
127
+ "loss": 0.8037,
128
+ "step": 170
129
+ },
130
+ {
131
+ "epoch": 0.01174704692292632,
132
+ "grad_norm": 5.181546943355458,
133
+ "learning_rate": 2.9999782533305785e-05,
134
+ "loss": 0.7274,
135
+ "step": 180
136
+ },
137
+ {
138
+ "epoch": 0.01239966064086667,
139
+ "grad_norm": 3.767076521211455,
140
+ "learning_rate": 2.9999583082527935e-05,
141
+ "loss": 0.7474,
142
+ "step": 190
143
+ },
144
+ {
145
+ "epoch": 0.013052274358807021,
146
+ "grad_norm": 18.84416377940188,
147
+ "learning_rate": 2.999931929414726e-05,
148
+ "loss": 0.7708,
149
+ "step": 200
150
+ },
151
+ {
152
+ "epoch": 0.013704888076747372,
153
+ "grad_norm": 3.169160630444992,
154
+ "learning_rate": 2.999899116929522e-05,
155
+ "loss": 0.8279,
156
+ "step": 210
157
+ },
158
+ {
159
+ "epoch": 0.014357501794687724,
160
+ "grad_norm": 1.912782077307437,
161
+ "learning_rate": 2.999859870937924e-05,
162
+ "loss": 0.7407,
163
+ "step": 220
164
+ },
165
+ {
166
+ "epoch": 0.015010115512628075,
167
+ "grad_norm": 3.3906505952914974,
168
+ "learning_rate": 2.9998141916082696e-05,
169
+ "loss": 0.7732,
170
+ "step": 230
171
+ },
172
+ {
173
+ "epoch": 0.015662729230568426,
174
+ "grad_norm": 2.7144492322383584,
175
+ "learning_rate": 2.999762079136491e-05,
176
+ "loss": 0.7272,
177
+ "step": 240
178
+ },
179
+ {
180
+ "epoch": 0.01631534294850878,
181
+ "grad_norm": 7.109330196029837,
182
+ "learning_rate": 2.9997035337461135e-05,
183
+ "loss": 0.7748,
184
+ "step": 250
185
+ },
186
+ {
187
+ "epoch": 0.016967956666449128,
188
+ "grad_norm": 1.6054280593801813,
189
+ "learning_rate": 2.9996385556882555e-05,
190
+ "loss": 0.7676,
191
+ "step": 260
192
+ },
193
+ {
194
+ "epoch": 0.01762057038438948,
195
+ "grad_norm": 10.883212441614672,
196
+ "learning_rate": 2.9995671452416274e-05,
197
+ "loss": 0.735,
198
+ "step": 270
199
+ },
200
+ {
201
+ "epoch": 0.01827318410232983,
202
+ "grad_norm": 3.511064886507805,
203
+ "learning_rate": 2.999489302712529e-05,
204
+ "loss": 0.7741,
205
+ "step": 280
206
+ },
207
+ {
208
+ "epoch": 0.018925797820270183,
209
+ "grad_norm": 3.618603818375307,
210
+ "learning_rate": 2.9994050284348497e-05,
211
+ "loss": 0.749,
212
+ "step": 290
213
+ },
214
+ {
215
+ "epoch": 0.019578411538210532,
216
+ "grad_norm": 6.012944880342178,
217
+ "learning_rate": 2.9993143227700668e-05,
218
+ "loss": 0.7411,
219
+ "step": 300
220
+ },
221
+ {
222
+ "epoch": 0.020231025256150885,
223
+ "grad_norm": 2.348670372295822,
224
+ "learning_rate": 2.9992171861072428e-05,
225
+ "loss": 0.7394,
226
+ "step": 310
227
+ },
228
+ {
229
+ "epoch": 0.020883638974091234,
230
+ "grad_norm": 4.728309497649916,
231
+ "learning_rate": 2.9991136188630263e-05,
232
+ "loss": 0.8077,
233
+ "step": 320
234
+ },
235
+ {
236
+ "epoch": 0.021536252692031587,
237
+ "grad_norm": 15.611917863290122,
238
+ "learning_rate": 2.9990036214816467e-05,
239
+ "loss": 0.7209,
240
+ "step": 330
241
+ },
242
+ {
243
+ "epoch": 0.022188866409971936,
244
+ "grad_norm": 3.7315277354070817,
245
+ "learning_rate": 2.998887194434916e-05,
246
+ "loss": 0.7101,
247
+ "step": 340
248
+ },
249
+ {
250
+ "epoch": 0.02284148012791229,
251
+ "grad_norm": 6.618759094750745,
252
+ "learning_rate": 2.998764338222222e-05,
253
+ "loss": 0.7759,
254
+ "step": 350
255
+ },
256
+ {
257
+ "epoch": 0.02349409384585264,
258
+ "grad_norm": 6.770044306239603,
259
+ "learning_rate": 2.998635053370533e-05,
260
+ "loss": 0.7398,
261
+ "step": 360
262
+ },
263
+ {
264
+ "epoch": 0.02414670756379299,
265
+ "grad_norm": 12.471224202357552,
266
+ "learning_rate": 2.998499340434389e-05,
267
+ "loss": 0.7046,
268
+ "step": 370
269
+ },
270
+ {
271
+ "epoch": 0.02479932128173334,
272
+ "grad_norm": 4.147359416986547,
273
+ "learning_rate": 2.9983571999959013e-05,
274
+ "loss": 0.761,
275
+ "step": 380
276
+ },
277
+ {
278
+ "epoch": 0.025451934999673693,
279
+ "grad_norm": 34.84722866603778,
280
+ "learning_rate": 2.9982086326647533e-05,
281
+ "loss": 0.757,
282
+ "step": 390
283
+ },
284
+ {
285
+ "epoch": 0.026104548717614043,
286
+ "grad_norm": 5.245498180313093,
287
+ "learning_rate": 2.998053639078193e-05,
288
+ "loss": 0.7536,
289
+ "step": 400
290
+ },
291
+ {
292
+ "epoch": 0.026757162435554396,
293
+ "grad_norm": 36.55990241841121,
294
+ "learning_rate": 2.997892219901034e-05,
295
+ "loss": 0.7395,
296
+ "step": 410
297
+ },
298
+ {
299
+ "epoch": 0.027409776153494745,
300
+ "grad_norm": 5.03198653806696,
301
+ "learning_rate": 2.9977243758256494e-05,
302
+ "loss": 0.7208,
303
+ "step": 420
304
+ },
305
+ {
306
+ "epoch": 0.028062389871435098,
307
+ "grad_norm": 11.376914733036081,
308
+ "learning_rate": 2.997550107571972e-05,
309
+ "loss": 0.719,
310
+ "step": 430
311
+ },
312
+ {
313
+ "epoch": 0.028715003589375447,
314
+ "grad_norm": 2.958119684662306,
315
+ "learning_rate": 2.9973694158874898e-05,
316
+ "loss": 0.7271,
317
+ "step": 440
318
+ },
319
+ {
320
+ "epoch": 0.0293676173073158,
321
+ "grad_norm": 6.037096737490817,
322
+ "learning_rate": 2.9971823015472418e-05,
323
+ "loss": 0.7356,
324
+ "step": 450
325
+ },
326
+ {
327
+ "epoch": 0.03002023102525615,
328
+ "grad_norm": 5.3042973640363575,
329
+ "learning_rate": 2.9969887653538164e-05,
330
+ "loss": 0.7207,
331
+ "step": 460
332
+ },
333
+ {
334
+ "epoch": 0.030672844743196502,
335
+ "grad_norm": 2.4985603001745624,
336
+ "learning_rate": 2.996788808137347e-05,
337
+ "loss": 0.7769,
338
+ "step": 470
339
+ },
340
+ {
341
+ "epoch": 0.03132545846113685,
342
+ "grad_norm": 7.607065841315647,
343
+ "learning_rate": 2.9965824307555084e-05,
344
+ "loss": 0.7091,
345
+ "step": 480
346
+ },
347
+ {
348
+ "epoch": 0.03197807217907721,
349
+ "grad_norm": 4.322533035107957,
350
+ "learning_rate": 2.9963696340935144e-05,
351
+ "loss": 0.7114,
352
+ "step": 490
353
+ },
354
+ {
355
+ "epoch": 0.03263068589701756,
356
+ "grad_norm": 5.878565903250334,
357
+ "learning_rate": 2.9961504190641108e-05,
358
+ "loss": 0.7284,
359
+ "step": 500
360
+ },
361
+ {
362
+ "epoch": 0.033283299614957906,
363
+ "grad_norm": 5.0026507027119855,
364
+ "learning_rate": 2.9959247866075764e-05,
365
+ "loss": 0.6992,
366
+ "step": 510
367
+ },
368
+ {
369
+ "epoch": 0.033935913332898256,
370
+ "grad_norm": 7.12632150273901,
371
+ "learning_rate": 2.9956927376917137e-05,
372
+ "loss": 0.7285,
373
+ "step": 520
374
+ },
375
+ {
376
+ "epoch": 0.03458852705083861,
377
+ "grad_norm": 5.211123255860348,
378
+ "learning_rate": 2.9954542733118496e-05,
379
+ "loss": 0.7511,
380
+ "step": 530
381
+ },
382
+ {
383
+ "epoch": 0.03524114076877896,
384
+ "grad_norm": 9.925273547498618,
385
+ "learning_rate": 2.995209394490827e-05,
386
+ "loss": 0.7699,
387
+ "step": 540
388
+ },
389
+ {
390
+ "epoch": 0.03589375448671931,
391
+ "grad_norm": 7.418381681996765,
392
+ "learning_rate": 2.9949581022790025e-05,
393
+ "loss": 0.759,
394
+ "step": 550
395
+ },
396
+ {
397
+ "epoch": 0.03654636820465966,
398
+ "grad_norm": 4.352380973507467,
399
+ "learning_rate": 2.9947003977542423e-05,
400
+ "loss": 0.7537,
401
+ "step": 560
402
+ },
403
+ {
404
+ "epoch": 0.037198981922600016,
405
+ "grad_norm": 9.712842120769198,
406
+ "learning_rate": 2.9944362820219167e-05,
407
+ "loss": 0.7063,
408
+ "step": 570
409
+ },
410
+ {
411
+ "epoch": 0.037851595640540366,
412
+ "grad_norm": 5.757600819230482,
413
+ "learning_rate": 2.994165756214895e-05,
414
+ "loss": 0.7893,
415
+ "step": 580
416
+ },
417
+ {
418
+ "epoch": 0.038504209358480715,
419
+ "grad_norm": 5.529209601152462,
420
+ "learning_rate": 2.9938888214935426e-05,
421
+ "loss": 0.6771,
422
+ "step": 590
423
+ },
424
+ {
425
+ "epoch": 0.039156823076421064,
426
+ "grad_norm": 10.550479346499758,
427
+ "learning_rate": 2.9936054790457127e-05,
428
+ "loss": 0.737,
429
+ "step": 600
430
+ },
431
+ {
432
+ "epoch": 0.03980943679436142,
433
+ "grad_norm": 8.284279553451016,
434
+ "learning_rate": 2.9933157300867437e-05,
435
+ "loss": 0.7182,
436
+ "step": 610
437
+ },
438
+ {
439
+ "epoch": 0.04046205051230177,
440
+ "grad_norm": 8.18511648646326,
441
+ "learning_rate": 2.9930195758594542e-05,
442
+ "loss": 0.6901,
443
+ "step": 620
444
+ },
445
+ {
446
+ "epoch": 0.04111466423024212,
447
+ "grad_norm": 14.569754827631956,
448
+ "learning_rate": 2.9927170176341365e-05,
449
+ "loss": 0.7008,
450
+ "step": 630
451
+ },
452
+ {
453
+ "epoch": 0.04176727794818247,
454
+ "grad_norm": 4.214581273685441,
455
+ "learning_rate": 2.992408056708551e-05,
456
+ "loss": 0.7489,
457
+ "step": 640
458
+ },
459
+ {
460
+ "epoch": 0.042419891666122825,
461
+ "grad_norm": 10.038596627079452,
462
+ "learning_rate": 2.9920926944079224e-05,
463
+ "loss": 0.7649,
464
+ "step": 650
465
+ },
466
+ {
467
+ "epoch": 0.043072505384063174,
468
+ "grad_norm": 2.386544029221306,
469
+ "learning_rate": 2.9917709320849305e-05,
470
+ "loss": 0.7223,
471
+ "step": 660
472
+ },
473
+ {
474
+ "epoch": 0.043725119102003523,
475
+ "grad_norm": 8.286359254511249,
476
+ "learning_rate": 2.9914427711197096e-05,
477
+ "loss": 0.7089,
478
+ "step": 670
479
+ },
480
+ {
481
+ "epoch": 0.04437773281994387,
482
+ "grad_norm": 4.235819327444911,
483
+ "learning_rate": 2.9911082129198372e-05,
484
+ "loss": 0.7138,
485
+ "step": 680
486
+ },
487
+ {
488
+ "epoch": 0.04503034653788423,
489
+ "grad_norm": 5.187338033698449,
490
+ "learning_rate": 2.9907672589203316e-05,
491
+ "loss": 0.7192,
492
+ "step": 690
493
+ },
494
+ {
495
+ "epoch": 0.04568296025582458,
496
+ "grad_norm": 6.360475337181379,
497
+ "learning_rate": 2.9904199105836443e-05,
498
+ "loss": 0.7094,
499
+ "step": 700
500
+ },
501
+ {
502
+ "epoch": 0.04633557397376493,
503
+ "grad_norm": 4.906400836156689,
504
+ "learning_rate": 2.990066169399654e-05,
505
+ "loss": 0.654,
506
+ "step": 710
507
+ },
508
+ {
509
+ "epoch": 0.04698818769170528,
510
+ "grad_norm": 17.600495314130633,
511
+ "learning_rate": 2.9897060368856603e-05,
512
+ "loss": 0.7299,
513
+ "step": 720
514
+ },
515
+ {
516
+ "epoch": 0.04764080140964563,
517
+ "grad_norm": 7.765935941492389,
518
+ "learning_rate": 2.989339514586377e-05,
519
+ "loss": 0.7486,
520
+ "step": 730
521
+ },
522
+ {
523
+ "epoch": 0.04829341512758598,
524
+ "grad_norm": 7.30026395137639,
525
+ "learning_rate": 2.9889666040739252e-05,
526
+ "loss": 0.6941,
527
+ "step": 740
528
+ },
529
+ {
530
+ "epoch": 0.04894602884552633,
531
+ "grad_norm": 4.676985481218465,
532
+ "learning_rate": 2.9885873069478275e-05,
533
+ "loss": 0.7701,
534
+ "step": 750
535
+ },
536
+ {
537
+ "epoch": 0.04959864256346668,
538
+ "grad_norm": 42.50656974727186,
539
+ "learning_rate": 2.9882016248350006e-05,
540
+ "loss": 0.7428,
541
+ "step": 760
542
+ },
543
+ {
544
+ "epoch": 0.05025125628140704,
545
+ "grad_norm": 3.9893667031114766,
546
+ "learning_rate": 2.9878095593897474e-05,
547
+ "loss": 0.7204,
548
+ "step": 770
549
+ },
550
+ {
551
+ "epoch": 0.05090386999934739,
552
+ "grad_norm": 8.909028486553332,
553
+ "learning_rate": 2.9874111122937518e-05,
554
+ "loss": 0.7336,
555
+ "step": 780
556
+ },
557
+ {
558
+ "epoch": 0.051556483717287736,
559
+ "grad_norm": 5.256925284136456,
560
+ "learning_rate": 2.9870062852560698e-05,
561
+ "loss": 0.7674,
562
+ "step": 790
563
+ },
564
+ {
565
+ "epoch": 0.052209097435228086,
566
+ "grad_norm": 5.835535487534073,
567
+ "learning_rate": 2.986595080013123e-05,
568
+ "loss": 0.7547,
569
+ "step": 800
570
+ },
571
+ {
572
+ "epoch": 0.05286171115316844,
573
+ "grad_norm": 4.7337998648314565,
574
+ "learning_rate": 2.9861774983286913e-05,
575
+ "loss": 0.7412,
576
+ "step": 810
577
+ },
578
+ {
579
+ "epoch": 0.05351432487110879,
580
+ "grad_norm": 4.020304406250962,
581
+ "learning_rate": 2.9857535419939053e-05,
582
+ "loss": 0.7351,
583
+ "step": 820
584
+ },
585
+ {
586
+ "epoch": 0.05416693858904914,
587
+ "grad_norm": 7.005748568175158,
588
+ "learning_rate": 2.9853232128272367e-05,
589
+ "loss": 0.7146,
590
+ "step": 830
591
+ },
592
+ {
593
+ "epoch": 0.05481955230698949,
594
+ "grad_norm": 12.598315147497464,
595
+ "learning_rate": 2.984886512674494e-05,
596
+ "loss": 0.7066,
597
+ "step": 840
598
+ },
599
+ {
600
+ "epoch": 0.055472166024929846,
601
+ "grad_norm": 5.636755294839953,
602
+ "learning_rate": 2.9844434434088114e-05,
603
+ "loss": 0.8033,
604
+ "step": 850
605
+ },
606
+ {
607
+ "epoch": 0.056124779742870196,
608
+ "grad_norm": 2.5964949457129305,
609
+ "learning_rate": 2.9839940069306436e-05,
610
+ "loss": 0.718,
611
+ "step": 860
612
+ },
613
+ {
614
+ "epoch": 0.056777393460810545,
615
+ "grad_norm": 5.496060434333994,
616
+ "learning_rate": 2.9835382051677548e-05,
617
+ "loss": 0.7382,
618
+ "step": 870
619
+ },
620
+ {
621
+ "epoch": 0.057430007178750894,
622
+ "grad_norm": 3.367511777906771,
623
+ "learning_rate": 2.9830760400752117e-05,
624
+ "loss": 0.7049,
625
+ "step": 880
626
+ },
627
+ {
628
+ "epoch": 0.05808262089669125,
629
+ "grad_norm": 12.228282751386294,
630
+ "learning_rate": 2.9826075136353762e-05,
631
+ "loss": 0.7135,
632
+ "step": 890
633
+ },
634
+ {
635
+ "epoch": 0.0587352346146316,
636
+ "grad_norm": 7.426066867205744,
637
+ "learning_rate": 2.9821326278578955e-05,
638
+ "loss": 0.6966,
639
+ "step": 900
640
+ },
641
+ {
642
+ "epoch": 0.05938784833257195,
643
+ "grad_norm": 5.720080945169142,
644
+ "learning_rate": 2.981651384779693e-05,
645
+ "loss": 0.7325,
646
+ "step": 910
647
+ },
648
+ {
649
+ "epoch": 0.0600404620505123,
650
+ "grad_norm": 3.3362738196336275,
651
+ "learning_rate": 2.9811637864649622e-05,
652
+ "loss": 0.7013,
653
+ "step": 920
654
+ },
655
+ {
656
+ "epoch": 0.060693075768452655,
657
+ "grad_norm": 5.5481143050516675,
658
+ "learning_rate": 2.980669835005154e-05,
659
+ "loss": 0.7107,
660
+ "step": 930
661
+ },
662
+ {
663
+ "epoch": 0.061345689486393004,
664
+ "grad_norm": 2.7247889305754533,
665
+ "learning_rate": 2.980169532518971e-05,
666
+ "loss": 0.6839,
667
+ "step": 940
668
+ },
669
+ {
670
+ "epoch": 0.06199830320433335,
671
+ "grad_norm": 12.705144630158374,
672
+ "learning_rate": 2.9796628811523576e-05,
673
+ "loss": 0.7061,
674
+ "step": 950
675
+ },
676
+ {
677
+ "epoch": 0.0626509169222737,
678
+ "grad_norm": 3.1174966376805777,
679
+ "learning_rate": 2.9791498830784896e-05,
680
+ "loss": 0.706,
681
+ "step": 960
682
+ },
683
+ {
684
+ "epoch": 0.06330353064021406,
685
+ "grad_norm": 6.454819870022971,
686
+ "learning_rate": 2.9786305404977657e-05,
687
+ "loss": 0.6901,
688
+ "step": 970
689
+ },
690
+ {
691
+ "epoch": 0.06395614435815442,
692
+ "grad_norm": 8.62099817289566,
693
+ "learning_rate": 2.9781048556377982e-05,
694
+ "loss": 0.6737,
695
+ "step": 980
696
+ },
697
+ {
698
+ "epoch": 0.06460875807609476,
699
+ "grad_norm": 12.649532843245389,
700
+ "learning_rate": 2.977572830753404e-05,
701
+ "loss": 0.6777,
702
+ "step": 990
703
+ },
704
+ {
705
+ "epoch": 0.06526137179403511,
706
+ "grad_norm": 5.019508830810828,
707
+ "learning_rate": 2.9770344681265925e-05,
708
+ "loss": 0.7125,
709
+ "step": 1000
710
+ },
711
+ {
712
+ "epoch": 0.06591398551197546,
713
+ "grad_norm": 5.417114630539967,
714
+ "learning_rate": 2.9764897700665595e-05,
715
+ "loss": 0.7558,
716
+ "step": 1010
717
+ },
718
+ {
719
+ "epoch": 0.06656659922991581,
720
+ "grad_norm": 13.487574757960102,
721
+ "learning_rate": 2.975938738909674e-05,
722
+ "loss": 0.7305,
723
+ "step": 1020
724
+ },
725
+ {
726
+ "epoch": 0.06721921294785617,
727
+ "grad_norm": 4.115297871929447,
728
+ "learning_rate": 2.97538137701947e-05,
729
+ "loss": 0.7382,
730
+ "step": 1030
731
+ },
732
+ {
733
+ "epoch": 0.06787182666579651,
734
+ "grad_norm": 4.218133725965425,
735
+ "learning_rate": 2.974817686786636e-05,
736
+ "loss": 0.7131,
737
+ "step": 1040
738
+ },
739
+ {
740
+ "epoch": 0.06852444038373687,
741
+ "grad_norm": 23.754945260227526,
742
+ "learning_rate": 2.9742476706290044e-05,
743
+ "loss": 0.6854,
744
+ "step": 1050
745
+ },
746
+ {
747
+ "epoch": 0.06917705410167722,
748
+ "grad_norm": 9.992382581534882,
749
+ "learning_rate": 2.973671330991541e-05,
750
+ "loss": 0.7224,
751
+ "step": 1060
752
+ },
753
+ {
754
+ "epoch": 0.06982966781961757,
755
+ "grad_norm": 9.022842665053004,
756
+ "learning_rate": 2.973088670346336e-05,
757
+ "loss": 0.69,
758
+ "step": 1070
759
+ },
760
+ {
761
+ "epoch": 0.07048228153755792,
762
+ "grad_norm": 7.180693480173149,
763
+ "learning_rate": 2.97249969119259e-05,
764
+ "loss": 0.6752,
765
+ "step": 1080
766
+ },
767
+ {
768
+ "epoch": 0.07113489525549826,
769
+ "grad_norm": 4.631581340679664,
770
+ "learning_rate": 2.9719043960566088e-05,
771
+ "loss": 0.7078,
772
+ "step": 1090
773
+ },
774
+ {
775
+ "epoch": 0.07178750897343862,
776
+ "grad_norm": 3.8365551360021497,
777
+ "learning_rate": 2.9713027874917867e-05,
778
+ "loss": 0.7455,
779
+ "step": 1100
780
+ },
781
+ {
782
+ "epoch": 0.07244012269137898,
783
+ "grad_norm": 20.612721990589407,
784
+ "learning_rate": 2.9706948680785984e-05,
785
+ "loss": 0.7123,
786
+ "step": 1110
787
+ },
788
+ {
789
+ "epoch": 0.07309273640931932,
790
+ "grad_norm": 8.515913036269723,
791
+ "learning_rate": 2.9700806404245893e-05,
792
+ "loss": 0.6755,
793
+ "step": 1120
794
+ },
795
+ {
796
+ "epoch": 0.07374535012725968,
797
+ "grad_norm": 8.702591994450561,
798
+ "learning_rate": 2.9694601071643607e-05,
799
+ "loss": 0.743,
800
+ "step": 1130
801
+ },
802
+ {
803
+ "epoch": 0.07439796384520003,
804
+ "grad_norm": 20.204623397644042,
805
+ "learning_rate": 2.968833270959562e-05,
806
+ "loss": 0.6995,
807
+ "step": 1140
808
+ },
809
+ {
810
+ "epoch": 0.07505057756314037,
811
+ "grad_norm": 3.4150625200259563,
812
+ "learning_rate": 2.9682001344988768e-05,
813
+ "loss": 0.7245,
814
+ "step": 1150
815
+ },
816
+ {
817
+ "epoch": 0.07570319128108073,
818
+ "grad_norm": 4.827412673105033,
819
+ "learning_rate": 2.967560700498013e-05,
820
+ "loss": 0.6764,
821
+ "step": 1160
822
+ },
823
+ {
824
+ "epoch": 0.07635580499902107,
825
+ "grad_norm": 5.9778449783108965,
826
+ "learning_rate": 2.9669149716996897e-05,
827
+ "loss": 0.7094,
828
+ "step": 1170
829
+ },
830
+ {
831
+ "epoch": 0.07700841871696143,
832
+ "grad_norm": 4.626419468156439,
833
+ "learning_rate": 2.9662629508736278e-05,
834
+ "loss": 0.7139,
835
+ "step": 1180
836
+ },
837
+ {
838
+ "epoch": 0.07766103243490179,
839
+ "grad_norm": 8.23953369228554,
840
+ "learning_rate": 2.9656046408165344e-05,
841
+ "loss": 0.7132,
842
+ "step": 1190
843
+ },
844
+ {
845
+ "epoch": 0.07831364615284213,
846
+ "grad_norm": 5.755275462407804,
847
+ "learning_rate": 2.964940044352095e-05,
848
+ "loss": 0.6923,
849
+ "step": 1200
850
+ },
851
+ {
852
+ "epoch": 0.07896625987078248,
853
+ "grad_norm": 3.8396649246253816,
854
+ "learning_rate": 2.9642691643309572e-05,
855
+ "loss": 0.7082,
856
+ "step": 1210
857
+ },
858
+ {
859
+ "epoch": 0.07961887358872284,
860
+ "grad_norm": 5.7429454484886415,
861
+ "learning_rate": 2.963592003630723e-05,
862
+ "loss": 0.7095,
863
+ "step": 1220
864
+ },
865
+ {
866
+ "epoch": 0.08027148730666318,
867
+ "grad_norm": 17.628494673763004,
868
+ "learning_rate": 2.962908565155932e-05,
869
+ "loss": 0.7309,
870
+ "step": 1230
871
+ },
872
+ {
873
+ "epoch": 0.08092410102460354,
874
+ "grad_norm": 4.83400055237192,
875
+ "learning_rate": 2.9622188518380528e-05,
876
+ "loss": 0.6925,
877
+ "step": 1240
878
+ },
879
+ {
880
+ "epoch": 0.08157671474254388,
881
+ "grad_norm": 3.1535973307593905,
882
+ "learning_rate": 2.9615228666354667e-05,
883
+ "loss": 0.7441,
884
+ "step": 1250
885
+ },
886
+ {
887
+ "epoch": 0.08222932846048424,
888
+ "grad_norm": 4.085385929026401,
889
+ "learning_rate": 2.9608206125334586e-05,
890
+ "loss": 0.7137,
891
+ "step": 1260
892
+ },
893
+ {
894
+ "epoch": 0.0828819421784246,
895
+ "grad_norm": 4.299591870123697,
896
+ "learning_rate": 2.9601120925442016e-05,
897
+ "loss": 0.7515,
898
+ "step": 1270
899
+ },
900
+ {
901
+ "epoch": 0.08353455589636494,
902
+ "grad_norm": 12.873434323415678,
903
+ "learning_rate": 2.959397309706746e-05,
904
+ "loss": 0.6852,
905
+ "step": 1280
906
+ },
907
+ {
908
+ "epoch": 0.0841871696143053,
909
+ "grad_norm": 6.427088345402557,
910
+ "learning_rate": 2.958676267087004e-05,
911
+ "loss": 0.6499,
912
+ "step": 1290
913
+ },
914
+ {
915
+ "epoch": 0.08483978333224565,
916
+ "grad_norm": 4.70723263638176,
917
+ "learning_rate": 2.9579489677777387e-05,
918
+ "loss": 0.6803,
919
+ "step": 1300
920
+ },
921
+ {
922
+ "epoch": 0.08549239705018599,
923
+ "grad_norm": 4.819218491318424,
924
+ "learning_rate": 2.9572154148985495e-05,
925
+ "loss": 0.6798,
926
+ "step": 1310
927
+ },
928
+ {
929
+ "epoch": 0.08614501076812635,
930
+ "grad_norm": 3.0652661968089827,
931
+ "learning_rate": 2.9564756115958592e-05,
932
+ "loss": 0.6935,
933
+ "step": 1320
934
+ },
935
+ {
936
+ "epoch": 0.08679762448606669,
937
+ "grad_norm": 5.997224165634556,
938
+ "learning_rate": 2.9557295610429017e-05,
939
+ "loss": 0.7133,
940
+ "step": 1330
941
+ },
942
+ {
943
+ "epoch": 0.08745023820400705,
944
+ "grad_norm": 3.3593003375605717,
945
+ "learning_rate": 2.954977266439706e-05,
946
+ "loss": 0.7335,
947
+ "step": 1340
948
+ },
949
+ {
950
+ "epoch": 0.0881028519219474,
951
+ "grad_norm": 4.161242018302672,
952
+ "learning_rate": 2.954218731013083e-05,
953
+ "loss": 0.7054,
954
+ "step": 1350
955
+ },
956
+ {
957
+ "epoch": 0.08875546563988775,
958
+ "grad_norm": 5.827431481546491,
959
+ "learning_rate": 2.953453958016614e-05,
960
+ "loss": 0.6321,
961
+ "step": 1360
962
+ },
963
+ {
964
+ "epoch": 0.0894080793578281,
965
+ "grad_norm": 7.1039105888444904,
966
+ "learning_rate": 2.952682950730634e-05,
967
+ "loss": 0.6941,
968
+ "step": 1370
969
+ },
970
+ {
971
+ "epoch": 0.09006069307576846,
972
+ "grad_norm": 2.7616336275225892,
973
+ "learning_rate": 2.951905712462219e-05,
974
+ "loss": 0.6928,
975
+ "step": 1380
976
+ },
977
+ {
978
+ "epoch": 0.0907133067937088,
979
+ "grad_norm": 4.261061690296871,
980
+ "learning_rate": 2.9511222465451716e-05,
981
+ "loss": 0.7176,
982
+ "step": 1390
983
+ },
984
+ {
985
+ "epoch": 0.09136592051164916,
986
+ "grad_norm": 5.4134818862551395,
987
+ "learning_rate": 2.950332556340006e-05,
988
+ "loss": 0.7048,
989
+ "step": 1400
990
+ },
991
+ {
992
+ "epoch": 0.0920185342295895,
993
+ "grad_norm": 6.3477656240577085,
994
+ "learning_rate": 2.949536645233935e-05,
995
+ "loss": 0.6842,
996
+ "step": 1410
997
+ },
998
+ {
999
+ "epoch": 0.09267114794752986,
1000
+ "grad_norm": 63.477804314776044,
1001
+ "learning_rate": 2.9487345166408545e-05,
1002
+ "loss": 0.6876,
1003
+ "step": 1420
1004
+ },
1005
+ {
1006
+ "epoch": 0.09332376166547021,
1007
+ "grad_norm": 4.368664541213622,
1008
+ "learning_rate": 2.9479261740013286e-05,
1009
+ "loss": 0.6913,
1010
+ "step": 1430
1011
+ },
1012
+ {
1013
+ "epoch": 0.09397637538341055,
1014
+ "grad_norm": 9.476938465079238,
1015
+ "learning_rate": 2.9471116207825754e-05,
1016
+ "loss": 0.6891,
1017
+ "step": 1440
1018
+ },
1019
+ {
1020
+ "epoch": 0.09462898910135091,
1021
+ "grad_norm": 8.434794578560851,
1022
+ "learning_rate": 2.9462908604784523e-05,
1023
+ "loss": 0.6585,
1024
+ "step": 1450
1025
+ },
1026
+ {
1027
+ "epoch": 0.09528160281929127,
1028
+ "grad_norm": 4.798759761163433,
1029
+ "learning_rate": 2.945463896609441e-05,
1030
+ "loss": 0.6736,
1031
+ "step": 1460
1032
+ },
1033
+ {
1034
+ "epoch": 0.09593421653723161,
1035
+ "grad_norm": 9.782724872581115,
1036
+ "learning_rate": 2.9446307327226306e-05,
1037
+ "loss": 0.6659,
1038
+ "step": 1470
1039
+ },
1040
+ {
1041
+ "epoch": 0.09658683025517197,
1042
+ "grad_norm": 3.997516099278308,
1043
+ "learning_rate": 2.9437913723917058e-05,
1044
+ "loss": 0.6527,
1045
+ "step": 1480
1046
+ },
1047
+ {
1048
+ "epoch": 0.09723944397311232,
1049
+ "grad_norm": 4.623015725563099,
1050
+ "learning_rate": 2.942945819216928e-05,
1051
+ "loss": 0.7274,
1052
+ "step": 1490
1053
+ },
1054
+ {
1055
+ "epoch": 0.09789205769105266,
1056
+ "grad_norm": 3.2197835799755055,
1057
+ "learning_rate": 2.942094076825123e-05,
1058
+ "loss": 0.6966,
1059
+ "step": 1500
1060
+ },
1061
+ {
1062
+ "epoch": 0.09854467140899302,
1063
+ "grad_norm": 3.5107988249516984,
1064
+ "learning_rate": 2.9412361488696628e-05,
1065
+ "loss": 0.7235,
1066
+ "step": 1510
1067
+ },
1068
+ {
1069
+ "epoch": 0.09919728512693336,
1070
+ "grad_norm": 18.7865650951996,
1071
+ "learning_rate": 2.9403720390304518e-05,
1072
+ "loss": 0.7382,
1073
+ "step": 1520
1074
+ },
1075
+ {
1076
+ "epoch": 0.09984989884487372,
1077
+ "grad_norm": 3.85598692653545,
1078
+ "learning_rate": 2.93950175101391e-05,
1079
+ "loss": 0.7475,
1080
+ "step": 1530
1081
+ },
1082
+ {
1083
+ "epoch": 0.10050251256281408,
1084
+ "grad_norm": 20.459657003411998,
1085
+ "learning_rate": 2.938625288552957e-05,
1086
+ "loss": 0.6558,
1087
+ "step": 1540
1088
+ },
1089
+ {
1090
+ "epoch": 0.10115512628075442,
1091
+ "grad_norm": 6.416583997846208,
1092
+ "learning_rate": 2.9377426554069976e-05,
1093
+ "loss": 0.7205,
1094
+ "step": 1550
1095
+ },
1096
+ {
1097
+ "epoch": 0.10180773999869477,
1098
+ "grad_norm": 5.532087704430113,
1099
+ "learning_rate": 2.936853855361904e-05,
1100
+ "loss": 0.7189,
1101
+ "step": 1560
1102
+ },
1103
+ {
1104
+ "epoch": 0.10246035371663513,
1105
+ "grad_norm": 4.756518458886862,
1106
+ "learning_rate": 2.9359588922299986e-05,
1107
+ "loss": 0.7088,
1108
+ "step": 1570
1109
+ },
1110
+ {
1111
+ "epoch": 0.10311296743457547,
1112
+ "grad_norm": 5.775658785412931,
1113
+ "learning_rate": 2.9350577698500408e-05,
1114
+ "loss": 0.682,
1115
+ "step": 1580
1116
+ },
1117
+ {
1118
+ "epoch": 0.10376558115251583,
1119
+ "grad_norm": 7.714313915746094,
1120
+ "learning_rate": 2.9341504920872087e-05,
1121
+ "loss": 0.7393,
1122
+ "step": 1590
1123
+ },
1124
+ {
1125
+ "epoch": 0.10441819487045617,
1126
+ "grad_norm": 11.153510433173501,
1127
+ "learning_rate": 2.933237062833082e-05,
1128
+ "loss": 0.6616,
1129
+ "step": 1600
1130
+ }
1131
+ ],
1132
+ "logging_steps": 10,
1133
+ "max_steps": 15323,
1134
+ "num_input_tokens_seen": 0,
1135
+ "num_train_epochs": 1,
1136
+ "save_steps": 400,
1137
+ "stateful_callbacks": {
1138
+ "TrainerControl": {
1139
+ "args": {
1140
+ "should_epoch_stop": false,
1141
+ "should_evaluate": false,
1142
+ "should_log": false,
1143
+ "should_save": true,
1144
+ "should_training_stop": false
1145
+ },
1146
+ "attributes": {}
1147
+ }
1148
+ },
1149
+ "total_flos": 4.3737129443917824e+18,
1150
+ "train_batch_size": 8,
1151
+ "trial_name": null,
1152
+ "trial_params": null
1153
+ }
checkpoint-1600/training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a3a6a5052a9445cc570063f5939fdeea3ff8007e9c2718674bb335b9eea0bfff
3
+ size 6520
checkpoint-1600/zero_to_fp32.py ADDED
@@ -0,0 +1,587 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+
3
+ # Copyright (c) Microsoft Corporation.
4
+ # SPDX-License-Identifier: Apache-2.0
5
+
6
+ # DeepSpeed Team
7
+
8
+ # This script extracts fp32 consolidated weights from a zero 1, 2 and 3 DeepSpeed checkpoints. It gets
9
+ # copied into the top level checkpoint dir, so the user can easily do the conversion at any point in
10
+ # the future. Once extracted, the weights don't require DeepSpeed and can be used in any
11
+ # application.
12
+ #
13
+ # example: python zero_to_fp32.py . pytorch_model.bin
14
+
15
+ import argparse
16
+ import torch
17
+ import glob
18
+ import math
19
+ import os
20
+ import re
21
+ from collections import OrderedDict
22
+ from dataclasses import dataclass
23
+
24
+ # while this script doesn't use deepspeed to recover data, since the checkpoints are pickled with
25
+ # DeepSpeed data structures it has to be available in the current python environment.
26
+ from deepspeed.utils import logger
27
+ from deepspeed.checkpoint.constants import (DS_VERSION, OPTIMIZER_STATE_DICT, SINGLE_PARTITION_OF_FP32_GROUPS,
28
+ FP32_FLAT_GROUPS, ZERO_STAGE, PARTITION_COUNT, PARAM_SHAPES, BUFFER_NAMES,
29
+ FROZEN_PARAM_SHAPES, FROZEN_PARAM_FRAGMENTS)
30
+
31
+
32
+ @dataclass
33
+ class zero_model_state:
34
+ buffers: dict()
35
+ param_shapes: dict()
36
+ shared_params: list
37
+ ds_version: int
38
+ frozen_param_shapes: dict()
39
+ frozen_param_fragments: dict()
40
+
41
+
42
+ debug = 0
43
+
44
+ # load to cpu
45
+ device = torch.device('cpu')
46
+
47
+
48
+ def atoi(text):
49
+ return int(text) if text.isdigit() else text
50
+
51
+
52
+ def natural_keys(text):
53
+ '''
54
+ alist.sort(key=natural_keys) sorts in human order
55
+ http://nedbatchelder.com/blog/200712/human_sorting.html
56
+ (See Toothy's implementation in the comments)
57
+ '''
58
+ return [atoi(c) for c in re.split(r'(\d+)', text)]
59
+
60
+
61
+ def get_model_state_file(checkpoint_dir, zero_stage):
62
+ if not os.path.isdir(checkpoint_dir):
63
+ raise FileNotFoundError(f"Directory '{checkpoint_dir}' doesn't exist")
64
+
65
+ # there should be only one file
66
+ if zero_stage <= 2:
67
+ file = os.path.join(checkpoint_dir, "mp_rank_00_model_states.pt")
68
+ elif zero_stage == 3:
69
+ file = os.path.join(checkpoint_dir, "zero_pp_rank_0_mp_rank_00_model_states.pt")
70
+
71
+ if not os.path.exists(file):
72
+ raise FileNotFoundError(f"can't find model states file at '{file}'")
73
+
74
+ return file
75
+
76
+
77
+ def get_checkpoint_files(checkpoint_dir, glob_pattern):
78
+ # XXX: need to test that this simple glob rule works for multi-node setup too
79
+ ckpt_files = sorted(glob.glob(os.path.join(checkpoint_dir, glob_pattern)), key=natural_keys)
80
+
81
+ if len(ckpt_files) == 0:
82
+ raise FileNotFoundError(f"can't find {glob_pattern} files in directory '{checkpoint_dir}'")
83
+
84
+ return ckpt_files
85
+
86
+
87
+ def get_optim_files(checkpoint_dir):
88
+ return get_checkpoint_files(checkpoint_dir, "*_optim_states.pt")
89
+
90
+
91
+ def get_model_state_files(checkpoint_dir):
92
+ return get_checkpoint_files(checkpoint_dir, "*_model_states.pt")
93
+
94
+
95
+ def parse_model_states(files):
96
+ zero_model_states = []
97
+ for file in files:
98
+ state_dict = torch.load(file, map_location=device)
99
+
100
+ if BUFFER_NAMES not in state_dict:
101
+ raise ValueError(f"{file} is not a model state checkpoint")
102
+ buffer_names = state_dict[BUFFER_NAMES]
103
+ if debug:
104
+ print("Found buffers:", buffer_names)
105
+
106
+ # recover just the buffers while restoring them to fp32 if they were saved in fp16
107
+ buffers = {k: v.float() for k, v in state_dict["module"].items() if k in buffer_names}
108
+ param_shapes = state_dict[PARAM_SHAPES]
109
+
110
+ # collect parameters that are included in param_shapes
111
+ param_names = []
112
+ for s in param_shapes:
113
+ for name in s.keys():
114
+ param_names.append(name)
115
+
116
+ # update with frozen parameters
117
+ frozen_param_shapes = state_dict.get(FROZEN_PARAM_SHAPES, None)
118
+ if frozen_param_shapes is not None:
119
+ if debug:
120
+ print(f"Found frozen_param_shapes: {frozen_param_shapes}")
121
+ param_names += list(frozen_param_shapes.keys())
122
+
123
+ # handle shared params
124
+ shared_params = [[k, v] for k, v in state_dict["shared_params"].items()]
125
+
126
+ ds_version = state_dict.get(DS_VERSION, None)
127
+
128
+ frozen_param_fragments = state_dict.get(FROZEN_PARAM_FRAGMENTS, None)
129
+
130
+ z_model_state = zero_model_state(buffers=buffers,
131
+ param_shapes=param_shapes,
132
+ shared_params=shared_params,
133
+ ds_version=ds_version,
134
+ frozen_param_shapes=frozen_param_shapes,
135
+ frozen_param_fragments=frozen_param_fragments)
136
+ zero_model_states.append(z_model_state)
137
+
138
+ return zero_model_states
139
+
140
+
141
+ def parse_optim_states(files, ds_checkpoint_dir):
142
+
143
+ total_files = len(files)
144
+ state_dicts = []
145
+ for f in files:
146
+ state_dict = torch.load(f, map_location=device)
147
+ # immediately discard the potentially huge 2 optimizer states as we only care for fp32 master weights
148
+ # and also handle the case where it was already removed by another helper script
149
+ state_dict["optimizer_state_dict"].pop("optimizer_state_dict", None)
150
+ state_dicts.append(state_dict)
151
+
152
+ if not ZERO_STAGE in state_dicts[0][OPTIMIZER_STATE_DICT]:
153
+ raise ValueError(f"{files[0]} is not a zero checkpoint")
154
+ zero_stage = state_dicts[0][OPTIMIZER_STATE_DICT][ZERO_STAGE]
155
+ world_size = state_dicts[0][OPTIMIZER_STATE_DICT][PARTITION_COUNT]
156
+
157
+ # For ZeRO-2 each param group can have different partition_count as data parallelism for expert
158
+ # parameters can be different from data parallelism for non-expert parameters. So we can just
159
+ # use the max of the partition_count to get the dp world_size.
160
+
161
+ if type(world_size) is list:
162
+ world_size = max(world_size)
163
+
164
+ if world_size != total_files:
165
+ raise ValueError(
166
+ f"Expected {world_size} of '*_optim_states.pt' under '{ds_checkpoint_dir}' but found {total_files} files. "
167
+ "Possibly due to an overwrite of an old checkpoint, or a checkpoint didn't get saved by one or more processes."
168
+ )
169
+
170
+ # the groups are named differently in each stage
171
+ if zero_stage <= 2:
172
+ fp32_groups_key = SINGLE_PARTITION_OF_FP32_GROUPS
173
+ elif zero_stage == 3:
174
+ fp32_groups_key = FP32_FLAT_GROUPS
175
+ else:
176
+ raise ValueError(f"unknown zero stage {zero_stage}")
177
+
178
+ if zero_stage <= 2:
179
+ fp32_flat_groups = [state_dicts[i][OPTIMIZER_STATE_DICT][fp32_groups_key] for i in range(len(state_dicts))]
180
+ elif zero_stage == 3:
181
+ # if there is more than one param group, there will be multiple flattened tensors - one
182
+ # flattened tensor per group - for simplicity merge them into a single tensor
183
+ #
184
+ # XXX: could make the script more memory efficient for when there are multiple groups - it
185
+ # will require matching the sub-lists of param_shapes for each param group flattened tensor
186
+
187
+ fp32_flat_groups = [
188
+ torch.cat(state_dicts[i][OPTIMIZER_STATE_DICT][fp32_groups_key], 0) for i in range(len(state_dicts))
189
+ ]
190
+
191
+ return zero_stage, world_size, fp32_flat_groups
192
+
193
+
194
+ def _get_fp32_state_dict_from_zero_checkpoint(ds_checkpoint_dir):
195
+ """
196
+ Returns fp32 state_dict reconstructed from ds checkpoint
197
+
198
+ Args:
199
+ - ``ds_checkpoint_dir``: path to the deepspeed checkpoint folder (where the optimizer files are)
200
+
201
+ """
202
+ print(f"Processing zero checkpoint '{ds_checkpoint_dir}'")
203
+
204
+ optim_files = get_optim_files(ds_checkpoint_dir)
205
+ zero_stage, world_size, fp32_flat_groups = parse_optim_states(optim_files, ds_checkpoint_dir)
206
+ print(f"Detected checkpoint of type zero stage {zero_stage}, world_size: {world_size}")
207
+
208
+ model_files = get_model_state_files(ds_checkpoint_dir)
209
+
210
+ zero_model_states = parse_model_states(model_files)
211
+ print(f'Parsing checkpoint created by deepspeed=={zero_model_states[0].ds_version}')
212
+
213
+ if zero_stage <= 2:
214
+ return _get_fp32_state_dict_from_zero2_checkpoint(world_size, fp32_flat_groups, zero_model_states)
215
+ elif zero_stage == 3:
216
+ return _get_fp32_state_dict_from_zero3_checkpoint(world_size, fp32_flat_groups, zero_model_states)
217
+
218
+
219
+ def _zero2_merge_frozen_params(state_dict, zero_model_states):
220
+ if zero_model_states[0].frozen_param_shapes is None or len(zero_model_states[0].frozen_param_shapes) == 0:
221
+ return
222
+
223
+ frozen_param_shapes = zero_model_states[0].frozen_param_shapes
224
+ frozen_param_fragments = zero_model_states[0].frozen_param_fragments
225
+
226
+ if debug:
227
+ num_elem = sum(s.numel() for s in frozen_param_shapes.values())
228
+ print(f'rank 0: {FROZEN_PARAM_SHAPES}.numel = {num_elem}')
229
+
230
+ wanted_params = len(frozen_param_shapes)
231
+ wanted_numel = sum(s.numel() for s in frozen_param_shapes.values())
232
+ avail_numel = sum([p.numel() for p in frozen_param_fragments.values()])
233
+ print(f'Frozen params: Have {avail_numel} numels to process.')
234
+ print(f'Frozen params: Need {wanted_numel} numels in {wanted_params} params')
235
+
236
+ total_params = 0
237
+ total_numel = 0
238
+ for name, shape in frozen_param_shapes.items():
239
+ total_params += 1
240
+ unpartitioned_numel = shape.numel()
241
+ total_numel += unpartitioned_numel
242
+
243
+ state_dict[name] = frozen_param_fragments[name]
244
+
245
+ if debug:
246
+ print(f"{name} full shape: {shape} unpartitioned numel {unpartitioned_numel} ")
247
+
248
+ print(f"Reconstructed Frozen fp32 state dict with {total_params} params {total_numel} elements")
249
+
250
+
251
+ def _zero2_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states):
252
+ param_shapes = zero_model_states[0].param_shapes
253
+
254
+ # Reconstruction protocol:
255
+ #
256
+ # XXX: document this
257
+
258
+ if debug:
259
+ for i in range(world_size):
260
+ for j in range(len(fp32_flat_groups[0])):
261
+ print(f"{FP32_FLAT_GROUPS}[{i}][{j}].shape={fp32_flat_groups[i][j].shape}")
262
+
263
+ # XXX: memory usage doubles here (zero2)
264
+ num_param_groups = len(fp32_flat_groups[0])
265
+ merged_single_partition_of_fp32_groups = []
266
+ for i in range(num_param_groups):
267
+ merged_partitions = [sd[i] for sd in fp32_flat_groups]
268
+ full_single_fp32_vector = torch.cat(merged_partitions, 0)
269
+ merged_single_partition_of_fp32_groups.append(full_single_fp32_vector)
270
+ avail_numel = sum(
271
+ [full_single_fp32_vector.numel() for full_single_fp32_vector in merged_single_partition_of_fp32_groups])
272
+
273
+ if debug:
274
+ wanted_params = sum([len(shapes) for shapes in param_shapes])
275
+ wanted_numel = sum([sum(shape.numel() for shape in shapes.values()) for shapes in param_shapes])
276
+ # not asserting if there is a mismatch due to possible padding
277
+ print(f"Have {avail_numel} numels to process.")
278
+ print(f"Need {wanted_numel} numels in {wanted_params} params.")
279
+
280
+ # params
281
+ # XXX: for huge models that can't fit into the host's RAM we will have to recode this to support
282
+ # out-of-core computing solution
283
+ total_numel = 0
284
+ total_params = 0
285
+ for shapes, full_single_fp32_vector in zip(param_shapes, merged_single_partition_of_fp32_groups):
286
+ offset = 0
287
+ avail_numel = full_single_fp32_vector.numel()
288
+ for name, shape in shapes.items():
289
+
290
+ unpartitioned_numel = shape.numel()
291
+ total_numel += unpartitioned_numel
292
+ total_params += 1
293
+
294
+ if debug:
295
+ print(f"{name} full shape: {shape} unpartitioned numel {unpartitioned_numel} ")
296
+ state_dict[name] = full_single_fp32_vector.narrow(0, offset, unpartitioned_numel).view(shape)
297
+ offset += unpartitioned_numel
298
+
299
+ # Z2 started to align to 2*world_size to improve nccl performance. Therefore both offset and
300
+ # avail_numel can differ by anywhere between 0..2*world_size. Due to two unrelated complex
301
+ # paddings performed in the code it's almost impossible to predict the exact numbers w/o the
302
+ # live optimizer object, so we are checking that the numbers are within the right range
303
+ align_to = 2 * world_size
304
+
305
+ def zero2_align(x):
306
+ return align_to * math.ceil(x / align_to)
307
+
308
+ if debug:
309
+ print(f"original offset={offset}, avail_numel={avail_numel}")
310
+
311
+ offset = zero2_align(offset)
312
+ avail_numel = zero2_align(avail_numel)
313
+
314
+ if debug:
315
+ print(f"aligned offset={offset}, avail_numel={avail_numel}")
316
+
317
+ # Sanity check
318
+ if offset != avail_numel:
319
+ raise ValueError(f"consumed {offset} numels out of {avail_numel} - something is wrong")
320
+
321
+ print(f"Reconstructed fp32 state dict with {total_params} params {total_numel} elements")
322
+
323
+
324
+ def _get_fp32_state_dict_from_zero2_checkpoint(world_size, fp32_flat_groups, zero_model_states):
325
+ state_dict = OrderedDict()
326
+
327
+ # buffers
328
+ buffers = zero_model_states[0].buffers
329
+ state_dict.update(buffers)
330
+ if debug:
331
+ print(f"added {len(buffers)} buffers")
332
+
333
+ _zero2_merge_frozen_params(state_dict, zero_model_states)
334
+
335
+ _zero2_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states)
336
+
337
+ # recover shared parameters
338
+ for pair in zero_model_states[0].shared_params:
339
+ if pair[1] in state_dict:
340
+ state_dict[pair[0]] = state_dict[pair[1]]
341
+
342
+ return state_dict
343
+
344
+
345
+ def zero3_partitioned_param_info(unpartitioned_numel, world_size):
346
+ remainder = unpartitioned_numel % world_size
347
+ padding_numel = (world_size - remainder) if remainder else 0
348
+ partitioned_numel = math.ceil(unpartitioned_numel / world_size)
349
+ return partitioned_numel, padding_numel
350
+
351
+
352
+ def _zero3_merge_frozen_params(state_dict, world_size, zero_model_states):
353
+ if zero_model_states[0].frozen_param_shapes is None or len(zero_model_states[0].frozen_param_shapes) == 0:
354
+ return
355
+
356
+ if debug:
357
+ for i in range(world_size):
358
+ num_elem = sum(s.numel() for s in zero_model_states[i].frozen_param_fragments.values())
359
+ print(f'rank {i}: {FROZEN_PARAM_SHAPES}.numel = {num_elem}')
360
+
361
+ frozen_param_shapes = zero_model_states[0].frozen_param_shapes
362
+ wanted_params = len(frozen_param_shapes)
363
+ wanted_numel = sum(s.numel() for s in frozen_param_shapes.values())
364
+ avail_numel = sum([p.numel() for p in zero_model_states[0].frozen_param_fragments.values()]) * world_size
365
+ print(f'Frozen params: Have {avail_numel} numels to process.')
366
+ print(f'Frozen params: Need {wanted_numel} numels in {wanted_params} params')
367
+
368
+ total_params = 0
369
+ total_numel = 0
370
+ for name, shape in zero_model_states[0].frozen_param_shapes.items():
371
+ total_params += 1
372
+ unpartitioned_numel = shape.numel()
373
+ total_numel += unpartitioned_numel
374
+
375
+ param_frags = tuple(model_state.frozen_param_fragments[name] for model_state in zero_model_states)
376
+ state_dict[name] = torch.cat(param_frags, 0).narrow(0, 0, unpartitioned_numel).view(shape)
377
+
378
+ partitioned_numel, partitioned_padding_numel = zero3_partitioned_param_info(unpartitioned_numel, world_size)
379
+
380
+ if debug:
381
+ print(
382
+ f"Frozen params: {total_params} {name} full shape: {shape} partition0 numel={partitioned_numel} partitioned_padding_numel={partitioned_padding_numel}"
383
+ )
384
+
385
+ print(f"Reconstructed Frozen fp32 state dict with {total_params} params {total_numel} elements")
386
+
387
+
388
+ def _zero3_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states):
389
+ param_shapes = zero_model_states[0].param_shapes
390
+ avail_numel = fp32_flat_groups[0].numel() * world_size
391
+ # Reconstruction protocol: For zero3 we need to zip the partitions together at boundary of each
392
+ # param, re-consolidating each param, while dealing with padding if any
393
+
394
+ # merge list of dicts, preserving order
395
+ param_shapes = {k: v for d in param_shapes for k, v in d.items()}
396
+
397
+ if debug:
398
+ for i in range(world_size):
399
+ print(f"{FP32_FLAT_GROUPS}[{i}].shape={fp32_flat_groups[i].shape}")
400
+
401
+ wanted_params = len(param_shapes)
402
+ wanted_numel = sum(shape.numel() for shape in param_shapes.values())
403
+ # not asserting if there is a mismatch due to possible padding
404
+ avail_numel = fp32_flat_groups[0].numel() * world_size
405
+ print(f"Trainable params: Have {avail_numel} numels to process.")
406
+ print(f"Trainable params: Need {wanted_numel} numels in {wanted_params} params.")
407
+
408
+ # params
409
+ # XXX: for huge models that can't fit into the host's RAM we will have to recode this to support
410
+ # out-of-core computing solution
411
+ offset = 0
412
+ total_numel = 0
413
+ total_params = 0
414
+ for name, shape in param_shapes.items():
415
+
416
+ unpartitioned_numel = shape.numel()
417
+ total_numel += unpartitioned_numel
418
+ total_params += 1
419
+
420
+ partitioned_numel, partitioned_padding_numel = zero3_partitioned_param_info(unpartitioned_numel, world_size)
421
+
422
+ if debug:
423
+ print(
424
+ f"Trainable params: {total_params} {name} full shape: {shape} partition0 numel={partitioned_numel} partitioned_padding_numel={partitioned_padding_numel}"
425
+ )
426
+
427
+ # XXX: memory usage doubles here
428
+ state_dict[name] = torch.cat(
429
+ tuple(fp32_flat_groups[i].narrow(0, offset, partitioned_numel) for i in range(world_size)),
430
+ 0).narrow(0, 0, unpartitioned_numel).view(shape)
431
+ offset += partitioned_numel
432
+
433
+ offset *= world_size
434
+
435
+ # Sanity check
436
+ if offset != avail_numel:
437
+ raise ValueError(f"consumed {offset} numels out of {avail_numel} - something is wrong")
438
+
439
+ print(f"Reconstructed Trainable fp32 state dict with {total_params} params {total_numel} elements")
440
+
441
+
442
+ def _get_fp32_state_dict_from_zero3_checkpoint(world_size, fp32_flat_groups, zero_model_states):
443
+ state_dict = OrderedDict()
444
+
445
+ # buffers
446
+ buffers = zero_model_states[0].buffers
447
+ state_dict.update(buffers)
448
+ if debug:
449
+ print(f"added {len(buffers)} buffers")
450
+
451
+ _zero3_merge_frozen_params(state_dict, world_size, zero_model_states)
452
+
453
+ _zero3_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states)
454
+
455
+ # recover shared parameters
456
+ for pair in zero_model_states[0].shared_params:
457
+ if pair[1] in state_dict:
458
+ state_dict[pair[0]] = state_dict[pair[1]]
459
+
460
+ return state_dict
461
+
462
+
463
+ def get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag=None):
464
+ """
465
+ Convert ZeRO 2 or 3 checkpoint into a single fp32 consolidated state_dict that can be loaded with
466
+ ``load_state_dict()`` and used for training without DeepSpeed or shared with others, for example
467
+ via a model hub.
468
+
469
+ Args:
470
+ - ``checkpoint_dir``: path to the desired checkpoint folder
471
+ - ``tag``: checkpoint tag used as a unique identifier for checkpoint. If not provided will attempt to load tag in 'latest' file. e.g., ``global_step14``
472
+
473
+ Returns:
474
+ - pytorch ``state_dict``
475
+
476
+ Note: this approach may not work if your application doesn't have sufficient free CPU memory and
477
+ you may need to use the offline approach using the ``zero_to_fp32.py`` script that is saved with
478
+ the checkpoint.
479
+
480
+ A typical usage might be ::
481
+
482
+ from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint
483
+ # do the training and checkpoint saving
484
+ state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir) # already on cpu
485
+ model = model.cpu() # move to cpu
486
+ model.load_state_dict(state_dict)
487
+ # submit to model hub or save the model to share with others
488
+
489
+ In this example the ``model`` will no longer be usable in the deepspeed context of the same
490
+ application. i.e. you will need to re-initialize the deepspeed engine, since
491
+ ``model.load_state_dict(state_dict)`` will remove all the deepspeed magic from it.
492
+
493
+ If you want it all done for you, use ``load_state_dict_from_zero_checkpoint`` instead.
494
+
495
+ """
496
+ if tag is None:
497
+ latest_path = os.path.join(checkpoint_dir, 'latest')
498
+ if os.path.isfile(latest_path):
499
+ with open(latest_path, 'r') as fd:
500
+ tag = fd.read().strip()
501
+ else:
502
+ raise ValueError(f"Unable to find 'latest' file at {latest_path}")
503
+
504
+ ds_checkpoint_dir = os.path.join(checkpoint_dir, tag)
505
+
506
+ if not os.path.isdir(ds_checkpoint_dir):
507
+ raise FileNotFoundError(f"Directory '{ds_checkpoint_dir}' doesn't exist")
508
+
509
+ return _get_fp32_state_dict_from_zero_checkpoint(ds_checkpoint_dir)
510
+
511
+
512
+ def convert_zero_checkpoint_to_fp32_state_dict(checkpoint_dir, output_file, tag=None):
513
+ """
514
+ Convert ZeRO 2 or 3 checkpoint into a single fp32 consolidated ``state_dict`` file that can be
515
+ loaded with ``torch.load(file)`` + ``load_state_dict()`` and used for training without DeepSpeed.
516
+
517
+ Args:
518
+ - ``checkpoint_dir``: path to the desired checkpoint folder. (one that contains the tag-folder, like ``global_step14``)
519
+ - ``output_file``: path to the pytorch fp32 state_dict output file (e.g. path/pytorch_model.bin)
520
+ - ``tag``: checkpoint tag used as a unique identifier for checkpoint. If not provided will attempt to load tag in the file named ``latest`` in the checkpoint folder, e.g., ``global_step14``
521
+ """
522
+
523
+ state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag)
524
+ print(f"Saving fp32 state dict to {output_file}")
525
+ torch.save(state_dict, output_file)
526
+
527
+
528
+ def load_state_dict_from_zero_checkpoint(model, checkpoint_dir, tag=None):
529
+ """
530
+ 1. Put the provided model to cpu
531
+ 2. Convert ZeRO 2 or 3 checkpoint into a single fp32 consolidated ``state_dict``
532
+ 3. Load it into the provided model
533
+
534
+ Args:
535
+ - ``model``: the model object to update
536
+ - ``checkpoint_dir``: path to the desired checkpoint folder. (one that contains the tag-folder, like ``global_step14``)
537
+ - ``tag``: checkpoint tag used as a unique identifier for checkpoint. If not provided will attempt to load tag in the file named ``latest`` in the checkpoint folder, e.g., ``global_step14``
538
+
539
+ Returns:
540
+ - ``model`: modified model
541
+
542
+ Make sure you have plenty of CPU memory available before you call this function. If you don't
543
+ have enough use the ``zero_to_fp32.py`` utility to do the conversion. You will find it
544
+ conveniently placed for you in the checkpoint folder.
545
+
546
+ A typical usage might be ::
547
+
548
+ from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
549
+ model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)
550
+ # submit to model hub or save the model to share with others
551
+
552
+ Note, that once this was run, the ``model`` will no longer be usable in the deepspeed context
553
+ of the same application. i.e. you will need to re-initialize the deepspeed engine, since
554
+ ``model.load_state_dict(state_dict)`` will remove all the deepspeed magic from it.
555
+
556
+ """
557
+ logger.info(f"Extracting fp32 weights")
558
+ state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag)
559
+
560
+ logger.info(f"Overwriting model with fp32 weights")
561
+ model = model.cpu()
562
+ model.load_state_dict(state_dict, strict=False)
563
+
564
+ return model
565
+
566
+
567
+ if __name__ == "__main__":
568
+
569
+ parser = argparse.ArgumentParser()
570
+ parser.add_argument("checkpoint_dir",
571
+ type=str,
572
+ help="path to the desired checkpoint folder, e.g., path/checkpoint-12")
573
+ parser.add_argument(
574
+ "output_file",
575
+ type=str,
576
+ help="path to the pytorch fp32 state_dict output file (e.g. path/checkpoint-12/pytorch_model.bin)")
577
+ parser.add_argument("-t",
578
+ "--tag",
579
+ type=str,
580
+ default=None,
581
+ help="checkpoint tag used as a unique identifier for checkpoint. e.g., global_step1")
582
+ parser.add_argument("-d", "--debug", action='store_true', help="enable debug")
583
+ args = parser.parse_args()
584
+
585
+ debug = args.debug
586
+
587
+ convert_zero_checkpoint_to_fp32_state_dict(args.checkpoint_dir, args.output_file, tag=args.tag)
checkpoint-2000/README.md ADDED
@@ -0,0 +1,203 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: peft
3
+ base_model: Qwen/Qwen-VL-Chat
4
+ ---
5
+
6
+ # Model Card for Model ID
7
+
8
+ <!-- Provide a quick summary of what the model is/does. -->
9
+
10
+
11
+
12
+ ## Model Details
13
+
14
+ ### Model Description
15
+
16
+ <!-- Provide a longer summary of what this model is. -->
17
+
18
+
19
+
20
+ - **Developed by:** [More Information Needed]
21
+ - **Funded by [optional]:** [More Information Needed]
22
+ - **Shared by [optional]:** [More Information Needed]
23
+ - **Model type:** [More Information Needed]
24
+ - **Language(s) (NLP):** [More Information Needed]
25
+ - **License:** [More Information Needed]
26
+ - **Finetuned from model [optional]:** [More Information Needed]
27
+
28
+ ### Model Sources [optional]
29
+
30
+ <!-- Provide the basic links for the model. -->
31
+
32
+ - **Repository:** [More Information Needed]
33
+ - **Paper [optional]:** [More Information Needed]
34
+ - **Demo [optional]:** [More Information Needed]
35
+
36
+ ## Uses
37
+
38
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
+
40
+ ### Direct Use
41
+
42
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
+
44
+ [More Information Needed]
45
+
46
+ ### Downstream Use [optional]
47
+
48
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
+
50
+ [More Information Needed]
51
+
52
+ ### Out-of-Scope Use
53
+
54
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
+
56
+ [More Information Needed]
57
+
58
+ ## Bias, Risks, and Limitations
59
+
60
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
+
62
+ [More Information Needed]
63
+
64
+ ### Recommendations
65
+
66
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
+
68
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
+
70
+ ## How to Get Started with the Model
71
+
72
+ Use the code below to get started with the model.
73
+
74
+ [More Information Needed]
75
+
76
+ ## Training Details
77
+
78
+ ### Training Data
79
+
80
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
+
82
+ [More Information Needed]
83
+
84
+ ### Training Procedure
85
+
86
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
+
88
+ #### Preprocessing [optional]
89
+
90
+ [More Information Needed]
91
+
92
+
93
+ #### Training Hyperparameters
94
+
95
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
+
97
+ #### Speeds, Sizes, Times [optional]
98
+
99
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
+
101
+ [More Information Needed]
102
+
103
+ ## Evaluation
104
+
105
+ <!-- This section describes the evaluation protocols and provides the results. -->
106
+
107
+ ### Testing Data, Factors & Metrics
108
+
109
+ #### Testing Data
110
+
111
+ <!-- This should link to a Dataset Card if possible. -->
112
+
113
+ [More Information Needed]
114
+
115
+ #### Factors
116
+
117
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
+
119
+ [More Information Needed]
120
+
121
+ #### Metrics
122
+
123
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
+
125
+ [More Information Needed]
126
+
127
+ ### Results
128
+
129
+ [More Information Needed]
130
+
131
+ #### Summary
132
+
133
+
134
+
135
+ ## Model Examination [optional]
136
+
137
+ <!-- Relevant interpretability work for the model goes here -->
138
+
139
+ [More Information Needed]
140
+
141
+ ## Environmental Impact
142
+
143
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
+
145
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
+
147
+ - **Hardware Type:** [More Information Needed]
148
+ - **Hours used:** [More Information Needed]
149
+ - **Cloud Provider:** [More Information Needed]
150
+ - **Compute Region:** [More Information Needed]
151
+ - **Carbon Emitted:** [More Information Needed]
152
+
153
+ ## Technical Specifications [optional]
154
+
155
+ ### Model Architecture and Objective
156
+
157
+ [More Information Needed]
158
+
159
+ ### Compute Infrastructure
160
+
161
+ [More Information Needed]
162
+
163
+ #### Hardware
164
+
165
+ [More Information Needed]
166
+
167
+ #### Software
168
+
169
+ [More Information Needed]
170
+
171
+ ## Citation [optional]
172
+
173
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
+
175
+ **BibTeX:**
176
+
177
+ [More Information Needed]
178
+
179
+ **APA:**
180
+
181
+ [More Information Needed]
182
+
183
+ ## Glossary [optional]
184
+
185
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
+
187
+ [More Information Needed]
188
+
189
+ ## More Information [optional]
190
+
191
+ [More Information Needed]
192
+
193
+ ## Model Card Authors [optional]
194
+
195
+ [More Information Needed]
196
+
197
+ ## Model Card Contact
198
+
199
+ [More Information Needed]
200
+ ### Framework versions
201
+
202
+ - PEFT 0.10.0
203
+ - PEFT 0.11.1
checkpoint-2000/adapter_config.json ADDED
@@ -0,0 +1,380 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alpha_pattern": {},
3
+ "auto_mapping": null,
4
+ "base_model_name_or_path": "Qwen/Qwen-VL-Chat",
5
+ "bias": "none",
6
+ "fan_in_fan_out": false,
7
+ "inference_mode": true,
8
+ "init_lora_weights": true,
9
+ "layer_replication": null,
10
+ "layers_pattern": null,
11
+ "layers_to_transform": null,
12
+ "loftq_config": {},
13
+ "lora_alpha": 16,
14
+ "lora_dropout": 0.05,
15
+ "megatron_config": null,
16
+ "megatron_core": "megatron.core",
17
+ "modules_to_save": null,
18
+ "peft_type": "LORA",
19
+ "r": 64,
20
+ "rank_pattern": {},
21
+ "revision": null,
22
+ "target_modules": [
23
+ "transformer.h.16.mlp.w1",
24
+ "transformer.visual.transformer.resblocks.13.attn.out_proj",
25
+ "transformer.h.28.mlp.w1",
26
+ "transformer.h.16.attn.c_attn",
27
+ "transformer.h.3.mlp.w1",
28
+ "transformer.visual.transformer.resblocks.29.attn.in_proj",
29
+ "transformer.visual.transformer.resblocks.19.mlp.c_proj",
30
+ "transformer.visual.transformer.resblocks.47.mlp.c_fc",
31
+ "transformer.visual.transformer.resblocks.34.mlp.c_fc",
32
+ "transformer.visual.transformer.resblocks.4.attn.out_proj",
33
+ "transformer.h.31.attn.c_attn",
34
+ "transformer.h.16.mlp.w2",
35
+ "transformer.visual.transformer.resblocks.5.attn.out_proj",
36
+ "transformer.h.2.mlp.w1",
37
+ "transformer.visual.transformer.resblocks.7.attn.in_proj",
38
+ "transformer.h.20.mlp.w2",
39
+ "transformer.h.19.mlp.w1",
40
+ "transformer.visual.transformer.resblocks.18.mlp.c_fc",
41
+ "transformer.visual.transformer.resblocks.27.attn.out_proj",
42
+ "transformer.visual.transformer.resblocks.10.mlp.c_proj",
43
+ "transformer.visual.transformer.resblocks.43.mlp.c_fc",
44
+ "transformer.h.5.mlp.w1",
45
+ "transformer.visual.transformer.resblocks.15.mlp.c_proj",
46
+ "transformer.visual.transformer.resblocks.25.mlp.c_proj",
47
+ "transformer.visual.transformer.resblocks.10.attn.out_proj",
48
+ "transformer.visual.transformer.resblocks.4.mlp.c_fc",
49
+ "transformer.h.31.mlp.w2",
50
+ "transformer.visual.transformer.resblocks.37.attn.out_proj",
51
+ "transformer.h.8.attn.c_proj",
52
+ "transformer.h.29.attn.c_attn",
53
+ "transformer.visual.transformer.resblocks.24.mlp.c_proj",
54
+ "transformer.h.19.mlp.c_proj",
55
+ "transformer.visual.transformer.resblocks.11.attn.out_proj",
56
+ "transformer.h.13.mlp.c_proj",
57
+ "transformer.h.27.mlp.c_proj",
58
+ "transformer.h.31.mlp.w1",
59
+ "transformer.visual.transformer.resblocks.7.mlp.c_proj",
60
+ "transformer.h.28.mlp.w2",
61
+ "transformer.visual.transformer.resblocks.3.mlp.c_proj",
62
+ "transformer.visual.transformer.resblocks.13.attn.in_proj",
63
+ "transformer.h.21.attn.c_attn",
64
+ "transformer.visual.transformer.resblocks.23.mlp.c_fc",
65
+ "transformer.visual.transformer.resblocks.33.mlp.c_proj",
66
+ "transformer.visual.transformer.resblocks.42.mlp.c_fc",
67
+ "transformer.visual.transformer.resblocks.3.attn.in_proj",
68
+ "transformer.h.13.mlp.w1",
69
+ "transformer.visual.transformer.resblocks.22.attn.out_proj",
70
+ "transformer.visual.transformer.resblocks.20.mlp.c_fc",
71
+ "transformer.h.26.mlp.w2",
72
+ "transformer.h.14.attn.c_attn",
73
+ "transformer.h.16.attn.c_proj",
74
+ "transformer.h.1.mlp.w1",
75
+ "transformer.visual.transformer.resblocks.21.attn.out_proj",
76
+ "transformer.visual.transformer.resblocks.39.mlp.c_proj",
77
+ "transformer.visual.transformer.resblocks.4.attn.in_proj",
78
+ "transformer.h.29.mlp.c_proj",
79
+ "transformer.visual.transformer.resblocks.12.mlp.c_proj",
80
+ "transformer.visual.transformer.resblocks.14.attn.in_proj",
81
+ "transformer.h.28.attn.c_proj",
82
+ "transformer.h.18.mlp.w1",
83
+ "transformer.h.27.mlp.w2",
84
+ "transformer.h.18.attn.c_attn",
85
+ "transformer.visual.transformer.resblocks.33.attn.out_proj",
86
+ "transformer.h.5.mlp.w2",
87
+ "transformer.visual.transformer.resblocks.37.mlp.c_fc",
88
+ "transformer.visual.transformer.resblocks.2.mlp.c_proj",
89
+ "transformer.visual.transformer.resblocks.42.attn.out_proj",
90
+ "transformer.visual.transformer.resblocks.15.attn.in_proj",
91
+ "transformer.visual.transformer.resblocks.6.mlp.c_fc",
92
+ "transformer.h.13.mlp.w2",
93
+ "transformer.h.23.attn.c_proj",
94
+ "transformer.h.20.mlp.c_proj",
95
+ "transformer.h.14.mlp.w2",
96
+ "transformer.visual.transformer.resblocks.9.attn.in_proj",
97
+ "transformer.visual.transformer.resblocks.46.attn.in_proj",
98
+ "transformer.h.9.attn.c_attn",
99
+ "transformer.visual.transformer.resblocks.36.mlp.c_proj",
100
+ "transformer.h.31.attn.c_proj",
101
+ "transformer.visual.transformer.resblocks.19.mlp.c_fc",
102
+ "transformer.h.17.mlp.w1",
103
+ "transformer.h.2.attn.c_proj",
104
+ "transformer.visual.transformer.resblocks.47.attn.in_proj",
105
+ "transformer.visual.transformer.resblocks.45.mlp.c_proj",
106
+ "transformer.visual.transformer.resblocks.46.mlp.c_fc",
107
+ "transformer.visual.transformer.resblocks.27.attn.in_proj",
108
+ "transformer.visual.transformer.resblocks.26.attn.out_proj",
109
+ "transformer.h.22.attn.c_proj",
110
+ "transformer.visual.transformer.resblocks.40.attn.out_proj",
111
+ "transformer.visual.transformer.resblocks.46.mlp.c_proj",
112
+ "transformer.visual.transformer.resblocks.18.attn.out_proj",
113
+ "transformer.h.27.attn.c_proj",
114
+ "transformer.visual.transformer.resblocks.26.attn.in_proj",
115
+ "transformer.h.4.mlp.w1",
116
+ "transformer.h.10.attn.c_proj",
117
+ "transformer.h.6.attn.c_attn",
118
+ "transformer.h.2.attn.c_attn",
119
+ "transformer.h.22.mlp.w1",
120
+ "transformer.visual.transformer.resblocks.39.mlp.c_fc",
121
+ "transformer.h.8.mlp.w2",
122
+ "transformer.h.4.attn.c_attn",
123
+ "transformer.h.26.mlp.c_proj",
124
+ "transformer.visual.transformer.resblocks.29.mlp.c_proj",
125
+ "transformer.visual.transformer.resblocks.5.mlp.c_proj",
126
+ "transformer.h.11.mlp.c_proj",
127
+ "transformer.h.0.mlp.w2",
128
+ "transformer.visual.transformer.resblocks.36.attn.out_proj",
129
+ "transformer.h.29.mlp.w1",
130
+ "transformer.h.12.mlp.c_proj",
131
+ "transformer.visual.transformer.resblocks.2.attn.in_proj",
132
+ "transformer.visual.transformer.resblocks.2.mlp.c_fc",
133
+ "transformer.h.25.attn.c_attn",
134
+ "transformer.visual.transformer.resblocks.19.attn.in_proj",
135
+ "transformer.visual.transformer.resblocks.43.attn.out_proj",
136
+ "transformer.visual.transformer.resblocks.35.attn.out_proj",
137
+ "transformer.h.22.attn.c_attn",
138
+ "transformer.h.0.mlp.w1",
139
+ "transformer.h.3.attn.c_attn",
140
+ "transformer.h.28.attn.c_attn",
141
+ "transformer.visual.transformer.resblocks.25.attn.in_proj",
142
+ "transformer.visual.transformer.resblocks.34.attn.out_proj",
143
+ "transformer.h.21.attn.c_proj",
144
+ "transformer.h.6.attn.c_proj",
145
+ "transformer.visual.transformer.resblocks.11.mlp.c_proj",
146
+ "transformer.h.13.attn.c_attn",
147
+ "transformer.visual.transformer.resblocks.38.attn.out_proj",
148
+ "transformer.h.3.attn.c_proj",
149
+ "transformer.visual.transformer.resblocks.17.mlp.c_fc",
150
+ "transformer.h.26.mlp.w1",
151
+ "transformer.visual.transformer.resblocks.36.mlp.c_fc",
152
+ "transformer.h.26.attn.c_attn",
153
+ "transformer.visual.transformer.resblocks.29.attn.out_proj",
154
+ "transformer.h.7.mlp.w1",
155
+ "transformer.visual.transformer.resblocks.40.mlp.c_fc",
156
+ "transformer.visual.transformer.resblocks.9.attn.out_proj",
157
+ "transformer.h.3.mlp.c_proj",
158
+ "transformer.visual.transformer.resblocks.26.mlp.c_fc",
159
+ "transformer.h.11.mlp.w2",
160
+ "transformer.visual.transformer.resblocks.33.attn.in_proj",
161
+ "transformer.visual.transformer.resblocks.42.mlp.c_proj",
162
+ "transformer.visual.transformer.resblocks.32.attn.out_proj",
163
+ "transformer.h.4.attn.c_proj",
164
+ "transformer.visual.transformer.resblocks.27.mlp.c_fc",
165
+ "transformer.visual.transformer.resblocks.11.mlp.c_fc",
166
+ "transformer.visual.transformer.resblocks.25.attn.out_proj",
167
+ "transformer.visual.transformer.resblocks.23.attn.in_proj",
168
+ "transformer.h.5.attn.c_attn",
169
+ "transformer.h.16.mlp.c_proj",
170
+ "transformer.visual.transformer.resblocks.14.mlp.c_proj",
171
+ "transformer.h.22.mlp.w2",
172
+ "transformer.h.25.mlp.c_proj",
173
+ "transformer.visual.transformer.resblocks.10.mlp.c_fc",
174
+ "transformer.h.24.mlp.c_proj",
175
+ "transformer.h.19.mlp.w2",
176
+ "transformer.h.14.mlp.w1",
177
+ "transformer.visual.transformer.resblocks.40.mlp.c_proj",
178
+ "transformer.visual.transformer.resblocks.28.attn.out_proj",
179
+ "transformer.visual.transformer.resblocks.24.mlp.c_fc",
180
+ "transformer.h.8.attn.c_attn",
181
+ "transformer.h.9.mlp.w1",
182
+ "transformer.h.6.mlp.c_proj",
183
+ "transformer.visual.transformer.resblocks.19.attn.out_proj",
184
+ "transformer.visual.transformer.resblocks.32.mlp.c_fc",
185
+ "transformer.visual.transformer.resblocks.7.mlp.c_fc",
186
+ "transformer.visual.transformer.resblocks.44.attn.in_proj",
187
+ "transformer.visual.transformer.resblocks.34.mlp.c_proj",
188
+ "transformer.visual.transformer.resblocks.9.mlp.c_fc",
189
+ "transformer.visual.conv1",
190
+ "transformer.visual.transformer.resblocks.8.attn.out_proj",
191
+ "transformer.h.23.mlp.w2",
192
+ "transformer.h.7.mlp.w2",
193
+ "transformer.h.24.attn.c_proj",
194
+ "transformer.h.30.attn.c_proj",
195
+ "transformer.h.29.attn.c_proj",
196
+ "transformer.visual.transformer.resblocks.9.mlp.c_proj",
197
+ "transformer.visual.transformer.resblocks.35.attn.in_proj",
198
+ "transformer.visual.transformer.resblocks.21.mlp.c_fc",
199
+ "transformer.visual.transformer.resblocks.41.mlp.c_proj",
200
+ "transformer.visual.transformer.resblocks.38.mlp.c_fc",
201
+ "transformer.visual.transformer.resblocks.13.mlp.c_proj",
202
+ "transformer.visual.transformer.resblocks.41.attn.out_proj",
203
+ "transformer.visual.transformer.resblocks.16.mlp.c_fc",
204
+ "transformer.visual.transformer.resblocks.45.attn.out_proj",
205
+ "transformer.h.11.mlp.w1",
206
+ "transformer.visual.transformer.resblocks.16.attn.in_proj",
207
+ "transformer.visual.transformer.resblocks.47.attn.out_proj",
208
+ "transformer.h.9.attn.c_proj",
209
+ "transformer.h.31.mlp.c_proj",
210
+ "transformer.visual.transformer.resblocks.12.attn.in_proj",
211
+ "transformer.visual.transformer.resblocks.28.mlp.c_proj",
212
+ "transformer.visual.transformer.resblocks.20.attn.out_proj",
213
+ "transformer.h.12.attn.c_attn",
214
+ "transformer.h.24.mlp.w1",
215
+ "transformer.visual.transformer.resblocks.21.attn.in_proj",
216
+ "transformer.visual.transformer.resblocks.41.attn.in_proj",
217
+ "transformer.h.10.mlp.w1",
218
+ "transformer.h.1.mlp.w2",
219
+ "transformer.h.0.mlp.c_proj",
220
+ "transformer.h.22.mlp.c_proj",
221
+ "transformer.visual.transformer.resblocks.18.attn.in_proj",
222
+ "transformer.visual.transformer.resblocks.38.mlp.c_proj",
223
+ "transformer.h.12.mlp.w1",
224
+ "transformer.h.1.attn.c_attn",
225
+ "transformer.visual.transformer.resblocks.31.mlp.c_proj",
226
+ "transformer.visual.transformer.resblocks.44.mlp.c_proj",
227
+ "transformer.h.15.mlp.c_proj",
228
+ "transformer.h.6.mlp.w1",
229
+ "transformer.visual.transformer.resblocks.16.mlp.c_proj",
230
+ "transformer.h.13.attn.c_proj",
231
+ "transformer.h.15.attn.c_attn",
232
+ "transformer.h.15.mlp.w1",
233
+ "transformer.h.17.mlp.w2",
234
+ "transformer.visual.transformer.resblocks.10.attn.in_proj",
235
+ "transformer.h.26.attn.c_proj",
236
+ "transformer.visual.transformer.resblocks.20.attn.in_proj",
237
+ "transformer.h.10.mlp.w2",
238
+ "transformer.h.24.attn.c_attn",
239
+ "transformer.h.8.mlp.w1",
240
+ "transformer.h.23.mlp.w1",
241
+ "transformer.visual.transformer.resblocks.1.mlp.c_proj",
242
+ "transformer.h.4.mlp.w2",
243
+ "transformer.visual.transformer.resblocks.38.attn.in_proj",
244
+ "transformer.h.12.mlp.w2",
245
+ "transformer.h.7.attn.c_proj",
246
+ "transformer.h.4.mlp.c_proj",
247
+ "transformer.visual.transformer.resblocks.31.attn.out_proj",
248
+ "transformer.visual.transformer.resblocks.17.mlp.c_proj",
249
+ "transformer.h.21.mlp.w2",
250
+ "transformer.visual.transformer.resblocks.5.attn.in_proj",
251
+ "transformer.h.18.attn.c_proj",
252
+ "transformer.visual.transformer.resblocks.31.mlp.c_fc",
253
+ "transformer.h.18.mlp.w2",
254
+ "transformer.visual.transformer.resblocks.6.attn.out_proj",
255
+ "transformer.visual.transformer.resblocks.8.attn.in_proj",
256
+ "transformer.visual.transformer.resblocks.30.mlp.c_proj",
257
+ "transformer.h.30.mlp.c_proj",
258
+ "transformer.visual.transformer.resblocks.30.attn.out_proj",
259
+ "transformer.visual.transformer.resblocks.16.attn.out_proj",
260
+ "transformer.visual.transformer.resblocks.14.attn.out_proj",
261
+ "transformer.h.25.mlp.w1",
262
+ "transformer.visual.transformer.resblocks.45.attn.in_proj",
263
+ "transformer.h.11.attn.c_proj",
264
+ "transformer.visual.transformer.resblocks.30.attn.in_proj",
265
+ "transformer.visual.transformer.resblocks.43.mlp.c_proj",
266
+ "transformer.h.10.mlp.c_proj",
267
+ "transformer.h.21.mlp.c_proj",
268
+ "transformer.visual.transformer.resblocks.43.attn.in_proj",
269
+ "transformer.visual.transformer.resblocks.3.mlp.c_fc",
270
+ "transformer.visual.transformer.resblocks.44.attn.out_proj",
271
+ "transformer.h.23.attn.c_attn",
272
+ "transformer.visual.transformer.resblocks.22.attn.in_proj",
273
+ "transformer.visual.transformer.resblocks.6.attn.in_proj",
274
+ "transformer.visual.transformer.resblocks.44.mlp.c_fc",
275
+ "transformer.h.17.attn.c_attn",
276
+ "transformer.h.7.attn.c_attn",
277
+ "transformer.visual.transformer.resblocks.42.attn.in_proj",
278
+ "transformer.visual.transformer.resblocks.20.mlp.c_proj",
279
+ "transformer.h.8.mlp.c_proj",
280
+ "transformer.visual.transformer.resblocks.17.attn.out_proj",
281
+ "transformer.h.14.attn.c_proj",
282
+ "transformer.visual.transformer.resblocks.40.attn.in_proj",
283
+ "transformer.h.25.attn.c_proj",
284
+ "transformer.h.28.mlp.c_proj",
285
+ "transformer.visual.transformer.resblocks.35.mlp.c_proj",
286
+ "transformer.visual.transformer.resblocks.36.attn.in_proj",
287
+ "transformer.visual.transformer.resblocks.41.mlp.c_fc",
288
+ "transformer.visual.transformer.resblocks.14.mlp.c_fc",
289
+ "transformer.h.30.mlp.w2",
290
+ "transformer.h.20.mlp.w1",
291
+ "transformer.visual.transformer.resblocks.33.mlp.c_fc",
292
+ "transformer.h.29.mlp.w2",
293
+ "transformer.visual.transformer.resblocks.47.mlp.c_proj",
294
+ "transformer.visual.transformer.resblocks.30.mlp.c_fc",
295
+ "transformer.h.10.attn.c_attn",
296
+ "transformer.visual.transformer.resblocks.1.attn.in_proj",
297
+ "transformer.h.1.attn.c_proj",
298
+ "transformer.visual.transformer.resblocks.8.mlp.c_proj",
299
+ "transformer.h.19.attn.c_proj",
300
+ "transformer.visual.transformer.resblocks.37.attn.in_proj",
301
+ "transformer.h.15.attn.c_proj",
302
+ "transformer.h.5.attn.c_proj",
303
+ "transformer.visual.transformer.resblocks.32.mlp.c_proj",
304
+ "transformer.visual.transformer.resblocks.3.attn.out_proj",
305
+ "transformer.visual.transformer.resblocks.32.attn.in_proj",
306
+ "transformer.h.21.mlp.w1",
307
+ "transformer.h.23.mlp.c_proj",
308
+ "transformer.h.30.mlp.w1",
309
+ "transformer.h.0.attn.c_attn",
310
+ "transformer.visual.transformer.resblocks.24.attn.out_proj",
311
+ "transformer.visual.transformer.resblocks.31.attn.in_proj",
312
+ "transformer.h.18.mlp.c_proj",
313
+ "transformer.visual.transformer.resblocks.25.mlp.c_fc",
314
+ "transformer.visual.transformer.resblocks.22.mlp.c_fc",
315
+ "transformer.h.30.attn.c_attn",
316
+ "transformer.visual.transformer.resblocks.13.mlp.c_fc",
317
+ "transformer.h.17.mlp.c_proj",
318
+ "transformer.visual.transformer.resblocks.24.attn.in_proj",
319
+ "transformer.h.11.attn.c_attn",
320
+ "transformer.h.2.mlp.w2",
321
+ "transformer.visual.transformer.resblocks.8.mlp.c_fc",
322
+ "transformer.visual.transformer.resblocks.0.mlp.c_fc",
323
+ "transformer.visual.transformer.resblocks.2.attn.out_proj",
324
+ "transformer.visual.transformer.resblocks.35.mlp.c_fc",
325
+ "transformer.visual.transformer.resblocks.39.attn.out_proj",
326
+ "transformer.h.12.attn.c_proj",
327
+ "transformer.visual.transformer.resblocks.28.attn.in_proj",
328
+ "transformer.visual.transformer.resblocks.29.mlp.c_fc",
329
+ "transformer.visual.transformer.resblocks.0.attn.out_proj",
330
+ "transformer.visual.transformer.resblocks.23.mlp.c_proj",
331
+ "transformer.h.20.attn.c_attn",
332
+ "transformer.visual.transformer.resblocks.7.attn.out_proj",
333
+ "transformer.visual.transformer.resblocks.15.attn.out_proj",
334
+ "transformer.h.7.mlp.c_proj",
335
+ "transformer.visual.transformer.resblocks.1.attn.out_proj",
336
+ "transformer.h.3.mlp.w2",
337
+ "transformer.h.9.mlp.w2",
338
+ "transformer.visual.transformer.resblocks.34.attn.in_proj",
339
+ "transformer.h.27.attn.c_attn",
340
+ "transformer.visual.transformer.resblocks.12.mlp.c_fc",
341
+ "transformer.h.6.mlp.w2",
342
+ "transformer.visual.transformer.resblocks.39.attn.in_proj",
343
+ "transformer.h.15.mlp.w2",
344
+ "transformer.visual.transformer.resblocks.18.mlp.c_proj",
345
+ "transformer.h.0.attn.c_proj",
346
+ "transformer.h.19.attn.c_attn",
347
+ "transformer.visual.transformer.resblocks.27.mlp.c_proj",
348
+ "transformer.visual.transformer.resblocks.23.attn.out_proj",
349
+ "transformer.h.14.mlp.c_proj",
350
+ "transformer.h.9.mlp.c_proj",
351
+ "transformer.visual.transformer.resblocks.12.attn.out_proj",
352
+ "transformer.visual.transformer.resblocks.0.mlp.c_proj",
353
+ "transformer.visual.transformer.resblocks.5.mlp.c_fc",
354
+ "transformer.visual.transformer.resblocks.28.mlp.c_fc",
355
+ "transformer.visual.transformer.resblocks.6.mlp.c_proj",
356
+ "transformer.visual.transformer.resblocks.22.mlp.c_proj",
357
+ "transformer.visual.transformer.resblocks.37.mlp.c_proj",
358
+ "transformer.visual.transformer.resblocks.17.attn.in_proj",
359
+ "transformer.visual.transformer.resblocks.46.attn.out_proj",
360
+ "transformer.h.24.mlp.w2",
361
+ "transformer.h.27.mlp.w1",
362
+ "transformer.visual.transformer.resblocks.11.attn.in_proj",
363
+ "transformer.visual.transformer.resblocks.4.mlp.c_proj",
364
+ "transformer.visual.transformer.resblocks.21.mlp.c_proj",
365
+ "transformer.visual.transformer.resblocks.26.mlp.c_proj",
366
+ "transformer.visual.transformer.resblocks.15.mlp.c_fc",
367
+ "transformer.h.2.mlp.c_proj",
368
+ "transformer.h.1.mlp.c_proj",
369
+ "transformer.h.5.mlp.c_proj",
370
+ "transformer.visual.transformer.resblocks.45.mlp.c_fc",
371
+ "transformer.visual.transformer.resblocks.0.attn.in_proj",
372
+ "transformer.h.25.mlp.w2",
373
+ "transformer.h.20.attn.c_proj",
374
+ "transformer.h.17.attn.c_proj",
375
+ "transformer.visual.transformer.resblocks.1.mlp.c_fc"
376
+ ],
377
+ "task_type": "CAUSAL_LM",
378
+ "use_dora": false,
379
+ "use_rslora": false
380
+ }
checkpoint-2000/adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5b8112968ddb6d5c9cc45ec4d181a2563b8c368858d7518e99e7a2f245a9f0f9
3
+ size 469105640
checkpoint-2000/latest ADDED
@@ -0,0 +1 @@
 
 
1
+ global_step2000
checkpoint-2000/qwen.tiktoken ADDED
The diff for this file is too large to render. See raw diff
 
checkpoint-2000/rng_state_0.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:373e583e765629a92d3530782b1b5ca914f786284fe0f518884228570ac59903
3
+ size 15920
checkpoint-2000/rng_state_1.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:361a2793fa584b3154fa7c73ce06ed5ea5168d465509e86dc4cb35aaab2a8bc1
3
+ size 15920
checkpoint-2000/rng_state_2.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4375f3bebf15db7c1ff742b2f45104c74917bdd457bdcf7c4e871a438ef88a23
3
+ size 15920
checkpoint-2000/rng_state_3.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:53e9816ac4dacdc9737c33654b217a3d3423ce888a9165acc83b6c109118e8bb
3
+ size 15920
checkpoint-2000/rng_state_4.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9be6ba5b929abd1627d02881ae59d6445ddd79d165123739de9ec3f1ecf40134
3
+ size 15920