Gege24 commited on
Commit
70d72d7
·
verified ·
1 Parent(s): b9c6121

Upload task output 1

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,209 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: None
3
+ library_name: peft
4
+ pipeline_tag: text-generation
5
+ tags:
6
+ - base_model:adapter:/cache/models/Qwen--Qwen3-4B-Instruct-2507
7
+ - grpo
8
+ - lora
9
+ - transformers
10
+ - trl
11
+ ---
12
+
13
+ # Model Card for Model ID
14
+
15
+ <!-- Provide a quick summary of what the model is/does. -->
16
+
17
+
18
+
19
+ ## Model Details
20
+
21
+ ### Model Description
22
+
23
+ <!-- Provide a longer summary of what this model is. -->
24
+
25
+
26
+
27
+ - **Developed by:** [More Information Needed]
28
+ - **Funded by [optional]:** [More Information Needed]
29
+ - **Shared by [optional]:** [More Information Needed]
30
+ - **Model type:** [More Information Needed]
31
+ - **Language(s) (NLP):** [More Information Needed]
32
+ - **License:** [More Information Needed]
33
+ - **Finetuned from model [optional]:** [More Information Needed]
34
+
35
+ ### Model Sources [optional]
36
+
37
+ <!-- Provide the basic links for the model. -->
38
+
39
+ - **Repository:** [More Information Needed]
40
+ - **Paper [optional]:** [More Information Needed]
41
+ - **Demo [optional]:** [More Information Needed]
42
+
43
+ ## Uses
44
+
45
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
46
+
47
+ ### Direct Use
48
+
49
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
50
+
51
+ [More Information Needed]
52
+
53
+ ### Downstream Use [optional]
54
+
55
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
56
+
57
+ [More Information Needed]
58
+
59
+ ### Out-of-Scope Use
60
+
61
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
62
+
63
+ [More Information Needed]
64
+
65
+ ## Bias, Risks, and Limitations
66
+
67
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
68
+
69
+ [More Information Needed]
70
+
71
+ ### Recommendations
72
+
73
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
74
+
75
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
76
+
77
+ ## How to Get Started with the Model
78
+
79
+ Use the code below to get started with the model.
80
+
81
+ [More Information Needed]
82
+
83
+ ## Training Details
84
+
85
+ ### Training Data
86
+
87
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
88
+
89
+ [More Information Needed]
90
+
91
+ ### Training Procedure
92
+
93
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
94
+
95
+ #### Preprocessing [optional]
96
+
97
+ [More Information Needed]
98
+
99
+
100
+ #### Training Hyperparameters
101
+
102
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
103
+
104
+ #### Speeds, Sizes, Times [optional]
105
+
106
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
107
+
108
+ [More Information Needed]
109
+
110
+ ## Evaluation
111
+
112
+ <!-- This section describes the evaluation protocols and provides the results. -->
113
+
114
+ ### Testing Data, Factors & Metrics
115
+
116
+ #### Testing Data
117
+
118
+ <!-- This should link to a Dataset Card if possible. -->
119
+
120
+ [More Information Needed]
121
+
122
+ #### Factors
123
+
124
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
125
+
126
+ [More Information Needed]
127
+
128
+ #### Metrics
129
+
130
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
131
+
132
+ [More Information Needed]
133
+
134
+ ### Results
135
+
136
+ [More Information Needed]
137
+
138
+ #### Summary
139
+
140
+
141
+
142
+ ## Model Examination [optional]
143
+
144
+ <!-- Relevant interpretability work for the model goes here -->
145
+
146
+ [More Information Needed]
147
+
148
+ ## Environmental Impact
149
+
150
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
151
+
152
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
153
+
154
+ - **Hardware Type:** [More Information Needed]
155
+ - **Hours used:** [More Information Needed]
156
+ - **Cloud Provider:** [More Information Needed]
157
+ - **Compute Region:** [More Information Needed]
158
+ - **Carbon Emitted:** [More Information Needed]
159
+
160
+ ## Technical Specifications [optional]
161
+
162
+ ### Model Architecture and Objective
163
+
164
+ [More Information Needed]
165
+
166
+ ### Compute Infrastructure
167
+
168
+ [More Information Needed]
169
+
170
+ #### Hardware
171
+
172
+ [More Information Needed]
173
+
174
+ #### Software
175
+
176
+ [More Information Needed]
177
+
178
+ ## Citation [optional]
179
+
180
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
181
+
182
+ **BibTeX:**
183
+
184
+ [More Information Needed]
185
+
186
+ **APA:**
187
+
188
+ [More Information Needed]
189
+
190
+ ## Glossary [optional]
191
+
192
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
193
+
194
+ [More Information Needed]
195
+
196
+ ## More Information [optional]
197
+
198
+ [More Information Needed]
199
+
200
+ ## Model Card Authors [optional]
201
+
202
+ [More Information Needed]
203
+
204
+ ## Model Card Contact
205
+
206
+ [More Information Needed]
207
+ ### Framework versions
208
+
209
+ - PEFT 0.18.1
adapter_config.json ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alora_invocation_tokens": null,
3
+ "alpha_pattern": {},
4
+ "arrow_config": null,
5
+ "auto_mapping": null,
6
+ "base_model_name_or_path": null,
7
+ "bias": "none",
8
+ "corda_config": null,
9
+ "ensure_weight_tying": false,
10
+ "eva_config": null,
11
+ "exclude_modules": null,
12
+ "fan_in_fan_out": false,
13
+ "inference_mode": true,
14
+ "init_lora_weights": true,
15
+ "layer_replication": null,
16
+ "layers_pattern": null,
17
+ "layers_to_transform": null,
18
+ "loftq_config": {},
19
+ "lora_alpha": 64,
20
+ "lora_bias": false,
21
+ "lora_dropout": 0.05,
22
+ "megatron_config": null,
23
+ "megatron_core": "megatron.core",
24
+ "modules_to_save": null,
25
+ "peft_type": "LORA",
26
+ "peft_version": "0.18.1",
27
+ "qalora_group_size": 16,
28
+ "r": 32,
29
+ "rank_pattern": {},
30
+ "revision": null,
31
+ "target_modules": [
32
+ "down_proj",
33
+ "k_proj",
34
+ "q_proj",
35
+ "gate_proj",
36
+ "o_proj",
37
+ "v_proj",
38
+ "up_proj"
39
+ ],
40
+ "target_parameters": null,
41
+ "task_type": "CAUSAL_LM",
42
+ "trainable_token_indices": null,
43
+ "use_dora": false,
44
+ "use_qalora": false,
45
+ "use_rslora": false
46
+ }
adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:11f10923c51131df28cbbc3b2ebac46c8ea940027309c1f8f36b14a93c130c18
3
+ size 264308896
added_tokens.json ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "</think>": 151668,
3
+ "</tool_call>": 151658,
4
+ "</tool_response>": 151666,
5
+ "<think>": 151667,
6
+ "<tool_call>": 151657,
7
+ "<tool_response>": 151665,
8
+ "<|box_end|>": 151649,
9
+ "<|box_start|>": 151648,
10
+ "<|endoftext|>": 151643,
11
+ "<|file_sep|>": 151664,
12
+ "<|fim_middle|>": 151660,
13
+ "<|fim_pad|>": 151662,
14
+ "<|fim_prefix|>": 151659,
15
+ "<|fim_suffix|>": 151661,
16
+ "<|im_end|>": 151645,
17
+ "<|im_start|>": 151644,
18
+ "<|image_pad|>": 151655,
19
+ "<|object_ref_end|>": 151647,
20
+ "<|object_ref_start|>": 151646,
21
+ "<|quad_end|>": 151651,
22
+ "<|quad_start|>": 151650,
23
+ "<|repo_name|>": 151663,
24
+ "<|video_pad|>": 151656,
25
+ "<|vision_end|>": 151653,
26
+ "<|vision_pad|>": 151654,
27
+ "<|vision_start|>": 151652
28
+ }
chat_template.jinja ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {%- if tools %}
2
+ {{- '<|im_start|>system\n' }}
3
+ {%- if messages[0].role == 'system' %}
4
+ {{- messages[0].content + '\n\n' }}
5
+ {%- endif %}
6
+ {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
7
+ {%- for tool in tools %}
8
+ {{- "\n" }}
9
+ {{- tool | tojson }}
10
+ {%- endfor %}
11
+ {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
12
+ {%- else %}
13
+ {%- if messages[0].role == 'system' %}
14
+ {{- '<|im_start|>system\n' + messages[0].content + '<|im_end|>\n' }}
15
+ {%- endif %}
16
+ {%- endif %}
17
+ {%- for message in messages %}
18
+ {%- if message.content is string %}
19
+ {%- set content = message.content %}
20
+ {%- else %}
21
+ {%- set content = '' %}
22
+ {%- endif %}
23
+ {%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
24
+ {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }}
25
+ {%- elif message.role == "assistant" %}
26
+ {{- '<|im_start|>' + message.role + '\n' + content }}
27
+ {%- if message.tool_calls %}
28
+ {%- for tool_call in message.tool_calls %}
29
+ {%- if (loop.first and content) or (not loop.first) %}
30
+ {{- '\n' }}
31
+ {%- endif %}
32
+ {%- if tool_call.function %}
33
+ {%- set tool_call = tool_call.function %}
34
+ {%- endif %}
35
+ {{- '<tool_call>\n{"name": "' }}
36
+ {{- tool_call.name }}
37
+ {{- '", "arguments": ' }}
38
+ {%- if tool_call.arguments is string %}
39
+ {{- tool_call.arguments }}
40
+ {%- else %}
41
+ {{- tool_call.arguments | tojson }}
42
+ {%- endif %}
43
+ {{- '}\n</tool_call>' }}
44
+ {%- endfor %}
45
+ {%- endif %}
46
+ {{- '<|im_end|>\n' }}
47
+ {%- elif message.role == "tool" %}
48
+ {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
49
+ {{- '<|im_start|>user' }}
50
+ {%- endif %}
51
+ {{- '\n<tool_response>\n' }}
52
+ {{- content }}
53
+ {{- '\n</tool_response>' }}
54
+ {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
55
+ {{- '<|im_end|>\n' }}
56
+ {%- endif %}
57
+ {%- endif %}
58
+ {%- endfor %}
59
+ {%- if add_generation_prompt %}
60
+ {{- '<|im_start|>assistant\n' }}
61
+ {%- endif %}
loss.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ 53,no_eval
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
special_tokens_map.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|im_start|>",
4
+ "<|im_end|>",
5
+ "<|object_ref_start|>",
6
+ "<|object_ref_end|>",
7
+ "<|box_start|>",
8
+ "<|box_end|>",
9
+ "<|quad_start|>",
10
+ "<|quad_end|>",
11
+ "<|vision_start|>",
12
+ "<|vision_end|>",
13
+ "<|vision_pad|>",
14
+ "<|image_pad|>",
15
+ "<|video_pad|>"
16
+ ],
17
+ "eos_token": {
18
+ "content": "<|im_end|>",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ },
24
+ "pad_token": {
25
+ "content": "<|endoftext|>",
26
+ "lstrip": false,
27
+ "normalized": false,
28
+ "rstrip": false,
29
+ "single_word": false
30
+ }
31
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:aeb13307a71acd8fe81861d94ad54ab689df773318809eed3cbe794b4492dae4
3
+ size 11422654
tokenizer_config.json ADDED
@@ -0,0 +1,239 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_prefix_space": false,
4
+ "added_tokens_decoder": {
5
+ "151643": {
6
+ "content": "<|endoftext|>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "151644": {
14
+ "content": "<|im_start|>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "151645": {
22
+ "content": "<|im_end|>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "151646": {
30
+ "content": "<|object_ref_start|>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ },
37
+ "151647": {
38
+ "content": "<|object_ref_end|>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false,
43
+ "special": true
44
+ },
45
+ "151648": {
46
+ "content": "<|box_start|>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false,
51
+ "special": true
52
+ },
53
+ "151649": {
54
+ "content": "<|box_end|>",
55
+ "lstrip": false,
56
+ "normalized": false,
57
+ "rstrip": false,
58
+ "single_word": false,
59
+ "special": true
60
+ },
61
+ "151650": {
62
+ "content": "<|quad_start|>",
63
+ "lstrip": false,
64
+ "normalized": false,
65
+ "rstrip": false,
66
+ "single_word": false,
67
+ "special": true
68
+ },
69
+ "151651": {
70
+ "content": "<|quad_end|>",
71
+ "lstrip": false,
72
+ "normalized": false,
73
+ "rstrip": false,
74
+ "single_word": false,
75
+ "special": true
76
+ },
77
+ "151652": {
78
+ "content": "<|vision_start|>",
79
+ "lstrip": false,
80
+ "normalized": false,
81
+ "rstrip": false,
82
+ "single_word": false,
83
+ "special": true
84
+ },
85
+ "151653": {
86
+ "content": "<|vision_end|>",
87
+ "lstrip": false,
88
+ "normalized": false,
89
+ "rstrip": false,
90
+ "single_word": false,
91
+ "special": true
92
+ },
93
+ "151654": {
94
+ "content": "<|vision_pad|>",
95
+ "lstrip": false,
96
+ "normalized": false,
97
+ "rstrip": false,
98
+ "single_word": false,
99
+ "special": true
100
+ },
101
+ "151655": {
102
+ "content": "<|image_pad|>",
103
+ "lstrip": false,
104
+ "normalized": false,
105
+ "rstrip": false,
106
+ "single_word": false,
107
+ "special": true
108
+ },
109
+ "151656": {
110
+ "content": "<|video_pad|>",
111
+ "lstrip": false,
112
+ "normalized": false,
113
+ "rstrip": false,
114
+ "single_word": false,
115
+ "special": true
116
+ },
117
+ "151657": {
118
+ "content": "<tool_call>",
119
+ "lstrip": false,
120
+ "normalized": false,
121
+ "rstrip": false,
122
+ "single_word": false,
123
+ "special": false
124
+ },
125
+ "151658": {
126
+ "content": "</tool_call>",
127
+ "lstrip": false,
128
+ "normalized": false,
129
+ "rstrip": false,
130
+ "single_word": false,
131
+ "special": false
132
+ },
133
+ "151659": {
134
+ "content": "<|fim_prefix|>",
135
+ "lstrip": false,
136
+ "normalized": false,
137
+ "rstrip": false,
138
+ "single_word": false,
139
+ "special": false
140
+ },
141
+ "151660": {
142
+ "content": "<|fim_middle|>",
143
+ "lstrip": false,
144
+ "normalized": false,
145
+ "rstrip": false,
146
+ "single_word": false,
147
+ "special": false
148
+ },
149
+ "151661": {
150
+ "content": "<|fim_suffix|>",
151
+ "lstrip": false,
152
+ "normalized": false,
153
+ "rstrip": false,
154
+ "single_word": false,
155
+ "special": false
156
+ },
157
+ "151662": {
158
+ "content": "<|fim_pad|>",
159
+ "lstrip": false,
160
+ "normalized": false,
161
+ "rstrip": false,
162
+ "single_word": false,
163
+ "special": false
164
+ },
165
+ "151663": {
166
+ "content": "<|repo_name|>",
167
+ "lstrip": false,
168
+ "normalized": false,
169
+ "rstrip": false,
170
+ "single_word": false,
171
+ "special": false
172
+ },
173
+ "151664": {
174
+ "content": "<|file_sep|>",
175
+ "lstrip": false,
176
+ "normalized": false,
177
+ "rstrip": false,
178
+ "single_word": false,
179
+ "special": false
180
+ },
181
+ "151665": {
182
+ "content": "<tool_response>",
183
+ "lstrip": false,
184
+ "normalized": false,
185
+ "rstrip": false,
186
+ "single_word": false,
187
+ "special": false
188
+ },
189
+ "151666": {
190
+ "content": "</tool_response>",
191
+ "lstrip": false,
192
+ "normalized": false,
193
+ "rstrip": false,
194
+ "single_word": false,
195
+ "special": false
196
+ },
197
+ "151667": {
198
+ "content": "<think>",
199
+ "lstrip": false,
200
+ "normalized": false,
201
+ "rstrip": false,
202
+ "single_word": false,
203
+ "special": false
204
+ },
205
+ "151668": {
206
+ "content": "</think>",
207
+ "lstrip": false,
208
+ "normalized": false,
209
+ "rstrip": false,
210
+ "single_word": false,
211
+ "special": false
212
+ }
213
+ },
214
+ "additional_special_tokens": [
215
+ "<|im_start|>",
216
+ "<|im_end|>",
217
+ "<|object_ref_start|>",
218
+ "<|object_ref_end|>",
219
+ "<|box_start|>",
220
+ "<|box_end|>",
221
+ "<|quad_start|>",
222
+ "<|quad_end|>",
223
+ "<|vision_start|>",
224
+ "<|vision_end|>",
225
+ "<|vision_pad|>",
226
+ "<|image_pad|>",
227
+ "<|video_pad|>"
228
+ ],
229
+ "bos_token": null,
230
+ "clean_up_tokenization_spaces": false,
231
+ "eos_token": "<|im_end|>",
232
+ "errors": "replace",
233
+ "extra_special_tokens": {},
234
+ "model_max_length": 1010000,
235
+ "pad_token": "<|endoftext|>",
236
+ "split_special_tokens": false,
237
+ "tokenizer_class": "Qwen2Tokenizer",
238
+ "unk_token": null
239
+ }
trainer_state.json ADDED
@@ -0,0 +1,1315 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_global_step": null,
3
+ "best_metric": null,
4
+ "best_model_checkpoint": null,
5
+ "epoch": 0.00053,
6
+ "eval_steps": 500,
7
+ "global_step": 53,
8
+ "is_hyper_param_search": false,
9
+ "is_local_process_zero": true,
10
+ "is_world_process_zero": true,
11
+ "log_history": [
12
+ {
13
+ "clip_ratio/high_max": 0.0,
14
+ "clip_ratio/high_mean": 0.0,
15
+ "clip_ratio/low_mean": 0.0,
16
+ "clip_ratio/low_min": 0.0,
17
+ "clip_ratio/region_mean": 0.0,
18
+ "completions/clipped_ratio": 0.0,
19
+ "completions/max_length": 3.0,
20
+ "completions/max_terminated_length": 3.0,
21
+ "completions/mean_length": 3.0,
22
+ "completions/mean_terminated_length": 3.0,
23
+ "completions/min_length": 3.0,
24
+ "completions/min_terminated_length": 3.0,
25
+ "entropy": 0.48936397582292557,
26
+ "epoch": 1e-05,
27
+ "frac_reward_zero_std": 0.0,
28
+ "grad_norm": 0.0005664514610543847,
29
+ "kl": 0.0,
30
+ "learning_rate": 0.0,
31
+ "loss": 0.0,
32
+ "num_tokens": 146997.0,
33
+ "reward": -3.158337116241455,
34
+ "reward_std": 11.52173900604248,
35
+ "rewards/rollout_reward_func/mean": -3.158337116241455,
36
+ "rewards/rollout_reward_func/std": 11.52173900604248,
37
+ "sampling/importance_sampling_ratio/max": 0.9331093430519104,
38
+ "sampling/importance_sampling_ratio/mean": 0.8982968330383301,
39
+ "sampling/importance_sampling_ratio/min": 0.8507044315338135,
40
+ "sampling/sampling_logp_difference/max": 0.15258322656154633,
41
+ "sampling/sampling_logp_difference/mean": 0.03600664064288139,
42
+ "step": 1,
43
+ "step_time": 78.72238844999993
44
+ },
45
+ {
46
+ "clip_ratio/high_max": 0.0,
47
+ "clip_ratio/high_mean": 0.0,
48
+ "clip_ratio/low_mean": 0.0,
49
+ "clip_ratio/low_min": 0.0,
50
+ "clip_ratio/region_mean": 0.0,
51
+ "entropy": 0.48936397582292557,
52
+ "epoch": 2e-05,
53
+ "grad_norm": 0.0005677085719071329,
54
+ "kl": 0.0,
55
+ "learning_rate": 2.2857142857142855e-07,
56
+ "loss": 0.0,
57
+ "step": 2,
58
+ "step_time": 15.87692344900006
59
+ },
60
+ {
61
+ "clip_ratio/high_max": 0.0,
62
+ "clip_ratio/high_mean": 0.0,
63
+ "clip_ratio/low_mean": 0.010416666977107525,
64
+ "clip_ratio/low_min": 0.0,
65
+ "clip_ratio/region_mean": 0.010416666977107525,
66
+ "completions/clipped_ratio": 0.0,
67
+ "completions/max_length": 3.0,
68
+ "completions/max_terminated_length": 3.0,
69
+ "completions/mean_length": 2.96875,
70
+ "completions/mean_terminated_length": 2.96875,
71
+ "completions/min_length": 2.0,
72
+ "completions/min_terminated_length": 2.0,
73
+ "entropy": 0.48732541501522064,
74
+ "epoch": 3e-05,
75
+ "frac_reward_zero_std": 0.0,
76
+ "grad_norm": 0.002515916945412755,
77
+ "kl": 0.003698288830491947,
78
+ "learning_rate": 4.571428571428571e-07,
79
+ "loss": -0.0,
80
+ "num_tokens": 300422.0,
81
+ "reward": -3.6841847896575928,
82
+ "reward_std": 11.474555015563965,
83
+ "rewards/rollout_reward_func/mean": -3.6841847896575928,
84
+ "rewards/rollout_reward_func/std": 11.474555015563965,
85
+ "sampling/importance_sampling_ratio/max": 0.9328096508979797,
86
+ "sampling/importance_sampling_ratio/mean": 0.8980345726013184,
87
+ "sampling/importance_sampling_ratio/min": 0.8055793642997742,
88
+ "sampling/sampling_logp_difference/max": 0.14251333475112915,
89
+ "sampling/sampling_logp_difference/mean": 0.03639774024486542,
90
+ "step": 3,
91
+ "step_time": 80.83202714800018
92
+ },
93
+ {
94
+ "clip_ratio/high_max": 0.0,
95
+ "clip_ratio/high_mean": 0.0,
96
+ "clip_ratio/low_mean": 0.0,
97
+ "clip_ratio/low_min": 0.0,
98
+ "clip_ratio/region_mean": 0.0,
99
+ "entropy": 0.47548339515924454,
100
+ "epoch": 4e-05,
101
+ "grad_norm": 0.0018481042934581637,
102
+ "kl": 0.0001756823347314196,
103
+ "learning_rate": 6.857142857142857e-07,
104
+ "loss": -0.0,
105
+ "step": 4,
106
+ "step_time": 16.098328456000672
107
+ },
108
+ {
109
+ "clip_ratio/high_max": 0.0,
110
+ "clip_ratio/high_mean": 0.0,
111
+ "clip_ratio/low_mean": 0.0,
112
+ "clip_ratio/low_min": 0.0,
113
+ "clip_ratio/region_mean": 0.0,
114
+ "completions/clipped_ratio": 0.0,
115
+ "completions/max_length": 3.0,
116
+ "completions/max_terminated_length": 3.0,
117
+ "completions/mean_length": 2.875,
118
+ "completions/mean_terminated_length": 2.875,
119
+ "completions/min_length": 2.0,
120
+ "completions/min_terminated_length": 2.0,
121
+ "entropy": 0.5311543047428131,
122
+ "epoch": 5e-05,
123
+ "frac_reward_zero_std": 0.0,
124
+ "grad_norm": 0.0028014357667416334,
125
+ "kl": 0.0004833272454334292,
126
+ "learning_rate": 9.142857142857142e-07,
127
+ "loss": 0.0001,
128
+ "num_tokens": 456990.0,
129
+ "reward": -1.6541011333465576,
130
+ "reward_std": 11.05872631072998,
131
+ "rewards/rollout_reward_func/mean": -1.6541011333465576,
132
+ "rewards/rollout_reward_func/std": 11.058727264404297,
133
+ "sampling/importance_sampling_ratio/max": 0.9697506427764893,
134
+ "sampling/importance_sampling_ratio/mean": 0.8795021772384644,
135
+ "sampling/importance_sampling_ratio/min": 0.5391229391098022,
136
+ "sampling/sampling_logp_difference/max": 0.4758959412574768,
137
+ "sampling/sampling_logp_difference/mean": 0.04838065803050995,
138
+ "step": 5,
139
+ "step_time": 86.28426266100041
140
+ },
141
+ {
142
+ "clip_ratio/high_max": 0.02500000037252903,
143
+ "clip_ratio/high_mean": 0.012500000186264515,
144
+ "clip_ratio/low_mean": 0.0,
145
+ "clip_ratio/low_min": 0.0,
146
+ "clip_ratio/region_mean": 0.012500000186264515,
147
+ "entropy": 0.5359498672187328,
148
+ "epoch": 6e-05,
149
+ "grad_norm": 0.0022959199268370867,
150
+ "kl": 0.002978163921227406,
151
+ "learning_rate": 1.1428571428571428e-06,
152
+ "loss": 0.0001,
153
+ "step": 6,
154
+ "step_time": 16.73348437599998
155
+ },
156
+ {
157
+ "clip_ratio/high_max": 0.0,
158
+ "clip_ratio/high_mean": 0.0,
159
+ "clip_ratio/low_mean": 0.0,
160
+ "clip_ratio/low_min": 0.0,
161
+ "clip_ratio/region_mean": 0.0,
162
+ "completions/clipped_ratio": 0.0,
163
+ "completions/max_length": 3.0,
164
+ "completions/max_terminated_length": 3.0,
165
+ "completions/mean_length": 3.0,
166
+ "completions/mean_terminated_length": 3.0,
167
+ "completions/min_length": 3.0,
168
+ "completions/min_terminated_length": 3.0,
169
+ "entropy": 0.5112962238490582,
170
+ "epoch": 7e-05,
171
+ "frac_reward_zero_std": 0.0,
172
+ "grad_norm": 0.00931193120777607,
173
+ "kl": 5.794192431451961e-05,
174
+ "learning_rate": 1.3714285714285715e-06,
175
+ "loss": 0.0,
176
+ "num_tokens": 606759.0,
177
+ "reward": -1.4654988050460815,
178
+ "reward_std": 12.649130821228027,
179
+ "rewards/rollout_reward_func/mean": -1.4654988050460815,
180
+ "rewards/rollout_reward_func/std": 12.649128913879395,
181
+ "sampling/importance_sampling_ratio/max": 1.0540623664855957,
182
+ "sampling/importance_sampling_ratio/mean": 0.9033976197242737,
183
+ "sampling/importance_sampling_ratio/min": 0.8677928447723389,
184
+ "sampling/sampling_logp_difference/max": 0.17156457901000977,
185
+ "sampling/sampling_logp_difference/mean": 0.037677086889743805,
186
+ "step": 7,
187
+ "step_time": 80.9153498870005
188
+ },
189
+ {
190
+ "clip_ratio/high_max": 0.0,
191
+ "clip_ratio/high_mean": 0.0,
192
+ "clip_ratio/low_mean": 0.0,
193
+ "clip_ratio/low_min": 0.0,
194
+ "clip_ratio/region_mean": 0.0,
195
+ "entropy": 0.5051367729902267,
196
+ "epoch": 8e-05,
197
+ "grad_norm": 0.004157723858952522,
198
+ "kl": 4.364301761228262e-05,
199
+ "learning_rate": 1.6e-06,
200
+ "loss": 0.0,
201
+ "step": 8,
202
+ "step_time": 16.163697053000305
203
+ },
204
+ {
205
+ "clip_ratio/high_max": 0.0,
206
+ "clip_ratio/high_mean": 0.0,
207
+ "clip_ratio/low_mean": 0.0,
208
+ "clip_ratio/low_min": 0.0,
209
+ "clip_ratio/region_mean": 0.0,
210
+ "completions/clipped_ratio": 0.0,
211
+ "completions/max_length": 3.0,
212
+ "completions/max_terminated_length": 3.0,
213
+ "completions/mean_length": 2.875,
214
+ "completions/mean_terminated_length": 2.875,
215
+ "completions/min_length": 2.0,
216
+ "completions/min_terminated_length": 2.0,
217
+ "entropy": 0.5464204885065556,
218
+ "epoch": 9e-05,
219
+ "frac_reward_zero_std": 0.0,
220
+ "grad_norm": 0.003582538105547428,
221
+ "kl": 0.0011944683974434156,
222
+ "learning_rate": 1.8285714285714284e-06,
223
+ "loss": 0.0,
224
+ "num_tokens": 757249.0,
225
+ "reward": -7.697338104248047,
226
+ "reward_std": 8.61632251739502,
227
+ "rewards/rollout_reward_func/mean": -7.697338104248047,
228
+ "rewards/rollout_reward_func/std": 8.61632251739502,
229
+ "sampling/importance_sampling_ratio/max": 1.1465768814086914,
230
+ "sampling/importance_sampling_ratio/mean": 0.9066257476806641,
231
+ "sampling/importance_sampling_ratio/min": 0.8192122578620911,
232
+ "sampling/sampling_logp_difference/max": 0.23310279846191406,
233
+ "sampling/sampling_logp_difference/mean": 0.042134545743465424,
234
+ "step": 9,
235
+ "step_time": 87.67375510900001
236
+ },
237
+ {
238
+ "clip_ratio/high_max": 0.0,
239
+ "clip_ratio/high_mean": 0.0,
240
+ "clip_ratio/low_mean": 0.012500000186264515,
241
+ "clip_ratio/low_min": 0.0,
242
+ "clip_ratio/region_mean": 0.012500000186264515,
243
+ "entropy": 0.5470796823501587,
244
+ "epoch": 0.0001,
245
+ "grad_norm": 0.004420289304107428,
246
+ "kl": 0.0015031041582567184,
247
+ "learning_rate": 2.057142857142857e-06,
248
+ "loss": -0.0,
249
+ "step": 10,
250
+ "step_time": 16.138566152999374
251
+ },
252
+ {
253
+ "clip_ratio/high_max": 0.0,
254
+ "clip_ratio/high_mean": 0.0,
255
+ "clip_ratio/low_mean": 0.010416666977107525,
256
+ "clip_ratio/low_min": 0.0,
257
+ "clip_ratio/region_mean": 0.010416666977107525,
258
+ "completions/clipped_ratio": 0.0,
259
+ "completions/max_length": 3.0,
260
+ "completions/max_terminated_length": 3.0,
261
+ "completions/mean_length": 3.0,
262
+ "completions/mean_terminated_length": 3.0,
263
+ "completions/min_length": 3.0,
264
+ "completions/min_terminated_length": 3.0,
265
+ "entropy": 0.4877597391605377,
266
+ "epoch": 0.00011,
267
+ "frac_reward_zero_std": 0.0,
268
+ "grad_norm": 0.0015756486682221293,
269
+ "kl": 0.0009349783420873337,
270
+ "learning_rate": 2.2857142857142856e-06,
271
+ "loss": 0.0,
272
+ "num_tokens": 910271.0,
273
+ "reward": -1.2807047367095947,
274
+ "reward_std": 11.581576347351074,
275
+ "rewards/rollout_reward_func/mean": -1.2807047367095947,
276
+ "rewards/rollout_reward_func/std": 11.581576347351074,
277
+ "sampling/importance_sampling_ratio/max": 1.075101375579834,
278
+ "sampling/importance_sampling_ratio/mean": 0.9127196669578552,
279
+ "sampling/importance_sampling_ratio/min": 0.7523821592330933,
280
+ "sampling/sampling_logp_difference/max": 0.17058521509170532,
281
+ "sampling/sampling_logp_difference/mean": 0.04037006199359894,
282
+ "step": 11,
283
+ "step_time": 84.74441413100021
284
+ },
285
+ {
286
+ "clip_ratio/high_max": 0.0,
287
+ "clip_ratio/high_mean": 0.0,
288
+ "clip_ratio/low_mean": 0.0,
289
+ "clip_ratio/low_min": 0.0,
290
+ "clip_ratio/region_mean": 0.0,
291
+ "entropy": 0.48288414254784584,
292
+ "epoch": 0.00012,
293
+ "grad_norm": 0.0031028895173221827,
294
+ "kl": 0.00019705418833382282,
295
+ "learning_rate": 2.5142857142857142e-06,
296
+ "loss": 0.0,
297
+ "step": 12,
298
+ "step_time": 16.036360359999435
299
+ },
300
+ {
301
+ "clip_ratio/high_max": 0.0,
302
+ "clip_ratio/high_mean": 0.0,
303
+ "clip_ratio/low_mean": 0.0,
304
+ "clip_ratio/low_min": 0.0,
305
+ "clip_ratio/region_mean": 0.0,
306
+ "completions/clipped_ratio": 0.0,
307
+ "completions/max_length": 4.0,
308
+ "completions/max_terminated_length": 4.0,
309
+ "completions/mean_length": 2.9375,
310
+ "completions/mean_terminated_length": 2.9375,
311
+ "completions/min_length": 2.0,
312
+ "completions/min_terminated_length": 2.0,
313
+ "entropy": 0.49491142481565475,
314
+ "epoch": 0.00013,
315
+ "frac_reward_zero_std": 0.0,
316
+ "grad_norm": 0.0027884214650839567,
317
+ "kl": 0.0024777476012687316,
318
+ "learning_rate": 2.742857142857143e-06,
319
+ "loss": 0.0,
320
+ "num_tokens": 1049539.0,
321
+ "reward": -7.930685997009277,
322
+ "reward_std": 10.647638320922852,
323
+ "rewards/rollout_reward_func/mean": -7.930685997009277,
324
+ "rewards/rollout_reward_func/std": 10.647637367248535,
325
+ "sampling/importance_sampling_ratio/max": 1.1072638034820557,
326
+ "sampling/importance_sampling_ratio/mean": 0.9023596048355103,
327
+ "sampling/importance_sampling_ratio/min": 0.5414906740188599,
328
+ "sampling/sampling_logp_difference/max": 0.4979434013366699,
329
+ "sampling/sampling_logp_difference/mean": 0.0404733270406723,
330
+ "step": 13,
331
+ "step_time": 80.93550676899986
332
+ },
333
+ {
334
+ "clip_ratio/high_max": 0.0,
335
+ "clip_ratio/high_mean": 0.0,
336
+ "clip_ratio/low_mean": 0.0,
337
+ "clip_ratio/low_min": 0.0,
338
+ "clip_ratio/region_mean": 0.0,
339
+ "entropy": 0.503012377768755,
340
+ "epoch": 0.00014,
341
+ "grad_norm": 0.002723120851442218,
342
+ "kl": 0.00012674138329771267,
343
+ "learning_rate": 2.9714285714285716e-06,
344
+ "loss": 0.0,
345
+ "step": 14,
346
+ "step_time": 16.415520546000607
347
+ },
348
+ {
349
+ "clip_ratio/high_max": 0.0,
350
+ "clip_ratio/high_mean": 0.0,
351
+ "clip_ratio/low_mean": 0.0,
352
+ "clip_ratio/low_min": 0.0,
353
+ "clip_ratio/region_mean": 0.0,
354
+ "completions/clipped_ratio": 0.0,
355
+ "completions/max_length": 3.0,
356
+ "completions/max_terminated_length": 3.0,
357
+ "completions/mean_length": 2.96875,
358
+ "completions/mean_terminated_length": 2.96875,
359
+ "completions/min_length": 2.0,
360
+ "completions/min_terminated_length": 2.0,
361
+ "entropy": 0.5103708207607269,
362
+ "epoch": 0.00015,
363
+ "frac_reward_zero_std": 0.0,
364
+ "grad_norm": 0.0012367322342470288,
365
+ "kl": 0.0009814446365226104,
366
+ "learning_rate": 3.2e-06,
367
+ "loss": -0.0,
368
+ "num_tokens": 1198467.0,
369
+ "reward": -1.7910524606704712,
370
+ "reward_std": 14.345645904541016,
371
+ "rewards/rollout_reward_func/mean": -1.7910524606704712,
372
+ "rewards/rollout_reward_func/std": 14.345645904541016,
373
+ "sampling/importance_sampling_ratio/max": 0.9826624393463135,
374
+ "sampling/importance_sampling_ratio/mean": 0.9046387672424316,
375
+ "sampling/importance_sampling_ratio/min": 0.8640578985214233,
376
+ "sampling/sampling_logp_difference/max": 0.1460454910993576,
377
+ "sampling/sampling_logp_difference/mean": 0.03705396130681038,
378
+ "step": 15,
379
+ "step_time": 84.01811012399958
380
+ },
381
+ {
382
+ "clip_ratio/high_max": 0.0,
383
+ "clip_ratio/high_mean": 0.0,
384
+ "clip_ratio/low_mean": 0.0,
385
+ "clip_ratio/low_min": 0.0,
386
+ "clip_ratio/region_mean": 0.0,
387
+ "entropy": 0.5053780749440193,
388
+ "epoch": 0.00016,
389
+ "grad_norm": 0.0012754682684317231,
390
+ "kl": 0.0003874969786465954,
391
+ "learning_rate": 3.428571428571428e-06,
392
+ "loss": -0.0,
393
+ "step": 16,
394
+ "step_time": 16.14125642999943
395
+ },
396
+ {
397
+ "clip_ratio/high_max": 0.0,
398
+ "clip_ratio/high_mean": 0.0,
399
+ "clip_ratio/low_mean": 0.0,
400
+ "clip_ratio/low_min": 0.0,
401
+ "clip_ratio/region_mean": 0.0,
402
+ "completions/clipped_ratio": 0.0,
403
+ "completions/max_length": 4.0,
404
+ "completions/max_terminated_length": 4.0,
405
+ "completions/mean_length": 3.0,
406
+ "completions/mean_terminated_length": 3.0,
407
+ "completions/min_length": 2.0,
408
+ "completions/min_terminated_length": 2.0,
409
+ "entropy": 0.4758913367986679,
410
+ "epoch": 0.00017,
411
+ "frac_reward_zero_std": 0.0,
412
+ "grad_norm": 0.000835731509141624,
413
+ "kl": 2.2824605100169038e-05,
414
+ "learning_rate": 3.657142857142857e-06,
415
+ "loss": -0.0,
416
+ "num_tokens": 1352552.0,
417
+ "reward": -2.025902271270752,
418
+ "reward_std": 9.88846492767334,
419
+ "rewards/rollout_reward_func/mean": -2.025902271270752,
420
+ "rewards/rollout_reward_func/std": 9.88846492767334,
421
+ "sampling/importance_sampling_ratio/max": 0.9374787211418152,
422
+ "sampling/importance_sampling_ratio/mean": 0.9018987417221069,
423
+ "sampling/importance_sampling_ratio/min": 0.8059737086296082,
424
+ "sampling/sampling_logp_difference/max": 0.1600051373243332,
425
+ "sampling/sampling_logp_difference/mean": 0.034846384078264236,
426
+ "step": 17,
427
+ "step_time": 87.90823437400013
428
+ },
429
+ {
430
+ "clip_ratio/high_max": 0.0,
431
+ "clip_ratio/high_mean": 0.0,
432
+ "clip_ratio/low_mean": 0.0,
433
+ "clip_ratio/low_min": 0.0,
434
+ "clip_ratio/region_mean": 0.0,
435
+ "entropy": 0.47047749534249306,
436
+ "epoch": 0.00018,
437
+ "grad_norm": 0.000915601325687021,
438
+ "kl": 3.488661536721338e-05,
439
+ "learning_rate": 3.885714285714286e-06,
440
+ "loss": -0.0,
441
+ "step": 18,
442
+ "step_time": 16.648166115000095
443
+ },
444
+ {
445
+ "clip_ratio/high_max": 0.0,
446
+ "clip_ratio/high_mean": 0.0,
447
+ "clip_ratio/low_mean": 0.0,
448
+ "clip_ratio/low_min": 0.0,
449
+ "clip_ratio/region_mean": 0.0,
450
+ "completions/clipped_ratio": 0.0,
451
+ "completions/max_length": 4.0,
452
+ "completions/max_terminated_length": 4.0,
453
+ "completions/mean_length": 3.0,
454
+ "completions/mean_terminated_length": 3.0,
455
+ "completions/min_length": 2.0,
456
+ "completions/min_terminated_length": 2.0,
457
+ "entropy": 0.4496545232832432,
458
+ "epoch": 0.00019,
459
+ "frac_reward_zero_std": 0.0,
460
+ "grad_norm": 0.0004858207830693573,
461
+ "kl": 3.536299266215792e-05,
462
+ "learning_rate": 4.114285714285714e-06,
463
+ "loss": 0.0,
464
+ "num_tokens": 1501094.0,
465
+ "reward": -3.7144558429718018,
466
+ "reward_std": 11.072718620300293,
467
+ "rewards/rollout_reward_func/mean": -3.7144558429718018,
468
+ "rewards/rollout_reward_func/std": 11.072717666625977,
469
+ "sampling/importance_sampling_ratio/max": 0.9308063983917236,
470
+ "sampling/importance_sampling_ratio/mean": 0.899206817150116,
471
+ "sampling/importance_sampling_ratio/min": 0.66890549659729,
472
+ "sampling/sampling_logp_difference/max": 0.32890522480010986,
473
+ "sampling/sampling_logp_difference/mean": 0.03617248684167862,
474
+ "step": 19,
475
+ "step_time": 80.7782728230004
476
+ },
477
+ {
478
+ "clip_ratio/high_max": 0.0,
479
+ "clip_ratio/high_mean": 0.0,
480
+ "clip_ratio/low_mean": 0.0,
481
+ "clip_ratio/low_min": 0.0,
482
+ "clip_ratio/region_mean": 0.0,
483
+ "entropy": 0.44183528050780296,
484
+ "epoch": 0.0002,
485
+ "grad_norm": 0.00042393436888232827,
486
+ "kl": 0.00015307544185816369,
487
+ "learning_rate": 4.342857142857142e-06,
488
+ "loss": 0.0,
489
+ "step": 20,
490
+ "step_time": 16.09888975399963
491
+ },
492
+ {
493
+ "clip_ratio/high_max": 0.0,
494
+ "clip_ratio/high_mean": 0.0,
495
+ "clip_ratio/low_mean": 0.0,
496
+ "clip_ratio/low_min": 0.0,
497
+ "clip_ratio/region_mean": 0.0,
498
+ "completions/clipped_ratio": 0.0,
499
+ "completions/max_length": 3.0,
500
+ "completions/max_terminated_length": 3.0,
501
+ "completions/mean_length": 2.9375,
502
+ "completions/mean_terminated_length": 2.9375,
503
+ "completions/min_length": 2.0,
504
+ "completions/min_terminated_length": 2.0,
505
+ "entropy": 0.5056380964815617,
506
+ "epoch": 0.00021,
507
+ "frac_reward_zero_std": 0.0,
508
+ "grad_norm": 0.00032198833650909364,
509
+ "kl": 4.869016663633374e-05,
510
+ "learning_rate": 4.571428571428571e-06,
511
+ "loss": -0.0,
512
+ "num_tokens": 1651955.0,
513
+ "reward": -1.5917177200317383,
514
+ "reward_std": 13.17538070678711,
515
+ "rewards/rollout_reward_func/mean": -1.5917177200317383,
516
+ "rewards/rollout_reward_func/std": 13.175381660461426,
517
+ "sampling/importance_sampling_ratio/max": 0.9360762238502502,
518
+ "sampling/importance_sampling_ratio/mean": 0.9005730152130127,
519
+ "sampling/importance_sampling_ratio/min": 0.5624815821647644,
520
+ "sampling/sampling_logp_difference/max": 0.6438815593719482,
521
+ "sampling/sampling_logp_difference/mean": 0.038462258875370026,
522
+ "step": 21,
523
+ "step_time": 86.13508222099972
524
+ },
525
+ {
526
+ "clip_ratio/high_max": 0.0,
527
+ "clip_ratio/high_mean": 0.0,
528
+ "clip_ratio/low_mean": 0.0,
529
+ "clip_ratio/low_min": 0.0,
530
+ "clip_ratio/region_mean": 0.0,
531
+ "entropy": 0.4966563805937767,
532
+ "epoch": 0.00022,
533
+ "grad_norm": 0.0003532824048306793,
534
+ "kl": 8.798080261840369e-05,
535
+ "learning_rate": 4.8e-06,
536
+ "loss": -0.0,
537
+ "step": 22,
538
+ "step_time": 16.589822670999638
539
+ },
540
+ {
541
+ "clip_ratio/high_max": 0.0,
542
+ "clip_ratio/high_mean": 0.0,
543
+ "clip_ratio/low_mean": 0.0,
544
+ "clip_ratio/low_min": 0.0,
545
+ "clip_ratio/region_mean": 0.0,
546
+ "completions/clipped_ratio": 0.0,
547
+ "completions/max_length": 3.0,
548
+ "completions/max_terminated_length": 3.0,
549
+ "completions/mean_length": 2.96875,
550
+ "completions/mean_terminated_length": 2.96875,
551
+ "completions/min_length": 2.0,
552
+ "completions/min_terminated_length": 2.0,
553
+ "entropy": 0.46145135909318924,
554
+ "epoch": 0.00023,
555
+ "frac_reward_zero_std": 0.0,
556
+ "grad_norm": 0.0017623284365981817,
557
+ "kl": 8.294408689835109e-05,
558
+ "learning_rate": 5.0285714285714285e-06,
559
+ "loss": 0.0,
560
+ "num_tokens": 1801431.0,
561
+ "reward": -3.1079812049865723,
562
+ "reward_std": 15.382396697998047,
563
+ "rewards/rollout_reward_func/mean": -3.1079812049865723,
564
+ "rewards/rollout_reward_func/std": 15.382396697998047,
565
+ "sampling/importance_sampling_ratio/max": 0.9533551335334778,
566
+ "sampling/importance_sampling_ratio/mean": 0.9110425710678101,
567
+ "sampling/importance_sampling_ratio/min": 0.8689562678337097,
568
+ "sampling/sampling_logp_difference/max": 0.12607532739639282,
569
+ "sampling/sampling_logp_difference/mean": 0.031991466879844666,
570
+ "step": 23,
571
+ "step_time": 84.70673029099953
572
+ },
573
+ {
574
+ "clip_ratio/high_max": 0.0,
575
+ "clip_ratio/high_mean": 0.0,
576
+ "clip_ratio/low_mean": 0.0,
577
+ "clip_ratio/low_min": 0.0,
578
+ "clip_ratio/region_mean": 0.0,
579
+ "entropy": 0.4546354450285435,
580
+ "epoch": 0.00024,
581
+ "grad_norm": 0.002184377983212471,
582
+ "kl": 0.00013984292854729574,
583
+ "learning_rate": 5.257142857142857e-06,
584
+ "loss": 0.0,
585
+ "step": 24,
586
+ "step_time": 15.989325058999839
587
+ },
588
+ {
589
+ "clip_ratio/high_max": 0.02083333395421505,
590
+ "clip_ratio/high_mean": 0.010416666977107525,
591
+ "clip_ratio/low_mean": 0.0,
592
+ "clip_ratio/low_min": 0.0,
593
+ "clip_ratio/region_mean": 0.010416666977107525,
594
+ "completions/clipped_ratio": 0.0,
595
+ "completions/max_length": 3.0,
596
+ "completions/max_terminated_length": 3.0,
597
+ "completions/mean_length": 2.9375,
598
+ "completions/mean_terminated_length": 2.9375,
599
+ "completions/min_length": 2.0,
600
+ "completions/min_terminated_length": 2.0,
601
+ "entropy": 0.50193877145648,
602
+ "epoch": 0.00025,
603
+ "frac_reward_zero_std": 0.0,
604
+ "grad_norm": 0.0028643528930842876,
605
+ "kl": 0.0006277600778048509,
606
+ "learning_rate": 5.485714285714286e-06,
607
+ "loss": -0.0,
608
+ "num_tokens": 1954111.0,
609
+ "reward": -1.1967151165008545,
610
+ "reward_std": 9.904693603515625,
611
+ "rewards/rollout_reward_func/mean": -1.1967151165008545,
612
+ "rewards/rollout_reward_func/std": 9.904692649841309,
613
+ "sampling/importance_sampling_ratio/max": 1.0367523431777954,
614
+ "sampling/importance_sampling_ratio/mean": 0.9070800542831421,
615
+ "sampling/importance_sampling_ratio/min": 0.641558825969696,
616
+ "sampling/sampling_logp_difference/max": 0.2708436846733093,
617
+ "sampling/sampling_logp_difference/mean": 0.03570127859711647,
618
+ "step": 25,
619
+ "step_time": 88.23927604999926
620
+ },
621
+ {
622
+ "clip_ratio/high_max": 0.0,
623
+ "clip_ratio/high_mean": 0.0,
624
+ "clip_ratio/low_mean": 0.0,
625
+ "clip_ratio/low_min": 0.0,
626
+ "clip_ratio/region_mean": 0.0,
627
+ "entropy": 0.493778470903635,
628
+ "epoch": 0.00026,
629
+ "grad_norm": 0.002861190587282181,
630
+ "kl": 0.006509271008326323,
631
+ "learning_rate": 5.7142857142857145e-06,
632
+ "loss": -0.0,
633
+ "step": 26,
634
+ "step_time": 17.19145845399953
635
+ },
636
+ {
637
+ "clip_ratio/high_max": 0.0,
638
+ "clip_ratio/high_mean": 0.0,
639
+ "clip_ratio/low_mean": 0.0,
640
+ "clip_ratio/low_min": 0.0,
641
+ "clip_ratio/region_mean": 0.0,
642
+ "completions/clipped_ratio": 0.0,
643
+ "completions/max_length": 3.0,
644
+ "completions/max_terminated_length": 3.0,
645
+ "completions/mean_length": 2.90625,
646
+ "completions/mean_terminated_length": 2.90625,
647
+ "completions/min_length": 2.0,
648
+ "completions/min_terminated_length": 2.0,
649
+ "entropy": 0.4369519129395485,
650
+ "epoch": 0.00027,
651
+ "frac_reward_zero_std": 0.0,
652
+ "grad_norm": 0.0023894549813121557,
653
+ "kl": 0.00021573404501395999,
654
+ "learning_rate": 5.942857142857143e-06,
655
+ "loss": 0.0,
656
+ "num_tokens": 2103766.0,
657
+ "reward": -4.149303436279297,
658
+ "reward_std": 11.891597747802734,
659
+ "rewards/rollout_reward_func/mean": -4.149303436279297,
660
+ "rewards/rollout_reward_func/std": 11.891597747802734,
661
+ "sampling/importance_sampling_ratio/max": 0.9366415739059448,
662
+ "sampling/importance_sampling_ratio/mean": 0.9109276533126831,
663
+ "sampling/importance_sampling_ratio/min": 0.7386661171913147,
664
+ "sampling/sampling_logp_difference/max": 0.17405521869659424,
665
+ "sampling/sampling_logp_difference/mean": 0.03263023495674133,
666
+ "step": 27,
667
+ "step_time": 84.90730395599985
668
+ },
669
+ {
670
+ "clip_ratio/high_max": 0.0,
671
+ "clip_ratio/high_mean": 0.0,
672
+ "clip_ratio/low_mean": 0.0,
673
+ "clip_ratio/low_min": 0.0,
674
+ "clip_ratio/region_mean": 0.0,
675
+ "entropy": 0.4287884756922722,
676
+ "epoch": 0.00028,
677
+ "grad_norm": 0.003158987034112215,
678
+ "kl": 0.0010838782172868378,
679
+ "learning_rate": 6.171428571428571e-06,
680
+ "loss": 0.0,
681
+ "step": 28,
682
+ "step_time": 16.073362338999686
683
+ },
684
+ {
685
+ "clip_ratio/high_max": 0.0,
686
+ "clip_ratio/high_mean": 0.0,
687
+ "clip_ratio/low_mean": 0.0,
688
+ "clip_ratio/low_min": 0.0,
689
+ "clip_ratio/region_mean": 0.0,
690
+ "completions/clipped_ratio": 0.0,
691
+ "completions/max_length": 3.0,
692
+ "completions/max_terminated_length": 3.0,
693
+ "completions/mean_length": 2.90625,
694
+ "completions/mean_terminated_length": 2.90625,
695
+ "completions/min_length": 2.0,
696
+ "completions/min_terminated_length": 2.0,
697
+ "entropy": 0.47590041905641556,
698
+ "epoch": 0.00029,
699
+ "frac_reward_zero_std": 0.0,
700
+ "grad_norm": 0.0019009847892448306,
701
+ "kl": 0.002789211330764374,
702
+ "learning_rate": 6.4e-06,
703
+ "loss": -0.0,
704
+ "num_tokens": 2255958.0,
705
+ "reward": -1.561665415763855,
706
+ "reward_std": 11.948278427124023,
707
+ "rewards/rollout_reward_func/mean": -1.561665415763855,
708
+ "rewards/rollout_reward_func/std": 11.948278427124023,
709
+ "sampling/importance_sampling_ratio/max": 0.951849639415741,
710
+ "sampling/importance_sampling_ratio/mean": 0.8972554802894592,
711
+ "sampling/importance_sampling_ratio/min": 0.328483909368515,
712
+ "sampling/sampling_logp_difference/max": 1.0063753128051758,
713
+ "sampling/sampling_logp_difference/mean": 0.04340764880180359,
714
+ "step": 29,
715
+ "step_time": 91.08275033600057
716
+ },
717
+ {
718
+ "clip_ratio/high_max": 0.0,
719
+ "clip_ratio/high_mean": 0.0,
720
+ "clip_ratio/low_mean": 0.0,
721
+ "clip_ratio/low_min": 0.0,
722
+ "clip_ratio/region_mean": 0.0,
723
+ "entropy": 0.47105172276496887,
724
+ "epoch": 0.0003,
725
+ "grad_norm": 0.002664634957909584,
726
+ "kl": 0.0012541578576019674,
727
+ "learning_rate": 6.628571428571428e-06,
728
+ "loss": -0.0,
729
+ "step": 30,
730
+ "step_time": 16.112164762000702
731
+ },
732
+ {
733
+ "clip_ratio/high_max": 0.0,
734
+ "clip_ratio/high_mean": 0.0,
735
+ "clip_ratio/low_mean": 0.0,
736
+ "clip_ratio/low_min": 0.0,
737
+ "clip_ratio/region_mean": 0.0,
738
+ "completions/clipped_ratio": 0.0,
739
+ "completions/max_length": 3.0,
740
+ "completions/max_terminated_length": 3.0,
741
+ "completions/mean_length": 3.0,
742
+ "completions/mean_terminated_length": 3.0,
743
+ "completions/min_length": 3.0,
744
+ "completions/min_terminated_length": 3.0,
745
+ "entropy": 0.3540297709405422,
746
+ "epoch": 0.00031,
747
+ "frac_reward_zero_std": 0.0,
748
+ "grad_norm": 0.00012217683251947165,
749
+ "kl": 0.00030874026197125204,
750
+ "learning_rate": 6.857142857142856e-06,
751
+ "loss": -0.0,
752
+ "num_tokens": 2407027.0,
753
+ "reward": 0.8819369077682495,
754
+ "reward_std": 9.408540725708008,
755
+ "rewards/rollout_reward_func/mean": 0.8819369077682495,
756
+ "rewards/rollout_reward_func/std": 9.408540725708008,
757
+ "sampling/importance_sampling_ratio/max": 0.9496375322341919,
758
+ "sampling/importance_sampling_ratio/mean": 0.926586389541626,
759
+ "sampling/importance_sampling_ratio/min": 0.8927135467529297,
760
+ "sampling/sampling_logp_difference/max": 0.11497166007757187,
761
+ "sampling/sampling_logp_difference/mean": 0.025536321103572845,
762
+ "step": 31,
763
+ "step_time": 87.01734963399849
764
+ },
765
+ {
766
+ "clip_ratio/high_max": 0.0,
767
+ "clip_ratio/high_mean": 0.0,
768
+ "clip_ratio/low_mean": 0.0,
769
+ "clip_ratio/low_min": 0.0,
770
+ "clip_ratio/region_mean": 0.0,
771
+ "entropy": 0.34124791249632835,
772
+ "epoch": 0.00032,
773
+ "grad_norm": 0.00012294144835323095,
774
+ "kl": 0.0003481047842797125,
775
+ "learning_rate": 7.085714285714285e-06,
776
+ "loss": -0.0,
777
+ "step": 32,
778
+ "step_time": 16.119427530000394
779
+ },
780
+ {
781
+ "clip_ratio/high_max": 0.0,
782
+ "clip_ratio/high_mean": 0.0,
783
+ "clip_ratio/low_mean": 0.0,
784
+ "clip_ratio/low_min": 0.0,
785
+ "clip_ratio/region_mean": 0.0,
786
+ "completions/clipped_ratio": 0.0,
787
+ "completions/max_length": 3.0,
788
+ "completions/max_terminated_length": 3.0,
789
+ "completions/mean_length": 2.9375,
790
+ "completions/mean_terminated_length": 2.9375,
791
+ "completions/min_length": 2.0,
792
+ "completions/min_terminated_length": 2.0,
793
+ "entropy": 0.37663908302783966,
794
+ "epoch": 0.00033,
795
+ "frac_reward_zero_std": 0.0,
796
+ "grad_norm": 0.00028033516719006,
797
+ "kl": 0.00011349991063980269,
798
+ "learning_rate": 7.314285714285714e-06,
799
+ "loss": 0.0,
800
+ "num_tokens": 2556831.0,
801
+ "reward": -5.249267101287842,
802
+ "reward_std": 9.296294212341309,
803
+ "rewards/rollout_reward_func/mean": -5.249267101287842,
804
+ "rewards/rollout_reward_func/std": 9.296294212341309,
805
+ "sampling/importance_sampling_ratio/max": 0.9602704644203186,
806
+ "sampling/importance_sampling_ratio/mean": 0.9252135753631592,
807
+ "sampling/importance_sampling_ratio/min": 0.8558757305145264,
808
+ "sampling/sampling_logp_difference/max": 0.1554594784975052,
809
+ "sampling/sampling_logp_difference/mean": 0.027311213314533234,
810
+ "step": 33,
811
+ "step_time": 91.05606818099977
812
+ },
813
+ {
814
+ "clip_ratio/high_max": 0.0,
815
+ "clip_ratio/high_mean": 0.0,
816
+ "clip_ratio/low_mean": 0.0,
817
+ "clip_ratio/low_min": 0.0,
818
+ "clip_ratio/region_mean": 0.0,
819
+ "entropy": 0.37676023691892624,
820
+ "epoch": 0.00034,
821
+ "grad_norm": 0.0002865525020752102,
822
+ "kl": 0.00011139449998154305,
823
+ "learning_rate": 7.542857142857142e-06,
824
+ "loss": 0.0,
825
+ "step": 34,
826
+ "step_time": 16.66090228600069
827
+ },
828
+ {
829
+ "clip_ratio/high_max": 0.0,
830
+ "clip_ratio/high_mean": 0.0,
831
+ "clip_ratio/low_mean": 0.010416666977107525,
832
+ "clip_ratio/low_min": 0.0,
833
+ "clip_ratio/region_mean": 0.010416666977107525,
834
+ "completions/clipped_ratio": 0.0,
835
+ "completions/max_length": 3.0,
836
+ "completions/max_terminated_length": 3.0,
837
+ "completions/mean_length": 3.0,
838
+ "completions/mean_terminated_length": 3.0,
839
+ "completions/min_length": 3.0,
840
+ "completions/min_terminated_length": 3.0,
841
+ "entropy": 0.38403602689504623,
842
+ "epoch": 0.00035,
843
+ "frac_reward_zero_std": 0.0,
844
+ "grad_norm": 0.0007655026274733245,
845
+ "kl": 0.0008573867726227036,
846
+ "learning_rate": 7.771428571428572e-06,
847
+ "loss": 0.0,
848
+ "num_tokens": 2711341.0,
849
+ "reward": -3.2598910331726074,
850
+ "reward_std": 11.653153419494629,
851
+ "rewards/rollout_reward_func/mean": -3.2598910331726074,
852
+ "rewards/rollout_reward_func/std": 11.653153419494629,
853
+ "sampling/importance_sampling_ratio/max": 1.3437409400939941,
854
+ "sampling/importance_sampling_ratio/mean": 0.9472247362136841,
855
+ "sampling/importance_sampling_ratio/min": 0.872779130935669,
856
+ "sampling/sampling_logp_difference/max": 0.351446270942688,
857
+ "sampling/sampling_logp_difference/mean": 0.03257140517234802,
858
+ "step": 35,
859
+ "step_time": 97.56342971100003
860
+ },
861
+ {
862
+ "clip_ratio/high_max": 0.0,
863
+ "clip_ratio/high_mean": 0.0,
864
+ "clip_ratio/low_mean": 0.010416666977107525,
865
+ "clip_ratio/low_min": 0.0,
866
+ "clip_ratio/region_mean": 0.010416666977107525,
867
+ "entropy": 0.37373727560043335,
868
+ "epoch": 0.00036,
869
+ "grad_norm": 0.0005624265759252012,
870
+ "kl": 0.0011794225774792721,
871
+ "learning_rate": 8e-06,
872
+ "loss": 0.0,
873
+ "step": 36,
874
+ "step_time": 16.08771912999964
875
+ },
876
+ {
877
+ "clip_ratio/high_max": 0.0,
878
+ "clip_ratio/high_mean": 0.0,
879
+ "clip_ratio/low_mean": 0.0,
880
+ "clip_ratio/low_min": 0.0,
881
+ "clip_ratio/region_mean": 0.0,
882
+ "completions/clipped_ratio": 0.0,
883
+ "completions/max_length": 3.0,
884
+ "completions/max_terminated_length": 3.0,
885
+ "completions/mean_length": 3.0,
886
+ "completions/mean_terminated_length": 3.0,
887
+ "completions/min_length": 3.0,
888
+ "completions/min_terminated_length": 3.0,
889
+ "entropy": 0.3612252026796341,
890
+ "epoch": 0.00037,
891
+ "frac_reward_zero_std": 0.0,
892
+ "grad_norm": 0.003778834594413638,
893
+ "kl": 0.005491923870067694,
894
+ "learning_rate": 7.999999999958871e-06,
895
+ "loss": 0.0,
896
+ "num_tokens": 2860979.0,
897
+ "reward": -3.204911947250366,
898
+ "reward_std": 13.321104049682617,
899
+ "rewards/rollout_reward_func/mean": -3.204911947250366,
900
+ "rewards/rollout_reward_func/std": 13.3211030960083,
901
+ "sampling/importance_sampling_ratio/max": 0.9534703493118286,
902
+ "sampling/importance_sampling_ratio/mean": 0.9272962808609009,
903
+ "sampling/importance_sampling_ratio/min": 0.812048077583313,
904
+ "sampling/sampling_logp_difference/max": 0.125631183385849,
905
+ "sampling/sampling_logp_difference/mean": 0.027381815016269684,
906
+ "step": 37,
907
+ "step_time": 93.92803074699941
908
+ },
909
+ {
910
+ "clip_ratio/high_max": 0.0416666679084301,
911
+ "clip_ratio/high_mean": 0.02083333395421505,
912
+ "clip_ratio/low_mean": 0.0,
913
+ "clip_ratio/low_min": 0.0,
914
+ "clip_ratio/region_mean": 0.02083333395421505,
915
+ "entropy": 0.3561956323683262,
916
+ "epoch": 0.00038,
917
+ "grad_norm": 0.00041312171379104257,
918
+ "kl": 0.009066745038580848,
919
+ "learning_rate": 7.999999999835487e-06,
920
+ "loss": -0.0,
921
+ "step": 38,
922
+ "step_time": 16.57034048200012
923
+ },
924
+ {
925
+ "clip_ratio/high_max": 0.0,
926
+ "clip_ratio/high_mean": 0.0,
927
+ "clip_ratio/low_mean": 0.0,
928
+ "clip_ratio/low_min": 0.0,
929
+ "clip_ratio/region_mean": 0.0,
930
+ "completions/clipped_ratio": 0.0,
931
+ "completions/max_length": 3.0,
932
+ "completions/max_terminated_length": 3.0,
933
+ "completions/mean_length": 2.90625,
934
+ "completions/mean_terminated_length": 2.90625,
935
+ "completions/min_length": 2.0,
936
+ "completions/min_terminated_length": 2.0,
937
+ "entropy": 0.37458185851573944,
938
+ "epoch": 0.00039,
939
+ "frac_reward_zero_std": 0.0,
940
+ "grad_norm": 0.003013370558619499,
941
+ "kl": 0.0015790614006618853,
942
+ "learning_rate": 7.999999999629846e-06,
943
+ "loss": -0.0,
944
+ "num_tokens": 3014870.0,
945
+ "reward": -4.495431900024414,
946
+ "reward_std": 9.237488746643066,
947
+ "rewards/rollout_reward_func/mean": -4.495431900024414,
948
+ "rewards/rollout_reward_func/std": 9.237488746643066,
949
+ "sampling/importance_sampling_ratio/max": 0.955286979675293,
950
+ "sampling/importance_sampling_ratio/mean": 0.9233821630477905,
951
+ "sampling/importance_sampling_ratio/min": 0.7778754234313965,
952
+ "sampling/sampling_logp_difference/max": 0.19711723923683167,
953
+ "sampling/sampling_logp_difference/mean": 0.028256624937057495,
954
+ "step": 39,
955
+ "step_time": 95.70763315500017
956
+ },
957
+ {
958
+ "clip_ratio/high_max": 0.0,
959
+ "clip_ratio/high_mean": 0.0,
960
+ "clip_ratio/low_mean": 0.0,
961
+ "clip_ratio/low_min": 0.0,
962
+ "clip_ratio/region_mean": 0.0,
963
+ "entropy": 0.3661472350358963,
964
+ "epoch": 0.0004,
965
+ "grad_norm": 0.0022348016500473022,
966
+ "kl": 0.0027103804022772238,
967
+ "learning_rate": 7.99999999934195e-06,
968
+ "loss": -0.0,
969
+ "step": 40,
970
+ "step_time": 16.061642451000353
971
+ },
972
+ {
973
+ "clip_ratio/high_max": 0.0,
974
+ "clip_ratio/high_mean": 0.0,
975
+ "clip_ratio/low_mean": 0.0,
976
+ "clip_ratio/low_min": 0.0,
977
+ "clip_ratio/region_mean": 0.0,
978
+ "completions/clipped_ratio": 0.0,
979
+ "completions/max_length": 3.0,
980
+ "completions/max_terminated_length": 3.0,
981
+ "completions/mean_length": 2.96875,
982
+ "completions/mean_terminated_length": 2.96875,
983
+ "completions/min_length": 2.0,
984
+ "completions/min_terminated_length": 2.0,
985
+ "entropy": 0.28179030306637287,
986
+ "epoch": 0.00041,
987
+ "frac_reward_zero_std": 0.0,
988
+ "grad_norm": 6.525115895783529e-05,
989
+ "kl": 0.000415502121541067,
990
+ "learning_rate": 7.999999998971795e-06,
991
+ "loss": -0.0,
992
+ "num_tokens": 3164817.0,
993
+ "reward": 2.3517534732818604,
994
+ "reward_std": 12.381294250488281,
995
+ "rewards/rollout_reward_func/mean": 2.3517534732818604,
996
+ "rewards/rollout_reward_func/std": 12.381294250488281,
997
+ "sampling/importance_sampling_ratio/max": 0.9573846459388733,
998
+ "sampling/importance_sampling_ratio/mean": 0.9431747198104858,
999
+ "sampling/importance_sampling_ratio/min": 0.9145121574401855,
1000
+ "sampling/sampling_logp_difference/max": 0.08903493732213974,
1001
+ "sampling/sampling_logp_difference/mean": 0.01973060891032219,
1002
+ "step": 41,
1003
+ "step_time": 90.856857966
1004
+ },
1005
+ {
1006
+ "clip_ratio/high_max": 0.0,
1007
+ "clip_ratio/high_mean": 0.0,
1008
+ "clip_ratio/low_mean": 0.0,
1009
+ "clip_ratio/low_min": 0.0,
1010
+ "clip_ratio/region_mean": 0.0,
1011
+ "entropy": 0.27641875483095646,
1012
+ "epoch": 0.00042,
1013
+ "grad_norm": 0.0001371208782074973,
1014
+ "kl": 0.0004361617029644549,
1015
+ "learning_rate": 7.999999998519386e-06,
1016
+ "loss": -0.0,
1017
+ "step": 42,
1018
+ "step_time": 16.129788360000475
1019
+ },
1020
+ {
1021
+ "clip_ratio/high_max": 0.0,
1022
+ "clip_ratio/high_mean": 0.0,
1023
+ "clip_ratio/low_mean": 0.0,
1024
+ "clip_ratio/low_min": 0.0,
1025
+ "clip_ratio/region_mean": 0.0,
1026
+ "completions/clipped_ratio": 0.0,
1027
+ "completions/max_length": 3.0,
1028
+ "completions/max_terminated_length": 3.0,
1029
+ "completions/mean_length": 2.90625,
1030
+ "completions/mean_terminated_length": 2.90625,
1031
+ "completions/min_length": 2.0,
1032
+ "completions/min_terminated_length": 2.0,
1033
+ "entropy": 0.32416083104908466,
1034
+ "epoch": 0.00043,
1035
+ "frac_reward_zero_std": 0.0,
1036
+ "grad_norm": 0.0014907444128766656,
1037
+ "kl": 0.0035448235757939983,
1038
+ "learning_rate": 7.99999999798472e-06,
1039
+ "loss": 0.0,
1040
+ "num_tokens": 3317992.0,
1041
+ "reward": 0.9361296892166138,
1042
+ "reward_std": 10.966468811035156,
1043
+ "rewards/rollout_reward_func/mean": 0.9361296892166138,
1044
+ "rewards/rollout_reward_func/std": 10.966468811035156,
1045
+ "sampling/importance_sampling_ratio/max": 1.039656639099121,
1046
+ "sampling/importance_sampling_ratio/mean": 0.9434545636177063,
1047
+ "sampling/importance_sampling_ratio/min": 0.8878768682479858,
1048
+ "sampling/sampling_logp_difference/max": 0.11892110854387283,
1049
+ "sampling/sampling_logp_difference/mean": 0.022497989237308502,
1050
+ "step": 43,
1051
+ "step_time": 92.78864051999926
1052
+ },
1053
+ {
1054
+ "clip_ratio/high_max": 0.0,
1055
+ "clip_ratio/high_mean": 0.0,
1056
+ "clip_ratio/low_mean": 0.0,
1057
+ "clip_ratio/low_min": 0.0,
1058
+ "clip_ratio/region_mean": 0.0,
1059
+ "entropy": 0.3210602290928364,
1060
+ "epoch": 0.00044,
1061
+ "grad_norm": 0.0009183208458125591,
1062
+ "kl": 0.0044001892892993055,
1063
+ "learning_rate": 7.999999997367799e-06,
1064
+ "loss": -0.0,
1065
+ "step": 44,
1066
+ "step_time": 16.18974080999942
1067
+ },
1068
+ {
1069
+ "clip_ratio/high_max": 0.0,
1070
+ "clip_ratio/high_mean": 0.0,
1071
+ "clip_ratio/low_mean": 0.0,
1072
+ "clip_ratio/low_min": 0.0,
1073
+ "clip_ratio/region_mean": 0.0,
1074
+ "completions/clipped_ratio": 0.0,
1075
+ "completions/max_length": 3.0,
1076
+ "completions/max_terminated_length": 3.0,
1077
+ "completions/mean_length": 2.96875,
1078
+ "completions/mean_terminated_length": 2.96875,
1079
+ "completions/min_length": 2.0,
1080
+ "completions/min_terminated_length": 2.0,
1081
+ "entropy": 0.2824931964278221,
1082
+ "epoch": 0.00045,
1083
+ "frac_reward_zero_std": 0.0,
1084
+ "grad_norm": 0.005496473051607609,
1085
+ "kl": 0.09349646368718822,
1086
+ "learning_rate": 7.999999996668619e-06,
1087
+ "loss": 0.0,
1088
+ "num_tokens": 3472410.0,
1089
+ "reward": 1.872360348701477,
1090
+ "reward_std": 12.642463684082031,
1091
+ "rewards/rollout_reward_func/mean": 1.872360348701477,
1092
+ "rewards/rollout_reward_func/std": 12.642463684082031,
1093
+ "sampling/importance_sampling_ratio/max": 0.9575082063674927,
1094
+ "sampling/importance_sampling_ratio/mean": 0.941848635673523,
1095
+ "sampling/importance_sampling_ratio/min": 0.8221862316131592,
1096
+ "sampling/sampling_logp_difference/max": 0.12410736083984375,
1097
+ "sampling/sampling_logp_difference/mean": 0.0203759353607893,
1098
+ "step": 45,
1099
+ "step_time": 92.56621814599839
1100
+ },
1101
+ {
1102
+ "clip_ratio/high_max": 0.0,
1103
+ "clip_ratio/high_mean": 0.0,
1104
+ "clip_ratio/low_mean": 0.010416666977107525,
1105
+ "clip_ratio/low_min": 0.0,
1106
+ "clip_ratio/region_mean": 0.010416666977107525,
1107
+ "entropy": 0.2739644553512335,
1108
+ "epoch": 0.00046,
1109
+ "grad_norm": 0.0008409882429987192,
1110
+ "kl": 0.17168781322834548,
1111
+ "learning_rate": 7.999999995887185e-06,
1112
+ "loss": 0.0,
1113
+ "step": 46,
1114
+ "step_time": 16.97712109299937
1115
+ },
1116
+ {
1117
+ "clip_ratio/high_max": 0.0,
1118
+ "clip_ratio/high_mean": 0.0,
1119
+ "clip_ratio/low_mean": 0.0,
1120
+ "clip_ratio/low_min": 0.0,
1121
+ "clip_ratio/region_mean": 0.0,
1122
+ "completions/clipped_ratio": 0.0,
1123
+ "completions/max_length": 3.0,
1124
+ "completions/max_terminated_length": 3.0,
1125
+ "completions/mean_length": 2.90625,
1126
+ "completions/mean_terminated_length": 2.90625,
1127
+ "completions/min_length": 2.0,
1128
+ "completions/min_terminated_length": 2.0,
1129
+ "entropy": 0.3127005323767662,
1130
+ "epoch": 0.00047,
1131
+ "frac_reward_zero_std": 0.0,
1132
+ "grad_norm": 0.005147633608430624,
1133
+ "kl": 0.02404813828979968,
1134
+ "learning_rate": 7.999999995023493e-06,
1135
+ "loss": 0.0,
1136
+ "num_tokens": 3623971.0,
1137
+ "reward": -2.005354404449463,
1138
+ "reward_std": 9.463374137878418,
1139
+ "rewards/rollout_reward_func/mean": -2.005354404449463,
1140
+ "rewards/rollout_reward_func/std": 9.463374137878418,
1141
+ "sampling/importance_sampling_ratio/max": 1.8077476024627686,
1142
+ "sampling/importance_sampling_ratio/mean": 0.9680489897727966,
1143
+ "sampling/importance_sampling_ratio/min": 0.8782174587249756,
1144
+ "sampling/sampling_logp_difference/max": 0.6487047672271729,
1145
+ "sampling/sampling_logp_difference/mean": 0.027968257665634155,
1146
+ "step": 47,
1147
+ "step_time": 90.81133413299995
1148
+ },
1149
+ {
1150
+ "clip_ratio/high_max": 0.02083333395421505,
1151
+ "clip_ratio/high_mean": 0.010416666977107525,
1152
+ "clip_ratio/low_mean": 0.0,
1153
+ "clip_ratio/low_min": 0.0,
1154
+ "clip_ratio/region_mean": 0.010416666977107525,
1155
+ "entropy": 0.31450361758470535,
1156
+ "epoch": 0.00048,
1157
+ "grad_norm": 0.010047705844044685,
1158
+ "kl": 0.011364400590537116,
1159
+ "learning_rate": 7.999999994077545e-06,
1160
+ "loss": -0.0,
1161
+ "step": 48,
1162
+ "step_time": 16.131004170999404
1163
+ },
1164
+ {
1165
+ "clip_ratio/high_max": 0.0,
1166
+ "clip_ratio/high_mean": 0.0,
1167
+ "clip_ratio/low_mean": 0.0,
1168
+ "clip_ratio/low_min": 0.0,
1169
+ "clip_ratio/region_mean": 0.0,
1170
+ "completions/clipped_ratio": 0.0,
1171
+ "completions/max_length": 3.0,
1172
+ "completions/max_terminated_length": 3.0,
1173
+ "completions/mean_length": 2.9375,
1174
+ "completions/mean_terminated_length": 2.9375,
1175
+ "completions/min_length": 2.0,
1176
+ "completions/min_terminated_length": 2.0,
1177
+ "entropy": 0.31293279118835926,
1178
+ "epoch": 0.00049,
1179
+ "frac_reward_zero_std": 0.0,
1180
+ "grad_norm": 0.0025514012668281794,
1181
+ "kl": 0.004930681967380224,
1182
+ "learning_rate": 7.999999993049342e-06,
1183
+ "loss": -0.0,
1184
+ "num_tokens": 3769693.0,
1185
+ "reward": -1.6621102094650269,
1186
+ "reward_std": 11.216950416564941,
1187
+ "rewards/rollout_reward_func/mean": -1.6621102094650269,
1188
+ "rewards/rollout_reward_func/std": 11.216950416564941,
1189
+ "sampling/importance_sampling_ratio/max": 0.963409960269928,
1190
+ "sampling/importance_sampling_ratio/mean": 0.9157481789588928,
1191
+ "sampling/importance_sampling_ratio/min": 0.4775127172470093,
1192
+ "sampling/sampling_logp_difference/max": 0.6763087511062622,
1193
+ "sampling/sampling_logp_difference/mean": 0.03370527923107147,
1194
+ "step": 49,
1195
+ "step_time": 87.11181479400057
1196
+ },
1197
+ {
1198
+ "clip_ratio/high_max": 0.0,
1199
+ "clip_ratio/high_mean": 0.0,
1200
+ "clip_ratio/low_mean": 0.02083333395421505,
1201
+ "clip_ratio/low_min": 0.0,
1202
+ "clip_ratio/region_mean": 0.02083333395421505,
1203
+ "entropy": 0.305842787027359,
1204
+ "epoch": 0.0005,
1205
+ "grad_norm": 0.001049969345331192,
1206
+ "kl": 0.01695527902484173,
1207
+ "learning_rate": 7.999999991938882e-06,
1208
+ "loss": -0.0,
1209
+ "step": 50,
1210
+ "step_time": 17.37328599800003
1211
+ },
1212
+ {
1213
+ "clip_ratio/high_max": 0.0,
1214
+ "clip_ratio/high_mean": 0.0,
1215
+ "clip_ratio/low_mean": 0.0,
1216
+ "clip_ratio/low_min": 0.0,
1217
+ "clip_ratio/region_mean": 0.0,
1218
+ "completions/clipped_ratio": 0.0,
1219
+ "completions/max_length": 3.0,
1220
+ "completions/max_terminated_length": 3.0,
1221
+ "completions/mean_length": 2.9375,
1222
+ "completions/mean_terminated_length": 2.9375,
1223
+ "completions/min_length": 2.0,
1224
+ "completions/min_terminated_length": 2.0,
1225
+ "entropy": 0.26627207547426224,
1226
+ "epoch": 0.00051,
1227
+ "frac_reward_zero_std": 0.0,
1228
+ "grad_norm": 0.000988983316347003,
1229
+ "kl": 0.0010694206866901368,
1230
+ "learning_rate": 7.999999990746166e-06,
1231
+ "loss": -0.0,
1232
+ "num_tokens": 3918993.0,
1233
+ "reward": 0.07662695646286011,
1234
+ "reward_std": 12.704819679260254,
1235
+ "rewards/rollout_reward_func/mean": 0.07662695646286011,
1236
+ "rewards/rollout_reward_func/std": 12.704818725585938,
1237
+ "sampling/importance_sampling_ratio/max": 0.9753377437591553,
1238
+ "sampling/importance_sampling_ratio/mean": 0.950916051864624,
1239
+ "sampling/importance_sampling_ratio/min": 0.8742276430130005,
1240
+ "sampling/sampling_logp_difference/max": 0.08313722163438797,
1241
+ "sampling/sampling_logp_difference/mean": 0.018695753067731857,
1242
+ "step": 51,
1243
+ "step_time": 84.65364402499836
1244
+ },
1245
+ {
1246
+ "clip_ratio/high_max": 0.0,
1247
+ "clip_ratio/high_mean": 0.0,
1248
+ "clip_ratio/low_mean": 0.0,
1249
+ "clip_ratio/low_min": 0.0,
1250
+ "clip_ratio/region_mean": 0.0,
1251
+ "entropy": 0.26170745119452477,
1252
+ "epoch": 0.00052,
1253
+ "grad_norm": 0.0009646639809943736,
1254
+ "kl": 0.0012894975807284936,
1255
+ "learning_rate": 7.999999989471194e-06,
1256
+ "loss": -0.0,
1257
+ "step": 52,
1258
+ "step_time": 15.9307113390023
1259
+ },
1260
+ {
1261
+ "clip_ratio/high_max": 0.0,
1262
+ "clip_ratio/high_mean": 0.0,
1263
+ "clip_ratio/low_mean": 0.0,
1264
+ "clip_ratio/low_min": 0.0,
1265
+ "clip_ratio/region_mean": 0.0,
1266
+ "completions/clipped_ratio": 0.0,
1267
+ "completions/max_length": 3.0,
1268
+ "completions/max_terminated_length": 3.0,
1269
+ "completions/mean_length": 2.96875,
1270
+ "completions/mean_terminated_length": 2.96875,
1271
+ "completions/min_length": 2.0,
1272
+ "completions/min_terminated_length": 2.0,
1273
+ "entropy": 0.26491483487188816,
1274
+ "epoch": 0.00053,
1275
+ "frac_reward_zero_std": 0.0,
1276
+ "grad_norm": 0.0002617822028696537,
1277
+ "kl": 0.0023044159715936985,
1278
+ "learning_rate": 7.999999988113964e-06,
1279
+ "loss": -0.0,
1280
+ "num_tokens": 4071645.0,
1281
+ "reward": -0.05424162745475769,
1282
+ "reward_std": 9.241598129272461,
1283
+ "rewards/rollout_reward_func/mean": -0.05424162745475769,
1284
+ "rewards/rollout_reward_func/std": 9.241597175598145,
1285
+ "sampling/importance_sampling_ratio/max": 0.9647132754325867,
1286
+ "sampling/importance_sampling_ratio/mean": 0.9405180215835571,
1287
+ "sampling/importance_sampling_ratio/min": 0.6948662400245667,
1288
+ "sampling/sampling_logp_difference/max": 0.30833685398101807,
1289
+ "sampling/sampling_logp_difference/mean": 0.021141711622476578,
1290
+ "step": 53,
1291
+ "step_time": 93.0931882169998
1292
+ }
1293
+ ],
1294
+ "logging_steps": 1.0,
1295
+ "max_steps": 600000,
1296
+ "num_input_tokens_seen": 4071645,
1297
+ "num_train_epochs": 6,
1298
+ "save_steps": 500,
1299
+ "stateful_callbacks": {
1300
+ "TrainerControl": {
1301
+ "args": {
1302
+ "should_epoch_stop": false,
1303
+ "should_evaluate": false,
1304
+ "should_log": false,
1305
+ "should_save": true,
1306
+ "should_training_stop": true
1307
+ },
1308
+ "attributes": {}
1309
+ }
1310
+ },
1311
+ "total_flos": 0.0,
1312
+ "train_batch_size": 2,
1313
+ "trial_name": null,
1314
+ "trial_params": null
1315
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2010939fa76a3f4dbd78aaab9e103d580462957e8424de882b2bd2c6f761ef2c
3
+ size 8081
vocab.json ADDED
The diff for this file is too large to render. See raw diff