iamPi commited on
Commit
be08e11
·
verified ·
1 Parent(s): 33b90e4

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,209 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: Qwen/Qwen3-4B-Instruct-2507
3
+ library_name: peft
4
+ pipeline_tag: text-generation
5
+ tags:
6
+ - base_model:adapter:Qwen/Qwen3-4B-Instruct-2507
7
+ - grpo
8
+ - lora
9
+ - transformers
10
+ - trl
11
+ ---
12
+
13
+ # Model Card for Model ID
14
+
15
+ <!-- Provide a quick summary of what the model is/does. -->
16
+
17
+
18
+
19
+ ## Model Details
20
+
21
+ ### Model Description
22
+
23
+ <!-- Provide a longer summary of what this model is. -->
24
+
25
+
26
+
27
+ - **Developed by:** [More Information Needed]
28
+ - **Funded by [optional]:** [More Information Needed]
29
+ - **Shared by [optional]:** [More Information Needed]
30
+ - **Model type:** [More Information Needed]
31
+ - **Language(s) (NLP):** [More Information Needed]
32
+ - **License:** [More Information Needed]
33
+ - **Finetuned from model [optional]:** [More Information Needed]
34
+
35
+ ### Model Sources [optional]
36
+
37
+ <!-- Provide the basic links for the model. -->
38
+
39
+ - **Repository:** [More Information Needed]
40
+ - **Paper [optional]:** [More Information Needed]
41
+ - **Demo [optional]:** [More Information Needed]
42
+
43
+ ## Uses
44
+
45
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
46
+
47
+ ### Direct Use
48
+
49
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
50
+
51
+ [More Information Needed]
52
+
53
+ ### Downstream Use [optional]
54
+
55
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
56
+
57
+ [More Information Needed]
58
+
59
+ ### Out-of-Scope Use
60
+
61
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
62
+
63
+ [More Information Needed]
64
+
65
+ ## Bias, Risks, and Limitations
66
+
67
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
68
+
69
+ [More Information Needed]
70
+
71
+ ### Recommendations
72
+
73
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
74
+
75
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
76
+
77
+ ## How to Get Started with the Model
78
+
79
+ Use the code below to get started with the model.
80
+
81
+ [More Information Needed]
82
+
83
+ ## Training Details
84
+
85
+ ### Training Data
86
+
87
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
88
+
89
+ [More Information Needed]
90
+
91
+ ### Training Procedure
92
+
93
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
94
+
95
+ #### Preprocessing [optional]
96
+
97
+ [More Information Needed]
98
+
99
+
100
+ #### Training Hyperparameters
101
+
102
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
103
+
104
+ #### Speeds, Sizes, Times [optional]
105
+
106
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
107
+
108
+ [More Information Needed]
109
+
110
+ ## Evaluation
111
+
112
+ <!-- This section describes the evaluation protocols and provides the results. -->
113
+
114
+ ### Testing Data, Factors & Metrics
115
+
116
+ #### Testing Data
117
+
118
+ <!-- This should link to a Dataset Card if possible. -->
119
+
120
+ [More Information Needed]
121
+
122
+ #### Factors
123
+
124
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
125
+
126
+ [More Information Needed]
127
+
128
+ #### Metrics
129
+
130
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
131
+
132
+ [More Information Needed]
133
+
134
+ ### Results
135
+
136
+ [More Information Needed]
137
+
138
+ #### Summary
139
+
140
+
141
+
142
+ ## Model Examination [optional]
143
+
144
+ <!-- Relevant interpretability work for the model goes here -->
145
+
146
+ [More Information Needed]
147
+
148
+ ## Environmental Impact
149
+
150
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
151
+
152
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
153
+
154
+ - **Hardware Type:** [More Information Needed]
155
+ - **Hours used:** [More Information Needed]
156
+ - **Cloud Provider:** [More Information Needed]
157
+ - **Compute Region:** [More Information Needed]
158
+ - **Carbon Emitted:** [More Information Needed]
159
+
160
+ ## Technical Specifications [optional]
161
+
162
+ ### Model Architecture and Objective
163
+
164
+ [More Information Needed]
165
+
166
+ ### Compute Infrastructure
167
+
168
+ [More Information Needed]
169
+
170
+ #### Hardware
171
+
172
+ [More Information Needed]
173
+
174
+ #### Software
175
+
176
+ [More Information Needed]
177
+
178
+ ## Citation [optional]
179
+
180
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
181
+
182
+ **BibTeX:**
183
+
184
+ [More Information Needed]
185
+
186
+ **APA:**
187
+
188
+ [More Information Needed]
189
+
190
+ ## Glossary [optional]
191
+
192
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
193
+
194
+ [More Information Needed]
195
+
196
+ ## More Information [optional]
197
+
198
+ [More Information Needed]
199
+
200
+ ## Model Card Authors [optional]
201
+
202
+ [More Information Needed]
203
+
204
+ ## Model Card Contact
205
+
206
+ [More Information Needed]
207
+ ### Framework versions
208
+
209
+ - PEFT 0.18.1
adapter_config.json ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alora_invocation_tokens": null,
3
+ "alpha_pattern": {},
4
+ "arrow_config": null,
5
+ "auto_mapping": null,
6
+ "base_model_name_or_path": "/cache/models/Qwen--Qwen3-4B-Instruct-2507",
7
+ "bias": "none",
8
+ "corda_config": null,
9
+ "ensure_weight_tying": false,
10
+ "eva_config": null,
11
+ "exclude_modules": null,
12
+ "fan_in_fan_out": false,
13
+ "inference_mode": true,
14
+ "init_lora_weights": true,
15
+ "layer_replication": null,
16
+ "layers_pattern": null,
17
+ "layers_to_transform": null,
18
+ "loftq_config": {},
19
+ "lora_alpha": 64,
20
+ "lora_bias": false,
21
+ "lora_dropout": 0.05,
22
+ "megatron_config": null,
23
+ "megatron_core": "megatron.core",
24
+ "modules_to_save": null,
25
+ "peft_type": "LORA",
26
+ "peft_version": "0.18.1",
27
+ "qalora_group_size": 16,
28
+ "r": 32,
29
+ "rank_pattern": {},
30
+ "revision": null,
31
+ "target_modules": [
32
+ "q_proj",
33
+ "o_proj",
34
+ "v_proj",
35
+ "down_proj",
36
+ "k_proj",
37
+ "up_proj",
38
+ "gate_proj"
39
+ ],
40
+ "target_parameters": null,
41
+ "task_type": "CAUSAL_LM",
42
+ "trainable_token_indices": null,
43
+ "use_dora": false,
44
+ "use_qalora": false,
45
+ "use_rslora": false
46
+ }
adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5ef77f1c459d6bc3a23ef214661eafa56d59c8491aa15bc5a3a10e00cab04091
3
+ size 264308896
added_tokens.json ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "</think>": 151668,
3
+ "</tool_call>": 151658,
4
+ "</tool_response>": 151666,
5
+ "<think>": 151667,
6
+ "<tool_call>": 151657,
7
+ "<tool_response>": 151665,
8
+ "<|box_end|>": 151649,
9
+ "<|box_start|>": 151648,
10
+ "<|endoftext|>": 151643,
11
+ "<|file_sep|>": 151664,
12
+ "<|fim_middle|>": 151660,
13
+ "<|fim_pad|>": 151662,
14
+ "<|fim_prefix|>": 151659,
15
+ "<|fim_suffix|>": 151661,
16
+ "<|im_end|>": 151645,
17
+ "<|im_start|>": 151644,
18
+ "<|image_pad|>": 151655,
19
+ "<|object_ref_end|>": 151647,
20
+ "<|object_ref_start|>": 151646,
21
+ "<|quad_end|>": 151651,
22
+ "<|quad_start|>": 151650,
23
+ "<|repo_name|>": 151663,
24
+ "<|video_pad|>": 151656,
25
+ "<|vision_end|>": 151653,
26
+ "<|vision_pad|>": 151654,
27
+ "<|vision_start|>": 151652
28
+ }
chat_template.jinja ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {%- if tools %}
2
+ {{- '<|im_start|>system\n' }}
3
+ {%- if messages[0].role == 'system' %}
4
+ {{- messages[0].content + '\n\n' }}
5
+ {%- endif %}
6
+ {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
7
+ {%- for tool in tools %}
8
+ {{- "\n" }}
9
+ {{- tool | tojson }}
10
+ {%- endfor %}
11
+ {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
12
+ {%- else %}
13
+ {%- if messages[0].role == 'system' %}
14
+ {{- '<|im_start|>system\n' + messages[0].content + '<|im_end|>\n' }}
15
+ {%- endif %}
16
+ {%- endif %}
17
+ {%- for message in messages %}
18
+ {%- if message.content is string %}
19
+ {%- set content = message.content %}
20
+ {%- else %}
21
+ {%- set content = '' %}
22
+ {%- endif %}
23
+ {%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
24
+ {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }}
25
+ {%- elif message.role == "assistant" %}
26
+ {{- '<|im_start|>' + message.role + '\n' + content }}
27
+ {%- if message.tool_calls %}
28
+ {%- for tool_call in message.tool_calls %}
29
+ {%- if (loop.first and content) or (not loop.first) %}
30
+ {{- '\n' }}
31
+ {%- endif %}
32
+ {%- if tool_call.function %}
33
+ {%- set tool_call = tool_call.function %}
34
+ {%- endif %}
35
+ {{- '<tool_call>\n{"name": "' }}
36
+ {{- tool_call.name }}
37
+ {{- '", "arguments": ' }}
38
+ {%- if tool_call.arguments is string %}
39
+ {{- tool_call.arguments }}
40
+ {%- else %}
41
+ {{- tool_call.arguments | tojson }}
42
+ {%- endif %}
43
+ {{- '}\n</tool_call>' }}
44
+ {%- endfor %}
45
+ {%- endif %}
46
+ {{- '<|im_end|>\n' }}
47
+ {%- elif message.role == "tool" %}
48
+ {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
49
+ {{- '<|im_start|>user' }}
50
+ {%- endif %}
51
+ {{- '\n<tool_response>\n' }}
52
+ {{- content }}
53
+ {{- '\n</tool_response>' }}
54
+ {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
55
+ {{- '<|im_end|>\n' }}
56
+ {%- endif %}
57
+ {%- endif %}
58
+ {%- endfor %}
59
+ {%- if add_generation_prompt %}
60
+ {{- '<|im_start|>assistant\n' }}
61
+ {%- endif %}
loss.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ 83,no_eval
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
special_tokens_map.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|im_start|>",
4
+ "<|im_end|>",
5
+ "<|object_ref_start|>",
6
+ "<|object_ref_end|>",
7
+ "<|box_start|>",
8
+ "<|box_end|>",
9
+ "<|quad_start|>",
10
+ "<|quad_end|>",
11
+ "<|vision_start|>",
12
+ "<|vision_end|>",
13
+ "<|vision_pad|>",
14
+ "<|image_pad|>",
15
+ "<|video_pad|>"
16
+ ],
17
+ "eos_token": {
18
+ "content": "<|im_end|>",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ },
24
+ "pad_token": {
25
+ "content": "<|endoftext|>",
26
+ "lstrip": false,
27
+ "normalized": false,
28
+ "rstrip": false,
29
+ "single_word": false
30
+ }
31
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:aeb13307a71acd8fe81861d94ad54ab689df773318809eed3cbe794b4492dae4
3
+ size 11422654
tokenizer_config.json ADDED
@@ -0,0 +1,239 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_prefix_space": false,
4
+ "added_tokens_decoder": {
5
+ "151643": {
6
+ "content": "<|endoftext|>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "151644": {
14
+ "content": "<|im_start|>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "151645": {
22
+ "content": "<|im_end|>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "151646": {
30
+ "content": "<|object_ref_start|>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ },
37
+ "151647": {
38
+ "content": "<|object_ref_end|>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false,
43
+ "special": true
44
+ },
45
+ "151648": {
46
+ "content": "<|box_start|>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false,
51
+ "special": true
52
+ },
53
+ "151649": {
54
+ "content": "<|box_end|>",
55
+ "lstrip": false,
56
+ "normalized": false,
57
+ "rstrip": false,
58
+ "single_word": false,
59
+ "special": true
60
+ },
61
+ "151650": {
62
+ "content": "<|quad_start|>",
63
+ "lstrip": false,
64
+ "normalized": false,
65
+ "rstrip": false,
66
+ "single_word": false,
67
+ "special": true
68
+ },
69
+ "151651": {
70
+ "content": "<|quad_end|>",
71
+ "lstrip": false,
72
+ "normalized": false,
73
+ "rstrip": false,
74
+ "single_word": false,
75
+ "special": true
76
+ },
77
+ "151652": {
78
+ "content": "<|vision_start|>",
79
+ "lstrip": false,
80
+ "normalized": false,
81
+ "rstrip": false,
82
+ "single_word": false,
83
+ "special": true
84
+ },
85
+ "151653": {
86
+ "content": "<|vision_end|>",
87
+ "lstrip": false,
88
+ "normalized": false,
89
+ "rstrip": false,
90
+ "single_word": false,
91
+ "special": true
92
+ },
93
+ "151654": {
94
+ "content": "<|vision_pad|>",
95
+ "lstrip": false,
96
+ "normalized": false,
97
+ "rstrip": false,
98
+ "single_word": false,
99
+ "special": true
100
+ },
101
+ "151655": {
102
+ "content": "<|image_pad|>",
103
+ "lstrip": false,
104
+ "normalized": false,
105
+ "rstrip": false,
106
+ "single_word": false,
107
+ "special": true
108
+ },
109
+ "151656": {
110
+ "content": "<|video_pad|>",
111
+ "lstrip": false,
112
+ "normalized": false,
113
+ "rstrip": false,
114
+ "single_word": false,
115
+ "special": true
116
+ },
117
+ "151657": {
118
+ "content": "<tool_call>",
119
+ "lstrip": false,
120
+ "normalized": false,
121
+ "rstrip": false,
122
+ "single_word": false,
123
+ "special": false
124
+ },
125
+ "151658": {
126
+ "content": "</tool_call>",
127
+ "lstrip": false,
128
+ "normalized": false,
129
+ "rstrip": false,
130
+ "single_word": false,
131
+ "special": false
132
+ },
133
+ "151659": {
134
+ "content": "<|fim_prefix|>",
135
+ "lstrip": false,
136
+ "normalized": false,
137
+ "rstrip": false,
138
+ "single_word": false,
139
+ "special": false
140
+ },
141
+ "151660": {
142
+ "content": "<|fim_middle|>",
143
+ "lstrip": false,
144
+ "normalized": false,
145
+ "rstrip": false,
146
+ "single_word": false,
147
+ "special": false
148
+ },
149
+ "151661": {
150
+ "content": "<|fim_suffix|>",
151
+ "lstrip": false,
152
+ "normalized": false,
153
+ "rstrip": false,
154
+ "single_word": false,
155
+ "special": false
156
+ },
157
+ "151662": {
158
+ "content": "<|fim_pad|>",
159
+ "lstrip": false,
160
+ "normalized": false,
161
+ "rstrip": false,
162
+ "single_word": false,
163
+ "special": false
164
+ },
165
+ "151663": {
166
+ "content": "<|repo_name|>",
167
+ "lstrip": false,
168
+ "normalized": false,
169
+ "rstrip": false,
170
+ "single_word": false,
171
+ "special": false
172
+ },
173
+ "151664": {
174
+ "content": "<|file_sep|>",
175
+ "lstrip": false,
176
+ "normalized": false,
177
+ "rstrip": false,
178
+ "single_word": false,
179
+ "special": false
180
+ },
181
+ "151665": {
182
+ "content": "<tool_response>",
183
+ "lstrip": false,
184
+ "normalized": false,
185
+ "rstrip": false,
186
+ "single_word": false,
187
+ "special": false
188
+ },
189
+ "151666": {
190
+ "content": "</tool_response>",
191
+ "lstrip": false,
192
+ "normalized": false,
193
+ "rstrip": false,
194
+ "single_word": false,
195
+ "special": false
196
+ },
197
+ "151667": {
198
+ "content": "<think>",
199
+ "lstrip": false,
200
+ "normalized": false,
201
+ "rstrip": false,
202
+ "single_word": false,
203
+ "special": false
204
+ },
205
+ "151668": {
206
+ "content": "</think>",
207
+ "lstrip": false,
208
+ "normalized": false,
209
+ "rstrip": false,
210
+ "single_word": false,
211
+ "special": false
212
+ }
213
+ },
214
+ "additional_special_tokens": [
215
+ "<|im_start|>",
216
+ "<|im_end|>",
217
+ "<|object_ref_start|>",
218
+ "<|object_ref_end|>",
219
+ "<|box_start|>",
220
+ "<|box_end|>",
221
+ "<|quad_start|>",
222
+ "<|quad_end|>",
223
+ "<|vision_start|>",
224
+ "<|vision_end|>",
225
+ "<|vision_pad|>",
226
+ "<|image_pad|>",
227
+ "<|video_pad|>"
228
+ ],
229
+ "bos_token": null,
230
+ "clean_up_tokenization_spaces": false,
231
+ "eos_token": "<|im_end|>",
232
+ "errors": "replace",
233
+ "extra_special_tokens": {},
234
+ "model_max_length": 1010000,
235
+ "pad_token": "<|endoftext|>",
236
+ "split_special_tokens": false,
237
+ "tokenizer_class": "Qwen2Tokenizer",
238
+ "unk_token": null
239
+ }
trainer_state.json ADDED
@@ -0,0 +1,2035 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_global_step": null,
3
+ "best_metric": null,
4
+ "best_model_checkpoint": null,
5
+ "epoch": 0.00083,
6
+ "eval_steps": 500,
7
+ "global_step": 83,
8
+ "is_hyper_param_search": false,
9
+ "is_local_process_zero": true,
10
+ "is_world_process_zero": true,
11
+ "log_history": [
12
+ {
13
+ "clip_ratio/high_max": 0.0,
14
+ "clip_ratio/high_mean": 0.0,
15
+ "clip_ratio/low_mean": 0.0,
16
+ "clip_ratio/low_min": 0.0,
17
+ "clip_ratio/region_mean": 0.0,
18
+ "completions/clipped_ratio": 0.0,
19
+ "completions/max_length": 13588.0,
20
+ "completions/max_terminated_length": 13588.0,
21
+ "completions/mean_length": 11062.28125,
22
+ "completions/mean_terminated_length": 11062.28125,
23
+ "completions/min_length": 3240.0,
24
+ "completions/min_terminated_length": 3240.0,
25
+ "entropy": 0.024793223245069385,
26
+ "epoch": 1e-05,
27
+ "frac_reward_zero_std": 0.0,
28
+ "grad_norm": 8.580889701843262,
29
+ "kl": 0.0,
30
+ "learning_rate": 0.0,
31
+ "loss": 0.3828,
32
+ "num_tokens": 380760.0,
33
+ "reward": -0.18871402740478516,
34
+ "reward_std": 0.6556680202484131,
35
+ "rewards/rollout_reward_func/mean": -0.18871402740478516,
36
+ "rewards/rollout_reward_func/std": 0.7358422875404358,
37
+ "sampling/importance_sampling_ratio/max": 3.0,
38
+ "sampling/importance_sampling_ratio/mean": 1.0005069971084595,
39
+ "sampling/importance_sampling_ratio/min": 0.2118072211742401,
40
+ "sampling/sampling_logp_difference/max": 1.5520787239074707,
41
+ "sampling/sampling_logp_difference/mean": 0.007466587238013744,
42
+ "step": 1,
43
+ "step_time": 164.69555583700276
44
+ },
45
+ {
46
+ "clip_ratio/high_max": 0.0,
47
+ "clip_ratio/high_mean": 0.0,
48
+ "clip_ratio/low_mean": 0.0,
49
+ "clip_ratio/low_min": 0.0,
50
+ "clip_ratio/region_mean": 0.0,
51
+ "entropy": 0.024793223245069385,
52
+ "epoch": 2e-05,
53
+ "grad_norm": 8.476791381835938,
54
+ "kl": 0.0,
55
+ "learning_rate": 1.4285714285714286e-06,
56
+ "loss": 0.3828,
57
+ "step": 2,
58
+ "step_time": 76.13254943499851
59
+ },
60
+ {
61
+ "clip_ratio/high_max": 0.007962517178384587,
62
+ "clip_ratio/high_mean": 0.004406428648508154,
63
+ "clip_ratio/low_mean": 0.004677400924265385,
64
+ "clip_ratio/low_min": 0.0013333333772607148,
65
+ "clip_ratio/region_mean": 0.009083829529117793,
66
+ "completions/clipped_ratio": 0.0,
67
+ "completions/max_length": 13187.0,
68
+ "completions/max_terminated_length": 13187.0,
69
+ "completions/mean_length": 10307.78125,
70
+ "completions/mean_terminated_length": 10307.78125,
71
+ "completions/min_length": 1148.0,
72
+ "completions/min_terminated_length": 1148.0,
73
+ "entropy": 0.02775774576002732,
74
+ "epoch": 3e-05,
75
+ "frac_reward_zero_std": 0.0,
76
+ "grad_norm": 7.864254951477051,
77
+ "kl": 0.003300241682154592,
78
+ "learning_rate": 2.8571428571428573e-06,
79
+ "loss": -0.359,
80
+ "num_tokens": 737612.0,
81
+ "reward": 0.12486958503723145,
82
+ "reward_std": 0.6849837303161621,
83
+ "rewards/rollout_reward_func/mean": 0.12486958503723145,
84
+ "rewards/rollout_reward_func/std": 0.7890143394470215,
85
+ "sampling/importance_sampling_ratio/max": 3.0,
86
+ "sampling/importance_sampling_ratio/mean": 0.9988386631011963,
87
+ "sampling/importance_sampling_ratio/min": 0.06573130935430527,
88
+ "sampling/sampling_logp_difference/max": 2.722179889678955,
89
+ "sampling/sampling_logp_difference/mean": 0.009859845042228699,
90
+ "step": 3,
91
+ "step_time": 162.92369715300083
92
+ },
93
+ {
94
+ "clip_ratio/high_max": 0.006946050678379834,
95
+ "clip_ratio/high_mean": 0.003898195398505777,
96
+ "clip_ratio/low_mean": 0.002575721693574451,
97
+ "clip_ratio/low_min": 0.0,
98
+ "clip_ratio/region_mean": 0.006473917106632143,
99
+ "entropy": 0.02751988806994632,
100
+ "epoch": 4e-05,
101
+ "grad_norm": 5.452242851257324,
102
+ "kl": 0.002423067733161588,
103
+ "learning_rate": 4.285714285714286e-06,
104
+ "loss": -0.3579,
105
+ "step": 4,
106
+ "step_time": 72.9002098730034
107
+ },
108
+ {
109
+ "clip_ratio/high_max": 0.0,
110
+ "clip_ratio/high_mean": 0.0,
111
+ "clip_ratio/low_mean": 0.003770460345549509,
112
+ "clip_ratio/low_min": 0.0,
113
+ "clip_ratio/region_mean": 0.003770460345549509,
114
+ "completions/clipped_ratio": 0.0,
115
+ "completions/max_length": 13268.0,
116
+ "completions/max_terminated_length": 13268.0,
117
+ "completions/mean_length": 10374.875,
118
+ "completions/mean_terminated_length": 10374.875,
119
+ "completions/min_length": 1137.0,
120
+ "completions/min_terminated_length": 1137.0,
121
+ "entropy": 0.019460081937722862,
122
+ "epoch": 5e-05,
123
+ "frac_reward_zero_std": 0.0,
124
+ "grad_norm": 2.471012830734253,
125
+ "kl": 0.004856036448204648,
126
+ "learning_rate": 5.7142857142857145e-06,
127
+ "loss": 0.641,
128
+ "num_tokens": 1096168.0,
129
+ "reward": 0.023967057466506958,
130
+ "reward_std": 0.8080207109451294,
131
+ "rewards/rollout_reward_func/mean": 0.023967057466506958,
132
+ "rewards/rollout_reward_func/std": 0.7734705805778503,
133
+ "sampling/importance_sampling_ratio/max": 2.3860788345336914,
134
+ "sampling/importance_sampling_ratio/mean": 0.9997902512550354,
135
+ "sampling/importance_sampling_ratio/min": 0.24590186774730682,
136
+ "sampling/sampling_logp_difference/max": 1.402822732925415,
137
+ "sampling/sampling_logp_difference/mean": 0.006388354115188122,
138
+ "step": 5,
139
+ "step_time": 163.10432826500255
140
+ },
141
+ {
142
+ "clip_ratio/high_max": 0.0008591166697442532,
143
+ "clip_ratio/high_mean": 0.0004295583348721266,
144
+ "clip_ratio/low_mean": 0.002894193778047338,
145
+ "clip_ratio/low_min": 0.00044014083687216043,
146
+ "clip_ratio/region_mean": 0.0033237521129194647,
147
+ "entropy": 0.019741965195862576,
148
+ "epoch": 6e-05,
149
+ "grad_norm": 2.724069595336914,
150
+ "kl": 0.0028640855837238632,
151
+ "learning_rate": 7.142857142857143e-06,
152
+ "loss": 0.6421,
153
+ "step": 6,
154
+ "step_time": 72.53219779800384
155
+ },
156
+ {
157
+ "clip_ratio/high_max": 0.002466684323735535,
158
+ "clip_ratio/high_mean": 0.0014459271915256977,
159
+ "clip_ratio/low_mean": 0.003938428315450437,
160
+ "clip_ratio/low_min": 0.0,
161
+ "clip_ratio/region_mean": 0.00538435549242422,
162
+ "completions/clipped_ratio": 0.0,
163
+ "completions/max_length": 13555.0,
164
+ "completions/max_terminated_length": 13555.0,
165
+ "completions/mean_length": 10355.03125,
166
+ "completions/mean_terminated_length": 10355.03125,
167
+ "completions/min_length": 671.0,
168
+ "completions/min_terminated_length": 671.0,
169
+ "entropy": 0.022901446820469573,
170
+ "epoch": 7e-05,
171
+ "frac_reward_zero_std": 0.0,
172
+ "grad_norm": 2.8263163566589355,
173
+ "kl": 0.003971936868765624,
174
+ "learning_rate": 8.571428571428573e-06,
175
+ "loss": 0.1888,
176
+ "num_tokens": 1454826.0,
177
+ "reward": -0.312838077545166,
178
+ "reward_std": 0.7269577980041504,
179
+ "rewards/rollout_reward_func/mean": -0.312838077545166,
180
+ "rewards/rollout_reward_func/std": 0.8357037305831909,
181
+ "sampling/importance_sampling_ratio/max": 2.0568044185638428,
182
+ "sampling/importance_sampling_ratio/mean": 1.000103235244751,
183
+ "sampling/importance_sampling_ratio/min": 0.08904052525758743,
184
+ "sampling/sampling_logp_difference/max": 2.418663740158081,
185
+ "sampling/sampling_logp_difference/mean": 0.0077797044068574905,
186
+ "step": 7,
187
+ "step_time": 162.22604865499852
188
+ },
189
+ {
190
+ "clip_ratio/high_max": 0.0032546367147006094,
191
+ "clip_ratio/high_mean": 0.0018399033870082349,
192
+ "clip_ratio/low_mean": 0.003713095487910323,
193
+ "clip_ratio/low_min": 0.0010530821746215224,
194
+ "clip_ratio/region_mean": 0.005552998904022388,
195
+ "entropy": 0.02342148229945451,
196
+ "epoch": 8e-05,
197
+ "grad_norm": 2.752257823944092,
198
+ "kl": 0.004395232397655491,
199
+ "learning_rate": 1e-05,
200
+ "loss": 0.1866,
201
+ "step": 8,
202
+ "step_time": 74.18662748600218
203
+ },
204
+ {
205
+ "clip_ratio/high_max": 0.005761671985965222,
206
+ "clip_ratio/high_mean": 0.0031964925583451986,
207
+ "clip_ratio/low_mean": 0.0030218699539545923,
208
+ "clip_ratio/low_min": 0.0,
209
+ "clip_ratio/region_mean": 0.0062183625414036214,
210
+ "completions/clipped_ratio": 0.0,
211
+ "completions/max_length": 13308.0,
212
+ "completions/max_terminated_length": 13308.0,
213
+ "completions/mean_length": 10721.71875,
214
+ "completions/mean_terminated_length": 10721.71875,
215
+ "completions/min_length": 1860.0,
216
+ "completions/min_terminated_length": 1860.0,
217
+ "entropy": 0.02882851444883272,
218
+ "epoch": 9e-05,
219
+ "frac_reward_zero_std": 0.0,
220
+ "grad_norm": 4.064320087432861,
221
+ "kl": 0.011813669728326204,
222
+ "learning_rate": 1.1428571428571429e-05,
223
+ "loss": -0.7679,
224
+ "num_tokens": 1824780.0,
225
+ "reward": -0.08607947826385498,
226
+ "reward_std": 0.5436858534812927,
227
+ "rewards/rollout_reward_func/mean": -0.08607947826385498,
228
+ "rewards/rollout_reward_func/std": 0.6569795608520508,
229
+ "sampling/importance_sampling_ratio/max": 1.9653983116149902,
230
+ "sampling/importance_sampling_ratio/mean": 0.9994874000549316,
231
+ "sampling/importance_sampling_ratio/min": 0.12670548260211945,
232
+ "sampling/sampling_logp_difference/max": 2.065889835357666,
233
+ "sampling/sampling_logp_difference/mean": 0.008421818725764751,
234
+ "step": 9,
235
+ "step_time": 166.1723138519992
236
+ },
237
+ {
238
+ "clip_ratio/high_max": 0.006464897654950619,
239
+ "clip_ratio/high_mean": 0.0037041135947220027,
240
+ "clip_ratio/low_mean": 0.00621041064732708,
241
+ "clip_ratio/low_min": 0.00045289855916053057,
242
+ "clip_ratio/region_mean": 0.009914524242049083,
243
+ "entropy": 0.030018516205018386,
244
+ "epoch": 0.0001,
245
+ "grad_norm": 3.770601511001587,
246
+ "kl": 0.0105567153639754,
247
+ "learning_rate": 1.2857142857142857e-05,
248
+ "loss": -0.7706,
249
+ "step": 10,
250
+ "step_time": 73.40501256899915
251
+ },
252
+ {
253
+ "clip_ratio/high_max": 0.0021470933861564845,
254
+ "clip_ratio/high_mean": 0.0010735466930782422,
255
+ "clip_ratio/low_mean": 0.005622330281767063,
256
+ "clip_ratio/low_min": 0.0,
257
+ "clip_ratio/region_mean": 0.006695876887533814,
258
+ "completions/clipped_ratio": 0.0,
259
+ "completions/max_length": 13362.0,
260
+ "completions/max_terminated_length": 13362.0,
261
+ "completions/mean_length": 10876.53125,
262
+ "completions/mean_terminated_length": 10876.53125,
263
+ "completions/min_length": 1132.0,
264
+ "completions/min_terminated_length": 1132.0,
265
+ "entropy": 0.0270388865028508,
266
+ "epoch": 0.00011,
267
+ "frac_reward_zero_std": 0.0,
268
+ "grad_norm": 3.835709571838379,
269
+ "kl": 0.01752202024363214,
270
+ "learning_rate": 1.4285714285714285e-05,
271
+ "loss": -0.382,
272
+ "num_tokens": 2199775.0,
273
+ "reward": -0.10052897781133652,
274
+ "reward_std": 0.4733438193798065,
275
+ "rewards/rollout_reward_func/mean": -0.10052897781133652,
276
+ "rewards/rollout_reward_func/std": 0.5049359202384949,
277
+ "sampling/importance_sampling_ratio/max": 2.5024189949035645,
278
+ "sampling/importance_sampling_ratio/mean": 0.9998884201049805,
279
+ "sampling/importance_sampling_ratio/min": 0.030437257140874863,
280
+ "sampling/sampling_logp_difference/max": 3.4920878410339355,
281
+ "sampling/sampling_logp_difference/mean": 0.007882161997258663,
282
+ "step": 11,
283
+ "step_time": 168.9051820099976
284
+ },
285
+ {
286
+ "clip_ratio/high_max": 0.0065168983419425786,
287
+ "clip_ratio/high_mean": 0.0032584491709712893,
288
+ "clip_ratio/low_mean": 0.007071279003866948,
289
+ "clip_ratio/low_min": 0.0019379844889044762,
290
+ "clip_ratio/region_mean": 0.010329728116630577,
291
+ "entropy": 0.02872419002233073,
292
+ "epoch": 0.00012,
293
+ "grad_norm": 3.390139579772949,
294
+ "kl": 0.026051949804241303,
295
+ "learning_rate": 1.5714285714285715e-05,
296
+ "loss": -0.3917,
297
+ "step": 12,
298
+ "step_time": 75.23061114399752
299
+ },
300
+ {
301
+ "clip_ratio/high_max": 0.005910941952606663,
302
+ "clip_ratio/high_mean": 0.003374952997546643,
303
+ "clip_ratio/low_mean": 0.0033144439657917246,
304
+ "clip_ratio/low_min": 0.00042517005931586027,
305
+ "clip_ratio/region_mean": 0.006689396948786452,
306
+ "completions/clipped_ratio": 0.0,
307
+ "completions/max_length": 13292.0,
308
+ "completions/max_terminated_length": 13292.0,
309
+ "completions/mean_length": 10776.84375,
310
+ "completions/mean_terminated_length": 10776.84375,
311
+ "completions/min_length": 506.0,
312
+ "completions/min_terminated_length": 506.0,
313
+ "entropy": 0.023527769571956014,
314
+ "epoch": 0.00013,
315
+ "frac_reward_zero_std": 0.0,
316
+ "grad_norm": 7.241506099700928,
317
+ "kl": 0.038550363320811964,
318
+ "learning_rate": 1.7142857142857145e-05,
319
+ "loss": -0.3493,
320
+ "num_tokens": 2571951.0,
321
+ "reward": -0.014623546972870827,
322
+ "reward_std": 0.896544337272644,
323
+ "rewards/rollout_reward_func/mean": -0.014623546972870827,
324
+ "rewards/rollout_reward_func/std": 1.0157890319824219,
325
+ "sampling/importance_sampling_ratio/max": 3.0,
326
+ "sampling/importance_sampling_ratio/mean": 1.0013093948364258,
327
+ "sampling/importance_sampling_ratio/min": 0.1680545210838318,
328
+ "sampling/sampling_logp_difference/max": 1.7834668159484863,
329
+ "sampling/sampling_logp_difference/mean": 0.010352442041039467,
330
+ "step": 13,
331
+ "step_time": 166.36859219799953
332
+ },
333
+ {
334
+ "clip_ratio/high_max": 0.008198851748602465,
335
+ "clip_ratio/high_mean": 0.004307759183575399,
336
+ "clip_ratio/low_mean": 0.003082827272010036,
337
+ "clip_ratio/low_min": 0.0008474673668388277,
338
+ "clip_ratio/region_mean": 0.007390586411929689,
339
+ "entropy": 0.024192788765503792,
340
+ "epoch": 0.00014,
341
+ "grad_norm": 4.684647083282471,
342
+ "kl": 0.047261360837082655,
343
+ "learning_rate": 1.8571428571428572e-05,
344
+ "loss": -0.3576,
345
+ "step": 14,
346
+ "step_time": 74.20421179199911
347
+ },
348
+ {
349
+ "clip_ratio/high_max": 0.006833544030087069,
350
+ "clip_ratio/high_mean": 0.00451569254801143,
351
+ "clip_ratio/low_mean": 0.003979612258262932,
352
+ "clip_ratio/low_min": 0.0012784616847056895,
353
+ "clip_ratio/region_mean": 0.008495304791722447,
354
+ "completions/clipped_ratio": 0.0,
355
+ "completions/max_length": 13161.0,
356
+ "completions/max_terminated_length": 13161.0,
357
+ "completions/mean_length": 11274.875,
358
+ "completions/mean_terminated_length": 11274.875,
359
+ "completions/min_length": 677.0,
360
+ "completions/min_terminated_length": 677.0,
361
+ "entropy": 0.030410012637730688,
362
+ "epoch": 0.00015,
363
+ "frac_reward_zero_std": 0.0,
364
+ "grad_norm": 7.206306457519531,
365
+ "kl": 0.16209679024177603,
366
+ "learning_rate": 2e-05,
367
+ "loss": -0.6323,
368
+ "num_tokens": 2959792.0,
369
+ "reward": -0.04547331482172012,
370
+ "reward_std": 0.512954592704773,
371
+ "rewards/rollout_reward_func/mean": -0.04547331482172012,
372
+ "rewards/rollout_reward_func/std": 0.5874266028404236,
373
+ "sampling/importance_sampling_ratio/max": 1.758615493774414,
374
+ "sampling/importance_sampling_ratio/mean": 0.9982211589813232,
375
+ "sampling/importance_sampling_ratio/min": 0.14112254977226257,
376
+ "sampling/sampling_logp_difference/max": 1.9581265449523926,
377
+ "sampling/sampling_logp_difference/mean": 0.010034291073679924,
378
+ "step": 15,
379
+ "step_time": 172.67286533600054
380
+ },
381
+ {
382
+ "clip_ratio/high_max": 0.009126297401962802,
383
+ "clip_ratio/high_mean": 0.00564141028735321,
384
+ "clip_ratio/low_mean": 0.0038173396605998278,
385
+ "clip_ratio/low_min": 0.0017036317440215498,
386
+ "clip_ratio/region_mean": 0.009458749933401123,
387
+ "entropy": 0.030430902494117618,
388
+ "epoch": 0.00016,
389
+ "grad_norm": 5.043717861175537,
390
+ "kl": 0.09108352061593905,
391
+ "learning_rate": 2.1428571428571428e-05,
392
+ "loss": -0.6478,
393
+ "step": 16,
394
+ "step_time": 75.43593462200079
395
+ },
396
+ {
397
+ "clip_ratio/high_max": 0.0034075405565090477,
398
+ "clip_ratio/high_mean": 0.002120436984114349,
399
+ "clip_ratio/low_mean": 0.0051906249864259735,
400
+ "clip_ratio/low_min": 0.00041946308920159936,
401
+ "clip_ratio/region_mean": 0.007311061970540322,
402
+ "completions/clipped_ratio": 0.0,
403
+ "completions/max_length": 13232.0,
404
+ "completions/max_terminated_length": 13232.0,
405
+ "completions/mean_length": 10035.0,
406
+ "completions/mean_terminated_length": 10035.0,
407
+ "completions/min_length": 692.0,
408
+ "completions/min_terminated_length": 692.0,
409
+ "entropy": 0.03239170782035217,
410
+ "epoch": 0.00017,
411
+ "frac_reward_zero_std": 0.0,
412
+ "grad_norm": 26.07994842529297,
413
+ "kl": 0.5745490518165752,
414
+ "learning_rate": 2.2857142857142858e-05,
415
+ "loss": -0.4504,
416
+ "num_tokens": 3308089.0,
417
+ "reward": -0.21278375387191772,
418
+ "reward_std": 0.5441625118255615,
419
+ "rewards/rollout_reward_func/mean": -0.21278375387191772,
420
+ "rewards/rollout_reward_func/std": 0.612115204334259,
421
+ "sampling/importance_sampling_ratio/max": 3.0,
422
+ "sampling/importance_sampling_ratio/mean": 0.9986231327056885,
423
+ "sampling/importance_sampling_ratio/min": 0.07349392026662827,
424
+ "sampling/sampling_logp_difference/max": 2.6105525493621826,
425
+ "sampling/sampling_logp_difference/mean": 0.010844497010111809,
426
+ "step": 17,
427
+ "step_time": 161.29507867800203
428
+ },
429
+ {
430
+ "clip_ratio/high_max": 0.009956223861081526,
431
+ "clip_ratio/high_mean": 0.005192153024836443,
432
+ "clip_ratio/low_mean": 0.008039093692786992,
433
+ "clip_ratio/low_min": 0.0030091897933743894,
434
+ "clip_ratio/region_mean": 0.013231246688519605,
435
+ "entropy": 0.034400187025312334,
436
+ "epoch": 0.00018,
437
+ "grad_norm": 17.663053512573242,
438
+ "kl": 0.061504869983764365,
439
+ "learning_rate": 2.4285714285714288e-05,
440
+ "loss": -0.4629,
441
+ "step": 18,
442
+ "step_time": 74.907978645002
443
+ },
444
+ {
445
+ "clip_ratio/high_max": 0.009938271541614085,
446
+ "clip_ratio/high_mean": 0.005400170251959935,
447
+ "clip_ratio/low_mean": 0.0053618835809174925,
448
+ "clip_ratio/low_min": 0.0013297871919348836,
449
+ "clip_ratio/region_mean": 0.010762053818325512,
450
+ "completions/clipped_ratio": 0.0,
451
+ "completions/max_length": 13178.0,
452
+ "completions/max_terminated_length": 13178.0,
453
+ "completions/mean_length": 10745.46875,
454
+ "completions/mean_terminated_length": 10745.46875,
455
+ "completions/min_length": 1214.0,
456
+ "completions/min_terminated_length": 1214.0,
457
+ "entropy": 0.037911236577201635,
458
+ "epoch": 0.00019,
459
+ "frac_reward_zero_std": 0.0,
460
+ "grad_norm": 5.704324722290039,
461
+ "kl": 0.07061941432766616,
462
+ "learning_rate": 2.5714285714285714e-05,
463
+ "loss": 0.201,
464
+ "num_tokens": 3679071.0,
465
+ "reward": 0.08096315711736679,
466
+ "reward_std": 0.6758204698562622,
467
+ "rewards/rollout_reward_func/mean": 0.08096315711736679,
468
+ "rewards/rollout_reward_func/std": 0.6970977783203125,
469
+ "sampling/importance_sampling_ratio/max": 3.0,
470
+ "sampling/importance_sampling_ratio/mean": 0.9999469518661499,
471
+ "sampling/importance_sampling_ratio/min": 0.22013714909553528,
472
+ "sampling/sampling_logp_difference/max": 1.5135045051574707,
473
+ "sampling/sampling_logp_difference/mean": 0.013475988060235977,
474
+ "step": 19,
475
+ "step_time": 167.33121058399774
476
+ },
477
+ {
478
+ "clip_ratio/high_max": 0.0077229262969922274,
479
+ "clip_ratio/high_mean": 0.0049346807500114664,
480
+ "clip_ratio/low_mean": 0.007034420879790559,
481
+ "clip_ratio/low_min": 0.00042808218859136105,
482
+ "clip_ratio/region_mean": 0.011969101600698195,
483
+ "entropy": 0.03793096466688439,
484
+ "epoch": 0.0002,
485
+ "grad_norm": 4.580775737762451,
486
+ "kl": 0.0702707355376333,
487
+ "learning_rate": 2.714285714285714e-05,
488
+ "loss": 0.1831,
489
+ "step": 20,
490
+ "step_time": 74.7389482260005
491
+ },
492
+ {
493
+ "clip_ratio/high_max": 0.012402922351611778,
494
+ "clip_ratio/high_mean": 0.007489239520509727,
495
+ "clip_ratio/low_mean": 0.005276143303490244,
496
+ "clip_ratio/low_min": 0.0012991318944841623,
497
+ "clip_ratio/region_mean": 0.012765382809448056,
498
+ "completions/clipped_ratio": 0.0,
499
+ "completions/max_length": 13736.0,
500
+ "completions/max_terminated_length": 13736.0,
501
+ "completions/mean_length": 11500.34375,
502
+ "completions/mean_terminated_length": 11500.34375,
503
+ "completions/min_length": 699.0,
504
+ "completions/min_terminated_length": 699.0,
505
+ "entropy": 0.0338919882196933,
506
+ "epoch": 0.00021,
507
+ "frac_reward_zero_std": 0.0,
508
+ "grad_norm": 4.951229572296143,
509
+ "kl": 0.08188517112284899,
510
+ "learning_rate": 2.857142857142857e-05,
511
+ "loss": -0.7709,
512
+ "num_tokens": 4074157.0,
513
+ "reward": -0.03381938114762306,
514
+ "reward_std": 0.35305583477020264,
515
+ "rewards/rollout_reward_func/mean": -0.03381938114762306,
516
+ "rewards/rollout_reward_func/std": 0.3805231750011444,
517
+ "sampling/importance_sampling_ratio/max": 2.957324266433716,
518
+ "sampling/importance_sampling_ratio/mean": 0.9992592930793762,
519
+ "sampling/importance_sampling_ratio/min": 0.08525566011667252,
520
+ "sampling/sampling_logp_difference/max": 2.4621007442474365,
521
+ "sampling/sampling_logp_difference/mean": 0.012218999676406384,
522
+ "step": 21,
523
+ "step_time": 172.31501902600212
524
+ },
525
+ {
526
+ "clip_ratio/high_max": 0.015019576210761443,
527
+ "clip_ratio/high_mean": 0.008796090245596133,
528
+ "clip_ratio/low_mean": 0.0036695960588986054,
529
+ "clip_ratio/low_min": 0.00041946308920159936,
530
+ "clip_ratio/region_mean": 0.012465686231735162,
531
+ "entropy": 0.033622618939261883,
532
+ "epoch": 0.00022,
533
+ "grad_norm": 3.219212293624878,
534
+ "kl": 0.07884596765507013,
535
+ "learning_rate": 3e-05,
536
+ "loss": -0.7798,
537
+ "step": 22,
538
+ "step_time": 77.3749247250089
539
+ },
540
+ {
541
+ "clip_ratio/high_max": 0.0047400579787790775,
542
+ "clip_ratio/high_mean": 0.002854942809790373,
543
+ "clip_ratio/low_mean": 0.0032737747678766027,
544
+ "clip_ratio/low_min": 0.002108728658640757,
545
+ "clip_ratio/region_mean": 0.0061287175776669756,
546
+ "completions/clipped_ratio": 0.0,
547
+ "completions/max_length": 13377.0,
548
+ "completions/max_terminated_length": 13377.0,
549
+ "completions/mean_length": 12026.28125,
550
+ "completions/mean_terminated_length": 12026.28125,
551
+ "completions/min_length": 3719.0,
552
+ "completions/min_terminated_length": 3719.0,
553
+ "entropy": 0.03219945449382067,
554
+ "epoch": 0.00023,
555
+ "frac_reward_zero_std": 0.0,
556
+ "grad_norm": 5.014606475830078,
557
+ "kl": 0.06574048777110875,
558
+ "learning_rate": 3.142857142857143e-05,
559
+ "loss": 0.1808,
560
+ "num_tokens": 4485860.0,
561
+ "reward": 0.021906912326812744,
562
+ "reward_std": 0.6035071015357971,
563
+ "rewards/rollout_reward_func/mean": 0.021906912326812744,
564
+ "rewards/rollout_reward_func/std": 0.6828868389129639,
565
+ "sampling/importance_sampling_ratio/max": 2.2410309314727783,
566
+ "sampling/importance_sampling_ratio/mean": 0.9987294673919678,
567
+ "sampling/importance_sampling_ratio/min": 0.14094144105911255,
568
+ "sampling/sampling_logp_difference/max": 1.9594107866287231,
569
+ "sampling/sampling_logp_difference/mean": 0.01040736399590969,
570
+ "step": 23,
571
+ "step_time": 174.87443501801317
572
+ },
573
+ {
574
+ "clip_ratio/high_max": 0.007845318003091961,
575
+ "clip_ratio/high_mean": 0.004138176242122427,
576
+ "clip_ratio/low_mean": 0.006086511944886297,
577
+ "clip_ratio/low_min": 0.0018401192501187325,
578
+ "clip_ratio/region_mean": 0.010224688216112554,
579
+ "entropy": 0.03171470231609419,
580
+ "epoch": 0.00024,
581
+ "grad_norm": 3.639631748199463,
582
+ "kl": 0.09756560530513525,
583
+ "learning_rate": 3.285714285714286e-05,
584
+ "loss": 0.1634,
585
+ "step": 24,
586
+ "step_time": 76.57009659699543
587
+ },
588
+ {
589
+ "clip_ratio/high_max": 0.005132828722707927,
590
+ "clip_ratio/high_mean": 0.002983081038109958,
591
+ "clip_ratio/low_mean": 0.0029943463450763375,
592
+ "clip_ratio/low_min": 0.0004310344811528921,
593
+ "clip_ratio/region_mean": 0.005977427397738211,
594
+ "completions/clipped_ratio": 0.0,
595
+ "completions/max_length": 13353.0,
596
+ "completions/max_terminated_length": 13353.0,
597
+ "completions/mean_length": 11107.46875,
598
+ "completions/mean_terminated_length": 11107.46875,
599
+ "completions/min_length": 1148.0,
600
+ "completions/min_terminated_length": 1148.0,
601
+ "entropy": 0.03088551398832351,
602
+ "epoch": 0.00025,
603
+ "frac_reward_zero_std": 0.0,
604
+ "grad_norm": 5.480138778686523,
605
+ "kl": 0.07383820961695164,
606
+ "learning_rate": 3.428571428571429e-05,
607
+ "loss": -0.0496,
608
+ "num_tokens": 4867735.0,
609
+ "reward": 0.1064801961183548,
610
+ "reward_std": 0.7724429368972778,
611
+ "rewards/rollout_reward_func/mean": 0.1064801961183548,
612
+ "rewards/rollout_reward_func/std": 0.7649414539337158,
613
+ "sampling/importance_sampling_ratio/max": 3.0,
614
+ "sampling/importance_sampling_ratio/mean": 1.0009795427322388,
615
+ "sampling/importance_sampling_ratio/min": 0.29720377922058105,
616
+ "sampling/sampling_logp_difference/max": 1.2752747535705566,
617
+ "sampling/sampling_logp_difference/mean": 0.007685038261115551,
618
+ "step": 25,
619
+ "step_time": 167.8738094909968
620
+ },
621
+ {
622
+ "clip_ratio/high_max": 0.011608412605710328,
623
+ "clip_ratio/high_mean": 0.0062891200941521674,
624
+ "clip_ratio/low_mean": 0.004773339023813605,
625
+ "clip_ratio/low_min": 0.0009652225417084992,
626
+ "clip_ratio/region_mean": 0.011062459088861942,
627
+ "entropy": 0.030013347961357795,
628
+ "epoch": 0.00026,
629
+ "grad_norm": 3.696756362915039,
630
+ "kl": 0.08500297460705042,
631
+ "learning_rate": 3.571428571428572e-05,
632
+ "loss": -0.0613,
633
+ "step": 26,
634
+ "step_time": 74.28590359098962
635
+ },
636
+ {
637
+ "clip_ratio/high_max": 0.008908686431823298,
638
+ "clip_ratio/high_mean": 0.004666928187361918,
639
+ "clip_ratio/low_mean": 0.002849717377102934,
640
+ "clip_ratio/low_min": 0.0,
641
+ "clip_ratio/region_mean": 0.007516645535361022,
642
+ "completions/clipped_ratio": 0.0,
643
+ "completions/max_length": 12989.0,
644
+ "completions/max_terminated_length": 12989.0,
645
+ "completions/mean_length": 11003.25,
646
+ "completions/mean_terminated_length": 11003.25,
647
+ "completions/min_length": 646.0,
648
+ "completions/min_terminated_length": 646.0,
649
+ "entropy": 0.0298636804218404,
650
+ "epoch": 0.00027,
651
+ "frac_reward_zero_std": 0.0,
652
+ "grad_norm": 3.567798376083374,
653
+ "kl": 0.06775169487809762,
654
+ "learning_rate": 3.7142857142857143e-05,
655
+ "loss": 0.1227,
656
+ "num_tokens": 5246228.0,
657
+ "reward": -0.022418564185500145,
658
+ "reward_std": 0.5492140054702759,
659
+ "rewards/rollout_reward_func/mean": -0.022418564185500145,
660
+ "rewards/rollout_reward_func/std": 0.568433403968811,
661
+ "sampling/importance_sampling_ratio/max": 3.0,
662
+ "sampling/importance_sampling_ratio/mean": 1.001232385635376,
663
+ "sampling/importance_sampling_ratio/min": 0.21295833587646484,
664
+ "sampling/sampling_logp_difference/max": 1.5466587543487549,
665
+ "sampling/sampling_logp_difference/mean": 0.009141004644334316,
666
+ "step": 27,
667
+ "step_time": 167.0670825299967
668
+ },
669
+ {
670
+ "clip_ratio/high_max": 0.00908595250803046,
671
+ "clip_ratio/high_mean": 0.004751309592393227,
672
+ "clip_ratio/low_mean": 0.0031262846459867433,
673
+ "clip_ratio/low_min": 0.0,
674
+ "clip_ratio/region_mean": 0.007877594252931885,
675
+ "entropy": 0.028611631889361888,
676
+ "epoch": 0.00028,
677
+ "grad_norm": 4.556273937225342,
678
+ "kl": 0.08138555369805545,
679
+ "learning_rate": 3.857142857142858e-05,
680
+ "loss": 0.1296,
681
+ "step": 28,
682
+ "step_time": 73.89199204599572
683
+ },
684
+ {
685
+ "clip_ratio/high_max": 0.006030527874827385,
686
+ "clip_ratio/high_mean": 0.0032235972757916898,
687
+ "clip_ratio/low_mean": 0.0035321618779562414,
688
+ "clip_ratio/low_min": 0.00042229730752296746,
689
+ "clip_ratio/region_mean": 0.006755759153747931,
690
+ "completions/clipped_ratio": 0.0,
691
+ "completions/max_length": 12965.0,
692
+ "completions/max_terminated_length": 12965.0,
693
+ "completions/mean_length": 11042.71875,
694
+ "completions/mean_terminated_length": 11042.71875,
695
+ "completions/min_length": 679.0,
696
+ "completions/min_terminated_length": 679.0,
697
+ "entropy": 0.035136714577674866,
698
+ "epoch": 0.00029,
699
+ "frac_reward_zero_std": 0.0,
700
+ "grad_norm": 3.9678425788879395,
701
+ "kl": 0.12332394430995919,
702
+ "learning_rate": 4e-05,
703
+ "loss": 0.0405,
704
+ "num_tokens": 5625953.0,
705
+ "reward": 0.09734156727790833,
706
+ "reward_std": 0.8730854988098145,
707
+ "rewards/rollout_reward_func/mean": 0.09734156727790833,
708
+ "rewards/rollout_reward_func/std": 0.8854982256889343,
709
+ "sampling/importance_sampling_ratio/max": 2.165529251098633,
710
+ "sampling/importance_sampling_ratio/mean": 1.0002460479736328,
711
+ "sampling/importance_sampling_ratio/min": 0.1636945754289627,
712
+ "sampling/sampling_logp_difference/max": 1.8097529411315918,
713
+ "sampling/sampling_logp_difference/mean": 0.010431939736008644,
714
+ "step": 29,
715
+ "step_time": 166.6891715229831
716
+ },
717
+ {
718
+ "clip_ratio/high_max": 0.008942051383201033,
719
+ "clip_ratio/high_mean": 0.004679359000874683,
720
+ "clip_ratio/low_mean": 0.011221011402085423,
721
+ "clip_ratio/low_min": 0.0025366951304022223,
722
+ "clip_ratio/region_mean": 0.015900370432063937,
723
+ "entropy": 0.03758203284814954,
724
+ "epoch": 0.0003,
725
+ "grad_norm": 3.95874285697937,
726
+ "kl": 0.1673376166727394,
727
+ "learning_rate": 4.1428571428571437e-05,
728
+ "loss": 0.0305,
729
+ "step": 30,
730
+ "step_time": 73.23714858698804
731
+ },
732
+ {
733
+ "clip_ratio/high_max": 0.005954384687356651,
734
+ "clip_ratio/high_mean": 0.0029771923436783254,
735
+ "clip_ratio/low_mean": 0.002211941609857604,
736
+ "clip_ratio/low_min": 0.00042229730752296746,
737
+ "clip_ratio/region_mean": 0.005189133953535929,
738
+ "completions/clipped_ratio": 0.0,
739
+ "completions/max_length": 13006.0,
740
+ "completions/max_terminated_length": 13006.0,
741
+ "completions/mean_length": 10191.125,
742
+ "completions/mean_terminated_length": 10191.125,
743
+ "completions/min_length": 407.0,
744
+ "completions/min_terminated_length": 407.0,
745
+ "entropy": 0.02958523586858064,
746
+ "epoch": 0.00031,
747
+ "frac_reward_zero_std": 0.0,
748
+ "grad_norm": 5.264285564422607,
749
+ "kl": 0.15118286339566112,
750
+ "learning_rate": 4.2857142857142856e-05,
751
+ "loss": -0.1601,
752
+ "num_tokens": 5978592.0,
753
+ "reward": 0.15017695724964142,
754
+ "reward_std": 0.7236286401748657,
755
+ "rewards/rollout_reward_func/mean": 0.15017695724964142,
756
+ "rewards/rollout_reward_func/std": 0.810089647769928,
757
+ "sampling/importance_sampling_ratio/max": 2.8205833435058594,
758
+ "sampling/importance_sampling_ratio/mean": 1.0002295970916748,
759
+ "sampling/importance_sampling_ratio/min": 0.11224253475666046,
760
+ "sampling/sampling_logp_difference/max": 2.1870932579040527,
761
+ "sampling/sampling_logp_difference/mean": 0.011024706065654755,
762
+ "step": 31,
763
+ "step_time": 159.12965879299736
764
+ },
765
+ {
766
+ "clip_ratio/high_max": 0.009853226481936872,
767
+ "clip_ratio/high_mean": 0.004926613240968436,
768
+ "clip_ratio/low_mean": 0.006775343528715894,
769
+ "clip_ratio/low_min": 0.0013721397845074534,
770
+ "clip_ratio/region_mean": 0.0117019567405805,
771
+ "entropy": 0.02985566615825519,
772
+ "epoch": 0.00032,
773
+ "grad_norm": 4.398590087890625,
774
+ "kl": 0.12289534416049719,
775
+ "learning_rate": 4.428571428571428e-05,
776
+ "loss": -0.1953,
777
+ "step": 32,
778
+ "step_time": 72.89587214800122
779
+ },
780
+ {
781
+ "clip_ratio/high_max": 0.004788487451151013,
782
+ "clip_ratio/high_mean": 0.003007469786098227,
783
+ "clip_ratio/low_mean": 0.002534152939915657,
784
+ "clip_ratio/low_min": 0.0,
785
+ "clip_ratio/region_mean": 0.005541622740565799,
786
+ "completions/clipped_ratio": 0.0,
787
+ "completions/max_length": 13141.0,
788
+ "completions/max_terminated_length": 13141.0,
789
+ "completions/mean_length": 11030.53125,
790
+ "completions/mean_terminated_length": 11030.53125,
791
+ "completions/min_length": 2621.0,
792
+ "completions/min_terminated_length": 2621.0,
793
+ "entropy": 0.030806915485300124,
794
+ "epoch": 0.00033,
795
+ "frac_reward_zero_std": 0.0,
796
+ "grad_norm": 10.662086486816406,
797
+ "kl": 0.12435430695768446,
798
+ "learning_rate": 4.5714285714285716e-05,
799
+ "loss": 0.0348,
800
+ "num_tokens": 6357887.0,
801
+ "reward": 0.0787108913064003,
802
+ "reward_std": 0.6069952249526978,
803
+ "rewards/rollout_reward_func/mean": 0.0787108913064003,
804
+ "rewards/rollout_reward_func/std": 0.6843085289001465,
805
+ "sampling/importance_sampling_ratio/max": 2.1111693382263184,
806
+ "sampling/importance_sampling_ratio/mean": 0.9992009997367859,
807
+ "sampling/importance_sampling_ratio/min": 0.3298701345920563,
808
+ "sampling/sampling_logp_difference/max": 1.1090562343597412,
809
+ "sampling/sampling_logp_difference/mean": 0.00803404301404953,
810
+ "step": 33,
811
+ "step_time": 171.64439654499802
812
+ },
813
+ {
814
+ "clip_ratio/high_max": 0.008590338955400512,
815
+ "clip_ratio/high_mean": 0.004512183382757939,
816
+ "clip_ratio/low_mean": 0.0033453173091402277,
817
+ "clip_ratio/low_min": 0.0,
818
+ "clip_ratio/region_mean": 0.007857500677346252,
819
+ "entropy": 0.03131913376273587,
820
+ "epoch": 0.00034,
821
+ "grad_norm": 4.566284656524658,
822
+ "kl": 0.19618933752644807,
823
+ "learning_rate": 4.714285714285714e-05,
824
+ "loss": 0.0389,
825
+ "step": 34,
826
+ "step_time": 75.43813904400304
827
+ },
828
+ {
829
+ "clip_ratio/high_max": 0.007634009321918711,
830
+ "clip_ratio/high_mean": 0.004664549152948894,
831
+ "clip_ratio/low_mean": 0.002761319585260935,
832
+ "clip_ratio/low_min": 0.0,
833
+ "clip_ratio/region_mean": 0.007425868738209829,
834
+ "completions/clipped_ratio": 0.0,
835
+ "completions/max_length": 13050.0,
836
+ "completions/max_terminated_length": 13050.0,
837
+ "completions/mean_length": 11396.5,
838
+ "completions/mean_terminated_length": 11396.5,
839
+ "completions/min_length": 1138.0,
840
+ "completions/min_terminated_length": 1138.0,
841
+ "entropy": 0.03832057095132768,
842
+ "epoch": 0.00035,
843
+ "frac_reward_zero_std": 0.0,
844
+ "grad_norm": 5.9035139083862305,
845
+ "kl": 0.11720841797068715,
846
+ "learning_rate": 4.8571428571428576e-05,
847
+ "loss": -0.8942,
848
+ "num_tokens": 6749218.0,
849
+ "reward": -0.07082575559616089,
850
+ "reward_std": 0.47015607357025146,
851
+ "rewards/rollout_reward_func/mean": -0.07082575559616089,
852
+ "rewards/rollout_reward_func/std": 0.5316166281700134,
853
+ "sampling/importance_sampling_ratio/max": 3.0,
854
+ "sampling/importance_sampling_ratio/mean": 0.9994174242019653,
855
+ "sampling/importance_sampling_ratio/min": 0.2981066107749939,
856
+ "sampling/sampling_logp_difference/max": 1.524237871170044,
857
+ "sampling/sampling_logp_difference/mean": 0.012385983020067215,
858
+ "step": 35,
859
+ "step_time": 168.27690360299312
860
+ },
861
+ {
862
+ "clip_ratio/high_max": 0.010596543666906655,
863
+ "clip_ratio/high_mean": 0.006154476490337402,
864
+ "clip_ratio/low_mean": 0.007853075832827017,
865
+ "clip_ratio/low_min": 0.0018445946625433862,
866
+ "clip_ratio/region_mean": 0.014007552264956757,
867
+ "entropy": 0.04002648440655321,
868
+ "epoch": 0.00036,
869
+ "grad_norm": 4.1411895751953125,
870
+ "kl": 0.13405020465143025,
871
+ "learning_rate": 5e-05,
872
+ "loss": -0.903,
873
+ "step": 36,
874
+ "step_time": 72.99781238099604
875
+ },
876
+ {
877
+ "clip_ratio/high_max": 0.005275360541418195,
878
+ "clip_ratio/high_mean": 0.0026376802707090974,
879
+ "clip_ratio/low_mean": 0.0028754653612850234,
880
+ "clip_ratio/low_min": 0.00042517005931586027,
881
+ "clip_ratio/region_mean": 0.005513145631994121,
882
+ "completions/clipped_ratio": 0.0,
883
+ "completions/max_length": 13047.0,
884
+ "completions/max_terminated_length": 13047.0,
885
+ "completions/mean_length": 9878.625,
886
+ "completions/mean_terminated_length": 9878.625,
887
+ "completions/min_length": 657.0,
888
+ "completions/min_terminated_length": 657.0,
889
+ "entropy": 0.03371463587973267,
890
+ "epoch": 0.00037,
891
+ "frac_reward_zero_std": 0.0,
892
+ "grad_norm": 3.8389480113983154,
893
+ "kl": 0.14478663937188685,
894
+ "learning_rate": 4.999300402366083e-05,
895
+ "loss": -0.0642,
896
+ "num_tokens": 7091798.0,
897
+ "reward": 0.2947101891040802,
898
+ "reward_std": 0.9005900621414185,
899
+ "rewards/rollout_reward_func/mean": 0.2947101891040802,
900
+ "rewards/rollout_reward_func/std": 0.9348424077033997,
901
+ "sampling/importance_sampling_ratio/max": 2.4184277057647705,
902
+ "sampling/importance_sampling_ratio/mean": 0.9990720152854919,
903
+ "sampling/importance_sampling_ratio/min": 0.15915784239768982,
904
+ "sampling/sampling_logp_difference/max": 1.83785879611969,
905
+ "sampling/sampling_logp_difference/mean": 0.009822597727179527,
906
+ "step": 37,
907
+ "step_time": 157.96763956499854
908
+ },
909
+ {
910
+ "clip_ratio/high_max": 0.008132616407237947,
911
+ "clip_ratio/high_mean": 0.004482974822167307,
912
+ "clip_ratio/low_mean": 0.00759831492905505,
913
+ "clip_ratio/low_min": 0.00041946308920159936,
914
+ "clip_ratio/region_mean": 0.012081289780326188,
915
+ "entropy": 0.03477160807233304,
916
+ "epoch": 0.00038,
917
+ "grad_norm": 2.4872944355010986,
918
+ "kl": 0.13733661081641912,
919
+ "learning_rate": 4.997202131530303e-05,
920
+ "loss": -0.0742,
921
+ "step": 38,
922
+ "step_time": 69.91781051699581
923
+ },
924
+ {
925
+ "clip_ratio/high_max": 0.008490972220897675,
926
+ "clip_ratio/high_mean": 0.005065609118901193,
927
+ "clip_ratio/low_mean": 0.005926517609623261,
928
+ "clip_ratio/low_min": 0.0008651451207697392,
929
+ "clip_ratio/region_mean": 0.010992126757628284,
930
+ "completions/clipped_ratio": 0.0,
931
+ "completions/max_length": 13241.0,
932
+ "completions/max_terminated_length": 13241.0,
933
+ "completions/mean_length": 9449.46875,
934
+ "completions/mean_terminated_length": 9449.46875,
935
+ "completions/min_length": 724.0,
936
+ "completions/min_terminated_length": 724.0,
937
+ "entropy": 0.04318245640024543,
938
+ "epoch": 0.00039,
939
+ "frac_reward_zero_std": 0.0,
940
+ "grad_norm": 4.977228164672852,
941
+ "kl": 0.15251367259770632,
942
+ "learning_rate": 4.993706753300993e-05,
943
+ "loss": -0.2261,
944
+ "num_tokens": 7421075.0,
945
+ "reward": -0.10847704857587814,
946
+ "reward_std": 0.6628357172012329,
947
+ "rewards/rollout_reward_func/mean": -0.10847704857587814,
948
+ "rewards/rollout_reward_func/std": 0.6781213283538818,
949
+ "sampling/importance_sampling_ratio/max": 3.0,
950
+ "sampling/importance_sampling_ratio/mean": 1.0004980564117432,
951
+ "sampling/importance_sampling_ratio/min": 0.38631442189216614,
952
+ "sampling/sampling_logp_difference/max": 1.6516692638397217,
953
+ "sampling/sampling_logp_difference/mean": 0.012933445163071156,
954
+ "step": 39,
955
+ "step_time": 159.9343009430013
956
+ },
957
+ {
958
+ "clip_ratio/high_max": 0.00801005051471293,
959
+ "clip_ratio/high_mean": 0.00463565593236126,
960
+ "clip_ratio/low_mean": 0.0074972594738937914,
961
+ "clip_ratio/low_min": 0.0017302902415394783,
962
+ "clip_ratio/region_mean": 0.012132915493566543,
963
+ "entropy": 0.04298010759521276,
964
+ "epoch": 0.0004,
965
+ "grad_norm": 3.2948248386383057,
966
+ "kl": 0.14709224458783865,
967
+ "learning_rate": 4.988816876060381e-05,
968
+ "loss": -0.2293,
969
+ "step": 40,
970
+ "step_time": 71.67063941899687
971
+ },
972
+ {
973
+ "clip_ratio/high_max": 0.0078367855749093,
974
+ "clip_ratio/high_mean": 0.00391839278745465,
975
+ "clip_ratio/low_mean": 0.004580665918183513,
976
+ "clip_ratio/low_min": 0.0016778901626821607,
977
+ "clip_ratio/region_mean": 0.008499058720190078,
978
+ "completions/clipped_ratio": 0.0,
979
+ "completions/max_length": 13282.0,
980
+ "completions/max_terminated_length": 13282.0,
981
+ "completions/mean_length": 11286.65625,
982
+ "completions/mean_terminated_length": 11286.65625,
983
+ "completions/min_length": 5097.0,
984
+ "completions/min_terminated_length": 5097.0,
985
+ "entropy": 0.040187596692703664,
986
+ "epoch": 0.00041,
987
+ "frac_reward_zero_std": 0.0,
988
+ "grad_norm": 6.325024127960205,
989
+ "kl": 0.1349788053194061,
990
+ "learning_rate": 4.98253614881812e-05,
991
+ "loss": 0.4806,
992
+ "num_tokens": 7808749.0,
993
+ "reward": 0.17059466242790222,
994
+ "reward_std": 0.8136788010597229,
995
+ "rewards/rollout_reward_func/mean": 0.17059466242790222,
996
+ "rewards/rollout_reward_func/std": 0.7990698218345642,
997
+ "sampling/importance_sampling_ratio/max": 2.513362169265747,
998
+ "sampling/importance_sampling_ratio/mean": 1.0001041889190674,
999
+ "sampling/importance_sampling_ratio/min": 0.04595469310879707,
1000
+ "sampling/sampling_logp_difference/max": 3.08009934425354,
1001
+ "sampling/sampling_logp_difference/mean": 0.012912388890981674,
1002
+ "step": 41,
1003
+ "step_time": 168.68965952500002
1004
+ },
1005
+ {
1006
+ "clip_ratio/high_max": 0.010451015492435545,
1007
+ "clip_ratio/high_mean": 0.0054425216221716255,
1008
+ "clip_ratio/low_mean": 0.011983542804955505,
1009
+ "clip_ratio/low_min": 0.005103718169266358,
1010
+ "clip_ratio/region_mean": 0.017426064412575215,
1011
+ "entropy": 0.042241197486873716,
1012
+ "epoch": 0.00042,
1013
+ "grad_norm": 3.489445209503174,
1014
+ "kl": 0.12189190601930022,
1015
+ "learning_rate": 4.974869258488254e-05,
1016
+ "loss": 0.4733,
1017
+ "step": 42,
1018
+ "step_time": 74.903593405008
1019
+ },
1020
+ {
1021
+ "clip_ratio/high_max": 0.009101398580241948,
1022
+ "clip_ratio/high_mean": 0.005175699247047305,
1023
+ "clip_ratio/low_mean": 0.0022814181866124272,
1024
+ "clip_ratio/low_min": 0.0,
1025
+ "clip_ratio/region_mean": 0.007457117491867393,
1026
+ "completions/clipped_ratio": 0.0,
1027
+ "completions/max_length": 13482.0,
1028
+ "completions/max_terminated_length": 13482.0,
1029
+ "completions/mean_length": 10740.0625,
1030
+ "completions/mean_terminated_length": 10740.0625,
1031
+ "completions/min_length": 512.0,
1032
+ "completions/min_terminated_length": 512.0,
1033
+ "entropy": 0.051280891231726855,
1034
+ "epoch": 0.00043,
1035
+ "frac_reward_zero_std": 0.0,
1036
+ "grad_norm": 4.938914775848389,
1037
+ "kl": 0.1590550672262907,
1038
+ "learning_rate": 4.965821926391673e-05,
1039
+ "loss": -1.2291,
1040
+ "num_tokens": 8178979.0,
1041
+ "reward": -0.07395707070827484,
1042
+ "reward_std": 0.6287177205085754,
1043
+ "rewards/rollout_reward_func/mean": -0.07395707070827484,
1044
+ "rewards/rollout_reward_func/std": 0.6483203172683716,
1045
+ "sampling/importance_sampling_ratio/max": 3.0,
1046
+ "sampling/importance_sampling_ratio/mean": 1.0006033182144165,
1047
+ "sampling/importance_sampling_ratio/min": 0.17635640501976013,
1048
+ "sampling/sampling_logp_difference/max": 1.735248327255249,
1049
+ "sampling/sampling_logp_difference/mean": 0.01553763635456562,
1050
+ "step": 43,
1051
+ "step_time": 167.16221409800346
1052
+ },
1053
+ {
1054
+ "clip_ratio/high_max": 0.015113038418348879,
1055
+ "clip_ratio/high_mean": 0.008806519123027101,
1056
+ "clip_ratio/low_mean": 0.008764334983425215,
1057
+ "clip_ratio/low_min": 0.002106164349243045,
1058
+ "clip_ratio/region_mean": 0.017570854048244655,
1059
+ "entropy": 0.051857893471606076,
1060
+ "epoch": 0.00044,
1061
+ "grad_norm": 4.257857322692871,
1062
+ "kl": 0.1681835250928998,
1063
+ "learning_rate": 4.9554009039866464e-05,
1064
+ "loss": -1.259,
1065
+ "step": 44,
1066
+ "step_time": 75.09785533400282
1067
+ },
1068
+ {
1069
+ "clip_ratio/high_max": 0.014557944756234065,
1070
+ "clip_ratio/high_mean": 0.00841726346698124,
1071
+ "clip_ratio/low_mean": 0.007099654772900976,
1072
+ "clip_ratio/low_min": 0.0015867681941017509,
1073
+ "clip_ratio/region_mean": 0.015516918079811148,
1074
+ "completions/clipped_ratio": 0.0,
1075
+ "completions/max_length": 13154.0,
1076
+ "completions/max_terminated_length": 13154.0,
1077
+ "completions/mean_length": 9749.5,
1078
+ "completions/mean_terminated_length": 9749.5,
1079
+ "completions/min_length": 1209.0,
1080
+ "completions/min_terminated_length": 1209.0,
1081
+ "entropy": 0.07222186587750912,
1082
+ "epoch": 0.00045,
1083
+ "frac_reward_zero_std": 0.0,
1084
+ "grad_norm": 4.37516975402832,
1085
+ "kl": 0.18799102026969194,
1086
+ "learning_rate": 4.9436139678306335e-05,
1087
+ "loss": -1.0063,
1088
+ "num_tokens": 8517773.0,
1089
+ "reward": 0.27246785163879395,
1090
+ "reward_std": 0.9603098034858704,
1091
+ "rewards/rollout_reward_func/mean": 0.27246785163879395,
1092
+ "rewards/rollout_reward_func/std": 0.9796059131622314,
1093
+ "sampling/importance_sampling_ratio/max": 2.65130352973938,
1094
+ "sampling/importance_sampling_ratio/mean": 0.9977819919586182,
1095
+ "sampling/importance_sampling_ratio/min": 0.19320373237133026,
1096
+ "sampling/sampling_logp_difference/max": 1.644010066986084,
1097
+ "sampling/sampling_logp_difference/mean": 0.018468504771590233,
1098
+ "step": 45,
1099
+ "step_time": 157.59911965499487
1100
+ },
1101
+ {
1102
+ "clip_ratio/high_max": 0.018529696972109377,
1103
+ "clip_ratio/high_mean": 0.011184744304046035,
1104
+ "clip_ratio/low_mean": 0.014836993927019648,
1105
+ "clip_ratio/low_min": 0.00389691605232656,
1106
+ "clip_ratio/region_mean": 0.02602173836203292,
1107
+ "entropy": 0.07017613667994738,
1108
+ "epoch": 0.00046,
1109
+ "grad_norm": 4.444145679473877,
1110
+ "kl": 0.19352262001484632,
1111
+ "learning_rate": 4.930469913777124e-05,
1112
+ "loss": -1.0142,
1113
+ "step": 46,
1114
+ "step_time": 72.60382612999456
1115
+ },
1116
+ {
1117
+ "clip_ratio/high_max": 0.009264436783269048,
1118
+ "clip_ratio/high_mean": 0.0050559911178424954,
1119
+ "clip_ratio/low_mean": 0.005405787727795541,
1120
+ "clip_ratio/low_min": 0.0017123287543654442,
1121
+ "clip_ratio/region_mean": 0.010461778845638037,
1122
+ "completions/clipped_ratio": 0.0,
1123
+ "completions/max_length": 13436.0,
1124
+ "completions/max_terminated_length": 13436.0,
1125
+ "completions/mean_length": 9738.90625,
1126
+ "completions/mean_terminated_length": 9738.90625,
1127
+ "completions/min_length": 2217.0,
1128
+ "completions/min_terminated_length": 2217.0,
1129
+ "entropy": 0.06129000161308795,
1130
+ "epoch": 0.00047,
1131
+ "frac_reward_zero_std": 0.0,
1132
+ "grad_norm": 5.193581581115723,
1133
+ "kl": 0.22149202227592468,
1134
+ "learning_rate": 4.91597855041184e-05,
1135
+ "loss": -0.5912,
1136
+ "num_tokens": 8856267.0,
1137
+ "reward": 0.06697382032871246,
1138
+ "reward_std": 0.7488831877708435,
1139
+ "rewards/rollout_reward_func/mean": 0.06697382032871246,
1140
+ "rewards/rollout_reward_func/std": 0.8408234119415283,
1141
+ "sampling/importance_sampling_ratio/max": 3.0,
1142
+ "sampling/importance_sampling_ratio/mean": 0.9977284669876099,
1143
+ "sampling/importance_sampling_ratio/min": 0.14137963950634003,
1144
+ "sampling/sampling_logp_difference/max": 1.9563064575195312,
1145
+ "sampling/sampling_logp_difference/mean": 0.019741810858249664,
1146
+ "step": 47,
1147
+ "step_time": 162.02058434099672
1148
+ },
1149
+ {
1150
+ "clip_ratio/high_max": 0.01009550440357998,
1151
+ "clip_ratio/high_mean": 0.005703912320313975,
1152
+ "clip_ratio/low_mean": 0.012514074885984883,
1153
+ "clip_ratio/low_min": 0.003124057315289974,
1154
+ "clip_ratio/region_mean": 0.018217987177195027,
1155
+ "entropy": 0.059198580449447036,
1156
+ "epoch": 0.00048,
1157
+ "grad_norm": 4.741568565368652,
1158
+ "kl": 0.33581918594427407,
1159
+ "learning_rate": 4.900150691733207e-05,
1160
+ "loss": -0.5888,
1161
+ "step": 48,
1162
+ "step_time": 72.85700078900118
1163
+ },
1164
+ {
1165
+ "clip_ratio/high_max": 0.006105485779698938,
1166
+ "clip_ratio/high_mean": 0.0032638915290590376,
1167
+ "clip_ratio/low_mean": 0.005087121811811812,
1168
+ "clip_ratio/low_min": 0.0012962963082827628,
1169
+ "clip_ratio/region_mean": 0.008351013195351698,
1170
+ "completions/clipped_ratio": 0.0,
1171
+ "completions/max_length": 13361.0,
1172
+ "completions/max_terminated_length": 13361.0,
1173
+ "completions/mean_length": 10574.78125,
1174
+ "completions/mean_terminated_length": 10574.78125,
1175
+ "completions/min_length": 2209.0,
1176
+ "completions/min_terminated_length": 2209.0,
1177
+ "entropy": 0.06214652210474014,
1178
+ "epoch": 0.00049,
1179
+ "frac_reward_zero_std": 0.0,
1180
+ "grad_norm": 5.913769245147705,
1181
+ "kl": 0.2271044417284429,
1182
+ "learning_rate": 4.8829981490825384e-05,
1183
+ "loss": 0.0406,
1184
+ "num_tokens": 9221107.0,
1185
+ "reward": 0.0014273226261138916,
1186
+ "reward_std": 1.042165994644165,
1187
+ "rewards/rollout_reward_func/mean": 0.0014273226261138916,
1188
+ "rewards/rollout_reward_func/std": 1.1747405529022217,
1189
+ "sampling/importance_sampling_ratio/max": 3.0,
1190
+ "sampling/importance_sampling_ratio/mean": 1.0004156827926636,
1191
+ "sampling/importance_sampling_ratio/min": 0.03238532319664955,
1192
+ "sampling/sampling_logp_difference/max": 3.4300498962402344,
1193
+ "sampling/sampling_logp_difference/mean": 0.020954634994268417,
1194
+ "step": 49,
1195
+ "step_time": 167.1314073070098
1196
+ },
1197
+ {
1198
+ "clip_ratio/high_max": 0.011936334864003584,
1199
+ "clip_ratio/high_mean": 0.00661664160725195,
1200
+ "clip_ratio/low_mean": 0.011766647483455017,
1201
+ "clip_ratio/low_min": 0.0026851851725950837,
1202
+ "clip_ratio/region_mean": 0.018383289090706967,
1203
+ "entropy": 0.05987148266285658,
1204
+ "epoch": 0.0005,
1205
+ "grad_norm": 4.5713090896606445,
1206
+ "kl": 0.25915910955518484,
1207
+ "learning_rate": 4.864533722329971e-05,
1208
+ "loss": 0.032,
1209
+ "step": 50,
1210
+ "step_time": 73.75267140600772
1211
+ },
1212
+ {
1213
+ "clip_ratio/high_max": 0.005659546091919765,
1214
+ "clip_ratio/high_mean": 0.003255021831137128,
1215
+ "clip_ratio/low_mean": 0.005070317463832907,
1216
+ "clip_ratio/low_min": 0.002108652319293469,
1217
+ "clip_ratio/region_mean": 0.008325339294970036,
1218
+ "completions/clipped_ratio": 0.0,
1219
+ "completions/max_length": 13190.0,
1220
+ "completions/max_terminated_length": 13190.0,
1221
+ "completions/mean_length": 11075.46875,
1222
+ "completions/mean_terminated_length": 11075.46875,
1223
+ "completions/min_length": 3639.0,
1224
+ "completions/min_terminated_length": 3639.0,
1225
+ "entropy": 0.04613603337202221,
1226
+ "epoch": 0.00051,
1227
+ "frac_reward_zero_std": 0.0,
1228
+ "grad_norm": 6.114112854003906,
1229
+ "kl": 0.16452285135164857,
1230
+ "learning_rate": 4.8447711903227245e-05,
1231
+ "loss": 0.3885,
1232
+ "num_tokens": 9601902.0,
1233
+ "reward": 0.20258194208145142,
1234
+ "reward_std": 0.9743450880050659,
1235
+ "rewards/rollout_reward_func/mean": 0.20258194208145142,
1236
+ "rewards/rollout_reward_func/std": 1.0126794576644897,
1237
+ "sampling/importance_sampling_ratio/max": 3.0,
1238
+ "sampling/importance_sampling_ratio/mean": 1.001020073890686,
1239
+ "sampling/importance_sampling_ratio/min": 0.14685897529125214,
1240
+ "sampling/sampling_logp_difference/max": 1.9182825088500977,
1241
+ "sampling/sampling_logp_difference/mean": 0.01429649069905281,
1242
+ "step": 51,
1243
+ "step_time": 167.67579722999653
1244
+ },
1245
+ {
1246
+ "clip_ratio/high_max": 0.00904727095621638,
1247
+ "clip_ratio/high_mean": 0.005424798160674982,
1248
+ "clip_ratio/low_mean": 0.008320969296619296,
1249
+ "clip_ratio/low_min": 0.0025309495395049453,
1250
+ "clip_ratio/region_mean": 0.013745767500950024,
1251
+ "entropy": 0.04530790273565799,
1252
+ "epoch": 0.00052,
1253
+ "grad_norm": 4.374583721160889,
1254
+ "kl": 0.1778181241825223,
1255
+ "learning_rate": 4.8237253006028074e-05,
1256
+ "loss": 0.373,
1257
+ "step": 52,
1258
+ "step_time": 73.67422152999279
1259
+ },
1260
+ {
1261
+ "clip_ratio/high_max": 0.008981934457551688,
1262
+ "clip_ratio/high_mean": 0.005327096936525777,
1263
+ "clip_ratio/low_mean": 0.0028215366910444573,
1264
+ "clip_ratio/low_min": 0.0,
1265
+ "clip_ratio/region_mean": 0.00814863367122598,
1266
+ "completions/clipped_ratio": 0.0,
1267
+ "completions/max_length": 13266.0,
1268
+ "completions/max_terminated_length": 13266.0,
1269
+ "completions/mean_length": 10068.46875,
1270
+ "completions/mean_terminated_length": 10068.46875,
1271
+ "completions/min_length": 2111.0,
1272
+ "completions/min_terminated_length": 2111.0,
1273
+ "entropy": 0.052251385175623,
1274
+ "epoch": 0.00053,
1275
+ "frac_reward_zero_std": 0.0,
1276
+ "grad_norm": 4.039042949676514,
1277
+ "kl": 0.15936051355674863,
1278
+ "learning_rate": 4.801411758401846e-05,
1279
+ "loss": -0.8609,
1280
+ "num_tokens": 9950736.0,
1281
+ "reward": -0.07826434075832367,
1282
+ "reward_std": 0.8228805065155029,
1283
+ "rewards/rollout_reward_func/mean": -0.07826434075832367,
1284
+ "rewards/rollout_reward_func/std": 0.9726049900054932,
1285
+ "sampling/importance_sampling_ratio/max": 3.0,
1286
+ "sampling/importance_sampling_ratio/mean": 0.9996135234832764,
1287
+ "sampling/importance_sampling_ratio/min": 0.1369163990020752,
1288
+ "sampling/sampling_logp_difference/max": 1.98838472366333,
1289
+ "sampling/sampling_logp_difference/mean": 0.014011010527610779,
1290
+ "step": 53,
1291
+ "step_time": 166.02416877299765
1292
+ },
1293
+ {
1294
+ "clip_ratio/high_max": 0.011853837757371366,
1295
+ "clip_ratio/high_mean": 0.007465479196980596,
1296
+ "clip_ratio/low_mean": 0.010039673798019066,
1297
+ "clip_ratio/low_min": 0.0014534883666783571,
1298
+ "clip_ratio/region_mean": 0.017505153038655408,
1299
+ "entropy": 0.050861918134614825,
1300
+ "epoch": 0.00054,
1301
+ "grad_norm": 2.7967240810394287,
1302
+ "kl": 0.1533252433873713,
1303
+ "learning_rate": 4.777847214921259e-05,
1304
+ "loss": -0.8692,
1305
+ "step": 54,
1306
+ "step_time": 74.05675573799817
1307
+ },
1308
+ {
1309
+ "clip_ratio/high_max": 0.01076295055099763,
1310
+ "clip_ratio/high_mean": 0.006015452978317626,
1311
+ "clip_ratio/low_mean": 0.004845005663810298,
1312
+ "clip_ratio/low_min": 0.00041666667675599456,
1313
+ "clip_ratio/region_mean": 0.010860458569368348,
1314
+ "completions/clipped_ratio": 0.0,
1315
+ "completions/max_length": 13128.0,
1316
+ "completions/max_terminated_length": 13128.0,
1317
+ "completions/mean_length": 11066.8125,
1318
+ "completions/mean_terminated_length": 11066.8125,
1319
+ "completions/min_length": 899.0,
1320
+ "completions/min_terminated_length": 899.0,
1321
+ "entropy": 0.052756483550183475,
1322
+ "epoch": 0.00055,
1323
+ "frac_reward_zero_std": 0.0,
1324
+ "grad_norm": 5.2398858070373535,
1325
+ "kl": 0.16460610833019018,
1326
+ "learning_rate": 4.753049254906501e-05,
1327
+ "loss": 0.0265,
1328
+ "num_tokens": 10330842.0,
1329
+ "reward": 0.053395774215459824,
1330
+ "reward_std": 0.7901345491409302,
1331
+ "rewards/rollout_reward_func/mean": 0.053395774215459824,
1332
+ "rewards/rollout_reward_func/std": 0.8272984623908997,
1333
+ "sampling/importance_sampling_ratio/max": 3.0,
1334
+ "sampling/importance_sampling_ratio/mean": 1.0008231401443481,
1335
+ "sampling/importance_sampling_ratio/min": 0.21802321076393127,
1336
+ "sampling/sampling_logp_difference/max": 1.5231537818908691,
1337
+ "sampling/sampling_logp_difference/mean": 0.01787828654050827,
1338
+ "step": 55,
1339
+ "step_time": 168.7718367270063
1340
+ },
1341
+ {
1342
+ "clip_ratio/high_max": 0.012885269446996972,
1343
+ "clip_ratio/high_mean": 0.006856542007881217,
1344
+ "clip_ratio/low_mean": 0.013156471701222472,
1345
+ "clip_ratio/low_min": 0.0017187500488944352,
1346
+ "clip_ratio/region_mean": 0.020013013621792197,
1347
+ "entropy": 0.04982141393702477,
1348
+ "epoch": 0.00056,
1349
+ "grad_norm": 4.425920486450195,
1350
+ "kl": 0.19017641432583332,
1351
+ "learning_rate": 4.727036383524666e-05,
1352
+ "loss": 0.0102,
1353
+ "step": 56,
1354
+ "step_time": 73.32767669900568
1355
+ },
1356
+ {
1357
+ "clip_ratio/high_max": 0.008151731628458947,
1358
+ "clip_ratio/high_mean": 0.004496746012591757,
1359
+ "clip_ratio/low_mean": 0.0041994539787992835,
1360
+ "clip_ratio/low_min": 0.0016025641234591603,
1361
+ "clip_ratio/region_mean": 0.00869619999139104,
1362
+ "completions/clipped_ratio": 0.0,
1363
+ "completions/max_length": 13100.0,
1364
+ "completions/max_terminated_length": 13100.0,
1365
+ "completions/mean_length": 10688.53125,
1366
+ "completions/mean_terminated_length": 10688.53125,
1367
+ "completions/min_length": 1659.0,
1368
+ "completions/min_terminated_length": 1659.0,
1369
+ "entropy": 0.041110213147476315,
1370
+ "epoch": 0.00057,
1371
+ "frac_reward_zero_std": 0.0,
1372
+ "grad_norm": 4.053499221801758,
1373
+ "kl": 0.16082082968205214,
1374
+ "learning_rate": 4.699828012555243e-05,
1375
+ "loss": -0.2207,
1376
+ "num_tokens": 10698978.0,
1377
+ "reward": 0.07224054634571075,
1378
+ "reward_std": 0.7553356885910034,
1379
+ "rewards/rollout_reward_func/mean": 0.07224054634571075,
1380
+ "rewards/rollout_reward_func/std": 0.7995120286941528,
1381
+ "sampling/importance_sampling_ratio/max": 3.0,
1382
+ "sampling/importance_sampling_ratio/mean": 0.9975321292877197,
1383
+ "sampling/importance_sampling_ratio/min": 0.21801193058490753,
1384
+ "sampling/sampling_logp_difference/max": 2.1512062549591064,
1385
+ "sampling/sampling_logp_difference/mean": 0.012901031412184238,
1386
+ "step": 57,
1387
+ "step_time": 166.55692352800543
1388
+ },
1389
+ {
1390
+ "clip_ratio/high_max": 0.01201454337569885,
1391
+ "clip_ratio/high_mean": 0.0074542641377775,
1392
+ "clip_ratio/low_mean": 0.005944907039520331,
1393
+ "clip_ratio/low_min": 0.00042517005931586027,
1394
+ "clip_ratio/region_mean": 0.01339917117729783,
1395
+ "entropy": 0.04046116629615426,
1396
+ "epoch": 0.00058,
1397
+ "grad_norm": 2.3814172744750977,
1398
+ "kl": 0.18354935571551323,
1399
+ "learning_rate": 4.671444445904316e-05,
1400
+ "loss": -0.2364,
1401
+ "step": 58,
1402
+ "step_time": 73.06675892299609
1403
+ },
1404
+ {
1405
+ "clip_ratio/high_max": 0.004129695822484791,
1406
+ "clip_ratio/high_mean": 0.00248151458799839,
1407
+ "clip_ratio/low_mean": 0.0025588898279238492,
1408
+ "clip_ratio/low_min": 0.00041666667675599456,
1409
+ "clip_ratio/region_mean": 0.005040404415922239,
1410
+ "completions/clipped_ratio": 0.0,
1411
+ "completions/max_length": 13176.0,
1412
+ "completions/max_terminated_length": 13176.0,
1413
+ "completions/mean_length": 9803.90625,
1414
+ "completions/mean_terminated_length": 9803.90625,
1415
+ "completions/min_length": 663.0,
1416
+ "completions/min_terminated_length": 663.0,
1417
+ "entropy": 0.04748129227664322,
1418
+ "epoch": 0.00059,
1419
+ "frac_reward_zero_std": 0.0,
1420
+ "grad_norm": 4.284725189208984,
1421
+ "kl": 0.18153538985643536,
1422
+ "learning_rate": 4.641906864453027e-05,
1423
+ "loss": -0.7635,
1424
+ "num_tokens": 11038846.0,
1425
+ "reward": 0.13717541098594666,
1426
+ "reward_std": 0.9412651062011719,
1427
+ "rewards/rollout_reward_func/mean": 0.13717541098594666,
1428
+ "rewards/rollout_reward_func/std": 0.9482499361038208,
1429
+ "sampling/importance_sampling_ratio/max": 3.0,
1430
+ "sampling/importance_sampling_ratio/mean": 0.9980131983757019,
1431
+ "sampling/importance_sampling_ratio/min": 0.18534508347511292,
1432
+ "sampling/sampling_logp_difference/max": 1.6855359077453613,
1433
+ "sampling/sampling_logp_difference/mean": 0.0141033586114645,
1434
+ "step": 59,
1435
+ "step_time": 160.25458542300476
1436
+ },
1437
+ {
1438
+ "clip_ratio/high_max": 0.009755878942087293,
1439
+ "clip_ratio/high_mean": 0.005086272780317813,
1440
+ "clip_ratio/low_mean": 0.010913642530795187,
1441
+ "clip_ratio/low_min": 0.00582064944319427,
1442
+ "clip_ratio/region_mean": 0.015999915311113,
1443
+ "entropy": 0.04707136598881334,
1444
+ "epoch": 0.0006,
1445
+ "grad_norm": 3.2246789932250977,
1446
+ "kl": 0.2255641722586006,
1447
+ "learning_rate": 4.6112373102516095e-05,
1448
+ "loss": -0.789,
1449
+ "step": 60,
1450
+ "step_time": 72.25460137299524
1451
+ },
1452
+ {
1453
+ "clip_ratio/high_max": 0.007714824576396495,
1454
+ "clip_ratio/high_mean": 0.004483810480451211,
1455
+ "clip_ratio/low_mean": 0.003787761408602819,
1456
+ "clip_ratio/low_min": 0.00042229730752296746,
1457
+ "clip_ratio/region_mean": 0.008271571874502115,
1458
+ "completions/clipped_ratio": 0.0,
1459
+ "completions/max_length": 13179.0,
1460
+ "completions/max_terminated_length": 13179.0,
1461
+ "completions/mean_length": 10362.78125,
1462
+ "completions/mean_terminated_length": 10362.78125,
1463
+ "completions/min_length": 2001.0,
1464
+ "completions/min_terminated_length": 2001.0,
1465
+ "entropy": 0.038825739873573184,
1466
+ "epoch": 0.00061,
1467
+ "frac_reward_zero_std": 0.0,
1468
+ "grad_norm": 3.541637420654297,
1469
+ "kl": 0.1603014951106161,
1470
+ "learning_rate": 4.5794586700707875e-05,
1471
+ "loss": 0.1258,
1472
+ "num_tokens": 11396771.0,
1473
+ "reward": 0.24396184086799622,
1474
+ "reward_std": 0.822333037853241,
1475
+ "rewards/rollout_reward_func/mean": 0.24396184086799622,
1476
+ "rewards/rollout_reward_func/std": 0.7946814298629761,
1477
+ "sampling/importance_sampling_ratio/max": 2.8133909702301025,
1478
+ "sampling/importance_sampling_ratio/mean": 1.0008466243743896,
1479
+ "sampling/importance_sampling_ratio/min": 0.37330174446105957,
1480
+ "sampling/sampling_logp_difference/max": 1.0343904495239258,
1481
+ "sampling/sampling_logp_difference/mean": 0.009579579345881939,
1482
+ "step": 61,
1483
+ "step_time": 163.49556491199837
1484
+ },
1485
+ {
1486
+ "clip_ratio/high_max": 0.013337028271052986,
1487
+ "clip_ratio/high_mean": 0.007507459231419489,
1488
+ "clip_ratio/low_mean": 0.009406370008946396,
1489
+ "clip_ratio/low_min": 0.0033371542231179774,
1490
+ "clip_ratio/region_mean": 0.016913829240365885,
1491
+ "entropy": 0.039391311816871166,
1492
+ "epoch": 0.00062,
1493
+ "grad_norm": 1.9976162910461426,
1494
+ "kl": 0.17127820663154125,
1495
+ "learning_rate": 4.546594658322805e-05,
1496
+ "loss": 0.1168,
1497
+ "step": 62,
1498
+ "step_time": 72.66579336100767
1499
+ },
1500
+ {
1501
+ "clip_ratio/high_max": 0.007516533107263967,
1502
+ "clip_ratio/high_mean": 0.0037582665536319837,
1503
+ "clip_ratio/low_mean": 0.004609227966284379,
1504
+ "clip_ratio/low_min": 0.00041946308920159936,
1505
+ "clip_ratio/region_mean": 0.008367494549020194,
1506
+ "completions/clipped_ratio": 0.0,
1507
+ "completions/max_length": 12647.0,
1508
+ "completions/max_terminated_length": 12647.0,
1509
+ "completions/mean_length": 9626.34375,
1510
+ "completions/mean_terminated_length": 9626.34375,
1511
+ "completions/min_length": 2082.0,
1512
+ "completions/min_terminated_length": 2082.0,
1513
+ "entropy": 0.04371998517308384,
1514
+ "epoch": 0.00063,
1515
+ "frac_reward_zero_std": 0.0,
1516
+ "grad_norm": 3.0750207901000977,
1517
+ "kl": 0.23586172657087445,
1518
+ "learning_rate": 4.512669799364848e-05,
1519
+ "loss": 0.2851,
1520
+ "num_tokens": 11730926.0,
1521
+ "reward": 0.1786271631717682,
1522
+ "reward_std": 0.671964168548584,
1523
+ "rewards/rollout_reward_func/mean": 0.1786271631717682,
1524
+ "rewards/rollout_reward_func/std": 0.7027593851089478,
1525
+ "sampling/importance_sampling_ratio/max": 2.313385248184204,
1526
+ "sampling/importance_sampling_ratio/mean": 0.9989328384399414,
1527
+ "sampling/importance_sampling_ratio/min": 0.14483538269996643,
1528
+ "sampling/sampling_logp_difference/max": 1.9321575164794922,
1529
+ "sampling/sampling_logp_difference/mean": 0.012537499889731407,
1530
+ "step": 63,
1531
+ "step_time": 156.11718567599382
1532
+ },
1533
+ {
1534
+ "clip_ratio/high_max": 0.014760866208234802,
1535
+ "clip_ratio/high_mean": 0.00759016461961437,
1536
+ "clip_ratio/low_mean": 0.009175922066788189,
1537
+ "clip_ratio/low_min": 0.002094518975354731,
1538
+ "clip_ratio/region_mean": 0.01676608665729873,
1539
+ "entropy": 0.04859339352697134,
1540
+ "epoch": 0.00064,
1541
+ "grad_norm": 2.8551297187805176,
1542
+ "kl": 0.23367749294266105,
1543
+ "learning_rate": 4.477709409198042e-05,
1544
+ "loss": 0.2813,
1545
+ "step": 64,
1546
+ "step_time": 68.68542803400123
1547
+ },
1548
+ {
1549
+ "clip_ratio/high_max": 0.0030558938160538673,
1550
+ "clip_ratio/high_mean": 0.0015279469080269337,
1551
+ "clip_ratio/low_mean": 0.003682863956782967,
1552
+ "clip_ratio/low_min": 0.0,
1553
+ "clip_ratio/region_mean": 0.005210810893913731,
1554
+ "completions/clipped_ratio": 0.0,
1555
+ "completions/max_length": 12710.0,
1556
+ "completions/max_terminated_length": 12710.0,
1557
+ "completions/mean_length": 9393.03125,
1558
+ "completions/mean_terminated_length": 9393.03125,
1559
+ "completions/min_length": 1634.0,
1560
+ "completions/min_terminated_length": 1634.0,
1561
+ "entropy": 0.05264301272109151,
1562
+ "epoch": 0.00065,
1563
+ "frac_reward_zero_std": 0.0,
1564
+ "grad_norm": 4.439396858215332,
1565
+ "kl": 0.2075956475455314,
1566
+ "learning_rate": 4.441739576575714e-05,
1567
+ "loss": 0.701,
1568
+ "num_tokens": 12057521.0,
1569
+ "reward": 0.40391504764556885,
1570
+ "reward_std": 0.80250084400177,
1571
+ "rewards/rollout_reward_func/mean": 0.40391504764556885,
1572
+ "rewards/rollout_reward_func/std": 0.8927465081214905,
1573
+ "sampling/importance_sampling_ratio/max": 2.7482974529266357,
1574
+ "sampling/importance_sampling_ratio/mean": 1.0016303062438965,
1575
+ "sampling/importance_sampling_ratio/min": 0.26992499828338623,
1576
+ "sampling/sampling_logp_difference/max": 1.309611201286316,
1577
+ "sampling/sampling_logp_difference/mean": 0.012656865641474724,
1578
+ "step": 65,
1579
+ "step_time": 157.0456754090119
1580
+ },
1581
+ {
1582
+ "clip_ratio/high_max": 0.011578783625736833,
1583
+ "clip_ratio/high_mean": 0.006229532649740577,
1584
+ "clip_ratio/low_mean": 0.008213154564145952,
1585
+ "clip_ratio/low_min": 0.0016835207934491336,
1586
+ "clip_ratio/region_mean": 0.01444268724299036,
1587
+ "entropy": 0.05576217197813094,
1588
+ "epoch": 0.00066,
1589
+ "grad_norm": 2.467440605163574,
1590
+ "kl": 0.2055382311809808,
1591
+ "learning_rate": 4.404787143534977e-05,
1592
+ "loss": 0.6828,
1593
+ "step": 66,
1594
+ "step_time": 69.32088893900436
1595
+ },
1596
+ {
1597
+ "clip_ratio/high_max": 0.005042135278927162,
1598
+ "clip_ratio/high_mean": 0.003166579277603887,
1599
+ "clip_ratio/low_mean": 0.006012097903294489,
1600
+ "clip_ratio/low_min": 0.0017156008398160338,
1601
+ "clip_ratio/region_mean": 0.009178677151794545,
1602
+ "completions/clipped_ratio": 0.0,
1603
+ "completions/max_length": 13314.0,
1604
+ "completions/max_terminated_length": 13314.0,
1605
+ "completions/mean_length": 10746.4375,
1606
+ "completions/mean_terminated_length": 10746.4375,
1607
+ "completions/min_length": 413.0,
1608
+ "completions/min_terminated_length": 413.0,
1609
+ "entropy": 0.05329764215275645,
1610
+ "epoch": 0.00067,
1611
+ "frac_reward_zero_std": 0.0,
1612
+ "grad_norm": 3.5107641220092773,
1613
+ "kl": 0.19909493857994676,
1614
+ "learning_rate": 4.366879685366202e-05,
1615
+ "loss": -0.0363,
1616
+ "num_tokens": 12427620.0,
1617
+ "reward": -0.05278514325618744,
1618
+ "reward_std": 0.5065455436706543,
1619
+ "rewards/rollout_reward_func/mean": -0.05278514325618744,
1620
+ "rewards/rollout_reward_func/std": 0.6038879156112671,
1621
+ "sampling/importance_sampling_ratio/max": 1.9659762382507324,
1622
+ "sampling/importance_sampling_ratio/mean": 0.9986153841018677,
1623
+ "sampling/importance_sampling_ratio/min": 0.21797966957092285,
1624
+ "sampling/sampling_logp_difference/max": 1.5233535766601562,
1625
+ "sampling/sampling_logp_difference/mean": 0.012393541634082794,
1626
+ "step": 67,
1627
+ "step_time": 166.0341528559984
1628
+ },
1629
+ {
1630
+ "clip_ratio/high_max": 0.008735484327189624,
1631
+ "clip_ratio/high_mean": 0.005423485417850316,
1632
+ "clip_ratio/low_mean": 0.010732135677244514,
1633
+ "clip_ratio/low_min": 0.0028922114870510995,
1634
+ "clip_ratio/region_mean": 0.01615562100778334,
1635
+ "entropy": 0.05276717653032392,
1636
+ "epoch": 0.00068,
1637
+ "grad_norm": 5.79787540435791,
1638
+ "kl": 0.2132586808875203,
1639
+ "learning_rate": 4.3280454900353015e-05,
1640
+ "loss": -0.0566,
1641
+ "step": 68,
1642
+ "step_time": 73.57829091900203
1643
+ },
1644
+ {
1645
+ "clip_ratio/high_max": 0.011086496990174055,
1646
+ "clip_ratio/high_mean": 0.0059599152009468526,
1647
+ "clip_ratio/low_mean": 0.003915530367521569,
1648
+ "clip_ratio/low_min": 0.0008503401186317205,
1649
+ "clip_ratio/region_mean": 0.009875445597572252,
1650
+ "completions/clipped_ratio": 0.0,
1651
+ "completions/max_length": 13366.0,
1652
+ "completions/max_terminated_length": 13366.0,
1653
+ "completions/mean_length": 8519.09375,
1654
+ "completions/mean_terminated_length": 8519.09375,
1655
+ "completions/min_length": 403.0,
1656
+ "completions/min_terminated_length": 403.0,
1657
+ "entropy": 0.06176481384318322,
1658
+ "epoch": 0.00069,
1659
+ "frac_reward_zero_std": 0.0,
1660
+ "grad_norm": 4.625095367431641,
1661
+ "kl": 0.22681198676582426,
1662
+ "learning_rate": 4.288313537074191e-05,
1663
+ "loss": -0.8238,
1664
+ "num_tokens": 12726508.0,
1665
+ "reward": 0.10592889785766602,
1666
+ "reward_std": 0.8413736820220947,
1667
+ "rewards/rollout_reward_func/mean": 0.10592889785766602,
1668
+ "rewards/rollout_reward_func/std": 0.8298947811126709,
1669
+ "sampling/importance_sampling_ratio/max": 2.5253984928131104,
1670
+ "sampling/importance_sampling_ratio/mean": 0.9988598823547363,
1671
+ "sampling/importance_sampling_ratio/min": 0.3644062876701355,
1672
+ "sampling/sampling_logp_difference/max": 1.0094858407974243,
1673
+ "sampling/sampling_logp_difference/mean": 0.013074668124318123,
1674
+ "step": 69,
1675
+ "step_time": 160.9605769760019
1676
+ },
1677
+ {
1678
+ "clip_ratio/high_max": 0.010014714149292558,
1679
+ "clip_ratio/high_mean": 0.005424023693194613,
1680
+ "clip_ratio/low_mean": 0.009666147583629936,
1681
+ "clip_ratio/low_min": 0.0025510203558951616,
1682
+ "clip_ratio/region_mean": 0.015090171465999447,
1683
+ "entropy": 0.059861738176550716,
1684
+ "epoch": 0.0007,
1685
+ "grad_norm": 2.4509620666503906,
1686
+ "kl": 0.2412201176630333,
1687
+ "learning_rate": 4.2477134759551676e-05,
1688
+ "loss": -0.8366,
1689
+ "step": 70,
1690
+ "step_time": 72.93759117600348
1691
+ },
1692
+ {
1693
+ "clip_ratio/high_max": 0.004440079559572041,
1694
+ "clip_ratio/high_mean": 0.002428373132715933,
1695
+ "clip_ratio/low_mean": 0.0032831220742082223,
1696
+ "clip_ratio/low_min": 0.0,
1697
+ "clip_ratio/region_mean": 0.00571149516326841,
1698
+ "completions/clipped_ratio": 0.0,
1699
+ "completions/max_length": 12877.0,
1700
+ "completions/max_terminated_length": 12877.0,
1701
+ "completions/mean_length": 10483.0,
1702
+ "completions/mean_terminated_length": 10483.0,
1703
+ "completions/min_length": 400.0,
1704
+ "completions/min_terminated_length": 400.0,
1705
+ "entropy": 0.042265110299922526,
1706
+ "epoch": 0.00071,
1707
+ "frac_reward_zero_std": 0.0,
1708
+ "grad_norm": 6.566309928894043,
1709
+ "kl": 0.29861711349803954,
1710
+ "learning_rate": 4.206275603965376e-05,
1711
+ "loss": -0.0193,
1712
+ "num_tokens": 13088062.0,
1713
+ "reward": 0.2799592912197113,
1714
+ "reward_std": 0.8195935487747192,
1715
+ "rewards/rollout_reward_func/mean": 0.2799592912197113,
1716
+ "rewards/rollout_reward_func/std": 0.8185666799545288,
1717
+ "sampling/importance_sampling_ratio/max": 3.0,
1718
+ "sampling/importance_sampling_ratio/mean": 0.9999350905418396,
1719
+ "sampling/importance_sampling_ratio/min": 0.20893976092338562,
1720
+ "sampling/sampling_logp_difference/max": 1.5657092332839966,
1721
+ "sampling/sampling_logp_difference/mean": 0.010059979744255543,
1722
+ "step": 71,
1723
+ "step_time": 163.97088142000212
1724
+ },
1725
+ {
1726
+ "clip_ratio/high_max": 0.009024532453622669,
1727
+ "clip_ratio/high_mean": 0.0049289329326711595,
1728
+ "clip_ratio/low_mean": 0.003570742075680755,
1729
+ "clip_ratio/low_min": 0.0,
1730
+ "clip_ratio/region_mean": 0.008499675037455745,
1731
+ "entropy": 0.04315118561498821,
1732
+ "epoch": 0.00072,
1733
+ "grad_norm": 3.726982355117798,
1734
+ "kl": 0.16696504969149828,
1735
+ "learning_rate": 4.1640308435978284e-05,
1736
+ "loss": -0.033,
1737
+ "step": 72,
1738
+ "step_time": 70.53560518000086
1739
+ },
1740
+ {
1741
+ "clip_ratio/high_max": 0.0072561182314530015,
1742
+ "clip_ratio/high_mean": 0.004068221780471504,
1743
+ "clip_ratio/low_mean": 0.005030757718486711,
1744
+ "clip_ratio/low_min": 0.0008477011579088867,
1745
+ "clip_ratio/region_mean": 0.0090989794844063,
1746
+ "completions/clipped_ratio": 0.0,
1747
+ "completions/max_length": 13527.0,
1748
+ "completions/max_terminated_length": 13527.0,
1749
+ "completions/mean_length": 10463.59375,
1750
+ "completions/mean_terminated_length": 10463.59375,
1751
+ "completions/min_length": 697.0,
1752
+ "completions/min_terminated_length": 697.0,
1753
+ "entropy": 0.06954505993053317,
1754
+ "epoch": 0.00073,
1755
+ "frac_reward_zero_std": 0.0,
1756
+ "grad_norm": 3.154179573059082,
1757
+ "kl": 0.2885348331183195,
1758
+ "learning_rate": 4.121010719475882e-05,
1759
+ "loss": 0.4437,
1760
+ "num_tokens": 13448939.0,
1761
+ "reward": 0.031230026856064796,
1762
+ "reward_std": 1.2288157939910889,
1763
+ "rewards/rollout_reward_func/mean": 0.031230026856064796,
1764
+ "rewards/rollout_reward_func/std": 1.4065639972686768,
1765
+ "sampling/importance_sampling_ratio/max": 3.0,
1766
+ "sampling/importance_sampling_ratio/mean": 1.001633882522583,
1767
+ "sampling/importance_sampling_ratio/min": 0.07619146257638931,
1768
+ "sampling/sampling_logp_difference/max": 2.5745058059692383,
1769
+ "sampling/sampling_logp_difference/mean": 0.015677718445658684,
1770
+ "step": 73,
1771
+ "step_time": 168.45864097499725
1772
+ },
1773
+ {
1774
+ "clip_ratio/high_max": 0.01567425244138576,
1775
+ "clip_ratio/high_mean": 0.008502019758452661,
1776
+ "clip_ratio/low_mean": 0.007619034266099334,
1777
+ "clip_ratio/low_min": 0.0012556306610349566,
1778
+ "clip_ratio/region_mean": 0.0161210541264154,
1779
+ "entropy": 0.06863664695993066,
1780
+ "epoch": 0.00074,
1781
+ "grad_norm": 3.359342098236084,
1782
+ "kl": 0.32731896825134754,
1783
+ "learning_rate": 4.077247334828387e-05,
1784
+ "loss": 0.4367,
1785
+ "step": 74,
1786
+ "step_time": 76.37142113899972
1787
+ },
1788
+ {
1789
+ "clip_ratio/high_max": 0.00919221993535757,
1790
+ "clip_ratio/high_mean": 0.004596109967678785,
1791
+ "clip_ratio/low_mean": 0.0027718102501239628,
1792
+ "clip_ratio/low_min": 0.0008627254865132272,
1793
+ "clip_ratio/region_mean": 0.007367920188698918,
1794
+ "completions/clipped_ratio": 0.0,
1795
+ "completions/max_length": 13662.0,
1796
+ "completions/max_terminated_length": 13662.0,
1797
+ "completions/mean_length": 12236.15625,
1798
+ "completions/mean_terminated_length": 12236.15625,
1799
+ "completions/min_length": 6765.0,
1800
+ "completions/min_terminated_length": 6765.0,
1801
+ "entropy": 0.05073382775299251,
1802
+ "epoch": 0.00075,
1803
+ "frac_reward_zero_std": 0.0,
1804
+ "grad_norm": 3.2718288898468018,
1805
+ "kl": 0.21677575819194317,
1806
+ "learning_rate": 4.032773347533051e-05,
1807
+ "loss": 0.4564,
1808
+ "num_tokens": 13866649.0,
1809
+ "reward": 0.2860051393508911,
1810
+ "reward_std": 0.8219031095504761,
1811
+ "rewards/rollout_reward_func/mean": 0.2860051393508911,
1812
+ "rewards/rollout_reward_func/std": 0.7910440564155579,
1813
+ "sampling/importance_sampling_ratio/max": 3.0,
1814
+ "sampling/importance_sampling_ratio/mean": 0.99900221824646,
1815
+ "sampling/importance_sampling_ratio/min": 0.13723966479301453,
1816
+ "sampling/sampling_logp_difference/max": 1.9860265254974365,
1817
+ "sampling/sampling_logp_difference/mean": 0.012662259861826897,
1818
+ "step": 75,
1819
+ "step_time": 176.83860804200594
1820
+ },
1821
+ {
1822
+ "clip_ratio/high_max": 0.015345552063081414,
1823
+ "clip_ratio/high_mean": 0.008016182604478672,
1824
+ "clip_ratio/low_mean": 0.011612206522841007,
1825
+ "clip_ratio/low_min": 0.004224211996188387,
1826
+ "clip_ratio/region_mean": 0.019628389243735,
1827
+ "entropy": 0.05343453283421695,
1828
+ "epoch": 0.00076,
1829
+ "grad_norm": 3.2833566665649414,
1830
+ "kl": 0.24319026991724968,
1831
+ "learning_rate": 3.9876219457459105e-05,
1832
+ "loss": 0.4571,
1833
+ "step": 76,
1834
+ "step_time": 77.4465379820067
1835
+ },
1836
+ {
1837
+ "clip_ratio/high_max": 0.006052493612514809,
1838
+ "clip_ratio/high_mean": 0.003868083542329259,
1839
+ "clip_ratio/low_mean": 0.0047318042779807,
1840
+ "clip_ratio/low_min": 0.000838963984278962,
1841
+ "clip_ratio/region_mean": 0.00859988784941379,
1842
+ "completions/clipped_ratio": 0.0,
1843
+ "completions/max_length": 12997.0,
1844
+ "completions/max_terminated_length": 12997.0,
1845
+ "completions/mean_length": 10965.78125,
1846
+ "completions/mean_terminated_length": 10965.78125,
1847
+ "completions/min_length": 3016.0,
1848
+ "completions/min_terminated_length": 3016.0,
1849
+ "entropy": 0.0676007338333875,
1850
+ "epoch": 0.00077,
1851
+ "frac_reward_zero_std": 0.0,
1852
+ "grad_norm": 4.018630027770996,
1853
+ "kl": 0.3990188483148813,
1854
+ "learning_rate": 3.9418268231350794e-05,
1855
+ "loss": 0.3117,
1856
+ "num_tokens": 14243707.0,
1857
+ "reward": 0.3758563995361328,
1858
+ "reward_std": 0.6880111694335938,
1859
+ "rewards/rollout_reward_func/mean": 0.3758563995361328,
1860
+ "rewards/rollout_reward_func/std": 0.7155471444129944,
1861
+ "sampling/importance_sampling_ratio/max": 3.0,
1862
+ "sampling/importance_sampling_ratio/mean": 1.0001542568206787,
1863
+ "sampling/importance_sampling_ratio/min": 0.26981011033058167,
1864
+ "sampling/sampling_logp_difference/max": 1.7861847877502441,
1865
+ "sampling/sampling_logp_difference/mean": 0.014852023683488369,
1866
+ "step": 77,
1867
+ "step_time": 168.0876079659938
1868
+ },
1869
+ {
1870
+ "clip_ratio/high_max": 0.011769446049584076,
1871
+ "clip_ratio/high_mean": 0.007204167588497512,
1872
+ "clip_ratio/low_mean": 0.005082681338535622,
1873
+ "clip_ratio/low_min": 0.0016722972795832902,
1874
+ "clip_ratio/region_mean": 0.012286848752410151,
1875
+ "entropy": 0.07258684397675097,
1876
+ "epoch": 0.00078,
1877
+ "grad_norm": 3.411888360977173,
1878
+ "kl": 0.3419452579692006,
1879
+ "learning_rate": 3.8954221537372784e-05,
1880
+ "loss": 0.2974,
1881
+ "step": 78,
1882
+ "step_time": 74.37252307100789
1883
+ },
1884
+ {
1885
+ "clip_ratio/high_max": 0.005192018230445683,
1886
+ "clip_ratio/high_mean": 0.00372389325639233,
1887
+ "clip_ratio/low_mean": 0.004230349280987866,
1888
+ "clip_ratio/low_min": 0.00041946308920159936,
1889
+ "clip_ratio/region_mean": 0.007954242537380196,
1890
+ "completions/clipped_ratio": 0.0,
1891
+ "completions/max_length": 13711.0,
1892
+ "completions/max_terminated_length": 13711.0,
1893
+ "completions/mean_length": 9765.40625,
1894
+ "completions/mean_terminated_length": 9765.40625,
1895
+ "completions/min_length": 1120.0,
1896
+ "completions/min_terminated_length": 1120.0,
1897
+ "entropy": 0.07584212138317525,
1898
+ "epoch": 0.00079,
1899
+ "frac_reward_zero_std": 0.0,
1900
+ "grad_norm": 2.9946579933166504,
1901
+ "kl": 0.3551831729710102,
1902
+ "learning_rate": 3.848442566455879e-05,
1903
+ "loss": 0.3019,
1904
+ "num_tokens": 14582149.0,
1905
+ "reward": 0.6430723071098328,
1906
+ "reward_std": 0.8897272944450378,
1907
+ "rewards/rollout_reward_func/mean": 0.6430723071098328,
1908
+ "rewards/rollout_reward_func/std": 0.8830904364585876,
1909
+ "sampling/importance_sampling_ratio/max": 1.8536871671676636,
1910
+ "sampling/importance_sampling_ratio/mean": 0.9992114305496216,
1911
+ "sampling/importance_sampling_ratio/min": 0.13193820416927338,
1912
+ "sampling/sampling_logp_difference/max": 2.025421619415283,
1913
+ "sampling/sampling_logp_difference/mean": 0.015373140573501587,
1914
+ "step": 79,
1915
+ "step_time": 164.90890989500258
1916
+ },
1917
+ {
1918
+ "clip_ratio/high_max": 0.012501898920163512,
1919
+ "clip_ratio/high_mean": 0.006734387134201825,
1920
+ "clip_ratio/low_mean": 0.006024214468197897,
1921
+ "clip_ratio/low_min": 0.00042229730752296746,
1922
+ "clip_ratio/region_mean": 0.012758601747918874,
1923
+ "entropy": 0.07754075806587934,
1924
+ "epoch": 0.0008,
1925
+ "grad_norm": 2.0877890586853027,
1926
+ "kl": 0.3181155929341912,
1927
+ "learning_rate": 3.800923119219528e-05,
1928
+ "loss": 0.2807,
1929
+ "step": 80,
1930
+ "step_time": 75.99154542500037
1931
+ },
1932
+ {
1933
+ "clip_ratio/high_max": 0.006057915627025068,
1934
+ "clip_ratio/high_mean": 0.0034836535924114287,
1935
+ "clip_ratio/low_mean": 0.002263873800984584,
1936
+ "clip_ratio/low_min": 0.00041946308920159936,
1937
+ "clip_ratio/region_mean": 0.0057475273933960125,
1938
+ "completions/clipped_ratio": 0.0,
1939
+ "completions/max_length": 13044.0,
1940
+ "completions/max_terminated_length": 13044.0,
1941
+ "completions/mean_length": 9906.5625,
1942
+ "completions/mean_terminated_length": 9906.5625,
1943
+ "completions/min_length": 194.0,
1944
+ "completions/min_terminated_length": 194.0,
1945
+ "entropy": 0.07884204341098666,
1946
+ "epoch": 0.00081,
1947
+ "frac_reward_zero_std": 0.0,
1948
+ "grad_norm": 3.3087422847747803,
1949
+ "kl": 0.33128924760967493,
1950
+ "learning_rate": 3.752899272820599e-05,
1951
+ "loss": 0.5296,
1952
+ "num_tokens": 14925339.0,
1953
+ "reward": 0.6571594476699829,
1954
+ "reward_std": 0.9445677995681763,
1955
+ "rewards/rollout_reward_func/mean": 0.6571594476699829,
1956
+ "rewards/rollout_reward_func/std": 0.9522512555122375,
1957
+ "sampling/importance_sampling_ratio/max": 2.334723472595215,
1958
+ "sampling/importance_sampling_ratio/mean": 1.0011037588119507,
1959
+ "sampling/importance_sampling_ratio/min": 0.5123927593231201,
1960
+ "sampling/sampling_logp_difference/max": 0.847893476486206,
1961
+ "sampling/sampling_logp_difference/mean": 0.013168051838874817,
1962
+ "step": 81,
1963
+ "step_time": 160.75939445400218
1964
+ },
1965
+ {
1966
+ "clip_ratio/high_max": 0.00975386772188358,
1967
+ "clip_ratio/high_mean": 0.00487693386094179,
1968
+ "clip_ratio/low_mean": 0.003392288461327553,
1969
+ "clip_ratio/low_min": 0.00042808218859136105,
1970
+ "clip_ratio/region_mean": 0.008269222205854021,
1971
+ "entropy": 0.08188656461425126,
1972
+ "epoch": 0.00082,
1973
+ "grad_norm": 3.4174883365631104,
1974
+ "kl": 0.3137277886271477,
1975
+ "learning_rate": 3.7044068644530266e-05,
1976
+ "loss": 0.5072,
1977
+ "step": 82,
1978
+ "step_time": 72.33279583299736
1979
+ },
1980
+ {
1981
+ "clip_ratio/high_max": 0.006675469485344365,
1982
+ "clip_ratio/high_mean": 0.0033377347426721826,
1983
+ "clip_ratio/low_mean": 0.001848820800660178,
1984
+ "clip_ratio/low_min": 0.0,
1985
+ "clip_ratio/region_mean": 0.005186555499676615,
1986
+ "completions/clipped_ratio": 0.0,
1987
+ "completions/max_length": 13170.0,
1988
+ "completions/max_terminated_length": 13170.0,
1989
+ "completions/mean_length": 9203.78125,
1990
+ "completions/mean_terminated_length": 9203.78125,
1991
+ "completions/min_length": 1503.0,
1992
+ "completions/min_terminated_length": 1503.0,
1993
+ "entropy": 0.0853383056819439,
1994
+ "epoch": 0.00083,
1995
+ "frac_reward_zero_std": 0.0,
1996
+ "grad_norm": 3.260225296020508,
1997
+ "kl": 0.3031325452029705,
1998
+ "learning_rate": 3.6554820809692434e-05,
1999
+ "loss": 0.2749,
2000
+ "num_tokens": 15246024.0,
2001
+ "reward": 0.7069042325019836,
2002
+ "reward_std": 0.9800997972488403,
2003
+ "rewards/rollout_reward_func/mean": 0.7069042325019836,
2004
+ "rewards/rollout_reward_func/std": 0.9830536246299744,
2005
+ "sampling/importance_sampling_ratio/max": 1.983609914779663,
2006
+ "sampling/importance_sampling_ratio/mean": 0.9999707341194153,
2007
+ "sampling/importance_sampling_ratio/min": 0.27940061688423157,
2008
+ "sampling/sampling_logp_difference/max": 1.2751085758209229,
2009
+ "sampling/sampling_logp_difference/mean": 0.014450456015765667,
2010
+ "step": 83,
2011
+ "step_time": 154.39635029400597
2012
+ }
2013
+ ],
2014
+ "logging_steps": 1.0,
2015
+ "max_steps": 150,
2016
+ "num_input_tokens_seen": 15246024,
2017
+ "num_train_epochs": 1,
2018
+ "save_steps": 500,
2019
+ "stateful_callbacks": {
2020
+ "TrainerControl": {
2021
+ "args": {
2022
+ "should_epoch_stop": false,
2023
+ "should_evaluate": false,
2024
+ "should_log": false,
2025
+ "should_save": true,
2026
+ "should_training_stop": false
2027
+ },
2028
+ "attributes": {}
2029
+ }
2030
+ },
2031
+ "total_flos": 0.0,
2032
+ "train_batch_size": 1,
2033
+ "trial_name": null,
2034
+ "trial_params": null
2035
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3cf2afd4c2e1d5c0ab3119cc46dbc965396a829819d96dc569939c4c2045adc2
3
+ size 8145
vocab.json ADDED
The diff for this file is too large to render. See raw diff